Disasters can strike at any moment, whether they are natural or human-caused. When they do, your critical apps need a business continuity strategy to keep running. SingleStore Helios SmartDR is your fully automated disaster recovery (DR) solution, seamlessly replicating your data and configurations to a secondary region.
In the face of a crisis, you can bring up your application in the secondary region with just a few clicks, ensuring minimal downtime. The best part is SmartDR is incredibly cost-effective, eliminating the need for idle computing resources in the secondary region.
In this article, we’ll guide you through how SingleStore Helios SmartDR meets your DR requirements in a way that’s both cost effective and simple.
The difference between disaster recovery and high availability
Before we dive into the details, let’s clarify the difference between High Availability (HA) and Disaster Recovery (DR). HA and DR are often used interchangeably, but have distinct purposes. Both aim to maintain the availability of your critical data or application, but HA focuses on preventing or minimizing interruptions from minor glitches or isolated infrastructure events. On the other hand, DR deals with large-scale events and disasters that affect a whole region, and have an unknown resolution time.
For example, some scenarios where HA can help are when the health of compute nodes is degraded, storage volume fails, a network outage affects multiple racks or a power outage shuts down a whole data center (Availability Zone). DR plans come into effect when an entire region fails due to natural disasters — like an earthquake or flood — and we don’t know when it will be back online. In some industries, like banking or healthcare, it is mandatory to have a strong DR plan that meets certain Recovery Point Objective (RPO) and Recovery Time Objective (RTO) requirements.
Disaster Recovery solutions are typically measured by RPO and RTO. RPO determines acceptable levels of data loss to the business when the system is recovered. RTO measures the interval from the time of failure to the time of normal operations — including data availability. Both RPO and RTO are measured in time intervals; for example, RPO could be something like five minutes, meaning that when the database is restored, five minutes of data is lost. This depends on how frequently data is replicated from the primary to the secondary region. Continuing the example, RTO could be 15 minutes — indicating that it took this much time to resume operational readiness or end the database outage.
SingleStore SmartDR is different
With SingleStore Helios SmartDR, you can achieve a cross-region DR strategy with ease. Not only does SmartDR replicate your data, but also all the cluster configurations, users, roles and permissions, firewall policies, Pipelines (for ingest) and all other metadata. What sets SmartDR apart is that it does not require you to have a duplicate compute running idle all the time on the secondary region — wasting 2x your credits. As you know, compute is the most expensive part of your infrastructure.
Configuring SmartDR
You can get started with SmartDR in a few easy steps:
- Go to the Replication tab from the top menu and click on Configure Replication.
- Choose the Primary region where your databases are located. This is pre-selected and cannot be changed.
- Choose the Secondary region where you want to replicate your databases. You can select any region from the drop-down list.
- Choose the databases that you want to replicate. You can select one or more databases from the list.
- Choose Storage Only as the Replication Type. This is the only option available at the moment.
- Click the Submit button. You will see a status bar showing the progress of the replication process, and the replication status of each database.
SmartDR works as a set-it-and-forget-it solution. After you configure the replication, SingleStore automatically starts asynchronous replication of your data to the secondary region. You can check the last sync status under the replication tab in the UI. We use symmetric and bi-directional replication ensuring data in your object storage is replicated to the secondary region. The best part is this process does not require idle compute resources to be provisioned in the secondary region.
If a disaster strikes and you want to fail over, you only have to click a single failover button. We then provision your environment on demand in the secondary or target region. After the failover is completed, your secondary region will look exactly like the primary (or recently failed) region. At this point, you just need to point your app to the new workspace.
Pre-provision your secondary region with SmartDR
SingleStore also gives you more flexibility to pre-provision your secondary region in advance. This has three main advantages:
- You can configure your private endpoints for secure connections to your app before initiating failover
- You can test DR by failing over to the secondary region without impacting your primary region
- You can have a more predictable failover as workspaces are already active. Pre-provisioning allows you to secure workspaces and "beat the rush" that can occur in surviving regions if another region goes down.
To pre-provision, you simply have to navigate to the replication tab and select “Start pre-provisioning.” SingleStore will configure your secondary region with the same topology and configurations as your primary region in the background. This includes the workspaces, databases, users and group permissions, security configurations and pipelines. A key advantage of pre-provisioning your secondary region is that you can configure the private endpoints in this secondary region so that when you fail over, the process of connecting to your application is faster and simpler.
During this process, your primary region still actively replicates data to your secondary region. So something amazing you can do to fulfill compliance needs is to test a failover without disrupting your production environment. To test your failover, simply switch to the secondary region, attach your database to the workspace and start querying to make sure your latest data is replicated to the secondary region (when you attach, a new branch of the database is created implicitly — more on that in a moment). During this test, you can also attach your application in the secondary region and validate that the app can insert or update your database as expected — without impacting your production.
For this test failover to work, we leverage a SingleStore feature called DB branching. DB branches are like Git branches, but for databases. When you create a branch, it shares the metadata references to the data files of the parent —the actual files don't have to be copied, so it’s fast and cost effective.
Other than the history, branches operate independently of each other; any changes you make to the branch do not impact the parent database. While testing your failover, SingleStore automatically creates a branch of your database to reflect all the data up to the point of failover. Now, you can test your application updates in the secondary region confidently without impacting the workloads in the primary region.
Finally, pre-provisioning an active workspace in the secondary region offers another significant advantage: it ensures a more predictable failover experience. In the event of a disaster, initiating a failover allows SingleStore to immediately attach the databases to the ready infrastructure and allow customers to perform a smooth endpoint switch, ensuring minimal application downtime.
Let’s dive deep into the failover experience
We have recently talked about how easy and straightforward it is to fail over with SingleStore during an event. Just click the failover button under the replication tab and SingleStore handles it all: securing the infrastructure, bringing up the databases and providing you with a connection string to resume app availability.
In this experience, SingleStore prioritizes availability over consistency. Asynchronous DR solutions replicate data across regions, and if Region A (primary region) suddenly goes offline, some data still in the pipeline or not yet replicated to Region B (secondary region) may be lost during failover.
Customers often accept this trade-off as part of their Recovery Point Objective (RPO) — for instance, allowing a five-minute window for potential data loss. Consider your DR strategy replicating from Region A to Region B. Picture a failure hitting Region A, causing a 90% drop in traffic.To minimize downtime, you swiftly decide to fail over to Region B. The good news? Region B comes online without waiting for Region A to fully stop. However, here’s the catch: writes are still trickling into Region A, but they won’t be visible in the new primary region. In many systems, those writes in Region A are simply lost.
But SingleStore has a neat trick up its sleeve. With native support of DB branching — and symmetric and bi-directional replication — SingleStore ensures your data is not lost once all the dust is settled, provided the failed region eventually comes back online. The goal of SingleStore replication is simple: to make sure the contents of object storage in both regions (eventually) are the same. So your files and metadata will eventually be in both regions.
Let’s look at this example: Region A is unresponsive and the user quickly fails over to Region B (they had first enabled the failover option in Region A). At this time, SingleStore simply creates a branch of the data available in Region B, attaching the database to reduce downtime. This action is independent of Region A. Once the dust settles in Region A, our control plane (which is actively trying to reconnect), will commit the missing data and replicate it to Region B. You can access this data through a separate DB branch in Region B and write it to the live database using `INSERT...SELECT... or UPDATE statements as necessary.
With support for bidirectional replication, SingleStore ensures that new data entered in Region B’s live database is also replicated back to Region A. This way, SingleStore will make sure all your data that is in object storage is eventually available in both regions.
How do I failback to the primary region once it is operational again?
SingleStore’s unique ability to make sure your object storage data is eventually available in both regions makes failing back to the primary region a breeze. Once you are confident the primary region (Region A) is working 100%, you simply have to navigate to the replication tab in the active secondary region (Region B) and click failback.
SingleStore will take care of the rest: it configures the infrastructure in Region A, updates users/permissions and other firewall settings, attaches the DBs and presents you with a connection string to connect and resume your application in the primary region (Region A). This failback process is incremental — it is not necessary to copy the whole database from region B back to region A. Only the changes have to be copied back.
Conclusion
To sum up, SingleStore empowers you to operate all your mission-critical applications with confidence. Our platform fully supports ACID transactions for real-time OLTP workloads, ensuring data integrity and consistency. With a highly available architecture featuring distributed tables across availability zones, SingleStore guarantees resilience and performance. Our online PITR feature offers fast data recovery, enabling you to restore your data to any point in time effortlessly.
Now with SmartDR, you can be confident your essential applications are disaster-proof, even if an entire region goes offline. SmartDR’s intuitive replication setup can be configured with just a few clicks, safeguarding your data in a different region. And the best part? There’s no need for an active workspace in the secondary region, which translates to substantial cost savings by eliminating idle compute capacity.
For more information about SmartDR and its configuration, please visit our documentation page or reach out to team@singlestore.com for a personalized demo.