Enhancing business continuity with AWS disaster recovery strategies

Drashti Jadav, Cloud Center of Excellence

In today's digital landscape, maintaining business continuity is more critical than ever. A robust disaster recovery (DR) strategy ensures that your applications and data remain resilient against unexpected disruptions, whether they're caused by hardware failures, natural disasters, or cyber-attacks. AWS offers a powerful suite of tools and services to help you design a disaster recovery architecture tailored to your needs. In this blog, we’ll explore various DR strategies on AWS and how you can leverage cloud-based solutions to ensure rapid recovery and minimal downtime.

Understanding today’s CX landscape

Before diving into specific strategies, it's essential to understand two key metrics:

Recovery point objective (RPO): This is the maximum amount of data loss you can tolerate, measured in time. For example, an RPO of four hours mean you can afford to lose up to four hours of data.
Recovery time objective (RTO): This is the maximum acceptable amount of time required to restore service after a disruption. For instance, an RTO of two hours mean you need to get your systems back online within two hours of an outage.

AWS disaster recovery strategies

AWS provides several strategies to meet your DR objectives, each with different levels of cost, complexity, and recovery speed.

A diagram of a pilot light

Description automatically generated

The above figure highlights four strategies and shows us how different DR strategies incur differing RTO and RPO.

To choose the most effective strategy from above, collaborate with the business owner of the workload to evaluate the benefits and risks, based on input from the engineering/IT team. Determine the required RTO and RPO for the workload and assess the investment you're prepared to make in terms of money, time, and effort.

The following chart illustrates how the four disaster recovery strategies relate to RTO and RPO, adding the dimension of the cost of implementing each solution. The vertical red line shows the RTO defined by an organization—DR strategies to the right of this line are not acceptable.

Backup and restore provides the lowest cost, but the longest RTO
Pilot light provides a medium cost and RTO
Warm standby provides a high cost and low RTO
Multi-site active/active provides the highest cost and a near-zero RTO

Let’s take a closer look at these strategies.

Active/passive DR strategies

The above figure illustrates the active/passive strategy. The workload runs from a primary site (in this case, an AWS region), where all requests are processed. In the event of a disaster that prevents the active Region from supporting the workload, a passive site, known as the recovery Region, takes over. At this point, the workload is switched to the recovery Region in a process called “failover.” For tighter RTO/RPO objectives, data is kept live, and the infrastructure is partially or fully deployed in the recovery site before failover. If data must be restored from backups, it can extend the recovery point and lead to potential data loss. Similarly, if the infrastructure requires additional setup before handling live traffic, this increases recovery time. These increases in RTO and RPO are acceptable if business goals are still achieved.

In the event of a disaster in one Region, failover reroutes traffic to the remaining active Region(s). Despite data replication across Regions, backing up data is still essential for disaster recovery. This safeguards against human errors or software-related issues. If such a disaster causes data to be deleted or corrupted, point-in-time recovery from backups is required to restore the data to its last known good state.

Architecture of the DR strategies

Backup and restore

This approach provides the highest level of disaster protection, regardless of the impact's scope. For cross-Region failover, along with data recovery from backups, you must also restore infrastructure in the recovery Region. Tools like AWS CloudFormation, AWS Cloud Development Kit (AWSCDK), Terraform help ensure consistent infrastructure deployment across Regions.

Although the backup and recovery strategy is the least efficient in terms of RTO, using AWS services like Amazon EventBridge can automate processes, reducing RTO by improving detection and recovery. This will be explored in detail in a future blog post.

Pilot light

In the pilot light strategy, critical data remains live while services are kept idle. This means that data stores and databases are up-to-date (or nearly so) with the active Region and ready for read operations. As shown in above figure, Amazon Aurora global database replicates data to a local read-only cluster in the recovery Region. However, like all disaster recovery strategies, backups (such as the Aurora DB cluster snapshot in above figure) are essential. In case of a disaster that wipes out or corrupts data, these backups allow you to "rewind" to a last known good state.

With the pilot light strategy, basic infrastructure components such as Elastic Load Balancing and Amazon EC2 Auto Scaling (shown in above figure) are in place, but functional elements, like compute, are “shut off.” In the cloud, the best way to shut off an Amazon EC2 instance is by not deploying it, and above figure shows zero instances deployed. To "turn on" these instances, an Amazon Machine Image (AMI) that was previously built and replicated across Regions is used. This AMI deploys Amazon EC2 instances with the necessary operating system and software packages. Like a pilot light in a furnace, which cannot heat your house until activated, the pilot light strategy cannot process requests until the remaining infrastructure is deployed.

Warm standby

Like the pilot light strategy, the warm standby strategy keeps data live and includes periodic backups. The key difference lies in the infrastructure and the code running on it. In a warm standby setup, a minimal deployment is maintained that can handle requests, but only at reduced capacity—it cannot manage full production-level traffic. As shown in above figure, this is represented by one Amazon EC2 instance deployed per tier. Warm standby is easier to test because no extra steps are required for the passive endpoint to process synthetic test transactions before it goes live. However, before failover, the infrastructure needs to scale up to accommodate full production demands.

Multi-site active/active

In a multi-site active/active strategy, two or more Regions handle requests simultaneously. Failover involves re-routing requests away from a Region that is unable to process them. Data is replicated across Regions and actively used to serve read requests in each location. For write requests, various patterns can be employed, such as writing to the local Region or re-routing writes to specific Regions, which will be explored in an upcoming blog post. As with all disaster recovery strategies, data is backed up to recover from accidental deletion or corruption. Above figure shows Amazon DynamoDB global tables as the database tier—an ideal choice for multi-site active/active setups, as any Region can accept writes, and the data is replicated across all Regions, typically within a second.

Building resilience through tailored recovery solutions

Disaster events can threaten the availability of your workload, but AWS Cloud services help mitigate or eliminate these risks. By first understanding the business requirements for your workload, you can select the most suitable DR strategy. Leveraging AWS services, you can then design an architecture that meets your business's recovery time and recovery point objectives.

TP is a proud AWS advanced tier service partner. Our extensive expertise and experience empower us to assist customers in leveraging the AWS Well-Architected Framework. This framework provides best practices and guidelines that ensure your workloads on AWS are designed and operated to be reliable, secure, efficient, and cost-effective. We are here to listen, understand, and help you achieve your goals.

Learn more https://www.tp.com/why-tp/aws/.

Enhancing business continuity with AWS disaster recovery strategies

Understanding today’s CX landscape

AWS disaster recovery strategies

Building resilience through tailored recovery solutions

Want to know more?

Sharing is caring

Other impactful stories