Cloud Disaster recovery – What are the options?
When disaster strikes, there’s not a moment to lose. Every second of downtime will cost money and your reputation. You need a plan to get back on your feet - fast.
Without knowing what disaster might strike - or when - you must be prepared for anything.
It could be a natural disaster, warfare, alien invasion, or a spilled Latte Grande with extra caramel syrup in a data center somewhere. Even a ‘small’ human error could potentially result in massive data loss or service disruption.
But, with a solid Disaster Recovery Plan, you can get your cloud back to ‘business as usual’, without breaking a sweat. Let’s have a look at how you can go about this.
Define your recovery timeline
The first thing that you need to decide is how fast you need to restore your systems: seconds, minutes, or hours?
Take time to really think about this.
Naturally, the shortest disaster recovery time sounds the most appealing, but it comes with a cost. So, you need to figure out if the benefit is worth the cost. Because, in certain circumstances, instantaneous recovery is worth it.
The cost of continuity
To help understand this, you need to put a clear value on business continuity – how much would it cost for every second your systems are down? Then, you need to compare this to the known cost of disaster recovery solutions.
It’s useful to think of it a bit like an insurance policy. If you were an insurance company, you might use the ‘loss ratio’ to figure out an acceptable cost for an insurance policy. A typical loss ratio is 60%, meaning the cost of the insurance policy (your recovery plan) is usually not more than 60% of the payout (your potential business loss).
Conversely, if you were taking out an auto insurance policy, you might expect your annual premiums to be in the region of 10% of your total potential payout. But is the risk of a cloud disaster the same as the risk of an auto accident? This is hard to know because threats can come from all kinds of trajectories, but it can give you some parameters to get started with.
The complexity of your cloud will also play a factor in this, and you must consider the knock-on effect of lost customers and partners from a serious outage.
Ultimately, your own risk appetite will be the deciding factor in what you determine to be an ‘acceptable cost’ of business continuity. With a clear price tag on continuity, you have a better idea of how much downtime is acceptable.
Now, let’s turn this into clear recovery goals.
Clear recovery goals
There are two metrics that are used as disaster recovery goals, and these values can determine which recovery option is the best balance of cost vs. speed.
- The first of these is your Recovery Time Objective (RTO). This is the maximum amount of time you will allow before returning to normal operations if something goes amiss. This is essentially your maximum acceptable downtime.
- Second, you must assess your Recovery Point Objective (RPO), which is the amount of data (measured in elapsed time) that you’re willing to lose, based on your most recent ‘restore point.’
The RPO value has a direct impact on how often data must be backed-up, but in practice most organizations require continuous backups due to the volume of data involved. Your RPO is effectively the last point in time that a full backup was made, so this can be determined by assessing how far back in time you’re willing to go – in other words, how much data you’re prepared to lose.
Ways to reduce downtime in case of a disaster
Good planning
One way you can reduce your recovery timeline is with the use of good planning and clear communication. It’s important to continue reviewing and updating your disaster recovery plan on a regular basis, as your needs will change over time. Roles and responsibilities must be clear, as this avoids confusion and speeds recovery.
Runbook
To enhance recovery speed, you should keep an updated runbook for all the potential issues you can think of. By defining these with a clear plan for getting everything back to normal, you can drastically reduce your downtime and organize your tasks effectively, with clear responsibilities for each team member.
Disaster detection
You need to think about detection methods. Your detection speed will have a huge impact on how fast you can recover from a disaster situation. For this, we recommend using AWS CloudWatch, which can pick up problems based on workload KPIs, anomaly detection, service validation, and service API metrics.
Infrastructure as Code
You must also be able to recover your configurations and infrastructure by using Infrastructure as Code. You can do this with AWS CloudFormation or Terraform (we have a few Terraform templates for this), so you can quickly redeploy these (as well as recovered data) across multiple accounts and regions.
Testing
It is wise to regularly test your Disaster Recovery solution, so you know it will work as you expect (and hope) if the worst happens. This will help you identify issues and see if you can meet your RPO and RTO goals. You can test your solution with AWS Config by enabling AWS managed conformance packs like Operational Best Practices for Data Resiliency.
What types of Disaster Recovery are available?
If you’re using AWS for your cloud then you have several options that give you different levels of disaster recovery protection, depending on your needs.
These are:
- Backup and restore (active/passive)
- Pilot Light (active/passive)
- Warm Standby (active/passive)
- Multisite (active/active)
Backup and restore
A Backup and Restore policy is the most basic level of protection, and should be used for any Disaster Recovery plan. You can do this using AWS Backup, but it doesn’t provide automatic recovery.
Unless you’re using a single workload, additional layers of protection will be needed.
‘Pilot Light’ disaster recovery
This approach is called a ‘pilot light’ because, just like a pilot light in a home heating system, it makes it easy to restart your cloud when it goes out. It’s not running at full burn, but there’s enough of a flame to get it going again reasonably quickly.
The pilot light method is quite economical and uses a minimum of resources to offer a faster recovery time.
With the pilot light method, a copy of your core workload structure is made on another AWS Region. It’s recommended to use a different account for each region (for security), and you’ll need to use Infrastructure as Code to deploy across accounts and Regions. To guard against data corruption or loss, you should also use point-in-time snapshots or versioning.
If you’re not sure how fast you need to get up and running, the pilot light approach will give you a fast recovery time of less than an hour (usually much less), and it isn’t excessively expensive. If you decide that a faster recovery time is needed, then you can always upgrade to the next step up.
You can use the pilot light method by leveraging AWS Elastic Disaster Recovery.
AWS Elastic Disaster Recovery will continuously replicate your applications and databases so you can rapidly spin up to a full-capacity deployment on the recovery VPC. AWS Elastic Disaster Recovery also replicates your network settings and configurations too, so the recovered state is equally secure.
Because it uses less computing (compared to the options below), AWS Elastic Disaster Recovery is much more economical. It also means you can restore to your maximum capacity in a short time.
‘Warm standby’ disaster recovery
Some organizations prefer using a ‘warm standby’ approach, which is like a warmed-up version of the pilot light. Just as with the pilot light, a warm standby is continuously replicating and backing up your data in your recovery Region, but with the addition of running a scaled-down version of your functional stack.
The benefit of using the warm standby approach is that your services/workloads are already running (although scaled down), so they can start handling requests immediately.
Sure, the initial capacity will be lower, but this quickly scales to meet the demand, reducing the recovery time to a few minutes at most.
Multisite (active/active)
This is the highest level of protection, offering nearly instantaneous recovery when a problem is detected.
With a multisite strategy your workloads are running simultaneously in multiple regions. This method also requires continuous backup to ensure near-zero data loss.
Multisite disaster recovery essentially relies on running a full redundant mirror of your cloud, running at full capacity and ready to take over whenever you need it to. As you might imagine, this uses a lot of resources, and comes at a cost.
What else do you need?
Data backup is the foundation for Disaster Recovery, but what else do you need? Which level of protection is necessary to ensure the continuity of your cloud, and what other options are there?
Should you use a Backup as a Service (BaaS) or a Disaster Recovery as a Service (DRaaS) platform? Do you really need a full virtual backup and disaster recovery mirrored on virtual machines?
It's important to understand what your specific needs are, as there may be other Disaster Recovery options which are a better fit, or ones that are just more cost-effective.