Many people have a false sense of security when moving to the public cloud from a performance and recovery perspective. They have been led to believe, either by unscrupulous salespeople or just misunderstanding, that moving to the cloud is a panacea for all sorts of reasons, not just cost. Truth is, “cloud” is not the same as the “EASY” button, and sound engineering in design can’t be overlooked. If the
Your cloud computing disaster recovery plan will be very dependent on the cloud use case you have. From a practical standpoint there are typically two scenarios that organizations find themselves as it relates to DR in the cloud:
- IaaS supporting individual workloads (i.e., virtual machines) to take advantage of reduced capital costs.
- IaaS/PaaS supporting applications to take advantage of rapid elasticity.
If you are supporting individual workloads, then your DR planning will most likely be focused on how to bring a failed instance back up. This is the simpler of the two from a planning standpoint, and can be done by following your non-cloud data center procedures for restoration, except it should be much faster. For example, if a piece of hardware fails in your data center, you have to physically replace it, which takes time, and then restore the function. In the cloud, you will just bring up a new instance and restore. Losing an entire physical data center is catastrophic, whereas in the cloud you can move to a different part of your provider’s infrastructure (e.g., availability zone/region). This still takes planning and processes, but the similarity to a traditional data center DR process makes it smoother.
The second scenario is much more problematic for one reason: There is much more reliance on cloud characteristics, which need to be considered when creating your DR plan. Here are some key disaster recovery best practices to ensure that can happen.
Prioritize assets and understand interdependencies
You need to first establish priorities for your systems and applications. You need to perform a risk analysis and determine which assets are critical, important or ancillary. In that vein, I suggest you have specific answers to the following question as it relates to critical and important assets:
- How long can you live without an asset? You will need to know if you have to be fully functional in five min or can you take the time to bring it up on another CSP and restore from backup. This will give you metrics for later use.
Once you have identified the critical and important assets, you need to know what specific systems and software make up those assets. After you have this, the next step is to determine technical dependencies and “fire-up order” that will need to be completed in order to bring the critical assets back on line.
Having running systems and applications are great, but without the necessary data most would agree it is a futile effort. In that vein, you need to answer the question: How much data can I live with re-entering?
If you batch enter nightly, you can likely just reapply the batch, but if you are real-time, you may need more frequent backup or DB snapshots. It is important to understand how much data you can live without, and then implement the appropriate backup, duplication or snapshot processes to ensure the data you require is available during the restoration process.
In particular, the snapshot or backup and data in the database are going to be among the most important decisions and processes you will undertake in the DR process. It’s important to have current data, and make sure data is available in multiple locations for restoration. This can be done in a number of ways and the best way is likely dependent upon your cloud service provider.
Architect your application for DR in the cloud
One of the key factors to a successful recovery is the ability of your applications to successfully be recovered. In the cloud, that means your application should be architected to leverage cloud functionality that can be thought of as built-in DR mechanisms (i.e., a new VM is spun up and configured when errors occur in a running VM, then processing is transferred). While the cloud services may perform failover correctly, many applications aren't architected to survive failover themselves. In a recent blog post, Netflix indicated it had redesigned its application to be near stateless, and that architecture worked out well for the company during the recent AWS outage. Thus, when a server or region of servers go down, the DR process involves just bringing up more servers, and a company does not have to worry about application state as part of the DR process.
Whether or not you back up server images or virtual machines is dependent upon the architecture of your application and your cloud environment. If your images are consistent and do not have special configuration information, then it is likely you can ignore doing backups of images and just focus on the data, allowing a restore of the data and then spin-up the images to bring you back to business. Even if you have custom configurations on systems, there are certain cloud management platforms that provide the ability to automate that customization
Leverage virtual machines
One of the advantages virtual machines provide is the potential to have hot spares up and running (a.k.a. over provisioning). While one of the main reasons organizations move to the cloud is reduced costs, you should not be pennywise and pound-foolish. Depending on how quickly you need to get a machine back and running, it may be worthwhile to have hot spares up and running in sites that are not likely to be affected. This is no different than having a hot DR site. As the AWS outage showed us, it may take a while, if ever, for you to bring up a fully suspended/of instance. Reminds me of the scene in Top Gun where they have Maverick on alert2, but the catapult is broken. Do you want to wait for the instance to get spun up, or do you want some in hot standby? If you cannot wait, then have them spun up. This will all be determined during the risk assessment. While some may complain of the increased cost, a running but idle virtual machine is significantly less expensive to an organization than a piece of idle metal sitting in a data center..
Also, consider what happens when your cloud service provider fails to spin up your instance. You need to plan for failed failover!
Databases and testing
How your database responds in a DR situation is dependent upon the implementation of that type of database. It seems most organizations utilize either a master slave relationship with traditional SQL databases or NoSQL-based databases. Each has advantages and disadvantages, and the selection depends more on your application requirements than your disaster recovery procedure. Whatever you choose, make sure it is possible to restore in the following scenarios:
- Single system loss
- Multiple node failure
- Complete main data loss
While it should go without saying, disaster recovery best practices require testing for failure situations. You should have processes and procedures in place that test your DR processes on a regular basis to make sure they perform as expected. While you can’t test for every condition, you can test for a number of them. Netflix noted in its summary of the AWS outage that it used a tool called “Chaos Monkey” to help in failure simulations.
A final word of advice: Your DR process and procedures should be documented and available locally (or in multiple cloud locations if in the cloud). Make sure you update those processes and procedures as things change as well. A decent guide to look at is the Contingency Planning Guide for Federal Information Systems (NIST Special Publication 800-34-rev1).
Remember, the cloud is nothing new, just different technology that we need to apply good practices to. If you don’t take time to implement disaster recovery best practices in the cloud, any money you saved from moving to the cloud will be severely offset in the event of a failure.
About the author:
Philip Cox is director of security and compliance for SystemExperts Corporation, a consulting firm that specializes in system security and management. He is a well-known authority in the areas of system integration and security. He serves on the Trusted Cloud Initiative Architecture workgroup, as well as the PCI Virtualization and Scoping SIGs. He frequently writes and lectures on issues dealing with heterogeneous system integration and compliance (PCI-DSS and ISO) and is the lead author of Windows 2000 Security Handbook Second Edition(Osborne McGraw-Hill) and contributing author for Windows NT/2000 Network Security (Macmillan Technical Publishing) and CIW Security Professional Certification Bible (Wiley).
This was first published in May 2011