If you aren't familiar with the term, resiliency refers to the ability of a system to continue to provide service...
during a disruptive event. The disruptive event could be anything from a natural disaster, such as a flood or earthquake, to a manmade situation like a power outage, to a "run of the mill" operational failure like an infrastructure issue, hardware failure or misconfiguration.
No matter what the disruption is caused by, the point is that the degree to which a system provides resiliency is the degree to which it can continue to operate during that critical time period. In this tip, I will discuss the importance of cloud resiliency and offer steps on starting the calculating and monitoring process.
Clouds aren't immune to downtime
Traditional wisdom holds that cloud technologies -- by virtue of their economies of scale and scalable architecture -- can offer a number of advantages to customers in the arena of resiliency. Specifically, because cloud services are often created in such a way as to span multiple redundant geographic locations and are built around scalable and rapid provisioning, many customers believe this means their applications will continue to operate "automagically" under adverse circumstances.
Needless to say, organizations that actually use the cloud (and, let's face it, that's most of us nowadays) know that "the devil is in the details" when it comes to these points. For example, remember the April 2011 AWS outage? Many still call it the cloudpocalypse. In case you're not familiar with this event, it was a short-duration outage of the Amazon Web Services environment that had a fairly large-scale impact: It rendered a number of downstream subscribers, such asReddit, Quora and Foursquare, unavailable. However, note that it's not just Amazon that has had issues. In that same year, Microsoft, Google, Intuit and VMware's Cloud Foundry all had issues with downtime as well.
Organizations must remember that even if cloud providers could remove all possibility of downtime in their infrastructure (which again, emphatically, they can't), downtime in a cloud-intersecting service can still happen. Consider what happens to a cloud service when its supporting, traditionally delivered, on-premises components or services go down -- for example, a software as a service (SaaS) application that leverages authentication components hosted in a traditional manner. There might be very good security or usability reasons to host components this way -- perhaps users demand the enhanced usability of not having to remember another password just for the SaaS, or enterprise security policies stipulate that only enterprise credentials are authoritative and trusted for login. However, that kind of integration also establishes an operational dependency, becoming a single point of failure in the process. If a cloud service stays available in adverse circumstances but the dependencies don't, users remain offline as long as they would if the cloud service were out of commission.
The point is that your enterprise won't be immune to downtime just because it's using cloud services. Given this, the question for practitioners then becomes, how can we get a handle on situations where users might be impacted? What can we do to monitor cloud resiliency so that we know when to take mitigation steps even when availability risks leave acceptable tolerances?
Taking stock: Usage information
The critical first step in monitoring cloud resiliency is determining how resilient cloud services are in the first place. This is more complicated than it sounds because it requires looking past any given cloud service provider's published uptime numbers, contractually negotiated uptime service-level agreements, or other statements that providers have made either formally or informally about uptime. Certainly, the statements made by the provider (in particular, those it has agreed to contractually) are important -- but they only tell part of the story; also required are two other pieces of information unique to you: information about your usage and information about your interdependencies.
The first data point -- information about usage -- essentially just means taking stock of which cloud services an enterprise uses, where it uses them and how they're being used. Cloud usage can grow organically, in many cases without direct oversight by IT (this is particularly true of SaaS). Those areas of usage need to be accounted for with respect to possible sources of downtime. Why? Because it could very well be that users apply these services to business-critical applications (you might wish this isn't the case, but sometimes it is); it's also possible that certain consumer-oriented services aren't built with "five nines" of uptime in mind. You need to understand this usage in order to thoroughly evaluate overall downtime scenarios and the potential impact these scenarios might have to the business as a whole.
There are a few ways to start gathering information about usage. One strategy is to leverage an existing process, such as a business impact analysis (BIA) to collect data about what cloud usage exists and how critical that usage is. If your organization requires a periodic BIA -- perhaps as part of disaster recovery or business continuity planning -- leveraging that data to also learn about cloud usage can save your organization quite a bit of elbow grease. Another option is to leverage cloud discovery tools emerging in the marketplace (such as Skyhigh Discover or CipherCloud for Cloud Discovery) that can help discover where employees are using cloud without the IT department's knowledge. Either method can help inform you about which cloud services your enterprise is currently using and how they potentially intersect, and help you assess issues that could arise should any individual service be offline.
Taking stock: Understanding cloud deployment architectures
The second step in monitoring cloud resiliency is for the company to understand the architecture of its cloud deployments so that it can look for single points of failure. It should pay particular attention to authentication systems, data storage systems or any other "master" repository of information or centralized point of processing that is located in the traditional (i.e., non-cloud) IT infrastructure.
A good way to get a handle on this is to start with artifacts that your organization may already have that illustrate interaction points. For example, if your business deals with a PCI environment, recall that it is required to have diagrams showing data flows (requirement 1.1.3) -- at least for cardholder data. Likewise, if your enterprise has gone through a formalized threat model that includes creating a data flow diagram, interaction points between components (such as those in the cloud and those in a traditional IT environment) may already be illustrated in those documents. If neither of those resources exists, business may need to do some digging instead to uncover the interaction points. One strategy to help round out that information might be to examine traffic to the central data repositories and key components that are the most likely candidates to represent the single point of failure.
Starting with these steps is by no means a recipe for removing all concerns about downtime or ensuring that you get to the resiliency that your enterprise wants. However, they will give you a starting point and at least the raw materials you'll require to do a more systematic analysis.
About the author:
Ed Moyle is the director of emerging business and technology at ISACA. He previously worked as senior security strategist at Savvis Inc. and as senior manager at Computer Task Group. Prior to that, he served as vice president and information security officer at Merrill Lynch Investment Managers.
Learn more about planning for failure and cloud resiliency
Get more info on cloud risk assessments for security and resiliency