Most organizations moving assets into cloud provider environments realize it’s prudent to do a thorough risk assessment beforehand. However, even the most robust cloud infrastructure can suffer from availability and security failures. In recent months, there have been several well-publicized cloud outages and cloud computing breaches. Cloud consumers can learn some simple lessons from these incidents.
Cloud outages demonstrate need for redundancy
Likely the most public cloud-based incidents have occurred with Amazon Web Services (AWS) In April, Amazon experienced a widespread service outage with its Elastic Compute Cloud (EC2), bringing down a number of well-known companies’ online services, including Reddit and Foursquare. This cloud outage was caused when Amazon engineers updated network capacity in one of their Eastern U.S. “availability zones,” and a configuration error forced traffic to the wrong router, overloading it. Amazon’s Elastic Block Storage (EBS), which is a storage service for use within EC2, was directly affected by this router failure, which caused storage nodes to completely disconnect from the entire redundant Amazon network. These nodes began trying to use any available space within their immediate clusters, causing a low-level denial-of-service (DoS) scenario.
Amazon got things back on track, but suffered a second outage in August, affecting hundreds of companies like Netflix and Quora when a major data center hosting AWS services in Dublin, Ireland suffered a power failure. The incident was resolved in a relatively short time, but there are still some lessons to learn from both of these cloud outages.
Cloud customers who need a high level of availability should plan accordingly by setting up redundancy options in their cloud services design. EC2 supports the use of Elastic IPs (EIPs) -- dynamic IP addresses that can be assigned to EC2 instances. By setting up redundant architectures in different availability zones, a failure in one zone could be immediately mapped (with the same IP addresses) to the systems in the failover zone(s). These systems could actually run simultaneously as a fully redundant and cross-zone architecture, thus mimicking a true high availability (HA) design in AWS. Costs will increase with this type of architecture, particularly for data transfer between zones, but downtime will be largely mitigated even if Amazon experiences a failure in one zone. This is a very practical and realistic option for AWS customers, especially those that cannot afford any downtime for their computing assets and resources. For very large sites and interconnected infrastructures, this redundancy could be somewhat expensive, but likely less expensive than hosting a completely separate site elsewhere.
Hosting resources with multiple CSPs or hosting backups in a private cloud internally are additional options; organizations that want the highest degree of resilience and don’t completely trust one CSP may want to pursue those alternatives.
Cloud computing breach illustrates need for CSP security reviews
Another significant cloud incident occurred in October, when hackers compromised user accounts at German cloud provider Hetzner Online. A significant post-mortem review found Hetzner was severely deficient in properly configuring and maintaining a number of fundamental security controls. First, the site allowed logged-in users to traverse directories, reaching both other users’ content and platform files, including password files. Second, many of the passwords stored in these files were unencrypted. Although Hetzner later implemented strong 256-bit AES encryption for their passwords, the damage had already been done. In addition, it seems as though the company was only performing limited logging and also may not have been reviewing any of the logs.
The fundamental lesson to learn from this incident is that a thorough review of any CSP security program should be considered mandatory prior to doing business with a cloud provider. Most contractual stipulations will not go into a deep level of technical detail, but they should mandate regular audits be performed against some agreed-upon set of security controls, ranging from well-known standards like ISO 27001 to regulations and compliance mandates like the PCI DSS or industry guidance such as the Cloud Security Alliance’s Cloud Controls Matrix (CCM).
The major risk inherent in cloud adoption is ceding control to an external organization, and that includes security controls. Consumers of cloud services should demand a high level of visibility and transparency into exactly what controls the providers have, as well as how they’re maintained.
About the author:
Dave Shackleford is the senior vice president of research and the chief technology officer at IANS. Dave is a SANS analyst, instructor and course author, as well as a GIAC technical director. Dave previously was the founder and principal consultant with Voodoo Security, and has consulted with hundreds of organizations in the areas of security, regulatory compliance and network architecture and engineering. Dave is a former QSA with several years' experience performing PCI assessments. He is a VMware vExpert, and has extensive experience designing and configuring secure virtualized infrastructures. Dave previously was CSO for Configuresoft, CTO for the Center for Internet Security, and has also worked as a security architect, analyst, and manager for several Fortune 500 companies.