Cloud availability and resiliency: Planning for failure

Gartner advises companies to take responsibility for cloud service resiliency.

SAN DIEGO -- The public cloud is a utility and utilities fail, making it critical that cloud customers prepare...

for downtime, said Richard Jones, managing vice president for Cloud and Data Center Strategies in the Gartner IT Professionals Research group.

Many enterprises appear to assume the cloud is reliable, but cloud provider outages -- such as the Amazon Web Services outage in April – illustrate that cloud availability is susceptible to failure, he said in a presentation Thursday at the Gartner Catalyst Conference 2011 in San Diego.

“You need to understand your risk should this happen. You can’t expect something to be up and running all the time,” Jones said. “You need to take responsibility for uptime.”

IT professionals need to gain transparency into their cloud provider’s infrastructure to understand the level of cloud availability and resiliency, he said. “If there’s no transparency, assume the worst for design purposes.”

Cloud customers have varying degrees of control over resiliency, depending on the cloud model, Jones said. In the SaaS model, they have the least control; companies must review the provider’s track record on cloud availability, devise an exit strategy before signing a contract, and back up their data on a regular basis.

In the PaaS model, customers have more control and the ability to build some resiliency into their application, Jones said. “You’re going to want to design the application to accommodate failure,” he said, explaining that application resiliency includes stateless design and distributed and decoupled logic.

Jones advised looking for PaaS vendors that have decoupled middleware and service components for redundancy, scalability and failure. Customers should also understand the vendor’s resiliency features, such as data replication, snapshots and location, as well as how to leverage them.

IT managers have the greatest control over resiliency in the IaaS model, Jones said. He discussed Amazon’s architecture of five global regions and availability zones within those regions, and referenced Amazon’s best practices for resiliency (PDF), which include using multiple availability zones.

Using a cloud provider with a global presence that can distribute your environment broadly can help build resilience against regional outages, but comes with additional cost, he said.

“No given solution is bulletproof,” Jones said. “Something can always happen.”

Dig Deeper on Public Cloud Computing Security