adam121 - Fotolia

Amazon dodges another AWS reboot, but how?

In a surprise move, Amazon Web services was able to patch a Xen hypervisor security flaw without a mass reboot of its EC2 infrastructure. So how did AWS pull it off?

After being forced to reboot a significant portion of its public cloud instances last October to install a Xen hypervisor security update, Amazon Web Services Inc. has managed to dodge another EC2 reboot to address what is believed to be a similar flaw.

The mystery, however, is that nobody knows how AWS could have installed the fix without a reboot, and the cloud computing giant isn't sharing details.

Last month AWS announced plans to once again reboot a significant portion of its EC2 server fleet because of a Xen hypervisor vulnerability that required a software update. A similar Xen hypervisor security advisory trigged an AWS reboot last fall, and also caused other major cloud providers such as IBM SoftLayer and Rackspace to reboot their public cloud infrastructure.

But last Monday AWS made a surprise announcement: The mass reboot was canceled, and 99.9% of its affected EC2 instances would receive live updates to the hypervisor to avoid a reboot.

"Since we posted the information below, our team has been working around-the-clock to find ways to minimize the impact for those requiring a reboot," AWS wrote in a support update. "We're happy to share that we'll now be able to live-update ‎the vast majority of our older hardware for this Xen Security Advisory."

In the space of just a few months, AWS would appear to have radically improved its ability to address a major hypervisor security flaw without affecting system availability, evolving from mandatory reboots of large portions of its public cloud infrastructure, to on-the-fly updates with no virtually no downtime.

Adding to the intrigue, it would seem that whatever method AWS used to avert the reboot was finalized recently, likely between Feb. 27, when it announced the need for another reboot, and March 3, when the reboot was canceled.

So how did AWS avoid the reboot? The company couldn't be reached for comment and has offered few details. But security analysts are ruling out live migration -- the most seemingly likely option -- because Amazon does not support it, at least not as anyone knows yet.

"I think it's unlikely that they would use live migration to accomplish this without warning admins that their machines were going to be live-migrated," said John Burke,  principal research analyst at Nemertes Research Group Inc., based in Mokena, Ill. "If they had functionality they felt [was] robust enough for this purpose, I would think they would be announcing it and offering it as an optional method to avoid the restart. "

Instead, Burke said, it's more likely AWS devised a way to perform hot patching on its servers. 

Rich Mogull, analyst and CEO at Phoenix-based research firm Securosis LLC, agreed with the hot-patching theory, and said it's possible that since the last reboot, AWS came up with a way to live-patch this specific kind of hypervisor vulnerability without rebooting the hardware.

Mogull pointed out that AWS' most recent announcement specified that its "older hardware" was live-updated to avoid the reboot, meaning that new hardware was able to avoid any rebooting or updates altogether.

"It seems they use some hardware techniques to reduce or eliminate reboots on newer hardware," Mogull said.

Burke said it's possible whatever method AWS used to avert the reboots is contingent on the company's custom infrastructure.

"It may be dependent on how they build their systems in the first place, which they do regard as competitive IP," Burke said. "It may also be something their management systems make possible where other [cloud provider systems] don't."

AWS is keeping its cards close to the vest for now, which Mogull said is exactly the kind of information that it likes to keep secret because it could be a competitive differentiator.

"They have a ton of secret recipes they never talk about," Mogull said. "Things that you would think are simple on the surface, they always stay silent on. All those little advantages add up, and they really keep competitors guessing. For example, they don't even talk about how exactly an EBS [Elastic Block Store] snapshot or security groups work."

Burke concurred, saying that the secret recipe behind the averted reboot could be a big advantage for AWS, depending on how it was performed.

"Assuming it is not human-resource intensive, yes, it could be a differentiator," he said. "If it is not set-and-forget automatable, though, then no."

Next Steps

Learn more about how Amazon is improving AWS security with encryption key management

Dig Deeper on Public Cloud Computing Security