By now you have heard that Amazon Web Services had a massive disruption yesterday, affecting Elastic Cloud Computing (EC2) instances in the company's northern Virginia data center. The disruption was/is long-lived (Amazon's dashboard is still showing problems), and certainly blew any claims for an annual uptime of 99.9 percent, which is 8.76 hours downtime per year. In fact, it likely blew 99.8 percent uptime, which is 17.52 hours of downtime. While 99.8 percent sounds good, the fact that some sites have been down the better part of a day has real impact on revenue. The downtime is also bad for those who manage Amazon's Web Services. It's bad for those that use Amazon's web services. No one likes downtime. But it's not necessarily a reason to avoid the cloud, and don't make the mistake of thinking that owning your own infrastructure would have avoided a similar problem.
Technologies and strategies like cloud, virtualization, clustering and RAID, are not going to magically dissolve failure. Failure happens. It happens to everyone. When failure happens to a giant like Amazon, it is a Big Deal--partly because Amazon is a victim of its own success and of the promises it made to customers, in Justin Santa Barbara's opinion.
Amazon is also a victim of the hype surrounding cloud computing, with the notion that cloud computing today provides resilient, fault-tolerant computing services with the capability to auto-magically recover from failure. Automation is key to the success of any cloud service, and particularly on the scale of EC2. It would appear, and this is pure speculation on my part, that Amazon's automatic recovery processes exacerbated the outage by consuming storage at a massive rate. Gartner's Lydia Leong has a good explanation of what happened at Amazon outage and the auto-immune vulnerabilities of resiliency. Ooooops. I am not going to kick Amazon. Mistakes happen to the best of us, and I have to think the folks who designed and manage Amazon's service are pretty talented. But if I were an Amazon EC2 customer, or any cloud customer, I'd be taking a good, hard look at my cloud providers' availability claims to see if they are sufficient to meet my business needs.
What do you do as a customer or potential customer of Amazon's services? Amazon is notoriously tight-lipped about its operations and management, and there are some folks--like Roman Stanek, whose company is an Amazon customer--who would like more transparency. Transparency is important not only during an event, but during the purchase process. If you are going to base your business, in whole or in part, on an external service, you need assurances that the service is run reliably and that the operation's processes are going to result in effective assessments of an outage's severity and a realistic assessment of recovery.
You can't really demand a demonstration of a catastrophic fail-over and recovery. You can review the available materials and processes a provider uses and assess its effectiveness based on your own experience--or, if you don't have the expertise, the assessment of a trusted adviser--and determine whether the provider can satisfy their own promises. If a potential service provider isn't forthcoming with a potential customer, then perhaps you decide not to do business with the organization.