Archive for April 22, 2011

Friday, April 22, 2011

Amazon EC2 Outage

Justin Santa Barbara:

I used the word “contract” above, but I meant it in the technical sense, not the legal sense. The legal contract is the SLA, which I consider relatively worthless. Engineers designed to the AWS technical ‘guarantees’, but a multi-AZ failure shouldn’t happen if AWS is upholding their end of the ‘bargain’.

Update (2011-04-25): Rich Wolski:

Thus, in the extreme, the best solution to the problem of tolerating failure reduces to the End-to-End argument: the application itself must include logic for managing failures, no matter how well engineered the system is that they are using, when continuous operation is a requirement. Admittedly, this statement is extreme but clouds like AWS are also extreme in the scale they can support making such reductionist logic potentially useful. AWS is an extraordinarily well-designed and engineered system, as its availability characteristics indicate. There is just no getting around the Law of Large Numbers and for AWS, one has to believe the numbers are large.

High Scalability:

So many great articles have been written on the Amazon Outage. Some aim at being helpful, some chastise developers for being so stupid, some chastise Amazon for being so incompetent, some talk about the pain they and their companies have experienced, and some even predict the downfall of the cloud. Still others say we have seen a sea change in future of the cloud, a prediction that’s hard to disagree with, though the shape of the change remains…cloudy.


There are three things we will do to prevent a single Availability Zone from impacting the EBS control plane across multiple Availability Zones. The first is that we will immediately improve our timeout logic to prevent thread exhaustion when a single Availability Zone cluster is taking too long to process requests. This would have prevented the API impact from 12:50 AM PDT to 2:40 AM PDT on April 21st. To address the cause of the second API impact, we will also add the ability for our EBS control plane to be more Availability Zone aware and shed load intelligently when it is over capacity. This is similar to other throttles that we already have in our systems. Additionally, we also see an opportunity to push more of our EBS control plane into per-EBS cluster services. By moving more functionality out of the EBS control plane and creating per-EBS cluster deployments of these services (which run in the same Availability Zone as the EBS cluster they are supporting), we can provide even better Availability Zone isolation for the EBS control plane.