Tuesday, April 26, 2011

My Thoughts about Amazon Web Service Failure


Amazon Web Services (AWS) had a major failure last week and there was a lot of buz about it since it took down several major web sites. AWS is one of the leading cloud infrastructure or platform as a service companies. When they fail, it's big news.

Here’s my take.  No solution will provide 100% up-time There will always be a use-case that was not anticipated, failure mode that was not thought of, or human error that couldn’t be mitigated.  Is this a reason to call cloud computing with AWS a failure?  No.  Although I don’t know what their up-time stats are, I’m willing to bet that even if you don’t use multi-site implementation the benefits of scalability, flexibility and up-time still rival what a lot of companies could do on their own for the cost.

What’s important here is how the vendor performed. What the cloud provider’s track record is, and more importantly, how did they handle and react to the failure.  Did the vendor acknowledge the issue immediately?  Was the vendor open and honest in their communications?  Was the vendor able to recover in a reasonable amount of time given the incident?  Did the vendor conduct root cause investigation and implement processes and or technology to mitigate the risk of re-occurrence?  We wish we could have 100% up-time but we should demand continuous improvement from any provider.  Any failure should make them stronger.

Real data to answer the above has alluded me but from what I gather, they communicated openly, gave estimates and updates.

I watched closely, one of the first major failures of Google services a few years ago.  After they recovered, Google presented their root cause findings.  Within those findings, Google not only acknowledged the technical issues and what they were going to do about them, but also what they could have improved in their reaction to the event.   As a result, Google revamped how they react, how they communicate.  This is clearly an example of a company that demonstrates how important up-time is.  The on-line status board from Google is one of the results of this event.  In my opinion Google sets the bar in not only high availability for a cloud provider but accountability, professionalism, and how they respond.  My point is to demonstrate what a mature service provider look like as an example and how, at a minimum, they should be able to react.

What was interesting is that at the beginning of the year, Microsoft failed to learn from others (like Google) and in my opinion responded in a way that demonstrates what NOT to do.  I wrote about here if you are interested.

If you look at the benefits of cloud computing you still need to compare the availability and other benefits of going to the cloud vs. a local implementation.  I submit that many companies can’t come close to the cost benefit while delivering the level of availability along with other features, at the price of AWS.  Sure, an outage at AWS can have a huge impact but because of the devastating impact that it can have to their business and the scale of AWS, they can afford to pay handsomely for security and availability.  None of this absolves customers of understanding the implementation and the impact of key implementation choices that they make.

- Chris Claborne

No comments:

Post a Comment