Monday, January 17, 2011

Google Draws a Line, Microsoft Stumbles

Google Announces New SLA

Google drew a line in the sand when it announced on January 14, 2011 some changes to its service level agreement (SLA). First, the SLA will no longer have an "out" for planned downtime. Customers will receive SLA credits for any downtime - planned or unplanned. Google claims to be the first major cloud provider to eliminate maintenance windows from its service level agreement.

Previously, Google did not count periods shorter than 10 minutes as downtime. That meant that even though short outages could add up to hours over a long enough period of time, the company had no obligation to compensate users. Google is ending that policy and will now credit users for any amount downtime, no matter how brief.

Gmail users have seen some downtime in the past at Google and Google was very open and very aggressive to not only ensure that they gave status but, more importantly, look for root cause, vigorously work to improve up-time and processes as well as continually improve the way they communicate with customers. I will say that the first outage Google had they didn’t communicate well and it was hard to find out what happened or when it would be fixed or who was affected. They took that experience and improved drastically. Last year, as a "Business Customer" I received a break on my renewal as compensation for downtime although I never experience any. This latest move puts teeth behind their quest to improve availability rather than just talk about it.

The Radicati Group, 2010. "Corporate IT Survey – Messaging & Collaboration, 2010-2011"

Any outage by a cloud provider receives a lot of press and it's not only damaging to the provider but always brings into the question the use of any cloud service as a viable option to premise based computing. Microsoft recently had their own disaster and it was pretty bad (see below). Google takes the opportunity in this press release to address the notion of up-time of a cloud vs. on-premise e-mail like exchange. According to Radicati Group, GMail fairs pretty well over on-premises solutions like Lotus and Exchange. In my humble opinion, this is just another stake in the heart of on-premise e-mail systems.

Disclaimer: Before I go any farther, I have to tell you that I am a Google groupie or “fanboy”. I have converted everything that I do regarding personal office productivity apps, over to Google Apps in order to truly experience it, and I love it (more on that in a later posting). Now back to your regularly scheduled BLOG. This isn’t to the exclusion of Microsoft or other cloud providers.

It would be unfair of me to state that Google’s SLA rocks but it may be kicking a little butt out there. Microsoft’s official cloud based business solution promises “Financially-backed, guaranteed 99.9% up-time” which translates to 43.2 minutes of downtime per month. I think Google has done two things here, they’ve drawn a line in the sand (isn’t competition wonderful) and they’ve set the bar for their own improvements.

According to Google’s announcement, when they factor in the accumulation of small delays of a few seconds and longer disruptions they achieved a 99.984% up-time for both business and consumer services. That averages out to 7 minutes of down-time per month. I dare say that a large swath of users saw 100% up-time (if you factor out user infrastructure and Internet connectivity). I work for a fortune 500 company and just getting to three 9s availability is a real task. The latest research from the Radicati Group found that on-premises email averaged 3.8 hours of downtime per month. Based on their announcement, it looks like Google is shooting to better their availability metric by pushing availability to 99.99%, and do it for millions of their users while adding functionality. The new goal would translate to reducing downtime to 4.32 minutes per month vs. Microsoft’s 43.2 minutes of downtime. Microsoft Office 365 on-line is brand new, they don’t have availability numbers to stand on except for their reputation with Hotmail. It will be interesting to see how both competitors fair in 2011.

Microsoft Hotmail Stumbles

On December 30th, 2010, Microsoft took a huge hit to their availability numbers when 17,000 Hotmail users lost all of their data. The outage lasted well into the new year. This was a major hit to their Hotmail service and their cloud public image. The timing is horrible as Microsoft is just starting to roll out Office 365, their on-line office productivity suite. InfoWorld beat them up pretty bad in their coverage (as the news media likes to do). The primary thing that I took away from this event (as covered by InfoWorld) is that Microsoft is still learning and I think they are where Google was (application maturity wise) about 18 or more months ago. This includes not only their ability to maintain up-time but how they react to downtime, messaging, status boards, processes, etc. I don’t underestimate their ability to come back from this but a few things concern me. 1) Their messaging sucked and they are hopefully learning from this. 2) The outage was very long. 3) Many users lost data and it’s not clear, after over a week, if all of that data was restored. It’s one thing for the press to beat up on your for a outage, but the execution on the recover was bungled and it won’t give customers or potential customers the warm fuzzes.

Everyone has their problems. had at-least two outages last year. It’s just a fact of life. As pointed out in Google’s announcement, their mail SAAS isn’t dial-tone. There are a lot of moving parts in these systems and we may never get to “5 9s availability” which is considered “dial tone availability”, which translates to 25.9 seconds of downtime per month.

Evaluating Cloud Provider's Availability

Separating the hype and the marketing

The press really enjoys taking providers to task when there is an outage and it must therefore sell. I’ll have to say, where there’s smoke there’s probably fire. When evaluating a provider’s availability numbers it will help to use the press articles to balance out the marketing hype that you will see on the provider’s web site. You should look for not only how much downtime they experienced, but how they managed the event, how they communicated and look for continious improvement in the providers process. Take a look at the “Business Continuity” section of my “Challenges” article for what to look for.

What’s really needed is a independent fact based monitoring organization that can help bring better clarity to provider performance. I’ve seen some out there but don’t recall the sites yet. I’m sure this would bring into question how exactly this would get done. A disreputable site would just dedicate machines and people to make sure the monitoring point that a company might use stays up.

Be Reasonable

As I suggested in my previous article, “Challenges & Risks of Implementing Cloud Computing”, you not only need to look at your provider’s performance history, but also use a reasonable measure of quality. For example, if a goal of “three 9s” availability is something that you can reasonable achieve internally, then it’s a good measure. 100% up-time is not, even 5 9s may not be. Don’t forget, you also need to factor in your ability to deliver quality network connectivity to your users as that plays role in your real service availability numbers.

Coming soon

I’m keeping my eye out for what is sure to be a market for independent companies that monitor availability of software as a service companies. If I can track down a good list, I’ll pass it on. Having an independent view will become much more important as companies start to do some comparison shopping.


Google SLA Announcement

InfoWorld article on Google Cloud SLA

Microsoft Fail - Infoworld

Challenges & Risks of Implementing Cloud Computing

Wikipedia article on High Availability
- Chris Claborne

No comments:

Post a Comment