Monday, January 18, 2016

Zynga Returning to AWS Part 2

In my last cloud article, “Zynga and the cost of agility”, I referenced an article about Zynga’s return to AWS  While at at AWS re:Invent 2015, I had a chance to hear first hand from Zynga’s CIO, Dorion Carroll, why Zynga left AWS in the early days and why they’re back.  It was an excellent presentation and the story is a real lesson as to why a CIO should think real hard before they start buying infrastructure (even if you are going to deploy in a co-location facility).



As a quick recap, Zynga sort of rocked the burgeoning cloud business landscape when they announced they were pulling out of AWS and build their own private cloud starting 2009.  They started with Equinix and completed their first massive dedicated large data center in 2011.   Carroll gave some compelling reasons (at the time) why they left.  
  • At the time AWS couldn’t keep up with the growth Zynga was going through.
    Games launched from 2009 to 2001 were rocketing from 0 to 10 million users in weeks.
  • EC2 performance / technology couldn’t match running on bare metal servers (at the time)
  • EC2 consumption was at 10s of thousands of instances and they needed more
  • Zynga needed to scale an analytics infrastructure not possible on Amazon at the time.

Before Zynga was finished racking and stacking the infrastructure, bolting together all of the storage, and turning it over to the deployment team, AWS had revised their technology 3+ times.  Zynga teams spent a lot of time in design and technology choice and then started executing on that plan.  In the meantime AWS continued to invest more in R&D and evolve what they delivered.  They can do this  faster than a company build their own.

As the RW article indicates, Zynga’s business changed quickly and so did their infrastructure needs.  With multiple datacenters and acres of equipment depreciating, you can’t afford to just change it all out when better technology comes out next year.  The previous article touched on the value of AWS’s “return on agility” but there’s more to it.  Carroll wanted to get out of the business (read cost) of technology R&D and maintaining massive old datacenters and spend that on developing games.  With every technology refresh there is a cost of learning what the state of the art is.  This cost isn’t just design, it’s also the time to work with vendors, learn how to install new technology and operate and configure that technology.  If the technology is bleeding edge, you have to invest more to stay current with patches and nuances of the limitations.

Carroll mentioned that there was a real stress around purchasing hardware fast enough and managing the process to delivery.  In regards to operations, at one point, his team was replacing 80 to 100 drives per month due to failure.  According to Carroll, AWS advanced their storage and compute three times in in his first year of equipment depreciation and AWS was advancing much faster than Zynga could.  Zynga’s advantage of running their own cloud eroded.  The bottom line is that a three-year CapEx commitment means you are committed to what you buy now versus what is state of the art in one, two and three years.  

When Zynga started moving back it was clear it was the right thing to do.  Here are some of  the key points Carroll made:
  • They were able to go from 100 database servers to 3 in AWS
  • They were able to drop 300 linux servers
  • They replaced a large mysql farm with dynamodb managed service
  • They saw a 30 to 40X improvement in query performance (even under full write load)
  • Modernizing storage tech was expected to yield > 75% reduction of instances
  • It’s liberating to be able to spin up a massive infrastructure and test a proposed configuration and then shut it down and stop paying for it.
  • AWS feeds a culture of experimentation to drive innovation.
  • They analyze petabytes of data in AWS today, something their aging private cloud equipment can no longer keep up with.
  • Not only did Amazon release new generations of technology during the migration, they had the flexibility to try various configurations in instance classes and persistence technologies.

The last part of the presentation covered how Zynga’s move back to AWS may be one of the biggest moves at scale in the history of AWS.  Zynga’s approach to the move mirror’s 99% of all the other migration stories told at re:Invent, which is “lift and shift”, then optimize.  For example, after migrating Words with Friends, they were able to reduce overall size by over 30%.   

Now that Zynga’s move is nearly complete, they are more stable in the cloud.  They no longer have to deal with hardware failures and instead spend time improving in other areas to improve their up-time even further.

Zynga didn’t just one day decide to start moving to AWS.  They spent 15 months in planning and 5 months to move it.  The lesson here is that you have to plan, design, test and validate your assumptions to pull off a move of this scale.

Like other CIOs I heard at re:Invent 2015, Carroll doesn’t think they could do what they do today on AWS using a private cloud.  Although he didn’t explicitly talk about financial ROI, there were references that finance was heavily involved in the decision making and I’m guessing they liked the fact that the scale that drives cost could closely match demand rather than the huge step function in cost it takes to operate your own cloud.  Zynga continues to optimize using new managed services and architectures in AWS.  Zynga experiments to drive innovation and are currently trying micro-services, AWS Lambda, Redshift and Glacier.

Carroll’s final two statements sum it up.
“AWS is investing more in computing infrastructure and innovation and moving faster than we ever could.”

“Using AWS allows Zynga to focus on developing great games, investing in product innovation, and improving player expectations”.

Reference


Chris Claborne

(aka Christian Claborne for you googlers :)

No comments:

Post a Comment