Amazon’s Cloud Computing failures began early Thursday morning and continue into Friday April 22nd. Affected web sites included Quora.com, Reddit.com, GroupMe.com and Scvngr.com, which all posted messages to their visitors about the issue. Most of those web sites have been inaccessible for hours, and others were only partly operational.
Companies use Amazon’s cloud-based service (known as Elastic Cloud or “EC2”) to host their Web sites, applications and for data storage. Amazon’s customers include start-ups like the social networking site Foursquare but also big companies like Pfizer, Netflix and Nasdaq. EC2 is designed to cope with giant traffic spikes, of the type Amazon experiences during its pre-Christmas shopping rush in December of each year. Today, Amazon said that a networking glitch made its storage volumes automatically create back-ups of themselves, filling up storage capacity and causing connectivity issues.
For several years, we’ve been pounding the table about many of the potential problems that Cloud Service Providers have “swept under the rug” or ignored. These include failure/ disaster recovery, SLAs, security, lack of standards, vendor lock-in, etc. During the session “Everyone can now afford a Disaster Recovery Center*” at the 2011 Cloud Connect conference, the speaker stated that disaster recovery could be solved by transferring the work loads from effected cloud data centers to other data centers owned by the same Cloud Service Provider. He gave an example where cloud outages in San Francisco, CA resulted in the jobs transferred to cloud data centers in London, England with data being replicated there. I strongly challenged the speaker about the complexity of doing this. In particular, the resulting extra processing load on the servers in London, data replication issues and network bandwidth saturation. He danced around those problems and remained confident disaster recovery would actually be a strength, rather than a weekness of cloud computing. Wonder what he thinks now after the Amazon cloud outages?
* The session, “Everyone Can Now Afford a Disaster Recovery (DR) Center,” was supposed to detail the ways in which cloud computing has disrupted the cost dynamics of disaster recovery. “The economics of cloud computing have changed the disaster recovery game, allowing everyone to afford a DR center and pay for DR services only when they are needed. Attendees of this session will learn about new strategies for data protection and disaster recovery in the Cloud.”
“Cloud computing is revolutionizing Disaster Recovery,” said Dr. Ian Howells, CMO at StorSimple. “The natural advantages of the cloud being available from anywhere with high availability, elasticity and utility billing make it ideal for next generation Disaster Recovery strategies that are now affordable for widespread usage. What is needed is a framework to optimize content, data and application movement between the cloud and on-premises infrastructure.”
After Amazon’s cloud went dark, one location-based service proider (SCVNGR), tweeted: ‘The sky is falling! Amazon’s cloud seems to be down (raining?) so we’re experiencing some issues too. Be back soon!’ Four Square and a number of other social media sites hosted by Amazon’s cloud were also forced to post apology notices
Here are a few comments from Executives that have considered EC2 and cloud computing:
- “We don’t think the cloud is enterprise-ready,” said Jimmy Tam, general manager of Peer Software, which provides data backup for businesses. “Are you really going to trust your corporate jewels to these cloud providers?”
- “Clearly you’re not in control of your data, your information,” said Campbell McKellar, founder of Loosecubes, a Web site for finding temporary workspace that was not available Thursday. “It’s a major business interruption. I’m getting business interruption insurance tomorrow, believe me, and maybe we get a different cloud provider as a backup.”
- Ben Parr from Mashable pointed out, the event revealed that Amazon’s cloud redundancies failed to stop a mass outage. “Its Availability Zones are supposed to be able to fail independently without bringing the whole system down. Instead, there was a single point of failure that shouldn’t have been there,” said Parr.
- Cheezburger Network CEO Ben Huh said that outages like this can be a learning opportunity for companies. “It’s not a catastrophe unless something valuable (like user data) was lost,” said Huh. “It’s an opportunity to learn about the service provider’s weakness and how to design more stable, reliable systems. Services recover very quickly from outages as long as they are relatively short. Long-term outages are another beast.”
We could only find one voice that extolled the benefits of cloud computing, in light of Amazon’s massive failure:
“The benefits of the cloud are significant,” said Jeff Janer, chief executive of Springpad, a service that people use to save items online, which went offline as a result of Amazon’s problem. “Amazon as a resource for a company like ours makes an awful lot of sense. We’re just all keeping our fingers crossed that they get back as quickly as possible.”
We invite comments from Viodi View readers regarding their experiences and/or opinions about cloud failure recovery and other vulnerabilities.
For the latest status on Amazon’s Web Services by region, please visit their Service Health Dashboard: http://status.aws.amazon.com/