6. Whenever mission critical applications are concerned, how "secure" cloud providers claim to be matters a great deal less than the claw back service level agreements (SLA) they
provide, or whether auditors can adequately evaluate their offerings against regulatory compliance criteria.
AWS outage. for everyone involved (not least Amazon’s own operations staff) it’s been a very long 4 days. What are the lessons to learn?
1. Read your cloud provider’s SLA very carefully ‐ Amazingly, the four‐day outage did not breach Amazon’s EC2 SLA. which as a FAQ explains, “guarantees 99.95% availability of the
service within a Region over a trailing 365 period.” Since it has been the EBS (elastic block storage) and RDS (relational dbase) services rather than EC2 itself that has failed (and all the
failures have been restricted to Availability Zones within a single Region), the SLA has not been breached, legally speaking. That’s no consolation for those affected of course, nor is it
any excuse for the disruption they’ve suffered. But it certainly gives pause for thought.
2. Don’t take your provider’s assurances for granted ‐ Many of the affected customers were paying extra to host their instances in more than one Availability Zone (AZ). Amazon
recommends this course of action to ensure resilience against failure. (Each AZ, according to Amazon’s FAQ, “runs on its own physically distinct, independent infrastructure, and is
engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate,
such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone.” )
Unfortunately, this turned out to be a technical specification rather than a contractual guarantee. It will take Amazon quite some effort to repair the reputational damage this event has brought upon it. Justin Santa Barbara, founder and CEO of FathomDB was forthright in
his blog post on Why the sky is falling: “AWS broke their promises on the failure scenarios for Availability Zones … The sites that are down were correctly designing to the ‘contract’; the problem is that AWS didn’t follow their own specifications. Whether that happened
through incompetence or dishonesty or something a lot more forgivable entirely, we simply don’t know at this point.” While it’s easy to be wise after the event, Amazon’s vulnerability to this type of failure may have been visible on a deep‐enough due diligence exercise. As
Amazon competitor Joyent’s Chief Scientist Jason Hoffman notes on the company’s blog, “This is not a ’speed bump’ or a ‘cloud failure’ or ‘growing pains’, this is a foreseeable consequence of fundamental architectural decisions made by Amazon.”
3. Most customers will still forgive Amazon its failings ‐ However badly they’ve been affected, providers have sung Amazon’s praises in recognition of how much it’s helped them run a powerful infrastructure at lower cost and effort. Many
prefaced criticisms with gratitude for what Amazon had made possible, such as BigDoor’s CEO Keith Smith: “AWS has allowed us to scale a complex system quickly, and extremely cost effectively. At any given point in time, we have 12 database servers, 45 app servers, six
static servers and six analytics servers up and running. Our systems auto‐scale when traffic or processing requirements spike, and auto‐shrink when not needed in order to conserve dollars.”
4. There are many ways you can supplement a cloud provider’s resilience As O’Reilly’s George Reese points out, “if your systems failed in the Amazon cloud this week, it wasn’t Amazon’s fault. You either deemed an outage of this nature
an acceptable risk or you failed to design for Amazon’s cloud computing model.” It’s useful to review the techniques customers have used to minimize their exposure to failures at Amazon.
(Twilio, for example, didn’t go down. Although the company hasn’t explained exactly what its exposure was to the affected North Virginia Availability Zones, it has described its architectural design principles in a first entry on its new engineering blog by co‐founder and CTO
Evan Cooke. These include decomposing resources into independent pools, building in support for quick timeouts and retries, and having idempotent interfaces that allow multiple retries of failed requests. Of course all this is easier said than done if all your experience is in
designing tightly‐coupled enterprise application stacks that assume a resilient local area network. Cooke’s post goes on to describe some of the characteristics that make Twilio’s architecture capable of operating in this more fault tolerant manner. To start with, “Separate
business logic into small stateless services that can be organized in simple homogeneous pools.” Another step is to partition the reading and writing of data: “if there is a large pool of data that is written infrequently, separate the reads and writes to that data … For example,
by writing to a database master and reading from database slaves, you can scale up the number of read slaves to improve availability and performance.” Another site that didn’t go down is NetFlix, which runs all its infrastructure in the Amazon cloud.
5. Building in extra resilience comes at a cost (Bob Warfield describes how a previous company used Amazon.com infrastructure in a way that allowed it to “bring back the service in another region if the one we were in totally failed within 20 minutes
and with no more than 5 minutes of data loss.” As he goes on to say, the choices you make about the length of outage you’re prepared to support have consequences for the cost your customers or enterprise must fund. “Smart users and PaaS vendors will look into
packaging several options because you should be backed up to S3 regardless, so what you’re basically arguing about and paying extra for is how ‘warm’ the alternate site is and how much has to be spun up from scratch via S3.”)
6. Understanding the trade‐offs helps you frame what to ask ‐ There are questions you should be asking to satisfy yourself that a cloud service you rely on is not exposing you to a
similar failure (or at least that, if it is, you understand this and are willing to bear the consequences in return for a cheaper cost). Referring to NetFlix’s practice of randomly killing
resources and services in order to test its resilience, Bob Warfield adds this advice:
“That’s likely another good question to ask your PaaS and Cloud vendors — “Do you take down production infrastructure to test your failover?” Of course you’d like to see that and
not just take their word for it too.”
7. Lack of transparency may be Amazon’s ‘Achilles heel’ ‐ Several affected customers have complained of the lack of useful information forthcoming from Amazon during the outage. BigDoor CEO Keith Smith wrote, “If Amazon had been more
forthcoming with what they are experiencing, we would have been able to restore our systems sooner.” GoodData’s Roman Stanek called on Amazon to tear down its wall of secrecy: “Our dev‐ops people can’t read from the tea‐leaves how to organize our systems for
performance, scalability and most importantly disaster recovery. The difference between ‘reasonable’ SLAs and ‘five‐9s’ is the difference between improvisation and the complete alignment of our respective operational processes … There should not be communication
walls between IaaS, PaaS, SaaS and customer layers of the cloud infrastructure.”
Amazon’s challenge in the coming weeks is to show that it is prepared to give its customers the information it needs to build in that resilience reliably. If it does not meet that need
and allows others to do better, it may gradually start losing its dominant position today in IaaS provision.
5
9. 9 Good Cloud
Driver 1
for those prepared to innovate, there’s the power of the cloud to increase competitiveness and
realize new opportunities.
Driver 2 there’s the threat posed by the cloud (and thos fast nimble cheap startups) to those
established businesses that are unable to innovate fast enough (eg consumerization trends.
BUSINESS CASE Something we'll come back to)
There is ample evidence that forward‐looking enterprises are thinking carefully about these threats
and opportunities.
They’re changing the way they organise IT and its relationship with the business.
Intuit generates more than half its $4+ billion revenues from connected services. SaaS.
Ginny Lee, CIO, explains that to support this growth in on‐demand capabilities, she “had to turn the
IT organization from a service provider into a change agent … I had to change the mindsets of
people within IT to make sure they know that their mission is to enable growth and a great
customer experience.”
Financial services giant Fidelity, which is using the cloud to provide employee portals to its clients
that combine customer HR data and benefit plans with relevant information about 401k investment
planning.
Xerox is cloud‐enabling its high‐volume printing systems to serve its customers better and open up
opportunities to provide turnkey marketing services to smaller companies.
Postage meterage provider Pitney Bowes faces falling spending on stamps and so is building a
secure mailbox in the cloud.