AWS provides a platform that is ideally suited for building highly available systems, enabling you to build reliable, affordable, fault-tolerant systems that operate with a minimal amount of human interaction. This presentation covers many of the high-availability and fault-tolerance concepts and features of the various services that you can use to build highly reliable and highly available applications in the AWS Cloud: architectures involving multiple Availability Zones, including EC2 best practices and RDS Multi-AZ deployments; loosely coupled and self-healing systems involving SQS and Auto Scaling; networking best practices for high availability, including Elastic IP addresses, load balancing, and DNS; leveraging services that inherently are built with high-availability and fault tolerance in mind, including S3, Elastic Beanstalk and more.
Ianni Vamvadelis, Manager, Solution Architecture, AWS
Daniel Richardson, Director of Engineering, JustEat
2. What is High Availability (HA)?
• Percentage of time an application operates
• Loss of availability is known as an outage or downtime
– Planned and unplanned
– App is offline, unreachable, or partially available
– App is unresponsive
2
3. HA is related to …
• Scalability
– Often slow is indistinguishable from unavailable.
• Fault Tolerance
– Apps continue functioning when components fail
• Disaster Recovery
– Restoring service after a catastrophic event
3
4. HA and DR High Availability Disaster Recovery
• A continuum
• business continuity plan
• Not all or nothing proposition
In the face of internal or external events, how do you…
– Keep your applications running 24x7
– Make sure you data is safe
– Get an application recovered after a major disaster
4
6. US-WEST (Oregon)
EU-WEST (Ireland)
AWS GovCloud (US)
ASIA PAC (Tokyo)
US-EAST (Virginia)
ASIA PAC (Sydney)
US-WEST (N. California)
ASIA PAC
(Singapore)
SOUTH AMERICA (Sao Paulo)
7. US-WEST (Oregon))
EU-WEST (Ireland)
AWS GovCloud (US)
ASIA PAC (Tokyo)
US-EAST (Virginia)
ASIA PAC (Sydney)
US-WEST (N. California)
ASIA PAC
(Singapore)
SOUTH AMERICA (Sao Paulo)
120. JUST EAT
13 countries
34,000+ restaurants
8m+ members
Over 50m orders
16,000+ restaurants in UK, 8m visits a month
120
121. PLATFORM
Devices in restaurants
Apps and
External
Services
Consumer Public API Customer Restaurant
Website Care Tools Services
APIs
Order API Ratings API Search API … …
Common
Infrastructure
SQL Server Networking Monitoring Emails
121
122. DESIGN FOR FAILURE
Devices in restaurants
Web Device
Service Service
Orders
eu-west-1a queue eu-west-1a
Web JCT
Device
Service Service
Service
eu-west-1b
Orders eu-west-1b
data
Web
Service
eu-west-1c eu-west-1c
Auto scaling Group Auto scaling Group
122
127. EVERYTHING MULTI AZ – CONSUMER WEBSITE
99%
66% 99%
66% 66%
Monitor to keep resource usage at
eu-west-1a eu-west-1b eu-west-1c
max of 66% of capacity in each AZ
when everything’s available.
Auto scaling Group
127
128. EVERYTHING MULTI AZ – INTERNAL APIS
Applications assume that internal APIs will fail
or run slowly. So can cope with the loss of an AZ
or instances – will just degrade gracefully.
100%
80% 80%
100% 80%
Alarms tell us that performance has
eu-west-1a eu-west-1b eu-west-1c
been degraded – but platform will
self heal as new instances are
launched.
Auto scaling Group
128
129. EVERYTHING MULTI AZ – SQL SERVER 2012
Connection strings simply contain
both primary and secondary servers –
no code changes required.
Primary Witness Alarms tell us that failover has
Secondary
eu-west-1a eu-west-1b eu-west-1c
occurred, but it happens without
manual intervention.
129