O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.
How 
Ne'lix 
Leverages 
Mul3ple 
Regions 
to 
Increase 
Availability 
Ruslan 
Meshenberg 
10 
März 
2014
Watch the video with slide 
synchronization on InfoQ.com! 
http://www.infoq.com/presentations 
/netflix-availability-regio...
Presented at QCon London 
www.qconlondon.com 
Purpose of QCon 
- to empower software development by facilitating the sprea...
Ne'lix
Who 
am 
I? 
• Director, 
Pla'orm 
Engineering 
• Led 
Mul3-­‐Regional 
Resiliency 
• Leading 
Ne'lixOSS 
Program 
• @rusm...
Failure
Assump3ons 
Hardware 
Will 
Fail 
• Slowly 
Changing 
• Large 
Scale 
• Rapid 
Change 
• Large 
Scale 
• Slowly 
Changing ...
Incidents 
– 
Impact 
and 
Mi3ga3on 
PR 
X 
Incidents 
CS 
XX 
Incidents 
Metrics 
Impact 
– 
Feature 
Disable 
XXX 
Incid...
Does 
an 
Instance 
Fail? 
• It 
can, 
plan 
for 
it 
• Bad 
code 
/ 
configura3on 
pushes 
• Latent 
issues 
• Hardware 
...
Does 
a 
Zone 
Fail? 
• Rarely, 
but 
happened 
before 
• Rou3ng 
issues 
• DC-­‐specific 
issues 
• App-­‐specific 
issue...
Does 
a 
Region 
Fail? 
• Full 
region 
– 
unlikely, 
very 
rare 
• Individual 
Services 
can 
fail 
region-­‐wide 
• Most...
Everything 
Fails… 
Eventually 
• Keep 
your 
services 
running 
by 
embracing 
isola3on 
and 
redundancy 
• Construct 
a ...
Isola3on 
• Changes 
in 
one 
region 
should 
not 
affect 
others 
• Regional 
outage 
should 
not 
affect 
others 
• Netw...
Redundancy 
• Make 
more 
than 
one 
(of 
preZy 
much 
everything) 
• Specifically, 
distribute 
services 
across 
Availab...
History: 
X-­‐mas 
Eve 
2012 
• Ne'lix 
mul3-­‐hour 
outage 
• US-­‐East1 
regional 
Elas3c 
Load 
Balancing 
issue 
• “.....
Isthmus 
– 
Normal 
Opera3on 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
US...
Isthmus 
– 
Failover 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
US-­‐West-...
Zuul 
Overview 
Elas%c 
Load 
Balancing 
Elas%c 
Load 
Balancing 
Elas%c 
Load 
Balancing
Zuul
Denominator 
– 
Abstrac3ng 
the 
DNS 
Layer 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandr...
Isthmus 
– 
Only 
for 
ELB 
Failures 
• Other 
services 
may 
fail 
region-­‐wide 
• Not 
worthwhile 
to 
develop 
one-­‐o...
Ac3ve-­‐Ac3ve 
– 
Full 
Regional 
Resiliency 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassand...
Ac3ve-­‐Ac3ve 
– 
Failover 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
Cassandra 
Replicas 
Regi...
Can’t 
we 
just 
deploy 
in 
2 
regions?
Ac3ve-­‐Ac3ve 
Architecture
2 
main 
challenges 
• Rou3ng 
the 
users 
to 
the 
services 
• Replica3ng 
the 
data
Separa3ng 
the 
Data 
– 
Eventual 
Consistency 
• 2–4 
region 
Cassandra 
clusters 
• Eventual 
consistency 
!= 
hopeful 
...
Benchmarking 
Global 
Cassandra 
Write 
intensive 
test 
of 
cross-­‐region 
replica3on 
capacity 
16 
x 
hi1.4xlarge 
SSD...
Propaga3ng 
EVCache 
Invalida3ons 
4. 
Calls 
Writer 
with 
Key, 
Write 
Time, 
TTL 
& 
Value 
aTer 
checking 
if 
this 
i...
Announcing 
Dinomyte 
• Cross 
AZ 
& 
Region 
replica3on 
to 
exis3ng 
Key 
Value 
stores 
– Memcached 
– Redis 
• Thin 
D...
Archaius 
– 
Region-­‐isolated 
Configura3on
Running 
Isthmus 
and 
Ac%ve-­‐Ac%ve
Mul3regional 
Monkeys 
• Detect 
failure 
to 
deploy 
• Differences 
in 
configura3on 
• Resource 
differences
Opera3ng 
in 
N 
Regions 
• Best 
prac3ces: 
avoiding 
peak 
3mes 
for 
deployment 
• Early 
problem 
detec3on 
/ 
rollbac...
Applica3on 
deployment
6 
deployments
Automa3ng 
deployment
Monitoring 
and 
Aler3ng 
• Per 
region 
metrics 
• Global 
aggrega3on 
• Anomaly 
detec3on
Failover 
and 
Fallback 
• Direc3onal 
DNS 
changes 
(Or 
Route53) 
• For 
fallback, 
ensure 
data 
consistency 
• Some 
c...
Valida3ng 
the 
Whole 
Thing 
Works
Does 
Isola3on 
Work?
Does 
Failover 
work?
Steady 
State
Wii.FM?
We’re 
here 
to 
help 
you 
get 
to 
global 
scale… 
Apache 
Licensed 
Cloud 
Na3ve 
OSS 
Pla'orm 
hZp://ne'lix.github.com
Technical 
Indiges3on 
– 
what 
do 
all 
these 
do?
Asgard 
-­‐ 
Manage 
Red/Black 
Deployments
Ice 
– 
Slice 
and 
dice 
detailed 
costs 
and 
usage
Automa3cally 
Baking 
AMIs 
with 
Aminator 
• AutoScaleGroup 
instances 
should 
be 
iden3cal 
• Base 
plus 
code/config 
...
Discovering 
your 
Services 
-­‐ 
Eureka 
• Map 
applica3ons 
by 
name 
to 
– AMI, 
instances, 
Zones 
– IP 
addresses, 
U...
Deploying 
Eureka 
Service 
– 
1 
per 
Zone
Searchable 
state 
history 
for 
a 
Region 
/ 
Account 
Edda 
AWS 
Instances, 
ASGs, 
etc. 
Eureka 
Services 
metadata 
Yo...
Edda 
Query 
Examples 
Find 
any 
instances 
that 
have 
ever 
had 
a 
specific 
public 
IP 
address! 
$ curl "http://edda...
Archaius 
– 
Property 
Console
Archaius 
library 
– 
configura3on 
management 
Based 
on 
Pytheas. 
Not 
open 
sourced 
yet 
SimpleDB 
or 
DynamoDB 
for ...
Denominator: 
DNS 
for 
Mul3-­‐Region 
Availability 
Zone 
A 
Cassandra 
Replicas 
Zone 
B 
Cassandra 
Replicas 
Zone 
C 
...
Zuul 
– 
Smart 
and 
Scalable 
Rou3ng 
Layer
Ribbon 
library 
for 
internal 
request 
rou3ng
Ribbon 
– 
Zone 
Aware 
LB
Karyon 
-­‐ 
Common 
server 
container 
• Bootstrapping 
o Dependency 
& 
Lifecycle 
management 
via 
Governator. 
o Servi...
Karyon 
• Embedded 
Status 
Page 
Console 
o Environment 
o Eureka 
o JMX
Karyon 
• Sample 
Service 
using 
Karyon 
available 
as 
"Hello-­‐ne'lix-­‐oss" 
on 
github
Hystrix 
Circuit 
Breaker: 
Fail 
Fast 
-­‐> 
recover 
fast
Hystrix 
Circuit 
Breaker 
State 
Flow
Turbine 
Dashboard 
Per 
Second 
Update 
Circuit 
Breakers 
in 
a 
Web 
Browser
Either 
you 
break 
it, 
or 
users 
will
Add 
some 
Chaos 
to 
your 
system
Clean 
up 
your 
room! 
– 
Janitor 
Monkey 
Works 
with 
Edda 
history 
to 
clean 
up 
aTer 
Asgard
Conformity 
Monkey 
Track 
and 
alert 
for 
old 
code 
versions 
and 
known 
issues 
Walks 
Karyon 
status 
pages 
found 
...
In 
use 
by 
many 
companies 
ne=lixoss@ne=lix.com
hZp://www.meetup.com/Ne'lix-­‐Open-­‐ 
Source-­‐Pla'orm/
Takeaways 
By 
embracing 
isola8on 
and 
redundancy 
for 
availability 
we 
transformed 
our 
Cloud 
Infrastructure 
to 
b...
Watch the video with slide synchronization on 
InfoQ.com! 
http://www.infoq.com/presentations/netflix-availability- 
regio...
How Netflix Leverages Multiple Regions to Increase Availability: An Active-Active Case Study
Próximos SlideShares
Carregando em…5
×

How Netflix Leverages Multiple Regions to Increase Availability: An Active-Active Case Study

1.284 visualizações

Publicada em

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1mp4JK1.

Ruslan Meshenberg discusses Netflix's challenges, operational tools and best practices needed to provide high availability through multiple regions. Filmed at qconlondon.com.

Ruslan Meshenberg is the Director of Engineering for Platform, Persistence and Metadata services at Netflix. Ruslan initiated and continues to lead the Netflix Open Source efforts that share infrastructure and tooling we developed for Cloud-native applications with others.

Publicada em: Tecnologia
  • Seja o primeiro a comentar

How Netflix Leverages Multiple Regions to Increase Availability: An Active-Active Case Study

  1. 1. How Ne'lix Leverages Mul3ple Regions to Increase Availability Ruslan Meshenberg 10 März 2014
  2. 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /netflix-availability-regions InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  3. 3. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Ne'lix
  5. 5. Who am I? • Director, Pla'orm Engineering • Led Mul3-­‐Regional Resiliency • Leading Ne'lixOSS Program • @rusmeshenberg
  6. 6. Failure
  7. 7. Assump3ons Hardware Will Fail • Slowly Changing • Large Scale • Rapid Change • Large Scale • Slowly Changing • Small Scale • Rapid Change • Small Scale Telcos Web Scale Speed Scale Everything Works Everything Is Broken SoTware Will Fail Enterprise IT Startups
  8. 8. Incidents – Impact and Mi3ga3on PR X Incidents CS XX Incidents Metrics Impact – Feature Disable XXX Incidents No Impact – Fast Retry or Automated Failover XXXX Incidents Public rela3ons media impact High customer service calls Affects AB test results Y incidents mi3gated by Ac3ve Ac3ve, game day prac3cing YY incidents mi3gated by beZer tools and prac3ces YYY incidents mi3gated by beZer data tagging
  9. 9. Does an Instance Fail? • It can, plan for it • Bad code / configura3on pushes • Latent issues • Hardware failure • Test with Chaos Monkey
  10. 10. Does a Zone Fail? • Rarely, but happened before • Rou3ng issues • DC-­‐specific issues • App-­‐specific issues within a zone • Test with Chaos Gorilla
  11. 11. Does a Region Fail? • Full region – unlikely, very rare • Individual Services can fail region-­‐wide • Most likely, a region-­‐wide configura3on issue • Test with Chaos Kong
  12. 12. Everything Fails… Eventually • Keep your services running by embracing isola3on and redundancy • Construct a highly agile and highly available service from ephemeral and assumed broken components
  13. 13. Isola3on • Changes in one region should not affect others • Regional outage should not affect others • Network par33oning between regions should not affect func3onality / opera3ons
  14. 14. Redundancy • Make more than one (of preZy much everything) • Specifically, distribute services across Availability Zones and regions
  15. 15. History: X-­‐mas Eve 2012 • Ne'lix mul3-­‐hour outage • US-­‐East1 regional Elas3c Load Balancing issue • “...data was deleted by a maintenance process that was inadvertently run against the produc8on ELB state data”
  16. 16. Isthmus – Normal Opera3on Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas US-­‐West-­‐2 ELB US-­‐East ELB Tunnel ELB Zuul Infrastructure*
  17. 17. Isthmus – Failover Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas US-­‐West-­‐2 ELB US-­‐East ELB Tunnel ELB Zuul Infrastructure*
  18. 18. Zuul Overview Elas%c Load Balancing Elas%c Load Balancing Elas%c Load Balancing
  19. 19. Zuul
  20. 20. Denominator – Abstrac3ng the DNS Layer Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas ELBs Zone A Cassandra Replicas Regional Load Balancers Zone B Cassandra Replicas Zone C Cassandra Replicas UltraDNS DynECT DNS Amazon Route 53 Denominator
  21. 21. Isthmus – Only for ELB Failures • Other services may fail region-­‐wide • Not worthwhile to develop one-­‐offs for each one
  22. 22. Ac3ve-­‐Ac3ve – Full Regional Resiliency Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Regional Load Balancers Zone A Cassandra Replicas Regional Load Balancers Zone B Cassandra Replicas Zone C Cassandra Replicas
  23. 23. Ac3ve-­‐Ac3ve – Failover Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Regional Load Balancers Zone A Cassandra Replicas Regional Load Balancers Zone B Cassandra Replicas Zone C Cassandra Replicas
  24. 24. Can’t we just deploy in 2 regions?
  25. 25. Ac3ve-­‐Ac3ve Architecture
  26. 26. 2 main challenges • Rou3ng the users to the services • Replica3ng the data
  27. 27. Separa3ng the Data – Eventual Consistency • 2–4 region Cassandra clusters • Eventual consistency != hopeful consistency
  28. 28. Benchmarking Global Cassandra Write intensive test of cross-­‐region replica3on capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six loca3ons up and running Cassandra in 20 minutes Zone A US-­‐West-­‐2 Region -­‐ Oregon Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Zone A US-­‐East-­‐1 Region -­‐ Virginia Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Test Load Test Load Valida3on Load Interzone Traffic 1 Million Writes CL.ONE (Wait for One Replica to ack) 1 Million Reads aTer 500 ms CL.ONE with No Data Loss Interregion Traffic Up to 9Gbits/s, 83ms 18 TB backups from S3
  29. 29. Propaga3ng EVCache Invalida3ons 4. Calls Writer with Key, Write Time, TTL & Value aTer checking if this is the latest event for the key in the current batch. Goes cross-­‐region through ELB over HTTPS 6. Deletes the value for the key US-­‐WEST-­‐2 3. Read from SQS in batches US-­‐EAST-­‐1 Drainer Writer EVCache Replica%on Metadata 5. Checks write 3me to ensure this is the latest opera3on for the key EVECVACCAHCHE E EVCACHE EVECVACCAHCHE E EVCACHE EVCache Replica%on SQS Metadata EVCache Client App Server Drainer Writer 1. Set data in EVCACHE 2. Write events to SQS and EVCACHE_REGION_REPLICA-­‐TION 8. Delete keys from SQS that were successful 7. Return keys that were successful EVCache Replica3on Service EVCache Replica3on Service
  30. 30. Announcing Dinomyte • Cross AZ & Region replica3on to exis3ng Key Value stores – Memcached – Redis • Thin Dynamo implementa3on for replica3on • Keep na3ve protocol – No code refactoring • OSS soon!
  31. 31. Archaius – Region-­‐isolated Configura3on
  32. 32. Running Isthmus and Ac%ve-­‐Ac%ve
  33. 33. Mul3regional Monkeys • Detect failure to deploy • Differences in configura3on • Resource differences
  34. 34. Opera3ng in N Regions • Best prac3ces: avoiding peak 3mes for deployment • Early problem detec3on / rollbacks • Automated canaries / con3nuous delivery
  35. 35. Applica3on deployment
  36. 36. 6 deployments
  37. 37. Automa3ng deployment
  38. 38. Monitoring and Aler3ng • Per region metrics • Global aggrega3on • Anomaly detec3on
  39. 39. Failover and Fallback • Direc3onal DNS changes (Or Route53) • For fallback, ensure data consistency • Some challenges – Cold cache – Autoscaling • (Iterate, Automate)+
  40. 40. Valida3ng the Whole Thing Works
  41. 41. Does Isola3on Work?
  42. 42. Does Failover work?
  43. 43. Steady State
  44. 44. Wii.FM?
  45. 45. We’re here to help you get to global scale… Apache Licensed Cloud Na3ve OSS Pla'orm hZp://ne'lix.github.com
  46. 46. Technical Indiges3on – what do all these do?
  47. 47. Asgard -­‐ Manage Red/Black Deployments
  48. 48. Ice – Slice and dice detailed costs and usage
  49. 49. Automa3cally Baking AMIs with Aminator • AutoScaleGroup instances should be iden3cal • Base plus code/config • Immutable instances • Works for 1 or 1000… • Aminator Launch – Use Asgard to start AMI or – CloudForma3on Recipe
  50. 50. Discovering your Services -­‐ Eureka • Map applica3ons by name to – AMI, instances, Zones – IP addresses, URLs, ports – Keep track of healthy, unhealthy and ini3alizing instances • Eureka Launch – Use Asgard to launch AMI or use CloudForma3on Template
  51. 51. Deploying Eureka Service – 1 per Zone
  52. 52. Searchable state history for a Region / Account Edda AWS Instances, ASGs, etc. Eureka Services metadata Your Own Custom State Monkeys Timestamped delta cache of JSON describe call results for anything of interest… Edda Launch Use Asgard to launch AMI or use CloudForma3on Template
  53. 53. Edda Query Examples Find any instances that have ever had a specific public IP address! $ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"! ["i-0123456789","i-012345678a","i-012345678b”]! !S how the most recent change to a security group! $ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"! --- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810! +++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504! @@ -1,33 +1,33 @@! {! …! "ipRanges" : [! "10.10.1.1/32",! "10.10.1.2/32",! + "10.10.1.3/32",! - "10.10.1.4/32"! …! }!
  54. 54. Archaius – Property Console
  55. 55. Archaius library – configura3on management Based on Pytheas. Not open sourced yet SimpleDB or DynamoDB for Ne'lixOSS. Ne'lix uses Cassandra for mul3-­‐region…
  56. 56. Denominator: DNS for Mul3-­‐Region Availability Zone A Cassandra Replicas Zone B Cassandra Replicas Zone C Cassandra Replicas Regional Load Balancers Zone A Cassandra Replicas Denominator Zone B Cassandra Replicas Zone C Cassandra Replicas Regional Load Balancers UltraDNS DynECT DNS AWS Route53 Zuul API Router Denominator – manage traffic via mul3ple DNS providers with Java code
  57. 57. Zuul – Smart and Scalable Rou3ng Layer
  58. 58. Ribbon library for internal request rou3ng
  59. 59. Ribbon – Zone Aware LB
  60. 60. Karyon -­‐ Common server container • Bootstrapping o Dependency & Lifecycle management via Governator. o Service registry via Eureka. o Property management via Archaius o Hooks for Latency Monkey tes3ng o Preconfigured status page and heathcheck servlets
  61. 61. Karyon • Embedded Status Page Console o Environment o Eureka o JMX
  62. 62. Karyon • Sample Service using Karyon available as "Hello-­‐ne'lix-­‐oss" on github
  63. 63. Hystrix Circuit Breaker: Fail Fast -­‐> recover fast
  64. 64. Hystrix Circuit Breaker State Flow
  65. 65. Turbine Dashboard Per Second Update Circuit Breakers in a Web Browser
  66. 66. Either you break it, or users will
  67. 67. Add some Chaos to your system
  68. 68. Clean up your room! – Janitor Monkey Works with Edda history to clean up aTer Asgard
  69. 69. Conformity Monkey Track and alert for old code versions and known issues Walks Karyon status pages found via Edda
  70. 70. In use by many companies ne=lixoss@ne=lix.com
  71. 71. hZp://www.meetup.com/Ne'lix-­‐Open-­‐ Source-­‐Pla'orm/
  72. 72. Takeaways By embracing isola8on and redundancy for availability we transformed our Cloud Infrastructure to be resilient against Region-­‐wide outages Use Ne=lixOSS to scale your Enterprise Contribute to exis8ng github projects and add your own hJp://ne=lix.github.com @Ne=lixOSS @rusmeshenberg
  73. 73. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/netflix-availability- regions

×