Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Netflix Global Cloud Architecture
1. Globally
Distributed
Cloud
Applica4ons
at
Ne7lix
October
2012
Adrian
Cockcro3
@adrianco
#ne6lixcloud
h;p://www.linkedin.com/in/adriancockcro3
2. Adrian
Cockcro3
• Director,
Architecture
for
Cloud
Systems,
Ne6lix
Inc.
– Previously
Director
for
PersonalizaMon
Pla6orm
• DisMnguished
Availability
Engineer,
eBay
Inc.
2004-‐7
– Founding
member
of
eBay
Research
Labs
• DisMnguished
Engineer,
Sun
Microsystems
Inc.
1988-‐2004
– 2003-‐4
Chief
Architect
High
Performance
Technical
CompuMng
– 2001
Author:
Capacity
Planning
for
Web
Services
– 1999
Author:
Resource
Management
– 1995
&
1998
Author:
Sun
Performance
and
Tuning
– 1996
Japanese
EdiMon
of
Sun
Performance
and
Tuning
•
SPARC
&
Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)
• More
– Twi;er
@adrianco
–
Blog
h;p://perfcap.blogspot.com
– PresentaMons
at
h;p://www.slideshare.net/adrianco
3. The
Ne6lix
Streaming
Service
Now
in
USA,
Canada,
LaMn
America,
UK,
Ireland,
Sweden,
Denmark,
Norway
and
Finland
11. What
Ne6lix
Did
• Moved
to
SaaS
– Corporate
IT
–
OneLogin,
Workday,
Box,
Evernote…
– Tools
–
Pagerduty,
AppDynamics,
EMR
(Hadoop)
• Built
our
own
PaaS
– Customized
to
make
our
developers
producMve
– Large
scale,
global,
highly
available,
leveraging
AWS
• Moved
incremental
capacity
to
IaaS
– No
new
datacenter
space
since
2008
as
we
grew
– Moved
our
streaming
apps
to
the
cloud
12. Keeping
up
with
Developer
Trends
In
producMon
at
Ne6lix
• Big
Data/Hadoop
2009
• AWS
Cloud
2009
• ApplicaMon
Performance
Management
2010
• Integrated
DevOps
PracMces
2010
• ConMnuous
IntegraMon/Delivery
2010
• NoSQL
2010
• Pla6orm
as
a
Service;
Fine
grain
SOA
2010
• Social
coding,
open
development/github
2011
14. Portability
vs.
FuncMonality
• Portability
–
the
OperaMons
focus
– Avoid
vendor
lock-‐in
– Support
datacenter
based
use
cases
– Possible
operaMons
cost
savings
• FuncMonality
–
the
Developer
focus
– Less
complex
test
and
debug,
one
mature
supplier
– Faster
Mme
to
market
for
your
products
– Possible
developer
Mme/cost
savings
15. FuncMonal
PaaS
• IaaS
base
-‐
all
the
features
of
AWS
– Very
large
scale,
mature,
global,
evolving
rapidly
– ELB,
Autoscale,
VPC,
SQS,
EIP,
EMR,
etc,
etc.
– E.g.
Large
files
(TB)
and
mulMpart
writes
in
S3
• FuncMonal
PaaS
–
Ne6lix
added
features
– ConMnuous
build/deploy,
SOA,
HA
pa;erns
– Asgard
console,
Monkeys,
Big
data
tools
– Cassandra/Zookeeper
data
store
automaMon
16. How
Ne6lix
Works
Consumer
Electronics
User
Data
AWS
Cloud
Web
Site
or
Discovery
API
Services
PersonalizaMon
CDN
Edge
LocaMons
DRM
Customer
Device
Streaming
API
(PC,
PS3,
TV…)
QoS
Logging
CDN
Management
and
Steering
OpenConnect
CDN
Boxes
Content
Encoding
20. Current
Architectural
Pa;erns
for
Availability
• Isolated
Services
– Resilient
Business
logic
• Three
Balanced
Availability
Zones
– Resilient
to
Infrastructure
outage
• Triple
Replicated
Persistence
– Durable
distributed
Storage
• Isolated
Regions
– US
and
EU
don’t
take
each
other
down
22. Three
Balanced
Availability
Zones
Test
with
Chaos
Gorilla
Load
Balancers
Zone
A
Zone
B
Zone
C
Cassandra
and
Evcache
Cassandra
and
Evcache
Cassandra
and
Evcache
Replicas
Replicas
Replicas
23. Triple
Replicated
Persistence
Cassandra
maintenance
affects
individual
replicas
Load
Balancers
Zone
A
Zone
B
Zone
C
Cassandra
and
Evcache
Cassandra
and
Evcache
Cassandra
and
Evcache
Replicas
Replicas
Replicas
24. Isolated
Regions
US-‐East
Load
Balancers
EU-‐West
Load
Balancers
Zone
A
Zone
B
Zone
C
Zone
A
Zone
B
Zone
C
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
25. Failure
Modes
and
Effects
Failure
Mode
Probability
Mi4ga4on
Plan
ApplicaMon
Failure
High
AutomaMc
degraded
response
AWS
Region
Failure
Low
Wait
for
region
to
recover
AWS
Zone
Failure
Medium
ConMnue
to
run
on
2
out
of
3
zones
Datacenter
Failure
Medium
Migrate
more
funcMons
to
cloud
Data
store
failure
Low
Restore
from
S3
backups
S3
failure
Low
Restore
from
remote
archive
26. Ne6lix
Deployed
on
AWS
2009
2009
2010
2010
2010
2011
Content
Logs
Play
WWW
API
CS
Content
S3
InternaMonal
Management
DRM
Sign-‐Up
Metadata
CS
lookup
Terabytes
EC2
Search
Device
DiagnosMcs
EMR
CDN
rouMng
Config
&
AcMons
Encoding
Solr
S3
Movie
TV
Movie
Customer
Hive
&
Pig
Bookmarks
Choosing
Choosing
Call
Log
Petabytes
Business
Social
Logging
RaMngs
Facebook
CS
AnalyMcs
Intelligence
CDNs
ISPs
Terabits
Customers
28. Datacenter
to
Cloud
TransiMon
Goals
• Faster
– Lower
latency
than
the
equivalent
datacenter
web
pages
and
API
calls
– Measured
as
mean
and
99th
percenMle
– For
both
first
hit
(e.g.
home
page)
and
in-‐session
hits
for
the
same
user
• Scalable
– Avoid
needing
any
more
datacenter
capacity
as
subscriber
count
increases
– No
central
verMcally
scaled
databases
– Leverage
AWS
elasMc
capacity
effecMvely
• Available
– SubstanMally
higher
robustness
and
availability
than
datacenter
services
– Leverage
mulMple
AWS
availability
zones
– No
scheduled
down
Mme,
no
central
database
schema
to
change
• ProducMve
– OpMmize
agility
of
a
large
development
team
with
automaMon
and
tools
– Leave
behind
complex
tangled
datacenter
code
base
(~8
year
old
architecture)
– Enforce
clean
layered
interfaces
and
re-‐usable
components
29. Ne6lix
Datacenter
vs.
Cloud
Arch
Central
SQL
Database
Distributed
Key/Value
NoSQL
SMcky
In-‐Memory
Session
Shared
Memcached
Session
Cha;y
Protocols
Latency
Tolerant
Protocols
Tangled
Service
Interfaces
Layered
Service
Interfaces
Instrumented
Code
Instrumented
Service
Pa;erns
Fat
Complex
Objects
Lightweight
Serializable
Objects
Components
as
Jar
Files
Components
as
Services
30. Cassandra
on
AWS
A
highly
available
and
durable
deployment
pa;ern
31. Cassandra
Service
Pa;ern
Cassandra
Cluster
Service
REST
Clients
Managed
by
Priam
Between
6
and
72
nodes
Data
Access
REST
Service
Astyanax
Cassandra
Client
Datacenter
Update
Flow
Appdynamics
Service
Flow
VisualizaMon
32. ProducMon
Deployment
Totally
Denormalized
Data
Model
Over
50
Cassandra
Clusters
Over
500
nodes
Over
30TB
of
daily
backups
Biggest
cluster
72
nodes
1
cluster
over
250Kwrites/s
33. Astyanax
-‐
Cassandra
Write
Data
Flows
Single
Region,
MulMple
Availability
Zone,
Token
Aware
Cassandra
• Disks
• Zone
A
1. Client
Writes
to
local
Cassandra
3
2
Cassandra
If
a
node
goes
offline,
coordinator
• Disks
4 3
Disks
4
• hinted
handoff
2. Coodinator
writes
to
• Zone
C
1 • Zone
B
completes
the
write
2
other
zones
3. Nodes
return
ack
Token
when
the
node
comes
back
up.
4. Data
wri;en
to
Aware
internal
commit
log
Clients
Requests
can
choose
to
disks
(no
more
than
Cassandra
Cassandra
wait
for
one
node,
a
10
seconds
later)
• Disks
• Disks
quorum,
or
all
nodes
to
• Zone
B
• Zone
C
ack
the
write
3
Cassandra
SSTable
disk
writes
and
• Disks
4
compacMons
occur
• Zone
A
asynchronously
34. Data
Flows
for
MulM-‐Region
Writes
Token
Aware,
Consistency
Level
=
Local
Quorum
1. Client
writes
to
local
replicas
If
a
node
or
region
goes
offline,
hinted
handoff
2. Local
write
acks
returned
to
completes
the
write
when
the
node
comes
back
up.
Client
which
conMnues
when
Nightly
global
compare
and
repair
jobs
ensure
2
of
3
local
nodes
are
everything
stays
consistent.
commi;ed
3. Local
coordinator
writes
to
remote
coordinator.
Cassandra
100+ms
latency
4. When
data
arrives,
remote
Cassandra
• Disks
• Disks
• Zone
A
• Zone
A
coordinator
node
acks
and
Cassandra
2
2
Cassandra
Cassandra
4
Cassandra
6
6
3
5
Disks
6
copies
to
other
remote
zones
6
• Disks
• Disks
• Zone
C
• Zone
B
•
• Zone
C
4
Disks
B
•
• Zone
1
4
5. Remote
nodes
ack
to
local
US
EU
coordinator
Clients
Clients
Cassandra
2
Cassandra
Cassandra
Cassandra
6. Data
flushed
to
internal
• Disks
• Zone
B
• Disks
6
• Zone
C
• Disks
• Zone
B
• Disks
• Zone
C
commit
log
disks
(no
more
Cassandra
6
5
Cassandra
than
10
seconds
later)
• Disks
• Disks
• Zone
A
• Zone
A
35. ETL
for
Cassandra
• Data
is
de-‐normalized
over
many
clusters!
• Too
many
to
restore
from
backups
for
ETL
• SoluMon
–
read
backup
files
using
Hadoop
• Aegisthus
– h;p://techblog.ne6lix.com/2012/02/aegisthus-‐bulk-‐data-‐pipeline-‐out-‐of.html
– High
throughput
raw
SSTable
processing
– Re-‐normalizes
many
clusters
to
a
consistent
view
– Extract,
Transform,
then
Load
into
Teradata
37. Cloud
Deployment
Scalability
New
Autoscaled
AMI
–
zero
to
500
instances
from
21:38:52
-‐
21:46:32,
7m40s
Scaled
up
and
down
over
a
few
days,
total
2176
instance
launches,
m2.2xlarge
(4
core
34GB)
Min. 1st Qu. Median Mean 3rd Qu. Max. !
41.0 104.2 149.0 171.8 215.8 562.0!
38. Scalability
from
48
to
288
nodes
on
AWS
h;p://techblog.ne6lix.com/2011/11/benchmarking-‐cassandra-‐scalability-‐on.html
Client
Writes/s
by
node
count
–
Replica4on
Factor
=
3
1200000
1099837
1000000
800000
Used
288
of
m1.xlarge
4
CPU,
15
GB
RAM,
8
ECU
600000
537172
Cassandra
0.86
Benchmark
config
only
400000
366828
existed
for
about
1hr
200000
174373
0
0
50
100
150
200
250
300
350
42. Chaos
Monkey
h;p://techblog.ne6lix.com/2012/07/chaos-‐monkey-‐released-‐into-‐wild.html
• Computers
(Datacenter
or
AWS)
randomly
die
– Fact
of
life,
but
too
infrequent
to
test
resiliency
• Test
to
make
sure
systems
are
resilient
– Allow
any
instance
to
fail
without
customer
impact
• Chaos
Monkey
hours
– Monday-‐Friday
9am-‐3pm
random
instance
kill
• ApplicaMon
configuraMon
opMon
– Apps
now
have
to
opt-‐out
from
Chaos
Monkey
43. Responsibility
and
Experience
• Make
developers
responsible
for
failures
– Then
they
learn
and
write
code
that
doesn’t
fail
• Use
Incident
Reviews
to
find
gaps
to
fix
– Make
sure
its
not
about
finding
“who
to
blame”
• Keep
Mmeouts
short,
fail
fast
– Don’t
let
cascading
Mmeouts
stack
up
• Make
configuraMon
opMons
dynamic
– You
don’t
want
to
push
code
to
tweak
an
opMon
45. Distributed
OperaMonal
Model
• Developers
– Provision
and
run
their
own
code
in
producMon
– Take
turns
to
be
on
call
if
it
breaks
(pagerduty)
– Configure
autoscalers
to
handle
capacity
needs
• DevOps
and
PaaS
(aka
NoOps)
– DevOps
is
used
to
build
and
run
the
PaaS
– PaaS
constrains
Dev
to
use
automaMon
instead
– PaaS
puts
more
responsibility
on
Dev,
with
tools
47. UnconvenMonal
Culture
See
culture
deck
at
h;p://jobs.ne6lix.com
• Brave/Aggressive
from
the
top
down
• Focus
on
talent
density
above
everything
• Reduce
process,
remove
complexity
• Freedom
and
Responsibility
• One
product
focus
for
the
whole
company
• (almost)
full
informaMon
sharing
across
co.
• Simplified
managers
role
48. Managers
Role
• Hiring,
Architecture,
Project
Management
• No
vacaMon
policy
to
track
• (Almost)
no
remote
employees
or
contractors
• No
bonuses
to
allocate
• No
expenses
to
approve
• Pay
mark
to
market
handled
at
VP
level
49. Ne6lix
OrganizaMon
DevOps
Org
ReporMng
into
Product
Group,
not
ITops
CEO
–
Reed
HasMngs
CPO
–
Chief
Product
Officer
–
Neil
Hunt
VP
-‐
Cloud
and
Pla6orm
Engineering
-‐
Yury
Pla6orm
and
Cloud
Ops
PersonalizaMon
Persistence
Reliability
Pla6orm
and
Membership
and
Data
Science
Architecture
Cloud
SoluMons
Billing
Pla6orm
Engineering
Engineering
Performance
Eng
Future
planning
Base
Pla6orm
Monitoring
Metadata
Alert
RouMng
Data
sources
Business
Security
Arch
Zookeeper
Monkeys
Benchmarking
Intelligence
Incident
Lifecycle
Vault
processing
Efficiency
Cassandra
Ops
Build
Tools
Memcached
AWS
VPC
AWS
Instances
Hyperguard
AWS
Instances
PagerDuty
AWS
Instances
Cassandra
Hadoop
on
EMR
AWS
API
Powerpoint
J
51. Components
• ConMnuous
build
framework
turns
code
into
AMIs
• AWS
accounts
for
test,
producMon,
etc.
• Cloud
access
gateway
• Service
registry
• ConfiguraMon
properMes
service
• Persistence
services
• Monitoring,
alert
forwarding
• Backups,
archives
52. Ne6lix
Open
Source
Strategy
• Release
PaaS
Components
git-‐by-‐git
– Source
at
github.com/ne6lix
–
we
build
from
it…
– Intros
and
techniques
at
techblog.ne6lix.com
– Blog
post
or
new
code
every
few
weeks
• MoMvaMons
– Give
back
to
Apache
licensed
OSS
community
– MoMvate,
retain,
hire
top
engineers
– “Peer
pressure”
code
cleanup,
external
contribuMons
53. Instance
creaMon
Bakery
&
Build
tools
Asgard
Base
AMI
Instance
Autoscaling
ApplicaMon
Odin
scripts
Code
Image
baked
ASG
/
Instance
started
Instance
Running
55. RunMme
Astyanax
Priam
Curator
Chaos
Monkey
Latency
Monkey
NIWS
Exhibitor
LB
Janitor
Monkey
REST
Cass
JMeter
Dependency
client
Command
Explorers
Calling
other
Managing
Resiliency
aids
services
service
56. Open
Source
Projects
Legend
Github
/
Techblog
Priam
Exhibitor
Servo
and
Autoscaling
Scripts
Apache
ContribuMons
Cassandra
as
a
Service
Zookeeper
as
a
Service
Astyanax
Curator
Honu
Techblog
Post
Cassandra
client
for
Java
Zookeeper
Pa;erns
Log4j
streaming
to
Hadoop
Coming
Soon
CassJMeter
EVCache
Circuit
Breaker
Cassandra
test
suite
Memcached
as
a
Service
Robust
service
pa;ern
Cassandra
MulM-‐region
EC2
Eureka
/
Discovery
Asgard
-‐
AutoScaleGroup
datastore
support
Service
Directory
based
AWS
console
Aegisthus
Archaius
Chaos
Monkey
Hadoop
ETL
for
Cassandra
Dynamics
ProperMes
Service
Robustness
verificaMon
Explorers
EntryPoints
Latency
Monkey
Governator
-‐
Library
lifecycle
Server-‐side
latency/error
and
dependency
injecMon
injecMon
Janitor
Monkey
Odin
REST
Client
+
mid-‐Mer
LB
Bakeries
and
AMI
Workflow
orchestraMon
Async
logging
ConfiguraMon
REST
endpoints
Build
dynaslaves
57. Roadmap
for
2012
• More
resiliency
and
improved
availability
• More
automaMon,
orchestraMon
• “Hardening”
the
pla6orm,
code
clean-‐up
• Lower
latency
for
web
services
and
devices
• IPv6
–
now
running
in
prod,
rollout
in
process
• More
open
sourced
components
• See
you
at
AWS
Re:Invent
in
November…
58. Takeaway
Ne?lix
has
built
and
deployed
a
scalable
global
Pla?orm
as
a
Service.
Key
components
of
the
Ne?lix
PaaS
are
being
released
as
Open
Source
projects
so
you
can
build
your
own
custom
PaaS.
h;p://github.com/Ne6lix
h;p://techblog.ne6lix.com
h;p://slideshare.net/Ne6lix
h;p://www.linkedin.com/in/adriancockcro3
@adrianco
#ne6lixcloud
59. Amazon Cloud Terminology Reference
See http://aws.amazon.com/ This is not a full list of Amazon Web Service features
• AWS
–
Amazon
Web
Services
(common
name
for
Amazon
cloud)
• AMI
–
Amazon
Machine
Image
(archived
boot
disk,
Linux,
Windows
etc.
plus
applicaMon
code)
• EC2
–
ElasMc
Compute
Cloud
– Range
of
virtual
machine
types
m1,
m2,
c1,
cc,
cg.
Varying
memory,
CPU
and
disk
configuraMons.
– Instance
–
a
running
computer
system.
Ephemeral,
when
it
is
de-‐allocated
nothing
is
kept.
– Reserved
Instances
–
pre-‐paid
to
reduce
cost
for
long
term
usage
– Availability
Zone
–
datacenter
with
own
power
and
cooling
hosMng
cloud
instances
– Region
–
group
of
Avail
Zones
–
US-‐East,
US-‐West,
EU-‐Eire,
Asia-‐Singapore,
Asia-‐Japan,
SA-‐Brazil,
US-‐Gov
• ASG
–
Auto
Scaling
Group
(instances
booMng
from
the
same
AMI)
• S3
–
Simple
Storage
Service
(h;p
access)
• EBS
–
ElasMc
Block
Storage
(network
disk
filesystem
can
be
mounted
on
an
instance)
• RDS
–
RelaMonal
Database
Service
(managed
MySQL
master
and
slaves)
• DynamoDB/SDB
–
Simple
Data
Base
(hosted
h;p
based
NoSQL
datastore,
DynamoDB
replaces
SDB)
• SQS
–
Simple
Queue
Service
(h;p
based
message
queue)
• SNS
–
Simple
NoMficaMon
Service
(h;p
and
email
based
topics
and
messages)
• EMR
–
ElasMc
Map
Reduce
(automaMcally
managed
Hadoop
cluster)
• ELB
–
ElasMc
Load
Balancer
• EIP
–
ElasMc
IP
(stable
IP
address
mapping
assigned
to
instance
or
ELB)
• VPC
–
Virtual
Private
Cloud
(single
tenant,
more
flexible
network
and
security
constructs)
• DirectConnect
–
secure
pipe
from
AWS
VPC
to
external
datacenter
• IAM
–
IdenMty
and
Access
Management
(fine
grain
role
based
security
keys)