Netflix Global Cloud Architecture

Globally
Distributed
Cloud

Applica4ons
at
Ne7lix

October
2012

Adrian
Cockcro3

@adrianco
#ne6lixcloud

h;p://www.linkedin.com/in/adriancockcro3

Adrian
Cockcro3

•  Director,
Architecture
for
Cloud
Systems,
Ne6lix
Inc.

–  Previously
Director
for
PersonalizaMon
Pla6orm

•  DisMnguished
Availability
Engineer,
eBay
Inc.
2004-‐7

–  Founding
member
of
eBay
Research
Labs

•  DisMnguished
Engineer,
Sun
Microsystems
Inc.
1988-‐2004

–  2003-‐4
Chief
Architect
High
Performance
Technical
CompuMng

–  2001
Author:
Capacity
Planning
for
Web
Services

–  1999
Author:
Resource
Management

–  1995
&
1998
Author:
Sun
Performance
and
Tuning

–  1996
Japanese
EdiMon
of
Sun
Performance
and
Tuning

• 
SPARC
&
Solarisパフォーマンスチューニング (サンソフトプレスシリーズ)

•  More

–  Twi;er
@adrianco
–
Blog
h;p://perfcap.blogspot.com

–  PresentaMons
at
h;p://www.slideshare.net/adrianco

The
Ne6lix
Streaming
Service

Now
in
USA,
Canada,
LaMn
America,

UK,
Ireland,
Sweden,
Denmark,

Norway
and
Finland

US
Non-‐Member
Web
Site

AdverMsing
and
MarkeMng
Driven

Member
Web
Site

PersonalizaMon
Driven

Streaming
Device
API

Netflix Ready Devices
From: May 2008
To: May 2010

Content
Delivery
Service

Distributed
storage
nodes
controlled
by
Ne6lix
cloud
services

Abstract

•  Ne6lix
on
Cloud
–
What,
Why
and
When

•  Globally
Distributed
Architecture

•  Open
Source
Components

Things
we
don’t
do

What
Ne6lix
Did

•  Moved
to
SaaS

–  Corporate
IT
–
OneLogin,
Workday,
Box,
Evernote…

–  Tools
–
Pagerduty,
AppDynamics,
EMR
(Hadoop)

•  Built
our
own
PaaS

–  Customized
to
make
our
developers
producMve

–  Large
scale,
global,
highly
available,
leveraging
AWS

•  Moved
incremental
capacity
to
IaaS

–  No
new
datacenter
space
since
2008
as
we
grew

–  Moved
our
streaming
apps
to
the
cloud

Keeping
up
with
Developer
Trends

In
producMon

at
Ne6lix

•  Big
Data/Hadoop
2009

•  AWS
Cloud
2009

•  ApplicaMon
Performance
Management
2010

•  Integrated
DevOps
PracMces
2010

•  ConMnuous
IntegraMon/Delivery
2010

•  NoSQL
2010

•  Pla6orm
as
a
Service;
Fine
grain
SOA
2010

•  Social
coding,
open
development/github
2011

AWS
speciﬁc
feature
dependence….

Portability
vs.
FuncMonality

•  Portability
–
the
OperaMons
focus

–  Avoid
vendor
lock-‐in

–  Support
datacenter
based
use
cases

–  Possible
operaMons
cost
savings

•  FuncMonality
–
the
Developer
focus

–  Less
complex
test
and
debug,
one
mature
supplier

–  Faster
Mme
to
market
for
your
products

–  Possible
developer
Mme/cost
savings

FuncMonal
PaaS

•  IaaS
base
-‐
all
the
features
of
AWS

–  Very
large
scale,
mature,
global,
evolving
rapidly

–  ELB,
Autoscale,
VPC,
SQS,
EIP,
EMR,
etc,
etc.

–  E.g.
Large
ﬁles
(TB)
and
mulMpart
writes
in
S3

•  FuncMonal
PaaS
–
Ne6lix
added
features

–  ConMnuous
build/deploy,
SOA,
HA
pa;erns

–  Asgard
console,
Monkeys,
Big
data
tools

–  Cassandra/Zookeeper
data
store
automaMon

How
Ne6lix
Works

Consumer

Electronics
User
Data

AWS
Cloud
Web
Site
or

Discovery
API

Services

PersonalizaMon

CDN
Edge

LocaMons

DRM

Customer
Device

Streaming
API

(PC,
PS3,
TV…)

QoS
Logging

CDN

Management
and

Steering

OpenConnect

CDN
Boxes

Content
Encoding

Component
Services

(Simpliﬁed
view
using
AppDynamics)

Web
Server
Dependencies
Flow

(Home
page
business
transacMon
as
seen
by
AppDynamics)

Cassandra

memcached

Web
service

Start
Here

S3
bucket

One
Request
Snapshot

(captured
because
it
was
unusually
slow)

Current
Architectural
Pa;erns
for
Availability

•  Isolated
Services

–  Resilient
Business
logic

•  Three
Balanced
Availability
Zones

–  Resilient
to
Infrastructure
outage

•  Triple
Replicated
Persistence

–  Durable
distributed
Storage

•  Isolated
Regions

–  US
and
EU
don’t
take
each
other
down

Isolated
Services

Test
With
Chaos
Monkey,
Latency
Monkey

Three
Balanced
Availability
Zones

Test
with
Chaos
Gorilla

Load
Balancers

Zone
A
Zone
B
Zone
C

Cassandra
and
Evcache
Cassandra
and
Evcache
Cassandra
and
Evcache

Replicas
Replicas
Replicas

Triple
Replicated
Persistence

Cassandra
maintenance
aﬀects
individual
replicas

Load
Balancers

Zone
A
Zone
B
Zone
C

Cassandra
and
Evcache
Cassandra
and
Evcache
Cassandra
and
Evcache

Replicas
Replicas
Replicas

Isolated
Regions

US-‐East
Load
Balancers
EU-‐West
Load
Balancers

Zone
A
Zone
B
Zone
C
Zone
A
Zone
B
Zone
C

Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas

Failure
Modes
and
Eﬀects

Failure
Mode
Probability
Mi4ga4on
Plan

ApplicaMon
Failure
High
AutomaMc
degraded
response

AWS
Region
Failure
Low
Wait
for
region
to
recover

AWS
Zone
Failure
Medium
ConMnue
to
run
on
2
out
of
3
zones

Datacenter
Failure
Medium
Migrate
more
funcMons
to
cloud

Data
store
failure
Low
Restore
from
S3
backups

S3
failure
Low
Restore
from
remote
archive

Ne6lix
Deployed
on
AWS

2009
2009
2010
2010
2010
2011

Content
Logs
Play
WWW
API
CS

Content
S3
InternaMonal

Management
DRM
Sign-‐Up
Metadata
CS
lookup

Terabytes

EC2
Search
Device
DiagnosMcs

EMR
CDN
rouMng
Conﬁg
&
AcMons

Encoding
Solr

S3
Movie
TV
Movie
Customer

Hive
&
Pig
Bookmarks
Choosing
Choosing
Call
Log

Petabytes

Business
Social

Logging
RaMngs
Facebook
CS
AnalyMcs

Intelligence

CDNs

ISPs

Terabits

Customers

Cloud
Architecture
Pa;erns

Where
do
we
start?

Datacenter
to
Cloud
TransiMon
Goals

•  Faster

–  Lower
latency
than
the
equivalent
datacenter
web
pages
and
API
calls

–  Measured
as
mean
and
99th
percenMle

–  For
both
ﬁrst
hit
(e.g.
home
page)
and
in-‐session
hits
for
the
same
user

•  Scalable

–  Avoid
needing
any
more
datacenter
capacity
as
subscriber
count
increases

–  No
central
verMcally
scaled
databases

–  Leverage
AWS
elasMc
capacity
eﬀecMvely

•  Available

–  SubstanMally
higher
robustness
and
availability
than
datacenter
services

–  Leverage
mulMple
AWS
availability
zones

–  No
scheduled
down
Mme,
no
central
database
schema
to
change

•  ProducMve

–  OpMmize
agility
of
a
large
development
team
with
automaMon
and
tools

–  Leave
behind
complex
tangled
datacenter
code
base
(~8
year
old
architecture)

–  Enforce
clean
layered
interfaces
and
re-‐usable
components

Ne6lix
Datacenter
vs.
Cloud
Arch

Central
SQL
Database
Distributed
Key/Value
NoSQL

SMcky
In-‐Memory
Session
Shared
Memcached
Session

Cha;y
Protocols
Latency
Tolerant
Protocols

Tangled
Service
Interfaces
Layered
Service
Interfaces

Instrumented
Code
Instrumented
Service
Pa;erns

Fat
Complex
Objects
Lightweight
Serializable
Objects

Components
as
Jar
Files
Components
as
Services

Cassandra
on
AWS

A
highly
available
and
durable

deployment
pa;ern

Cassandra
Service
Pa;ern

Cassandra
Cluster

Service
REST
Clients
Managed
by
Priam

Between
6
and
72
nodes

Data
Access
REST
Service

Astyanax
Cassandra
Client

Datacenter

Update
Flow

Appdynamics
Service
Flow
VisualizaMon

ProducMon
Deployment

Totally
Denormalized
Data
Model

Over
50
Cassandra
Clusters

Over
500
nodes

Over
30TB
of
daily
backups

Biggest
cluster
72
nodes

1
cluster
over
250Kwrites/s

Astyanax
-‐
Cassandra
Write
Data
Flows

Single
Region,
MulMple
Availability
Zone,
Token
Aware

Cassandra

• Disks

• Zone
A

1.  Client
Writes
to
local
Cassandra
3
2

Cassandra
If
a
node
goes
oﬄine,

coordinator
• Disks
4 3
Disks
4

•  hinted
handoﬀ

2.  Coodinator
writes
to
• Zone
C
1 • Zone
B
completes
the
write

2

other
zones

3.  Nodes
return
ack

Token
when
the
node
comes

back
up.

4.  Data
wri;en
to
Aware

internal
commit
log
Clients
Requests
can
choose
to

disks
(no
more
than
Cassandra
Cassandra
wait
for
one
node,
a

10
seconds
later)
• Disks
• Disks
quorum,
or
all
nodes
to

• Zone
B
• Zone
C
ack
the
write

3

Cassandra
SSTable
disk
writes
and

• Disks
4
compacMons
occur

• Zone
A

asynchronously

Data
Flows
for
MulM-‐Region
Writes

Token
Aware,
Consistency
Level
=
Local
Quorum

1.  Client
writes
to
local
replicas
If
a
node
or
region
goes
offline,
hinted
handoff

2.  Local
write
acks
returned
to
completes
the
write
when
the
node
comes
back
up.

Client
which
conMnues
when
Nightly
global
compare
and
repair
jobs
ensure

2
of
3
local
nodes
are
everything
stays
consistent.

commi;ed

3.  Local
coordinator
writes
to

remote
coordinator.

Cassandra
100+ms
latency

4.  When
data
arrives,
remote

Cassandra

•  Disks
•  Disks

•  Zone
A
•  Zone
A

coordinator
node
acks
and
Cassandra
2
2

Cassandra
Cassandra
4

Cassandra

6
6
3
5
Disks
6

copies
to
other
remote
zones
6

•  Disks
•  Disks

•  Zone
C
•  Zone
B

• 
•  Zone
C
4
Disks
B

• 
•  Zone

1

4

5.  Remote
nodes
ack
to
local
US
EU

coordinator
Clients
Clients

Cassandra
2

Cassandra
Cassandra
Cassandra

6.  Data
flushed
to
internal
•  Disks

•  Zone
B

•  Disks

6

•  Zone
C

•  Disks

•  Zone
B

•  Disks

•  Zone
C

commit
log
disks
(no
more
Cassandra
6

5

Cassandra

than
10
seconds
later)

•  Disks
•  Disks

•  Zone
A
•  Zone
A

ETL
for
Cassandra

•  Data
is
de-‐normalized
over
many
clusters!

•  Too
many
to
restore
from
backups
for
ETL

•  SoluMon
–
read
backup
ﬁles
using
Hadoop

•  Aegisthus

–  h;p://techblog.ne6lix.com/2012/02/aegisthus-‐bulk-‐data-‐pipeline-‐out-‐of.html

–  High
throughput
raw
SSTable
processing

–  Re-‐normalizes
many
clusters
to
a
consistent
view

–  Extract,
Transform,
then
Load
into
Teradata

Benchmarks
and
Scalability

Cloud
Deployment
Scalability

New
Autoscaled
AMI
–
zero
to
500
instances
from
21:38:52
-‐
21:46:32,
7m40s

Scaled
up
and
down
over
a
few
days,
total
2176
instance
launches,
m2.2xlarge
(4
core
34GB)

Min. 1st Qu. Median Mean 3rd Qu. Max. !
41.0 104.2 149.0 171.8 215.8 562.0!

Scalability
from
48
to
288
nodes
on
AWS

h;p://techblog.ne6lix.com/2011/11/benchmarking-‐cassandra-‐scalability-‐on.html

Client
Writes/s
by
node
count
–
Replica4on
Factor
=
3

1200000

1099837

1000000

800000

Used
288
of
m1.xlarge

4
CPU,
15
GB
RAM,
8
ECU

600000

537172
Cassandra
0.86

Benchmark
conﬁg
only

400000
366828
existed
for
about
1hr

200000
174373

0

0
50
100
150
200
250
300
350

Cassandra
on
AWS

The
Past
The
Future

•  Instance:
m2.4xlarge
•  Instance:
hi1.4xlarge

•  Storage:
2
drives,
1.7TB
•  Storage:
2
SSD
volumes,
2TB

•  CPU:
8
Cores,
26
ECU
•  CPU:
8
HT
cores,
35
ECU

•  RAM:
68GB
•  RAM:
64GB

•  Network:
1Gbit
•  Network:
10Gbit

•  IOPS:
~500
•  IOPS:
~100,000

•  Throughput:
~100Mbyte/s
•  Throughput:
~1Gbyte/s

•  Cost:
$1.80/hr
•  Cost:
$3.10/hr

Cassandra
Disk
vs.
SSD
Benchmark

Same
Throughput,
Lower
Latency,
Half
Cost

Availability
and
Resilience

Chaos
Monkey

h;p://techblog.ne6lix.com/2012/07/chaos-‐monkey-‐released-‐into-‐wild.html

•  Computers
(Datacenter
or
AWS)
randomly
die

–  Fact
of
life,
but
too
infrequent
to
test
resiliency

•  Test
to
make
sure
systems
are
resilient

–  Allow
any
instance
to
fail
without
customer
impact

•  Chaos
Monkey
hours

–  Monday-‐Friday
9am-‐3pm
random
instance
kill

•  ApplicaMon
conﬁguraMon
opMon

–  Apps
now
have
to
opt-‐out
from
Chaos
Monkey

Responsibility
and
Experience

•  Make
developers
responsible
for
failures

–  Then
they
learn
and
write
code
that
doesn’t
fail

•  Use
Incident
Reviews
to
find
gaps
to
fix

–  Make
sure
its
not
about
finding
“who
to
blame”

•  Keep
Mmeouts
short,
fail
fast

–  Don’t
let
cascading
Mmeouts
stack
up

•  Make
configuraMon
opMons
dynamic

–  You
don’t
want
to
push
code
to
tweak
an
opMon

Resilient
Design
–
Circuit
Breakers

h;p://techblog.ne6lix.com/2012/02/fault-‐tolerance-‐in-‐high-‐volume.html

Distributed
OperaMonal
Model

•  Developers

–  Provision
and
run
their
own
code
in
producMon

–  Take
turns
to
be
on
call
if
it
breaks
(pagerduty)

–  Conﬁgure
autoscalers
to
handle
capacity
needs

•  DevOps
and
PaaS
(aka
NoOps)

–  DevOps
is
used
to
build
and
run
the
PaaS

–  PaaS
constrains
Dev
to
use
automaMon
instead

–  PaaS
puts
more
responsibility
on
Dev,
with
tools

UnconvenMonal
Culture

See
culture
deck
at
h;p://jobs.ne6lix.com

•  Brave/Aggressive
from
the
top
down

•  Focus
on
talent
density
above
everything

•  Reduce
process,
remove
complexity

•  Freedom
and
Responsibility

•  One
product
focus
for
the
whole
company

•  (almost)
full
informaMon
sharing
across
co.

•  Simpliﬁed
managers
role

Managers
Role

•  Hiring,
Architecture,
Project
Management

•  No
vacaMon
policy
to
track

•  (Almost)
no
remote
employees
or
contractors

•  No
bonuses
to
allocate

•  No
expenses
to
approve

•  Pay
mark
to
market
handled
at
VP
level

Ne6lix
OrganizaMon

DevOps
Org
ReporMng
into
Product
Group,
not
ITops

CEO
–
Reed
HasMngs

CPO
–
Chief
Product
Oﬃcer
–
Neil
Hunt

VP
-‐
Cloud
and
Pla6orm
Engineering
-‐
Yury

Pla6orm
and
Cloud
Ops
PersonalizaMon

Persistence
Reliability
Pla6orm
and
Membership
and
Data
Science

Architecture
Cloud
SoluMons
Billing
Pla6orm

Engineering
Engineering
Performance
Eng

Future
planning
Base
Pla6orm
Monitoring
Metadata

Alert
RouMng
Data
sources
Business

Security
Arch
Zookeeper
Monkeys
Benchmarking
Intelligence

Incident
Lifecycle
Vault
processing

Eﬃciency
Cassandra
Ops
Build
Tools
Memcached

AWS
VPC

AWS
Instances

Hyperguard
AWS
Instances
PagerDuty
AWS
Instances
Cassandra
Hadoop
on
EMR

AWS
API

Powerpoint
J

Components

•  ConMnuous
build
framework
turns
code
into
AMIs

•  AWS
accounts
for
test,
producMon,
etc.

•  Cloud
access
gateway

•  Service
registry

•  ConﬁguraMon
properMes
service

•  Persistence
services

•  Monitoring,
alert
forwarding

•  Backups,
archives

Ne6lix
Open
Source
Strategy

•  Release
PaaS
Components
git-‐by-‐git

–  Source
at
github.com/ne6lix
–
we
build
from
it…

–  Intros
and
techniques
at
techblog.ne6lix.com

–  Blog
post
or
new
code
every
few
weeks

•  MoMvaMons

–  Give
back
to
Apache
licensed
OSS
community

–  MoMvate,
retain,
hire
top
engineers

–  “Peer
pressure”
code
cleanup,
external
contribuMons

Instance
creaMon

Bakery
&

Build
tools
Asgard

Base
AMI
Instance

Autoscaling

ApplicaMon
Odin
scripts

Code

Image
baked
ASG
/
Instance
started
Instance
Running

ApplicaMon
Launch

Governator

Eureka

(Guice)

Async

logging

Archaius
Entrypoints

Servo

Registering,

ApplicaMon
iniMalizing

conﬁguraMon

RunMme

Astyanax
Priam

Curator
Chaos
Monkey

Latency
Monkey

NIWS

Exhibitor

LB
Janitor
Monkey

REST

Cass
JMeter

Dependency
client

Command
Explorers

Calling
other
Managing
Resiliency
aids

services
service

Open
Source
Projects

Legend

Github
/
Techblog
Priam
Exhibitor

Servo
and
Autoscaling
Scripts

Apache
ContribuMons

Cassandra
as
a
Service
Zookeeper
as
a
Service

Astyanax
Curator
Honu

Techblog
Post

Cassandra
client
for
Java
Zookeeper
Pa;erns
Log4j
streaming
to
Hadoop

Coming
Soon

CassJMeter
EVCache
Circuit
Breaker

Cassandra
test
suite
Memcached
as
a
Service
Robust
service
pa;ern

Cassandra
MulM-‐region
EC2
Eureka
/
Discovery
Asgard
-‐
AutoScaleGroup

datastore
support
Service
Directory
based
AWS
console

Aegisthus
Archaius
Chaos
Monkey

Hadoop
ETL
for
Cassandra
Dynamics
ProperMes
Service
Robustness
verificaMon

Explorers
EntryPoints
Latency
Monkey

Governator
-‐
Library
lifecycle
Server-‐side
latency/error

and
dependency
injecMon
injecMon
Janitor
Monkey

Odin

REST
Client
+
mid-‐Mer
LB
Bakeries
and
AMI

Workflow
orchestraMon

Async
logging
ConfiguraMon
REST
endpoints
Build
dynaslaves

Roadmap
for
2012

•  More
resiliency
and
improved
availability

•  More
automaMon,
orchestraMon

•  “Hardening”
the
pla6orm,
code
clean-‐up

•  Lower
latency
for
web
services
and
devices

•  IPv6
–
now
running
in
prod,
rollout
in
process

•  More
open
sourced
components

•  See
you
at
AWS
Re:Invent
in
November…

Takeaway

Ne?lix
has
built
and
deployed
a
scalable
global
Pla?orm
as
a
Service.

Key
components
of
the
Ne?lix
PaaS
are
being
released
as
Open
Source

projects
so
you
can
build
your
own
custom
PaaS.

h;p://github.com/Ne6lix

h;p://techblog.ne6lix.com

h;p://slideshare.net/Ne6lix

h;p://www.linkedin.com/in/adriancockcro3

@adrianco
#ne6lixcloud

Amazon Cloud Terminology Reference
See http://aws.amazon.com/ This is not a full list of Amazon Web Service features

•  AWS
–
Amazon
Web
Services
(common
name
for
Amazon
cloud)

•  AMI
–
Amazon
Machine
Image
(archived
boot
disk,
Linux,
Windows
etc.
plus
applicaMon
code)

•  EC2
–
ElasMc
Compute
Cloud

–  Range
of
virtual
machine
types
m1,
m2,
c1,
cc,
cg.
Varying
memory,
CPU
and
disk
configuraMons.

–  Instance
–
a
running
computer
system.
Ephemeral,
when
it
is
de-‐allocated
nothing
is
kept.

–  Reserved
Instances
–
pre-‐paid
to
reduce
cost
for
long
term
usage

–  Availability
Zone
–
datacenter
with
own
power
and
cooling
hosMng
cloud
instances

–  Region
–
group
of
Avail
Zones
–
US-‐East,
US-‐West,
EU-‐Eire,
Asia-‐Singapore,
Asia-‐Japan,
SA-‐Brazil,
US-‐Gov

•  ASG
–
Auto
Scaling
Group
(instances
booMng
from
the
same
AMI)

•  S3
–
Simple
Storage
Service
(h;p
access)

•  EBS
–
ElasMc
Block
Storage
(network
disk
filesystem
can
be
mounted
on
an
instance)

•  RDS
–
RelaMonal
Database
Service
(managed
MySQL
master
and
slaves)

•  DynamoDB/SDB
–
Simple
Data
Base
(hosted
h;p
based
NoSQL
datastore,
DynamoDB
replaces
SDB)

•  SQS
–
Simple
Queue
Service
(h;p
based
message
queue)

•  SNS
–
Simple
NoMficaMon
Service
(h;p
and
email
based
topics
and
messages)

•  EMR
–
ElasMc
Map
Reduce
(automaMcally
managed
Hadoop
cluster)

•  ELB
–
ElasMc
Load
Balancer

•  EIP
–
ElasMc
IP
(stable
IP
address
mapping
assigned
to
instance
or
ELB)

•  VPC
–
Virtual
Private
Cloud
(single
tenant,
more
flexible
network
and
security
constructs)

•  DirectConnect
–
secure
pipe
from
AWS
VPC
to
external
datacenter

•  IAM
–
IdenMty
and
Access
Management
(fine
grain
role
based
security
keys)

Netflix Global Cloud Architecture

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (10)

Semelhante a Netflix Global Cloud Architecture

Semelhante a Netflix Global Cloud Architecture (20)

Mais de Adrian Cockcroft

Mais de Adrian Cockcroft (20)

Último

Último (20)

Netflix Global Cloud Architecture