Netflix
has
built
and
deployed
a
scalable
global
Platorm
as
a
Service.
Key
components
of
the
Netflix
PaaS
are
being
released
as
Open
Source
projects
so
you
can
build
your
own
custom
PaaS
1. The
Ne&lix
Open
Source
Pla&orm
September
26th,
2012
Adrian
Cockcro8,
Ruslan
Meshenberg
@adrianco
@rusmeshenberg
#neAlixcloud
hCp://www.linkedin.com/in/adriancockcro8
hCp://www.linkedin.com/in/ruslanmeshenberg
2. What
NeAlix
Did
• Moved
to
SaaS
– Corporate
IT
–
OneLogin,
Workday,
Box,
Evernote…
– Tools
–
Pagerduty,
AppDynamics,
ElasVc
MapReduce
• Built
our
own
PaaS
– Customized
to
make
our
developers
producVve
– When
we
started,
we
had
liCle
choice
• Moved
incremental
capacity
to
IaaS
– No
new
datacenter
space
since
2008
as
we
grew
– Moved
our
streaming
apps
to
the
cloud
5. NeAlix
Choice
was
AWS
with
our
own
plaAorm
and
tools
Unique
plaAorm
requirements
and
extreme
scale,
agility
and
flexibility
6. Leverage
AWS
Scale
“the
biggest
public
cloud”
AWS
investment
in
features
and
automaVon
Use
AWS
zones
and
regions
for
high
availability,
scalability
and
global
deployment
7. What
about
other
PaaS?
• CloudFoundry
–
Open
Source
by
VMWare
– Developer-‐friendly,
easy
to
get
started
– Missing
scale
and
some
enterprise
features
• Rightscale
– Widely
used
to
abstract
away
from
AWS
– Creates
it’s
own
lock-‐in
problem…
• AWS
is
growing
into
this
space
– We
didn’t
want
a
vendor
between
us
and
AWS
– We
wanted
to
build
a
thin
PaaS,
that
gets
thinner
9. Keeping
up
with
Developer
Trends
In
producVon
at
NeAlix
• Big
Data/Hadoop
2009
• AWS
Cloud
2009
• ApplicaVon
Performance
Management
2010
• Integrated
DevOps
PracVces
2010
• ConVnuous
IntegraVon/Delivery
2010
• NoSQL
2010
• PlaAorm
as
a
Service;
Fine
grain
SOA
2010
• Social
coding,
open
development/github
2011
11. Portability
vs.
FuncVonality
• Portability
–
the
OperaVons
focus
– Avoid
vendor
lock-‐in
– Support
datacenter
based
use
cases
– Possible
operaVons
cost
savings
• FuncVonality
–
the
Developer
focus
– Less
complex
test
and
debug,
one
mature
supplier
– Faster
Vme
to
market
for
your
products
– Possible
developer
cost
savings
12. Portable
PaaS
• Portable
IaaS
Base
-‐
some
AWS
compaVbility
– Eucalyptus
–
AWS
licensed
compaVble
subset
– CloudStack
–
Citrix
Apache
project
– OpenStack
–
Rackspace,
Cloudscaling,
HP
etc.
• Portable
PaaS
– VMWare
Cloud
Foundry
-‐
run
it
yourself
in
your
DC
– AppFog
and
Stackato
–
Cloud
Foundry/Openstack
– Vendor
opVons:
Rightscale,
Enstratus,
Smartscale
13. FuncVonal
PaaS
• IaaS
base
-‐
all
the
features
of
AWS
– Very
large
scale,
mature,
global,
evolving
rapidly
– ELB,
Autoscale,
VPC,
SQS,
EIP,
EMR,
DynamoDB
etc.
– Large
files
(TB)
and
mulVpart
writes
in
S3
• FuncVonal
PaaS
–
NeAlix
added
features
– Very
large
scale,
mature,
flexible,
customizable
– Asgard
console,
Monkeys,
Big
data
tools
– Cassandra/Zookeeper
data
store
automaVon
14. Developers
choose
FuncVonal
Don’t
let
the
roadie
write
the
set
list!
(yes
you
do
need
all
those
guitars
on
tour…)
15. Freedom
and
Responsibility
• Developers
leverage
cloud
to
get
freedom
– Agility
of
a
single
organizaVon,
no
silos
• But
now
developers
are
responsible
– For
compliance,
performance,
availability
etc.
“As
far
as
my
rehab
is
concerned,
it
is
within
my
ability
to
change
and
change
for
the
be>er
-‐
Eddie
Van
Halen”
16. Amazon Cloud Terminology Reference
See http://aws.amazon.com/ This is not a full list of Amazon Web Service features
• AWS
–
Amazon
Web
Services
(common
name
for
Amazon
cloud)
• AMI
–
Amazon
Machine
Image
(archived
boot
disk,
Linux,
Windows
etc.
plus
applicaVon
code)
• EC2
–
ElasVc
Compute
Cloud
– Range
of
virtual
machine
types
m1,
m2,
c1,
cc,
cg.
Varying
memory,
CPU
and
disk
configuraVons.
– Instance
–
a
running
computer
system.
Ephemeral,
when
it
is
de-‐allocated
nothing
is
kept.
– Reserved
Instances
–
pre-‐paid
to
reduce
cost
for
long
term
usage
– Availability
Zone
–
datacenter
with
own
power
and
cooling
hosVng
cloud
instances
– Region
–
group
of
Avail
Zones
–
US-‐East,
US-‐West,
EU-‐Eire,
Asia-‐Singapore,
Asia-‐Japan,
SA-‐Brazil,
US-‐Gov
• ASG
–
Auto
Scaling
Group
(instances
booVng
from
the
same
AMI)
• S3
–
Simple
Storage
Service
(hCp
access)
• EBS
–
ElasVc
Block
Storage
(network
disk
filesystem
can
be
mounted
on
an
instance)
• RDS
–
RelaVonal
Database
Service
(managed
MySQL
master
and
slaves)
• DynamoDB/SDB
–
Simple
Data
Base
(hosted
hCp
based
NoSQL
datastore,
DynamoDB
replaces
SDB)
• SQS
–
Simple
Queue
Service
(hCp
based
message
queue)
• SNS
–
Simple
NoVficaVon
Service
(hCp
and
email
based
topics
and
messages)
• EMR
–
ElasVc
Map
Reduce
(automaVcally
managed
Hadoop
cluster)
• ELB
–
ElasVc
Load
Balancer
• EIP
–
ElasVc
IP
(stable
IP
address
mapping
assigned
to
instance
or
ELB)
• VPC
–
Virtual
Private
Cloud
(single
tenant,
more
flexible
network
and
security
constructs)
• DirectConnect
–
secure
pipe
from
AWS
VPC
to
external
datacenter
• IAM
–
IdenVty
and
Access
Management
(fine
grain
role
based
security
keys)
17. What
Runs
in
the
Cloud?
Step
by
Step
NeAlix
Product
TransiVon
23. Current
Architectural
PaCerns
for
Availability
• Isolated
Services
– Resilient
Business
logic
• Three
Balanced
Availability
Zones
– Resilient
to
Infrastructure
outage
• Triple
Replicated
Persistence
– Durable
distributed
Storage
• Isolated
Regions
– US
and
EU
don’t
take
each
other
down
25. Three
Balanced
Availability
Zones
Test
with
Chaos
Gorilla
Load
Balancers
Zone
A
Zone
B
Zone
C
Cassandra
and
Evcache
Cassandra
and
Evcache
Cassandra
and
Evcache
Replicas
Replicas
Replicas
26. Triple
Replicated
Persistence
Cassandra
maintenance
drops
individual
replicas
Load
Balancers
Zone
A
Zone
B
Zone
C
Cassandra
and
Evcache
Cassandra
and
Evcache
Cassandra
and
Evcache
Replicas
Replicas
Replicas
27. Isolated
Regions
US-‐East
Load
Balancers
EU-‐West
Load
Balancers
Zone
A
Zone
B
Zone
C
Zone
A
Zone
B
Zone
C
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
Cassandra
Replicas
28. Failure
Modes
and
Effects
Failure
Mode
Probability
Mi;ga;on
Plan
ApplicaVon
Failure
High
AutomaVc
degraded
response
AWS
Region
Failure
Low
Wait
for
region
to
recover
AWS
Zone
Failure
Medium
ConVnue
to
run
on
2
out
of
3
zones
Datacenter
Failure
Medium
Migrate
more
funcVons
to
cloud
Data
store
failure
Low
Restore
from
S3
backups
S3
failure
Low
Restore
from
remote
archive
29. Observed
Regional
Failures
• Power
Outages
– PlaAorm
survives
any
one
zone
outage
– Two
recent
zone
outages,
one
OK,
one
triggered
a
bug
• Router
Bug
Takes
Region
Offline
– A
few
minutes
of
no
network
traffic,
then
recovered
– AWS
has
redesigned
routes
to
be
per
zone
• Control
Plane
Overload
Affects
EnVre
Region
– Consequence
of
other
outages
– We
lose
control
of
our
infrastructure
30. NeAlix
Deployed
on
AWS
2009
2009
2010
2010
2010
2011
Content
Logs
Play
WWW
API
CS
Content
S3
InternaVonal
Management
DRM
Sign-‐Up
Metadata
CS
lookup
Terabytes
EC2
Device
DiagnosVcs
EMR
CDN
rouVng
Search
Config
&
AcVons
Encoding
S3
Movie
TV
Movie
Customer
Hive
&
Pig
Bookmarks
Choosing
Choosing
Call
Log
Petabytes
Business
Social
Logging
RaVngs
Facebook
CS
AnalyVcs
Intelligence
CDNs
ISPs
Terabits
Customers
32. Datacenter
to
Cloud
TransiVon
Goals
• Faster
– Lower
latency
than
the
equivalent
datacenter
web
pages
and
API
calls
– Measured
as
mean
and
99th
percenVle
– For
both
first
hit
(e.g.
home
page)
and
in-‐session
hits
for
the
same
user
• Scalable
– Avoid
needing
any
more
datacenter
capacity
as
subscriber
count
increases
– No
central
verVcally
scaled
databases
– Leverage
AWS
elasVc
capacity
effecVvely
• Available
– SubstanVally
higher
robustness
and
availability
than
datacenter
services
– Leverage
mulVple
AWS
availability
zones
– No
scheduled
down
Vme,
no
central
database
schema
to
change
• ProducVve
– OpVmize
agility
of
a
large
development
team
with
automaVon
and
tools
– Leave
behind
complex
tangled
datacenter
code
base
(~8
year
old
architecture)
– Enforce
clean
layered
interfaces
and
re-‐usable
components
33. NeAlix
Datacenter
vs.
Cloud
Arch
Central
SQL
Database
Distributed
Key/Value
NoSQL
SVcky
In-‐Memory
Session
Shared
Memcached
Session
ChaCy
Protocols
Latency
Tolerant
Protocols
Tangled
Service
Interfaces
Layered
Service
Interfaces
Instrumented
Code
Instrumented
Service
PaCerns
Fat
Complex
Objects
Lightweight
Serializable
Objects
Components
as
Jar
Files
Components
as
Services
35. Chaos
Monkey
• Computers
(Datacenter
or
AWS)
randomly
die
– Fact
of
life,
but
too
infrequent
to
test
resiliency
• Test
to
make
sure
systems
are
resilient
– Allow
any
instance
to
fail
without
customer
impact
• Chaos
Monkey
hours
– Monday-‐Friday
9am-‐3pm
random
instance
kill
• ApplicaVon
configuraVon
opVon
– Apps
now
have
to
opt-‐out
from
Chaos
Monkey
36. Responsibility
and
Experience
• Make
developers
responsible
for
failures
– Then
they
learn
and
write
code
that
doesn’t
fail
• Use
Incident
Reviews
to
find
gaps
to
fix
– Make
sure
its
not
about
finding
“who
to
blame”
• Keep
Vmeouts
short,
fail
fast
– Don’t
let
cascading
Vmeouts
stack
up
• Make
configuraVon
opVons
dynamic
– You
don’t
want
to
push
code
to
tweak
an
opVon
38. Distributed
OperaVonal
Model
• Developers
– Provision
and
run
their
own
code
in
producVon
– Take
turns
to
be
on
call
if
it
breaks
(pagerduty)
– Configure
autoscalers
to
handle
capacity
needs
• DevOps
and
PaaS
(aka
NoOps)
– DevOps
is
used
to
build
and
run
the
PaaS
– PaaS
constrains
Dev
to
use
automaVon
instead
– PaaS
puts
more
responsibility
on
Dev,
with
tools
39. What’s
Le8
for
Corp
IT?
• Corporate
Security
and
Network
Management
– Billing
and
remnants
of
streaming
service
back-‐ends
in
DC
• Running
NeAlix’
DVD
Business
– Tens
of
Oracle
instances
Corp
WiFi
Performance
– Hundreds
of
MySQL
instances
– Thousands
of
VMWare
VMs
– Zabbix,
CacV,
Sumologic,
Puppet,
Chef
• Employee
ProducVvity
– Building
networks
and
WiFi
– SaaS
OneLogin
SSO
Portal
– Evernote
Premium,
Safari
Online
Bookshelf,
Dropbox
for
Teams
– Google
Enterprise
Apps,
Workday
HCM/Expense,
Box.com
– Many
more
SaaS
migraVons
coming…
40. NeAlix
OrganizaVon
DevOps
Org
ReporVng
into
Product
Group,
not
ITops
NeAlix
Cloud
PlaAorm
Team
Cloud
Ops
Build
Tools
PlaAorm
and
Cloud
Cloud
Reliability
Architecture
and
Persistence
Performance
SoluVons
Engineering
AutomaVon
Engineering
Perforce
Jenkins
PlaAorm
jars
Cassandra
Future
planning
ArVfactory
JIRA
Benchmarking
Monitoring
Alert
RouVng
Key
store
Security
Arch
Monkeys
Incident
Lifecycle
Base
AMI,
Bakery
Zookeeper
JVM
GC
Tuning
Efficiency
NeAlix
App
Console
Wiresharking
Entrypoints
Cassandra
AWS
VPC
PagerDuty
Hyperguard
AWS
API
AWS
Instances
AWS
Instances
AWS
Instances
Powerpoint
J
41. NeAlix
Open
Source
Strategy
• Steadily
release
PaaS
Components
git-‐by-‐git
• Source
at
github.com/neAlix
–
we
build
from
it…
• Intros
and
techniques
at
techblog.neAlix.com
56. RunVme,
Cont’d
Astyanax
Priam
Curator
Chaos
Monkey
Latency
Monkey
NIWS
LB
Exhibitor
Janitor
Monkey
Cass
JMeter
Dependency
REST
client
Command
Explorers
Calling
other
services
Managing
service
Resiliency
aids
57. Open
Source
Projects
Legend
Github
/
Techblog
Priam
Exhibitor
Servo
and
Autoscaling
Scripts
Apache
ContribuVons
Cassandra
as
a
Service
Zookeeper
as
a
Service
Astyanax
Curator
Honu
Techblog
Post
Cassandra
client
for
Java
Zookeeper
PaCerns
Log4j
streaming
to
Hadoop
Coming
Soon
CassJMeter
EVCache
Circuit
Breaker
Cassandra
test
suite
Memcached
as
a
Service
Robust
service
paCern
Cassandra
MulV-‐region
EC2
Eureka
/
Discovery
Asgard
AutoScaleGroup
based
datastore
support
Service
Directory
AWS
console
Aegisthus
Archaius
Chaos
Monkey
Hadoop
ETL
for
Cassandra
Dynamics
ProperVes
Service
Robustness
verificaVon
Explorers
EntryPoints
Latency
Monkey
Governator
Library
lifecycle
Server-‐side
latency/error
and
dependency
injecVon
injecVon
Janitor
Monkey
Odin
REST
Client
+
mid-‐Ver
LB
Bakeries
and
AMI
Workflow
orchestraVon
Async
logging
ConfiguraVon
REST
endpoints
Build
dynaslaves
59. Roadmap
for
2012
• More
resiliency
and
improved
availability
• More
automaVon,
orchestraVon
• “Hardening”
the
plaAorm,
code
clean-‐up
• Lower
latency
for
web
services
and
devices
• IPv6
–
now
running
in
prod,
rollout
in
process
• More
open
sourced
components
• See
you
at
AWS
Re:Invent
in
November…
60. Takeaway
NeElix
has
built
and
deployed
a
scalable
global
PlaEorm
as
a
Service.
Key
components
of
the
NeElix
PaaS
are
being
released
as
Open
Source
projects
so
you
can
build
your
own
custom
PaaS.
hCp://github.com/NeAlix
hCp://techblog.neAlix.com
hCp://slideshare.net/NeAlix
hCp://www.linkedin.com/in/adriancockcro8
hCp://www.linkedin.com/in/ruslanmeshenberg
@adrianco
@rusmeshenberg
#neAlixcloud