1. Performance
Architecture
for
Cloud
March
7,
2011
Adrian
Cockcro:
@adrianco
#ne=lixcloud
#ccevent
h@p://www.linkedin.com/in/adriancockcro:
acockcro:@ne=lix.com
2. Who,
Why,
What
Ne=lix
in
the
Cloud
Cloud
Performance
Challenges
Performance
Architecture
and
Tools
3. Ne=lix.com
is
now
~100%
Cloud
See
h@p://techblog.ne=lix.com
Detailed
SlideShare
presentaQon
:
Ne=lix
on
Cloud
h@p://slideshare.net/adrianco
We
have
25
minutes
-‐
not
half
a
day
to
discuss
everything!
4. A
Nice
Problem
To
Have…
h@p://techblog.ne=lix.com/2011/02/redesigning-‐ne=lix-‐api.html
37x
Growth
Jan
2010-‐Jan
2011
5. Data
Center
We
stopped
building
our
own
datacenters
Capacity
growth
is
acceleraQng,
unpredictable
Product
launch
spikes
-‐
iPhone,
Wii,
PS3,
XBox
6. We
want
to
use
clouds,
we
don’t
have
Qme
to
build
them
Public
cloud
for
agility
and
scale
AWS
because
they
are
big
enough
to
allocate
thousands
of
instances
per
hour
for
us
7. Ne=lix
EC2
Instances
per
Account
(summer
2010,
producQon
is
up
~3x
now…)
“Many
Thousands”
Content
Encoding
Test
and
ProducQon
Log
Analysis
“Several
Months”
8. AWS
Performance?
Mostly
good,
be@er
than
expected
over-‐all
• The
Good
– Large
EC2
Instance
types
(esp.
the
m2
range)
– Internal
disk
performance
– Network
performance
within
and
between
Availability
Zones
– Robustness
and
scalability
of
S3,
SQS
• The
Bad
– ElasQc
Load
Balancer
has
too
many
limitaQons
– SimpleDB
needs
memcached
front
end,
too
many
limitaQons
at
Terabyte
scale
• The
Ugly
– EBS
performance
is
slow
and
inconsistent,
we
avoid
it
9. Learnings
• Datacenter
oriented
tools
don’t
work
– Ephemeral
instances
– High
rate
of
change
– Need
too
much
hand-‐holding
and
manual
setup
• Cloud
Tools
Don’t
Scale
for
Enterprise
– Too
many
tools
are
“Startup”
oriented
– Built
our
own
tools
for
1000’s
of
instances
– Drove
vendors
to
be
dynamic,
scale,
add
APIs
• “fork-‐li:ed”
apps
are
fragile
– Too
many
datacenter
oriented
assumpQons
– We
re-‐wrote
our
code
base!
– (We
re-‐write
it
conQnuously
anyway)
11. Model
Driven
Architecture
• Datacenter
PracQces
– Lots
of
unique
hand-‐tweaked
systems
– Hard
to
enforce
pa@erns
• Model
Driven
Cloud
Architecture
– Perforce/Ivy/Hudson
based
builds
for
everything
– Every
producQon
instance
is
a
pre-‐baked
AMI
– Every
applicaQon
is
managed
by
an
Autoscaler
No
excep(ons,
every
change
is
a
new
AMI
12. Model
Driven
ImplicaQons
• Automated
“Least
Privilege”
Security
– Tightly
specified
security
groups
– Fine
grain
IAM
keys
to
access
AWS
resources
– Performance
tools
security
and
integraQon
• Model
Driven
Performance
Monitoring
– Hundreds
of
instances
appear
in
a
few
minutes…
– Tools
have
to
“garbage
collect”
dead
instances
14. What
is
Capacity
Planning?
• We
care
about
– CPU,
Memory,
Network
and
Disk
resources
consumed
– ApplicaQon
response
Qmes
• We
need
to
know
– how
much
of
each
resource
we
are
using
now
– how
much
will
we
use
in
the
future
– how
much
headroom
we
have
to
handle
higher
loads
• We
want
to
understand
– how
headroom
varies
– how
it
relates
to
response
Qmes
and
throughput
15. Capacity
Planning
in
Clouds
(a
few
things
have
changed…)
• Capacity
is
expensive
• Capacity
takes
Qme
to
buy
and
provision
• Capacity
only
increases,
can’t
be
shrunk
easily
• Capacity
comes
in
big
chunks,
paid
up
front
• Planning
errors
can
cause
big
problems
• Systems
are
clearly
defined
assets
• Systems
can
be
instrumented
in
detail
• Depreciate
assets
over
3
years
(reservaQons!)
16. OK,
so
just
give
me
the
data!
Throughput
–
not
hard
Response
Time
–
mean+2xSD?
%iles?
UQlizaQon….
17. UQlizaQon
“UQlizaQon
is
virtually
useless
as
a
metric”
CMG
2006
Paper
by
Adrian
Cockcro:
VirtualizaQon
is
a
DOS
a@ack
on
Capacity
Planning…
18. What
would
you
say
if
you
were
asked:
Q:
That
system
is
slow,
how
busy
is
it?
A:
I
have
no
idea…
A:
The
graph
in
this
tool
looks
about
50%
A:
But
the
graph
in
this
other
tool
is
65%
A:
Amazon
CloudWatch
says
82%
A:
Linux
says
us
sy
ni
id
wa
st
L
A:
Why
do
you
want
to
know?
A:
I’m
sorry,
you
don’t
understand
your
quesQon….
19. What's
the
problem
with
UQlizaQon?
• CPU
Capacity
– Varying
capacity
due
to
mulQ-‐tenancy
– Non-‐idenQcal
servers
or
CPUs
(check
/proc/cpuinfo)
– Non-‐linear
capacity
due
to
hyperthreading
etc.
• Measurement
Errors
– Monitoring
tools
that
ignore
“stolen
Qme”
(all
of
them)
– Mechanisms
with
built
in
bias
(clock
Qck
counQng)
– Pla=orm
and
release
specific
changes
in
metrics
Every
tool
shows
a
different
value
for
the
same
metric!
21. Monitoring
Issues
• Problem
– Too
many
tools,
each
with
a
good
reason
to
exist
– Hard
to
get
an
integrated
view
of
a
problem
– Too
much
manual
work
building
dashboards
– Tools
are
not
discoverable,
views
are
not
filtered
• SoluQon
– Get
vendors
to
add
deep
linking
URLs
and
APIs
– IntegraQon
“portal”
Qes
everything
together
– Underlying
dependency
database
– Dynamic
portal
generaQon,
relevant
data,
all
tools
22. Data
Sources
• External
URL
availability
and
latency
alerts
and
reports
–
Keynote
External
TesQng
• Stress
tesQng
-‐
SOASTA
• Ne=lix
REST
calls
–
Chukwa
to
DataOven
with
GUID
transacQon
idenQfier
Request
Trace
Logging
• Generic
HTTP
–
AppDynamics
service
Qer
aggregaQon,
end
to
end
tracking
• Tracers
and
counters
–
log4j,
tracer
central,
Chukwa
to
DataOven
ApplicaQon
logging
• Trackid
and
Audit/Debug
logging
–
DataOven,
Appdynamics
GUID
cross
reference
• ApplicaQon
specific
real
Qme
–
Nimso:,
Appdynamics,
Epic
JMX
Metrics
• Service
and
SLA
percenQles
–
Nimso:,
Appdynamics,
Epic,logged
to
DataOven
• Stdout
logs
–
S3
–
DataOven,
Nimso:
alerQng
Tomcat
and
Apache
logs
• Standard
format
Access
and
Error
logs
–
S3
–
DataOven,
Nimso:
AlerQng
• Garbage
CollecQon
–
Nimso:,
Appdynamics
JVM
• Memory
usage,
call
stacks,
resource/call
-‐
AppDynamics
• system
CPU/Net/RAM/Disk
metrics
–
AppDynamics,
Epic,
Nimso:
AlerQng
Linux
• SNMP
metrics
–
Epic,
Network
flows
-‐
FasQp
• Load
balancer
traffic
–
Amazon
Cloudwatch,
SimpleDB
usage
stats
AWS
• System
configuraQon
-‐
CPU
count/speed
and
RAM
size,
overall
usage
-‐
AWS
24. Dashboards
Architecture
• Integrated
Dashboard
View
– Single
web
page
containing
content
from
many
tools
– Filtered
to
highlight
most
“interesQng”
data
• Relevance
Controller
– Drill
in,
add
and
remove
content
interacQvely
– Given
an
applicaQon,
alert
or
problem
area,
dynamically
build
a
dashboard
relevant
to
your
role
and
needs
• Dependency
and
Incident
Model
– Model
Driven
-‐
Interrogates
tools
and
AWS
APIs
– Document
store
to
capture
dependency
tree
and
states
26. AppDynamics
How
to
look
deep
inside
your
cloud
applicaQons
• AutomaQc
Monitoring
– Base
AMI
bakes
in
all
monitoring
tools
– Outbound
calls
only
–
no
discovery/polling
issues
– InacQve
instances
removed
a:er
a
few
days
• Incident
Alarms
(deviaQon
from
baseline)
– Business
TransacQon
latency
and
error
rate
– Alarm
thresholds
discover
their
own
baseline
– Email
contains
URL
to
Incident
Workbench
UI
30. Point
Finger
and
Assess
Impact
(an
async
S3
write
was
slow,
no
big
deal)
31. Summary
• Performance
of
AWS
Systems
isn’t
an
issue
• Broken
datacenter
tools
and
metrics
is
the
issue!
• IntegraQng
too
many
different
tools
– They
are
not
designed
to
be
integrated
– Did
I
menQon
that
I
hate
flash
based
user
interfaces?
– We
have
“persuaded”
vendors
to
add
APIs
• If
you
can’t
see
deep
inside
your
app,
you’re
L
QuesQons?
Job
ApplicaQons?
@adrianco
#ne=lixcloud
#ccevent