Performance architecture for cloud connect

Performance
Architecture
for

Cloud

March
7,
2011

Adrian
Cockcro:

@adrianco
#ne=lixcloud
#ccevent

h@p://www.linkedin.com/in/adriancockcro:

acockcro:@ne=lix.com

Who,
Why,
What

Ne=lix
in
the
Cloud

Cloud
Performance
Challenges

Performance
Architecture
and
Tools

Ne=lix.com
is
now
~100%
Cloud

See
h@p://techblog.ne=lix.com

Detailed
SlideShare
presentaQon
:
Ne=lix
on
Cloud

h@p://slideshare.net/adrianco

We
have
25
minutes
-‐
not
half
a
day
to
discuss
everything!

A
Nice
Problem
To
Have…

h@p://techblog.ne=lix.com/2011/02/redesigning-‐ne=lix-‐api.html

37x
Growth
Jan

2010-‐Jan
2011

Data
Center

We
stopped

building
our
own

datacenters

Capacity
growth
is
acceleraQng,
unpredictable

Product
launch
spikes
-‐
iPhone,
Wii,
PS3,
XBox

We
want
to
use
clouds,

we
don’t
have
Qme
to
build
them

Public
cloud
for
agility
and
scale

AWS
because
they
are
big
enough
to
allocate

thousands
of
instances
per
hour
for
us

Ne=lix
EC2
Instances
per
Account

(summer
2010,
producQon
is
up
~3x
now…)

“Many
Thousands”

Content
Encoding

Test
and
ProducQon

Log
Analysis

“Several
Months”

AWS
Performance?

Mostly
good,
be@er
than
expected
over-‐all

•  The
Good

–  Large
EC2
Instance
types
(esp.
the
m2
range)

–  Internal
disk
performance

–  Network
performance
within
and
between

Availability
Zones

–  Robustness
and
scalability
of
S3,
SQS

•  The
Bad

–  ElasQc
Load
Balancer
has
too
many
limitaQons

–  SimpleDB
needs
memcached
front
end,
too

many
limitaQons
at
Terabyte
scale

•  The
Ugly

–  EBS
performance
is
slow
and
inconsistent,
we

avoid
it

Learnings

•  Datacenter
oriented
tools
don’t
work

–  Ephemeral
instances

–  High
rate
of
change

–  Need
too
much
hand-‐holding
and
manual
setup

•  Cloud
Tools
Don’t
Scale
for
Enterprise

–  Too
many
tools
are
“Startup”
oriented

–  Built
our
own
tools
for
1000’s
of
instances

–  Drove
vendors
to
be
dynamic,
scale,
add
APIs

•  “fork-‐li:ed”
apps
are
fragile

–  Too
many
datacenter
oriented
assumpQons

–  We
re-‐wrote
our
code
base!

–  (We
re-‐write
it
conQnuously
anyway)

Cloud
Performance
Challenges

Model
Driven
Architecture

Capacity
Planning
&
Metrics

Model
Driven
Architecture

•  Datacenter
PracQces

–  Lots
of
unique
hand-‐tweaked
systems

–  Hard
to
enforce
pa@erns

•  Model
Driven
Cloud
Architecture

–  Perforce/Ivy/Hudson
based
builds
for
everything

–  Every
producQon
instance
is
a
pre-‐baked
AMI

–  Every
applicaQon
is
managed
by
an
Autoscaler

No
excep(ons,
every
change
is
a
new
AMI

Model
Driven
ImplicaQons

•  Automated
“Least
Privilege”
Security

–  Tightly
speciﬁed
security
groups

–  Fine
grain
IAM
keys
to
access
AWS
resources

–  Performance
tools
security
and
integraQon

•  Model
Driven
Performance
Monitoring

–  Hundreds
of
instances
appear
in
a
few
minutes…

–  Tools
have
to
“garbage
collect”
dead
instances

Capacity
Planning
&
Metrics

What
is
Capacity
Planning?

•  We
care
about

–  CPU,
Memory,
Network
and
Disk
resources
consumed

–  ApplicaQon
response
Qmes

•  We
need
to
know

–  how
much
of
each
resource
we
are
using
now

–  how
much
will
we
use
in
the
future

–  how
much
headroom
we
have
to
handle
higher
loads

•  We
want
to
understand

–  how
headroom
varies

–  how
it
relates
to
response
Qmes
and
throughput

Capacity
Planning
in
Clouds

(a
few
things
have
changed…)

•  Capacity
is
expensive

•  Capacity
takes
Qme
to
buy
and
provision

•  Capacity
only
increases,
can’t
be
shrunk
easily

•  Capacity
comes
in
big
chunks,
paid
up
front

•  Planning
errors
can
cause
big
problems

•  Systems
are
clearly
deﬁned
assets

•  Systems
can
be
instrumented
in
detail

•  Depreciate
assets
over
3
years
(reservaQons!)

OK,
so
just
give
me
the
data!

Throughput
–
not
hard

Response
Time
–
mean+2xSD?
%iles?

UQlizaQon….

UQlizaQon

“UQlizaQon
is
virtually
useless
as
a
metric”

CMG
2006
Paper
by
Adrian
Cockcro:

VirtualizaQon
is
a
DOS
a@ack
on
Capacity

Planning…

What
would
you
say
if
you
were
asked:

Q:
That
system
is
slow,
how
busy
is
it?

A:
I
have
no
idea…

A:
The
graph
in
this
tool
looks
about
50%

A:
But
the
graph
in
this
other
tool
is
65%

A:
Amazon
CloudWatch
says
82%

A:
Linux
says
us
sy
ni
id
wa
st
L

A:
Why
do
you
want
to
know?

A:
I’m
sorry,
you
don’t
understand
your
quesQon….

What's
the
problem
with
UQlizaQon?

•  CPU
Capacity

–  Varying
capacity
due
to
mulQ-‐tenancy

–  Non-‐idenQcal
servers
or
CPUs
(check
/proc/cpuinfo)

–  Non-‐linear
capacity
due
to
hyperthreading
etc.

•  Measurement
Errors

–  Monitoring
tools
that
ignore
“stolen
Qme”
(all
of
them)

–  Mechanisms
with
built
in
bias
(clock
Qck
counQng)

–  Pla=orm
and
release
speciﬁc
changes
in
metrics

Every
tool
shows
a
diﬀerent
value
for
the
same
metric!

Performance
Tools
Architecture

Monitoring
Issues

•  Problem

–  Too
many
tools,
each
with
a
good
reason
to
exist

–  Hard
to
get
an
integrated
view
of
a
problem

–  Too
much
manual
work
building
dashboards

–  Tools
are
not
discoverable,
views
are
not
ﬁltered

•  SoluQon

–  Get
vendors
to
add
deep
linking
URLs
and
APIs

–  IntegraQon
“portal”
Qes
everything
together

–  Underlying
dependency
database

–  Dynamic
portal
generaQon,
relevant
data,
all
tools

Data
Sources

• External
URL
availability
and
latency
alerts
and
reports
–
Keynote

External
TesQng
• Stress
tesQng
-‐
SOASTA

• Ne=lix
REST
calls
–
Chukwa
to
DataOven
with
GUID
transacQon
idenQfier

Request
Trace
Logging
• Generic
HTTP
–
AppDynamics
service
Qer
aggregaQon,
end
to
end
tracking

• Tracers
and
counters
–
log4j,
tracer
central,
Chukwa
to
DataOven

ApplicaQon
logging
• Trackid
and
Audit/Debug
logging
–
DataOven,
Appdynamics

GUID
cross
reference

• ApplicaQon
specific
real
Qme
–
Nimso:,
Appdynamics,
Epic

JMX

Metrics
• Service
and
SLA
percenQles
–
Nimso:,
Appdynamics,
Epic,logged
to
DataOven

• Stdout
logs
–
S3
–
DataOven,
Nimso:
alerQng

Tomcat
and
Apache
logs
• Standard
format
Access
and
Error
logs
–
S3
–
DataOven,
Nimso:
AlerQng

• Garbage
CollecQon
–
Nimso:,
Appdynamics

JVM
• Memory
usage,
call
stacks,
resource/call
-‐
AppDynamics

• system
CPU/Net/RAM/Disk
metrics
–
AppDynamics,
Epic,
Nimso:
AlerQng

Linux
• SNMP
metrics
–
Epic,
Network
flows
-‐
FasQp

• Load
balancer
traffic
–
Amazon
Cloudwatch,
SimpleDB
usage
stats

AWS
• System
configuraQon

-‐
CPU
count/speed
and
RAM
size,
overall
usage
-‐
AWS

Dashboards
Architecture

•  Integrated
Dashboard
View

–  Single
web
page
containing
content
from
many
tools

–  Filtered
to
highlight
most
“interesQng”
data

•  Relevance
Controller

–  Drill
in,
add
and
remove
content
interacQvely

–  Given
an
applicaQon,
alert
or
problem
area,
dynamically

build
a
dashboard
relevant
to
your
role
and
needs

•  Dependency
and
Incident
Model

–  Model
Driven
-‐
Interrogates
tools
and
AWS
APIs

–  Document
store
to
capture
dependency
tree
and
states

Dashboard
Prototype

(not
everything
is
integrated
yet)

AppDynamics

How
to
look
deep
inside
your
cloud
applicaQons

•  AutomaQc
Monitoring

–  Base
AMI
bakes
in
all
monitoring
tools

–  Outbound
calls
only
–
no
discovery/polling
issues

–  InacQve
instances
removed
a:er
a
few
days

•  Incident
Alarms
(deviaQon
from
baseline)

–  Business
TransacQon
latency
and
error
rate

–  Alarm
thresholds
discover
their
own
baseline

–  Email
contains
URL
to
Incident
Workbench
UI

Using
AppDynamics

(simple
example
from
early
2010)

Switch
to
Snapshot
View

Pick
a
slow
call
graph

InteracQons
for
this
Snapshot

Click
to
view
call
graph

Point
Finger
and
Assess
Impact

(an
async
S3
write
was
slow,
no
big
deal)

Summary

•  Performance
of
AWS
Systems
isn’t
an
issue

•  Broken
datacenter
tools
and
metrics
is
the
issue!

•  IntegraQng
too
many
diﬀerent
tools

–  They
are
not
designed
to
be
integrated

–  Did
I
menQon
that
I
hate
ﬂash
based
user
interfaces?

–  We
have
“persuaded”
vendors
to
add
APIs

•  If
you
can’t
see
deep
inside
your
app,
you’re
L

QuesQons?
Job
ApplicaQons?

@adrianco
#ne=lixcloud
#ccevent

Performance architecture for cloud connect

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (12)

Semelhante a Performance architecture for cloud connect

Semelhante a Performance architecture for cloud connect (20)

Mais de Adrian Cockcroft

Mais de Adrian Cockcroft (6)

Último

Último (20)

Performance architecture for cloud connect