(Presented by Stackdriver) Key decisions related to architecture, tools, processes, and even team composition can have a dramatic effect on the human effort required to operate distributed applications on AWS. If you make the wrong decisions on in these areas, you spend your days, nights, weekends, and vacations dealing with issues and noise. If you make the right decisions, you and your team can focus on building customer value, and your time away from work is spent… not working.
Stackdriver and Smugmug describe the seven most important practices that world-class operations teams employ to minimize operational overhead, highlighting real-world examples to illustrate the importance of each.
4. Stuff we have in common
✓
✓
✓
✓
✓
‣
Years of AWS experience
Success and failure with many lessons learned
Both using Stackdriver for infrastructure monitoring
Lots of data
Philosophically aligned on how to run on AWS
Superheroes
Friday, November 15, 13
6. CLOUD HYPE
Peak of Expectations
DevOps Nirvana
Operational Enlightenment
Transition to Distributed Systems
Lure of Elasticity
Friday, November 15, 13
TIME
14. 2: Choose the right instance type
Friday, November 15, 13
15. Factors to Consider
CPU
Network
Disk I/O
Workload
Cost
Tools to help you decide
vmstat
iostat
sar
R
Excel
Stackdriver + agent
Friday, November 15, 13
24. Simple rules for confidently waking up ops@ at 3am
1. Something had better be
broken (or close to it) for the
customer
2. The broken thing should be as
obvious as possible
3. It should be clear what action I
can take to make the situation
better
Friday, November 15, 13
Customers seeing huge spike
in 5XX errors
Code deploy to web cluster
one hour ago
Revert!
27. Cloud Integration
System Agents
Workers
Workers
Workers
Custom Metrics
Agents
Agents
Agents
API
API
API
Data Ingestion
Elastic Load Balancing
w/ haproxy
DNS
Load Balancing 1
Load Balancing 2
Load Balancing n
Cell-1
GW
Cell-2
GW
Cell-n
GW
MQ
MQ
MQ
I
A
I
A
I
A
F
F
F
Online
Analysis
Archival
Serving
Q
1
S3
Anomaly
n
2
Cassandra
Health
3
Batch
Aggregation
Web/Mobile
Correlation
Friday, November 15, 13
Trending
o
UI
UI
Localized failure
Identical dimensions
Easy to reason
Network partitions ok
31. You cannot pre-test every change
So
You need to be really good at detecting issues
Very quickly
Friday, November 15, 13
32. Monitoring is a key part of quality assurance for dynamic systems
But monitoring tools need to be intelligent
Distributed sensors
Cloud-aware
Anomaly detection
Synthetic transactions
Friday, November 15, 13
33. •
•
Friday, November 15, 13
Training
Recommended reading:
Systemantics
(aka The Systems Bible)
High Scalability
(http://highscalability.com/)
James Hamilton’s blog
(http://perspectives.mvdirona.com/)
•
•
•
34. Visit us at http://www.smugmug.com/
Friday, November 15, 13
35. Visit us at booth 315!
Friday, November 15, 13