6. “ The SRE team is responsible for the availability , latency,
performance, efficiency, change management, monitoring , emergency
response and capacity planning”
8. Different Adoptions
Google Development Team 1
Development Team 3
Development Team 2
SRE
Team
Netflix Cross Functional Team 1
Cross Functional Team 3
Cross Functional Team 2
Same
High-Velocity,
High-Quality
Results
27. Using the tiered metrics to set SLOs
“More than 10% of 100 largest customers are experiencing greater than 0.5% packet loss”
28. Incident Reponse Maturity
“Aware” that incidents are normal
Having well documented processes and procedures with learning
inputs to the process
29. Runbooks
Problem:
We realized organic growth of content was getting out of hand and Critical ops content was hard to
find.
Solution:
Create Runbooks and place under source control
33. Smarter Paging 1. Compare text
2. Classify the failures
3. Invoke further action
34. Takeaways
• One size SRE does not fit all
• Keep identifying toil and automate it away!
• Reliability still number one throughout
• Patience: culture shifts take time
• Blueprint set, moving our process on to other projects