This document provides an overview of site reliability engineering (SRE) presented by Jamie Donoghue from VisionLed Consulting. SRE aims to implement a shared ownership model between operations and development teams through practices like establishing service level objectives (SLOs) and running blameless post-mortems after incidents. Key aspects of SRE covered include automating operational work to reduce toil, measuring toil and reliability metrics, and building SRE capabilities over beginner, intermediate, and advanced levels. The presentation concludes with a question and answer section.
Driving Behavioral Change for Information Management through Data-Driven Gree...
NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method for Digital Infra/Ops Management
1. Site Reliability Engineering
The Modern Approach to Digital Infra/Ops Management
#ISSLearningDay
Mr. Jamie Donoghue, VisionLed Consulting
2 August 2019
3. Title
• What is SRE?
➢ Shared Ownership
➢ SLOs and Blameless Post-Mortems
➢ Reduce Cost of Failure
➢ Automate This Year’s Job Away
➢ Measure Toil & Reliability
• Building your SRE capability
• Q&A
#ISSLearningDay
5. What is Site Reliability Engineering?
#ISSLearningDay
6. What SRE’s Do
• Champion reliability practices
• Guide designs and processes with an eye toward
resilience and low toil
• Reduce technical complexity and sprawl
(inefficiency)
• Drive the usage of common tools and components
(standardisation)
• Use software to improve resilience and automate
operations
#ISSLearningDay
7. What is Toil?
• In SRE, we want to spent time (50%) on long-term
engineering project work instead of operational work.
- because operational work maybe misinterpreted, we use a
specific word: toil
- SRE's should spend less that X% of their time on toil and the test
on coding (projects)
- Excess toil is redirected to the development team
• The work of reducing toil and scaling up services is the
'Engineering' in Site Reliability Engineering
#ISSLearningDay
Toil Characteristics
Manual
Repetitive
Automatable
Reactive
No Enduring Value
Scales Linearly With Growth
11. Share Ownership the Google Way
#ISSLearningDay
https://web.devopstopologies.com/
12. Share Ownership the Acquia Way
• We embed SRE’s within Product Teams, rather
than build teams that runs Products on behalf of
Developers
• The entire Product Team (incl. SRE) is expected to
‘own the Product’
• The SRE identifies risks to SLO’s as part of their
day-to-day activities and brings improvement
opportunities directly to the Product Owner for
prioritisation in the team’s backlog.
#ISSLearningDay
https://www.acquia.com/
15. SLA, SLO, SLI
#ISSLearningDay
If 99% of your system requests aren’t
completed in 5ms, you get a refund.
99.5% of requests will
be completed in 5ms.
Latency of a request
16. Service Level Objectives
#ISSLearningDay
• Once you've passed the happiness
test, increasing reliability will have
diminishing returns.
• In addition, higher reliability costs
you more to provide, reducing
your ability to make changes and
release new features.
17. Error Budget
#ISSLearningDay
• If your agreed reliability target per/month is 99.9%
• Your agreed unreliability is 00.1%
• This agreed unreliability is your error budget
• An error budget of 00.1% = 43.8 minutes of permissible
unplanned downtime
18. Using the Error Budget
Imagine your service has gone down, and
you have a permissible error budget of
43.8 minutes
• What activities to detect and manually
recover will occur within this time period?
• Do you believe you can recover within the
error budget?
• 5 Minutes to discuss
• 5 Minutes to share
19. Using the Error Budget
#ISSLearningDay
No cause for concern
Definite cause for concern
20. Blameless Post-Mortems
• Do a Post Mortem for every incident
• Post-Mortems are blameless
➢ i.e. they focus on process and technology, not people
#ISSLearningDay
21. Blameless Post-Mortems Agenda
• Document timeline of the Incident
• With the team determine
• What went well
• What didn’t go well (process failure, technical root cause)
• What was lucky (or circumstantial)
• File an action item for each item that didn’t go well, or was circumstantial,
including:
• Clear requirements and acceptance criteria
• Level of Effort and Prioritisation
• Openly share the post-mortem with the rest of the organisation
• Review post-mortem periodically
#ISSLearningDay
30. SRE Won’t Work Without…
• Authority to stop releases when the
Error Budget has been exhausted
• Authority to overflow operational
work to the Dev Team when
operational load is > 50%
• These must be authorised in a policy
(with CIO/CTO endorsement)
#ISSLearningDay
31. Beginner SRE Teams
• Staffing and hiring plan (with funding)
• Policy for:
• Launch readiness
• On-call rotation
• Balance of operational work/projects
• Post-mortems
• Overflow of operational work to development
• Agreed SLA, SLO, SLI with all relevant parties (end-to-
end)
• Documentation for release processes, service setup,
teardown, rollback and failover
• Runbooks for routine operational tasks
#ISSLearningDay
https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey
32. Intermediate SRE Teams
• Periodic reviews of SRE project work and impact with
business leaders
• Periodic reviews of SLIs and SLOs with business leaders
• Rollback mechanism for canary releases (ideally automated)
• Periodic testing of incident management, using a combination
of role-playing with some automation in place
• There’s an escalation policy tied to SLO violations
• Teams measure demand vs. capacity and use active
forecasting to determine when demand might exceed
capacity.
#ISSLearningDay
https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey
33. Advanced SRE Teams
• Project work can be and is often executed
horizontally, positively impacting many services at
once as opposed to linearly or worse per service
• Most service alerts are based on SLO burn rate
• Automated disaster recovery testing is in
place and positive impact can be measured
#ISSLearningDay
https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey