NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method for Digital Infra/Ops Management

Site Reliability Engineering
The Modern Approach to Digital Infra/Ops Management
#ISSLearningDay
Mr. Jamie Donoghue, VisionLed Consulting
2 August 2019

Jamie Donoghue
Director and Principal Consultant
MBA, Business Agility Coach, Lean Change Facilitator, DevOps Leader,
Strategic Product Manager, CISA, CGEIT, CISM, CRISC, COBIT,
P3O, MSP, PRINCE2, PMP, ITIL Expert, ScrumMaster,
LeSS Practitioner, Lean IT, Lean Kanban, 6σ
Jamie, a dual citizen of the UK and NZ, has spent over 20 years improving IT
Services for public and private organisations in the UK, Australia and South
East Asia.
As an architect, consultant and coach, he specialises in creating high-
performance, cross-functional teams that are competent, accountable and
inspired.
jamie@visionled.co
© VisionLed Consulting. All Rights Reserved. 2

Title
• What is SRE?
➢ Shared Ownership
➢ SLOs and Blameless Post-Mortems
➢ Reduce Cost of Failure
➢ Automate This Year’s Job Away
➢ Measure Toil & Reliability
• Building your SRE capability
• Q&A
#ISSLearningDay

Opposing Forces
#ISSLearningDay

What is Site Reliability Engineering?
#ISSLearningDay

What SRE’s Do
• Champion reliability practices
• Guide designs and processes with an eye toward
resilience and low toil
• Reduce technical complexity and sprawl
(inefficiency)
• Drive the usage of common tools and components
(standardisation)
• Use software to improve resilience and automate
operations
#ISSLearningDay

What is Toil?
• In SRE, we want to spent time (50%) on long-term
engineering project work instead of operational work.
- because operational work maybe misinterpreted, we use a
specific word: toil
- SRE's should spend less that X% of their time on toil and the test
on coding (projects)
- Excess toil is redirected to the development team
• The work of reducing toil and scaling up services is the
'Engineering' in Site Reliability Engineering
#ISSLearningDay
Toil Characteristics
Manual
Repetitive
Automatable
Reactive
No Enduring Value
Scales Linearly With Growth

SRE versus DevOps
#ISSLearningDay
https://www.youtube.com/watch?v=uTEL8Ff1Zvk

SRE implements DevOps (in part at least)
#ISSLearningDay

Share Ownership the Google Way
#ISSLearningDay
https://web.devopstopologies.com/

Share Ownership the Acquia Way
• We embed SRE’s within Product Teams, rather
than build teams that runs Products on behalf of
Developers
• The entire Product Team (incl. SRE) is expected to
‘own the Product’
• The SRE identifies risks to SLO’s as part of their
day-to-day activities and brings improvement
opportunities directly to the Product Owner for
prioritisation in the team’s backlog.
#ISSLearningDay
https://www.acquia.com/

SLA, SLO, SLI
#ISSLearningDay
Consequences for
missing a target
Targets for
measurement
What to
measure

SLA, SLO, SLI
#ISSLearningDay
If 99% of your system requests aren’t
completed in 5ms, you get a refund.
99.5% of requests will
be completed in 5ms.
Latency of a request

Service Level Objectives
#ISSLearningDay
• Once you've passed the happiness
test, increasing reliability will have
diminishing returns.
• In addition, higher reliability costs
you more to provide, reducing
your ability to make changes and
release new features.

Error Budget
#ISSLearningDay
• If your agreed reliability target per/month is 99.9%
• Your agreed unreliability is 00.1%
• This agreed unreliability is your error budget
• An error budget of 00.1% = 43.8 minutes of permissible
unplanned downtime

Using the Error Budget
Imagine your service has gone down, and
you have a permissible error budget of
43.8 minutes
• What activities to detect and manually
recover will occur within this time period?
• Do you believe you can recover within the
error budget?
• 5 Minutes to discuss
• 5 Minutes to share

Using the Error Budget
#ISSLearningDay
No cause for concern
Definite cause for concern

Blameless Post-Mortems
• Do a Post Mortem for every incident
• Post-Mortems are blameless
➢ i.e. they focus on process and technology, not people
#ISSLearningDay

Blameless Post-Mortems Agenda
• Document timeline of the Incident
• With the team determine
• What went well
• What didn’t go well (process failure, technical root cause)
• What was lucky (or circumstantial)
• File an action item for each item that didn’t go well, or was circumstantial,
including:
• Clear requirements and acceptance criteria
• Level of Effort and Prioritisation
• Openly share the post-mortem with the rest of the organisation
• Review post-mortem periodically
#ISSLearningDay

Canary Releases
#ISSLearningDay
https://www.youtube.com/watch?v=FT2O-qLj9Hc

Automate This Years Job Away
#ISSLearningDay

Runbook Automation
#ISSLearningDay
https://youtu.be/iFEKobyFqwQ

Infrastructure as Code
#ISSLearningDay

Measure Toil and Reliability
#ISSLearningDay

SRE Won’t Work Without…
• Authority to stop releases when the
Error Budget has been exhausted
• Authority to overflow operational
work to the Dev Team when
operational load is > 50%
• These must be authorised in a policy
(with CIO/CTO endorsement)
#ISSLearningDay

Beginner SRE Teams
• Staffing and hiring plan (with funding)
• Policy for:
• Launch readiness
• On-call rotation
• Balance of operational work/projects
• Post-mortems
• Overflow of operational work to development
• Agreed SLA, SLO, SLI with all relevant parties (end-to-
end)
• Documentation for release processes, service setup,
teardown, rollback and failover
• Runbooks for routine operational tasks
#ISSLearningDay
https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey

Intermediate SRE Teams
• Periodic reviews of SRE project work and impact with
business leaders
• Periodic reviews of SLIs and SLOs with business leaders
• Rollback mechanism for canary releases (ideally automated)
• Periodic testing of incident management, using a combination
of role-playing with some automation in place
• There’s an escalation policy tied to SLO violations
• Teams measure demand vs. capacity and use active
forecasting to determine when demand might exceed
capacity.
#ISSLearningDay

Advanced SRE Teams
• Project work can be and is often executed
horizontally, positively impacting many services at
once as opposed to linearly or worse per service
• Most service alerts are based on SLO burn rate
• Automated disaster recovery testing is in
place and positive impact can be measured
#ISSLearningDay

Thank You!
jamie@visionled.co
#ISSLearningDay

NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method for Digital Infra/Ops Management

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method for Digital Infra/Ops Management

Semelhante a NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method for Digital Infra/Ops Management (20)

Mais de NUS-ISS

Mais de NUS-ISS (20)

Último

Último (20)

NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method for Digital Infra/Ops Management