SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Site Reliability Engineering
The Modern Approach to Digital Infra/Ops Management
#ISSLearningDay
Mr. Jamie Donoghue, VisionLed Consulting
2 August 2019
Jamie Donoghue
Director and Principal Consultant
MBA, Business Agility Coach, Lean Change Facilitator, DevOps Leader,
Strategic Product Manager, CISA, CGEIT, CISM, CRISC, COBIT,
P3O, MSP, PRINCE2, PMP, ITIL Expert, ScrumMaster,
LeSS Practitioner, Lean IT, Lean Kanban, 6σ
Jamie, a dual citizen of the UK and NZ, has spent over 20 years improving IT
Services for public and private organisations in the UK, Australia and South
East Asia.
As an architect, consultant and coach, he specialises in creating high-
performance, cross-functional teams that are competent, accountable and
inspired.
jamie@visionled.co
© VisionLed Consulting. All Rights Reserved. 2
Title
• What is SRE?
➢ Shared Ownership
➢ SLOs and Blameless Post-Mortems
➢ Reduce Cost of Failure
➢ Automate This Year’s Job Away
➢ Measure Toil & Reliability
• Building your SRE capability
• Q&A
#ISSLearningDay
Opposing Forces
#ISSLearningDay
What is Site Reliability Engineering?
#ISSLearningDay
What SRE’s Do
• Champion reliability practices
• Guide designs and processes with an eye toward
resilience and low toil
• Reduce technical complexity and sprawl
(inefficiency)
• Drive the usage of common tools and components
(standardisation)
• Use software to improve resilience and automate
operations
#ISSLearningDay
What is Toil?
• In SRE, we want to spent time (50%) on long-term
engineering project work instead of operational work.
- because operational work maybe misinterpreted, we use a
specific word: toil
- SRE's should spend less that X% of their time on toil and the test
on coding (projects)
- Excess toil is redirected to the development team
• The work of reducing toil and scaling up services is the
'Engineering' in Site Reliability Engineering
#ISSLearningDay
Toil Characteristics
Manual
Repetitive
Automatable
Reactive
No Enduring Value
Scales Linearly With Growth
SRE versus DevOps
#ISSLearningDay
https://www.youtube.com/watch?v=uTEL8Ff1Zvk
SRE implements DevOps (in part at least)
#ISSLearningDay
SRE implements DevOps (in part at least)
#ISSLearningDay
Share Ownership the Google Way
#ISSLearningDay
https://web.devopstopologies.com/
Share Ownership the Acquia Way
• We embed SRE’s within Product Teams, rather
than build teams that runs Products on behalf of
Developers
• The entire Product Team (incl. SRE) is expected to
‘own the Product’
• The SRE identifies risks to SLO’s as part of their
day-to-day activities and brings improvement
opportunities directly to the Product Owner for
prioritisation in the team’s backlog.
#ISSLearningDay
https://www.acquia.com/
SRE implements DevOps (in part at least)
#ISSLearningDay
SLA, SLO, SLI
#ISSLearningDay
Consequences for
missing a target
Targets for
measurement
What to
measure
SLA, SLO, SLI
#ISSLearningDay
If 99% of your system requests aren’t
completed in 5ms, you get a refund.
99.5% of requests will
be completed in 5ms.
Latency of a request
Service Level Objectives
#ISSLearningDay
• Once you've passed the happiness
test, increasing reliability will have
diminishing returns.
• In addition, higher reliability costs
you more to provide, reducing
your ability to make changes and
release new features.
Error Budget
#ISSLearningDay
• If your agreed reliability target per/month is 99.9%
• Your agreed unreliability is 00.1%
• This agreed unreliability is your error budget
• An error budget of 00.1% = 43.8 minutes of permissible
unplanned downtime
Using the Error Budget
Imagine your service has gone down, and
you have a permissible error budget of
43.8 minutes
• What activities to detect and manually
recover will occur within this time period?
• Do you believe you can recover within the
error budget?
• 5 Minutes to discuss
• 5 Minutes to share
Using the Error Budget
#ISSLearningDay
No cause for concern
Definite cause for concern
Blameless Post-Mortems
• Do a Post Mortem for every incident
• Post-Mortems are blameless
➢ i.e. they focus on process and technology, not people
#ISSLearningDay
Blameless Post-Mortems Agenda
• Document timeline of the Incident
• With the team determine
• What went well
• What didn’t go well (process failure, technical root cause)
• What was lucky (or circumstantial)
• File an action item for each item that didn’t go well, or was circumstantial,
including:
• Clear requirements and acceptance criteria
• Level of Effort and Prioritisation
• Openly share the post-mortem with the rest of the organisation
• Review post-mortem periodically
#ISSLearningDay
SRE implements DevOps (in part at least)
#ISSLearningDay
Canary Releases
#ISSLearningDay
https://www.youtube.com/watch?v=FT2O-qLj9Hc
SRE implements DevOps (in part at least)
#ISSLearningDay
Automate This Years Job Away
#ISSLearningDay
Runbook Automation
#ISSLearningDay
https://youtu.be/iFEKobyFqwQ
Infrastructure as Code
#ISSLearningDay
SRE implements DevOps (in part at least)
#ISSLearningDay
Measure Toil and Reliability
#ISSLearningDay
SRE Won’t Work Without…
• Authority to stop releases when the
Error Budget has been exhausted
• Authority to overflow operational
work to the Dev Team when
operational load is > 50%
• These must be authorised in a policy
(with CIO/CTO endorsement)
#ISSLearningDay
Beginner SRE Teams
• Staffing and hiring plan (with funding)
• Policy for:
• Launch readiness
• On-call rotation
• Balance of operational work/projects
• Post-mortems
• Overflow of operational work to development
• Agreed SLA, SLO, SLI with all relevant parties (end-to-
end)
• Documentation for release processes, service setup,
teardown, rollback and failover
• Runbooks for routine operational tasks
#ISSLearningDay
https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey
Intermediate SRE Teams
• Periodic reviews of SRE project work and impact with
business leaders
• Periodic reviews of SLIs and SLOs with business leaders
• Rollback mechanism for canary releases (ideally automated)
• Periodic testing of incident management, using a combination
of role-playing with some automation in place
• There’s an escalation policy tied to SLO violations
• Teams measure demand vs. capacity and use active
forecasting to determine when demand might exceed
capacity.
#ISSLearningDay
https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey
Advanced SRE Teams
• Project work can be and is often executed
horizontally, positively impacting many services at
once as opposed to linearly or worse per service
• Most service alerts are based on SLO burn rate
• Automated disaster recovery testing is in
place and positive impact can be measured
#ISSLearningDay
https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey
Q&A
#ISSLearningDay
Thank You!
jamie@visionled.co
#ISSLearningDay

Mais conteúdo relacionado

Mais procurados

Putting the Pro in Process Design with Donna Knapp - an ITSM Academy Webinar
Putting the Pro in Process Design with Donna Knapp - an ITSM Academy WebinarPutting the Pro in Process Design with Donna Knapp - an ITSM Academy Webinar
Putting the Pro in Process Design with Donna Knapp - an ITSM Academy WebinarITSM Academy, Inc.
 
AppSphere 15 - Smoke Jumping with AppDynamics
AppSphere 15 - Smoke Jumping with AppDynamicsAppSphere 15 - Smoke Jumping with AppDynamics
AppSphere 15 - Smoke Jumping with AppDynamicsAppDynamics
 
DevOps Without Measurement is a Fail
DevOps Without Measurement is a FailDevOps Without Measurement is a Fail
DevOps Without Measurement is a FailTori Wieldt
 
Continuous Delivery: The One Question You Must Answer
Continuous Delivery: The One Question You Must AnswerContinuous Delivery: The One Question You Must Answer
Continuous Delivery: The One Question You Must AnswerDevOps.com
 
iSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
iSQI Certification Days DASA – DevOps & ISTQB Frank FrambachiSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
iSQI Certification Days DASA – DevOps & ISTQB Frank FrambachIevgenii Katsan
 
AppSphere 15 - APM Adoption within an Energy Supply & Trading Organisation
AppSphere 15 - APM Adoption within an Energy Supply & Trading OrganisationAppSphere 15 - APM Adoption within an Energy Supply & Trading Organisation
AppSphere 15 - APM Adoption within an Energy Supply & Trading OrganisationAppDynamics
 
Add Watson to your Apps
Add Watson to your AppsAdd Watson to your Apps
Add Watson to your AppsJason Anderson
 
Accenture DevOps: Delivering applications at the pace of business
Accenture DevOps: Delivering applications at the pace of businessAccenture DevOps: Delivering applications at the pace of business
Accenture DevOps: Delivering applications at the pace of businessAccenture Technology
 
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...Gene Kim
 
LesAffairesDevOps-Dec2020-Keynote-FromProjectToProduct-SteveMercier
LesAffairesDevOps-Dec2020-Keynote-FromProjectToProduct-SteveMercierLesAffairesDevOps-Dec2020-Keynote-FromProjectToProduct-SteveMercier
LesAffairesDevOps-Dec2020-Keynote-FromProjectToProduct-SteveMercierSteve Mercier
 
Getting Your DevOps-enabled Product Teams to See the Forest from the Trees
Getting Your DevOps-enabled Product Teams to See the Forest from the TreesGetting Your DevOps-enabled Product Teams to See the Forest from the Trees
Getting Your DevOps-enabled Product Teams to See the Forest from the TreesDevOps.com
 
What do the "Cool Kids" know about DevOps?
What do the "Cool Kids" know about DevOps?What do the "Cool Kids" know about DevOps?
What do the "Cool Kids" know about DevOps?Bill Holtshouser
 
Outcome-Driven Product Backlog Management by Mike Dwyer - Agile Maine Day 2016
Outcome-Driven Product Backlog Management by Mike Dwyer - Agile Maine Day 2016Outcome-Driven Product Backlog Management by Mike Dwyer - Agile Maine Day 2016
Outcome-Driven Product Backlog Management by Mike Dwyer - Agile Maine Day 2016agilemaine
 
Agile Technology Delivery Process Mr
Agile Technology Delivery Process   MrAgile Technology Delivery Process   Mr
Agile Technology Delivery Process MrMurray Robinson
 
DevOps : Consulting with Foresight
DevOps : Consulting with ForesightDevOps : Consulting with Foresight
DevOps : Consulting with ForesightInfoSeption
 
Offshore development model in 10 steps sap yard
Offshore development model in 10 steps   sap yardOffshore development model in 10 steps   sap yard
Offshore development model in 10 steps sap yardSAPYard
 
The Best Way to Get Trained on Ivanti Products
The Best Way to Get Trained on Ivanti ProductsThe Best Way to Get Trained on Ivanti Products
The Best Way to Get Trained on Ivanti ProductsIvanti
 
Why a DevOps approach is critical to achieve digital transformation
Why a DevOps approach is critical to achieve digital transformationWhy a DevOps approach is critical to achieve digital transformation
Why a DevOps approach is critical to achieve digital transformationAgileSparks
 

Mais procurados (20)

Richard Powell CV
Richard Powell CVRichard Powell CV
Richard Powell CV
 
Putting the Pro in Process Design with Donna Knapp - an ITSM Academy Webinar
Putting the Pro in Process Design with Donna Knapp - an ITSM Academy WebinarPutting the Pro in Process Design with Donna Knapp - an ITSM Academy Webinar
Putting the Pro in Process Design with Donna Knapp - an ITSM Academy Webinar
 
AppSphere 15 - Smoke Jumping with AppDynamics
AppSphere 15 - Smoke Jumping with AppDynamicsAppSphere 15 - Smoke Jumping with AppDynamics
AppSphere 15 - Smoke Jumping with AppDynamics
 
DevOps Without Measurement is a Fail
DevOps Without Measurement is a FailDevOps Without Measurement is a Fail
DevOps Without Measurement is a Fail
 
Continuous Delivery: The One Question You Must Answer
Continuous Delivery: The One Question You Must AnswerContinuous Delivery: The One Question You Must Answer
Continuous Delivery: The One Question You Must Answer
 
iSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
iSQI Certification Days DASA – DevOps & ISTQB Frank FrambachiSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
iSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
 
AppSphere 15 - APM Adoption within an Energy Supply & Trading Organisation
AppSphere 15 - APM Adoption within an Energy Supply & Trading OrganisationAppSphere 15 - APM Adoption within an Energy Supply & Trading Organisation
AppSphere 15 - APM Adoption within an Energy Supply & Trading Organisation
 
Add Watson to your Apps
Add Watson to your AppsAdd Watson to your Apps
Add Watson to your Apps
 
The hothouse approach
The hothouse approachThe hothouse approach
The hothouse approach
 
Accenture DevOps: Delivering applications at the pace of business
Accenture DevOps: Delivering applications at the pace of businessAccenture DevOps: Delivering applications at the pace of business
Accenture DevOps: Delivering applications at the pace of business
 
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...
DOES14 - Jonny Wooldridge - The Cambridge Satchel Company - 10 Enterprise Tip...
 
LesAffairesDevOps-Dec2020-Keynote-FromProjectToProduct-SteveMercier
LesAffairesDevOps-Dec2020-Keynote-FromProjectToProduct-SteveMercierLesAffairesDevOps-Dec2020-Keynote-FromProjectToProduct-SteveMercier
LesAffairesDevOps-Dec2020-Keynote-FromProjectToProduct-SteveMercier
 
Getting Your DevOps-enabled Product Teams to See the Forest from the Trees
Getting Your DevOps-enabled Product Teams to See the Forest from the TreesGetting Your DevOps-enabled Product Teams to See the Forest from the Trees
Getting Your DevOps-enabled Product Teams to See the Forest from the Trees
 
What do the "Cool Kids" know about DevOps?
What do the "Cool Kids" know about DevOps?What do the "Cool Kids" know about DevOps?
What do the "Cool Kids" know about DevOps?
 
Outcome-Driven Product Backlog Management by Mike Dwyer - Agile Maine Day 2016
Outcome-Driven Product Backlog Management by Mike Dwyer - Agile Maine Day 2016Outcome-Driven Product Backlog Management by Mike Dwyer - Agile Maine Day 2016
Outcome-Driven Product Backlog Management by Mike Dwyer - Agile Maine Day 2016
 
Agile Technology Delivery Process Mr
Agile Technology Delivery Process   MrAgile Technology Delivery Process   Mr
Agile Technology Delivery Process Mr
 
DevOps : Consulting with Foresight
DevOps : Consulting with ForesightDevOps : Consulting with Foresight
DevOps : Consulting with Foresight
 
Offshore development model in 10 steps sap yard
Offshore development model in 10 steps   sap yardOffshore development model in 10 steps   sap yard
Offshore development model in 10 steps sap yard
 
The Best Way to Get Trained on Ivanti Products
The Best Way to Get Trained on Ivanti ProductsThe Best Way to Get Trained on Ivanti Products
The Best Way to Get Trained on Ivanti Products
 
Why a DevOps approach is critical to achieve digital transformation
Why a DevOps approach is critical to achieve digital transformationWhy a DevOps approach is critical to achieve digital transformation
Why a DevOps approach is critical to achieve digital transformation
 

Semelhante a NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method for Digital Infra/Ops Management

Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas Naresh Jain
 
Why Agile Fail. *Hint* -it's more than just process
Why Agile Fail. *Hint* -it's more than just processWhy Agile Fail. *Hint* -it's more than just process
Why Agile Fail. *Hint* -it's more than just processTasktop
 
How Salesforce built a Scalable, World-Class, Performance Engineering Team
How Salesforce built a Scalable, World-Class, Performance Engineering TeamHow Salesforce built a Scalable, World-Class, Performance Engineering Team
How Salesforce built a Scalable, World-Class, Performance Engineering TeamSalesforce Developers
 
How to create awesome customer experiences
How to create awesome customer experiencesHow to create awesome customer experiences
How to create awesome customer experiencesMorgan Simonsen
 
Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Salesforce Engineering
 
Foundations of scaling agile with SAFe
Foundations of scaling agile with SAFeFoundations of scaling agile with SAFe
Foundations of scaling agile with SAFeYuval Yeret
 
Continuous Testing: A Key to DevOps Success
Continuous Testing: A Key to DevOps SuccessContinuous Testing: A Key to DevOps Success
Continuous Testing: A Key to DevOps SuccessTechWell
 
A confused tester in agile world finalversion
A confused tester in agile world finalversionA confused tester in agile world finalversion
A confused tester in agile world finalversionAshish Kumar
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityAcquia
 
Agile Truths and Misconceptions
Agile Truths and MisconceptionsAgile Truths and Misconceptions
Agile Truths and MisconceptionsRichard Cheng
 
Puppet Labs EMC DevOps Day NYC Aug-2015
Puppet Labs  EMC DevOps Day NYC Aug-2015Puppet Labs  EMC DevOps Day NYC Aug-2015
Puppet Labs EMC DevOps Day NYC Aug-2015Bob Sokol
 
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...Puppet
 
Holistic Product Development
Holistic Product DevelopmentHolistic Product Development
Holistic Product DevelopmentGary Pedretti
 
DevOps Roadshow - removing barriers between development and operations
DevOps Roadshow - removing barriers between development and operationsDevOps Roadshow - removing barriers between development and operations
DevOps Roadshow - removing barriers between development and operationsMicrosoft Developer Norway
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...Business of Software Conference
 
DevOps Transformation - Another View
DevOps Transformation - Another ViewDevOps Transformation - Another View
DevOps Transformation - Another ViewAgron Fazliu
 
Agile at Glasswing
Agile at GlasswingAgile at Glasswing
Agile at GlasswingRajeev Soni
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryXebiaLabs
 

Semelhante a NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method for Digital Infra/Ops Management (20)

Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas
 
Why Agile Fail. *Hint* -it's more than just process
Why Agile Fail. *Hint* -it's more than just processWhy Agile Fail. *Hint* -it's more than just process
Why Agile Fail. *Hint* -it's more than just process
 
How Salesforce built a Scalable, World-Class, Performance Engineering Team
How Salesforce built a Scalable, World-Class, Performance Engineering TeamHow Salesforce built a Scalable, World-Class, Performance Engineering Team
How Salesforce built a Scalable, World-Class, Performance Engineering Team
 
How to create awesome customer experiences
How to create awesome customer experiencesHow to create awesome customer experiences
How to create awesome customer experiences
 
Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce
 
Foundations of scaling agile with SAFe
Foundations of scaling agile with SAFeFoundations of scaling agile with SAFe
Foundations of scaling agile with SAFe
 
Continuous Testing: A Key to DevOps Success
Continuous Testing: A Key to DevOps SuccessContinuous Testing: A Key to DevOps Success
Continuous Testing: A Key to DevOps Success
 
A confused tester in agile world finalversion
A confused tester in agile world finalversionA confused tester in agile world finalversion
A confused tester in agile world finalversion
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Agile Truths and Misconceptions
Agile Truths and MisconceptionsAgile Truths and Misconceptions
Agile Truths and Misconceptions
 
Puppet Labs EMC DevOps Day NYC Aug-2015
Puppet Labs  EMC DevOps Day NYC Aug-2015Puppet Labs  EMC DevOps Day NYC Aug-2015
Puppet Labs EMC DevOps Day NYC Aug-2015
 
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
 
Holistic Product Development
Holistic Product DevelopmentHolistic Product Development
Holistic Product Development
 
DevOps Roadshow - removing barriers between development and operations
DevOps Roadshow - removing barriers between development and operationsDevOps Roadshow - removing barriers between development and operations
DevOps Roadshow - removing barriers between development and operations
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
 
DevOps Transformation - Another View
DevOps Transformation - Another ViewDevOps Transformation - Another View
DevOps Transformation - Another View
 
Fundamentals of Agile
Fundamentals of AgileFundamentals of Agile
Fundamentals of Agile
 
Agile at Glasswing
Agile at GlasswingAgile at Glasswing
Agile at Glasswing
 
Expo qa15 Keynote
Expo qa15 KeynoteExpo qa15 Keynote
Expo qa15 Keynote
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
 

Mais de NUS-ISS

Designing Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee KheeDesigning Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee KheeNUS-ISS
 
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...NUS-ISS
 
How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...NUS-ISS
 
The Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationNUS-ISS
 
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...NUS-ISS
 
Understanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohNUS-ISS
 
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeDigital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeNUS-ISS
 
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...NUS-ISS
 
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...NUS-ISS
 
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk MunnSupply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk MunnNUS-ISS
 
Future of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfNUS-ISS
 
Future of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan MengNUS-ISS
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7NUS-ISS
 
Product Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud ServiceProduct Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud ServiceNUS-ISS
 
Overview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and FoundationsOverview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and FoundationsNUS-ISS
 
Predictive Analytics
Predictive AnalyticsPredictive Analytics
Predictive AnalyticsNUS-ISS
 
Feature Engineering for IoT
Feature Engineering for IoTFeature Engineering for IoT
Feature Engineering for IoTNUS-ISS
 
Master of Technology in Software Engineering
Master of Technology in Software EngineeringMaster of Technology in Software Engineering
Master of Technology in Software EngineeringNUS-ISS
 
Master of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business AnalyticsMaster of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business AnalyticsNUS-ISS
 
Diagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System ArchetypesDiagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System ArchetypesNUS-ISS
 

Mais de NUS-ISS (20)

Designing Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee KheeDesigning Impactful Services and User Experience - Lim Wee Khee
Designing Impactful Services and User Experience - Lim Wee Khee
 
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
 
How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...
 
The Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
 
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
Architecting CX Measurement Frameworks and Ensuring CX Metrics are fit for Pu...
 
Understanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
 
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeDigital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
 
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
 
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
Beyond the Hype: What Generative AI Means for the Future of Work - Damien Cum...
 
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk MunnSupply Chain Security for Containerised Workloads - Lee Chuk Munn
Supply Chain Security for Containerised Workloads - Lee Chuk Munn
 
Future of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
 
Future of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan Meng
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
 
Product Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud ServiceProduct Management in The Trenches for a Cloud Service
Product Management in The Trenches for a Cloud Service
 
Overview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and FoundationsOverview of Data and Analytics Essentials and Foundations
Overview of Data and Analytics Essentials and Foundations
 
Predictive Analytics
Predictive AnalyticsPredictive Analytics
Predictive Analytics
 
Feature Engineering for IoT
Feature Engineering for IoTFeature Engineering for IoT
Feature Engineering for IoT
 
Master of Technology in Software Engineering
Master of Technology in Software EngineeringMaster of Technology in Software Engineering
Master of Technology in Software Engineering
 
Master of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business AnalyticsMaster of Technology in Enterprise Business Analytics
Master of Technology in Enterprise Business Analytics
 
Diagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System ArchetypesDiagnosing Complex Problems Using System Archetypes
Diagnosing Complex Problems Using System Archetypes
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

NUS-ISS Learning Day 2019-Site Reliability Engineering – The Modern Method for Digital Infra/Ops Management

  • 1. Site Reliability Engineering The Modern Approach to Digital Infra/Ops Management #ISSLearningDay Mr. Jamie Donoghue, VisionLed Consulting 2 August 2019
  • 2. Jamie Donoghue Director and Principal Consultant MBA, Business Agility Coach, Lean Change Facilitator, DevOps Leader, Strategic Product Manager, CISA, CGEIT, CISM, CRISC, COBIT, P3O, MSP, PRINCE2, PMP, ITIL Expert, ScrumMaster, LeSS Practitioner, Lean IT, Lean Kanban, 6σ Jamie, a dual citizen of the UK and NZ, has spent over 20 years improving IT Services for public and private organisations in the UK, Australia and South East Asia. As an architect, consultant and coach, he specialises in creating high- performance, cross-functional teams that are competent, accountable and inspired. jamie@visionled.co © VisionLed Consulting. All Rights Reserved. 2
  • 3. Title • What is SRE? ➢ Shared Ownership ➢ SLOs and Blameless Post-Mortems ➢ Reduce Cost of Failure ➢ Automate This Year’s Job Away ➢ Measure Toil & Reliability • Building your SRE capability • Q&A #ISSLearningDay
  • 5. What is Site Reliability Engineering? #ISSLearningDay
  • 6. What SRE’s Do • Champion reliability practices • Guide designs and processes with an eye toward resilience and low toil • Reduce technical complexity and sprawl (inefficiency) • Drive the usage of common tools and components (standardisation) • Use software to improve resilience and automate operations #ISSLearningDay
  • 7. What is Toil? • In SRE, we want to spent time (50%) on long-term engineering project work instead of operational work. - because operational work maybe misinterpreted, we use a specific word: toil - SRE's should spend less that X% of their time on toil and the test on coding (projects) - Excess toil is redirected to the development team • The work of reducing toil and scaling up services is the 'Engineering' in Site Reliability Engineering #ISSLearningDay Toil Characteristics Manual Repetitive Automatable Reactive No Enduring Value Scales Linearly With Growth
  • 9. SRE implements DevOps (in part at least) #ISSLearningDay
  • 10. SRE implements DevOps (in part at least) #ISSLearningDay
  • 11. Share Ownership the Google Way #ISSLearningDay https://web.devopstopologies.com/
  • 12. Share Ownership the Acquia Way • We embed SRE’s within Product Teams, rather than build teams that runs Products on behalf of Developers • The entire Product Team (incl. SRE) is expected to ‘own the Product’ • The SRE identifies risks to SLO’s as part of their day-to-day activities and brings improvement opportunities directly to the Product Owner for prioritisation in the team’s backlog. #ISSLearningDay https://www.acquia.com/
  • 13. SRE implements DevOps (in part at least) #ISSLearningDay
  • 14. SLA, SLO, SLI #ISSLearningDay Consequences for missing a target Targets for measurement What to measure
  • 15. SLA, SLO, SLI #ISSLearningDay If 99% of your system requests aren’t completed in 5ms, you get a refund. 99.5% of requests will be completed in 5ms. Latency of a request
  • 16. Service Level Objectives #ISSLearningDay • Once you've passed the happiness test, increasing reliability will have diminishing returns. • In addition, higher reliability costs you more to provide, reducing your ability to make changes and release new features.
  • 17. Error Budget #ISSLearningDay • If your agreed reliability target per/month is 99.9% • Your agreed unreliability is 00.1% • This agreed unreliability is your error budget • An error budget of 00.1% = 43.8 minutes of permissible unplanned downtime
  • 18. Using the Error Budget Imagine your service has gone down, and you have a permissible error budget of 43.8 minutes • What activities to detect and manually recover will occur within this time period? • Do you believe you can recover within the error budget? • 5 Minutes to discuss • 5 Minutes to share
  • 19. Using the Error Budget #ISSLearningDay No cause for concern Definite cause for concern
  • 20. Blameless Post-Mortems • Do a Post Mortem for every incident • Post-Mortems are blameless ➢ i.e. they focus on process and technology, not people #ISSLearningDay
  • 21. Blameless Post-Mortems Agenda • Document timeline of the Incident • With the team determine • What went well • What didn’t go well (process failure, technical root cause) • What was lucky (or circumstantial) • File an action item for each item that didn’t go well, or was circumstantial, including: • Clear requirements and acceptance criteria • Level of Effort and Prioritisation • Openly share the post-mortem with the rest of the organisation • Review post-mortem periodically #ISSLearningDay
  • 22. SRE implements DevOps (in part at least) #ISSLearningDay
  • 24. SRE implements DevOps (in part at least) #ISSLearningDay
  • 25. Automate This Years Job Away #ISSLearningDay
  • 28. SRE implements DevOps (in part at least) #ISSLearningDay
  • 29. Measure Toil and Reliability #ISSLearningDay
  • 30. SRE Won’t Work Without… • Authority to stop releases when the Error Budget has been exhausted • Authority to overflow operational work to the Dev Team when operational load is > 50% • These must be authorised in a policy (with CIO/CTO endorsement) #ISSLearningDay
  • 31. Beginner SRE Teams • Staffing and hiring plan (with funding) • Policy for: • Launch readiness • On-call rotation • Balance of operational work/projects • Post-mortems • Overflow of operational work to development • Agreed SLA, SLO, SLI with all relevant parties (end-to- end) • Documentation for release processes, service setup, teardown, rollback and failover • Runbooks for routine operational tasks #ISSLearningDay https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey
  • 32. Intermediate SRE Teams • Periodic reviews of SRE project work and impact with business leaders • Periodic reviews of SLIs and SLOs with business leaders • Rollback mechanism for canary releases (ideally automated) • Periodic testing of incident management, using a combination of role-playing with some automation in place • There’s an escalation policy tied to SLO violations • Teams measure demand vs. capacity and use active forecasting to determine when demand might exceed capacity. #ISSLearningDay https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey
  • 33. Advanced SRE Teams • Project work can be and is often executed horizontally, positively impacting many services at once as opposed to linearly or worse per service • Most service alerts are based on SLO burn rate • Automated disaster recovery testing is in place and positive impact can be measured #ISSLearningDay https://cloud.google.com/blog/products/devops-sre/how-to-start-and-assess-your-sre-journey