SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
A quick summary of
SRE – Site Reliability Engineering
Yogesh shah
Agenda
• What is SRE & its background
• Before going to SRE
• SRE and DevOps
• Components of SRE
• Reliability
• SLA
• SLO
• SLI
• Error budget
• Toil
• Things we did not cover
• References
What is SRE,
History &
Background
SRE = Site Reliability Engineering
Term SRE originated in google more than decade ago and it
has been backbone of Google’s highly reliable & valuable
suite of products & service
Google didn’t make details of SRE public as it thought that
it is the secrete sauce of their success
When DevOps movement stated, google could see that
there is lot of interest in implementing DevOps but there is
no clear path and people are struggling to implement
DevOps
Scrum, SAFe, Lean,
DevOps …………..
now SRE… 
• Framework Direction: Dev  Ops
• Flexibility: Rigid  open for interpretation
• Ease of implementation Easy  very hard
• Fit for market demand Less  high
Software delivery
mechanism
What is at the center Type Advantages Difficulties
Waterfall/ Project
management
Centers around: Plan
Outcome: Fixed target
Process • Easy to implement
• Scope, time, cost fixed
• Changing requirement
• Too heavy, complex & costly
ITIL
Centers around: SLA
Outcome: Predefined service quality
Framework • Easy to implement
• Clear accountability
• Predictable service quality
• Meet SLA != customer
satisfaction
• Too heavy & complex
Scrum/ SAFe
Centers around: Timebox, Focus
Outcome: delivery of Changing
requirement
Framework • Simple to understand • Difficult to implement
• Works best in pockets but
consistency is hard to achieve
Lean
Centers around: Flow of work
Outcome: Removal of waste
Methodology • Easy to implement
• Clear accountability
• Predictable service quality
• Meet SLA != customer
satisfaction
• Too heavy & complex
DevOps
Centers around: Unify Dev & Ops
Outcome: End to end accountability for
Dev & Ops
Philosophy • Great vision • Open to interpretation
What is SRE in comparison of others
• Centers around: Reliability
• Outcome: Customer satisfaction with control over balance of
Enhancement & Reliability
• Type: Implementation pattern
• Advantage: Implements DevOps,
• Disadvantage: None 
• Addresses so far neglected question “is system ready to handle change
without impacting customer experience?”
• SRE happens when a software engineer is tasked with what used to be
called operations.
SRE and DevOps But what is DevOps?
DevOps is about combined team (Dev & Ops)
using common set of tools & processes to deliver
any software change
SRE is an implementation of DevOps.
DevOps
Reduce organization silos
Accept failure as normal
Implement gradual change
Leverage tooling & automation
Measure everything
SRE
Share ownership with developers by using the same tools and techniques across the stack
Have a formula for balancing accidents and failures against new releases
Encourage moving quickly by reducing costs of failure
Encourages "automating this year's job away" and minimizing manual systems work to
focus on efforts that bring long-term value to the system
Believes that operations is a software problem, and defines prescriptive ways for
measuring availability, uptime, outages, toil, etc.
Components
of SRE
Reliability
SLA
SLO
SLI
Error Budget
Toil
Defining Reliability
•Clunky system with great features doesn’t work
•100% reliability is most often wrong target as it slows down velocity
•Reliability beyond a certain point has diminishing returns
•Each 9 after decimal point makes system 10 time more reliable but it costs 10 time more
Most important feature of any system
is its Reliability
•User, not monitoring metrics decide reliability hence in order to say system is reliable one
needs to measure user experienceUser Experience decides Reliability
•To achieve highly reliable (99.999…) systems well trained incident response team
(proactive & reactive) is required. Only talented developers & well engineered system is
not enough
Only engineering & talented
developer are not enough for highly
reliable systems. Well trained
incident response team is must
Reliability
• SRE helps defining reliability in clear way using concept of an error
budget
• Due to error budget understanding of reliability is understood
consistently across organization
• 100% reliability is wrong target as it slows down velocity
• User happiness and reliability is directly proportional till a point
beyond that user doesn’t care
SLA
• These are your agreements that you make with your customers about
the reliability of your service. An SLA has to have consequences if it's
violated
• Violating SLAs is costly affair in many aspects & hence getting a
informative warning with enough time to react is must to prevent
violation of SLA
SLO – Service Level Objectives
• Reliability is a feature hence it is prioritized against other functional features. However
prioritizing Reliability is challenging and SLOs are key to help in prioritizing Reliability
along with other features
• Target for specified reliability is SLO. In other words SLO is used to measure reliability
• SLO should always be stronger than your SLAs because customers are usually impacted
before the SLA is actually breached.
• SLO is effectively an internal promise to meet customer expectations. Violation of SLO
becomes really important issue as you are no longer have more outages so that you'll
want to take steps to remove risks from your service by devoting engineering
and automation efforts to reducing and eliminating areas of risks, etc.
• A good rule of thumb to set SLO targets is “happiness test” A threshold beyond which
user tends to become grumpy due to degraded service performance
• So Setting identifying and selecting SLO target is important but tough task and SRE has
clear guidelines to identify SLOs, set targets and revise SLO, Targets or both
SLI – Service Level Indicators
What is SLI
• Now we understand what is Reliability but how do we measure it?
• Reliability of service should be quantitative measure of customer experience. SRE helps you to
find suitable metric based on characteristics of your service
• The chosen metrics to measure level service provided to user is called SLI. In simple words It is a
quantitative measure of user experience
• Implementation to measure SLI metric changes based on implementation and environment
where service is operating
Relationship between SLI & SLO
• SLI is how is the service performing against that target at the given point in time
• SLO is the target we chose and measure SLI for period of time (e.g. 99% of requests are served within 2 seconds in last 4 weeks)
• SLI will tell us if certain time is good or bad based on measure of SLI against SLO target
• SLOs can be different for different times, different customer types, frequency of SLO misses etc. however concept of error budget
helps you manage this
How SRE helps
• SRE provide SLI menu for typical
user journey (system
characteristics)
• SRE provides simple formula to
measure SLIs. It is always ratio
(good events/ valid events)
• Provides blueprints to
implement SLI capture
mechanism along with tradeoffs
Error Budget
• Identifying, documenting and agreeing SLOs and SLIs can be great progress but how can
we make all this work?
• Error budget is useful
• actively balance Reliability of system against progress of other features in coherent manner
• To inform all how much head room is available before impacting customer experience
• It quantitatively informs how much failure or unreliability is allowed
• E.g.
• If intended reliability is 99.9% that means error budget is 0.1%
• 0.1% error budget = 40.32 mins of downtime over 28 days
• These 40.32 mins is SLO which we agree with all stakeholder. That means we have 40.32 mins for
recovering from any failure. Failure can be because of any reason hdd failure, bad code,
maintenance error, etc.
• It prompts lot of useful thinking.
• Assume that Reliability for your platform is 95% in 28 days. That means you are allowed to have
1.4 days of down time. Now do you really need CI-CD, Blue green deployment, test automation
etc.?
Toil
• Toil is work related to running production system/ service
• Toil satisfies following conditions
• manual
• Repetitive
• Automatable
• tactical
• devoid of long-term value
• Overhead (attending meeting, responding to email, etc.) is not a Toil
Not covered
• Detail steps and workshops for developing SLOs and SLIs
• Setting achievable SLO targets
• Define SLIs
• Manage growth of SLI parameter
• SLI menu, implementation patterns, tradeoffs and cost analysis
• Define and analyze error budget
• Error budget policy, thresholds and scenarios
• Identify and address SLO risks
• Consequences of missing SLO
• There is much more
References
• SRE Introduction – Set of videos about SRE introduction
• SRE – How google runs production systems
• SRE Workbook – Practical ways to implement SRE
Thank you

Mais conteĂşdo relacionado

Mais procurados

Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIKnoldus Inc.
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkMichae Blakeney
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRESquadcast Inc
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...ITSM Academy, Inc.
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceFranklin Angulo
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)jeetendra mandal
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...DevOpsDays Tel Aviv
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)Setyo Legowo
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationDr Ganesh Iyer
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!New Relic
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityAcquia
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7NUS-ISS
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsRauno De Pasquale
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SREBob Wise
 
Rapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsRapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsMarc Hornbeek
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringMichael Kehoe
 

Mais procurados (20)

Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
SRE 101
SRE 101SRE 101
SRE 101
 
What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)What is Site Reliability Engineering (SRE)
What is Site Reliability Engineering (SRE)
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7Site Reliability Engineer (SRE), We Keep The Lights On 24/7
Site Reliability Engineer (SRE), We Keep The Lights On 24/7
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SRE
 
Rapid Strategic SRE Assessments
Rapid Strategic SRE AssessmentsRapid Strategic SRE Assessments
Rapid Strategic SRE Assessments
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 

Semelhante a Sre summary

What is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my BusinessWhat is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my BusinessQualitest
 
Scaled Agile FrameworkÂŽ Overview
Scaled Agile FrameworkÂŽ OverviewScaled Agile FrameworkÂŽ Overview
Scaled Agile FrameworkÂŽ OverviewCprime
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...Business of Software Conference
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsRicardo Amaro
 
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgyStc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgyArchana Krushnan
 
TDWI STL 20140613 Agile - Paul Holway
TDWI STL 20140613 Agile - Paul HolwayTDWI STL 20140613 Agile - Paul Holway
TDWI STL 20140613 Agile - Paul HolwayTDWI St. Louis
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudDatavail
 
Dev ops training in chennai
Dev ops training in chennaiDev ops training in chennai
Dev ops training in chennairaj esaki
 
Puppet Labs EMC DevOps Day NYC Aug-2015
Puppet Labs  EMC DevOps Day NYC Aug-2015Puppet Labs  EMC DevOps Day NYC Aug-2015
Puppet Labs EMC DevOps Day NYC Aug-2015Bob Sokol
 
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...Puppet
 
Deliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and AtlassianDeliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and AtlassianXpand IT
 
Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Salesforce Engineering
 
Applying both of waterfall and iterative development
Applying both of waterfall and iterative developmentApplying both of waterfall and iterative development
Applying both of waterfall and iterative developmentDeny Prasetia
 
Erp implementation guide
Erp implementation guideErp implementation guide
Erp implementation guidePPTS India Pvt Ltd
 
Sdec10 lean package implementation
Sdec10 lean package implementationSdec10 lean package implementation
Sdec10 lean package implementationTerry Bunio
 
Agile Course Presentation
Agile Course PresentationAgile Course Presentation
Agile Course PresentationSoumya De
 
An Agile Overview @ ShoreTel Sky
An Agile Overview @ ShoreTel SkyAn Agile Overview @ ShoreTel Sky
An Agile Overview @ ShoreTel Skygirabrent
 

Semelhante a Sre summary (20)

What is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my BusinessWhat is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my Business
 
Scaled Agile FrameworkÂŽ Overview
Scaled Agile FrameworkÂŽ OverviewScaled Agile FrameworkÂŽ Overview
Scaled Agile FrameworkÂŽ Overview
 
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
 
S.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systemsS.R.E - create ultra-scalable and highly reliable systems
S.R.E - create ultra-scalable and highly reliable systems
 
Dev ops
Dev opsDev ops
Dev ops
 
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgyStc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
Stc 2016 regional-round-ppt-automation testing with devops in agile methodolgy
 
DevOps 101
DevOps 101DevOps 101
DevOps 101
 
TDWI STL 20140613 Agile - Paul Holway
TDWI STL 20140613 Agile - Paul HolwayTDWI STL 20140613 Agile - Paul Holway
TDWI STL 20140613 Agile - Paul Holway
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP Cloud
 
Dev ops training in chennai
Dev ops training in chennaiDev ops training in chennai
Dev ops training in chennai
 
Puppet Labs EMC DevOps Day NYC Aug-2015
Puppet Labs  EMC DevOps Day NYC Aug-2015Puppet Labs  EMC DevOps Day NYC Aug-2015
Puppet Labs EMC DevOps Day NYC Aug-2015
 
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...How to Build High-Performing IT Teams - Including New Data on IT Performance ...
How to Build High-Performing IT Teams - Including New Data on IT Performance ...
 
Deliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and AtlassianDeliver Fast and Reliably with Dev Ops and Atlassian
Deliver Fast and Reliably with Dev Ops and Atlassian
 
Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce Quality Testing and Agile at Salesforce
Quality Testing and Agile at Salesforce
 
Applying both of waterfall and iterative development
Applying both of waterfall and iterative developmentApplying both of waterfall and iterative development
Applying both of waterfall and iterative development
 
Erp implementation guide
Erp implementation guideErp implementation guide
Erp implementation guide
 
Sdec10 lean package implementation
Sdec10 lean package implementationSdec10 lean package implementation
Sdec10 lean package implementation
 
Agile Course Presentation
Agile Course PresentationAgile Course Presentation
Agile Course Presentation
 
Agile 101
Agile 101Agile 101
Agile 101
 
An Agile Overview @ ShoreTel Sky
An Agile Overview @ ShoreTel SkyAn Agile Overview @ ShoreTel Sky
An Agile Overview @ ShoreTel Sky
 

Último

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Último (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Sre summary

  • 1. A quick summary of SRE – Site Reliability Engineering Yogesh shah
  • 2. Agenda • What is SRE & its background • Before going to SRE • SRE and DevOps • Components of SRE • Reliability • SLA • SLO • SLI • Error budget • Toil • Things we did not cover • References
  • 3. What is SRE, History & Background SRE = Site Reliability Engineering Term SRE originated in google more than decade ago and it has been backbone of Google’s highly reliable & valuable suite of products & service Google didn’t make details of SRE public as it thought that it is the secrete sauce of their success When DevOps movement stated, google could see that there is lot of interest in implementing DevOps but there is no clear path and people are struggling to implement DevOps
  • 4. Scrum, SAFe, Lean, DevOps ………….. now SRE…  • Framework Direction: Dev  Ops • Flexibility: Rigid  open for interpretation • Ease of implementation Easy  very hard • Fit for market demand Less  high Software delivery mechanism What is at the center Type Advantages Difficulties Waterfall/ Project management Centers around: Plan Outcome: Fixed target Process • Easy to implement • Scope, time, cost fixed • Changing requirement • Too heavy, complex & costly ITIL Centers around: SLA Outcome: Predefined service quality Framework • Easy to implement • Clear accountability • Predictable service quality • Meet SLA != customer satisfaction • Too heavy & complex Scrum/ SAFe Centers around: Timebox, Focus Outcome: delivery of Changing requirement Framework • Simple to understand • Difficult to implement • Works best in pockets but consistency is hard to achieve Lean Centers around: Flow of work Outcome: Removal of waste Methodology • Easy to implement • Clear accountability • Predictable service quality • Meet SLA != customer satisfaction • Too heavy & complex DevOps Centers around: Unify Dev & Ops Outcome: End to end accountability for Dev & Ops Philosophy • Great vision • Open to interpretation
  • 5. What is SRE in comparison of others • Centers around: Reliability • Outcome: Customer satisfaction with control over balance of Enhancement & Reliability • Type: Implementation pattern • Advantage: Implements DevOps, • Disadvantage: None  • Addresses so far neglected question “is system ready to handle change without impacting customer experience?” • SRE happens when a software engineer is tasked with what used to be called operations.
  • 6. SRE and DevOps But what is DevOps? DevOps is about combined team (Dev & Ops) using common set of tools & processes to deliver any software change SRE is an implementation of DevOps. DevOps Reduce organization silos Accept failure as normal Implement gradual change Leverage tooling & automation Measure everything SRE Share ownership with developers by using the same tools and techniques across the stack Have a formula for balancing accidents and failures against new releases Encourage moving quickly by reducing costs of failure Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts that bring long-term value to the system Believes that operations is a software problem, and defines prescriptive ways for measuring availability, uptime, outages, toil, etc.
  • 8. Defining Reliability •Clunky system with great features doesn’t work •100% reliability is most often wrong target as it slows down velocity •Reliability beyond a certain point has diminishing returns •Each 9 after decimal point makes system 10 time more reliable but it costs 10 time more Most important feature of any system is its Reliability •User, not monitoring metrics decide reliability hence in order to say system is reliable one needs to measure user experienceUser Experience decides Reliability •To achieve highly reliable (99.999…) systems well trained incident response team (proactive & reactive) is required. Only talented developers & well engineered system is not enough Only engineering & talented developer are not enough for highly reliable systems. Well trained incident response team is must
  • 9. Reliability • SRE helps defining reliability in clear way using concept of an error budget • Due to error budget understanding of reliability is understood consistently across organization • 100% reliability is wrong target as it slows down velocity • User happiness and reliability is directly proportional till a point beyond that user doesn’t care
  • 10. SLA • These are your agreements that you make with your customers about the reliability of your service. An SLA has to have consequences if it's violated • Violating SLAs is costly affair in many aspects & hence getting a informative warning with enough time to react is must to prevent violation of SLA
  • 11. SLO – Service Level Objectives • Reliability is a feature hence it is prioritized against other functional features. However prioritizing Reliability is challenging and SLOs are key to help in prioritizing Reliability along with other features • Target for specified reliability is SLO. In other words SLO is used to measure reliability • SLO should always be stronger than your SLAs because customers are usually impacted before the SLA is actually breached. • SLO is effectively an internal promise to meet customer expectations. Violation of SLO becomes really important issue as you are no longer have more outages so that you'll want to take steps to remove risks from your service by devoting engineering and automation efforts to reducing and eliminating areas of risks, etc. • A good rule of thumb to set SLO targets is “happiness test” A threshold beyond which user tends to become grumpy due to degraded service performance • So Setting identifying and selecting SLO target is important but tough task and SRE has clear guidelines to identify SLOs, set targets and revise SLO, Targets or both
  • 12. SLI – Service Level Indicators What is SLI • Now we understand what is Reliability but how do we measure it? • Reliability of service should be quantitative measure of customer experience. SRE helps you to find suitable metric based on characteristics of your service • The chosen metrics to measure level service provided to user is called SLI. In simple words It is a quantitative measure of user experience • Implementation to measure SLI metric changes based on implementation and environment where service is operating Relationship between SLI & SLO • SLI is how is the service performing against that target at the given point in time • SLO is the target we chose and measure SLI for period of time (e.g. 99% of requests are served within 2 seconds in last 4 weeks) • SLI will tell us if certain time is good or bad based on measure of SLI against SLO target • SLOs can be different for different times, different customer types, frequency of SLO misses etc. however concept of error budget helps you manage this How SRE helps • SRE provide SLI menu for typical user journey (system characteristics) • SRE provides simple formula to measure SLIs. It is always ratio (good events/ valid events) • Provides blueprints to implement SLI capture mechanism along with tradeoffs
  • 13. Error Budget • Identifying, documenting and agreeing SLOs and SLIs can be great progress but how can we make all this work? • Error budget is useful • actively balance Reliability of system against progress of other features in coherent manner • To inform all how much head room is available before impacting customer experience • It quantitatively informs how much failure or unreliability is allowed • E.g. • If intended reliability is 99.9% that means error budget is 0.1% • 0.1% error budget = 40.32 mins of downtime over 28 days • These 40.32 mins is SLO which we agree with all stakeholder. That means we have 40.32 mins for recovering from any failure. Failure can be because of any reason hdd failure, bad code, maintenance error, etc. • It prompts lot of useful thinking. • Assume that Reliability for your platform is 95% in 28 days. That means you are allowed to have 1.4 days of down time. Now do you really need CI-CD, Blue green deployment, test automation etc.?
  • 14. Toil • Toil is work related to running production system/ service • Toil satisfies following conditions • manual • Repetitive • Automatable • tactical • devoid of long-term value • Overhead (attending meeting, responding to email, etc.) is not a Toil
  • 15. Not covered • Detail steps and workshops for developing SLOs and SLIs • Setting achievable SLO targets • Define SLIs • Manage growth of SLI parameter • SLI menu, implementation patterns, tradeoffs and cost analysis • Define and analyze error budget • Error budget policy, thresholds and scenarios • Identify and address SLO risks • Consequences of missing SLO • There is much more
  • 16. References • SRE Introduction – Set of videos about SRE introduction • SRE – How google runs production systems • SRE Workbook – Practical ways to implement SRE