SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Site Reliability Engineering
Presenter Name: Keet Malin Sugathadasa
Designation: Associate Technical Lead
Presented By
Keet Malin Sugathadasa
Associate Tech Lead at Cognite
More than 3 years of experience in
various roles related to Software
Engineering
Contributor to NPM and
Stackoverflow
Research Interests –Cyber
Security, Cloud Computing,
Distributed Computing.
AGENDA
• What is Site Reliability Engineering (SRE)
• The 5 Pillars of SRE
• SLOs, SLIs, SLAs
• Error Budgets
• Toil
• Ensuring Successful operations of a
production system
What is DevOps
Like Agile came in to remove the gap between BA &
Dev, DevOps made the gap between Dev & Ops go
away
What is SRE?
• DevOps has been a community built set of practices, a culture;
• while SRE was groomed inside Google as a secret sauce.
Reduce Organizational Silos
• SRE teams share ownership of production with
developers
• SRE teams get involved in development at very early
stages
• But products may not start with SRE support at first.
When onboarding, following items get checked
• System architecture and interservice dependencies
• Instrumentation, metrics, and monitoring
• Emergency response
• Capacity planning
• Change management
• Performance: availability, latency, and efficiency
Reduce Silos
Accept Failure as Normal
Blameless Postmortems
• When things have actually gone bazooka,
who’s fault is it?
• Answer: Nobody’s. It's the system’s fault.
It allowed people to act that way!
• Ask WHY not WHO!
If nobody is blamed, people open up, and
then the root cause cascade opens up.
Agility[Devs] vs Stability[Ops]
• What is availability?
• Clear definitions
• How available you want to be?
• Clear numerical indicators
• What to do when availability is
not met?
SLI - SLO - SLA : Service Level what?
Service Level Indicator: A metric aggregated over time, ( 90th percentile, median )
• Batch throughput
• Failures per request
• Is the ratios of errors to total number of requests received in last 5 minutes < 1%?
• Request latency
• Is the average latency of requests in last 5 minutes < 300ms?
• Is the 90th percentile of the latency of requests in last 5 minutes < 300ms?
Service Level Objectives: Number which SLI needs to be
• Is above indicator is YES 99.9% of the time?
• Monitor the SLIs over a long time and decide this
Service Level Agreement: A legal agreement
• The the level of reliability I promise & what will I do if I do not
• Usually based on SLOs but a business agreement
Risk and availability
• 100% availability is impossible.
• Each 9 you add to the SLO,
increases your cost
• Each 9 you add, you lose your
comfort
Error Budgets
• Once you decide the SLO, you get X number of minutes to go unavailable.
• X is your Error Budget
• If you reach that budget, you cannot release new features anymore
• Under AND over spending is bad.
Implement Gradual Change
Gradual change
• Updates should be pushed as canaries, not as bulk version changes
• Less code change means lesser mean time to recover on failure
• Rate of change would depend on selection of SLO
Tooling & Automation
Toil
Toil is the manual repetitive work tied to running in PROD ( which can be
automated )
Toil & Toil budget
SREs actively measure Toil. Toil budget should be
around 30% to 50%
If toil is not kept at its margins, it fills up to 100%
easily
But a little amount of toil is not harmful.
• Automation might be harder than the manual
work
• Helps newcomers to orient themselves
Measuring
Service reliability needs to be measured
• Uptime
• Mean time to failure
• Mean time to recover
Whatsapp (Example Use case)
• Message Delivery Time
• Message Throughput
• Image Resolution (Compression Algorithm)
• Video Compression Quality
• Etc etc
Hope is not a
Strategy!
Thank you

Mais conteúdo relacionado

Mais procurados

Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
Knoldus Inc.
 

Mais procurados (20)

SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
SRE-iously! Defining the Principles, Habits, and Practices of Site Reliabilit...
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
Reconstructing the SRE
Reconstructing the SREReconstructing the SRE
Reconstructing the SRE
 
How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
SRE in Startup
SRE in StartupSRE in Startup
SRE in Startup
 
Site reliability engineering - Lightning Talk
Site reliability engineering - Lightning TalkSite reliability engineering - Lightning Talk
Site reliability engineering - Lightning Talk
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
DevOps & SRE at Google Scale
DevOps & SRE at Google ScaleDevOps & SRE at Google Scale
DevOps & SRE at Google Scale
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
Observability for Modern Applications (CON306-R1) - AWS re:Invent 2018
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 

Semelhante a Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa

Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
XebiaLabs
 
Can you process 10 trillion logs per day software architecture conference 2015
Can you process 10 trillion logs per day software architecture conference 2015Can you process 10 trillion logs per day software architecture conference 2015
Can you process 10 trillion logs per day software architecture conference 2015
Sumo Logic
 

Semelhante a Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa (20)

Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'Hidden Costs of Chasing the Mythical 'Five Nines'
Hidden Costs of Chasing the Mythical 'Five Nines'
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
 
Adapting Scrum in an Organization with Tailored Processes
Adapting Scrum in an Organization with Tailored ProcessesAdapting Scrum in an Organization with Tailored Processes
Adapting Scrum in an Organization with Tailored Processes
 
DevOps 101
DevOps 101DevOps 101
DevOps 101
 
DevOps By The Numbers
DevOps By The NumbersDevOps By The Numbers
DevOps By The Numbers
 
Kanban testing
Kanban testingKanban testing
Kanban testing
 
DevOps 101
DevOps 101DevOps 101
DevOps 101
 
Top 10 Agile Metrics
Top 10 Agile MetricsTop 10 Agile Metrics
Top 10 Agile Metrics
 
Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...Agile Transformation: People, Process and Tools to Make Your Transformation S...
Agile Transformation: People, Process and Tools to Make Your Transformation S...
 
How to test a Mainframe Application
How to test a Mainframe ApplicationHow to test a Mainframe Application
How to test a Mainframe Application
 
Analyst Keynote: Continuous Delivery: Making DevOps Awesome
Analyst Keynote: Continuous Delivery: Making DevOps AwesomeAnalyst Keynote: Continuous Delivery: Making DevOps Awesome
Analyst Keynote: Continuous Delivery: Making DevOps Awesome
 
Can you process 10 trillion logs per day software architecture conference 2015
Can you process 10 trillion logs per day software architecture conference 2015Can you process 10 trillion logs per day software architecture conference 2015
Can you process 10 trillion logs per day software architecture conference 2015
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP Cloud
 
Measuring DevOps Performance
Measuring DevOps PerformanceMeasuring DevOps Performance
Measuring DevOps Performance
 
Scaling unstable systems velocity 2015
Scaling unstable systems   velocity 2015Scaling unstable systems   velocity 2015
Scaling unstable systems velocity 2015
 
Approaching Quality in Digital Era
Approaching Quality in Digital EraApproaching Quality in Digital Era
Approaching Quality in Digital Era
 
Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability
Serverless Days Helsinki 2019 Rolf Koski - Business Driven AvailabilityServerless Days Helsinki 2019 Rolf Koski - Business Driven Availability
Serverless Days Helsinki 2019 Rolf Koski - Business Driven Availability
 
What is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my BusinessWhat is DevOps? How can it impact my Customers and my Business
What is DevOps? How can it impact my Customers and my Business
 
Delivering A Great End User Experience
Delivering A Great End User ExperienceDelivering A Great End User Experience
Delivering A Great End User Experience
 

Mais de Keet Sugathadasa

Mais de Keet Sugathadasa (9)

Chaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in ProductionChaos Engineering - The Art of Breaking Things in Production
Chaos Engineering - The Art of Breaking Things in Production
 
Human Computer Interaction - Facebook Messenger
Human Computer Interaction - Facebook MessengerHuman Computer Interaction - Facebook Messenger
Human Computer Interaction - Facebook Messenger
 
Cyber Security and Cloud Computing
Cyber Security and Cloud ComputingCyber Security and Cloud Computing
Cyber Security and Cloud Computing
 
How to compete in hackathons
How to compete in hackathonsHow to compete in hackathons
How to compete in hackathons
 
Quality Engineering - When to Stop Testing
Quality Engineering - When to Stop TestingQuality Engineering - When to Stop Testing
Quality Engineering - When to Stop Testing
 
Training Report WSO2 internship
Training Report  WSO2 internshipTraining Report  WSO2 internship
Training Report WSO2 internship
 
Object oriented programming interview questions
Object oriented programming interview questionsObject oriented programming interview questions
Object oriented programming interview questions
 
Interview Facing Workshop
Interview Facing WorkshopInterview Facing Workshop
Interview Facing Workshop
 
Revolutionizing digital authentication with gsma mobile connect
Revolutionizing digital authentication with gsma mobile connectRevolutionizing digital authentication with gsma mobile connect
Revolutionizing digital authentication with gsma mobile connect
 

Último

"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 

Último (20)

Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 

Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa

  • 1. Site Reliability Engineering Presenter Name: Keet Malin Sugathadasa Designation: Associate Technical Lead
  • 2. Presented By Keet Malin Sugathadasa Associate Tech Lead at Cognite More than 3 years of experience in various roles related to Software Engineering Contributor to NPM and Stackoverflow Research Interests –Cyber Security, Cloud Computing, Distributed Computing.
  • 3. AGENDA • What is Site Reliability Engineering (SRE) • The 5 Pillars of SRE • SLOs, SLIs, SLAs • Error Budgets • Toil • Ensuring Successful operations of a production system
  • 4. What is DevOps Like Agile came in to remove the gap between BA & Dev, DevOps made the gap between Dev & Ops go away
  • 5. What is SRE? • DevOps has been a community built set of practices, a culture; • while SRE was groomed inside Google as a secret sauce.
  • 6.
  • 7.
  • 9. • SRE teams share ownership of production with developers • SRE teams get involved in development at very early stages • But products may not start with SRE support at first. When onboarding, following items get checked • System architecture and interservice dependencies • Instrumentation, metrics, and monitoring • Emergency response • Capacity planning • Change management • Performance: availability, latency, and efficiency Reduce Silos
  • 11. Blameless Postmortems • When things have actually gone bazooka, who’s fault is it? • Answer: Nobody’s. It's the system’s fault. It allowed people to act that way! • Ask WHY not WHO! If nobody is blamed, people open up, and then the root cause cascade opens up.
  • 12. Agility[Devs] vs Stability[Ops] • What is availability? • Clear definitions • How available you want to be? • Clear numerical indicators • What to do when availability is not met?
  • 13. SLI - SLO - SLA : Service Level what? Service Level Indicator: A metric aggregated over time, ( 90th percentile, median ) • Batch throughput • Failures per request • Is the ratios of errors to total number of requests received in last 5 minutes < 1%? • Request latency • Is the average latency of requests in last 5 minutes < 300ms? • Is the 90th percentile of the latency of requests in last 5 minutes < 300ms? Service Level Objectives: Number which SLI needs to be • Is above indicator is YES 99.9% of the time? • Monitor the SLIs over a long time and decide this Service Level Agreement: A legal agreement • The the level of reliability I promise & what will I do if I do not • Usually based on SLOs but a business agreement
  • 14.
  • 15. Risk and availability • 100% availability is impossible. • Each 9 you add to the SLO, increases your cost • Each 9 you add, you lose your comfort
  • 16. Error Budgets • Once you decide the SLO, you get X number of minutes to go unavailable. • X is your Error Budget • If you reach that budget, you cannot release new features anymore • Under AND over spending is bad.
  • 17.
  • 19. Gradual change • Updates should be pushed as canaries, not as bulk version changes • Less code change means lesser mean time to recover on failure • Rate of change would depend on selection of SLO
  • 21. Toil Toil is the manual repetitive work tied to running in PROD ( which can be automated )
  • 22. Toil & Toil budget SREs actively measure Toil. Toil budget should be around 30% to 50% If toil is not kept at its margins, it fills up to 100% easily But a little amount of toil is not harmful. • Automation might be harder than the manual work • Helps newcomers to orient themselves
  • 23. Measuring Service reliability needs to be measured • Uptime • Mean time to failure • Mean time to recover
  • 24. Whatsapp (Example Use case) • Message Delivery Time • Message Throughput • Image Resolution (Compression Algorithm) • Video Compression Quality • Etc etc
  • 25. Hope is not a Strategy!