SlideShare a Scribd company logo
1 of 19
SITE RELIABILITY
ENGINEERING
FOR GROWING ORGANIZATIONS
My company in 20s
• End-to-end payments platform
• API-First
• Docker, C#, ASP.NET, Java,
Powershell, SQL
• #31 Nilsen top merchant acquirer
• Inc. 5000 fastest growing
company
• STL Top Place to Work
• Sound fun? It is. Come see me.
2
WHAT IS SRE?
• “Ops, if everything is treated as a
software problem”
• Typically experienced software
devs with a passion for
automation & infrastructure
• Sort of like devops, but with more
of a focus on production
automation, resiliency and
scalability
• Google wrote this book – It is
being adopted and explored by
many companies
• SRE for Google won't be SRE for
your team!
3
GROWING COMPANY PROBLEMS
4
• If you don't set a service level expectation, they will form around 100% uptime
• Keeping everything running as-is gets treated as a sunk cost
• Code atrophies or gets frozen, but the business keeps changing
EXPECTATIONS OFTEN OUTPACE CAPACITY
5
• How many more users can we
support at our current growth rate?
• If you buy X, will that let us scale?
• If we agree to buy X, can we wait
until next year's budget?
These are very hard questions to
answer without data and documented
expectations for performance &
uptime.
GROWING COMPANY PROBLEMS
FINANCES GET MORE FORMAL - YOU NEED METRICS TO JUSTIFY ENHANCEMENTS
6
You will need more automation to keep
it running, not more people
• People are an ongoing cost,
automation is a capitalizable
investment
• With a bigger customer base, five
minute outages become damaging
experiences. Machines can react
faster.
• 4 nines (99.99%) is < 5 minutes
downtime per month. How quickly
can you triage an alert?
GROWING COMPANY PROBLEMS
COMPLEXITY IS EVER INCREASING
7
• Improve reaction time to incidents
‒ Focus can be spent on documenting tribal knowledge, minimizing mistakes and
improving RTO
• Learn from mistakes, turn them into opportunities
‒ SRE teams can focus on blameless postmortems, extracting as much marrow as
possible from incidents, then being a champion for change
• Raise awareness for system behavior, weaknesses & strengths
‒ SRE can be an independent consulting agency or PR firm for dev teams
‒ SRE will create and/or publicize metrics to show facts
• Bandwidth dedicated to forward-looking system behaviors.
‒ Usually this is done as time permits (which is limited when companies grow fast).
WHY YOU WANT A DEDICATED SRE TEAM
SOUNDS GOOD! HOW IS IT DONE?
8
‒ Monitor externally the way your customers see you AND the way you see you
‒ There will be false alarms so not everyone should see these
AUTOMATED MONITORING
SOUNDS GOOD! HOW IS IT DONE?
9
LOG INDEXING AND AGGREGATION
10
• Build self-healing systems when we can
‒ Service health checks & automated recovery actions
‒ Desired state configuration
‒ Service Orchestration
• Document procedures/playbooks/runbooks when we can't
SOUNDS GOOD! HOW IS IT DONE?
11
• More than just a socket connection
‒ Does a typical request return a 200-OK?
‒ How many 200/300 Responses vs 400/500?
‒ Can you connect to your downstream
dependencies?
‒ How long have you been up?
• Provide rich info, but quickly
‒ Other endpoints can give more expensive
info
HEALTH CHECKS
12
• SLOs – Service Level Objectives
‒ Where you’d like to be
• SLAs – Service Level Agreements
‒ Where you tell your customers you’ll be
‒ Penalties
‒ More liberal than your SLOs
• Error Budgets
‒ Based on your SLO, how much risk
can you tolerate?
SERVICE LEVELS
SIGNAL VS NOISE
13
• Alert fatigue is real. Keep your alerts actionable.
• Rare errors can be the most interesting, but error velocity is an indicator.
• Strengthen the signal-noise ratio to combat fatigue.
14
MY EXPERIENCES
• Team was formed from various departments
• Carried forward some SRE-related projects from dev
• Matured & documented processes
‒ Playbooks
‒ Dependencies, metrics, app catalog
• Sharing responsibility for prod incidents with operations and
dev teams
• Finding ways to consult on app design & rollout
• We are first-responders, but the dev & ops teams are on call
STORIES FROM THE FIELD
15
5,124 HOURS
AKA CISCO FIELD NOTICE FN-64291
AUTO-IMMUNE DISORDER
AGGRESSIVE HEALTH CHECKING
STORIES FROM THE FIELD
WHAT’S THE STRANGEST PLACE YOU’VE
WORKED A PRODUCTION INCIDENT?
THANK YOU!
• Twitter: jmloeffler
• G+: jmloeffler
• Github: jmloeffler
19

More Related Content

What's hot

A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
Acquia
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
Knoldus Inc.
 

What's hot (20)

How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)How Small Team Get Ready for SRE (public version)
How Small Team Get Ready for SRE (public version)
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
 
SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)SRE 101 (Site Reliability Engineering)
SRE 101 (Site Reliability Engineering)
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Building an SRE Organization @ Squarespace
Building an SRE Organization @ SquarespaceBuilding an SRE Organization @ Squarespace
Building an SRE Organization @ Squarespace
 
SRE vs DevOps
SRE vs DevOpsSRE vs DevOps
SRE vs DevOps
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
SRE 101
SRE 101SRE 101
SRE 101
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
SRE-iously! Reliability!
SRE-iously! Reliability!SRE-iously! Reliability!
SRE-iously! Reliability!
 
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
Kks sre book_ch1,2
Kks sre book_ch1,2Kks sre book_ch1,2
Kks sre book_ch1,2
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
Service Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLIService Level Terminology : SLA ,SLO & SLI
Service Level Terminology : SLA ,SLO & SLI
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 
When down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConfWhen down is not good enough. SRE On Azure - PolarConf
When down is not good enough. SRE On Azure - PolarConf
 

Similar to Site reliability engineering

Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
XebiaLabs
 
GWAVACon 2013: Novell Service Desk Design, Deployment
GWAVACon 2013: Novell Service Desk Design, DeploymentGWAVACon 2013: Novell Service Desk Design, Deployment
GWAVACon 2013: Novell Service Desk Design, Deployment
GWAVA
 
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
NoNameCon
 
Brighttalk understanding the promise of sde - final
Brighttalk   understanding the promise of sde - finalBrighttalk   understanding the promise of sde - final
Brighttalk understanding the promise of sde - final
Andrew White
 
Turn Your Business Vision into Reality with “Jewel – ERP"
Turn Your Business Vision into Reality with “Jewel – ERP"Turn Your Business Vision into Reality with “Jewel – ERP"
Turn Your Business Vision into Reality with “Jewel – ERP"
Faruk Shah
 

Similar to Site reliability engineering (20)

Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas Value Driven Development by Dave Thomas
Value Driven Development by Dave Thomas
 
DevOps Transformation - Another View
DevOps Transformation - Another ViewDevOps Transformation - Another View
DevOps Transformation - Another View
 
How Duct Tape and Bubblegum are Hurting Your Business
How Duct Tape and Bubblegum are Hurting Your BusinessHow Duct Tape and Bubblegum are Hurting Your Business
How Duct Tape and Bubblegum are Hurting Your Business
 
Patching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP CloudPatching is Your Friend in the New World Order of EPM and ERP Cloud
Patching is Your Friend in the New World Order of EPM and ERP Cloud
 
DevOps Best Practices and Implementation Roadmap
DevOps Best Practices and Implementation RoadmapDevOps Best Practices and Implementation Roadmap
DevOps Best Practices and Implementation Roadmap
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
 
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous DeliveryWebinar: Demonstrating Business Value for DevOps & Continuous Delivery
Webinar: Demonstrating Business Value for DevOps & Continuous Delivery
 
UCAAS portals make or buy
UCAAS portals make or buyUCAAS portals make or buy
UCAAS portals make or buy
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
 
DevSecCon Keynote
DevSecCon KeynoteDevSecCon Keynote
DevSecCon Keynote
 
GWAVACon 2013: Novell Service Desk Design, Deployment
GWAVACon 2013: Novell Service Desk Design, DeploymentGWAVACon 2013: Novell Service Desk Design, Deployment
GWAVACon 2013: Novell Service Desk Design, Deployment
 
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
Bert Heitink - Technical Insights for the SOC as Technical Centre for IT Secu...
 
To Deliver, Discover We Must - A value-driven approach to agile planning
To Deliver, Discover We Must - A value-driven approach to agile planningTo Deliver, Discover We Must - A value-driven approach to agile planning
To Deliver, Discover We Must - A value-driven approach to agile planning
 
The Evolution of the Enterprise Operating Model - Ryan Lockard
The Evolution of the Enterprise Operating Model - Ryan LockardThe Evolution of the Enterprise Operating Model - Ryan Lockard
The Evolution of the Enterprise Operating Model - Ryan Lockard
 
7 tips for better requirements management
7 tips for better requirements management7 tips for better requirements management
7 tips for better requirements management
 
AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...
AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...
AgileCamp 2014 Track 1: Accelerating Agile Enterprise Adoption with Scaled Ag...
 
Brighttalk understanding the promise of sde - final
Brighttalk   understanding the promise of sde - finalBrighttalk   understanding the promise of sde - final
Brighttalk understanding the promise of sde - final
 
Turn Your Business Vision into Reality with “Jewel – ERP"
Turn Your Business Vision into Reality with “Jewel – ERP"Turn Your Business Vision into Reality with “Jewel – ERP"
Turn Your Business Vision into Reality with “Jewel – ERP"
 
What to consider when monitoring microservices
What to consider when monitoring microservicesWhat to consider when monitoring microservices
What to consider when monitoring microservices
 
Integrated Cloud EDI for Sage ERP
Integrated Cloud EDI for Sage ERPIntegrated Cloud EDI for Sage ERP
Integrated Cloud EDI for Sage ERP
 

Recently uploaded

Recently uploaded (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Site reliability engineering

  • 2. My company in 20s • End-to-end payments platform • API-First • Docker, C#, ASP.NET, Java, Powershell, SQL • #31 Nilsen top merchant acquirer • Inc. 5000 fastest growing company • STL Top Place to Work • Sound fun? It is. Come see me. 2
  • 3. WHAT IS SRE? • “Ops, if everything is treated as a software problem” • Typically experienced software devs with a passion for automation & infrastructure • Sort of like devops, but with more of a focus on production automation, resiliency and scalability • Google wrote this book – It is being adopted and explored by many companies • SRE for Google won't be SRE for your team! 3
  • 4. GROWING COMPANY PROBLEMS 4 • If you don't set a service level expectation, they will form around 100% uptime • Keeping everything running as-is gets treated as a sunk cost • Code atrophies or gets frozen, but the business keeps changing EXPECTATIONS OFTEN OUTPACE CAPACITY
  • 5. 5 • How many more users can we support at our current growth rate? • If you buy X, will that let us scale? • If we agree to buy X, can we wait until next year's budget? These are very hard questions to answer without data and documented expectations for performance & uptime. GROWING COMPANY PROBLEMS FINANCES GET MORE FORMAL - YOU NEED METRICS TO JUSTIFY ENHANCEMENTS
  • 6. 6 You will need more automation to keep it running, not more people • People are an ongoing cost, automation is a capitalizable investment • With a bigger customer base, five minute outages become damaging experiences. Machines can react faster. • 4 nines (99.99%) is < 5 minutes downtime per month. How quickly can you triage an alert? GROWING COMPANY PROBLEMS COMPLEXITY IS EVER INCREASING
  • 7. 7 • Improve reaction time to incidents ‒ Focus can be spent on documenting tribal knowledge, minimizing mistakes and improving RTO • Learn from mistakes, turn them into opportunities ‒ SRE teams can focus on blameless postmortems, extracting as much marrow as possible from incidents, then being a champion for change • Raise awareness for system behavior, weaknesses & strengths ‒ SRE can be an independent consulting agency or PR firm for dev teams ‒ SRE will create and/or publicize metrics to show facts • Bandwidth dedicated to forward-looking system behaviors. ‒ Usually this is done as time permits (which is limited when companies grow fast). WHY YOU WANT A DEDICATED SRE TEAM
  • 8. SOUNDS GOOD! HOW IS IT DONE? 8 ‒ Monitor externally the way your customers see you AND the way you see you ‒ There will be false alarms so not everyone should see these AUTOMATED MONITORING
  • 9. SOUNDS GOOD! HOW IS IT DONE? 9 LOG INDEXING AND AGGREGATION
  • 10. 10 • Build self-healing systems when we can ‒ Service health checks & automated recovery actions ‒ Desired state configuration ‒ Service Orchestration • Document procedures/playbooks/runbooks when we can't SOUNDS GOOD! HOW IS IT DONE?
  • 11. 11 • More than just a socket connection ‒ Does a typical request return a 200-OK? ‒ How many 200/300 Responses vs 400/500? ‒ Can you connect to your downstream dependencies? ‒ How long have you been up? • Provide rich info, but quickly ‒ Other endpoints can give more expensive info HEALTH CHECKS
  • 12. 12 • SLOs – Service Level Objectives ‒ Where you’d like to be • SLAs – Service Level Agreements ‒ Where you tell your customers you’ll be ‒ Penalties ‒ More liberal than your SLOs • Error Budgets ‒ Based on your SLO, how much risk can you tolerate? SERVICE LEVELS
  • 13. SIGNAL VS NOISE 13 • Alert fatigue is real. Keep your alerts actionable. • Rare errors can be the most interesting, but error velocity is an indicator. • Strengthen the signal-noise ratio to combat fatigue.
  • 14. 14 MY EXPERIENCES • Team was formed from various departments • Carried forward some SRE-related projects from dev • Matured & documented processes ‒ Playbooks ‒ Dependencies, metrics, app catalog • Sharing responsibility for prod incidents with operations and dev teams • Finding ways to consult on app design & rollout • We are first-responders, but the dev & ops teams are on call
  • 15. STORIES FROM THE FIELD 15
  • 16. 5,124 HOURS AKA CISCO FIELD NOTICE FN-64291
  • 18. STORIES FROM THE FIELD WHAT’S THE STRANGEST PLACE YOU’VE WORKED A PRODUCTION INCIDENT?
  • 19. THANK YOU! • Twitter: jmloeffler • G+: jmloeffler • Github: jmloeffler 19