1. A quick summary of
SRE â Site Reliability Engineering
Yogesh shah
2. Agenda
⢠What is SRE & its background
⢠Before going to SRE
⢠SRE and DevOps
⢠Components of SRE
⢠Reliability
⢠SLA
⢠SLO
⢠SLI
⢠Error budget
⢠Toil
⢠Things we did not cover
⢠References
3. What is SRE,
History &
Background
SRE = Site Reliability Engineering
Term SRE originated in google more than decade ago and it
has been backbone of Googleâs highly reliable & valuable
suite of products & service
Google didnât make details of SRE public as it thought that
it is the secrete sauce of their success
When DevOps movement stated, google could see that
there is lot of interest in implementing DevOps but there is
no clear path and people are struggling to implement
DevOps
4. Scrum, SAFe, Lean,
DevOps âŚâŚâŚâŚ..
now SRE⌠ď
⢠Framework Direction: Dev ď Ops
⢠Flexibility: Rigid ď open for interpretation
⢠Ease of implementation Easy ď very hard
⢠Fit for market demand Less ď high
Software delivery
mechanism
What is at the center Type Advantages Difficulties
Waterfall/ Project
management
Centers around: Plan
Outcome: Fixed target
Process ⢠Easy to implement
⢠Scope, time, cost fixed
⢠Changing requirement
⢠Too heavy, complex & costly
ITIL
Centers around: SLA
Outcome: Predefined service quality
Framework ⢠Easy to implement
⢠Clear accountability
⢠Predictable service quality
⢠Meet SLA != customer
satisfaction
⢠Too heavy & complex
Scrum/ SAFe
Centers around: Timebox, Focus
Outcome: delivery of Changing
requirement
Framework ⢠Simple to understand ⢠Difficult to implement
⢠Works best in pockets but
consistency is hard to achieve
Lean
Centers around: Flow of work
Outcome: Removal of waste
Methodology ⢠Easy to implement
⢠Clear accountability
⢠Predictable service quality
⢠Meet SLA != customer
satisfaction
⢠Too heavy & complex
DevOps
Centers around: Unify Dev & Ops
Outcome: End to end accountability for
Dev & Ops
Philosophy ⢠Great vision ⢠Open to interpretation
5. What is SRE in comparison of others
⢠Centers around: Reliability
⢠Outcome: Customer satisfaction with control over balance of
Enhancement & Reliability
⢠Type: Implementation pattern
⢠Advantage: Implements DevOps,
⢠Disadvantage: None ď
⢠Addresses so far neglected question âis system ready to handle change
without impacting customer experience?â
⢠SRE happens when a software engineer is tasked with what used to be
called operations.
6. SRE and DevOps But what is DevOps?
DevOps is about combined team (Dev & Ops)
using common set of tools & processes to deliver
any software change
SRE is an implementation of DevOps.
DevOps
Reduce organization silos
Accept failure as normal
Implement gradual change
Leverage tooling & automation
Measure everything
SRE
Share ownership with developers by using the same tools and techniques across the stack
Have a formula for balancing accidents and failures against new releases
Encourage moving quickly by reducing costs of failure
Encourages "automating this year's job away" and minimizing manual systems work to
focus on efforts that bring long-term value to the system
Believes that operations is a software problem, and defines prescriptive ways for
measuring availability, uptime, outages, toil, etc.
8. Defining Reliability
â˘Clunky system with great features doesnât work
â˘100% reliability is most often wrong target as it slows down velocity
â˘Reliability beyond a certain point has diminishing returns
â˘Each 9 after decimal point makes system 10 time more reliable but it costs 10 time more
Most important feature of any system
is its Reliability
â˘User, not monitoring metrics decide reliability hence in order to say system is reliable one
needs to measure user experienceUser Experience decides Reliability
â˘To achieve highly reliable (99.999âŚ) systems well trained incident response team
(proactive & reactive) is required. Only talented developers & well engineered system is
not enough
Only engineering & talented
developer are not enough for highly
reliable systems. Well trained
incident response team is must
9. Reliability
⢠SRE helps defining reliability in clear way using concept of an error
budget
⢠Due to error budget understanding of reliability is understood
consistently across organization
⢠100% reliability is wrong target as it slows down velocity
⢠User happiness and reliability is directly proportional till a point
beyond that user doesnât care
10. SLA
⢠These are your agreements that you make with your customers about
the reliability of your service. An SLA has to have consequences if it's
violated
⢠Violating SLAs is costly affair in many aspects & hence getting a
informative warning with enough time to react is must to prevent
violation of SLA
11. SLO â Service Level Objectives
⢠Reliability is a feature hence it is prioritized against other functional features. However
prioritizing Reliability is challenging and SLOs are key to help in prioritizing Reliability
along with other features
⢠Target for specified reliability is SLO. In other words SLO is used to measure reliability
⢠SLO should always be stronger than your SLAs because customers are usually impacted
before the SLA is actually breached.
⢠SLO is effectively an internal promise to meet customer expectations. Violation of SLO
becomes really important issue as you are no longer have more outages so that you'll
want to take steps to remove risks from your service by devoting engineering
and automation efforts to reducing and eliminating areas of risks, etc.
⢠A good rule of thumb to set SLO targets is âhappiness testâ A threshold beyond which
user tends to become grumpy due to degraded service performance
⢠So Setting identifying and selecting SLO target is important but tough task and SRE has
clear guidelines to identify SLOs, set targets and revise SLO, Targets or both
12. SLI â Service Level Indicators
What is SLI
⢠Now we understand what is Reliability but how do we measure it?
⢠Reliability of service should be quantitative measure of customer experience. SRE helps you to
find suitable metric based on characteristics of your service
⢠The chosen metrics to measure level service provided to user is called SLI. In simple words It is a
quantitative measure of user experience
⢠Implementation to measure SLI metric changes based on implementation and environment
where service is operating
Relationship between SLI & SLO
⢠SLI is how is the service performing against that target at the given point in time
⢠SLO is the target we chose and measure SLI for period of time (e.g. 99% of requests are served within 2 seconds in last 4 weeks)
⢠SLI will tell us if certain time is good or bad based on measure of SLI against SLO target
⢠SLOs can be different for different times, different customer types, frequency of SLO misses etc. however concept of error budget
helps you manage this
How SRE helps
⢠SRE provide SLI menu for typical
user journey (system
characteristics)
⢠SRE provides simple formula to
measure SLIs. It is always ratio
(good events/ valid events)
⢠Provides blueprints to
implement SLI capture
mechanism along with tradeoffs
13. Error Budget
⢠Identifying, documenting and agreeing SLOs and SLIs can be great progress but how can
we make all this work?
⢠Error budget is useful
⢠actively balance Reliability of system against progress of other features in coherent manner
⢠To inform all how much head room is available before impacting customer experience
⢠It quantitatively informs how much failure or unreliability is allowed
⢠E.g.
⢠If intended reliability is 99.9% that means error budget is 0.1%
⢠0.1% error budget = 40.32 mins of downtime over 28 days
⢠These 40.32 mins is SLO which we agree with all stakeholder. That means we have 40.32 mins for
recovering from any failure. Failure can be because of any reason hdd failure, bad code,
maintenance error, etc.
⢠It prompts lot of useful thinking.
⢠Assume that Reliability for your platform is 95% in 28 days. That means you are allowed to have
1.4 days of down time. Now do you really need CI-CD, Blue green deployment, test automation
etc.?
14. Toil
⢠Toil is work related to running production system/ service
⢠Toil satisfies following conditions
⢠manual
⢠Repetitive
⢠Automatable
⢠tactical
⢠devoid of long-term value
⢠Overhead (attending meeting, responding to email, etc.) is not a Toil
15. Not covered
⢠Detail steps and workshops for developing SLOs and SLIs
⢠Setting achievable SLO targets
⢠Define SLIs
⢠Manage growth of SLI parameter
⢠SLI menu, implementation patterns, tradeoffs and cost analysis
⢠Define and analyze error budget
⢠Error budget policy, thresholds and scenarios
⢠Identify and address SLO risks
⢠Consequences of missing SLO
⢠There is much more
16. References
⢠SRE Introduction â Set of videos about SRE introduction
⢠SRE â How google runs production systems
⢠SRE Workbook â Practical ways to implement SRE