According to Google, SRE is what you get when you treat operations as if it’s a software problem. In this video, I briefly explain different SLIs typically associated with a system. I will explain Availability, latency and quality SLIs in brief.
Youtube channel here: https://youtu.be/EgpCw15fIK8
4. How SLOs help?
4
Product Development Operations
If reliability is a
feature, when do
you prioritise it
versus other
features?
How do you balance
the risk to reliability
from changing a
system with the
requirement to build
new, cool features for
that system?
What is the right level
of reliability for the
system you support?
https://www.usenix.org/sites/default/files/conference/protected-files/srecon18emea_slides_fong-jones.pdf
5. SLI Equation
1. SLIs fall between 0% and 100%
0% means nothing works, 100% means nothing is broken
2. SLIs have a consistent format
Consistency allows common tooling to be built around SLIs
Alerting logic, error budget calculations, and SLO analysis and reporting tools can all be written
to expect the same inputs: good events, valid events, and SLO threshold.
5
https://www.usenix.org/sites/default/files/conference/protected-files/srecon18emea_slides_fong-jones.pdf
7. Availability SLI
• The suggested specification for a request/response Availability SLI is:
The proportion of valid requests served successfully
• Turning this specification into an implementation requires making two
choices
• Which of the requests this system serves are valid for the SLI
• What makes a response successful?
• Sample success/failure indicators include HTTP/RPC response code
• Percentage of HTTP GET requests for /profile/{user} or /profile/{user}/avatar
that have 2XX, 3XX or 4XX (excl. 429) status measured at the load balancer
• The availability of a virtual machine could be defined as the proportion of
minutes that it was booted and accessible via SSH
7
https://miro.medium.com/max/2496/1*4_Isk3nxCga6jFL-I9VhzA.png
https://www.usenix.org/sites/default/files/conference/protected-files/srecon18emea_slides_fong-jones.pdf
8. Latency
• The suggested specification for a request/response Latency SLI
is:
The proportion of valid requests served faster than a threshold
• Turning this specification into an implementation requires
making two choices
• Which of the requests this system serves are valid for the
SLI,
• What threshold differentiate between requests that are fast
and slow
• Percentage of HTTP GET requests for /profile/{user} that send
their entire response within Xms measured at the load balancer
8
https://miro.medium.com/max/730/1*FfS0Jg6Nq5yW7Skndxvmkg.png
https://www.usenix.org/sites/default/files/conference/protected-files/srecon18emea_slides_fong-jones.pdf
https://www.usenix.org/sites/default/files/conference/protected-files/srecon18emea_slides_fong-jones1.pdf
9. Quality
• The suggested specification for a request/
response Quality SLI is:
The proportion of valid requests served
without degrading quality
• Turning this specification into an
implementation requires making two choices
• Which of the requests this system serves
are valid for the SLI
• How to determine whether the response
was served with degraded quality.
9
https://blog.readytomanage.com/wp-content/uploads/2012/07/quality-total-quality-cartoon.jpg
https://www.usenix.org/sites/default/files/conference/protected-files/srecon18emea_slides_fong-jones.pdf