What makes a “good” service is a moving target. Technologies and requirements change over time. It can be impossible to ensure that none of your services have been left behind.
The Service ScoreCard approach is to have a small check for each service initiative we have, this could be anything measurable; deployment frequency, the oncall team all have phone; ensuring the latest version of the JVM.
The Service ScoreCard, gives each service a grade from 'F' to 'A+', based on passing or failing the list of checks. As soon as anyone see the service grade’s slipping everyone rallies to improve the grades.
We can then set up rules based on the grades, “Only B and above services can deploy 24 / 7”, “moratorium on services without an A+” or “No SRE support until the services below C grade”.
2. 1 The Problem
3 A Solution tour
4 The results
5 Take aways & lessons Learnt & Questions
2 A Solution idea
Agenda
3. “If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me
Danny ☃ Lawrence
4. “If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me
Danny ☃ Lawrence
5. “If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me
Danny ☃ Lawrence
6. “If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me
Danny ☃ Lawrence
7. “If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me
Danny ☃ Lawrence
10. Operational Excellence
effective and efficient delivery of information,
technology, and services required by end users
that add measurable value.
10
Gamifying Operational Excellence
11. Operational Excellence
Doing everything required to make sure
all of your services are as fast and as reliable
as possible.
11
Gamifying Operational Excellence
14. Mostly Java
Multitudes of services
Doing lots of things
Service-oriented architecture
Everything talks to everything
My direct team looks after 80+ services
We have 200+ SREs
14
LinkedIn SRE Crash Course
21. Upgrading dependencies & libraries
Java / Jetty / Play / Tomcat
Correct usage of TLS
Switching databases / caches
Migrate from SVN to GIT
Reduce application startup time
Setup error budgeting
True up the number of metrics
21
Some examples
22. A GOOD service
can turn into a BAD service.
If you are not checking it
22
Gamifying Operational Excellence
29. BAD services wake me up
Time will cause GOOD to turn BAD
Hard to know what is BAD
Hard to know why is BAD
Not sure how to fix the BAD
29
Gamifying Operational Excellence
132. 132
When we started Now
Average grade for my team 40% 80%
Average score across SRE 35% 60%
Checks in 24 hours 15,560 89,859
Number of checks per service 15 31
Gamifying Operational Excellence
133. We can now explore news ways
to use the scores
133
Gamifying Operational Excellence