SlideShare uma empresa Scribd logo
1 de 159
Baixar para ler offline
Gamifying Operational Excellence
The
Service
Score
Card
1 The Problem
3 A Solution tour
4 The results
5 Take aways & lessons Learnt & Questions
2 A Solution idea
Agenda
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me
Danny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me
Danny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me
Danny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me
Danny ☃ Lawrence
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about me
Danny ☃ Lawrence
Good news
SRECON.
You passed
the ☃ test.
about me
Danny ☃ Lawrence
Some terms
(before we really get started)
Operational Excellence
effective and efficient delivery of information,
technology, and services required by end users
that add measurable value.
10
Gamifying Operational Excellence
Operational Excellence
Doing everything required to make sure
all of your services are as fast and as reliable
as possible.
11
Gamifying Operational Excellence
Gamification
application of game-design elements and game
principles in non-game contexts.
12
Gamifying Operational Excellence
Some background
(LinkedIn SRE crash course)
Mostly Java
Multitudes of services
Doing lots of things
Service-oriented architecture
Everything talks to everything
My direct team looks after 80+ services
We have 200+ SREs
14
LinkedIn SRE Crash Course
The Problem
(What started this whole thing)
Problem 1:
The GOOD
&
The BAD
16
Gamifying Operational Excellence
BAD services
wake me up
17
Gamifying Operational Excellence
GOOD services
let me sleep
18
Gamifying Operational Excellence
What makes a GOOD
service at LinkedIn
is a moving target.
19
Gamifying Operational Excellence
Technologies and dependencies
change
over time.
20
Gamifying Operational Excellence
Upgrading dependencies & libraries
Java / Jetty / Play / Tomcat
Correct usage of TLS
Switching databases / caches
Migrate from SVN to GIT
Reduce application startup time
Setup error budgeting
True up the number of metrics
21
Some examples
A GOOD service
can turn into a BAD service.
If you are not checking it
22
Gamifying Operational Excellence
Unfortunately
BAD services
do not
magically
turn into
GOOD services
23
Gamifying Operational Excellence
Problem 2:
Knowing what is BAD
24
Gamifying Operational Excellence
Problem 3:
Knowing why it’s BAD
25
Gamifying Operational Excellence
Problem 4:
Tribal knowledge
about how to get to GOOD
26
Gamifying Operational Excellence
The only thing SREs hate more than
not having documentation.
Is writing documentation.
27
Gamifying Operational Excellence
The Problem
summary
BAD services wake me up
Time will cause GOOD to turn BAD
Hard to know what is BAD
Hard to know why is BAD
Not sure how to fix the BAD
29
Gamifying Operational Excellence
The Service ScoreCard
(A solution)
In order determine the health
of the services we support,
we define a list of production requirements.
31
Gamifying Operational Excellence
Apply a weight to each requirement
32
Gamifying Operational Excellence
Codify each requirement into a check.
33
Gamifying Operational Excellence
Execute these checks
for each service
34
Service Scorecard
Tally up the results for service.
35
Gamifying Operational Excellence
Grade the service from
“F” to “A+”
36
Gamifying Operational Excellence
Add all the services into a highscore system
37
Gamifying Operational Excellence
Then
38
Gamifying Operational Excellence
Publish those scores to the company
39
Gamifying Operational Excellence
This is great,
but how do I improve the score?
How can I add X check into the system.
40
Gamifying Operational Excellence
What makes a check?
checks are one type of plugin.
fetch plugins gather data
check plugins check the data.
42
Gamifying Operational Excellence
We use the fetch plugin to gather
remote data from:
SVN, GIT, Configuration DBs,
host databases, monitoring systems,
build systems, deployment systems.
43
Gamifying Operational Excellence
Basically,
if we can fetch it,
then we do so.
44
Gamifying Operational Excellence
We build a giant context object.
45
Gamifying Operational Excellence
The check plugin will look at our
context object.
46
Gamifying Operational Excellence
All plugins are small python scripts,
where small is 10~30 LOC
47
Gamifying Operational Excellence
Simply return 2 or 3 things.
state*: True, False, None or 0.0 - 1.0
message*: short string
data: python dict of interesting things.
48
Gamifying Operational Excellence
Example fetch plugin
@ssc.tags(“ownership”)
def fetch_ownership(service_name):
“Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
50
@ssc.tags(“ownership”)
def fetch_ownership(service_name):
“Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
51
@ssc.tags(“ownership”)
def fetch_ownership(service_name):
“Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
52
@ssc.tags(“ownership”)
def fetch_ownership(service_name):
“Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
53
@ssc.tags(“ownership”)
def fetch_ownership(service_name):
“Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered owner data”, o.json()
54
Example check plugin
@ssc.weight(5)
@ssc.tags(‘ownership’)
@ssc.wiki(‘http://wiki/ssc_eng_owner’)
def check_eng_team(ctx):
“ensure ENG ownership of a service”
if ctx.ownership.eng_team:
return True, ctx.ownership.eng_team
return False, “missing eng_team”
56
@ssc.weight(5)
@ssc.tags(‘ownership’)
@ssc.wiki(‘http://wiki/ssc_eng_owner’)
def check_eng_team(ctx):
“ensure ENG ownership of a service”
if ctx.ownership.eng_team:
return True, ctx.ownership.eng_team
return False, “missing eng_team”
57
@ssc.weight(5)
@ssc.tags(‘ownership’)
@ssc.wiki(‘http://wiki/ssc_eng_owner’)
def check_eng_team(ctx):
“ensure ENG ownership of a service”
if ctx.ownership.eng_team:
return True, ctx.ownership.eng_team
return False, “missing eng_team”
58
@ssc.weight(5)
@ssc.tags(‘ownership’)
@ssc.wiki(‘http://wiki/ssc_eng_owner’)
def check_eng_team(ctx):
“ensure ENG ownership of a service”
if ctx.ownership.eng_team:
return True, ctx.ownership.eng_team
return False, “missing eng_team”
59
@ssc.weight(5)
@ssc.tags(‘ownership’)
@ssc.wiki(‘http://wiki/ssc_eng_owner’)
def check_eng_team(ctx):
“ensure ENG ownership of a service”
if ctx.ownership.eng_team:
return True, ctx.ownership.eng_team
return False, “missing eng_team”
60
@ssc.weight(5)
@ssc.tags(‘ownership’)
@ssc.wiki(‘http://wiki/ssc_eng_owner’)
def check_eng_team(ctx):
“ensure ENG ownership of a service”
if ctx.ownership.eng_team:
return True, ctx.ownership.eng_team
return False, “missing eng_team”
61
@ssc.weight(5)
@ssc.tags(‘ownership’)
@ssc.wiki(‘http://wiki/ssc_eng_owner’)
def check_eng_team(ctx):
“ensure ENG ownership of a service”
if ctx.ownership.eng_team:
return True, ctx.ownership.eng_team
return False, “missing eng_team”
62
Putting it all together
Problems
Understanding what is BAD
Knowing why it is BAD
Not sure how to fix the BAD
64
Gamifying Operational Excellence
Problems
Understanding what is BAD
65
Gamifying Operational Excellence
66
Service Scorecard
67
Service Scorecard
68
Service Scorecard
69
Service Scorecard
70
Service Scorecard
71
Service Scorecard
72
Service Scorecard
73
Service Scorecard
74
Service Scorecard
75
Service Scorecard
76
Service Scorecard
77
Service Scorecard
78
Service Scorecard
79
Service Scorecard
Problems
Understanding what is BAD
Knowing why it is BAD
80
Gamifying Operational Excellence
81
Service Scorecard
82
Service Scorecard
83
84
85
86
87
88
89
90
91
92
93
Problems
Understanding what is BAD
Knowing why it is BAD
Not sure how to fix the BAD
94
Gamifying Operational Excellence
95
96
97
98
99
What is the check?
Why is it important?
How long it will take to fix?
How will it be fixed?
100
Gamifying Operational Excellence
101
102
AngularJS
image: CC BY 4.0 https://angular.io/presskit.html (2017)
103
{{service_name}}
becomes
jobs-server
104
105
{{context.ownership.eng_owner}}
becomes
jobs-team
Using our fetched data in the wiki
107
{{service_name}}
108
{html}
<script src=”https://cdn/angularjs.js”/ >
{html}
109
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
110
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
111
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
112
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
113
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = data;
}
);
114
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
115
{{ctx.ownership.owner_eng}}
116
{{ctx.ownership.owner_eng}}
{{ctx.number_of_hosts}}
{{ctx.product.lib.jetty.version}}
{{ctx.hosts.hostnames}}
{{ctx.is_deployed_in_prod}}
{{ctx.commits.last_commit}}
Problems
Understanding what is BAD
Knowing why it is BAD
Not sure how to fix the BAD
117
Gamifying Operational Excellence
Now
Reports show what is BAD
Checks validate why it is BAD
Wiki shows how to fix the BAD
118
Gamifying Operational Excellence
No more of these emails
“If you use a lib-core, then upgrade it,
we found a bug”
119
Gamifying Operational Excellence
How many of my 80 services use this lib?
How do I check?
How do I upgrade?
120
Gamifying Operational Excellence
121
122
123
Where does this tool fit?
125
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
126
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
Service Scorecard
127
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
Service Scorecard
API
128
Gamifying Operational Excellence
Service Scorecard
API
hack-days Reporting Deployment Monitoring
Results
&
Outcomes
What we do with the scores?
130
Gamifying Operational Excellence
Priority #1:
Getting the grades better
131
Gamifying Operational Excellence
132
When we started Now
Average grade for my team 40% 80%
Average score across SRE 35% 60%
Checks in 24 hours 15,560 89,859
Number of checks per service 15 31
Gamifying Operational Excellence
We can now explore news ways
to use the scores
133
Gamifying Operational Excellence
Carrot
&
Stick
134
Gamifying Operational Excellence
Carrot / GOOD
Stick / BAD
135
Gamifying Operational Excellence
No SRE support
for
F Grade
services.
136
Gamifying Operational Excellence
F Grade services generally
cause the
most problems.
137
Gamifying Operational Excellence
No deploy moratorium
for
A+ services
138
Gamifying Operational Excellence
A+ services generally
cause the
least problems.
139
Gamifying Operational Excellence
A services
are allowed to deploy 24/7
140
Gamifying Operational Excellence
Premium SRE support
for A+ services
141
Gamifying Operational Excellence
Priority build queues
for
GOOD
Services.
142
Gamifying Operational Excellence
Tiger teams
to raise the
scores on
F Grade services
143
Gamifying Operational Excellence
Hack Days
144
Gamifying Operational Excellence
FREE BEER
145
Gamifying Operational Excellence
Basically any problem
can be solve with
FREE BEER
146
Gamifying Operational Excellence
OR T-Shirts
147
Gamifying Operational Excellence
/
148
Influence where we allocate
open headcount
149
Gamifying Operational Excellence
Simple way to get things done
150
Gamifying Operational Excellence
Take aways
&
Lessons Learnt
Everyone cares about Reliability.
152
Gamifying Operational Excellence
Everyone cares about Reliability,
Everyone is a Site Reliability Engineer.
153
Gamifying Operational Excellence
Everyone cares about Reliability,
You just need to empower them.
154
Gamifying Operational Excellence
Hack Days are important,
This POC was built in an afternoon.
155
Gamifying Operational Excellence
Getting the data was easy,
Finding interesting ways to use it is hard.
156
Gamifying Operational Excellence
Make it as easy as possible
to do the right thing.
157
Gamifying Operational Excellence
Cheers !
Q & A

Mais conteúdo relacionado

Semelhante a The servicescore card - Gamifying Operational Excellence - SRECON

SysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile ProblemSysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile Problem
Rundeck
 

Semelhante a The servicescore card - Gamifying Operational Excellence - SRECON (20)

muCon 2017 - Build Confidence in your System with Chaos Engineering
muCon 2017 - Build Confidence in your System with Chaos EngineeringmuCon 2017 - Build Confidence in your System with Chaos Engineering
muCon 2017 - Build Confidence in your System with Chaos Engineering
 
Engineering Velocity @indeed eng presented on Sept 24 2014 at Beyond Agile
Engineering Velocity @indeed eng presented on Sept 24 2014 at Beyond AgileEngineering Velocity @indeed eng presented on Sept 24 2014 at Beyond Agile
Engineering Velocity @indeed eng presented on Sept 24 2014 at Beyond Agile
 
SysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile ProblemSysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile Problem
 
AI & AWS DeepComposer
AI & AWS DeepComposerAI & AWS DeepComposer
AI & AWS DeepComposer
 
OSMC 2019 | Directing the Director by Martin Schurz
OSMC 2019 | Directing the Director by Martin SchurzOSMC 2019 | Directing the Director by Martin Schurz
OSMC 2019 | Directing the Director by Martin Schurz
 
Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017
 
The World Outside - The Blind Spot of TDD
The World Outside - The Blind Spot of TDDThe World Outside - The Blind Spot of TDD
The World Outside - The Blind Spot of TDD
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
The Enterprise Architecture you always wanted: A Billion Transactions Per Mon...
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.au
 
Jenkins Online Meetup - Automated SLI based Build Validation with Keptn
Jenkins Online Meetup - Automated SLI based Build Validation with KeptnJenkins Online Meetup - Automated SLI based Build Validation with Keptn
Jenkins Online Meetup - Automated SLI based Build Validation with Keptn
 
Selenium for Jobseekers
Selenium for JobseekersSelenium for Jobseekers
Selenium for Jobseekers
 
Machine Learning for Software Developers (...and Knitters)
Machine Learning for Software Developers (...and Knitters)Machine Learning for Software Developers (...and Knitters)
Machine Learning for Software Developers (...and Knitters)
 
Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital
Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapitalComment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital
Comment se crasher avec classe pendant un pic d'audience, a.k.a #effetcapital
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
How I Learned to Stop Worrying and Love Legacy Code
How I Learned to Stop Worrying and Love Legacy CodeHow I Learned to Stop Worrying and Love Legacy Code
How I Learned to Stop Worrying and Love Legacy Code
 
SELJE_Database_Unit_Testing.pdf
SELJE_Database_Unit_Testing.pdfSELJE_Database_Unit_Testing.pdf
SELJE_Database_Unit_Testing.pdf
 
Top Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your PipelineTop Java Performance Problems and Metrics To Check in Your Pipeline
Top Java Performance Problems and Metrics To Check in Your Pipeline
 
Acs trb g42
Acs trb g42Acs trb g42
Acs trb g42
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

The servicescore card - Gamifying Operational Excellence - SRECON