SlideShare uma empresa Scribd logo
1 de 144
Baixar para ler offline
Modern Operations:
Solving DevOps’ Last Mile Problem
Damon Edwards
@damonedwards
2018
June 18, 2016
Beaverton, OR
Ops Improvement
DevOps
Ops Tools
Community
Damon Edwards
This talk isn’t about cool technology.
This talk isn’t about cool technology.
Kubernetes
Istio
Lamba
This talk isn’t about cool technology.
Kubernetes
Istio
Lamba
Ansible
Docker
Prometheus
Chef
Collectd
AWS
Puppet
VMs
Nagios
This talk isn’t about cool technology.
Kubernetes
Istio
Lamba
Ansible
Docker
Prometheus
Chef
Collectd
AWS
Puppet
VMs
Nagios
???
???
???
This talk is about how we operate it.
OpsBusiness
Idea
Shorter Time-to-Market
Fast Feedback
from Users
Dev Ops
Running
Services
Improved Quality
Digital and DevOps
Availability Auditing
Security Compliance
"Go faster!"
“Open up!”
“Lock it down!”
Why now?
Let’s start with a true story…
Digital
Agile
DevOps
SRE
Cloud
Docker
Kubernetes
Microservices
CHANGE
Wow
That is cool
I wish I could
work there
But nobody was talking about what
happened after deployment…
It was just another Tuesday…
NOC
NOC
Biz
Manager
Escalate!
NOC NOC
NOC
(Bob)
Open
Incident
Ticket
9:30am 10:00am
NOC (Bob)
Biz Manager
Ticket
Context Wagon
Yes, but this
looks different
Hasn’t there been
some intermittent
errors this week?
v3
?!
NOC
(Bob)
Open
Incident
Ticket
Ticket
Biz
Manager
App-specific
SREs
“Try this.”
“Try that.”
SRE
SysAdmin
with Prod Access
(Steve)
SRE
SRE
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
fixed?
fixed?
NOC (Bob)
Biz Manager
NOC (Bob)
Biz Manager
SysAdmin (Steve)
7 x SRE
Ticket
Context Wagon
Ticket
Context Wagon
SRE
“It’s a problem
with the Foo
service”
SRE
SRE
Foo
SRE
SRE
SRE
SRE
Bridge
Call
Biz
Manager
Foo
Service
No.
NOC
(Bob)
Update
Ticket
Ticket
Foo
Lead Dev
+ add
12:00pm
NOC (Bob)
Biz Manager
Foo SRE
Ticket
Context Wagon
Can you
fix it?
o
Dev
Foo
Lead Dev
(Karen)
ding!
Ignore.
App
Manager
Hey did you see
that ticket?
Foo
Lead Dev
(Karen)
sigh.
I’ll take a look
I’m go
mor
pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
Scrum
Ticket
Context Wagon
k
Foo
Lead Dev
(Karen)
I’m going to need
more log files
Ticket
SysAdmin Team
+ add
Update
Ticket
Chat
“Can someone with
access to Foo Service
in Prod01 help me with
ticket #42516?”
SysAdmin
(Lee) Ticket
“logs
attached”
Foo
Lead Dev
(Karen)
Ticket
“no the
other ones”
Le
(K
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Ticket
Context Wagon
Foo
Lead Dev
(Karen)
Logs
-Who restarted these services? (and why?)
-They didn’t use the correct environment
variables!
-This entire service pool needs to be restarted!
Ticket
Update
Ticket
NOC
(Bob)
Update
Ticket
Ticket
Middleware Team
+ add
“Middleware, please
urgent restart this entire
app pool with the correct
environment variable”
2:00pm
Ticket
Context W
ase
s entire
e correct
able”
NOC
(Bob)
Middleware
Manager
(Melissa)
No way. It’s the middle
of the day! You need
business approval.
NOC
(Bob)
Update
Ticket
Ticket
SVP for Line of
Business
+ add
(S
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
NOC (B
Biz Ma
App Ma
Lead D
Foo SR
Ticket
Context Wagon
Ticket
Context Wagon
2:30pm
Update
Ticket
Ticket
SVP for Line of
Business
+ add
SVP
(Susan)
Chief of
Staff
Tech VP
Tech VP
Update
Ticket
Ticket
“Restart approved”
Customer
impact?
Ticket
Middlewa
Manage
(Melissa
Wh
prod
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Ticket
Context Wagon
Share
point
proved”
Ticket
Middleware
Manager
(Melissa)
Who knows these
production services
the best?
Ellen!
Middleware Middleware
(Scott)
Ellen
to
Europe
office
Middleware
(Scott)
Trial and error
.doc
5:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Ticket
Context Wagon
Share
point
Middleware
(Scott)
Trial and error
.doc
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
ket
Context Wagon
Middleware
(Scott)
Bar
Service
10 min Middleware
(Scott)
Waiting for
Acme Service
Acme startup
failed
Bar
Service
6:00pm
Come on.. no.no.no.
What? Why?
Middleware
(Scott)
Come on.. no.no.no.
What? Why?
Middleware
(Scott)
8888888
Come on.. no.no.no.
What? Why?
Middleware
(Scott)
-Bar app startup timed out. Error says can’t
connect to Acme service.
- I looked at Acme but it seems to be running
-Is this error message correct? Why can’t Bar
connect?
Ticket
Update
Ticket
Middleware
(Scott)
Bar SRE
+ add
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
6:45
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)Ticket
Context Wagon
The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
Bar SRE
(Linda)
Middleware
(Scott)
-URGENT: Network
connection issue
between Bar and
Acme
Ticket
Update
Ticket
Network
SRE Team
+ add
Bar
Lead Dev
6:45pm
ob)
ager
nager
ev (Karen)
E
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Customers are
calling. What
is going on?The new environment pre-flight
check is preventing startup.
Looks like Bar’s connection to
Acme is being blocked.
Bar
Lead Dev
(Liu)
Business
Managers
I can comment out
the test… But the
CD pipeline only
goes to QA ENV!
Network Dir
(Carlos)
Middleware
(Scott)
Carlos, I need a favor.
Can you escalate?Middleware
Manager
(Melissa)
Customers are
calling. What
is going on?
Last week..
Net SRE
VP
VP
Priority!
Different
Incident!
Net SRE Net SRE
Net SRE
Its the network!
Business
Managers
Your
network is
broken!
Business
Managers
We are already
working on it!
Network VPs
out
he
ly
V!
Network
SRE
(Hari)
The firewall is
blocking the traffic
You’ll have to take
it up with the
Firewall Team
-URGENT: Firewall is
blocking connection
between Bar and Acme
Ticket
Open
Firewall
Ticket
Firewall
Team
+ add
Firewall Engineer
(Freddie)
Middleware
(Scott)
Paging on-call…
Open bridge…
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
8:00p
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Ticket
Context Wagon
Firewall Engineer
(Freddie)
Middleware
(Scott)
Firewall Engineer
(Freddie)
Middleware
(Scott)
Can’t be the firewall, it hasn’t
changed since last Thursday.
No its the firewall.
There was a rule change last
Thursday that would stop Bar
from talking to Acme.
Can you change it back?
Sure we make changes on
Thursday…
Chief of
Staff
SVP and VPs are livid… this was
supposed to be a safe change!!
Freddie, we’ve got customers calling.
ES
Em
pro
rul
Update
Firewall
Ticket
Firewall Engineer
(Freddie)
8:00pm
d VPs are livid… this was
sed to be a safe change!!
we’ve got customers calling.
ESCALATE:
Emergency
production firewall
rule change review
Ticket
Update
Firewall
Ticket
NetSec
+ add
Firewall Engineer
(Freddie)
Paging on-call…
NetSec
(Nicole)
This is production so I’ll have
to get others on the Network
CAB…
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
9:00pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAd
Middle
SVP
Chief o
2 x Tec
Ticket
Context Wagon
I’ll have
Network
Chief of
Staff
Firewall
(Freddie)
Middleware
(Scott)
Customer outage!
APPROVE: Emergency
firewall rule change
Ticket
Update
Firewall
Ticket
NetSec
(Nicole)
… I’ll call SVP Susan
Middleware
Manager
VP
VP
Bar
Lead Dev
Firewall
(Freddie)
Net L2
(Bob)
Middl
(Sc
Firewall
change
Restart Bar
9:30pm
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
Middleware
(Scott)
Update
Ticket
Ticket
Customer Engagement
Manager
+ add
Policy
!!
“Ready for
API tests”
9:45pm
Firewall
(Freddie)
Net L2
(Bob)
Middleware
(Scott)
Firewall
change
Restart Bar
I think we
are good!
Middleware
Manager
VP
VP
Bar
Lead Dev
You
“think?”
pm
et
gement
“Ready for
API tests”
Customer
Engagement
Manager
(Varsha)
NOC
(Bob)
Customer Engagement
Manager
(Varsha)
Update
Ticket
Ticket
“APIs OK”
Middleware
(Scott)
Upda
Tick
11:00pm
Ticket
Co
e
Ticket
“APIs OK”
Middleware
(Scott)
Update
Ticket
Ticket
“Services
restarted OK”
NOC
NOC
Lights are green…
I guess it is fixed.
Close
Ticket
NOC
(Bob)
Zzz
11:30pm
N
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
Cust. Engmt. (Varsha)
e
Ticket
“APIs OK”
Middleware
(Scott)
Update
Ticket
Ticket
“Services
restarted OK”
NOC
NOC
Lights are green…
I guess it is fixed.
Close
Ticket
NOC
(Bob)
Zzz
11:30pm
N
NOC (Bob)
Biz Manager
App Manager
Lead Dev (Karen)
Foo SRE
SysAdmin (Lee)
Middleware Manager
SVP
Chief of Staff
2 x Tech VP
Middleware (Scott)
Bar SRE (Linda)
Network PM (Carlos)
Network SRE (Bob)
Firewall (Freddie)
Ticket
Context Wagon
NetSec (Nicole)
Cust. Engmt. (Varsha)
.
NOC
Lights are green…
I guess it is fixed.
Close
Ticket
NOC
(Bob)
Zzz
Next Day
SVP
(Susan)
Whose fault is this?!
Why are we so bad at change?
What additional processes
and approvals are you
adding to never let this
happen again?!
VP
VP
Dir
Dir
VP
Dir
VP
Scott)
da)
Carlos)
(Bob)
die)
NetSec (Nicole)
Cust. Engmt. (Varsha)
Later…
We’ve invested in Cloud, Agile,
DevOps, Containers…
Why does everything still take too
long and cost too much?
Executive Team
Our transformation has
largely ignored Ops
Most companies chase the symptoms…
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Waiting
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Defects
Defects
Defects
Defects
Defects
Waiting ! Defects
Manual /
Motion Manual /
Motion
Manual /
Motion
Manual /
Motion
Manual /
Motion
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Defects
Defects
Defects
Defects
Defects
Waiting ! Defects ! Motion/Manual
Manual /
Motion Manual /
Motion
Manual /
Motion
Manual /
Motion
Manual /
Motion
Task
Switching
Task
Switching Task
SwitchingTask
Switching Task
Switching
Task
Switching
Task
Switching
Task
Switching
Task
Switching
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Defects
Defects
Defects
Defects
Defects
Waiting ! Defects ! Motion/Manual ! Task Switching
Manual /
Motion Manual /
Motion
Manual /
Motion
Manual /
Motion
Manual /
Motion
Task
Switching
Task
Switching Task
SwitchingTask
Switching Task
Switching
Task
Switching
Task
Switching
Task
Switching
Task
Switching
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Defects
Defects
Defects
Defects
Defects
Partially
Done
Partially
Done
Partially
Done
Partially
Done
Partially
Done
Waiting ! Defects ! Motion/Manual ! Task Switching ! Partially Done
Manual /
Motion Manual /
Motion
Manual /
Motion
Manual /
Motion
Manual /
Motion
Task
Switching
Task
Switching Task
SwitchingTask
Switching Task
Switching
Task
Switching
Task
Switching
Task
Switching
Task
Switching
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Waiting Waiting
Waiting
Waiting
Waiting
Waiting
Defects
Defects
Defects
Defects
Defects
Partially
Done
Partially
Done
Partially
Done
Partially
Done
Partially
Done
Extra
Process
Extra
Process
Extra
Process
Extra
Process
Extra
Process
Extra
Process
Extra
Process
Extra
Process
Waiting ! Defects ! Motion/Manual ! Task Switching ! Partially Done ! Extra Process
Not Here!
Not Here!
Not Here!
Here?
Even Better!
HC: 2 HC: 9
CW: 10
HC: 3
CW: 6
HC: 3
CW: 7
HC: 4
CW: 10
HC: 4
CW: 12
HC: 3
CW: 12
HC: 4
CW: 13
HC: 8
CW: 15
HC: 6
CW: 17
HC: 4
CW: 18
HC: 3
CW: 18
HC = Headcount
CW = in Context Wagon
Follow the conventional wisdom:
“We need better tools”
Follow the conventional wisdom:
“We need better tools”
“We need more people”
Follow the conventional wisdom:
“We need better tools”
“We need more people”
“We need more discipline and attention to detail”
Follow the conventional wisdom:
“We need better tools”
“We need more people”
“We need more discipline and attention to detail”
“We need more change reviews/approvals”
Follow the conventional wisdom:
“We need better tools”
“We need more people”
“We need more discipline and attention to detail”
“We need more change reviews/approvals”
Follow the conventional wisdom:
Challenge the conventional
wisdom about operations work
Forces That Undermine Operations
Silos Queues
Toil Low Trust
Forces That Undermine Operations
Silos Queues
Toil Low Trust
Silos
Backlog Information
PrioritiesTools
Backlog Information
I need X
PrioritiesTools
Silos
Backlog Information
I need X
PrioritiesTools
Silos
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Silos cause disconnects and mismatches
Backlog Information
I need X
PrioritiesTools
Backlog
I do X
Requests
for X
Silo A
Information
Priorities
Silo B
Tools
Context
Context
Process
Process
Tooling
Tooling
Capacity
Capacity
1
2
3
Silos Interfere with feedback loops
1
2
3
Silos Interfere with feedback loops
Producer Consumer
Ops
Ops
Ops
Function A
Function B
Function C
Becomes siloed labor pools of functional specialists
Requests fulfilled by semi-
manual or manual effort

Primary management focus is
on protecting team capacity
Forces That Undermine Operations
Silos Queues
Toil Low Trust
How do we cover for our silos disconnects and mismatches?
Silo A Silo B
How do we cover for our silos disconnects and mismatches?
Silo A Silo B
Ticket
Queue
??
Silo A Silo B
We all know how well that works
Ticket
Queue
Request queues are an expensive way to manage work
Ticket
Queue
Queues Create…
Longer Cycle Time
Increased Risk
More Variability
More Overhead
Lower Quality
Less Motivation
Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development
What do queues do to value streams?
What do queues do to value streams?
Queue
A
Queue
B
What do queues do to value streams?
Queue
A
Queue
B
Queues disintegrate and
obfuscate value streams
Tickets queues become “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Tickets queues become “snowflake makers”
??
Silo A Silo B
Ticket
Queue
Snowflakes
(each unique, technically acceptable but unreproducible and brittle)
Forces That Undermine Operations
Silos Queues
Toil Low Trust
Excessive toil prevents fixing the system
Excessive toil prevents fixing the system
“Toil is the kind of work tied to running a production
service that tends to be manual, repetitive,
automatable, tactical, devoid of enduring value, and
that scales linearly as a service grows.”
-Vivek Rau

Google
Excessive toil prevents fixing the system
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Excessive toil prevents fixing the system
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Toil Impacts Development As Well
Toil Impacts Development As Well
• 2016-2017 study of development teams
• 14 enterprises (insurance, healthcare, finance, travel, retail)
• Tech org headcount: 900 - 6800
Toil Impacts Development As Well
“28% - 63% of development teams’ total
time was consumed by operations toil.”
• 2016-2017 study of development teams
• 14 enterprises (insurance, healthcare, finance, travel, retail)
• Tech org headcount: 900 - 6800
Toil Impacts Development As Well
“28% - 63% of development teams’ total
time was consumed by operations toil.”
Waiting for environments Incident escalations
Rework due to env. differences Handoffs
Network issues Requests for information
Broken lower environments Change meetings
And more…
• 2016-2017 study of development teams
• 14 enterprises (insurance, healthcare, finance, travel, retail)
• Tech org headcount: 900 - 6800
Forces That Undermine Operations
Silos Queues
Toil Low Trust
Where are decisions made? Who can take action?
escalate
1° 2° 3° 4°
escalate escalateor
All work is contextual
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME Is this dangerous?
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
Is this dangerous?
John
Allspaw
All work is contextual
rm -rf $PATHNAME
John
Allspaw
All work is contextual
rm -rf $PATHNAME
Answer is always
“it depends”
John
Allspaw
escalate
1° 2° 3° 4°
escalate escalateor
Context
Where are decisions made? Who can take action?
Low trust + approvals = illusion of control
Ticket
System
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
How many were the right call?
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
How many were the right call?
How many got rejected?
Low trust + approvals = illusion of control
Ticket
System
Add up the total number of approval requests and
…subtract the info radiators (“I need to be in the loop”)
…subtract the CYAs (“Prove you followed the process”)
…subtract the too removed to judge (“mostly guessing”)
How many are you left with?
How many were the right call?
How many got rejected?
Forces That Undermine Operations
Silos Queues
Toil Low Trust
So what can we do differently?
Obvious: Get rid of as many silos as possible
Old Silo A Old Silo B Old Silo C Old Silo D
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Obvious: Get rid of as many silos as possible
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Obvious: Get rid of as many silos as possible
“Horizontal” shared
responsibility, not
everyone do everything!
Shared and dedicated responsibility is key
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget consequences
“Netflix"
Model
“Google”
Model
Shared and dedicated responsibility is key
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget consequences
“Netflix"
Model
“Google”
Model
Shared and dedicated responsibility is key
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Development Team 1
Development Team 2
Development Team n
SRE
Team
Clear handoff requirements
Error budget consequences
“Netflix"
Model
“Google”
Model
Same
high-quality,
high-velocity
results!
But what about the cross-cutting concerns?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
But what about the cross-cutting concerns?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
But what about the cross-cutting concerns?
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Specialist
Capabilities
Specialist
Capabilities
Specialist
Capabilities
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue
Ticket
Queue Ticket
Queue
Operations as a Service: Turn handoffs into self-service
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Development Team 1
Development Team 2
Development Team n
Ops/SRE
Team
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(builds & operates)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Operations as a Service: Works with any org model
Use tickets only for what they are good for
Ticket
System
Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
Ticket
System
Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Ticket
System
Use tickets only for what they are good for
1.Documenting true problems/issues/exceptions
2.Routing for necessary approvals
Not as a general purpose work management system!
Ticket
System
Security or compliance “in the way”?
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Build-in
Security
Here
Build-in
Compliance
Here
“Shift Left” the ability to take action
escalate
1° 2° 3° 4°
escalate escalateor
“Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor
“Shift Left” the ability to take action
Push the ability to take action this direction
escalate
1° 2° 3° 4°
escalate escalateor
OaaS Enablement and tooling
Reduce Toil
Reduce Toil
1. Track toil levels for each team
Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
Reduce Toil
1. Track toil levels for each team
2. Set toil limits for each team
3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
Bonus: Use Service Level Objectives, Error Budgets, and other lessons from SRE
Example Operations as a Service Platform (shameless plug)
#! Ȑ Ƙ
Scripts APIs Tools Cloud VMs Containers
Workflow and
Scheduling
Collect and
Process Output
Infrastructure
details and state
Config.
Man.
CMDB
Monitor.
Metrics
Cloud
Corp
Directory
Authentication
and roles
ITSM Tickets, work
status, approvals
>_
Create workflows ● Define policies ● Execute workflows
Web GUI API CLI
Recap
Don’t forget about Ops.
Challenge conventional wisdom.
Leverage the Operations as a
Service design pattern
“Shift-Left” control and decision
making.
Old Silo A Old Silo B Old Silo C Old Silo D
Cross-Functional Team 1
Cross-Functional Team 2
Cross-Functional Team n
Focus on removing silos and
queues
Operations as a Service
On
Demand
On
Demand
On
Demand
On
Demand
Ops
(embedded)Cross-Functional Product Team 1
Cross-Functional Product Team n Ops
(embedded)
Ops
(builds & operates)
Cross-Functional Product Team 2 Ops
(embedded)
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Ops Capability
SRE, Dev, or
Specialist
Learn from SRE: Reduce toil to
create capacity to change
Toil Engineering Work
E.W.Toil
Reduce toil
Improve the business ǡ
No capacity to reduce toil
No capacity to improve business
Toil at manageable percentage of capacity
Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
Understand the forces
undermining operations work
Let’s talk…
@damonedwards
damon@rundeck.com
https://www.rundeck.com/oaas
Dive Deeper Into Operations as a Service:

Mais conteúdo relacionado

Mais procurados

Demystifying DevSecOps
Demystifying DevSecOpsDemystifying DevSecOps
Demystifying DevSecOpsArchana Joshi
 
DevOps concepts, tools, and technologies v1.0
DevOps concepts, tools, and technologies v1.0DevOps concepts, tools, and technologies v1.0
DevOps concepts, tools, and technologies v1.0Mohamed Taman
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityAcquia
 
Dev ops != Dev+Ops
Dev ops != Dev+OpsDev ops != Dev+Ops
Dev ops != Dev+OpsShalu Ahuja
 
Practical DevSecOps Course - Part 1
Practical DevSecOps Course - Part 1Practical DevSecOps Course - Part 1
Practical DevSecOps Course - Part 1Mohammed A. Imran
 
DevOps Introduction
DevOps IntroductionDevOps Introduction
DevOps IntroductionRobert Sell
 
DevOps, Common use cases, Architectures, Best Practices
DevOps, Common use cases, Architectures, Best PracticesDevOps, Common use cases, Architectures, Best Practices
DevOps, Common use cases, Architectures, Best PracticesShiva Narayanaswamy
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...DevClub_lv
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRESquadcast Inc
 
DevSecOps and the CI/CD Pipeline
 DevSecOps and the CI/CD Pipeline DevSecOps and the CI/CD Pipeline
DevSecOps and the CI/CD PipelineJames Wickett
 
DevSecOps: Taking a DevOps Approach to Security
DevSecOps: Taking a DevOps Approach to SecurityDevSecOps: Taking a DevOps Approach to Security
DevSecOps: Taking a DevOps Approach to SecurityAlert Logic
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesAshutosh Agarwal
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Abeer R
 
The DevOps Journey
The DevOps JourneyThe DevOps Journey
The DevOps JourneyMicro Focus
 
Introduction to DevSecOps
Introduction to DevSecOpsIntroduction to DevSecOps
Introduction to DevSecOpsSetu Parimi
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability EngineeringMark Underwood
 

Mais procurados (20)

Demystifying DevSecOps
Demystifying DevSecOpsDemystifying DevSecOps
Demystifying DevSecOps
 
DevOps concepts, tools, and technologies v1.0
DevOps concepts, tools, and technologies v1.0DevOps concepts, tools, and technologies v1.0
DevOps concepts, tools, and technologies v1.0
 
A Crash Course in Building Site Reliability
A Crash Course in Building Site ReliabilityA Crash Course in Building Site Reliability
A Crash Course in Building Site Reliability
 
Introduction to DevSecOps
Introduction to DevSecOpsIntroduction to DevSecOps
Introduction to DevSecOps
 
Dev ops != Dev+Ops
Dev ops != Dev+OpsDev ops != Dev+Ops
Dev ops != Dev+Ops
 
Practical DevSecOps Course - Part 1
Practical DevSecOps Course - Part 1Practical DevSecOps Course - Part 1
Practical DevSecOps Course - Part 1
 
Implementing DevSecOps
Implementing DevSecOpsImplementing DevSecOps
Implementing DevSecOps
 
DevOps Introduction
DevOps IntroductionDevOps Introduction
DevOps Introduction
 
DevOps, Common use cases, Architectures, Best Practices
DevOps, Common use cases, Architectures, Best PracticesDevOps, Common use cases, Architectures, Best Practices
DevOps, Common use cases, Architectures, Best Practices
 
SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...SRE (service reliability engineer) on big DevOps platform running on the clou...
SRE (service reliability engineer) on big DevOps platform running on the clou...
 
How to SRE when you have no SRE
How to SRE when you have no SREHow to SRE when you have no SRE
How to SRE when you have no SRE
 
SecDevOps
SecDevOpsSecDevOps
SecDevOps
 
DevSecOps and the CI/CD Pipeline
 DevSecOps and the CI/CD Pipeline DevSecOps and the CI/CD Pipeline
DevSecOps and the CI/CD Pipeline
 
DevSecOps: Taking a DevOps Approach to Security
DevSecOps: Taking a DevOps Approach to SecurityDevSecOps: Taking a DevOps Approach to Security
DevSecOps: Taking a DevOps Approach to Security
 
DevSecOps What Why and How
DevSecOps What Why and HowDevSecOps What Why and How
DevSecOps What Why and How
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 
The DevOps Journey
The DevOps JourneyThe DevOps Journey
The DevOps Journey
 
Introduction to DevSecOps
Introduction to DevSecOpsIntroduction to DevSecOps
Introduction to DevSecOps
 
Site (Service) Reliability Engineering
Site (Service) Reliability EngineeringSite (Service) Reliability Engineering
Site (Service) Reliability Engineering
 

Semelhante a Modern Operations: Solving DevOps’ Last Mile Problem

Operations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsOperations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsRundeck
 
Ops Happens: Improving Incident Response Using DevOps and SRE Practices
Ops Happens:  Improving Incident Response Using DevOps and SRE PracticesOps Happens:  Improving Incident Response Using DevOps and SRE Practices
Ops Happens: Improving Incident Response Using DevOps and SRE PracticesRundeck
 
Tickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily MiserableTickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily MiserableRundeck
 
Making Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of OperationsMaking Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of OperationsRundeck
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile Rundeck
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile Rundeck
 
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product..."Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...Product of Things
 
Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises Rundeck
 
SysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile ProblemSysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile ProblemRundeck
 
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...André Goliath
 
Scaling Up Lookout
Scaling Up LookoutScaling Up Lookout
Scaling Up LookoutLookout
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019RackN
 
Atlassian - Software For Every Team
Atlassian - Software For Every TeamAtlassian - Software For Every Team
Atlassian - Software For Every TeamSven Peters
 
The Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowThe Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowAll Things Open
 
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterSaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterThomas Jackson
 
Software that eats the world! - PerformDay Brussels
Software that eats the world! - PerformDay BrusselsSoftware that eats the world! - PerformDay Brussels
Software that eats the world! - PerformDay BrusselsKlaus Enzenhofer
 
Crossroads of Asynchrony and Graceful Degradation
Crossroads of Asynchrony and Graceful DegradationCrossroads of Asynchrony and Graceful Degradation
Crossroads of Asynchrony and Graceful DegradationC4Media
 
Expecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningExpecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningAtlassian
 
Js foo - Sept 8 upload
Js foo - Sept 8 uploadJs foo - Sept 8 upload
Js foo - Sept 8 uploadDebnath Sinha
 

Semelhante a Modern Operations: Solving DevOps’ Last Mile Problem (20)

Operations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOpsOperations: The Last Mile Problem For DevOps
Operations: The Last Mile Problem For DevOps
 
Ops Happens: Improving Incident Response Using DevOps and SRE Practices
Ops Happens:  Improving Incident Response Using DevOps and SRE PracticesOps Happens:  Improving Incident Response Using DevOps and SRE Practices
Ops Happens: Improving Incident Response Using DevOps and SRE Practices
 
Tickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily MiserableTickets Make Operations Work Unnecessarily Miserable
Tickets Make Operations Work Unnecessarily Miserable
 
Making Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of OperationsMaking Tomorrow Better than Today - Unlocking the Full Potential of Operations
Making Tomorrow Better than Today - Unlocking the Full Potential of Operations
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile
 
Operations: The Last Mile
Operations: The Last Mile Operations: The Last Mile
Operations: The Last Mile
 
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product..."Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...
"Product Architecture: failures and lessons learnt" - Royi Benyossef @Product...
 
Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises Failure Happens: Improving Incident Response In Enterprises
Failure Happens: Improving Incident Response In Enterprises
 
SysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile ProblemSysAdmin to SRE: Solving the Last Mile Problem
SysAdmin to SRE: Solving the Last Mile Problem
 
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
 
Scaling Up Lookout
Scaling Up LookoutScaling Up Lookout
Scaling Up Lookout
 
Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019Immutable Infrastructure & Rethinking Configuration - Interop 2019
Immutable Infrastructure & Rethinking Configuration - Interop 2019
 
Atlassian - Software For Every Team
Atlassian - Software For Every TeamAtlassian - Software For Every Team
Atlassian - Software For Every Team
 
The Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowThe Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To Know
 
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, FasterSaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
SaltConf 2015: Salt stack at web scale: Better, Stronger, Faster
 
Software that eats the world! - PerformDay Brussels
Software that eats the world! - PerformDay BrusselsSoftware that eats the world! - PerformDay Brussels
Software that eats the world! - PerformDay Brussels
 
Crossroads of Asynchrony and Graceful Degradation
Crossroads of Asynchrony and Graceful DegradationCrossroads of Asynchrony and Graceful Degradation
Crossroads of Asynchrony and Graceful Degradation
 
Expecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance TuningExpecto Performa! The Magic and Reality of Performance Tuning
Expecto Performa! The Magic and Reality of Performance Tuning
 
Js foo - Sept 8 upload
Js foo - Sept 8 uploadJs foo - Sept 8 upload
Js foo - Sept 8 upload
 
HyPPO - Hybrid Performance-aware Power-capping Orchestrator
HyPPO - Hybrid Performance-aware Power-capping OrchestratorHyPPO - Hybrid Performance-aware Power-capping Orchestrator
HyPPO - Hybrid Performance-aware Power-capping Orchestrator
 

Mais de Rundeck

Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck Community Office Hours:  Using Variables with Job Steps Rundeck Community Office Hours:  Using Variables with Job Steps
Rundeck Community Office Hours: Using Variables with Job Steps Rundeck
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationRundeck
 
How to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckHow to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckRundeck
 
Lunch and learn: Getting started with Rundeck & Ansible
Lunch and learn:  Getting started with Rundeck & AnsibleLunch and learn:  Getting started with Rundeck & Ansible
Lunch and learn: Getting started with Rundeck & AnsibleRundeck
 
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...Rundeck
 
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck Office Hours:  Best Practices Access Control PoliciesRundeck Office Hours:  Best Practices Access Control Policies
Rundeck Office Hours: Best Practices Access Control PoliciesRundeck
 
Mastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckMastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckRundeck
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4 Rundeck
 
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...Rundeck
 
Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Rundeck
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck Rundeck
 
Automated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuAutomated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuRundeck
 
Modernizing Incident Response
Modernizing Incident Response Modernizing Incident Response
Modernizing Incident Response Rundeck
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Rundeck
 
Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Rundeck
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck OverviewRundeck
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationRundeck
 
Advanced Cluster Settings
Advanced Cluster Settings Advanced Cluster Settings
Advanced Cluster Settings Rundeck
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Rundeck
 
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Rundeck
 

Mais de Rundeck (20)

Rundeck Community Office Hours: Using Variables with Job Steps
Rundeck Community Office Hours:  Using Variables with Job Steps Rundeck Community Office Hours:  Using Variables with Job Steps
Rundeck Community Office Hours: Using Variables with Job Steps
 
Introducing PagerDuty Process Automation
Introducing PagerDuty Process AutomationIntroducing PagerDuty Process Automation
Introducing PagerDuty Process Automation
 
How to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in RundeckHow to Build a Custom Plugin in Rundeck
How to Build a Custom Plugin in Rundeck
 
Lunch and learn: Getting started with Rundeck & Ansible
Lunch and learn:  Getting started with Rundeck & AnsibleLunch and learn:  Getting started with Rundeck & Ansible
Lunch and learn: Getting started with Rundeck & Ansible
 
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...Self Service Cloud Operations:  Safely Delegate the Management of your Cloud ...
Self Service Cloud Operations: Safely Delegate the Management of your Cloud ...
 
Rundeck Office Hours: Best Practices Access Control Policies
Rundeck Office Hours:  Best Practices Access Control PoliciesRundeck Office Hours:  Best Practices Access Control Policies
Rundeck Office Hours: Best Practices Access Control Policies
 
Mastering Secrets Management in Rundeck
Mastering Secrets Management in RundeckMastering Secrets Management in Rundeck
Mastering Secrets Management in Rundeck
 
What's New in Rundeck 3.4
What's New in Rundeck 3.4   What's New in Rundeck 3.4
What's New in Rundeck 3.4
 
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...Automate Yourself Out of a Job:  Safely Delegate the Management of your Azure...
Automate Yourself Out of a Job: Safely Delegate the Management of your Azure...
 
Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation Super-Charge Your Site Reliability Practices with Runbook Automation
Super-Charge Your Site Reliability Practices with Runbook Automation
 
Introduction to Rundeck
Introduction to Rundeck Introduction to Rundeck
Introduction to Rundeck
 
Automated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + SensuAutomated Remediation with Rundeck + Sensu
Automated Remediation with Rundeck + Sensu
 
Modernizing Incident Response
Modernizing Incident Response Modernizing Incident Response
Modernizing Incident Response
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
 
Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020Datadog + Rundeck at DASH 2020
Datadog + Rundeck at DASH 2020
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
 
Advanced Cluster Settings
Advanced Cluster Settings Advanced Cluster Settings
Advanced Cluster Settings
 
Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration Maximizing Your Rundeck Migration
Maximizing Your Rundeck Migration
 
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
Business Continuity for Humans: Keeping Your Business Running When Your Peopl...
 

Último

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Modern Operations: Solving DevOps’ Last Mile Problem

  • 1. Modern Operations: Solving DevOps’ Last Mile Problem Damon Edwards @damonedwards 2018 June 18, 2016 Beaverton, OR
  • 3. This talk isn’t about cool technology.
  • 4. This talk isn’t about cool technology. Kubernetes Istio Lamba
  • 5. This talk isn’t about cool technology. Kubernetes Istio Lamba Ansible Docker Prometheus Chef Collectd AWS Puppet VMs Nagios
  • 6. This talk isn’t about cool technology. Kubernetes Istio Lamba Ansible Docker Prometheus Chef Collectd AWS Puppet VMs Nagios ??? ??? ???
  • 7. This talk is about how we operate it.
  • 8. OpsBusiness Idea Shorter Time-to-Market Fast Feedback from Users Dev Ops Running Services Improved Quality Digital and DevOps Availability Auditing Security Compliance "Go faster!" “Open up!” “Lock it down!” Why now?
  • 9. Let’s start with a true story…
  • 11. But nobody was talking about what happened after deployment…
  • 12. It was just another Tuesday…
  • 13. NOC NOC Biz Manager Escalate! NOC NOC NOC (Bob) Open Incident Ticket 9:30am 10:00am NOC (Bob) Biz Manager Ticket Context Wagon Yes, but this looks different Hasn’t there been some intermittent errors this week? v3 ?!
  • 14. NOC (Bob) Open Incident Ticket Ticket Biz Manager App-specific SREs “Try this.” “Try that.” SRE SysAdmin with Prod Access (Steve) SRE SRE SRE SRE SRE SRE Bridge Call Biz Manager fixed? fixed? NOC (Bob) Biz Manager NOC (Bob) Biz Manager SysAdmin (Steve) 7 x SRE Ticket Context Wagon Ticket Context Wagon
  • 15. SRE “It’s a problem with the Foo service” SRE SRE Foo SRE SRE SRE SRE Bridge Call Biz Manager Foo Service No. NOC (Bob) Update Ticket Ticket Foo Lead Dev + add 12:00pm NOC (Bob) Biz Manager Foo SRE Ticket Context Wagon Can you fix it?
  • 16. o Dev Foo Lead Dev (Karen) ding! Ignore. App Manager Hey did you see that ticket? Foo Lead Dev (Karen) sigh. I’ll take a look I’m go mor pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE Scrum Ticket Context Wagon
  • 17. k Foo Lead Dev (Karen) I’m going to need more log files Ticket SysAdmin Team + add Update Ticket Chat “Can someone with access to Foo Service in Prod01 help me with ticket #42516?” SysAdmin (Lee) Ticket “logs attached” Foo Lead Dev (Karen) Ticket “no the other ones” Le (K NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Ticket Context Wagon
  • 18. Foo Lead Dev (Karen) Logs -Who restarted these services? (and why?) -They didn’t use the correct environment variables! -This entire service pool needs to be restarted! Ticket Update Ticket NOC (Bob) Update Ticket Ticket Middleware Team + add “Middleware, please urgent restart this entire app pool with the correct environment variable” 2:00pm Ticket Context W
  • 19. ase s entire e correct able” NOC (Bob) Middleware Manager (Melissa) No way. It’s the middle of the day! You need business approval. NOC (Bob) Update Ticket Ticket SVP for Line of Business + add (S NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager NOC (B Biz Ma App Ma Lead D Foo SR Ticket Context Wagon Ticket Context Wagon 2:30pm
  • 20. Update Ticket Ticket SVP for Line of Business + add SVP (Susan) Chief of Staff Tech VP Tech VP Update Ticket Ticket “Restart approved” Customer impact? Ticket Middlewa Manage (Melissa Wh prod 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Ticket Context Wagon
  • 21. Share point proved” Ticket Middleware Manager (Melissa) Who knows these production services the best? Ellen! Middleware Middleware (Scott) Ellen to Europe office Middleware (Scott) Trial and error .doc 5:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Ticket Context Wagon
  • 22. Share point Middleware (Scott) Trial and error .doc NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) ket Context Wagon Middleware (Scott) Bar Service 10 min Middleware (Scott) Waiting for Acme Service Acme startup failed Bar Service 6:00pm
  • 23. Come on.. no.no.no. What? Why? Middleware (Scott)
  • 24. Come on.. no.no.no. What? Why? Middleware (Scott)
  • 25. 8888888 Come on.. no.no.no. What? Why? Middleware (Scott)
  • 26. -Bar app startup timed out. Error says can’t connect to Acme service. - I looked at Acme but it seems to be running -Is this error message correct? Why can’t Bar connect? Ticket Update Ticket Middleware (Scott) Bar SRE + add Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add 6:45 NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda)Ticket Context Wagon The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked.
  • 27. Bar SRE (Linda) Middleware (Scott) -URGENT: Network connection issue between Bar and Acme Ticket Update Ticket Network SRE Team + add Bar Lead Dev 6:45pm ob) ager nager ev (Karen) E SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Customers are calling. What is going on?The new environment pre-flight check is preventing startup. Looks like Bar’s connection to Acme is being blocked. Bar Lead Dev (Liu) Business Managers I can comment out the test… But the CD pipeline only goes to QA ENV!
  • 28. Network Dir (Carlos) Middleware (Scott) Carlos, I need a favor. Can you escalate?Middleware Manager (Melissa) Customers are calling. What is going on? Last week.. Net SRE VP VP Priority! Different Incident! Net SRE Net SRE Net SRE Its the network! Business Managers Your network is broken! Business Managers We are already working on it! Network VPs out he ly V!
  • 29. Network SRE (Hari) The firewall is blocking the traffic You’ll have to take it up with the Firewall Team -URGENT: Firewall is blocking connection between Bar and Acme Ticket Open Firewall Ticket Firewall Team + add Firewall Engineer (Freddie) Middleware (Scott) Paging on-call… Open bridge… Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. 8:00p NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Ticket Context Wagon
  • 30. Firewall Engineer (Freddie) Middleware (Scott) Firewall Engineer (Freddie) Middleware (Scott) Can’t be the firewall, it hasn’t changed since last Thursday. No its the firewall. There was a rule change last Thursday that would stop Bar from talking to Acme. Can you change it back? Sure we make changes on Thursday… Chief of Staff SVP and VPs are livid… this was supposed to be a safe change!! Freddie, we’ve got customers calling. ES Em pro rul Update Firewall Ticket Firewall Engineer (Freddie) 8:00pm
  • 31. d VPs are livid… this was sed to be a safe change!! we’ve got customers calling. ESCALATE: Emergency production firewall rule change review Ticket Update Firewall Ticket NetSec + add Firewall Engineer (Freddie) Paging on-call… NetSec (Nicole) This is production so I’ll have to get others on the Network CAB… Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev 9:00pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAd Middle SVP Chief o 2 x Tec Ticket Context Wagon
  • 32. I’ll have Network Chief of Staff Firewall (Freddie) Middleware (Scott) Customer outage! APPROVE: Emergency firewall rule change Ticket Update Firewall Ticket NetSec (Nicole) … I’ll call SVP Susan Middleware Manager VP VP Bar Lead Dev Firewall (Freddie) Net L2 (Bob) Middl (Sc Firewall change Restart Bar 9:30pm NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Firewall (Freddie) Ticket Context Wagon NetSec (Nicole)
  • 33. Middleware (Scott) Update Ticket Ticket Customer Engagement Manager + add Policy !! “Ready for API tests” 9:45pm Firewall (Freddie) Net L2 (Bob) Middleware (Scott) Firewall change Restart Bar I think we are good! Middleware Manager VP VP Bar Lead Dev You “think?” pm
  • 34. et gement “Ready for API tests” Customer Engagement Manager (Varsha) NOC (Bob) Customer Engagement Manager (Varsha) Update Ticket Ticket “APIs OK” Middleware (Scott) Upda Tick 11:00pm Ticket Co
  • 35. e Ticket “APIs OK” Middleware (Scott) Update Ticket Ticket “Services restarted OK” NOC NOC Lights are green… I guess it is fixed. Close Ticket NOC (Bob) Zzz 11:30pm N NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Firewall (Freddie) Ticket Context Wagon NetSec (Nicole) Cust. Engmt. (Varsha)
  • 36. e Ticket “APIs OK” Middleware (Scott) Update Ticket Ticket “Services restarted OK” NOC NOC Lights are green… I guess it is fixed. Close Ticket NOC (Bob) Zzz 11:30pm N NOC (Bob) Biz Manager App Manager Lead Dev (Karen) Foo SRE SysAdmin (Lee) Middleware Manager SVP Chief of Staff 2 x Tech VP Middleware (Scott) Bar SRE (Linda) Network PM (Carlos) Network SRE (Bob) Firewall (Freddie) Ticket Context Wagon NetSec (Nicole) Cust. Engmt. (Varsha) .
  • 37. NOC Lights are green… I guess it is fixed. Close Ticket NOC (Bob) Zzz Next Day SVP (Susan) Whose fault is this?! Why are we so bad at change? What additional processes and approvals are you adding to never let this happen again?! VP VP Dir Dir VP Dir VP Scott) da) Carlos) (Bob) die) NetSec (Nicole) Cust. Engmt. (Varsha)
  • 39. We’ve invested in Cloud, Agile, DevOps, Containers… Why does everything still take too long and cost too much? Executive Team Our transformation has largely ignored Ops
  • 40. Most companies chase the symptoms…
  • 41.
  • 44. Manual / Motion Manual / Motion Manual / Motion Manual / Motion Manual / Motion Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Defects Defects Defects Defects Defects Waiting ! Defects ! Motion/Manual
  • 45. Manual / Motion Manual / Motion Manual / Motion Manual / Motion Manual / Motion Task Switching Task Switching Task SwitchingTask Switching Task Switching Task Switching Task Switching Task Switching Task Switching Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Defects Defects Defects Defects Defects Waiting ! Defects ! Motion/Manual ! Task Switching
  • 46. Manual / Motion Manual / Motion Manual / Motion Manual / Motion Manual / Motion Task Switching Task Switching Task SwitchingTask Switching Task Switching Task Switching Task Switching Task Switching Task Switching Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Defects Defects Defects Defects Defects Partially Done Partially Done Partially Done Partially Done Partially Done Waiting ! Defects ! Motion/Manual ! Task Switching ! Partially Done
  • 47. Manual / Motion Manual / Motion Manual / Motion Manual / Motion Manual / Motion Task Switching Task Switching Task SwitchingTask Switching Task Switching Task Switching Task Switching Task Switching Task Switching Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Waiting Defects Defects Defects Defects Defects Partially Done Partially Done Partially Done Partially Done Partially Done Extra Process Extra Process Extra Process Extra Process Extra Process Extra Process Extra Process Extra Process Waiting ! Defects ! Motion/Manual ! Task Switching ! Partially Done ! Extra Process
  • 48.
  • 50.
  • 52.
  • 54.
  • 55. Here?
  • 56.
  • 58. HC: 2 HC: 9 CW: 10 HC: 3 CW: 6 HC: 3 CW: 7 HC: 4 CW: 10 HC: 4 CW: 12 HC: 3 CW: 12 HC: 4 CW: 13 HC: 8 CW: 15 HC: 6 CW: 17 HC: 4 CW: 18 HC: 3 CW: 18 HC = Headcount CW = in Context Wagon
  • 60. “We need better tools” Follow the conventional wisdom:
  • 61. “We need better tools” “We need more people” Follow the conventional wisdom:
  • 62. “We need better tools” “We need more people” “We need more discipline and attention to detail” Follow the conventional wisdom:
  • 63. “We need better tools” “We need more people” “We need more discipline and attention to detail” “We need more change reviews/approvals” Follow the conventional wisdom:
  • 64. “We need better tools” “We need more people” “We need more discipline and attention to detail” “We need more change reviews/approvals” Follow the conventional wisdom:
  • 65. Challenge the conventional wisdom about operations work
  • 66. Forces That Undermine Operations Silos Queues Toil Low Trust
  • 67. Forces That Undermine Operations Silos Queues Toil Low Trust
  • 69. Backlog Information I need X PrioritiesTools Silos
  • 70. Backlog Information I need X PrioritiesTools Silos Backlog I do X Requests for X Silo A Information Priorities Silo B Tools
  • 71. Silos cause disconnects and mismatches Backlog Information I need X PrioritiesTools Backlog I do X Requests for X Silo A Information Priorities Silo B Tools Context Context Process Process Tooling Tooling Capacity Capacity
  • 72. 1 2 3 Silos Interfere with feedback loops
  • 73. 1 2 3 Silos Interfere with feedback loops Producer Consumer Ops Ops Ops
  • 74. Function A Function B Function C Becomes siloed labor pools of functional specialists Requests fulfilled by semi- manual or manual effort Primary management focus is on protecting team capacity
  • 75. Forces That Undermine Operations Silos Queues Toil Low Trust
  • 76. How do we cover for our silos disconnects and mismatches? Silo A Silo B
  • 77. How do we cover for our silos disconnects and mismatches? Silo A Silo B Ticket Queue
  • 78. ?? Silo A Silo B We all know how well that works Ticket Queue
  • 79. Request queues are an expensive way to manage work Ticket Queue Queues Create… Longer Cycle Time Increased Risk More Variability More Overhead Lower Quality Less Motivation Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development
  • 80. What do queues do to value streams?
  • 81. What do queues do to value streams? Queue A Queue B
  • 82. What do queues do to value streams? Queue A Queue B Queues disintegrate and obfuscate value streams
  • 83. Tickets queues become “snowflake makers” ?? Silo A Silo B Ticket Queue
  • 84. Tickets queues become “snowflake makers” ?? Silo A Silo B Ticket Queue Snowflakes (each unique, technically acceptable but unreproducible and brittle)
  • 85. Forces That Undermine Operations Silos Queues Toil Low Trust
  • 86. Excessive toil prevents fixing the system
  • 87. Excessive toil prevents fixing the system “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” -Vivek Rau Google
  • 88. Excessive toil prevents fixing the system Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 89. Excessive toil prevents fixing the system Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  • 91. Toil Impacts Development As Well • 2016-2017 study of development teams • 14 enterprises (insurance, healthcare, finance, travel, retail) • Tech org headcount: 900 - 6800
  • 92. Toil Impacts Development As Well “28% - 63% of development teams’ total time was consumed by operations toil.” • 2016-2017 study of development teams • 14 enterprises (insurance, healthcare, finance, travel, retail) • Tech org headcount: 900 - 6800
  • 93. Toil Impacts Development As Well “28% - 63% of development teams’ total time was consumed by operations toil.” Waiting for environments Incident escalations Rework due to env. differences Handoffs Network issues Requests for information Broken lower environments Change meetings And more… • 2016-2017 study of development teams • 14 enterprises (insurance, healthcare, finance, travel, retail) • Tech org headcount: 900 - 6800
  • 94. Forces That Undermine Operations Silos Queues Toil Low Trust
  • 95. Where are decisions made? Who can take action? escalate 1° 2° 3° 4° escalate escalateor
  • 96. All work is contextual John Allspaw
  • 97. All work is contextual rm -rf $PATHNAME John Allspaw
  • 98. All work is contextual rm -rf $PATHNAME Is this dangerous? John Allspaw
  • 99. All work is contextual rm -rf $PATHNAME John Allspaw
  • 100. All work is contextual rm -rf $PATHNAME John Allspaw
  • 101. All work is contextual rm -rf $PATHNAME Is this dangerous? John Allspaw
  • 102. All work is contextual rm -rf $PATHNAME John Allspaw
  • 103. All work is contextual rm -rf $PATHNAME Answer is always “it depends” John Allspaw
  • 104. escalate 1° 2° 3° 4° escalate escalateor Context Where are decisions made? Who can take action?
  • 105. Low trust + approvals = illusion of control Ticket System
  • 106. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and
  • 107. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”)
  • 108. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”)
  • 109. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  • 110. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  • 111. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  • 112. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with?
  • 113. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call?
  • 114. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call? How many got rejected?
  • 115. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call? How many got rejected?
  • 116. Forces That Undermine Operations Silos Queues Toil Low Trust
  • 117. So what can we do differently?
  • 118. Obvious: Get rid of as many silos as possible Old Silo A Old Silo B Old Silo C Old Silo D
  • 119. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Obvious: Get rid of as many silos as possible
  • 120. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Obvious: Get rid of as many silos as possible “Horizontal” shared responsibility, not everyone do everything!
  • 121. Shared and dedicated responsibility is key Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model
  • 122. Shared and dedicated responsibility is key Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model
  • 123. Shared and dedicated responsibility is key Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model Same high-quality, high-velocity results!
  • 124. But what about the cross-cutting concerns? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities
  • 125. But what about the cross-cutting concerns? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities Ticket Queue Ticket Queue Ticket Queue
  • 126. But what about the cross-cutting concerns? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities Ticket Queue Ticket Queue Ticket Queue Ticket Queue Ticket Queue Ticket Queue
  • 127. Operations as a Service: Turn handoffs into self-service Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist
  • 128. Development Team 1 Development Team 2 Development Team n Ops/SRE Team Operations as a Service On Demand On Demand On Demand On Demand Ops (builds & operates) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Operations as a Service: Works with any org model
  • 129. Use tickets only for what they are good for Ticket System
  • 130. Use tickets only for what they are good for 1.Documenting true problems/issues/exceptions Ticket System
  • 131. Use tickets only for what they are good for 1.Documenting true problems/issues/exceptions 2.Routing for necessary approvals Ticket System
  • 132. Use tickets only for what they are good for 1.Documenting true problems/issues/exceptions 2.Routing for necessary approvals Not as a general purpose work management system! Ticket System
  • 133. Security or compliance “in the way”? Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Build-in Security Here Build-in Compliance Here
  • 134. “Shift Left” the ability to take action escalate 1° 2° 3° 4° escalate escalateor
  • 135. “Shift Left” the ability to take action Push the ability to take action this direction escalate 1° 2° 3° 4° escalate escalateor
  • 136. “Shift Left” the ability to take action Push the ability to take action this direction escalate 1° 2° 3° 4° escalate escalateor OaaS Enablement and tooling
  • 138. Reduce Toil 1. Track toil levels for each team
  • 139. Reduce Toil 1. Track toil levels for each team 2. Set toil limits for each team
  • 140. Reduce Toil 1. Track toil levels for each team 2. Set toil limits for each team 3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
  • 141. Reduce Toil 1. Track toil levels for each team 2. Set toil limits for each team 3. Fund efforts to reduce toil (with emphasis on teams over toil limits) Bonus: Use Service Level Objectives, Error Budgets, and other lessons from SRE
  • 142. Example Operations as a Service Platform (shameless plug) #! Ȑ Ƙ Scripts APIs Tools Cloud VMs Containers Workflow and Scheduling Collect and Process Output Infrastructure details and state Config. Man. CMDB Monitor. Metrics Cloud Corp Directory Authentication and roles ITSM Tickets, work status, approvals >_ Create workflows ● Define policies ● Execute workflows Web GUI API CLI
  • 143. Recap Don’t forget about Ops. Challenge conventional wisdom. Leverage the Operations as a Service design pattern “Shift-Left” control and decision making. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Focus on removing silos and queues Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Learn from SRE: Reduce toil to create capacity to change Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) Understand the forces undermining operations work