All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success.
We will look at the process for Code Yellow, the term we use for this process of “righting the ship”, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Helping operations top-heavy teams succeed smartly
1. Helping operations top-heavy
teams the smart way
Jeff Weiner
Chief Executive Officer
Michael Kehoe
Staff Site Reliability Engineer
Todd Palino
Sr Staff Site Reliability Engineer
2. This Is The Only Slide You May Need a Picture Of
slideshare.net/ToddPalino slideshare.net/MichaelKehoe3
3. Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the
University of Queensland
4. Todd Palino
$ WHOAMI
• Senior Staff SRE @ LinkedIn
• Capacity Engineering Team
• Co-Author of Kafka: The Definitive Guide
• Late of VeriSign Infrastructure
Engineering
5. When Operations Isn’t Perfect
Code Yellow
https://devops.com/code-yellow-when-operations-isnt-perfect/
6. • How to quickly erase all your
technical debt
• How to change your engineering
culture
This talk is not
7. • How to identify team anti-patterns
• How to work through high toil
• How to create sustainable
workloads
This talk is
20. Building a formula for success
Define the areas
that need attacking
Problem Statement
Communicate
expectations with
clients & partners
Communication &
Partnerships
Define success
criteria
Exit Criteria
Get the help that
you require
Resource
Acquisition
Plan for short-term
& long-term
Planning
21. Define the areas that need attacking
Problem Statement
• Admit there is a problem
• Measure the problem
• Understand the problem
• Determines underlying causes that
need to be fixed
Building a formula for success
22. Define success criteria
Exit Criteria
• Define concrete goals
• Define concrete success criteria
• Measure via an operational metric
• Measure via a project being
completed
• Define timelines for completion
Building a formula for success
23. Get the help you require
Resource Acquisition
• Ask other teams for help
• Get dedicated engineers/ project
managers/ other roles as required
• Set exit-date for resources
Building a formula for success
24. Plan for the short-term & long-term
Planning
• Plan out short-term work
• Plan out longer-term projects
• Do they need to be rescheduled?
• Prioritize work that will reduce toil &
burnout (Automation +
Measurement)
Building a formula for success
25. Communicate expectations with
clients & partners
Communicatio
n &
Partnerships
• Communicate problem statement &
exit criteria
• Send regular progress updates
• Ensure that stakeholders
understand delays & expected
outcomes
Building a formula for success