Helping operations top-heavy teams succeed smartly

•Download as PPTX, PDF•

4 likes•857 views

All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success. We will look at the process for Code Yellow, the term we use for this process of “righting the ship”, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.

Engineering

Helping operations top-heavy
teams the smart way
Jeff Weiner
Chief Executive Officer
Michael Kehoe
Staff Site Reliability Engineer
Todd Palino
Sr Staff Site Reliability Engineer

This Is The Only Slide You May Need a Picture Of
slideshare.net/ToddPalino slideshare.net/MichaelKehoe3

Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the
University of Queensland

Todd Palino
$ WHOAMI
• Senior Staff SRE @ LinkedIn
• Capacity Engineering Team
• Co-Author of Kafka: The Definitive Guide
• Late of VeriSign Infrastructure
Engineering

When Operations Isn’t Perfect
Code Yellow
https://devops.com/code-yellow-when-operations-isnt-perfect/

• How to quickly erase all your
technical debt
• How to change your engineering
culture
This talk is not

• How to identify team anti-patterns
• How to work through high toil
• How to create sustainable
workloads
This talk is

Today’s
agenda
1 Background
2 Scenario 1: Traffic-SRE
3 Scenario 2: Kafka-SRE
4 Building A Formula For Success
5 Key Learnings
6 Q&A

Personal Experience in the past 19 months
ASSISTANCE RENDERED
• Traffic-SRE: Technical Debt/ Resource
Allocation
• Voyager-SRE: Technical Debt
• Capacity War-room
• Espresso-SRE: Reliability
• Kafka-SRE: Capacity and Alert Fatigue

Problem Statement
Technical Debt
• Written documentation needed
improvement
• Deployment infrastructure needed
investment
• Alert Fatigue
Traffic-SRE

Problem Statement
Resource Allocations
• Backlog of work for clients
• Staff shortage

Problem Statement
Capacity Planning
• Multi-tenant Infrastructure
• No resource controls
• Unclear resource ownership
• Ad-hoc capacity planning
• Sudden 100% increase in traffic

Problem Statement
Alert Fatigue
• Multiple applications overutilized
• No time for proactive work
• Most alerts non-actionable

Building a formula for success
Define the areas
that need attacking
Problem Statement
Communicate
expectations with
clients & partners
Communication &
Partnerships
Define success
criteria
Exit Criteria
Get the help that
you require
Resource
Acquisition
Plan for short-term
& long-term
Planning

Define the areas that need attacking
Problem Statement
• Admit there is a problem
• Measure the problem
• Understand the problem
• Determines underlying causes that
need to be fixed
Building a formula for success

Define success criteria
Exit Criteria
• Define concrete goals
• Define concrete success criteria
• Measure via an operational metric
• Measure via a project being
completed
• Define timelines for completion
Building a formula for success

Get the help you require
Resource Acquisition
• Ask other teams for help
• Get dedicated engineers/ project
managers/ other roles as required
• Set exit-date for resources
Building a formula for success

Plan for the short-term & long-term
Planning
• Plan out short-term work
• Plan out longer-term projects
• Do they need to be rescheduled?
• Prioritize work that will reduce toil &
burnout (Automation +
Measurement)
Building a formula for success

Communicate expectations with
clients & partners
Communicatio
n &
Partnerships
• Communicate problem statement &
exit criteria
• Send regular progress updates
• Ensure that stakeholders
understand delays & expected
outcomes
Building a formula for success

Key Learnings
Measure toil/
overhead
Measure
Prioritize efforts to
remove overhead/toil
Prioritize
Communicate with
partners & teams
Communicate

Helping operations top-heavy teams succeed smartly

What's hot

The Continuous delivery value - FunaroCodemotion

Technical Capabilities as enabler for Agile and DevOpsNelis Boucké

So long scrum, hello kanbanStormpath

Working Effectively with PeopleSoft SupportSmart ERP Solutions, Inc.

Agile performance testingCesario Ramos

Extreme Makeover OnBase EditionDataBank, A KYOCERA Group Company

Implementing Test Automation: What a Manager Should KnowSoftServe

Making disaster routinePeter Varhol

DevOps for Database webinarDBmaestro - Database DevOps

So you-want-to-go-fasterOoblioob

Django productionpythonsd

Api360 Summit The Automated MonolithHaufe-Lexware GmbH & Co KG

Outsmarting Merge Edge Cases in Component Based DesignPerforce

Kudo codefest : Delivering High Quality Software Through Better Release ProcessKudo Developers

DR in the Cloud: Finding the Right Tool for the JobHostway|HOSTING

Agile_SDLC_Node.js@Paypal_pptHitesh Kumar

NYC MeetUp 10.9Solano Labs

Scaling a Web Site - OSCON Tutorialduleepa

5 Cloud Migration Experiences Not to Be RepeatedHostway|HOSTING

Kanban VS ScrumMikalai Alimenkou

What's hot (20)

The Continuous delivery value - Funaro

Technical Capabilities as enabler for Agile and DevOps

So long scrum, hello kanban

Working Effectively with PeopleSoft Support

Agile performance testing

Extreme Makeover OnBase Edition

Implementing Test Automation: What a Manager Should Know

Making disaster routine

DevOps for Database webinar

So you-want-to-go-faster

Django production

Api360 Summit The Automated Monolith

Outsmarting Merge Edge Cases in Component Based Design

Kudo codefest : Delivering High Quality Software Through Better Release Process

DR in the Cloud: Finding the Right Tool for the Job

Agile_SDLC_Node.js@Paypal_ppt

NYC MeetUp 10.9

Scaling a Web Site - OSCON Tutorial

5 Cloud Migration Experiences Not to Be Repeated

Kanban VS Scrum

Similar to Helping operations top-heavy teams succeed smartly

Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe

Helping operations top-heavy teams the smart wayMichael Kehoe

BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...Business of Software Conference

Success recipe for new IT projects-Agile way. Fail Fast, Fail EarlyJoseph Vargheese PMP CSM CSP

Applying both of waterfall and iterative developmentDeny Prasetia

INAAU Project Management for Telecommunications ProfessionalsRory McKenna

American Electric Power Ercot kickoffJohn Napier

103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...ssuser835d1a

The Dashlane Agile JourneyDashlane

Engineering Teams and Systems for VelocityJean Barmash

CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...TheConnectedCause

AVATA Webinar: Solutions to Common Demantra & ASCP ChallengesAVATA

Doing It On Your Own: When to Call in the Consultants, When to Leave Them OutNTEN

XebiCon'17 : //Tam-tams// Voici l’histoire de la disparition des dinosaures d...Publicis Sapient Engineering

Kristian Fischer - Put Test in the Driver's SeatTEST Huddle

Pm training day 3Wasim Khalil PMP®,PRINCE2P®,ITIL®

Scrum Agile by David MannJames Sutter

Process improvement scrum_agile_v2_by_david_mannJim Sutter

Fundamentals of agile tntu (2015-04-27)Oleg Nazarevych

Building Production-Ready Microservices: DevopsExchangeSFMichael Kehoe

Similar to Helping operations top-heavy teams succeed smartly (20)

Code Yellow: Helping operations top-heavy teams the smart way

Helping operations top-heavy teams the smart way

BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...

Success recipe for new IT projects-Agile way. Fail Fast, Fail Early

Applying both of waterfall and iterative development

INAAU Project Management for Telecommunications Professionals

American Electric Power Ercot kickoff

103240-The-New-Way-of-Thinking-Our-Implementation-experience-with-Oracle-HCM-...

The Dashlane Agile Journey

Engineering Teams and Systems for Velocity

CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...

AVATA Webinar: Solutions to Common Demantra & ASCP Challenges

Doing It On Your Own: When to Call in the Consultants, When to Leave Them Out

XebiCon'17 : //Tam-tams// Voici l’histoire de la disparition des dinosaures d...

Kristian Fischer - Put Test in the Driver's Seat

Pm training day 3

Scrum Agile by David Mann

Process improvement scrum_agile_v2_by_david_mann

Fundamentals of agile tntu (2015-04-27)

Building Production-Ready Microservices: DevopsExchangeSF

Recently uploaded

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar

Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665

Past, Present and Future of Generative AIabhishek36461

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12

complete construction, environmental and economics information of biomass com...asadnawaz62

Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441

Design and analysis of solar grass cutter.pdfTagore Institute of Engineering And Technology

Heart Disease Prediction using machine learning.pptxPoojaBan

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

young call girls in Green Park🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Low Rate Call Girls In Saket, Delhi NCR

Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000

Correctly Loading Incremental Data at ScaleAlluxio, Inc.

Electronically Controlled suspensions system .pdfme23b1001

Application of Residue Theorem to evaluate real integrations.pptx959SahilShah

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721

Recently uploaded (20)

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger

Call Girls Delhi {Jodhpur} 9711199012 high profile service

Past, Present and Future of Generative AI

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE

complete construction, environmental and economics information of biomass com...

Instrumentation, measurement and control of bio process parameters ( Temperat...

Design and analysis of solar grass cutter.pdf

Heart Disease Prediction using machine learning.pptx

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction

young call girls in Green Park🔝 9953056974 🔝 escort Service

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx

IVE Industry Focused Event - Defence Sector 2024

UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf

Introduction to Machine Learning Unit-3 for II MECH

Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...

Correctly Loading Incremental Data at Scale

Electronically Controlled suspensions system .pdf

Application of Residue Theorem to evaluate real integrations.pptx

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync

Helping operations top-heavy teams succeed smartly

1. Helping operations top-heavy teams the smart way Jeff Weiner Chief Executive Officer Michael Kehoe Staff Site Reliability Engineer Todd Palino Sr Staff Site Reliability Engineer

2. This Is The Only Slide You May Need a Picture Of slideshare.net/ToddPalino slideshare.net/MichaelKehoe3

3. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland

4. Todd Palino $ WHOAMI • Senior Staff SRE @ LinkedIn • Capacity Engineering Team • Co-Author of Kafka: The Definitive Guide • Late of VeriSign Infrastructure Engineering

5. When Operations Isn’t Perfect Code Yellow https://devops.com/code-yellow-when-operations-isnt-perfect/

6. • How to quickly erase all your technical debt • How to change your engineering culture This talk is not

7. • How to identify team anti-patterns • How to work through high toil • How to create sustainable workloads This talk is

8. Today’s agenda 1 Background 2 Scenario 1: Traffic-SRE 3 Scenario 2: Kafka-SRE 4 Building A Formula For Success 5 Key Learnings 6 Q&A

9. Background

10. Personal Experience in the past 19 months ASSISTANCE RENDERED • Traffic-SRE: Technical Debt/ Resource Allocation • Voyager-SRE: Technical Debt • Capacity War-room • Espresso-SRE: Reliability • Kafka-SRE: Capacity and Alert Fatigue

11. Scenario 1: Traffic-SRE

12. Problem Statement Technical Debt • Written documentation needed improvement • Deployment infrastructure needed investment • Alert Fatigue Traffic-SRE

13. Problem Statement Resource Allocations • Backlog of work for clients • Staff shortage

14. Scenario 2: Kafka

15.

16. Problem Statement Capacity Planning • Multi-tenant Infrastructure • No resource controls • Unclear resource ownership • Ad-hoc capacity planning • Sudden 100% increase in traffic

17. Problem Statement Alert Fatigue • Multiple applications overutilized • No time for proactive work • Most alerts non-actionable

18. Building a formula for success

19. Code Yellow

20. Building a formula for success Define the areas that need attacking Problem Statement Communicate expectations with clients & partners Communication & Partnerships Define success criteria Exit Criteria Get the help that you require Resource Acquisition Plan for short-term & long-term Planning

21. Define the areas that need attacking Problem Statement • Admit there is a problem • Measure the problem • Understand the problem • Determines underlying causes that need to be fixed Building a formula for success

22. Define success criteria Exit Criteria • Define concrete goals • Define concrete success criteria • Measure via an operational metric • Measure via a project being completed • Define timelines for completion Building a formula for success

23. Get the help you require Resource Acquisition • Ask other teams for help • Get dedicated engineers/ project managers/ other roles as required • Set exit-date for resources Building a formula for success

24. Plan for the short-term & long-term Planning • Plan out short-term work • Plan out longer-term projects • Do they need to be rescheduled? • Prioritize work that will reduce toil & burnout (Automation + Measurement) Building a formula for success

25. Communicate expectations with clients & partners Communicatio n & Partnerships • Communicate problem statement & exit criteria • Send regular progress updates • Ensure that stakeholders understand delays & expected outcomes Building a formula for success

26. Key Learnings

27. Key Learnings Measure toil/ overhead Measure Prioritize efforts to remove overhead/toil Prioritize Communicate with partners & teams Communicate

28. Q&A

Helping operations top-heavy teams succeed smartly

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Helping operations top-heavy teams succeed smartly

Similar to Helping operations top-heavy teams succeed smartly (20)

More from Todd Palino

More from Todd Palino (13)

Recently uploaded

Recently uploaded (20)

Helping operations top-heavy teams succeed smartly