SlideShare uma empresa Scribd logo
1 de 47
Reducing MTTR and False
Escalations:
Event Correlation at LinkedIn
Jeff Weiner
Chief Executive Officer
Michael Kehoe
Staff SRE
Rusty Wickell
Sr Operations Engineer
False Escalations
• Have you ever?
• Been woken because your service is
unhealthy because of a dependency
• Been woken because someone
believes your service is responsible
• Spent hours trying to work out why
your service is broken
Today’s
agenda
1 Introductions
2 The Problem Statement
3 Architecture Considerations
4 Platform Overview
5 Ecosystem Integration
6 Key Takeaways
7 Q&A
Introduction
Who are we?
PRODUCTION-SRE TEAM AT LINKEDIN
• Assist in restoring stability to services
during site-critical issues
• Develop applications to improve MTTD
and MTTR
• Provide direction and guidelines for site
monitoring
• Build tools for efficient site-issue
detection, correlation & troubleshooting,
Problem Statement
Problem Statement
Learning Curve MTTRReliability
Service Complexity
Problem Statement
Understanding
services is harder
Learning
Curve
Complexity delays
identification of cause
High MTTR
Lack of
understanding results
in false escalations
False
Escalations
Project Goals
Project Goals
Internal application
shows high latency/
errors
Unified API
External monitoring
show high latency/
errors
Web Frontend
Project Goals
Reduce impact on
members
Reduce MTTR
Less disruptions to
oncall SRE’s
Reduce False
Escalations
Project Goals
Internal application
shows high latency/
errors
Applicable Use-
cases
External monitoring
show high latency/
errors
Non-Applicable Use-
cases
Architecture Considerations
Architecture Considerations
Running metric
correlation via
stream-processing
Real-Time Metrics Analysis
Metric correlation on
demand
Ad-Hoc metric analytics
Processing alerts and
performing
Alert Correlation
Architecture Considerations
REAL-TIME METRIC ANALYTICS
• Pros
• Fast response time
• Ability to do advanced analytics in
real-time
• Cons
• Resource intensive = Expensive
Architecture Considerations
AD-HOC METRIC ANALYTICS
• Pros
• Smaller resource footprint
• Cons
• Analysis time is slow
Architecture Considerations
ALERT CORRELATION
• Pros
• Leverage already existing alerts
• Strong signal-to-noise ratio
• Cons
• Analysis constrained to alerts only
(boolean state)
Architecture Considerations
EVALUATION
• Alert Correlation gives us strong signal
• Real-time analytics is expensive, but
useful
• Ad-Hoc metric analytics is slower, but
cheaper
Platform Overview
Platform Overview
Understanding how
services depend on
each other
Call Graph
K-Means analysis
Ad-Hoc Metric
Correlation
Using alerts to
confirm performance
Alert Correlation
Collating and
decorating data
Recommendations
Engine
Correlation Engine Overview
Architecture
21
Callgraph-api
Callgraph-be
correlate-fe
drilldown invisualize
site-stabilizer
Problem Statement
Learning Curve MTTRReliability
Service Complexity
Learning Curve
Scattered
Knowledge
Outdated
Documentatio
n
Poor
Dependency
Understanding
s
Callgraph
Stores
Callcount, latency and error rates
Created
Programmatically
Interface
API and a User Interface
Lookup
Service/ API
Services, APIs,
Protocols
Service
Discovery
Destination service,
Endpoint, Protocol
Metrics
How do we map service
Site Stabilizer | Real Time
and Ad-Hoc Metrics Analysis
Challenge: Not all metrics
had thresholds
Threshold
Challenge: expensive, real
time processing, tuning
based on the individual
metrics behaviour
Statistical
Approaches that we tried
Challenge: expensive, real
time processing, tuning
based on the individual
metrics behaviour
Machine
Learning
Clustering Algorithm
K-Means
Partitions
n observations to k clusters
Store
Can be trained and saved
cluster center
cluster 3cluster 3
cluster 2
cluster 1cluster 1
K-Means : How it works
cluster 1
cluster 2
cluster 3
cluster 3
cluster center
Predict score
Ranking
Using K-Means
Predict score
Based on the trend of
the time series
Trend score
Leverage week on
week data
WoW
Typical Workflow
Identify Drilldown
Identify the critical
metrics using the k-
means method
Drilldown to the
corresponding critical
services
inVisualize | Alert
Correlation and
Visualization
Polls the monitoring system
continuously for alerts
Ingests and represents
callcount, average latency, error
rate from callgraph
Correlates downstream alerts
using Callgraph
inVisualize Assumptions
Alert Correlation and
Visualization
inVisualize
Higher the alerts for a service,
more likely it’s affected or
broken
Higher the change in
latency/error to a downstream,
more likely it’s broken
Higher the callcount to a
downstream, more valuable it is
inVisualize Assumptions
Alert Correlation and
Visualization
inVisualize
inVisualize
Alert Correlation and
Visualization
inVisualize
Save the states continuously for
replay
Rank the services based on a score
and accessible via api
Score is normalized between 0-100
Recommendation Engine
Recommendation Engine
Service, colo,
duration
Input
Collates the outputs
from Site stabilizer
and inVisualize
Collate
Responsible service,
SRE team,
correlation
confidence score
User Interface
With information
such as scheduled
changes,
deployments and
A/B experiments
Decorate
Ecosystem Integration
Ecosystem Integration
Escalate
to correct
SRE
Nurse Plan arguments
• service-name: my-frontend
• req_confidence = 85
• escalate=true
Find what’s wrong with
‘my-frontend’ in
DatacenterB
Service: Service-C
Confidence: 91%
Reason: ‘Service-C’ has high latency
after a deploy
Service Owner: SRE
Key Takeaways
Key Takeaways
Understand what
correlation
infrastructure makes
sense
Approach
Understand
dependencies
Dependencies
Key Takeaways
Feedback Loops
• Important to get some feedback on
accuracy
• Provides a means to do reporting:
• System effectiveness
• Engineers saved from escalations
• Use feedback data to train system =
Improve Results
Team
Michael Kehoe Rusty Wickell Reynold PJ Govindaluri
Kishore
Renjith Rejan
Q&A
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn

Mais conteúdo relacionado

Mais procurados

Sergey Gordeychik, Security Metrics for PCI DSS Compliance
Sergey Gordeychik, Security Metrics for PCI DSS ComplianceSergey Gordeychik, Security Metrics for PCI DSS Compliance
Sergey Gordeychik, Security Metrics for PCI DSS Compliance
qqlan
 
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity Software Ireland
 
System Professional Overview
System Professional OverviewSystem Professional Overview
System Professional Overview
wayne_emerson
 

Mais procurados (19)

SW Risk Results for CIO
SW Risk Results for CIOSW Risk Results for CIO
SW Risk Results for CIO
 
How to Integrate ServiceNow with Azure DevOps
How to Integrate ServiceNow with Azure DevOpsHow to Integrate ServiceNow with Azure DevOps
How to Integrate ServiceNow with Azure DevOps
 
Webmars Presentation Ver1
Webmars Presentation Ver1Webmars Presentation Ver1
Webmars Presentation Ver1
 
3g 1 Audit Administration Software Webmars
3g   1   Audit Administration Software Webmars3g   1   Audit Administration Software Webmars
3g 1 Audit Administration Software Webmars
 
Sergey Gordeychik, Security Metrics for PCI DSS Compliance
Sergey Gordeychik, Security Metrics for PCI DSS ComplianceSergey Gordeychik, Security Metrics for PCI DSS Compliance
Sergey Gordeychik, Security Metrics for PCI DSS Compliance
 
Bally chohan support (Bally Chohan Bally )
Bally chohan support (Bally Chohan Bally )Bally chohan support (Bally Chohan Bally )
Bally chohan support (Bally Chohan Bally )
 
Trak eye intro
Trak eye introTrak eye intro
Trak eye intro
 
Test Data Management and Its Role in DevOps
Test Data Management and Its Role in DevOpsTest Data Management and Its Role in DevOps
Test Data Management and Its Role in DevOps
 
Puppet Camp Atlanta 2014: Keynote
Puppet Camp Atlanta 2014: Keynote  Puppet Camp Atlanta 2014: Keynote
Puppet Camp Atlanta 2014: Keynote
 
Application migration process presentation by t2 tech group
Application migration process presentation by t2 tech groupApplication migration process presentation by t2 tech group
Application migration process presentation by t2 tech group
 
Software supply chain management: Gaining velocity without losing control
Software supply chain management: Gaining velocity without losing controlSoftware supply chain management: Gaining velocity without losing control
Software supply chain management: Gaining velocity without losing control
 
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
 
Software Quality without Testing
Software Quality without TestingSoftware Quality without Testing
Software Quality without Testing
 
SV Value Brief
SV Value BriefSV Value Brief
SV Value Brief
 
Foundation, Transition, Transform – Koch’s Journey Toward The Plant of the Fu...
Foundation, Transition, Transform – Koch’s Journey Toward The Plant of the Fu...Foundation, Transition, Transform – Koch’s Journey Toward The Plant of the Fu...
Foundation, Transition, Transform – Koch’s Journey Toward The Plant of the Fu...
 
System Professional Overview
System Professional OverviewSystem Professional Overview
System Professional Overview
 
Reducing Cost and Risk of Effective Compliance in Multi-Tool Ecosystem
Reducing Cost and Risk of Effective Compliance in Multi-Tool EcosystemReducing Cost and Risk of Effective Compliance in Multi-Tool Ecosystem
Reducing Cost and Risk of Effective Compliance in Multi-Tool Ecosystem
 
Net Monitor Presentation
Net Monitor PresentationNet Monitor Presentation
Net Monitor Presentation
 
Inovaare Webinar - The Importance of A Clean Universe
Inovaare Webinar - The Importance of A Clean Universe Inovaare Webinar - The Importance of A Clean Universe
Inovaare Webinar - The Importance of A Clean Universe
 

Semelhante a SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn

SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014
Sujit Ghosh
 
Shruti Sharma_Testing_Sel
Shruti Sharma_Testing_SelShruti Sharma_Testing_Sel
Shruti Sharma_Testing_Sel
Shruti Sharma
 

Semelhante a SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn (20)

Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
 
Varsha
VarshaVarsha
Varsha
 
T3 Consortium's Performance Center of Excellence
T3 Consortium's Performance Center of ExcellenceT3 Consortium's Performance Center of Excellence
T3 Consortium's Performance Center of Excellence
 
Cascade
CascadeCascade
Cascade
 
RESUME
RESUMERESUME
RESUME
 
Value Stream Mapping – Stories From the Trenches
Value Stream Mapping – Stories From the TrenchesValue Stream Mapping – Stories From the Trenches
Value Stream Mapping – Stories From the Trenches
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
 
How to improve your system monitoring
How to improve your system monitoringHow to improve your system monitoring
How to improve your system monitoring
 
SV Training Intro - 20181129 4.pptx
SV Training Intro - 20181129 4.pptxSV Training Intro - 20181129 4.pptx
SV Training Intro - 20181129 4.pptx
 
Software_Engineer
Software_EngineerSoftware_Engineer
Software_Engineer
 
From Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.auFrom Monoliths to Microservices at Realestate.com.au
From Monoliths to Microservices at Realestate.com.au
 
Paper Practical Itsm Transformation Qai V 1.0
Paper   Practical Itsm Transformation   Qai V 1.0Paper   Practical Itsm Transformation   Qai V 1.0
Paper Practical Itsm Transformation Qai V 1.0
 
DevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft AzureDevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft Azure
 
Ankita_latst
Ankita_latstAnkita_latst
Ankita_latst
 
Sakthi_04112016
Sakthi_04112016Sakthi_04112016
Sakthi_04112016
 
SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014 SCQAA-SF Meeting on May 21 2014
SCQAA-SF Meeting on May 21 2014
 
My Profile
My ProfileMy Profile
My Profile
 
9 Yrs Manual and Selenium Testing Profile
9 Yrs Manual and Selenium Testing Profile9 Yrs Manual and Selenium Testing Profile
9 Yrs Manual and Selenium Testing Profile
 
Shruti Sharma_Testing_Sel
Shruti Sharma_Testing_SelShruti Sharma_Testing_Sel
Shruti Sharma_Testing_Sel
 
Navaneethan Balakrishnan_Resume
Navaneethan Balakrishnan_ResumeNavaneethan Balakrishnan_Resume
Navaneethan Balakrishnan_Resume
 

Mais de Michael Kehoe

Mais de Michael Kehoe (20)

eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
Code Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart way
 
QConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready Applications
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
AllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortems
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container Basics
 
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
 
What the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortems
 
PyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python Applications
 
Helping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
 
The Next Wave of Reliability Engineering
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
 
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
 
SRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREs
 
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
 
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
 
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
 
Couchbase Connect 2016
Couchbase Connect 2016Couchbase Connect 2016
Couchbase Connect 2016
 
Using SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production Systems
 
SRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level Talent
 

Último

Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Lovely Professional University
 
Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
Kamal Acharya
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
Kamal Acharya
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
Kamal Acharya
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
Kamal Acharya
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
AbrahamGadissa
 

Último (20)

retail automation billing system ppt.pptx
retail automation billing system ppt.pptxretail automation billing system ppt.pptx
retail automation billing system ppt.pptx
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
 
internship exam ppt.pptx on embedded system and IOT
internship exam ppt.pptx on embedded system and IOTinternship exam ppt.pptx on embedded system and IOT
internship exam ppt.pptx on embedded system and IOT
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
 
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
NO1 Pandit Black Magic Removal in Uk kala jadu Specialist kala jadu for Love ...
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
Attraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptxAttraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptx
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
İTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering WorkshopİTÜ CAD and Reverse Engineering Workshop
İTÜ CAD and Reverse Engineering Workshop
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdf
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 

SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn