SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Making Observability Actionable
At Scale
Sisir Koppaka
CTO
Squadcast
1
DevConnect Conference 2019
DBS Asia Hub, Academy
Singapore
Hi there !
Squadcast - Building a simple and free incident
response tool to help increase adoption of Site
Reliability Engineering (SRE)
Built real-time data science pipelines at two
different startups in NYC and BLR
Can disease diagnosis and tracking be automated
with ultrasound? - Research at MIT
Studied Reliability & Production Engineering at IIT
Kharagpur
What I’m definitely NOT!
An expert in Banking
2
Building reliable software
at scale is really hard.
School of Hard Knocks
Squadcast
System of engagement for managing
reliability end-to-end combining human +
machine data
3
Democratize SRE!
Service Definitions
Service Level Objectives (SLOs)
Service Level Indicators (SLIs)
Error Budgets
And
ACTIONS! - what we’ll focus on for this talk
4
What Service Level Objectives (SLOs) Look Like
SLOs, SLIs, Error Budgets and SRE best practices like
generic mitigation help limit toil and help you turn
the vicious cycle into a virtuous cycle
5
6
Excellent
What does Observability
really mean?
How well are you able to
infer a system’s internal
state given it’s output ?
System
Input Output
Not-so-good
Proactive Customer Success
High TBF
Low TTA, TTR
Transparency
Predictable Change Velocity
Low Toil
Sticky Customers
Reactive Customer Success
Low TBF
High TTA, TTR
Lack of Transparency
Unpredictable Change Velocity
High Toil
Meh Customers
Observability
...
7
Pillars of
Observability*
Logs
Metrics
Traces
*an apolitical rendition
What does Observability
really mean?
How well are you able to
infer a system’s internal
state given it’s output ?
System
Input Output
8
What the data tells us at
Squadcast!
Time-to-act (TTA) and Time-to-resolve (TTR)
are on average larger and more variable
outside the main working shift
Incident Response globally could be more
consistent, transferable, and scaleable
within organizations. Response patterns
cannot be versioned or programmed
against.
Similar to CI/CD circa 2005.
Are we at peak observability as a
community? No. If we can’t act
effectively, we cannot claim peak
observability.
*Normalized across three 8-hour shifts across the world. Data is not representative of any
individual customer.
9
A Deeper Look
(SRE teams at 72 companies)
A majority of respondents
considered themselves SREs (at
well-known companies).
56% were managing between
50-500 services, and 32% were
managing 10-50 services.
10
We may need a fourth
pillar to optimize for
peak observability by
building an active
knowledge repository
of Actions.
11
Pillars of
Observability*
Logs
Metrics
Traces
Actions
Data
Impact
What are Squadcast
Actions?
Quick Demo
12
Actions
- Primitives
squadctl circleci:rebuild
platform-js/master/latest
squadctl namespace:action :repo/branch/tag
- Runbooks
for the long tail of response activity
Markdown-supported active runbooks in a
language of your choice
Building Actions
A few things we learnt
along the way
13
Don’t Repeat Yourself (DRY)
Audit Trails with immutable log
Continuous Security
Composing Action Primitives into
Workflows
Continuous Feedback in the SDLC
Heterogeneous Workloads become easier
to support
Hybrid Cloud
And many more...
14
Let’s look at a real example
A Fortune 100 Enterprise has over 100 TB of release artifacts, growing
at double-digit % every year. They have different Engineering teams
for each product line, have a NOC that routes production incidents
to the appropriate team, have a SOC…..
Can we unlock additional value by taking more actions
during incident response that improves observability, and
thereby, the change velocity?
Use Cases
➔ Automatically flagging build artifacts for telemetry spikes, and rolling back
➔ Flagging build artifacts for new vulnerabilities and automated rollbacks
➔ Scaling production environment based on external events such as traffic
spikes
➔ And many more
15
Release Promotion and the SRE Loop
For a simple workload
1
2
3
V
C
S
Dev
Artifacts
Quality
Gate
Staging
Artifacts
Quality
Gate
GA
Artifacts
Quality
Gate
Production
Artifacts
Quality
Gate
Triage
Generic
Remediation
SLO
Breach
Incident
Routing
Root Cause
Analysis
16
Release Promotion and the SRE Loop
1
2
3
V
C
S
Dev
Artifacts
Quality
Gate
Staging
Artifacts
Quality
Gate
GA
Artifacts
Quality
Gate
Production
Artifacts
Quality
Gate
Triage
Generic
Remediation
SLO
Breach
Incident
Routing
Root Cause
Analysis
Motivation
Improving Observability can reduce the drag force on change velocity
17
Drag Force Reduction At Scale
With Superior Traceability
- Backpropagate accurate and
real-time metadata associated with
releases to JFrog Artifactory
(example used hee) or Sonatype
Nexus
- Use metadata to programmatically
drive incident response using
Artifactory Query Language in
Squadcast Runbooks
Quick Demo
18
How Squadcast Works
- Squadcast Actions and Runbooks
which trigger programmatic
response during incident response
- Human-in-the-loop,
machine-assisted
- Primitives can be composed -
primitives to snippets to more
complicated workflows
- Functional from all interfaces
including mobile, ‘coz incidents
happen anytime, anywhere.
19
Known Known
Ex - that telemetry spike
Automate
Known Unknowns
Ex - External Traffic Spikes
Prepare, then
human-in-loop
Unknown Knowns
Ex - Vulnerabilities
Prepare, then
human-in-loop
Unknown Unknowns
Convert to others
Let’s start the clock!
Understanding Failure Modes
20
Known Known
Ex - telemetry spike
Automate
Known Unknowns
Ex - External Traffic Spikes
Prepare, then
human-in-loop
Unknown Knowns
Ex - Vulnerabilities
Prepare, then
human-in-loop
Unknown Unknowns
Convert to other 3 types
Let’s start the clock!
What we’ll take a look at in the Demo
Responding to Failure Modes
21
DEMO
1. Improving traceability by building a loop
between release metadata / change
requests and incident response
2. Enrich production context by annotating
Actions more comprehensively in your
visualization tool like Grafana
3. Try at home - Improve and automate
response to vulnerabilities on a real-time
basis (you can start with automating
response to vulnerabilities from Snyk)
Known Known
Ex - telemetry spike
Automate
Unknown Knowns
Ex - Vulnerabilities
Prepare, then
human-in-loop
Known Unknowns
Ex - External Traffic Spikes
Prepare, then
human-in-loop
22
Known Known
Ex - telemetry spike
Automate
23
Known Known
Ex - telemetry spike
Automate
24
Here’s one more idea...
Actions help make your system more Observable.
What does the modern
enterprise gain from
the fourth pillar of
Observability?
25
Top 3 Priorities of the Modern Enterprise*
88% Revenue Acceleration
71% Improved Agility and faster Time to
Market
47% Cost Reduction
29% Better Management of Regulatory and
Compliance Risks
29% Increased CSAT
41% Other (Brand, Strategic, Financial)
*McKinsey Digital Survey of CIOs/CTOs at 52 enterprises. 78 percent work at
orgs with 5,000+ employees, and 44 percent work at companies with annual
revenues of $10 billion+
26
Thank you!
t: @sisirkoppaka / @squadcastHQ
e: sisir@squadcast.com

Mais conteúdo relacionado

Mais procurados

Kriss Rochefolle: "How to Convince Your Boss to Say "Yes!" to Chaos Engineeri...
Kriss Rochefolle: "How to Convince Your Boss to Say "Yes!" to Chaos Engineeri...Kriss Rochefolle: "How to Convince Your Boss to Say "Yes!" to Chaos Engineeri...
Kriss Rochefolle: "How to Convince Your Boss to Say "Yes!" to Chaos Engineeri...Christophe Rochefolle
 
SecureWorld: Security is Dead, Rugged DevOps 1f
SecureWorld:  Security is Dead, Rugged DevOps 1fSecureWorld:  Security is Dead, Rugged DevOps 1f
SecureWorld: Security is Dead, Rugged DevOps 1fGene Kim
 
Flexible FIngerprints H4D 2021 Lessons Learned
Flexible FIngerprints H4D 2021 Lessons LearnedFlexible FIngerprints H4D 2021 Lessons Learned
Flexible FIngerprints H4D 2021 Lessons LearnedStanford University
 
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"Aaron Rinehart
 
Blameless Retrospectives in DevSecOps (at Global Healthcare Giants)
Blameless Retrospectives in DevSecOps (at Global Healthcare Giants)Blameless Retrospectives in DevSecOps (at Global Healthcare Giants)
Blameless Retrospectives in DevSecOps (at Global Healthcare Giants)DJ Schleen
 
AllDayDevOps 2020 Aaron Rinehart Security Differently
AllDayDevOps 2020 Aaron Rinehart Security DifferentlyAllDayDevOps 2020 Aaron Rinehart Security Differently
AllDayDevOps 2020 Aaron Rinehart Security DifferentlyAaron Rinehart
 
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy WebinarBeyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy WebinarKaren Skiles
 
AllTheTalks Security Chaos Engineering
AllTheTalks Security Chaos Engineering AllTheTalks Security Chaos Engineering
AllTheTalks Security Chaos Engineering Aaron Rinehart
 
Professional and ethical
Professional and ethicalProfessional and ethical
Professional and ethicallocorecto
 
Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase N...
Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase N...Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase N...
Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase N...Matthew Skelton
 
The best way to design secure software products
The best way to design secure software productsThe best way to design secure software products
The best way to design secure software productsLabSharegroup
 
Patterns and Antipatterns for Software updates
Patterns and Antipatterns for Software updatesPatterns and Antipatterns for Software updates
Patterns and Antipatterns for Software updatesDISHAMESWANIA
 
Software Quality and DevOps - Friends or Foes? @ Instituto Politécnico de Leiria
Software Quality and DevOps - Friends or Foes? @ Instituto Politécnico de LeiriaSoftware Quality and DevOps - Friends or Foes? @ Instituto Politécnico de Leiria
Software Quality and DevOps - Friends or Foes? @ Instituto Politécnico de LeiriaManuel Pais
 
Acquiforce H4D Stanford 2018 final presentation
Acquiforce H4D Stanford 2018 final presentationAcquiforce H4D Stanford 2018 final presentation
Acquiforce H4D Stanford 2018 final presentationStanford University
 
Large-scale Microtask programming
Large-scale Microtask programmingLarge-scale Microtask programming
Large-scale Microtask programmingEmad Aghayi
 

Mais procurados (15)

Kriss Rochefolle: "How to Convince Your Boss to Say "Yes!" to Chaos Engineeri...
Kriss Rochefolle: "How to Convince Your Boss to Say "Yes!" to Chaos Engineeri...Kriss Rochefolle: "How to Convince Your Boss to Say "Yes!" to Chaos Engineeri...
Kriss Rochefolle: "How to Convince Your Boss to Say "Yes!" to Chaos Engineeri...
 
SecureWorld: Security is Dead, Rugged DevOps 1f
SecureWorld:  Security is Dead, Rugged DevOps 1fSecureWorld:  Security is Dead, Rugged DevOps 1f
SecureWorld: Security is Dead, Rugged DevOps 1f
 
Flexible FIngerprints H4D 2021 Lessons Learned
Flexible FIngerprints H4D 2021 Lessons LearnedFlexible FIngerprints H4D 2021 Lessons Learned
Flexible FIngerprints H4D 2021 Lessons Learned
 
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
VMWare Tech Talk: "The Road from Rugged DevOps to Security Chaos Engineering"
 
Blameless Retrospectives in DevSecOps (at Global Healthcare Giants)
Blameless Retrospectives in DevSecOps (at Global Healthcare Giants)Blameless Retrospectives in DevSecOps (at Global Healthcare Giants)
Blameless Retrospectives in DevSecOps (at Global Healthcare Giants)
 
AllDayDevOps 2020 Aaron Rinehart Security Differently
AllDayDevOps 2020 Aaron Rinehart Security DifferentlyAllDayDevOps 2020 Aaron Rinehart Security Differently
AllDayDevOps 2020 Aaron Rinehart Security Differently
 
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy WebinarBeyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
Beyond the Knowledge Base: Turning Data into Wisdom - an ITSM Academy Webinar
 
AllTheTalks Security Chaos Engineering
AllTheTalks Security Chaos Engineering AllTheTalks Security Chaos Engineering
AllTheTalks Security Chaos Engineering
 
Professional and ethical
Professional and ethicalProfessional and ethical
Professional and ethical
 
Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase N...
Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase N...Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase N...
Practical operability techniques - Matthew Skelton - Unicom DevOps Showcase N...
 
The best way to design secure software products
The best way to design secure software productsThe best way to design secure software products
The best way to design secure software products
 
Patterns and Antipatterns for Software updates
Patterns and Antipatterns for Software updatesPatterns and Antipatterns for Software updates
Patterns and Antipatterns for Software updates
 
Software Quality and DevOps - Friends or Foes? @ Instituto Politécnico de Leiria
Software Quality and DevOps - Friends or Foes? @ Instituto Politécnico de LeiriaSoftware Quality and DevOps - Friends or Foes? @ Instituto Politécnico de Leiria
Software Quality and DevOps - Friends or Foes? @ Instituto Politécnico de Leiria
 
Acquiforce H4D Stanford 2018 final presentation
Acquiforce H4D Stanford 2018 final presentationAcquiforce H4D Stanford 2018 final presentation
Acquiforce H4D Stanford 2018 final presentation
 
Large-scale Microtask programming
Large-scale Microtask programmingLarge-scale Microtask programming
Large-scale Microtask programming
 

Semelhante a Making Observability Actionable At Scale - DBS DevConnect 2019

Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...
Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...
Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...Burr Sutter
 
A Tale of Contemporary Software
A Tale of Contemporary SoftwareA Tale of Contemporary Software
A Tale of Contemporary SoftwareYun Zhi Lin
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0Joakim Lindbom
 
Using Agile Methodologies
Using Agile MethodologiesUsing Agile Methodologies
Using Agile MethodologiesDave Kellogg
 
BsidesMCR_2016-what-can-infosec-learn-from-devops
BsidesMCR_2016-what-can-infosec-learn-from-devopsBsidesMCR_2016-what-can-infosec-learn-from-devops
BsidesMCR_2016-what-can-infosec-learn-from-devopsJames '​-- Mckinlay
 
Product! - The road to production deployment
Product! - The road to production deploymentProduct! - The road to production deployment
Product! - The road to production deploymentFilippo Zanella
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Rundeck
 
Site-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfSite-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfDeepakGupta747774
 
Sustaining Your Career
Sustaining Your CareerSustaining Your Career
Sustaining Your CareerScott Lowe
 
Reactive Architecture
Reactive ArchitectureReactive Architecture
Reactive ArchitectureKnoldus Inc.
 
Secure Your DevOps Pipeline Best Practices Meetup 08022024.pptx
Secure Your DevOps Pipeline Best Practices Meetup 08022024.pptxSecure Your DevOps Pipeline Best Practices Meetup 08022024.pptx
Secure Your DevOps Pipeline Best Practices Meetup 08022024.pptxlior mazor
 
Achieving observability-in-modern-applications
Achieving observability-in-modern-applicationsAchieving observability-in-modern-applications
Achieving observability-in-modern-applicationsJulio Antúnez Tarín
 
Moving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesMoving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesKP Kaiser
 
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan MuthayanAgile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan MuthayanAgileNetwork
 
Why Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityWhy Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityAggregage
 
Linux Assignment 3
Linux Assignment 3Linux Assignment 3
Linux Assignment 3Diane Allen
 
Ci tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsCi tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsLinards Liep
 

Semelhante a Making Observability Actionable At Scale - DBS DevConnect 2019 (20)

Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...
Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...
Teaching Elephants to Dance (and Fly!): A Developer's Journey to Digital Tran...
 
A Tale of Contemporary Software
A Tale of Contemporary SoftwareA Tale of Contemporary Software
A Tale of Contemporary Software
 
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.02014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
2014-10 DevOps NFi - Why it's a good idea to deploy 10 times per day v1.0
 
Using Agile Methodologies
Using Agile MethodologiesUsing Agile Methodologies
Using Agile Methodologies
 
BsidesMCR_2016-what-can-infosec-learn-from-devops
BsidesMCR_2016-what-can-infosec-learn-from-devopsBsidesMCR_2016-what-can-infosec-learn-from-devops
BsidesMCR_2016-what-can-infosec-learn-from-devops
 
Product! - The road to production deployment
Product! - The road to production deploymentProduct! - The road to production deployment
Product! - The road to production deployment
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
Site-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdfSite-Reliability-Engineering-v2[6241].pdf
Site-Reliability-Engineering-v2[6241].pdf
 
Sustaining Your Career
Sustaining Your CareerSustaining Your Career
Sustaining Your Career
 
Symphony Driver Essay
Symphony Driver EssaySymphony Driver Essay
Symphony Driver Essay
 
Embedded multiple choice questions
Embedded multiple choice questionsEmbedded multiple choice questions
Embedded multiple choice questions
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
Reactive Architecture
Reactive ArchitectureReactive Architecture
Reactive Architecture
 
Secure Your DevOps Pipeline Best Practices Meetup 08022024.pptx
Secure Your DevOps Pipeline Best Practices Meetup 08022024.pptxSecure Your DevOps Pipeline Best Practices Meetup 08022024.pptx
Secure Your DevOps Pipeline Best Practices Meetup 08022024.pptx
 
Achieving observability-in-modern-applications
Achieving observability-in-modern-applicationsAchieving observability-in-modern-applications
Achieving observability-in-modern-applications
 
Moving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed TracesMoving to Microservices with the Help of Distributed Traces
Moving to Microservices with the Help of Distributed Traces
 
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan MuthayanAgile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
Agile Network India | Agility Day @Noida | SRE & AIOps | Murugan Muthayan
 
Why Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityWhy Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and Reliability
 
Linux Assignment 3
Linux Assignment 3Linux Assignment 3
Linux Assignment 3
 
Ci tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepinsCi tips and_tricks_linards_liepins
Ci tips and_tricks_linards_liepins
 

Último

Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneUiPathCommunity
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 

Último (20)

How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
WomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyoneWomenInAutomation2024: AI and Automation for eveyone
WomenInAutomation2024: AI and Automation for eveyone
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 

Making Observability Actionable At Scale - DBS DevConnect 2019

  • 1. Making Observability Actionable At Scale Sisir Koppaka CTO Squadcast 1 DevConnect Conference 2019 DBS Asia Hub, Academy Singapore
  • 2. Hi there ! Squadcast - Building a simple and free incident response tool to help increase adoption of Site Reliability Engineering (SRE) Built real-time data science pipelines at two different startups in NYC and BLR Can disease diagnosis and tracking be automated with ultrasound? - Research at MIT Studied Reliability & Production Engineering at IIT Kharagpur What I’m definitely NOT! An expert in Banking 2 Building reliable software at scale is really hard. School of Hard Knocks
  • 3. Squadcast System of engagement for managing reliability end-to-end combining human + machine data 3 Democratize SRE! Service Definitions Service Level Objectives (SLOs) Service Level Indicators (SLIs) Error Budgets And ACTIONS! - what we’ll focus on for this talk
  • 4. 4 What Service Level Objectives (SLOs) Look Like SLOs, SLIs, Error Budgets and SRE best practices like generic mitigation help limit toil and help you turn the vicious cycle into a virtuous cycle
  • 5. 5
  • 6. 6 Excellent What does Observability really mean? How well are you able to infer a system’s internal state given it’s output ? System Input Output Not-so-good Proactive Customer Success High TBF Low TTA, TTR Transparency Predictable Change Velocity Low Toil Sticky Customers Reactive Customer Success Low TBF High TTA, TTR Lack of Transparency Unpredictable Change Velocity High Toil Meh Customers Observability ...
  • 7. 7 Pillars of Observability* Logs Metrics Traces *an apolitical rendition What does Observability really mean? How well are you able to infer a system’s internal state given it’s output ? System Input Output
  • 8. 8 What the data tells us at Squadcast! Time-to-act (TTA) and Time-to-resolve (TTR) are on average larger and more variable outside the main working shift Incident Response globally could be more consistent, transferable, and scaleable within organizations. Response patterns cannot be versioned or programmed against. Similar to CI/CD circa 2005. Are we at peak observability as a community? No. If we can’t act effectively, we cannot claim peak observability. *Normalized across three 8-hour shifts across the world. Data is not representative of any individual customer.
  • 9. 9 A Deeper Look (SRE teams at 72 companies) A majority of respondents considered themselves SREs (at well-known companies). 56% were managing between 50-500 services, and 32% were managing 10-50 services.
  • 10. 10
  • 11. We may need a fourth pillar to optimize for peak observability by building an active knowledge repository of Actions. 11 Pillars of Observability* Logs Metrics Traces Actions Data Impact
  • 12. What are Squadcast Actions? Quick Demo 12 Actions - Primitives squadctl circleci:rebuild platform-js/master/latest squadctl namespace:action :repo/branch/tag - Runbooks for the long tail of response activity Markdown-supported active runbooks in a language of your choice
  • 13. Building Actions A few things we learnt along the way 13 Don’t Repeat Yourself (DRY) Audit Trails with immutable log Continuous Security Composing Action Primitives into Workflows Continuous Feedback in the SDLC Heterogeneous Workloads become easier to support Hybrid Cloud And many more...
  • 14. 14 Let’s look at a real example A Fortune 100 Enterprise has over 100 TB of release artifacts, growing at double-digit % every year. They have different Engineering teams for each product line, have a NOC that routes production incidents to the appropriate team, have a SOC….. Can we unlock additional value by taking more actions during incident response that improves observability, and thereby, the change velocity? Use Cases ➔ Automatically flagging build artifacts for telemetry spikes, and rolling back ➔ Flagging build artifacts for new vulnerabilities and automated rollbacks ➔ Scaling production environment based on external events such as traffic spikes ➔ And many more
  • 15. 15 Release Promotion and the SRE Loop For a simple workload 1 2 3 V C S Dev Artifacts Quality Gate Staging Artifacts Quality Gate GA Artifacts Quality Gate Production Artifacts Quality Gate Triage Generic Remediation SLO Breach Incident Routing Root Cause Analysis
  • 16. 16 Release Promotion and the SRE Loop 1 2 3 V C S Dev Artifacts Quality Gate Staging Artifacts Quality Gate GA Artifacts Quality Gate Production Artifacts Quality Gate Triage Generic Remediation SLO Breach Incident Routing Root Cause Analysis Motivation Improving Observability can reduce the drag force on change velocity
  • 17. 17 Drag Force Reduction At Scale With Superior Traceability - Backpropagate accurate and real-time metadata associated with releases to JFrog Artifactory (example used hee) or Sonatype Nexus - Use metadata to programmatically drive incident response using Artifactory Query Language in Squadcast Runbooks Quick Demo
  • 18. 18 How Squadcast Works - Squadcast Actions and Runbooks which trigger programmatic response during incident response - Human-in-the-loop, machine-assisted - Primitives can be composed - primitives to snippets to more complicated workflows - Functional from all interfaces including mobile, ‘coz incidents happen anytime, anywhere.
  • 19. 19 Known Known Ex - that telemetry spike Automate Known Unknowns Ex - External Traffic Spikes Prepare, then human-in-loop Unknown Knowns Ex - Vulnerabilities Prepare, then human-in-loop Unknown Unknowns Convert to others Let’s start the clock! Understanding Failure Modes
  • 20. 20 Known Known Ex - telemetry spike Automate Known Unknowns Ex - External Traffic Spikes Prepare, then human-in-loop Unknown Knowns Ex - Vulnerabilities Prepare, then human-in-loop Unknown Unknowns Convert to other 3 types Let’s start the clock! What we’ll take a look at in the Demo Responding to Failure Modes
  • 21. 21 DEMO 1. Improving traceability by building a loop between release metadata / change requests and incident response 2. Enrich production context by annotating Actions more comprehensively in your visualization tool like Grafana 3. Try at home - Improve and automate response to vulnerabilities on a real-time basis (you can start with automating response to vulnerabilities from Snyk) Known Known Ex - telemetry spike Automate Unknown Knowns Ex - Vulnerabilities Prepare, then human-in-loop Known Unknowns Ex - External Traffic Spikes Prepare, then human-in-loop
  • 22. 22 Known Known Ex - telemetry spike Automate
  • 23. 23 Known Known Ex - telemetry spike Automate
  • 24. 24 Here’s one more idea... Actions help make your system more Observable.
  • 25. What does the modern enterprise gain from the fourth pillar of Observability? 25 Top 3 Priorities of the Modern Enterprise* 88% Revenue Acceleration 71% Improved Agility and faster Time to Market 47% Cost Reduction 29% Better Management of Regulatory and Compliance Risks 29% Increased CSAT 41% Other (Brand, Strategic, Financial) *McKinsey Digital Survey of CIOs/CTOs at 52 enterprises. 78 percent work at orgs with 5,000+ employees, and 44 percent work at companies with annual revenues of $10 billion+
  • 26. 26 Thank you! t: @sisirkoppaka / @squadcastHQ e: sisir@squadcast.com