SlideShare uma empresa Scribd logo
1 de 17
I’m No Hero
Full Stack
Reliability
At LinkedIn
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd Palino
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What is Site Reliability Engineering?
3
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Types of SRE
 Embedded
 Central (or Production SRE)
 Tools and Infrastructure
4
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
We Can’t Do It Alone
 The Kafka SRE team is 3 people in the US, and 1.5 SREs in Bangalore
 We manage over 6000 application instances
– 100 Kafka clusters, with 1800 brokers
– Over 1 trillion messages a day
 The environment is never static from one day to the next
6
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Maslow’s Hierarchy
7
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd’s Hierarchy of Reliability
8
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Infrastructure as a Service
 SREs do not deploy hardware and OS
 Production Operations
– Datacenter Technicians
– Systems Operations
– Network Operations
 Provide all basic OS and network services
 There is still tweaking for individual applications
9
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Common Repositories
 All source code and configurations are committed to one place
 Subversion and Git centrally managed
 Consistent management
– Precommit checks
– ACLs and Review boards
 Connects directly to the build systems
10
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Containerization
 Most of our stack is Java
– Python is well-supported
– Always a few one-offs
 Java applications have Tomcat and Jetty containers
– Hooks for monitoring
– Client libraries are managed by the team that owns the application
 Provides a consistent control surface for applications
11
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Build and Deployment
 When code is committed, it is automatically built
– Successes become deployment artifacts
– Failures are tracked via Jira
 Build systems are centrally managed
 Common tools
– Dependency management and introspection
– Version management
– Error budgeting
– Deployment
12
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring
 Monitoring, graphing, and alerting as a service
 Completely self-service
– Applications annotate metrics and they are automatically collected
– Monitoring dashboards can be created by anyone
 Automatic metrics and dashboards for common features
– HTTP servers, system and OS metrics
– Client libraries (such as Kafka)
 Additional metrics can be published outside the container
13
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Site Up
14
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Site Up
 With the stack supporting it, applications sit on top
– SREs architect and run the application
– SRE and developers respond to failures
 The NOC monitors high-level metrics
– Overall site health and growth metrics
– They also coordinate incident response
 Incident response is blameless
– Fix the problem, don’t fix the blame
15
SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Review and Revise
 All components are constantly improving
– Incidents expose issues in the infrastructure
– Feedback from usage of the tools
 Steering committees discuss large-scale changes
– Production Operations, SRE, and Development all have their own
– Comprised of individual contributors, not managers
 Open collaboration
– Common repositories means everyone can help
16
I'm No Hero: Full Stack Reliability at LinkedIn

Mais conteúdo relacionado

Mais procurados

Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...Legacy Typesafe (now Lightbend)
 
Nano Server - the future of Windows Server - Thomas Maurer
Nano Server - the future of Windows Server - Thomas MaurerNano Server - the future of Windows Server - Thomas Maurer
Nano Server - the future of Windows Server - Thomas MaurerITCamp
 
What's New in Hyper-V 2016 - Thomas Maurer
What's New in Hyper-V 2016 - Thomas MaurerWhat's New in Hyper-V 2016 - Thomas Maurer
What's New in Hyper-V 2016 - Thomas MaurerITCamp
 
The Top Outages of 2021: Analysis and Takeaways
The Top Outages of 2021: Analysis and TakeawaysThe Top Outages of 2021: Analysis and Takeaways
The Top Outages of 2021: Analysis and TakeawaysThousandEyes
 
Cisco IT and ThousandEyes
Cisco IT and ThousandEyesCisco IT and ThousandEyes
Cisco IT and ThousandEyesThousandEyes
 
UCS Update: Efficiently Managing your server environment for traditional ente...
UCS Update: Efficiently Managing your server environment for traditional ente...UCS Update: Efficiently Managing your server environment for traditional ente...
UCS Update: Efficiently Managing your server environment for traditional ente...Cisco Canada
 
APIC EM APIs: a deep dive
APIC EM APIs: a deep diveAPIC EM APIs: a deep dive
APIC EM APIs: a deep diveCisco DevNet
 
Cisco ONE Enterprise Cloud (UCSD) Hands-on Lab
Cisco ONE Enterprise Cloud (UCSD) Hands-on LabCisco ONE Enterprise Cloud (UCSD) Hands-on Lab
Cisco ONE Enterprise Cloud (UCSD) Hands-on LabCisco Canada
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesAll Things Open
 
Cisco ACI for the Microsoft Cloud Platform
Cisco ACI for the Microsoft Cloud PlatformCisco ACI for the Microsoft Cloud Platform
Cisco ACI for the Microsoft Cloud PlatformShashi Kiran
 
Is the Cloud Going to Kill Traditional Application Delivery?
Is the Cloud Going to Kill Traditional Application Delivery?Is the Cloud Going to Kill Traditional Application Delivery?
Is the Cloud Going to Kill Traditional Application Delivery?Imperva Incapsula
 
Travelling in time with SQL Server 2016 - Damian Widera
Travelling in time with SQL Server 2016 - Damian WideraTravelling in time with SQL Server 2016 - Damian Widera
Travelling in time with SQL Server 2016 - Damian WideraITCamp
 
Riverbed Performance Management
Riverbed Performance ManagementRiverbed Performance Management
Riverbed Performance ManagementCTI Group
 
ThousandEyes Alerting Essentials for Your Network
ThousandEyes Alerting Essentials for Your NetworkThousandEyes Alerting Essentials for Your Network
ThousandEyes Alerting Essentials for Your NetworkThousandEyes
 
Oracle Public Cloud Operations from ThousandEyes Connect
Oracle Public Cloud Operations from ThousandEyes ConnectOracle Public Cloud Operations from ThousandEyes Connect
Oracle Public Cloud Operations from ThousandEyes ConnectThousandEyes
 
Ocs F5 Bigip Bestpractices
Ocs F5 Bigip BestpracticesOcs F5 Bigip Bestpractices
Ocs F5 Bigip BestpracticesThiago Gutierri
 
F5 iHealth Presentation 10 22-10
F5 iHealth Presentation 10 22-10F5 iHealth Presentation 10 22-10
F5 iHealth Presentation 10 22-10F5 Networks
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyesThousandEyes
 
Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres
Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres
Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres Martin Lipka
 
SDN in the Enterprise: APIC Enterprise Module
SDN in the Enterprise:  APIC Enterprise Module SDN in the Enterprise:  APIC Enterprise Module
SDN in the Enterprise: APIC Enterprise Module Cisco Canada
 

Mais procurados (20)

Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...
 
Nano Server - the future of Windows Server - Thomas Maurer
Nano Server - the future of Windows Server - Thomas MaurerNano Server - the future of Windows Server - Thomas Maurer
Nano Server - the future of Windows Server - Thomas Maurer
 
What's New in Hyper-V 2016 - Thomas Maurer
What's New in Hyper-V 2016 - Thomas MaurerWhat's New in Hyper-V 2016 - Thomas Maurer
What's New in Hyper-V 2016 - Thomas Maurer
 
The Top Outages of 2021: Analysis and Takeaways
The Top Outages of 2021: Analysis and TakeawaysThe Top Outages of 2021: Analysis and Takeaways
The Top Outages of 2021: Analysis and Takeaways
 
Cisco IT and ThousandEyes
Cisco IT and ThousandEyesCisco IT and ThousandEyes
Cisco IT and ThousandEyes
 
UCS Update: Efficiently Managing your server environment for traditional ente...
UCS Update: Efficiently Managing your server environment for traditional ente...UCS Update: Efficiently Managing your server environment for traditional ente...
UCS Update: Efficiently Managing your server environment for traditional ente...
 
APIC EM APIs: a deep dive
APIC EM APIs: a deep diveAPIC EM APIs: a deep dive
APIC EM APIs: a deep dive
 
Cisco ONE Enterprise Cloud (UCSD) Hands-on Lab
Cisco ONE Enterprise Cloud (UCSD) Hands-on LabCisco ONE Enterprise Cloud (UCSD) Hands-on Lab
Cisco ONE Enterprise Cloud (UCSD) Hands-on Lab
 
Open Source Applied - Real World Use Cases
Open Source Applied - Real World Use CasesOpen Source Applied - Real World Use Cases
Open Source Applied - Real World Use Cases
 
Cisco ACI for the Microsoft Cloud Platform
Cisco ACI for the Microsoft Cloud PlatformCisco ACI for the Microsoft Cloud Platform
Cisco ACI for the Microsoft Cloud Platform
 
Is the Cloud Going to Kill Traditional Application Delivery?
Is the Cloud Going to Kill Traditional Application Delivery?Is the Cloud Going to Kill Traditional Application Delivery?
Is the Cloud Going to Kill Traditional Application Delivery?
 
Travelling in time with SQL Server 2016 - Damian Widera
Travelling in time with SQL Server 2016 - Damian WideraTravelling in time with SQL Server 2016 - Damian Widera
Travelling in time with SQL Server 2016 - Damian Widera
 
Riverbed Performance Management
Riverbed Performance ManagementRiverbed Performance Management
Riverbed Performance Management
 
ThousandEyes Alerting Essentials for Your Network
ThousandEyes Alerting Essentials for Your NetworkThousandEyes Alerting Essentials for Your Network
ThousandEyes Alerting Essentials for Your Network
 
Oracle Public Cloud Operations from ThousandEyes Connect
Oracle Public Cloud Operations from ThousandEyes ConnectOracle Public Cloud Operations from ThousandEyes Connect
Oracle Public Cloud Operations from ThousandEyes Connect
 
Ocs F5 Bigip Bestpractices
Ocs F5 Bigip BestpracticesOcs F5 Bigip Bestpractices
Ocs F5 Bigip Bestpractices
 
F5 iHealth Presentation 10 22-10
F5 iHealth Presentation 10 22-10F5 iHealth Presentation 10 22-10
F5 iHealth Presentation 10 22-10
 
Introduction to ThousandEyes
Introduction to ThousandEyesIntroduction to ThousandEyes
Introduction to ThousandEyes
 
Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres
Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres
Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres
 
SDN in the Enterprise: APIC Enterprise Module
SDN in the Enterprise:  APIC Enterprise Module SDN in the Enterprise:  APIC Enterprise Module
SDN in the Enterprise: APIC Enterprise Module
 

Destaque

Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
Site Reliability Engineering Helps Google Conquer The World
Site Reliability Engineering Helps Google Conquer The WorldSite Reliability Engineering Helps Google Conquer The World
Site Reliability Engineering Helps Google Conquer The WorldVistara
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Works of site reliability engineer
Works of site reliability engineerWorks of site reliability engineer
Works of site reliability engineerShohei Kobayashi
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaJiangjie Qin
 
Kafka overview and use cases
Kafka overview and use casesKafka overview and use cases
Kafka overview and use casesIndrajeet Kumar
 
Stephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleStephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleIE Group
 
Building an E-commerce website in MEAN stack
Building an E-commerce website in MEAN stackBuilding an E-commerce website in MEAN stack
Building an E-commerce website in MEAN stackdivyapisces
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
You got a couple Microservices, now what? - Adding SRE to DevOps
You got a couple Microservices, now what?  - Adding SRE to DevOpsYou got a couple Microservices, now what?  - Adding SRE to DevOps
You got a couple Microservices, now what? - Adding SRE to DevOpsGonzalo Maldonado
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
The Startup Relationship Survival Guide by Nicole Cottrell
The Startup Relationship Survival Guide by Nicole CottrellThe Startup Relationship Survival Guide by Nicole Cottrell
The Startup Relationship Survival Guide by Nicole CottrellPHX Startup Week
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafkaJiangjie Qin
 
SRE - drupal day aveiro 2016
SRE - drupal day aveiro 2016SRE - drupal day aveiro 2016
SRE - drupal day aveiro 2016Ricardo Amaro
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 

Destaque (20)

Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Site Reliability Engineering Helps Google Conquer The World
Site Reliability Engineering Helps Google Conquer The WorldSite Reliability Engineering Helps Google Conquer The World
Site Reliability Engineering Helps Google Conquer The World
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Works of site reliability engineer
Works of site reliability engineerWorks of site reliability engineer
Works of site reliability engineer
 
SRE From Scratch
SRE From ScratchSRE From Scratch
SRE From Scratch
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Kafka overview and use cases
Kafka overview and use casesKafka overview and use cases
Kafka overview and use cases
 
Stephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleStephen McHenry - Chanecellor of Site Reliability Engineering, Google
Stephen McHenry - Chanecellor of Site Reliability Engineering, Google
 
Building an E-commerce website in MEAN stack
Building an E-commerce website in MEAN stackBuilding an E-commerce website in MEAN stack
Building an E-commerce website in MEAN stack
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
You got a couple Microservices, now what? - Adding SRE to DevOps
You got a couple Microservices, now what?  - Adding SRE to DevOpsYou got a couple Microservices, now what?  - Adding SRE to DevOps
You got a couple Microservices, now what? - Adding SRE to DevOps
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
The Startup Relationship Survival Guide by Nicole Cottrell
The Startup Relationship Survival Guide by Nicole CottrellThe Startup Relationship Survival Guide by Nicole Cottrell
The Startup Relationship Survival Guide by Nicole Cottrell
 
SRE Tools
SRE ToolsSRE Tools
SRE Tools
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
No data loss pipeline with apache kafka
No data loss pipeline with apache kafkaNo data loss pipeline with apache kafka
No data loss pipeline with apache kafka
 
SRE - drupal day aveiro 2016
SRE - drupal day aveiro 2016SRE - drupal day aveiro 2016
SRE - drupal day aveiro 2016
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 

Semelhante a I'm No Hero: Full Stack Reliability at LinkedIn

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaNitin Kumar
 
Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...
Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...
Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...Splunk
 
Oracle Management Cloud newpres-v1.1
Oracle Management Cloud   newpres-v1.1Oracle Management Cloud   newpres-v1.1
Oracle Management Cloud newpres-v1.1Lee Bonfield
 
Splunk Sales Presentation Imagemaker 2014
Splunk Sales Presentation Imagemaker 2014Splunk Sales Presentation Imagemaker 2014
Splunk Sales Presentation Imagemaker 2014Urena Nicolas
 
Managing Your Application Security Program with the ThreadFix Ecosystem
Managing Your Application Security Program with the ThreadFix EcosystemManaging Your Application Security Program with the ThreadFix Ecosystem
Managing Your Application Security Program with the ThreadFix EcosystemDenim Group
 
Splunk bangalore user group 2020-06-01
Splunk bangalore user group   2020-06-01Splunk bangalore user group   2020-06-01
Splunk bangalore user group 2020-06-01NiketNilay
 
Getting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseGetting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseSplunk
 
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...Jitendra Bafna
 
Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...
Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...
Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...Denim Group
 
Oracle: Building Cloud Native Applications
Oracle: Building Cloud Native ApplicationsOracle: Building Cloud Native Applications
Oracle: Building Cloud Native ApplicationsKelly Goetsch
 
Veracode Corporate Overview - Print
Veracode Corporate Overview - PrintVeracode Corporate Overview - Print
Veracode Corporate Overview - PrintAndrew Kanikuru
 
SwitchIT-02.2018-Company-overview.pptx
SwitchIT-02.2018-Company-overview.pptxSwitchIT-02.2018-Company-overview.pptx
SwitchIT-02.2018-Company-overview.pptxWILFRIEDKOUASSIKAN
 
Optimizing Your Application Security Program with Netsparker and ThreadFix
Optimizing Your Application Security Program with Netsparker and ThreadFixOptimizing Your Application Security Program with Netsparker and ThreadFix
Optimizing Your Application Security Program with Netsparker and ThreadFixDenim Group
 
SAP security made easy
SAP security made easySAP security made easy
SAP security made easyERPScan
 
The SAS developer portal – developer.sas.com 2.0: How we built it by Joe Furb...
The SAS developer portal –developer.sas.com 2.0: How we built it by Joe Furb...The SAS developer portal –developer.sas.com 2.0: How we built it by Joe Furb...
The SAS developer portal – developer.sas.com 2.0: How we built it by Joe Furb...Nordic APIs
 
Big Data Analytics for Real-time Operational Intelligence with Your z/OS Data
Big Data Analytics for Real-time Operational Intelligence with Your z/OS DataBig Data Analytics for Real-time Operational Intelligence with Your z/OS Data
Big Data Analytics for Real-time Operational Intelligence with Your z/OS DataPrecisely
 
Government and Education Webinar: Improving Application Performance
Government and Education Webinar: Improving Application PerformanceGovernment and Education Webinar: Improving Application Performance
Government and Education Webinar: Improving Application PerformanceSolarWinds
 
Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC)
 	Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC) 	Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC)
Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC)Onapsis Inc.
 
SYN328: Learn why AppDNA should be a part of every consultant’s toolkit
SYN328: Learn why AppDNA should be a part of every consultant’s toolkitSYN328: Learn why AppDNA should be a part of every consultant’s toolkit
SYN328: Learn why AppDNA should be a part of every consultant’s toolkitJeremy Saunders
 

Semelhante a I'm No Hero: Full Stack Reliability at LinkedIn (20)

Linked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafkaLinked in multi tier, multi-tenant, multi-problem kafka
Linked in multi tier, multi-tenant, multi-problem kafka
 
Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...
Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...
Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...
 
Oracle Management Cloud newpres-v1.1
Oracle Management Cloud   newpres-v1.1Oracle Management Cloud   newpres-v1.1
Oracle Management Cloud newpres-v1.1
 
Splunk Sales Presentation Imagemaker 2014
Splunk Sales Presentation Imagemaker 2014Splunk Sales Presentation Imagemaker 2014
Splunk Sales Presentation Imagemaker 2014
 
Managing Your Application Security Program with the ThreadFix Ecosystem
Managing Your Application Security Program with the ThreadFix EcosystemManaging Your Application Security Program with the ThreadFix Ecosystem
Managing Your Application Security Program with the ThreadFix Ecosystem
 
Splunk bangalore user group 2020-06-01
Splunk bangalore user group   2020-06-01Splunk bangalore user group   2020-06-01
Splunk bangalore user group 2020-06-01
 
Getting Started with Splunk Enterprise
Getting Started with Splunk EnterpriseGetting Started with Splunk Enterprise
Getting Started with Splunk Enterprise
 
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...
MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...
 
Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...
Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...
Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...
 
Oracle: Building Cloud Native Applications
Oracle: Building Cloud Native ApplicationsOracle: Building Cloud Native Applications
Oracle: Building Cloud Native Applications
 
Veracode Corporate Overview - Print
Veracode Corporate Overview - PrintVeracode Corporate Overview - Print
Veracode Corporate Overview - Print
 
SwitchIT-02.2018-Company-overview.pptx
SwitchIT-02.2018-Company-overview.pptxSwitchIT-02.2018-Company-overview.pptx
SwitchIT-02.2018-Company-overview.pptx
 
Optimizing Your Application Security Program with Netsparker and ThreadFix
Optimizing Your Application Security Program with Netsparker and ThreadFixOptimizing Your Application Security Program with Netsparker and ThreadFix
Optimizing Your Application Security Program with Netsparker and ThreadFix
 
SAP security made easy
SAP security made easySAP security made easy
SAP security made easy
 
The SAS developer portal – developer.sas.com 2.0: How we built it by Joe Furb...
The SAS developer portal –developer.sas.com 2.0: How we built it by Joe Furb...The SAS developer portal –developer.sas.com 2.0: How we built it by Joe Furb...
The SAS developer portal – developer.sas.com 2.0: How we built it by Joe Furb...
 
Big Data Analytics for Real-time Operational Intelligence with Your z/OS Data
Big Data Analytics for Real-time Operational Intelligence with Your z/OS DataBig Data Analytics for Real-time Operational Intelligence with Your z/OS Data
Big Data Analytics for Real-time Operational Intelligence with Your z/OS Data
 
Government and Education Webinar: Improving Application Performance
Government and Education Webinar: Improving Application PerformanceGovernment and Education Webinar: Improving Application Performance
Government and Education Webinar: Improving Application Performance
 
Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC)
 	Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC) 	Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC)
Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC)
 
Webinar–AppSec: Hype or Reality
Webinar–AppSec: Hype or RealityWebinar–AppSec: Hype or Reality
Webinar–AppSec: Hype or Reality
 
SYN328: Learn why AppDNA should be a part of every consultant’s toolkit
SYN328: Learn why AppDNA should be a part of every consultant’s toolkitSYN328: Learn why AppDNA should be a part of every consultant’s toolkit
SYN328: Learn why AppDNA should be a part of every consultant’s toolkit
 

Mais de Todd Palino

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderTodd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsTodd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Todd Palino
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Todd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum PainTodd Palino
 

Mais de Todd Palino (7)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 

Último

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 

Último (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 

I'm No Hero: Full Stack Reliability at LinkedIn

  • 1. I’m No Hero Full Stack Reliability At LinkedIn
  • 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  • 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What is Site Reliability Engineering? 3
  • 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Types of SRE  Embedded  Central (or Production SRE)  Tools and Infrastructure 4
  • 5.
  • 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. We Can’t Do It Alone  The Kafka SRE team is 3 people in the US, and 1.5 SREs in Bangalore  We manage over 6000 application instances – 100 Kafka clusters, with 1800 brokers – Over 1 trillion messages a day  The environment is never static from one day to the next 6
  • 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Maslow’s Hierarchy 7
  • 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd’s Hierarchy of Reliability 8
  • 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Infrastructure as a Service  SREs do not deploy hardware and OS  Production Operations – Datacenter Technicians – Systems Operations – Network Operations  Provide all basic OS and network services  There is still tweaking for individual applications 9
  • 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Common Repositories  All source code and configurations are committed to one place  Subversion and Git centrally managed  Consistent management – Precommit checks – ACLs and Review boards  Connects directly to the build systems 10
  • 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Containerization  Most of our stack is Java – Python is well-supported – Always a few one-offs  Java applications have Tomcat and Jetty containers – Hooks for monitoring – Client libraries are managed by the team that owns the application  Provides a consistent control surface for applications 11
  • 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Build and Deployment  When code is committed, it is automatically built – Successes become deployment artifacts – Failures are tracked via Jira  Build systems are centrally managed  Common tools – Dependency management and introspection – Version management – Error budgeting – Deployment 12
  • 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring  Monitoring, graphing, and alerting as a service  Completely self-service – Applications annotate metrics and they are automatically collected – Monitoring dashboards can be created by anyone  Automatic metrics and dashboards for common features – HTTP servers, system and OS metrics – Client libraries (such as Kafka)  Additional metrics can be published outside the container 13
  • 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Site Up 14
  • 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Site Up  With the stack supporting it, applications sit on top – SREs architect and run the application – SRE and developers respond to failures  The NOC monitors high-level metrics – Overall site health and growth metrics – They also coordinate incident response  Incident response is blameless – Fix the problem, don’t fix the blame 15
  • 16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Review and Revise  All components are constantly improving – Incidents expose issues in the infrastructure – Feedback from usage of the tools  Steering committees discuss large-scale changes – Production Operations, SRE, and Development all have their own – Comprised of individual contributors, not managers  Open collaboration – Common repositories means everyone can help 16

Notas do Editor

  1. This is not far from the truth. We go through a lot of beer. We’ll get to why I drink shortly. Site Reliability Engineering, or SRE, combines several roles that fit together into one Operations position. Foremost, we are administrators. We manage all of the systems in our area We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them. This is all well and good for describing the responsibilities, but how do we do it? An SRE needs certain knowledge A little knowledge of all the components Understanding of how they fit together Understanding of how to fit them into the infrastructure Combined with the ability to build tools and automation around the applications, SRE allows the developers to focus on the application, not on running the application. At the end of the day, our job is to keep the site running, always.
  2. At LinkedIn we have three types of SREs. The work is generally the same, but the scope is different for each. Embedded SRE teams are closely aligned with a development team, working with a specific application. This requires deep knowledge of the application itself, and the SREs often find themselves working in the code. The development team and the SRE team work together on feature planning, with the SRE team providing their expertise in operations to inform the architecture of the application. Central SRE teams (at LinkedIn we now call them Production SRE) oversee a number of different applications for a variety of development teams. Many of these applications are not big enough on their own to warrant their own teams, so the central SRE team will assign with managing the operations of the applications, including making sure there’s hardware for them to run on. Production SRE is also the home of our NOC team, who provide high level site monitoring and coordinate incidents that impact more than one team. Tools and Infrastructure SREs are a category to themselves. These teams are responsible for developing and deploying the infrastructure that everything at LinkedIn uses. For example, build and deployment systems, monitoring and alerting systems, and other tools that are common to all teams. My role is that of an Embedded SRE, working directly with the development teams responsible for Streaming. So, on to why I drink.
  3. This is an overview of the Streaming ecosystem at LinkedIn, highly simplified (it doesn’t account for multiple sites and simplifies many of the data flows). Within the Streaming organization, we have 3 teams – Data*, Kafka, and Samza. Data* manages our change capture systems. There are several versions of these, with the latest being Brooklin. Brooklin uses Apache Kafka underneath for streaming changes from Espresso (a key-value store) to client systems. Apache Kafka is the heart of our big data systems. Not only does it underpin Brooklin, some of our data storage systems, such as Espresso and Voldemort (two different key-value systems) use Kafka for replication between components. We also have a number of multitenant Kafka clusters, which are used by every system and application at LinkedIn. These are used for user tracking data, system and application metrics, logging, and queuing all sorts of other messages. Because Kafka is used for metrics, driving our monitoring and alerting systems, we have separate monitoring systems that we maintain for Kafka. Our team is also responsible for managing Zookeeper, used by us and many other applications. Samza is the third team, and they manage our stream processing platform that uses Apache Samza. This heavily relies on Kafka to provide the data, and a place for intermediate results to be written. Some of the applications that run here are things like our data standardization systems, and messaging applications.
  4. My team is quite small. We have 3 SREs dedicated to Kafka and Zookeeper in the US, with a little more than another full SRE in our team in Bangalore, India. This is to manage a deployment with well over 6000 application instances. For the core part of that, the Kafka clusters themselves, we have over 100 separate clusters comprised of more than 1800 servers. They’re processing over a trillion messages a day in total. What’s more, LinkedIn’s landscape is changing daily. There are thousands of applications running, with new versions many times a day. Hardware is always changing, we always have new features to contend with. There’s always someone who needs our help. How can we manage to run this ecosystem effectively with so small a team? The answer lies in what I call full-stack reliability.
  5. Many of us will be familiar with Maslow’s hierarchy of needs. This diagram illustrates the theory that there are basic needs that must be met in order for us to function as human beings. Each need builds upon the one below it. None can stand unless the ones beneath are met. What makes the SRE teams at LinkedIn effective is that we have built our environment in a similar fashion. When building a system within a cloud environment, you have many services that are provided for you to take advantage of. This includes hardware, databases, load balancers, monitoring, and any number of other tools. The idea is that you want to be able to focus on your application, not running those things that are not core to your business, but are still required.
  6. Here is what my stack looks like. I’m not as fancy as Maslow, with his colors, but the same theory stands. Each layer describes a basic need when it comes to reliability in our applications. None of the layers can stand unless the ones below them are satisfied. My stack has 6 layers, starting from the bottom: Infrastructure as a Service Common Repositories Containerization Build and Deployment Monitoring Site Up We’ll cover each of these in turn
  7. As an SRE, I have never set foot in a LinkedIn datacenter, nor have I had my hands on one of our servers. I haven’t even installed an operating system on one of them. Likewise, I have never worked our our networking hardware, or directly made modifications to a service like DNS. All of the services are provided by a separate organization, named Production Operations. The 3 larger teams that SRE works with on a day-to-day basis are the Datacenter Technicians, who are the people who actually deal directly with the hardware. They are the ones on site in each datacenter to both deploy and maintain the systems Systems Operations, the team responsible for the operating system deployment. They are also responsible for maintaining services like DNS Network Operations, which performs a similar function for the networking, handling all the router and switches, as well as firewalls, load balancers, and more The ProdOps team provides all basic OS and network services so that other teams do not have to have specialists in these areas and there is consistency across the infrastructure. For most applications, when I need to deploy new services I can allocate systems from a common pool and deploy with one command. If I need DNS changes, or network ACLs, I open a request for the change and it’s taken care of promptly. When I need to deploy a new broker, it’s a little different because they use custom hardware and tuning. For this, I put in a ticket for new hardware. Within a specified time, I get a hostname for the new system. I can trust that it’s already configured the way I need it, and it’s fully integrated with LinkedIn’s systems. I just need to deploy my application. How we get to that deployable application involves the next 3 layers.
  8. Applications start as source code, and how that is managed forms the base of the application layers. We use a single set of repositories for all code and configuration, which are separate. These subversion and git repositories (we use both right now) are centrally managed by our Tools team. They have consistent precommit checks, which not only help to validate the simple format of certain files (like XML or YAML), but also perform more complex checks like rejecting duplicate class definitions. There are also ACLs and review boards tied in so that individual teams can make sure that changes to their applications are appropriately vetted before they are committed. These repositories are tied into our build system as well, as we’ll discuss in the next layer. This may seem like a small thing to make up such a fundamental layer, but the management of code and config is critical. We have cultural tenets of craftsmanship and openness, and this serves both of them. Precommit checks allow us to follow a set of standards as to how we write code. Having it all in one place means that anyone can check out anyone else’s code – there are no secrets. It’s also important that we maintain configurations the same way we maintain code. Reviews before things are checked in means we are able to catch a lot of problems before they get out to production.
  9. Most of the applications we are working with are Java. We do have a large number of Python applications, as that is the other supported language and it’s used a lot by the SRE teams for writing the tools around the applications. Of course, there are more language than that in use – I have a few Golang apps that we have written. Because that is not a fully supported language, I had to take a few extra steps to make sure it would integrate with all of our build and deployment systems. All of the Java applications run in a container, usually Tomcat or Jetty, that encapsulates the application and provides all of the common pieces for the application developer. For example, the monitoring systems (which make up the next layer) are simply hooked in here. Most client libraries are accessed via Spring here. The versions have already been vetted by other teams, and any configuration parameters either have sane defaults or are surfaced in the application’s config. The most important thing about the containers is that they provide a reliable control surface for the application. This allows the app to interact with all of the tooling within LinkedIn without needing to specifically implement it. For one example, the container provides an HTTP endpoint of its own. For any app, I can quickly determine what the port number of this endpoint is, because there is a registry of port numbers, and I know that I can request `/admin` on that endpoint and get back either a good or a bad response, depending on the health of the application. A number of tools and automatic monitoring systems depend on this.
  10. As soon as code is committed to the repository, a build task is started. Most of us are familiar with these processes from open source projects, and we handle our internal applications the same way. Build successes automatically become deployable artifacts and are pushed up to Artifactory. Failures have a ticket created for them, assigned to the person who checked in the code. In many cases, the bad commit is automatically reverted to maintain trunk (or master) as clean. As with everything else, these build systems are centrally managed by the Tools team. For all of them, we have helper applications that are maintained that make working with apps easily. With common repositories and build systems, I can easily introspect and manage the dependency tree for example. As the owner of the Kafka client library, this is very important. When I have a critical fix that needs to go out to hundreds of applications, I can push a library update into all the dependent applications with as little as a single command. We also have systems for tracking the versions of applications that are deployed. It enforces certain rules and deployment steps, which can be defined for each app, which means that we can set a release process that can be followed by anyone. Which means I can trust developers to deploy applications to production because they will always follow the deployment path we have worked out together. Deployment is pretty amazing as well. Not only can we use the version tracking system to perform multiple steps with the push of a button, if I need to get a little more manual it’s still only one command to deploy anywhere in our infrastructure.
  11. Once deployed, monitoring is the most important part of running an application. If there’s an application that doesn’t have some sort of monitoring on it, it may as well not exist at all. At LinkedIn, our monitoring systems, including graphing and alerting, are all provided as a service for the rest of the organization by our Infrastructure SRE team. What’s more, it is a completely self-service system. Metrics do not have to be approved and on-boarded before they can be used. If a developer wants to expose a new metric, all they have to do is annotate the sensor within the application. The container logic takes care of polling the sensor and producing the metrics into Kafka. From there, the monitoring system consumes them and within about 5 minutes, graphs are available. We can then set up a dashboard with multiple metrics, including alert thresholds. Once the metrics are in the system, they are accessible by everyone, and anyone can set up their own dashboard to watch something. Many common components have their own metrics and dashboards automatically provided without the application needing to annotate them. For example, if an application uses a Kafka client, there are a number of metrics that are produced by default. There are also dashboards for some common things, like HTTP servers. It’s also possible to publish metrics into the system separately from the container. Since we use Kafka for collecting metrics, all you have to do is publish a metrics message. We have helper REST applications for this.
  12. Let’s be honest, none of this runs 100% all the time. With applications in a constant state of change, where does this put us at the top of the stack where we have me, an SRE, trying to keep the site up? Everything is on fire all the time, and that’s OK. Hardware is always failing, but ProdOps is detecting that and resolving it. The developers are constantly checking in changes, some of them pretty sketchy, and the tooling is taking care of building the code and generating deployables. Thanks to our Infrastructure SRE team, when those sketchy changes do make it to production, there is monitoring to detect problems and help us resolve them quickly.
  13. SRE focuses on architecting and running the application. We write tools and scripts to support this, and sometimes we write more general tools that other teams use as well. When something breaks, I work with my developers (as an embedded SRE) and get it fixed. Our NOC is there to monitor the high-level metrics, as most of the monitoring and alerting goes directly to the teams responsible. The NOC watches overall site health, and they track many metrics related to site growth. When there is a problem, they help coordinate multiple teams in fixing it. This is what we call “site up”, and it is the top priority. A big component of this is that our incident process, both the response and the followup, are blameless. It doesn’t matter who caused a problem, what is important is that we fix it and then make sure it doesn’t happen again. Trying to figure out who is at fault takes time away from other things, and only serves to make someone feel bad and make them less likely to contribute something meaningful in the future.
  14. As with any system, you must review it all the time and make sure you’re headed in the right direction. Like any other application, the infrastructure components are constantly being improved. Some of the incidents we have to resolve expose deficiencies, whether it’s something we have a missed monitoring or a process that needs to be changed to be safer. As users of the tools and infrastructure, SREs and developers are providing feedback on what works and what doesn’t. For bigger changes to what we’re doing, we have several steering committees that can be engaged to provide broader input and direction. The ProdOps, SRE, and development organizations each have their own committee covering different areas, and we collaborate with each other as needed. The teams are comprised of individual contributors, higher level technical employees, and not managers. This is important, because it feeds into our culture of strong technical leadership. Most importantly, our systems are set up to provide for open collaboration between all teams. Common code and config repositories are one aspect of this – when everyone can see what’s going on, everyone can contribute. This means that when I find a problem with a tool, I can create a fix and send the owner a patch to review. As opposed to just giving them the feedback, after which they need to set aside time to look at it among all the other things they have to do, duplicate the problem, create a fix, and get it reviewed.