I'm No Hero: Full Stack Reliability at LinkedIn

•Transferir como PPTX, PDF•

5 gostaram•1,537 visualizações

The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to. At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone. Description: Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany. Organized by EIT Digital and Huawei GRC, Germany. Twitter: @CloudRR2016

Engenharia

I’m No Hero
Full Stack
Reliability
At LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd Palino

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What is Site Reliability Engineering?
3

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Types of SRE
 Embedded
 Central (or Production SRE)
 Tools and Infrastructure
4

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
We Can’t Do It Alone
 The Kafka SRE team is 3 people in the US, and 1.5 SREs in Bangalore
 We manage over 6000 application instances
– 100 Kafka clusters, with 1800 brokers
– Over 1 trillion messages a day
 The environment is never static from one day to the next
6

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Maslow’s Hierarchy
7

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd’s Hierarchy of Reliability
8

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Infrastructure as a Service
 SREs do not deploy hardware and OS
 Production Operations
– Datacenter Technicians
– Systems Operations
– Network Operations
 Provide all basic OS and network services
 There is still tweaking for individual applications
9

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Common Repositories
 All source code and configurations are committed to one place
 Subversion and Git centrally managed
 Consistent management
– Precommit checks
– ACLs and Review boards
 Connects directly to the build systems
10

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Containerization
 Most of our stack is Java
– Python is well-supported
– Always a few one-offs
 Java applications have Tomcat and Jetty containers
– Hooks for monitoring
– Client libraries are managed by the team that owns the application
 Provides a consistent control surface for applications
11

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Build and Deployment
 When code is committed, it is automatically built
– Successes become deployment artifacts
– Failures are tracked via Jira
 Build systems are centrally managed
 Common tools
– Dependency management and introspection
– Version management
– Error budgeting
– Deployment
12

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring
 Monitoring, graphing, and alerting as a service
 Completely self-service
– Applications annotate metrics and they are automatically collected
– Monitoring dashboards can be created by anyone
 Automatic metrics and dashboards for common features
– HTTP servers, system and OS metrics
– Client libraries (such as Kafka)
 Additional metrics can be published outside the container
13

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Site Up
14

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Site Up
 With the stack supporting it, applications sit on top
– SREs architect and run the application
– SRE and developers respond to failures
 The NOC monitors high-level metrics
– Overall site health and growth metrics
– They also coordinate incident response
 Incident response is blameless
– Fix the problem, don’t fix the blame
15

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Review and Revise
 All components are constantly improving
– Incidents expose issues in the infrastructure
– Feedback from usage of the tools
 Steering committees discuss large-scale changes
– Production Operations, SRE, and Development all have their own
– Comprised of individual contributors, not managers
 Open collaboration
– Common repositories means everyone can help
16

Mais conteúdo relacionado

Mais procurados

Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...Legacy Typesafe (now Lightbend)

Nano Server - the future of Windows Server - Thomas MaurerITCamp

What's New in Hyper-V 2016 - Thomas MaurerITCamp

The Top Outages of 2021: Analysis and TakeawaysThousandEyes

Cisco IT and ThousandEyesThousandEyes

UCS Update: Efficiently Managing your server environment for traditional ente...Cisco Canada

APIC EM APIs: a deep diveCisco DevNet

Cisco ONE Enterprise Cloud (UCSD) Hands-on LabCisco Canada

Open Source Applied - Real World Use CasesAll Things Open

Cisco ACI for the Microsoft Cloud PlatformShashi Kiran

Is the Cloud Going to Kill Traditional Application Delivery?Imperva Incapsula

Travelling in time with SQL Server 2016 - Damian WideraITCamp

Riverbed Performance ManagementCTI Group

ThousandEyes Alerting Essentials for Your NetworkThousandEyes

Oracle Public Cloud Operations from ThousandEyes ConnectThousandEyes

Ocs F5 Bigip BestpracticesThiago Gutierri

F5 iHealth Presentation 10 22-10F5 Networks

Introduction to ThousandEyesThousandEyes

Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres Martin Lipka

SDN in the Enterprise: APIC Enterprise Module Cisco Canada

Mais procurados (20)

Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...

Nano Server - the future of Windows Server - Thomas Maurer

What's New in Hyper-V 2016 - Thomas Maurer

The Top Outages of 2021: Analysis and Takeaways

Cisco IT and ThousandEyes

UCS Update: Efficiently Managing your server environment for traditional ente...

APIC EM APIs: a deep dive

Cisco ONE Enterprise Cloud (UCSD) Hands-on Lab

Open Source Applied - Real World Use Cases

Cisco ACI for the Microsoft Cloud Platform

Is the Cloud Going to Kill Traditional Application Delivery?

Travelling in time with SQL Server 2016 - Damian Widera

Riverbed Performance Management

ThousandEyes Alerting Essentials for Your Network

Oracle Public Cloud Operations from ThousandEyes Connect

Ocs F5 Bigip Bestpractices

F5 iHealth Presentation 10 22-10

Introduction to ThousandEyes

Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres

SDN in the Enterprise: APIC Enterprise Module

Destaque

Kafka at Scale: Multi-Tier ArchitecturesTodd Palino

Kafka at Peak PerformanceTodd Palino

Site Reliability Engineering Helps Google Conquer The WorldVistara

Putting Kafka Into OverdriveTodd Palino

Works of site reliability engineerShohei Kobayashi

SRE From ScratchGrier Johnson

Producer Performance Tuning for Apache KafkaJiangjie Qin

Kafka overview and use casesIndrajeet Kumar

Stephen McHenry - Chanecellor of Site Reliability Engineering, GoogleIE Group

Building an E-commerce website in MEAN stackdivyapisces

(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services

Netflix Keystone—Cloud scale event processing pipelineMonal Daxini

You got a couple Microservices, now what? - Adding SRE to DevOpsGonzalo Maldonado

Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira

The Startup Relationship Survival Guide by Nicole CottrellPHX Startup Week

SRE ToolsGurbakash Phonsa

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

No data loss pipeline with apache kafkaJiangjie Qin

SRE - drupal day aveiro 2016Ricardo Amaro

SREcon 2016 Performance Checklists for SREsBrendan Gregg

Destaque (20)

Kafka at Scale: Multi-Tier Architectures

Kafka at Peak Performance

Site Reliability Engineering Helps Google Conquer The World

Putting Kafka Into Overdrive

Works of site reliability engineer

SRE From Scratch

Producer Performance Tuning for Apache Kafka

Kafka overview and use cases

Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Building an E-commerce website in MEAN stack

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

Netflix Keystone—Cloud scale event processing pipeline

You got a couple Microservices, now what? - Adding SRE to DevOps

Kafka Reliability - When it absolutely, positively has to be there

The Startup Relationship Survival Guide by Nicole Cottrell

SRE Tools

Netflix Data Pipeline With Kafka

No data loss pipeline with apache kafka

SRE - drupal day aveiro 2016

SREcon 2016 Performance Checklists for SREs

Semelhante a I'm No Hero: Full Stack Reliability at LinkedIn

Linked in multi tier, multi-tenant, multi-problem kafkaNitin Kumar

Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...Splunk

Oracle Management Cloud newpres-v1.1Lee Bonfield

Splunk Sales Presentation Imagemaker 2014Urena Nicolas

Managing Your Application Security Program with the ThreadFix EcosystemDenim Group

Splunk bangalore user group 2020-06-01NiketNilay

Getting Started with Splunk EnterpriseSplunk

MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...Jitendra Bafna

Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...Denim Group

Oracle: Building Cloud Native ApplicationsKelly Goetsch

Veracode Corporate Overview - PrintAndrew Kanikuru

SwitchIT-02.2018-Company-overview.pptxWILFRIEDKOUASSIKAN

Optimizing Your Application Security Program with Netsparker and ThreadFixDenim Group

SAP security made easyERPScan

The SAS developer portal –developer.sas.com 2.0: How we built it by Joe Furb...Nordic APIs

Big Data Analytics for Real-time Operational Intelligence with Your z/OS DataPrecisely

Government and Education Webinar: Improving Application PerformanceSolarWinds

Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC)Onapsis Inc.

Webinar–AppSec: Hype or RealitySynopsys Software Integrity Group

SYN328: Learn why AppDNA should be a part of every consultant’s toolkitJeremy Saunders

Semelhante a I'm No Hero: Full Stack Reliability at LinkedIn (20)

Linked in multi tier, multi-tenant, multi-problem kafka

Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...

Oracle Management Cloud newpres-v1.1

Splunk Sales Presentation Imagemaker 2014

Managing Your Application Security Program with the ThreadFix Ecosystem

Splunk bangalore user group 2020-06-01

Getting Started with Splunk Enterprise

MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...

Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...

Oracle: Building Cloud Native Applications

Veracode Corporate Overview - Print

SwitchIT-02.2018-Company-overview.pptx

Optimizing Your Application Security Program with Netsparker and ThreadFix

SAP security made easy

The SAS developer portal –developer.sas.com 2.0: How we built it by Joe Furb...

Big Data Analytics for Real-time Operational Intelligence with Your z/OS Data

Government and Education Webinar: Improving Application Performance

Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC)

Webinar–AppSec: Hype or Reality

SYN328: Learn why AppDNA should be a part of every consultant’s toolkit

Mais de Todd Palino

Leading Without Managing: Becoming an SRE Technical LeaderTodd Palino

From Operations to Site Reliability in Five Easy StepsTodd Palino

Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino

Why Does (My) Monitoring Suck?Todd Palino

URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino

Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...Todd Palino

Running Kafka for Maximum PainTodd Palino

Mais de Todd Palino (7)

Leading Without Managing: Becoming an SRE Technical Leader

From Operations to Site Reliability in Five Easy Steps

Code Yellow: Helping Operations Top-Heavy Teams the Smart Way

Why Does (My) Monitoring Suck?

URP? Excuse You! The Three Kafka Metrics You Need to Know

Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...

Running Kafka for Maximum Pain

Último

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth

Porous Ceramics seminar and technical writingrakeshbaidya232001

SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome

Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan

Roadmap to Membership of RICS - Pathways and RoutesM Maged Hegazy, LLM, MBA, CCP, P3O

247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis

Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth

(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat

HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla

(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat

Introduction and different types of Ethernet.pptxupamatechverse

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat

College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile

I'm No Hero: Full Stack Reliability at LinkedIn

1. I’m No Hero Full Stack Reliability At LinkedIn

6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. We Can’t Do It Alone  The Kafka SRE team is 3 people in the US, and 1.5 SREs in Bangalore  We manage over 6000 application instances – 100 Kafka clusters, with 1800 brokers – Over 1 trillion messages a day  The environment is never static from one day to the next 6

9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Infrastructure as a Service  SREs do not deploy hardware and OS  Production Operations – Datacenter Technicians – Systems Operations – Network Operations  Provide all basic OS and network services  There is still tweaking for individual applications 9

10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Common Repositories  All source code and configurations are committed to one place  Subversion and Git centrally managed  Consistent management – Precommit checks – ACLs and Review boards  Connects directly to the build systems 10

11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Containerization  Most of our stack is Java – Python is well-supported – Always a few one-offs  Java applications have Tomcat and Jetty containers – Hooks for monitoring – Client libraries are managed by the team that owns the application  Provides a consistent control surface for applications 11

12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Build and Deployment  When code is committed, it is automatically built – Successes become deployment artifacts – Failures are tracked via Jira  Build systems are centrally managed  Common tools – Dependency management and introspection – Version management – Error budgeting – Deployment 12

13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring  Monitoring, graphing, and alerting as a service  Completely self-service – Applications annotate metrics and they are automatically collected – Monitoring dashboards can be created by anyone  Automatic metrics and dashboards for common features – HTTP servers, system and OS metrics – Client libraries (such as Kafka)  Additional metrics can be published outside the container 13

15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Site Up  With the stack supporting it, applications sit on top – SREs architect and run the application – SRE and developers respond to failures  The NOC monitors high-level metrics – Overall site health and growth metrics – They also coordinate incident response  Incident response is blameless – Fix the problem, don’t fix the blame 15

16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Review and Revise  All components are constantly improving – Incidents expose issues in the infrastructure – Feedback from usage of the tools  Steering committees discuss large-scale changes – Production Operations, SRE, and Development all have their own – Comprised of individual contributors, not managers  Open collaboration – Common repositories means everyone can help 16

Notas do Editor

This is not far from the truth. We go through a lot of beer. We’ll get to why I drink shortly. Site Reliability Engineering, or SRE, combines several roles that fit together into one Operations position. Foremost, we are administrators. We manage all of the systems in our area We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them. This is all well and good for describing the responsibilities, but how do we do it? An SRE needs certain knowledge A little knowledge of all the components Understanding of how they fit together Understanding of how to fit them into the infrastructure Combined with the ability to build tools and automation around the applications, SRE allows the developers to focus on the application, not on running the application. At the end of the day, our job is to keep the site running, always.
At LinkedIn we have three types of SREs. The work is generally the same, but the scope is different for each. Embedded SRE teams are closely aligned with a development team, working with a specific application. This requires deep knowledge of the application itself, and the SREs often find themselves working in the code. The development team and the SRE team work together on feature planning, with the SRE team providing their expertise in operations to inform the architecture of the application. Central SRE teams (at LinkedIn we now call them Production SRE) oversee a number of different applications for a variety of development teams. Many of these applications are not big enough on their own to warrant their own teams, so the central SRE team will assign with managing the operations of the applications, including making sure there’s hardware for them to run on. Production SRE is also the home of our NOC team, who provide high level site monitoring and coordinate incidents that impact more than one team. Tools and Infrastructure SREs are a category to themselves. These teams are responsible for developing and deploying the infrastructure that everything at LinkedIn uses. For example, build and deployment systems, monitoring and alerting systems, and other tools that are common to all teams. My role is that of an Embedded SRE, working directly with the development teams responsible for Streaming. So, on to why I drink.
This is an overview of the Streaming ecosystem at LinkedIn, highly simplified (it doesn’t account for multiple sites and simplifies many of the data flows). Within the Streaming organization, we have 3 teams – Data*, Kafka, and Samza. Data* manages our change capture systems. There are several versions of these, with the latest being Brooklin. Brooklin uses Apache Kafka underneath for streaming changes from Espresso (a key-value store) to client systems. Apache Kafka is the heart of our big data systems. Not only does it underpin Brooklin, some of our data storage systems, such as Espresso and Voldemort (two different key-value systems) use Kafka for replication between components. We also have a number of multitenant Kafka clusters, which are used by every system and application at LinkedIn. These are used for user tracking data, system and application metrics, logging, and queuing all sorts of other messages. Because Kafka is used for metrics, driving our monitoring and alerting systems, we have separate monitoring systems that we maintain for Kafka. Our team is also responsible for managing Zookeeper, used by us and many other applications. Samza is the third team, and they manage our stream processing platform that uses Apache Samza. This heavily relies on Kafka to provide the data, and a place for intermediate results to be written. Some of the applications that run here are things like our data standardization systems, and messaging applications.
My team is quite small. We have 3 SREs dedicated to Kafka and Zookeeper in the US, with a little more than another full SRE in our team in Bangalore, India. This is to manage a deployment with well over 6000 application instances. For the core part of that, the Kafka clusters themselves, we have over 100 separate clusters comprised of more than 1800 servers. They’re processing over a trillion messages a day in total. What’s more, LinkedIn’s landscape is changing daily. There are thousands of applications running, with new versions many times a day. Hardware is always changing, we always have new features to contend with. There’s always someone who needs our help. How can we manage to run this ecosystem effectively with so small a team? The answer lies in what I call full-stack reliability.
Many of us will be familiar with Maslow’s hierarchy of needs. This diagram illustrates the theory that there are basic needs that must be met in order for us to function as human beings. Each need builds upon the one below it. None can stand unless the ones beneath are met. What makes the SRE teams at LinkedIn effective is that we have built our environment in a similar fashion. When building a system within a cloud environment, you have many services that are provided for you to take advantage of. This includes hardware, databases, load balancers, monitoring, and any number of other tools. The idea is that you want to be able to focus on your application, not running those things that are not core to your business, but are still required.
Here is what my stack looks like. I’m not as fancy as Maslow, with his colors, but the same theory stands. Each layer describes a basic need when it comes to reliability in our applications. None of the layers can stand unless the ones below them are satisfied. My stack has 6 layers, starting from the bottom: Infrastructure as a Service Common Repositories Containerization Build and Deployment Monitoring Site Up We’ll cover each of these in turn
As an SRE, I have never set foot in a LinkedIn datacenter, nor have I had my hands on one of our servers. I haven’t even installed an operating system on one of them. Likewise, I have never worked our our networking hardware, or directly made modifications to a service like DNS. All of the services are provided by a separate organization, named Production Operations. The 3 larger teams that SRE works with on a day-to-day basis are the Datacenter Technicians, who are the people who actually deal directly with the hardware. They are the ones on site in each datacenter to both deploy and maintain the systems Systems Operations, the team responsible for the operating system deployment. They are also responsible for maintaining services like DNS Network Operations, which performs a similar function for the networking, handling all the router and switches, as well as firewalls, load balancers, and more The ProdOps team provides all basic OS and network services so that other teams do not have to have specialists in these areas and there is consistency across the infrastructure. For most applications, when I need to deploy new services I can allocate systems from a common pool and deploy with one command. If I need DNS changes, or network ACLs, I open a request for the change and it’s taken care of promptly. When I need to deploy a new broker, it’s a little different because they use custom hardware and tuning. For this, I put in a ticket for new hardware. Within a specified time, I get a hostname for the new system. I can trust that it’s already configured the way I need it, and it’s fully integrated with LinkedIn’s systems. I just need to deploy my application. How we get to that deployable application involves the next 3 layers.
Applications start as source code, and how that is managed forms the base of the application layers. We use a single set of repositories for all code and configuration, which are separate. These subversion and git repositories (we use both right now) are centrally managed by our Tools team. They have consistent precommit checks, which not only help to validate the simple format of certain files (like XML or YAML), but also perform more complex checks like rejecting duplicate class definitions. There are also ACLs and review boards tied in so that individual teams can make sure that changes to their applications are appropriately vetted before they are committed. These repositories are tied into our build system as well, as we’ll discuss in the next layer. This may seem like a small thing to make up such a fundamental layer, but the management of code and config is critical. We have cultural tenets of craftsmanship and openness, and this serves both of them. Precommit checks allow us to follow a set of standards as to how we write code. Having it all in one place means that anyone can check out anyone else’s code – there are no secrets. It’s also important that we maintain configurations the same way we maintain code. Reviews before things are checked in means we are able to catch a lot of problems before they get out to production.
Most of the applications we are working with are Java. We do have a large number of Python applications, as that is the other supported language and it’s used a lot by the SRE teams for writing the tools around the applications. Of course, there are more language than that in use – I have a few Golang apps that we have written. Because that is not a fully supported language, I had to take a few extra steps to make sure it would integrate with all of our build and deployment systems. All of the Java applications run in a container, usually Tomcat or Jetty, that encapsulates the application and provides all of the common pieces for the application developer. For example, the monitoring systems (which make up the next layer) are simply hooked in here. Most client libraries are accessed via Spring here. The versions have already been vetted by other teams, and any configuration parameters either have sane defaults or are surfaced in the application’s config. The most important thing about the containers is that they provide a reliable control surface for the application. This allows the app to interact with all of the tooling within LinkedIn without needing to specifically implement it. For one example, the container provides an HTTP endpoint of its own. For any app, I can quickly determine what the port number of this endpoint is, because there is a registry of port numbers, and I know that I can request `/admin` on that endpoint and get back either a good or a bad response, depending on the health of the application. A number of tools and automatic monitoring systems depend on this.
As soon as code is committed to the repository, a build task is started. Most of us are familiar with these processes from open source projects, and we handle our internal applications the same way. Build successes automatically become deployable artifacts and are pushed up to Artifactory. Failures have a ticket created for them, assigned to the person who checked in the code. In many cases, the bad commit is automatically reverted to maintain trunk (or master) as clean. As with everything else, these build systems are centrally managed by the Tools team. For all of them, we have helper applications that are maintained that make working with apps easily. With common repositories and build systems, I can easily introspect and manage the dependency tree for example. As the owner of the Kafka client library, this is very important. When I have a critical fix that needs to go out to hundreds of applications, I can push a library update into all the dependent applications with as little as a single command. We also have systems for tracking the versions of applications that are deployed. It enforces certain rules and deployment steps, which can be defined for each app, which means that we can set a release process that can be followed by anyone. Which means I can trust developers to deploy applications to production because they will always follow the deployment path we have worked out together. Deployment is pretty amazing as well. Not only can we use the version tracking system to perform multiple steps with the push of a button, if I need to get a little more manual it’s still only one command to deploy anywhere in our infrastructure.
Once deployed, monitoring is the most important part of running an application. If there’s an application that doesn’t have some sort of monitoring on it, it may as well not exist at all. At LinkedIn, our monitoring systems, including graphing and alerting, are all provided as a service for the rest of the organization by our Infrastructure SRE team. What’s more, it is a completely self-service system. Metrics do not have to be approved and on-boarded before they can be used. If a developer wants to expose a new metric, all they have to do is annotate the sensor within the application. The container logic takes care of polling the sensor and producing the metrics into Kafka. From there, the monitoring system consumes them and within about 5 minutes, graphs are available. We can then set up a dashboard with multiple metrics, including alert thresholds. Once the metrics are in the system, they are accessible by everyone, and anyone can set up their own dashboard to watch something. Many common components have their own metrics and dashboards automatically provided without the application needing to annotate them. For example, if an application uses a Kafka client, there are a number of metrics that are produced by default. There are also dashboards for some common things, like HTTP servers. It’s also possible to publish metrics into the system separately from the container. Since we use Kafka for collecting metrics, all you have to do is publish a metrics message. We have helper REST applications for this.
Let’s be honest, none of this runs 100% all the time. With applications in a constant state of change, where does this put us at the top of the stack where we have me, an SRE, trying to keep the site up? Everything is on fire all the time, and that’s OK. Hardware is always failing, but ProdOps is detecting that and resolving it. The developers are constantly checking in changes, some of them pretty sketchy, and the tooling is taking care of building the code and generating deployables. Thanks to our Infrastructure SRE team, when those sketchy changes do make it to production, there is monitoring to detect problems and help us resolve them quickly.
SRE focuses on architecting and running the application. We write tools and scripts to support this, and sometimes we write more general tools that other teams use as well. When something breaks, I work with my developers (as an embedded SRE) and get it fixed. Our NOC is there to monitor the high-level metrics, as most of the monitoring and alerting goes directly to the teams responsible. The NOC watches overall site health, and they track many metrics related to site growth. When there is a problem, they help coordinate multiple teams in fixing it. This is what we call “site up”, and it is the top priority. A big component of this is that our incident process, both the response and the followup, are blameless. It doesn’t matter who caused a problem, what is important is that we fix it and then make sure it doesn’t happen again. Trying to figure out who is at fault takes time away from other things, and only serves to make someone feel bad and make them less likely to contribute something meaningful in the future.
As with any system, you must review it all the time and make sure you’re headed in the right direction. Like any other application, the infrastructure components are constantly being improved. Some of the incidents we have to resolve expose deficiencies, whether it’s something we have a missed monitoring or a process that needs to be changed to be safer. As users of the tools and infrastructure, SREs and developers are providing feedback on what works and what doesn’t. For bigger changes to what we’re doing, we have several steering committees that can be engaged to provide broader input and direction. The ProdOps, SRE, and development organizations each have their own committee covering different areas, and we collaborate with each other as needed. The teams are comprised of individual contributors, higher level technical employees, and not managers. This is important, because it feeds into our culture of strong technical leadership. Most importantly, our systems are set up to provide for open collaboration between all teams. Common code and config repositories are one aspect of this – when everyone can see what’s going on, everyone can contribute. This means that when I find a problem with a tool, I can create a fix and send the owner a patch to review. As opposed to just giving them the feedback, after which they need to set aside time to look at it among all the other things they have to do, duplicate the problem, create a fix, and get it reviewed.

I'm No Hero: Full Stack Reliability at LinkedIn

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a I'm No Hero: Full Stack Reliability at LinkedIn

Semelhante a I'm No Hero: Full Stack Reliability at LinkedIn (20)

Mais de Todd Palino

Mais de Todd Palino (7)

Último

Último (20)

I'm No Hero: Full Stack Reliability at LinkedIn

Notas do Editor