The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to.
At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone.
Description:
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016
This is not far from the truth. We go through a lot of beer. We’ll get to why I drink shortly.
Site Reliability Engineering, or SRE, combines several roles that fit together into one Operations position.
Foremost, we are administrators. We manage all of the systems in our area
We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together
And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them.
This is all well and good for describing the responsibilities, but how do we do it? An SRE needs certain knowledge
A little knowledge of all the components
Understanding of how they fit together
Understanding of how to fit them into the infrastructure
Combined with the ability to build tools and automation around the applications, SRE allows the developers to focus on the application, not on running the application. At the end of the day, our job is to keep the site running, always.
At LinkedIn we have three types of SREs. The work is generally the same, but the scope is different for each.
Embedded SRE teams are closely aligned with a development team, working with a specific application. This requires deep knowledge of the application itself, and the SREs often find themselves working in the code. The development team and the SRE team work together on feature planning, with the SRE team providing their expertise in operations to inform the architecture of the application.
Central SRE teams (at LinkedIn we now call them Production SRE) oversee a number of different applications for a variety of development teams. Many of these applications are not big enough on their own to warrant their own teams, so the central SRE team will assign with managing the operations of the applications, including making sure there’s hardware for them to run on. Production SRE is also the home of our NOC team, who provide high level site monitoring and coordinate incidents that impact more than one team.
Tools and Infrastructure SREs are a category to themselves. These teams are responsible for developing and deploying the infrastructure that everything at LinkedIn uses. For example, build and deployment systems, monitoring and alerting systems, and other tools that are common to all teams.
My role is that of an Embedded SRE, working directly with the development teams responsible for Streaming. So, on to why I drink.
This is an overview of the Streaming ecosystem at LinkedIn, highly simplified (it doesn’t account for multiple sites and simplifies many of the data flows). Within the Streaming organization, we have 3 teams – Data*, Kafka, and Samza.
Data* manages our change capture systems. There are several versions of these, with the latest being Brooklin. Brooklin uses Apache Kafka underneath for streaming changes from Espresso (a key-value store) to client systems.
Apache Kafka is the heart of our big data systems. Not only does it underpin Brooklin, some of our data storage systems, such as Espresso and Voldemort (two different key-value systems) use Kafka for replication between components. We also have a number of multitenant Kafka clusters, which are used by every system and application at LinkedIn. These are used for user tracking data, system and application metrics, logging, and queuing all sorts of other messages. Because Kafka is used for metrics, driving our monitoring and alerting systems, we have separate monitoring systems that we maintain for Kafka. Our team is also responsible for managing Zookeeper, used by us and many other applications.
Samza is the third team, and they manage our stream processing platform that uses Apache Samza. This heavily relies on Kafka to provide the data, and a place for intermediate results to be written. Some of the applications that run here are things like our data standardization systems, and messaging applications.
My team is quite small. We have 3 SREs dedicated to Kafka and Zookeeper in the US, with a little more than another full SRE in our team in Bangalore, India. This is to manage a deployment with well over 6000 application instances. For the core part of that, the Kafka clusters themselves, we have over 100 separate clusters comprised of more than 1800 servers. They’re processing over a trillion messages a day in total.
What’s more, LinkedIn’s landscape is changing daily. There are thousands of applications running, with new versions many times a day. Hardware is always changing, we always have new features to contend with. There’s always someone who needs our help. How can we manage to run this ecosystem effectively with so small a team? The answer lies in what I call full-stack reliability.
Many of us will be familiar with Maslow’s hierarchy of needs. This diagram illustrates the theory that there are basic needs that must be met in order for us to function as human beings. Each need builds upon the one below it. None can stand unless the ones beneath are met.
What makes the SRE teams at LinkedIn effective is that we have built our environment in a similar fashion. When building a system within a cloud environment, you have many services that are provided for you to take advantage of. This includes hardware, databases, load balancers, monitoring, and any number of other tools. The idea is that you want to be able to focus on your application, not running those things that are not core to your business, but are still required.
Here is what my stack looks like. I’m not as fancy as Maslow, with his colors, but the same theory stands. Each layer describes a basic need when it comes to reliability in our applications. None of the layers can stand unless the ones below them are satisfied.
My stack has 6 layers, starting from the bottom:
Infrastructure as a Service
Common Repositories
Containerization
Build and Deployment
Monitoring
Site Up
We’ll cover each of these in turn
As an SRE, I have never set foot in a LinkedIn datacenter, nor have I had my hands on one of our servers. I haven’t even installed an operating system on one of them. Likewise, I have never worked our our networking hardware, or directly made modifications to a service like DNS.
All of the services are provided by a separate organization, named Production Operations. The 3 larger teams that SRE works with on a day-to-day basis are
the Datacenter Technicians, who are the people who actually deal directly with the hardware. They are the ones on site in each datacenter to both deploy and maintain the systems
Systems Operations, the team responsible for the operating system deployment. They are also responsible for maintaining services like DNS
Network Operations, which performs a similar function for the networking, handling all the router and switches, as well as firewalls, load balancers, and more
The ProdOps team provides all basic OS and network services so that other teams do not have to have specialists in these areas and there is consistency across the infrastructure. For most applications, when I need to deploy new services I can allocate systems from a common pool and deploy with one command. If I need DNS changes, or network ACLs, I open a request for the change and it’s taken care of promptly.
When I need to deploy a new broker, it’s a little different because they use custom hardware and tuning. For this, I put in a ticket for new hardware. Within a specified time, I get a hostname for the new system. I can trust that it’s already configured the way I need it, and it’s fully integrated with LinkedIn’s systems. I just need to deploy my application. How we get to that deployable application involves the next 3 layers.
Applications start as source code, and how that is managed forms the base of the application layers. We use a single set of repositories for all code and configuration, which are separate. These subversion and git repositories (we use both right now) are centrally managed by our Tools team. They have consistent precommit checks, which not only help to validate the simple format of certain files (like XML or YAML), but also perform more complex checks like rejecting duplicate class definitions. There are also ACLs and review boards tied in so that individual teams can make sure that changes to their applications are appropriately vetted before they are committed. These repositories are tied into our build system as well, as we’ll discuss in the next layer.
This may seem like a small thing to make up such a fundamental layer, but the management of code and config is critical. We have cultural tenets of craftsmanship and openness, and this serves both of them. Precommit checks allow us to follow a set of standards as to how we write code. Having it all in one place means that anyone can check out anyone else’s code – there are no secrets. It’s also important that we maintain configurations the same way we maintain code. Reviews before things are checked in means we are able to catch a lot of problems before they get out to production.
Most of the applications we are working with are Java. We do have a large number of Python applications, as that is the other supported language and it’s used a lot by the SRE teams for writing the tools around the applications. Of course, there are more language than that in use – I have a few Golang apps that we have written. Because that is not a fully supported language, I had to take a few extra steps to make sure it would integrate with all of our build and deployment systems.
All of the Java applications run in a container, usually Tomcat or Jetty, that encapsulates the application and provides all of the common pieces for the application developer. For example, the monitoring systems (which make up the next layer) are simply hooked in here. Most client libraries are accessed via Spring here. The versions have already been vetted by other teams, and any configuration parameters either have sane defaults or are surfaced in the application’s config.
The most important thing about the containers is that they provide a reliable control surface for the application. This allows the app to interact with all of the tooling within LinkedIn without needing to specifically implement it. For one example, the container provides an HTTP endpoint of its own. For any app, I can quickly determine what the port number of this endpoint is, because there is a registry of port numbers, and I know that I can request `/admin` on that endpoint and get back either a good or a bad response, depending on the health of the application. A number of tools and automatic monitoring systems depend on this.
As soon as code is committed to the repository, a build task is started. Most of us are familiar with these processes from open source projects, and we handle our internal applications the same way. Build successes automatically become deployable artifacts and are pushed up to Artifactory. Failures have a ticket created for them, assigned to the person who checked in the code. In many cases, the bad commit is automatically reverted to maintain trunk (or master) as clean.
As with everything else, these build systems are centrally managed by the Tools team. For all of them, we have helper applications that are maintained that make working with apps easily. With common repositories and build systems, I can easily introspect and manage the dependency tree for example. As the owner of the Kafka client library, this is very important. When I have a critical fix that needs to go out to hundreds of applications, I can push a library update into all the dependent applications with as little as a single command.
We also have systems for tracking the versions of applications that are deployed. It enforces certain rules and deployment steps, which can be defined for each app, which means that we can set a release process that can be followed by anyone. Which means I can trust developers to deploy applications to production because they will always follow the deployment path we have worked out together.
Deployment is pretty amazing as well. Not only can we use the version tracking system to perform multiple steps with the push of a button, if I need to get a little more manual it’s still only one command to deploy anywhere in our infrastructure.
Once deployed, monitoring is the most important part of running an application. If there’s an application that doesn’t have some sort of monitoring on it, it may as well not exist at all. At LinkedIn, our monitoring systems, including graphing and alerting, are all provided as a service for the rest of the organization by our Infrastructure SRE team.
What’s more, it is a completely self-service system. Metrics do not have to be approved and on-boarded before they can be used. If a developer wants to expose a new metric, all they have to do is annotate the sensor within the application. The container logic takes care of polling the sensor and producing the metrics into Kafka. From there, the monitoring system consumes them and within about 5 minutes, graphs are available. We can then set up a dashboard with multiple metrics, including alert thresholds. Once the metrics are in the system, they are accessible by everyone, and anyone can set up their own dashboard to watch something.
Many common components have their own metrics and dashboards automatically provided without the application needing to annotate them. For example, if an application uses a Kafka client, there are a number of metrics that are produced by default. There are also dashboards for some common things, like HTTP servers. It’s also possible to publish metrics into the system separately from the container. Since we use Kafka for collecting metrics, all you have to do is publish a metrics message. We have helper REST applications for this.
Let’s be honest, none of this runs 100% all the time. With applications in a constant state of change, where does this put us at the top of the stack where we have me, an SRE, trying to keep the site up?
Everything is on fire all the time, and that’s OK. Hardware is always failing, but ProdOps is detecting that and resolving it. The developers are constantly checking in changes, some of them pretty sketchy, and the tooling is taking care of building the code and generating deployables. Thanks to our Infrastructure SRE team, when those sketchy changes do make it to production, there is monitoring to detect problems and help us resolve them quickly.
SRE focuses on architecting and running the application. We write tools and scripts to support this, and sometimes we write more general tools that other teams use as well. When something breaks, I work with my developers (as an embedded SRE) and get it fixed.
Our NOC is there to monitor the high-level metrics, as most of the monitoring and alerting goes directly to the teams responsible. The NOC watches overall site health, and they track many metrics related to site growth. When there is a problem, they help coordinate multiple teams in fixing it.
This is what we call “site up”, and it is the top priority. A big component of this is that our incident process, both the response and the followup, are blameless. It doesn’t matter who caused a problem, what is important is that we fix it and then make sure it doesn’t happen again. Trying to figure out who is at fault takes time away from other things, and only serves to make someone feel bad and make them less likely to contribute something meaningful in the future.
As with any system, you must review it all the time and make sure you’re headed in the right direction. Like any other application, the infrastructure components are constantly being improved. Some of the incidents we have to resolve expose deficiencies, whether it’s something we have a missed monitoring or a process that needs to be changed to be safer. As users of the tools and infrastructure, SREs and developers are providing feedback on what works and what doesn’t.
For bigger changes to what we’re doing, we have several steering committees that can be engaged to provide broader input and direction. The ProdOps, SRE, and development organizations each have their own committee covering different areas, and we collaborate with each other as needed. The teams are comprised of individual contributors, higher level technical employees, and not managers. This is important, because it feeds into our culture of strong technical leadership.
Most importantly, our systems are set up to provide for open collaboration between all teams. Common code and config repositories are one aspect of this – when everyone can see what’s going on, everyone can contribute. This means that when I find a problem with a tool, I can create a fix and send the owner a patch to review. As opposed to just giving them the feedback, after which they need to set aside time to look at it among all the other things they have to do, duplicate the problem, create a fix, and get it reviewed.