This document discusses Netflix's approach to developing and deploying its API in a way that allows it to move fast while staying safe. It focuses on how Netflix uses automation, architecture, and insight to rapidly innovate and scale its API to support over 50 million subscribers across over 40 countries and over 1000 device types. Key aspects include automated testing, red/black deployments, predictive autoscaling, real-time metrics and debugging to enable continuous delivery while maintaining high availability, resiliency and rollback capabilities.
12. Role of API
• Enable rapid innovation
• Conduit for metadata between Devices
and Services
• Implements business logic
• Scale with business
• Maintain resiliency
http://goo.gl/VhokZV
80. Move Fast; Stay Safe
Developing and Deploying the Netflix API
Sangeeta Narayanan
@sangeetan
http://www.linkedin.com/in/sangeetanarayanan
Notas do Editor
Started out as a DVD rental by mail service
Introduced on-demand video streaming over the internet in 2007
Has since expanded internationally
2012 marked a foray into the world of original programming
Shows like HoC & Orange have been received with high acclaim; as evidenced by recent Emmy wins. Strategy is to expand internationally and pursue high quality content to drive engagement and acquisition.
Global expansion, high quality originals and personalized content have fueled rapid subscriber growth.
Netflix now accounts for over 1/3rd of downstream internet traffic in NA at peak. This number has been in the news a lot lately!
Our members can choose to enjoy our service on over 1000 device types.
Edge Engineering operates the services that provide the personalized discovery and streaming experience for our members.
This is an extremely high level view of the Netflix service. API is the internet facing service that all devices connect to to provide the user experience. The API in turn consumes data from several middle-tier services, applies business logic on top of it as needed and provides an abstraction layer for devices to interact with.
The API in effect, acts as a broker of metadata between services and devices. Put another way, almost all product functionality flows through the API.
We are constantly striving for a balance between velocity and availability.
This talk will cover some of the strategies and techniques we employ in our pursuit for the balance between velocity and availability. I will focus on three areas - Architecture, Automation and Insight
Let’s look at a couple of examples of architectural choices that enable velocity and resiliency.
This is an overview of the Netflix Streaming architecture.
Zooming in on the interaction between the API and the devices it serves.
We support over 1000 device types.
Embracing the Differences: http://techblog.netflix.com/2012/07/embracing-differences-inside-netflix.html
Inside the API container
The Dynamic Scripting Platform reduces chattiness and allows API clients to develop and operate endpoints customized to their apps, on top of the API platform. Feature development and operations are distributed in this model; with endpoint dev and ops decoupled from that of the API (assuming the requisite functionality is available in the API).
Move away from resource based API to experience based API
Device teams are able to operate and manage their endpoints independently. This screenshot from our dashboard is showing the activity on various endpoints across all API environments.
API Server stats
Going back to the internals of the API container
Hystrix provides fault tolerance and resiliency by implementing the circuit breaker and bulkheading patterns to protect the API from failures in upstream dependencies.
http://techblog.netflix.com/2012/11/hystrix.html
Global AWS deployment in 3 EC2 regions. Each region has 3 availability zones.
Each region runs a ‘cluster’ of EC2 instances; consisting of one or more ASGs (Auto Scaling Groups). Instances are ephemeral; i.e. they come and go. Software is written to handle the loss of instances.
Eureka maintains a registry of healthy instances for each application and a software load balancer is used to route traffic within the SOA.
If we lose an AZ, instances are allocated across the remaining AZs. In the event of an region outage, traffic fails over to the other region.
If we lose an AZ, instances are allocated across the remaining AZs. In the event of an region outage, traffic fails over to the other region.
If we lose an AZ, instances are allocated across the remaining AZs. In the event of an region outage, traffic fails over to the other region.
The Simian army simulates various outage scenarios that help us validate that our systems are working as designed w.r.t their ability to handle failures gracefully. They also serve as practice drills for our teams.
Our traffic pattern shows an ebb and flow based on time of day and day of week. We use Amazon’s autoscaling policies to adjust capacity dynamically. This is pretty effective, but we ran into some of its limitations. An example is its inability to handle a traffic surge after an outage.
To offset these limitations, we created Scryer (not yet open sourced, but in production at Netflix). Scryer evaluates needs based on historical data (week over week, month over month metrics), adjusts instance minimums based on algorithms, and relies on Amazon Auto Scaling for unpredicted events
This graph shows that Scryer’s predictions are in line with actual RPS. In production, Scryer allows us to get instances into production prior to the need (which is different than Amazon’s reactive autoscaling engine which triggers the ramp up based on immediate need, only needing to wait until server start-up is complete). Because the instances are there in advance, Scryer smooths out load averages and response times, which in turn improves the customer experience.
We want to move fast; but protect ourselves from the dangers of doing so. Automation increases velocity while reducing risk by removing the potential for human error. It also helps to bring consistency and predictability to operations.
Shift the curve so you can go faster without compromising availability
It’s trying to stay on the edge; but with safety guards in place.
We have implemented Continuous Delivery to deal with the need for velocity. Releasing software in a steady stream allows us to go faster, bring predictability to our releases and minimize the risks associated with introducing change.
This is a view of our delivery pipeline. We deploy to internal environments several times a day. Production deployments are less frequents because of our farm sizes and the Red/Black deployment model we follow (details in later slides); but we have the ability to deploy on demand in an automated fashion.
We follow the ‘Operate what you Build’ model where developers are responsible for shepherding their changes all the way through to production. We provide them with the tools necessary to help them gain confidence in the quality of their code. One such tool is the automated Canary Analyzer.
Canary reports are generated at periodic intervals and emailed to the team. They are also available off the dashboard. Canary report showing an overall confidence score of the readiness of that build. This one didn’t do very well.
Details into the problematic metrics that contributed to the poor canary score.
We have a complex web of dependencies. Some problems cannot be caught until we are in Production.
We mitigate that by running a separate dependency-update pipeline. This allows us to validate the latest set of dependencies independent of our own code. This validation goes through all the steps of the normal pipeline; including the canary process. We also have detailed insight into changes that went into each canary; including library and config changes.
The same pipeline is also available to developers for their feature branches so they can test their code in production in isolation.
The same pipeline is also available to developers for their feature branches so they can test their code in production in isolation.
Ready for deployment
In the event that a newly deployed version of the software proves to be problematic, the system can be rolled back to the previous version. The old cluster is kept alive for a few hours so the automation knows what to roll back to. Because of our extensive use of autoscaling, provisioning the clusters accurately is tricky; and having to do it manually across three regions would make rollbacks slow and leave them to prone to error. Even though rollbacks are rare, the cost of getting it wrong is too high.
Dynamic configuration using Archaius allows features to be toggled dynamically. If newly introduced feature proves to be problematic, turning it off is an easy way to restore system health. Archaius is a set of config mgmt APIs based on Apache Common Config lib. This allows configuration changes to be propagated in a matter of minutes; at runtime without requiring app downtime. Configuration properties are multi-dimensional and context aware so their scope can be applied to a specific context e.g. env = Test/Staging/Production or region=us-east/us-west/eu-west etc.
Top: Notification of scheduled deployment emailed to the team.
Bottom: chatbot provides realtime updates
http://techblog.netflix.com/2012/12/hystrix-dashboard-and-turbine.html
Realtime dashboard powered by Turbine and Hystrix
We can see an outage in real time - the no. of 5XX errors & latency spiked during the incident. This data is being streamed by hundreds of servers, aggregated using Turbine and streamed to the dashboard.
As service owners, we are responsible for defining and configuring our own alerts. And respond to them at 4am too!
We need to be mindful of the number of metrics we are publishing so we don’t inundate the monitoring systems. That is part of the canary analysis as well.
Our big data pipeline (based on kafka, druid and Suro) powers this console that allows for real debugging and request tracing. http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html
All changes in production are recorded by publishing to a system and can be used for auditing and correlation to production events.
Good architectural practices, automation & tooling and deep insight into our systems allow us to operate resilient systems and go fast at scale. But the key piece that brings it all together and completes the picture is our culture.
Employees have the freedom to make major decisions and act on them without approvals. The counterbalance is the responsibility they assume for the implications of their actions. Management’s job is to set the appropriate context so employees have all the information they need to make the right decisions and judgement calls. This fosters a blameless culture where people feel empowered to take risks.