Move Fast;Stay Safe:Developing & Deploying the Netflix API

•

3 gostaram•1,400 visualizações

This document discusses Netflix's approach to developing and deploying its API in a way that allows it to move fast while staying safe. It focuses on how Netflix uses automation, architecture, and insight to rapidly innovate and scale its API to support over 50 million subscribers across over 40 countries and over 1000 device types. Key aspects include automated testing, red/black deployments, predictive autoscaling, real-time metrics and debugging to enable continuous delivery while maintaining high availability, resiliency and rollback capabilities.

Tecnologia

Move Fast; Stay Safe
Developing and Deploying the Netflix API
Sangeeta Narayanan
@sangeetan
http://www.linkedin.com/in/sangeetanarayanan

Global Streaming for Movies
and TV Shows

Over 50 Million Subscribers
Over 40 Countries

Over 34% of Peak Downstream Traffic in North America
Over 2 billion streaming hours a month

Personalized Discovery and Streaming Experience

Role of API
• Enable rapid innovation
• Conduit for metadata between Devices
and Services
• Implements business logic
• Scale with business
• Maintain resiliency
http://goo.gl/VhokZV

Architecture
Automation
Insight
http://goo.gl/LQKWJJ

Screen Sizes Controllers Bandwidth
VARIATIONS

Customized API
http://goo.gl/SCIDKE
Response Formatting
separated from
Data Gathering

Resource Based API
vs.
Experience Based API

• /users/<id>
• /users/<id>/lists
• /catalog/titles/<id>
• /xbox/kids_homepage

>5,000,000,000
Requests per day
>35
Dependencies
~600
Libraries

Availabilit
y Zone
Availabilit
y Zone
AWS
Region
Availabilit
y Zone

Predictive Autoscaling
http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html

6
5
4
3
2
1
0
Shifting the Curve
10 100 1000 10000
Availability

Going fast involves balancing on the edge…

It’s important to automate rollbacks too!!

http://netflix.github.io/
http://techblog.netflix.com

Mais conteúdo relacionado

Mais procurados

WIT Lightning Talk: Agility at Scale with the Netflix APISangeeta Narayanan

Deep Dive: Strategic Importance of BaaSApigee | Google Cloud

I Love APIs 2015: Apigee and Node.js Building Mock Backends FastApigee | Google Cloud

Webcast: Apigee Edge Product DemoApigee | Google Cloud

AWS Summit - Trends in Advanced Monitoring for AWS environmentsAndreas Grabner

Orchestrating microservices like a ninjaApigee | Google Cloud

Webcast: Apigee Edge Product DemoApigee | Google Cloud

Will ServerLess kill containers and OperationsStephane Woillez

What are your APIs Worth?Apigee | Google Cloud

API Design WorkflowsJakub Nesetril

YAGNI, YMMV and APIs: building a hybrid strategy for your API platform.Diogo Lucas

State of the API: Insights Into the Future of APIsPostman

apidays LIVE Paris 2021 - Automating API Documentation by Ajinkya Marudwar, G...apidays

I Love APIs 2015: Scaling Mobile-focused Microservices at VerizonApigee | Google Cloud

Your API Strategy: Why Boring is BestNordic APIs

Pain Points In API Development? They’re EverywhereNordic APIs

I Love APIs 2015: The "State" of your API: Common Use Cases for Storing DataApigee | Google Cloud

Monitoring your APIAndrés F Vargas

I Love APIs 2015: Create Design-driven APIs with Node.js and SwaggerApigee | Google Cloud

London Adapt or Die: Lunch keynoteApigee | Google Cloud

Mais procurados (20)

WIT Lightning Talk: Agility at Scale with the Netflix API

Deep Dive: Strategic Importance of BaaS

I Love APIs 2015: Apigee and Node.js Building Mock Backends Fast

Webcast: Apigee Edge Product Demo

AWS Summit - Trends in Advanced Monitoring for AWS environments

Orchestrating microservices like a ninja

Webcast: Apigee Edge Product Demo

Will ServerLess kill containers and Operations

What are your APIs Worth?

API Design Workflows

YAGNI, YMMV and APIs: building a hybrid strategy for your API platform.

State of the API: Insights Into the Future of APIs

apidays LIVE Paris 2021 - Automating API Documentation by Ajinkya Marudwar, G...

I Love APIs 2015: Scaling Mobile-focused Microservices at Verizon

Your API Strategy: Why Boring is Best

Pain Points In API Development? They’re Everywhere

I Love APIs 2015: The "State" of your API: Common Use Cases for Storing Data

Monitoring your API

I Love APIs 2015: Create Design-driven APIs with Node.js and Swagger

London Adapt or Die: Lunch keynote

Semelhante a Move Fast;Stay Safe:Developing & Deploying the Netflix API

Build an App on AWS for Your First 10 Million UsersAmazon Web Services

Netflix Edge Engineering Open House Presentations - June 9, 2016Daniel Jacobson

apidays LIVE India - Asynchronous and Broadcasting APIs using Kafka by Rohit ...apidays

Maintaining the Front Door to Netflix : The Netflix APIDaniel Jacobson

Build a Website on AWS for Your First 10 Million UsersAmazon Web Services

Big Data And HTML5 (DevCon TLV 2012)Ido Green

Netflix Cloud Architecture and Open Sourceaspyker

Netflix MSA and PivotalVMware Tanzu Korea

Event Driven Streaming Analytics - Demostration on Architecture of IoTLei Xu

Resilient Event Driven Systems With KafkaIccha Sethi

Build a Website on AWS for Your First 10 Million UsersAmazon Web Services

Financial Services Analytics on AWSAmazon Web Services

DevOps Powered by SplunkSplunk

Elasticsearch in NetflixDanny Yuan

Cloud Security Primer - F5 NetworksHarry Gunns

SeattleFall1Victor Angelbeat

From nothing to production in 1 hourRoy Braam

Scaling the Netflix API - From Atlassian Dev DenDaniel Jacobson

AWS Summit 2014 - Melbourne - Keynote by Mike ClayvilleAmazon Web Services

Vancouver keynote - AWS Innovate - Sam ElmalakAmazon Web Services

Semelhante a Move Fast;Stay Safe:Developing & Deploying the Netflix API (20)

Build an App on AWS for Your First 10 Million Users

Netflix Edge Engineering Open House Presentations - June 9, 2016

apidays LIVE India - Asynchronous and Broadcasting APIs using Kafka by Rohit ...

Maintaining the Front Door to Netflix : The Netflix API

Build a Website on AWS for Your First 10 Million Users

Big Data And HTML5 (DevCon TLV 2012)

Netflix Cloud Architecture and Open Source

Netflix MSA and Pivotal

Event Driven Streaming Analytics - Demostration on Architecture of IoT

Resilient Event Driven Systems With Kafka

Build a Website on AWS for Your First 10 Million Users

Financial Services Analytics on AWS

DevOps Powered by Splunk

Elasticsearch in Netflix

Cloud Security Primer - F5 Networks

SeattleFall1

From nothing to production in 1 hour

Scaling the Netflix API - From Atlassian Dev Den

AWS Summit 2014 - Melbourne - Keynote by Mike Clayville

Vancouver keynote - AWS Innovate - Sam Elmalak

Último

AI as an Interface for Commercial BuildingsMemoori

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Pigging Solutions in Pet Food ManufacturingPigging Solutions

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Install Stable Diffusion in windows machinePadma Pradeep

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Move Fast;Stay Safe:Developing & Deploying the Netflix API

1. Move Fast; Stay Safe Developing and Deploying the Netflix API Sangeeta Narayanan @sangeetan http://www.linkedin.com/in/sangeetanarayanan

4. Global Streaming for Movies and TV Shows

5. High Quality Original Content

7. Over 50 Million Subscribers Over 40 Countries

8. Over 34% of Peak Downstream Traffic in North America Over 2 billion streaming hours a month

10. Personalized Discovery and Streaming Experience

11.

12. Role of API • Enable rapid innovation • Conduit for metadata between Devices and Services • Implements business logic • Scale with business • Maintain resiliency http://goo.gl/VhokZV

13. Velocity Availability

14. Architecture Automation Insight http://goo.gl/LQKWJJ

15. Architecture http://goo.gl/PA0pFn

16.

17.

18. >1000 Device Types

19. Screen Sizes Controllers Bandwidth VARIATIONS

20. One Size Doesn’t Fit All

21. Customized API http://goo.gl/SCIDKE Response Formatting separated from Data Gathering

22. Data gathering Response Formatting

23.

24. Resource Based API vs. Experience Based API

25. • /users/<id> • /users/<id>/lists • /catalog/titles/<id> • /xbox/kids_homepage

26.

27. >5,000,000,000 Requests per day >35 Dependencies ~600 Libraries

28. http://goo.gl/oiY9DE Things will break!

29.

30.

31.

32. Availabilit y Zone Availabilit y Zone AWS Region Availabilit y Zone

33. EC2 instances in AZ

34. EC2 instances in AZ

35.

36.

37.

38.

39. Autoscaling

40. Predictive Autoscaling http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html

41.

42. http://goo.gl/tiVZNL Automation

43. 6 5 4 3 2 1 0 Shifting the Curve 10 100 1000 10000 Availability

44. Going fast involves balancing on the edge…

45. Continuous Delivery

46.

47.

48.

49. Testing

50. Production Traffic

51.

52.

53. Dependencies

54.

55.

56.

57.

58.

59. Red/Black Deployments

60. Production Traffic

61. Production Traffic

62. Production Traffic

63. Production Traffic

64. Production Traffic

65. It’s important to automate rollbacks too!!

66.

67. Keeping everyone informed

68. Insight http://goo.gl/sxYHU2

69. Historical & Real time Metrics

70.

71. Dependencies View

72. Self Service Alert Configs

73. Keeping tabs on Metrics growth

74. Context sensitive Debugging

75. Change Tracking

76. Culture

77. Freedom and Responsibility

78.

79. http://netflix.github.io/ http://techblog.netflix.com

80. Move Fast; Stay Safe Developing and Deploying the Netflix API Sangeeta Narayanan @sangeetan http://www.linkedin.com/in/sangeetanarayanan

Notas do Editor

Started out as a DVD rental by mail service
Introduced on-demand video streaming over the internet in 2007
Has since expanded internationally
2012 marked a foray into the world of original programming
Shows like HoC & Orange have been received with high acclaim; as evidenced by recent Emmy wins. Strategy is to expand internationally and pursue high quality content to drive engagement and acquisition.
Global expansion, high quality originals and personalized content have fueled rapid subscriber growth.
Netflix now accounts for over 1/3rd of downstream internet traffic in NA at peak. This number has been in the news a lot lately!
Our members can choose to enjoy our service on over 1000 device types.
Edge Engineering operates the services that provide the personalized discovery and streaming experience for our members.
This is an extremely high level view of the Netflix service. API is the internet facing service that all devices connect to to provide the user experience. The API in turn consumes data from several middle-tier services, applies business logic on top of it as needed and provides an abstraction layer for devices to interact with.
The API in effect, acts as a broker of metadata between services and devices. Put another way, almost all product functionality flows through the API.
We are constantly striving for a balance between velocity and availability.
This talk will cover some of the strategies and techniques we employ in our pursuit for the balance between velocity and availability. I will focus on three areas - Architecture, Automation and Insight
Let’s look at a couple of examples of architectural choices that enable velocity and resiliency.
This is an overview of the Netflix Streaming architecture.
Zooming in on the interaction between the API and the devices it serves.
We support over 1000 device types.
Embracing the Differences: http://techblog.netflix.com/2012/07/embracing-differences-inside-netflix.html
Inside the API container
The Dynamic Scripting Platform reduces chattiness and allows API clients to develop and operate endpoints customized to their apps, on top of the API platform. Feature development and operations are distributed in this model; with endpoint dev and ops decoupled from that of the API (assuming the requisite functionality is available in the API).
Move away from resource based API to experience based API
Device teams are able to operate and manage their endpoints independently. This screenshot from our dashboard is showing the activity on various endpoints across all API environments.
API Server stats
Going back to the internals of the API container
Hystrix provides fault tolerance and resiliency by implementing the circuit breaker and bulkheading patterns to protect the API from failures in upstream dependencies. http://techblog.netflix.com/2012/11/hystrix.html
Global AWS deployment in 3 EC2 regions. Each region has 3 availability zones.
Each region runs a ‘cluster’ of EC2 instances; consisting of one or more ASGs (Auto Scaling Groups). Instances are ephemeral; i.e. they come and go. Software is written to handle the loss of instances.
Eureka maintains a registry of healthy instances for each application and a software load balancer is used to route traffic within the SOA.
If we lose an AZ, instances are allocated across the remaining AZs. In the event of an region outage, traffic fails over to the other region.
If we lose an AZ, instances are allocated across the remaining AZs. In the event of an region outage, traffic fails over to the other region.
If we lose an AZ, instances are allocated across the remaining AZs. In the event of an region outage, traffic fails over to the other region.
The Simian army simulates various outage scenarios that help us validate that our systems are working as designed w.r.t their ability to handle failures gracefully. They also serve as practice drills for our teams.
Our traffic pattern shows an ebb and flow based on time of day and day of week. We use Amazon’s autoscaling policies to adjust capacity dynamically. This is pretty effective, but we ran into some of its limitations. An example is its inability to handle a traffic surge after an outage.
To offset these limitations, we created Scryer (not yet open sourced, but in production at Netflix). Scryer evaluates needs based on historical data (week over week, month over month metrics), adjusts instance minimums based on algorithms, and relies on Amazon Auto Scaling for unpredicted events
This graph shows that Scryer’s predictions are in line with actual RPS. In production, Scryer allows us to get instances into production prior to the need (which is different than Amazon’s reactive autoscaling engine which triggers the ramp up based on immediate need, only needing to wait until server start-up is complete). Because the instances are there in advance, Scryer smooths out load averages and response times, which in turn improves the customer experience.
We want to move fast; but protect ourselves from the dangers of doing so. Automation increases velocity while reducing risk by removing the potential for human error. It also helps to bring consistency and predictability to operations.
Shift the curve so you can go faster without compromising availability
It’s trying to stay on the edge; but with safety guards in place.
We have implemented Continuous Delivery to deal with the need for velocity. Releasing software in a steady stream allows us to go faster, bring predictability to our releases and minimize the risks associated with introducing change.
This is a view of our delivery pipeline. We deploy to internal environments several times a day. Production deployments are less frequents because of our farm sizes and the Red/Black deployment model we follow (details in later slides); but we have the ability to deploy on demand in an automated fashion.
We follow the ‘Operate what you Build’ model where developers are responsible for shepherding their changes all the way through to production. We provide them with the tools necessary to help them gain confidence in the quality of their code. One such tool is the automated Canary Analyzer.
Canary reports are generated at periodic intervals and emailed to the team. They are also available off the dashboard. Canary report showing an overall confidence score of the readiness of that build. This one didn’t do very well.
Details into the problematic metrics that contributed to the poor canary score.
We have a complex web of dependencies. Some problems cannot be caught until we are in Production.
We mitigate that by running a separate dependency-update pipeline. This allows us to validate the latest set of dependencies independent of our own code. This validation goes through all the steps of the normal pipeline; including the canary process. We also have detailed insight into changes that went into each canary; including library and config changes.
The same pipeline is also available to developers for their feature branches so they can test their code in production in isolation.
The same pipeline is also available to developers for their feature branches so they can test their code in production in isolation.
Ready for deployment
In the event that a newly deployed version of the software proves to be problematic, the system can be rolled back to the previous version. The old cluster is kept alive for a few hours so the automation knows what to roll back to. Because of our extensive use of autoscaling, provisioning the clusters accurately is tricky; and having to do it manually across three regions would make rollbacks slow and leave them to prone to error. Even though rollbacks are rare, the cost of getting it wrong is too high.
Dynamic configuration using Archaius allows features to be toggled dynamically. If newly introduced feature proves to be problematic, turning it off is an easy way to restore system health. Archaius is a set of config mgmt APIs based on Apache Common Config lib. This allows configuration changes to be propagated in a matter of minutes; at runtime without requiring app downtime. Configuration properties are multi-dimensional and context aware so their scope can be applied to a specific context e.g. env = Test/Staging/Production or region=us-east/us-west/eu-west etc.
Top: Notification of scheduled deployment emailed to the team. Bottom: chatbot provides realtime updates
http://techblog.netflix.com/2012/12/hystrix-dashboard-and-turbine.html Realtime dashboard powered by Turbine and Hystrix
We can see an outage in real time - the no. of 5XX errors & latency spiked during the incident. This data is being streamed by hundreds of servers, aggregated using Turbine and streamed to the dashboard.
As service owners, we are responsible for defining and configuring our own alerts. And respond to them at 4am too!
We need to be mindful of the number of metrics we are publishing so we don’t inundate the monitoring systems. That is part of the canary analysis as well.
Our big data pipeline (based on kafka, druid and Suro) powers this console that allows for real debugging and request tracing. http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html
All changes in production are recorded by publishing to a system and can be used for auditing and correlation to production events.
Good architectural practices, automation & tooling and deep insight into our systems allow us to operate resilient systems and go fast at scale. But the key piece that brings it all together and completes the picture is our culture.
Employees have the freedom to make major decisions and act on them without approvals. The counterbalance is the responsibility they assume for the implications of their actions. Management’s job is to set the appropriate context so employees have all the information they need to make the right decisions and judgement calls. This fosters a blameless culture where people feel empowered to take risks.

Move Fast;Stay Safe:Developing & Deploying the Netflix API

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Move Fast;Stay Safe:Developing & Deploying the Netflix API

Semelhante a Move Fast;Stay Safe:Developing & Deploying the Netflix API (20)

Último

Último (20)

Move Fast;Stay Safe:Developing & Deploying the Netflix API

Notas do Editor