SlideShare a Scribd company logo
1 of 22
Download to read offline
Herding Kats -
Netflix’s Kubernetes Journey
Andrew Spyker - @aspyker
10/27/2020
Hint: Don’t know him? Search: “Tiger King”
About the Speaker (@aspyker)
● Manager, Compute Platform
○ OS Engineering (OS, Kernel, Images)
○ Container Management Platform (Titus)
○ Image Baking and Registry
○ 20 engineers who design, implement, operate
● Part of Cloud Infrastructure Engineering
○ Compute, Networking, Reliability, Performance, Traffic/Capacity
● Part of Platform Engineering
○ Infrastructure, Productivity, Data Platform, TPM
○ Supports all of Product Engineering Cloud Infrastructure
Engineering
Batch first, added services later with focus on developer productivity
- Simpler compute management, deployment packaging, local dev
As organic adoption grew, focused on scale and reliability
- First very large scale users - Flink, Media Encoding
- Netflix Streaming critical services onboarded
Focus on efficiency - largest “cluster” in Netflix fleet
- First to make Titus cheaper, later focus on workloads
Titus and containers become supported first class (as much as VMs)
Netflix Container Management History (Titus)
2015
2016
2019
2018
Design Principles of Titus
● Runs any “Docker” container (IaaS, not PaaS)
○ Deep integration with existing Netflix platform services
● AWS workloads work unchanged
○ Not trying to abstract AWS EC2 away
○ Deep AWS Collaboration, making containers work well on EC2
● Just what Netflix needs
○ Not extendable to all vendors/companies’ needs
What runs on Titus today
The Netflix Streaming Product
Open Connect CDN monitoring/planning
Machine learning and recommendations
Workloads that power our Studios and Creatives
Content Engineering (Encoding, Artwork personalization, etc.)
Big Data platform (Presto, Notebooks, ETL, etc.)
We started with Mesos and then moved on
Mesos in 2015?
● Used at Netflix in other systems
● Already proven at scale for general use (Twitter, Apple, etc.)
● Just enough model, very adaptable to our needs
● Worked very well!
NOT Mesos in 2019?
● Web Scale users moving away
● “Northbound” agent to control plane communication
“Easy” to move as we don’t leverage Mesos features
Goal: remove Mesos …
bringing all existing workloads along seamlessly
Existing Stack
Agent
Agent
Existing
Agents
Mesos agent
Titus container
runtime
User ContainersUser ContainersUser Containers
Titus API
(Federation)
Titus Scheduling &
Coordination
Mesos etcd
Agent
Agent
Existing
Agents
Virtual Kubelet
Titus container
runtime
User ContainersUser ContainersUser Containers
Titus Scheduling &
Coordination
API Server
Kubernetes Powered Stack
Move
workloads
over time
Rollout timeframe
Q1
2019
Q2 Q3 Q4 Q1
2020
Q2 Q3
Design Implementation
and test env rollout
Production
rollout
Major scale-up
Multiple production
issues
Reliability Focus
Mesos turned off
Production Incidents
Foundational change to our production systems during unprecedented time
Already partially deployed by the start of shelter in place
All team members impacted by world events across 2020
A story of learning and eventual success
A story of success
Kinds of failures we saw
Scale
● API Server tuning for timeouts and caching for list and watch
Distributed
● Load balancing for web hooks (timeouts, network security)
● Overloading etcd with custom CRDs
Other “normal” challenges in running new infrastructure
Time to Recovery >> Root Cause
Recovery time was much longer than our previous system
Why?
● Didn’t know the new system as well as old system
● It takes real incidents to learn all areas of a new system
● Recency bias in diagnosis and remediation actions
Actions taken
● Implement recovery escape valves, use first, then diagnose
● Game days including breaking our systems
From Mesos architecture - “two monoliths”
⮑ Kubernetes architecture - “distributed system”
Likely rushed the productization of this change
Realized we under-invested in
● Cross component dashboarding
● Centralized log analysis
● Distributed tracing
Understanding and Observability
Our current production architecture
Cassandra
Titus Control Plane
● API
● Scheduling
● Job Lifecycle Control
EC2 Autoscaling
Fenzo
container
container
container
docker
Titus Hosts
Virtual Kubelet
Docker
Docker Registry
containercontainerUser Containers
AWS Virtual Machines
Kubernetes API
Server and etcd
Titus System ServicesBatch/Workflow
Systems
Service
CI/CD
Current and Future work
Control-plane responsibilities
- Fleet capacity management
- Cluster ingress / load balancing
- Batch job mgmt
- Service cluster mgmt, autoscaling
- Disruption budgeting
Transitioning to K8s native scheduler
Titus Scheduling and Coordination
Current Soon
Mesos Based Kubernetes Based
Monolithic Distributed
Mostly Netflix Specific Netflix Extensions
Java Golang
Transitioning to “Real” Kubelet
Benefits
● Full user controlled workload composition (pods)
● Networking and storage partners integration
● Off-the shelf open source add-ons
Challenges
● User namespaces aren’t supported (security)
● Pod container start/stop ordering not guaranteed
● Changes in user experience (ssh, metric, etc.)
Exposing the K8s API?
Most Netflix engineers do not need to understand K8s
Platforms and partners
● Abstract the infrastructure layer and use Titus API
● Some can better integrate at the K8s level
Current considerations
● Could lock our users into this API and tools
● Without it, makes tools (kubectl) hard to use by infrastructure partners
● Need for reliable | scalable | operational | agile cluster aggregation
Machine Learning for Infrastructure
Resource Isolation (NUMA, etc.) for latency sensitive workloads
Easier for users + better efficiency
● Actual resource usage and overcommit
● Capacity of services, batch runtime length
● Compute, networking, memory, storage needs
Easier to operate
● Fleet wide capacity management
Q & A Time ...
Where is Carole
Baskin’s Husband?

More Related Content

What's hot

쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)충섭 김
 
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018Amazon Web Services
 
Running Microservices and Docker on AWS Elastic Beanstalk - August 2016 Month...
Running Microservices and Docker on AWS Elastic Beanstalk - August 2016 Month...Running Microservices and Docker on AWS Elastic Beanstalk - August 2016 Month...
Running Microservices and Docker on AWS Elastic Beanstalk - August 2016 Month...Amazon Web Services
 
Let's build Developer Portal with Backstage
Let's build Developer Portal with BackstageLet's build Developer Portal with Backstage
Let's build Developer Portal with BackstageOpsta
 
Gaming on AWS - 1. AWS로 글로벌 게임 런칭하기 - 장르별 아키텍처 중심
Gaming on AWS - 1. AWS로 글로벌 게임 런칭하기 - 장르별 아키텍처 중심Gaming on AWS - 1. AWS로 글로벌 게임 런칭하기 - 장르별 아키텍처 중심
Gaming on AWS - 1. AWS로 글로벌 게임 런칭하기 - 장르별 아키텍처 중심Amazon Web Services Korea
 
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...Amazon Web Services Korea
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Infrastructure as Code for Beginners
Infrastructure as Code for BeginnersInfrastructure as Code for Beginners
Infrastructure as Code for BeginnersDavid Völkel
 
AWS Enterprise Summit :: 하이브리드 클라우드 인프라를 통한 데이터센터 확장과 마이그레이션 방안 (조성진 매니저)
AWS Enterprise Summit :: 하이브리드 클라우드 인프라를 통한 데이터센터 확장과 마이그레이션 방안 (조성진 매니저)AWS Enterprise Summit :: 하이브리드 클라우드 인프라를 통한 데이터센터 확장과 마이그레이션 방안 (조성진 매니저)
AWS Enterprise Summit :: 하이브리드 클라우드 인프라를 통한 데이터센터 확장과 마이그레이션 방안 (조성진 매니저)Amazon Web Services Korea
 
Kubernetes day 2 Operations
Kubernetes day 2 OperationsKubernetes day 2 Operations
Kubernetes day 2 OperationsPaul Czarkowski
 
Secrets in Kubernetes
Secrets in KubernetesSecrets in Kubernetes
Secrets in KubernetesJerry Jalava
 
Rancher 2.0 Technical Deep Dive
Rancher 2.0 Technical Deep DiveRancher 2.0 Technical Deep Dive
Rancher 2.0 Technical Deep DiveLINE Corporation
 
Migrating the media supply chain to the AWS cloud
Migrating the media supply chain to the AWS cloud Migrating the media supply chain to the AWS cloud
Migrating the media supply chain to the AWS cloud Amazon Web Services
 
(DVO305) Turbocharge YContinuous Deployment Pipeline with Containers
(DVO305) Turbocharge YContinuous Deployment Pipeline with Containers(DVO305) Turbocharge YContinuous Deployment Pipeline with Containers
(DVO305) Turbocharge YContinuous Deployment Pipeline with ContainersAmazon Web Services
 
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...SlideTeam
 
Day 5 - AWS Autoscaling Master Class - The New Capacity Plan
Day 5 - AWS Autoscaling Master Class - The New Capacity PlanDay 5 - AWS Autoscaling Master Class - The New Capacity Plan
Day 5 - AWS Autoscaling Master Class - The New Capacity PlanAmazon Web Services
 
Microservices with NGINX pdf
Microservices with NGINX pdfMicroservices with NGINX pdf
Microservices with NGINX pdfKatherine Bagood
 
Platform Engineering
Platform EngineeringPlatform Engineering
Platform EngineeringOpsta
 

What's hot (20)

쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
쿠버네티스를 이용한 기능 브랜치별 테스트 서버 만들기 (GitOps CI/CD)
 
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
How AWS Minimizes the Blast Radius of Failures (ARC338) - AWS re:Invent 2018
 
Running Microservices and Docker on AWS Elastic Beanstalk - August 2016 Month...
Running Microservices and Docker on AWS Elastic Beanstalk - August 2016 Month...Running Microservices and Docker on AWS Elastic Beanstalk - August 2016 Month...
Running Microservices and Docker on AWS Elastic Beanstalk - August 2016 Month...
 
Let's build Developer Portal with Backstage
Let's build Developer Portal with BackstageLet's build Developer Portal with Backstage
Let's build Developer Portal with Backstage
 
Gaming on AWS - 1. AWS로 글로벌 게임 런칭하기 - 장르별 아키텍처 중심
Gaming on AWS - 1. AWS로 글로벌 게임 런칭하기 - 장르별 아키텍처 중심Gaming on AWS - 1. AWS로 글로벌 게임 런칭하기 - 장르별 아키텍처 중심
Gaming on AWS - 1. AWS로 글로벌 게임 런칭하기 - 장르별 아키텍처 중심
 
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
AWS Transit Gateway를 통한 Multi-VPC 아키텍처 패턴 - 강동환 솔루션즈 아키텍트, AWS :: AWS Summit ...
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Infrastructure as Code for Beginners
Infrastructure as Code for BeginnersInfrastructure as Code for Beginners
Infrastructure as Code for Beginners
 
Openstack swift - VietOpenStack 6thmeeetup
Openstack swift - VietOpenStack 6thmeeetupOpenstack swift - VietOpenStack 6thmeeetup
Openstack swift - VietOpenStack 6thmeeetup
 
AWS Enterprise Summit :: 하이브리드 클라우드 인프라를 통한 데이터센터 확장과 마이그레이션 방안 (조성진 매니저)
AWS Enterprise Summit :: 하이브리드 클라우드 인프라를 통한 데이터센터 확장과 마이그레이션 방안 (조성진 매니저)AWS Enterprise Summit :: 하이브리드 클라우드 인프라를 통한 데이터센터 확장과 마이그레이션 방안 (조성진 매니저)
AWS Enterprise Summit :: 하이브리드 클라우드 인프라를 통한 데이터센터 확장과 마이그레이션 방안 (조성진 매니저)
 
Kubernetes day 2 Operations
Kubernetes day 2 OperationsKubernetes day 2 Operations
Kubernetes day 2 Operations
 
Introducing AWS Fargate
Introducing AWS FargateIntroducing AWS Fargate
Introducing AWS Fargate
 
Secrets in Kubernetes
Secrets in KubernetesSecrets in Kubernetes
Secrets in Kubernetes
 
Rancher 2.0 Technical Deep Dive
Rancher 2.0 Technical Deep DiveRancher 2.0 Technical Deep Dive
Rancher 2.0 Technical Deep Dive
 
Migrating the media supply chain to the AWS cloud
Migrating the media supply chain to the AWS cloud Migrating the media supply chain to the AWS cloud
Migrating the media supply chain to the AWS cloud
 
(DVO305) Turbocharge YContinuous Deployment Pipeline with Containers
(DVO305) Turbocharge YContinuous Deployment Pipeline with Containers(DVO305) Turbocharge YContinuous Deployment Pipeline with Containers
(DVO305) Turbocharge YContinuous Deployment Pipeline with Containers
 
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
An Architectural Deep Dive With Kubernetes And Containers Powerpoint Presenta...
 
Day 5 - AWS Autoscaling Master Class - The New Capacity Plan
Day 5 - AWS Autoscaling Master Class - The New Capacity PlanDay 5 - AWS Autoscaling Master Class - The New Capacity Plan
Day 5 - AWS Autoscaling Master Class - The New Capacity Plan
 
Microservices with NGINX pdf
Microservices with NGINX pdfMicroservices with NGINX pdf
Microservices with NGINX pdf
 
Platform Engineering
Platform EngineeringPlatform Engineering
Platform Engineering
 

Similar to Herding Kats - Netflix’s Journey to Kubernetes Public

Pivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry MeetupPivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry Meetupcornelia davis
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsSonja Schweigert
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsWeaveworks
 
Google Cloud Fundamentals by CloudZone
Google Cloud Fundamentals by CloudZoneGoogle Cloud Fundamentals by CloudZone
Google Cloud Fundamentals by CloudZoneIdan Tohami
 
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...NETWAYS
 
Episode 1: Building Kubernetes-as-a-Service
Episode 1: Building Kubernetes-as-a-ServiceEpisode 1: Building Kubernetes-as-a-Service
Episode 1: Building Kubernetes-as-a-ServiceMesosphere Inc.
 
Open shift and docker - october,2014
Open shift and docker - october,2014Open shift and docker - october,2014
Open shift and docker - october,2014Hojoong Kim
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps DevopsSreenivas Makam
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thingaspyker
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsAll Things Open
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview VMware Tanzu
 
DEVNET-1169 CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
DEVNET-1169	CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...DEVNET-1169	CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
DEVNET-1169 CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...Cisco DevNet
 
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...tdc-globalcode
 
Continuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteContinuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteWeaveworks
 
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...Weaveworks
 
Introduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeIntroduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeTerry Wang
 
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...Weaveworks
 
Integration in the Cloud
Integration in the CloudIntegration in the Cloud
Integration in the CloudRob Davies
 

Similar to Herding Kats - Netflix’s Journey to Kubernetes Public (20)

Pivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry MeetupPivotal Container Service (PKS) at SF Cloud Foundry Meetup
Pivotal Container Service (PKS) at SF Cloud Foundry Meetup
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
 
Google Cloud Fundamentals by CloudZone
Google Cloud Fundamentals by CloudZoneGoogle Cloud Fundamentals by CloudZone
Google Cloud Fundamentals by CloudZone
 
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...OSDC 2018 | Three years running containers with Kubernetes in Production by T...
OSDC 2018 | Three years running containers with Kubernetes in Production by T...
 
Episode 1: Building Kubernetes-as-a-Service
Episode 1: Building Kubernetes-as-a-ServiceEpisode 1: Building Kubernetes-as-a-Service
Episode 1: Building Kubernetes-as-a-Service
 
The rise of microservices
The rise of microservicesThe rise of microservices
The rise of microservices
 
Open shift and docker - october,2014
Open shift and docker - october,2014Open shift and docker - october,2014
Open shift and docker - october,2014
 
How Kubernetes helps Devops
How Kubernetes helps DevopsHow Kubernetes helps Devops
How Kubernetes helps Devops
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Netflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger ThingsNetflix and Containers: Not Stranger Things
Netflix and Containers: Not Stranger Things
 
Pivotal Container Service Overview
Pivotal Container Service Overview Pivotal Container Service Overview
Pivotal Container Service Overview
 
DEVNET-1169 CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
DEVNET-1169	CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...DEVNET-1169	CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
DEVNET-1169 CI/CT/CD on a Micro Services Applications using Docker, Salt & Ni...
 
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
 
Continuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteContinuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event Keynote
 
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...
Achieve Data & Operational Sovereignty: Managing Hybrid & Edge EKS Deployment...
 
Introduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud NativeIntroduction to containers, k8s, Microservices & Cloud Native
Introduction to containers, k8s, Microservices & Cloud Native
 
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
Overcoming Regulatory & Compliance Hurdles with Hybrid Cloud EKS and Weave Gi...
 
Integration in the Cloud
Integration in the CloudIntegration in the Cloud
Integration in the Cloud
 
Kubermatic.pdf
Kubermatic.pdfKubermatic.pdf
Kubermatic.pdf
 

More from aspyker

Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientistsaspyker
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2aspyker
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemonsaspyker
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayentaaspyker
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containersaspyker
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talkaspyker
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Sourceaspyker
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1aspyker
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4aspyker
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integrationaspyker
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Programaspyker
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflixaspyker
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3aspyker
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016aspyker
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1aspyker
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talkaspyker
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 

More from aspyker (20)

Season 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data ScientistsSeason 7 Episode 1 - Tools for Data Scientists
Season 7 Episode 1 - Tools for Data Scientists
 
CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2CMP376 - Another Week, Another Million Containers on Amazon EC2
CMP376 - Another Week, Another Million Containers on Amazon EC2
 
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and DaemonsQConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
QConSF18 - Disenchantment: Netflix Titus, its Feisty Team, and Daemons
 
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, KayentaNetflixOSS Meetup S6E2 - Spinnaker, Kayenta
NetflixOSS Meetup S6E2 - Spinnaker, Kayenta
 
NetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & ContainersNetflixOSS Meetup S6E1 - Titus & Containers
NetflixOSS Meetup S6E1 - Titus & Containers
 
SRECon Lightning Talk
SRECon Lightning TalkSRECon Lightning Talk
SRECon Lightning Talk
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1Netflix OSS Meetup Season 5 Episode 1
Netflix OSS Meetup Season 5 Episode 1
 
Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
 
Re:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS IntegrationRe:invent 2016 Container Scheduling, Execution and AWS Integration
Re:invent 2016 Container Scheduling, Execution and AWS Integration
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
 
Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3Netflix Open Source Meetup Season 4 Episode 3
Netflix Open Source Meetup Season 4 Episode 3
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016Netflix Container Runtime - Titus - for Container Camp 2016
Netflix Container Runtime - Titus - for Container Camp 2016
 
Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1Netflix Open Source Meetup Season 4 Episode 1
Netflix Open Source Meetup Season 4 Episode 1
 
CS80A Foothill College Open Source Talk
CS80A Foothill College Open Source TalkCS80A Foothill College Open Source Talk
CS80A Foothill College Open Source Talk
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 

Recently uploaded

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Herding Kats - Netflix’s Journey to Kubernetes Public

  • 1. Herding Kats - Netflix’s Kubernetes Journey Andrew Spyker - @aspyker 10/27/2020
  • 2. Hint: Don’t know him? Search: “Tiger King”
  • 3. About the Speaker (@aspyker) ● Manager, Compute Platform ○ OS Engineering (OS, Kernel, Images) ○ Container Management Platform (Titus) ○ Image Baking and Registry ○ 20 engineers who design, implement, operate ● Part of Cloud Infrastructure Engineering ○ Compute, Networking, Reliability, Performance, Traffic/Capacity ● Part of Platform Engineering ○ Infrastructure, Productivity, Data Platform, TPM ○ Supports all of Product Engineering Cloud Infrastructure Engineering
  • 4. Batch first, added services later with focus on developer productivity - Simpler compute management, deployment packaging, local dev As organic adoption grew, focused on scale and reliability - First very large scale users - Flink, Media Encoding - Netflix Streaming critical services onboarded Focus on efficiency - largest “cluster” in Netflix fleet - First to make Titus cheaper, later focus on workloads Titus and containers become supported first class (as much as VMs) Netflix Container Management History (Titus) 2015 2016 2019 2018
  • 5. Design Principles of Titus ● Runs any “Docker” container (IaaS, not PaaS) ○ Deep integration with existing Netflix platform services ● AWS workloads work unchanged ○ Not trying to abstract AWS EC2 away ○ Deep AWS Collaboration, making containers work well on EC2 ● Just what Netflix needs ○ Not extendable to all vendors/companies’ needs
  • 6. What runs on Titus today The Netflix Streaming Product Open Connect CDN monitoring/planning Machine learning and recommendations Workloads that power our Studios and Creatives Content Engineering (Encoding, Artwork personalization, etc.) Big Data platform (Presto, Notebooks, ETL, etc.)
  • 7. We started with Mesos and then moved on Mesos in 2015? ● Used at Netflix in other systems ● Already proven at scale for general use (Twitter, Apple, etc.) ● Just enough model, very adaptable to our needs ● Worked very well! NOT Mesos in 2019? ● Web Scale users moving away ● “Northbound” agent to control plane communication “Easy” to move as we don’t leverage Mesos features
  • 8. Goal: remove Mesos … bringing all existing workloads along seamlessly
  • 9. Existing Stack Agent Agent Existing Agents Mesos agent Titus container runtime User ContainersUser ContainersUser Containers Titus API (Federation) Titus Scheduling & Coordination Mesos etcd Agent Agent Existing Agents Virtual Kubelet Titus container runtime User ContainersUser ContainersUser Containers Titus Scheduling & Coordination API Server Kubernetes Powered Stack Move workloads over time
  • 10. Rollout timeframe Q1 2019 Q2 Q3 Q4 Q1 2020 Q2 Q3 Design Implementation and test env rollout Production rollout Major scale-up Multiple production issues Reliability Focus Mesos turned off
  • 12. Foundational change to our production systems during unprecedented time Already partially deployed by the start of shelter in place All team members impacted by world events across 2020 A story of learning and eventual success A story of success
  • 13. Kinds of failures we saw Scale ● API Server tuning for timeouts and caching for list and watch Distributed ● Load balancing for web hooks (timeouts, network security) ● Overloading etcd with custom CRDs Other “normal” challenges in running new infrastructure
  • 14. Time to Recovery >> Root Cause Recovery time was much longer than our previous system Why? ● Didn’t know the new system as well as old system ● It takes real incidents to learn all areas of a new system ● Recency bias in diagnosis and remediation actions Actions taken ● Implement recovery escape valves, use first, then diagnose ● Game days including breaking our systems
  • 15. From Mesos architecture - “two monoliths” ⮑ Kubernetes architecture - “distributed system” Likely rushed the productization of this change Realized we under-invested in ● Cross component dashboarding ● Centralized log analysis ● Distributed tracing Understanding and Observability
  • 16. Our current production architecture Cassandra Titus Control Plane ● API ● Scheduling ● Job Lifecycle Control EC2 Autoscaling Fenzo container container container docker Titus Hosts Virtual Kubelet Docker Docker Registry containercontainerUser Containers AWS Virtual Machines Kubernetes API Server and etcd Titus System ServicesBatch/Workflow Systems Service CI/CD
  • 18. Control-plane responsibilities - Fleet capacity management - Cluster ingress / load balancing - Batch job mgmt - Service cluster mgmt, autoscaling - Disruption budgeting Transitioning to K8s native scheduler Titus Scheduling and Coordination Current Soon Mesos Based Kubernetes Based Monolithic Distributed Mostly Netflix Specific Netflix Extensions Java Golang
  • 19. Transitioning to “Real” Kubelet Benefits ● Full user controlled workload composition (pods) ● Networking and storage partners integration ● Off-the shelf open source add-ons Challenges ● User namespaces aren’t supported (security) ● Pod container start/stop ordering not guaranteed ● Changes in user experience (ssh, metric, etc.)
  • 20. Exposing the K8s API? Most Netflix engineers do not need to understand K8s Platforms and partners ● Abstract the infrastructure layer and use Titus API ● Some can better integrate at the K8s level Current considerations ● Could lock our users into this API and tools ● Without it, makes tools (kubectl) hard to use by infrastructure partners ● Need for reliable | scalable | operational | agile cluster aggregation
  • 21. Machine Learning for Infrastructure Resource Isolation (NUMA, etc.) for latency sensitive workloads Easier for users + better efficiency ● Actual resource usage and overcommit ● Capacity of services, batch runtime length ● Compute, networking, memory, storage needs Easier to operate ● Fleet wide capacity management
  • 22. Q & A Time ... Where is Carole Baskin’s Husband?