SlideShare a Scribd company logo
1 of 35
Download to read offline
Confidential + Proprietary
Task Migration at Scale Using CRIU
Linux Plumbers Conference 2018
Victor Marmol Andy Tucker
vmarmol@google.com agtucker@google.com
2018-11-15
Confidential + Proprietary
Who we are
Outside of Google, we’ve worked on open source cluster management and
containers
2
lmcty
Confidential + Proprietary
Who we are
Inside Google: we’re part of the Borg team
● Manages all compute jobs
● Runs on every server
Images by Connie Zhou3
Confidential + Proprietary
What is Borg?
Google’s cluster management system
● Borgmaster: Cluster control and main API
entrypoint
● Borglet: On-machine management daemon
● Suite of tools and UIs for managing jobs
● Many purpose-built platforms created on top of
Borg
● Everything runs on Borg and everything runs in
containers
4
Confidential + Proprietary
Borg Machine
Borg basics
Base compute primitive: Task
● A priority signals how quickly a task should
schedule
● It’s appclass describes a task as either
serving (latency sensitive) or batch
● Static content/binaries provided by
packages
● A container isolates a task’s resources
● Native Linux processes
● Share an IP with the machine, ports
are allocated for each task
Task
Container
Processes Packages
Allocated Ports
5
Processes
Processes
Processes
Packages
Packages
Packages
Confidential + Proprietary
Borg basics: evictions
When a task is forcefully terminated by Borg
● Typically receive a notification: 1-5min
● Our SLO allows for quite a few evictions
● Applications must handle them
Reasons for evictions
● Preemption: a higher priority task needs the resources
● Software upgrades (e.g.: kernel, firmware)
● Re-balancing for availability or performance
6
Confidential + Proprietary
Evictions are impactful and hard to handle
Technical Complexity
● Handling evictions requires state management
○ How and what state to serialize and where to store it
● Application-specific and not very reusable
Lost Compute
● Batch jobs run at lower priorities and get preempted often
● Even platforms that handle them for users, don’t do a great job
Task 0 Shard 1 Task 2 Task 3 Task 4Evicted!
Compute is lost... 7
Confidential + Proprietary
Migrations to avoid evictions
Transparently replace evictions with migration
Native task migration offering in Borg
● Borg controls the eviction → always knows when to migrate
● Native management of state allows reuse for all workloads
Various possible mechanisms
● Checkpoint/restore
○ Pause application, transfer state, resume
○ Long blackout period, no brownout
● Live
○ Very short blackout, but with a longer brownout
○ Very low impact to applications
8
Confidential + Proprietary
Challenges with task migration
Migrating network connections
Port collisions and port use
Storage migration is slow
Must virtualize machine-local resources
Linux process state hard to migrate
9
Confidential + Proprietary
Challenges with task migration
Migrating network connections
Port collisions and port use
Storage migration is slow
Must virtualize machine-local resources
Linux process state hard to migrate
Little to no local storage
Linux namespaces
CRIU!
NET namespaces and IPv6 per-container
Drop the connection, user handles reconnections
10
Confidential + Proprietary
Migration
Workflow
11
Confidential + Proprietary
Machine A
Task Migration
Checkpoint/Restore
12
Confidential + Proprietary
Machine A
Task
Isolated Task Environment
● Linux namespaces
● Little local storage
● IPv6
● Google libraries (e.g.: Stubby/gRPC)
13
Confidential + Proprietary
Machine A
Task
Checkpoint
● Pause task
14
Confidential + Proprietary
Machine A
Checkpoint
● Pause task
● Serialize stateTask
15
Confidential + Proprietary
Machine A
Checkpoint
● Pause task
● Serialize state
● Upload to distributed storage
Task
Colossus
16
Confidential + Proprietary
Machine A
Colossus
17
Confidential + Proprietary
Colossus
Migration
● Borgmaster chooses new machine to
schedule the task.
Machine B
18
Confidential + Proprietary
Colossus
Machine B
Restore
● Download from distributed storage
Task
19
Confidential + Proprietary
Machine B
Restore
● Download from distributed storage
● Deserialize state Task
20
Confidential + Proprietary
Machine B
Restore
● Download from distributed storage
● Deserialize state
● Continue running task
Task
21
Confidential + Proprietary
Machine B
Isolated Task Environment
● Machine is opaque to the task
● Your local data travels with the task
● Your IP changes
● Google libraries re-establish
connections
Task
22
Confidential + Proprietary
Task
IPv6 + NET namespace
Networking
Networking @ Google
● Standardized RPC implementation: Stubby/gRPC
● Nearly all communication is RPC
● Unique IPv6 address per task
● BNS: Borg DNS, used by RPC layer
Task Migration
● Stubby/gRPC automatically reconnects
● Reconnect is transparent to users
● IP address changes, but this is rarely a problem
23
Network
Process
Stubby/gRPC Library
Confidential + Proprietary
Storage
Storage @ Google: Minimized local storage
● Most tasks are stateless, few require local
SSD/HDD
● Those that require state use our remote storage
stacks (e.g.: Colossus, Spanner)
● Small local storage is offered via tmpfs
Task migration
● Lack of local storage greatly simplifies work
● Remote storage stacks use RPC and thus recover
gracefully
● Small local storage is migrated with task
24
Task
Colossus
Process
Stubby/gRPC Library
Spanner
Tmpfs scratch space
PD
Confidential + Proprietary
Task environment
Container
● Primarily used for resource isolation
● Full namespaces applied
Security
● Root is not mapped into user namespace
● Capabilities are strictly limited
Root filesystem
● Separate from the host machine’s
● Built and bundled by the task as a package
25
Task
Container
full cgroups + full namespaces
Process
no root + limited caps
Isolated Filesystem
Confidential + Proprietary
CRIU
Checkpoint/Restore in User Space
● Used to serialize/deserialize the task’s process
Security and isolation
● Run inside a task’s container
● Run with minimal privileges
The Migrator
● Injected into task during a migration, orchestrates
the migration
● Manages execution of CRIU
● Encrypts and compresses checkpoint on the fly
○ Pretends to be a CRIU pageserver 26
Task
Container
Borglet [root]
Process
Migrator [root]
CRIU
Colossus
Process [user]
CRIU [user]
Confidential + Proprietary
In practice today
Migrations take 1-2min and succeed 90%+ of the time
Where the time goes
● Checkpoint/restore is relatively fast for well-behaved tasks
● Writing/reading to remote storage dominates checkpoint/restore
● Scheduling delays are also a large source of latency
Causes of failures
● Timeouts from high task resource usage (e.g.: threads, memory)
● Different host environments
● Misc failures in serialization (e.g.: unsupported features)
27
Confidential + Proprietary
Users
Works well for batch jobs
● Latency tolerant, longer-running, and lower priority
● Some are highly sharded and see many evictions
● Long pipelines suffer when some parts are evicted
User feedback
● They love it! Super simple to adopt
● Desire for advanced features
○ Migration notifications
○ User-controlled pause/resume
Not a great offering for latency-sensitive jobs
28
1s
10s
1m
3m
10m
Batch/LTLS
Confidential + Proprietary
Adoption challenges
Handling connection failures
● In theory: users are taught to expect failures
● In practice: users don’t handle failures well
○ Expect them not to occur and reset their state when they do
Isolating task environment
● Users make assumptions about the underlying host
○ Services are available via localhost
○ Expecting host:port to work
● Users don’t expect the underlying host to change at runtime
○ Certain features detected at startup and never refreshed (e.g.: kernel, CPU, location)
29
Confidential + Proprietary
Experience with CRIU
In one word? AMAZING!
● Mostly worked out of the box with few changes
● Reliability and performance have been great in production
● Community has been helpful and quick to fix issues
Our changes
● Performance improvements for checkpoint/restore
● Increasing/improving some limitations (see next slide)
● Most patches sent upstream
30
Confidential + Proprietary
CRIU security
CRIU suggested to run as root
● Security auditing found a series of bugs
● A malicious task can hijack a CRIU process
Recommendation
● Run CRIU as the task’s user
● Run in user namespace without root mapped in
● Trim privileges to minimal set
31
Task
Container
Borglet [root]
Process [user]
Migrator [root]
CRIU [user]
Confidential + Proprietary
What could do with improvements
Performance
● Some expensive operations remain, some have kernel limitations
○ e.g.: waitpid on all threads is O(n2
)
Security
● Reducing need for root and elevated capabilities
● Not well tested in this setup
Misc
● Contributing patches back is a bit hard
32
Confidential + Proprietary
What could do with improvements
Live migration
● Parts of incremental restore are very, very difficult
● Lots of work ahead to do the type of brownout used in VM live migration today
Handling time
● Hard to abstract away many of the time HW counters
● Time namespaces to the rescue?
33
Confidential + Proprietary
Future work
Increasing adoption internally
● Reduce lost compute and simplify user tasks
● Targeting on-by-default for large batch workloads
Machine-to-machine migration
● Skip the distributed storage of the checkpoint
● Reduces migration times to ~30s
Live migration
● Able to address latency sensitive workloads
● Will require some work in our stack and in CRIU
34
Confidential + Proprietary
Questions?
Native task migration offering in Borg
● Reduces compute lost to evictions
● Simplifies task handling of preemptions
● Addresses most batch workloads
● Serving workloads need live migration
CRIU
● Works amazingly well out of the box
● Security an area of investment
● We are excited about and look forward to live migration!
Victor Marmol
vmarmol@google.com
Andy Tucker
agtucker@google.com
35

More Related Content

What's hot

홍성우, 게임 서버의 목차 - 시작부터 출시까지, NDC2019
홍성우, 게임 서버의 목차 - 시작부터 출시까지, NDC2019홍성우, 게임 서버의 목차 - 시작부터 출시까지, NDC2019
홍성우, 게임 서버의 목차 - 시작부터 출시까지, NDC2019devCAT Studio, NEXON
 
게임서버 구축 방법비교 : GBaaS vs. Self-hosting
게임서버 구축 방법비교 : GBaaS vs. Self-hosting게임서버 구축 방법비교 : GBaaS vs. Self-hosting
게임서버 구축 방법비교 : GBaaS vs. Self-hostingiFunFactory Inc.
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우if kakao
 
라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성Hyunjik Bae
 
High Availability on pfSense 2.4 - pfSense Hangout March 2017
High Availability on pfSense 2.4 - pfSense Hangout March 2017High Availability on pfSense 2.4 - pfSense Hangout March 2017
High Availability on pfSense 2.4 - pfSense Hangout March 2017Netgate
 
Quic을 이용한 네트워크 성능 개선
 Quic을 이용한 네트워크 성능 개선 Quic을 이용한 네트워크 성능 개선
Quic을 이용한 네트워크 성능 개선NAVER D2
 
containerd the universal container runtime
containerd the universal container runtimecontainerd the universal container runtime
containerd the universal container runtimeDocker, Inc.
 
게임서버프로그래밍 #8 - 성능 평가
게임서버프로그래밍 #8 - 성능 평가게임서버프로그래밍 #8 - 성능 평가
게임서버프로그래밍 #8 - 성능 평가Seungmo Koo
 
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지Changje Jeong
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안SANG WON PARK
 
온라인 게임 처음부터 끝까지 동적언어로 만들기
온라인 게임 처음부터 끝까지 동적언어로 만들기온라인 게임 처음부터 끝까지 동적언어로 만들기
온라인 게임 처음부터 끝까지 동적언어로 만들기Seungjae Lee
 
仮想マシンにおけるメモリ管理
仮想マシンにおけるメモリ管理仮想マシンにおけるメモリ管理
仮想マシンにおけるメモリ管理Akari Asai
 
오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개Hyoungjun Kim
 
Issues of OpenStack multi-region mode
Issues of OpenStack multi-region modeIssues of OpenStack multi-region mode
Issues of OpenStack multi-region modeJoe Huang
 
게임서버프로그래밍 #7 - 패킷핸들링 및 암호화
게임서버프로그래밍 #7 - 패킷핸들링 및 암호화게임서버프로그래밍 #7 - 패킷핸들링 및 암호화
게임서버프로그래밍 #7 - 패킷핸들링 및 암호화Seungmo Koo
 
오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...
오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...
오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...Amazon Web Services Korea
 
Kubernetes Internals
Kubernetes InternalsKubernetes Internals
Kubernetes InternalsShimi Bandiel
 

What's hot (20)

홍성우, 게임 서버의 목차 - 시작부터 출시까지, NDC2019
홍성우, 게임 서버의 목차 - 시작부터 출시까지, NDC2019홍성우, 게임 서버의 목차 - 시작부터 출시까지, NDC2019
홍성우, 게임 서버의 목차 - 시작부터 출시까지, NDC2019
 
게임서버 구축 방법비교 : GBaaS vs. Self-hosting
게임서버 구축 방법비교 : GBaaS vs. Self-hosting게임서버 구축 방법비교 : GBaaS vs. Self-hosting
게임서버 구축 방법비교 : GBaaS vs. Self-hosting
 
카프카, 산전수전 노하우
카프카, 산전수전 노하우카프카, 산전수전 노하우
카프카, 산전수전 노하우
 
라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성라이브 서비스를 위한 게임 서버 구성
라이브 서비스를 위한 게임 서버 구성
 
High Availability on pfSense 2.4 - pfSense Hangout March 2017
High Availability on pfSense 2.4 - pfSense Hangout March 2017High Availability on pfSense 2.4 - pfSense Hangout March 2017
High Availability on pfSense 2.4 - pfSense Hangout March 2017
 
Quic을 이용한 네트워크 성능 개선
 Quic을 이용한 네트워크 성능 개선 Quic을 이용한 네트워크 성능 개선
Quic을 이용한 네트워크 성능 개선
 
containerd the universal container runtime
containerd the universal container runtimecontainerd the universal container runtime
containerd the universal container runtime
 
게임서버프로그래밍 #8 - 성능 평가
게임서버프로그래밍 #8 - 성능 평가게임서버프로그래밍 #8 - 성능 평가
게임서버프로그래밍 #8 - 성능 평가
 
KAFKA 3.1.0.pdf
KAFKA 3.1.0.pdfKAFKA 3.1.0.pdf
KAFKA 3.1.0.pdf
 
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
AWS를 활용하여 Daily Report 만들기 : 로그 수집부터 자동화된 분석까지
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 
온라인 게임 처음부터 끝까지 동적언어로 만들기
온라인 게임 처음부터 끝까지 동적언어로 만들기온라인 게임 처음부터 끝까지 동적언어로 만들기
온라인 게임 처음부터 끝까지 동적언어로 만들기
 
仮想マシンにおけるメモリ管理
仮想マシンにおけるメモリ管理仮想マシンにおけるメモリ管理
仮想マシンにおけるメモリ管理
 
오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개오픈소스 프로젝트 따라잡기_공개
오픈소스 프로젝트 따라잡기_공개
 
Issues of OpenStack multi-region mode
Issues of OpenStack multi-region modeIssues of OpenStack multi-region mode
Issues of OpenStack multi-region mode
 
Open Match Deep Dive
Open Match Deep DiveOpen Match Deep Dive
Open Match Deep Dive
 
게임서버프로그래밍 #7 - 패킷핸들링 및 암호화
게임서버프로그래밍 #7 - 패킷핸들링 및 암호화게임서버프로그래밍 #7 - 패킷핸들링 및 암호화
게임서버프로그래밍 #7 - 패킷핸들링 및 암호화
 
오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...
오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...
오딘: 발할라 라이징 MMORPG의 성능 최적화 사례 공유 [카카오게임즈 - 레벨 300] - 발표자: 김문권, 팀장, 라이온하트 스튜디오...
 
Microservices at Mercari
Microservices at MercariMicroservices at Mercari
Microservices at Mercari
 
Kubernetes Internals
Kubernetes InternalsKubernetes Internals
Kubernetes Internals
 

Similar to Task migration using CRIU

Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17aspyker
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Noam Elfanbaum
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahujacamunda services GmbH
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...peknap
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixDatabricks
 
Continuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageContinuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageRan Levy
 
The Road to Kubernetes
The Road to KubernetesThe Road to Kubernetes
The Road to KubernetesDeniz Zoeteman
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationLinaro
 
Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit
 
Building a Small Datacenter
Building a Small DatacenterBuilding a Small Datacenter
Building a Small Datacenterssuser4b98f0
 
Building a Small DC
Building a Small DCBuilding a Small DC
Building a Small DCAPNIC
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016aspyker
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Sharma Podila
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesLINE Corporation
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabMayaData Inc
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 

Similar to Task migration using CRIU (20)

Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17Series of Unfortunate Netflix Container Events - QConNYC17
Series of Unfortunate Netflix Container Events - QConNYC17
 
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
Bootstrapping a ML platform at Bluevine [Airflow Summit 2020]
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahuja
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
Solving Real-Time Scheduling Problems With RT_PREEMPT and Deadline-Based Sche...
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at Netflix
 
OpenFlow @ Google
OpenFlow @ GoogleOpenFlow @ Google
OpenFlow @ Google
 
Continuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageContinuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritage
 
The Road to Kubernetes
The Road to KubernetesThe Road to Kubernetes
The Road to Kubernetes
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolationHKG15-305: Real Time processing comparing the RT patch vs Core isolation
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
 
Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018Accumulo Summit Keynote 2018
Accumulo Summit Keynote 2018
 
Building a Small Datacenter
Building a Small DatacenterBuilding a Small Datacenter
Building a Small Datacenter
 
Building a Small DC
Building a Small DCBuilding a Small DC
Building a Small DC
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 

More from Rohit Jnagal

Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
Kubernetes intro public - kubernetes meetup 4-21-2015
Kubernetes intro   public - kubernetes meetup 4-21-2015Kubernetes intro   public - kubernetes meetup 4-21-2015
Kubernetes intro public - kubernetes meetup 4-21-2015Rohit Jnagal
 

More from Rohit Jnagal (6)

Cat @ scale
Cat @ scaleCat @ scale
Cat @ scale
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Kubernetes intro public - kubernetes meetup 4-21-2015
Kubernetes intro   public - kubernetes meetup 4-21-2015Kubernetes intro   public - kubernetes meetup 4-21-2015
Kubernetes intro public - kubernetes meetup 4-21-2015
 
Docker n co
Docker n coDocker n co
Docker n co
 
Docker Overview
Docker OverviewDocker Overview
Docker Overview
 
Docker internals
Docker internalsDocker internals
Docker internals
 

Recently uploaded

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Task migration using CRIU

  • 1. Confidential + Proprietary Task Migration at Scale Using CRIU Linux Plumbers Conference 2018 Victor Marmol Andy Tucker vmarmol@google.com agtucker@google.com 2018-11-15
  • 2. Confidential + Proprietary Who we are Outside of Google, we’ve worked on open source cluster management and containers 2 lmcty
  • 3. Confidential + Proprietary Who we are Inside Google: we’re part of the Borg team ● Manages all compute jobs ● Runs on every server Images by Connie Zhou3
  • 4. Confidential + Proprietary What is Borg? Google’s cluster management system ● Borgmaster: Cluster control and main API entrypoint ● Borglet: On-machine management daemon ● Suite of tools and UIs for managing jobs ● Many purpose-built platforms created on top of Borg ● Everything runs on Borg and everything runs in containers 4
  • 5. Confidential + Proprietary Borg Machine Borg basics Base compute primitive: Task ● A priority signals how quickly a task should schedule ● It’s appclass describes a task as either serving (latency sensitive) or batch ● Static content/binaries provided by packages ● A container isolates a task’s resources ● Native Linux processes ● Share an IP with the machine, ports are allocated for each task Task Container Processes Packages Allocated Ports 5 Processes Processes Processes Packages Packages Packages
  • 6. Confidential + Proprietary Borg basics: evictions When a task is forcefully terminated by Borg ● Typically receive a notification: 1-5min ● Our SLO allows for quite a few evictions ● Applications must handle them Reasons for evictions ● Preemption: a higher priority task needs the resources ● Software upgrades (e.g.: kernel, firmware) ● Re-balancing for availability or performance 6
  • 7. Confidential + Proprietary Evictions are impactful and hard to handle Technical Complexity ● Handling evictions requires state management ○ How and what state to serialize and where to store it ● Application-specific and not very reusable Lost Compute ● Batch jobs run at lower priorities and get preempted often ● Even platforms that handle them for users, don’t do a great job Task 0 Shard 1 Task 2 Task 3 Task 4Evicted! Compute is lost... 7
  • 8. Confidential + Proprietary Migrations to avoid evictions Transparently replace evictions with migration Native task migration offering in Borg ● Borg controls the eviction → always knows when to migrate ● Native management of state allows reuse for all workloads Various possible mechanisms ● Checkpoint/restore ○ Pause application, transfer state, resume ○ Long blackout period, no brownout ● Live ○ Very short blackout, but with a longer brownout ○ Very low impact to applications 8
  • 9. Confidential + Proprietary Challenges with task migration Migrating network connections Port collisions and port use Storage migration is slow Must virtualize machine-local resources Linux process state hard to migrate 9
  • 10. Confidential + Proprietary Challenges with task migration Migrating network connections Port collisions and port use Storage migration is slow Must virtualize machine-local resources Linux process state hard to migrate Little to no local storage Linux namespaces CRIU! NET namespaces and IPv6 per-container Drop the connection, user handles reconnections 10
  • 12. Confidential + Proprietary Machine A Task Migration Checkpoint/Restore 12
  • 13. Confidential + Proprietary Machine A Task Isolated Task Environment ● Linux namespaces ● Little local storage ● IPv6 ● Google libraries (e.g.: Stubby/gRPC) 13
  • 14. Confidential + Proprietary Machine A Task Checkpoint ● Pause task 14
  • 15. Confidential + Proprietary Machine A Checkpoint ● Pause task ● Serialize stateTask 15
  • 16. Confidential + Proprietary Machine A Checkpoint ● Pause task ● Serialize state ● Upload to distributed storage Task Colossus 16
  • 18. Confidential + Proprietary Colossus Migration ● Borgmaster chooses new machine to schedule the task. Machine B 18
  • 19. Confidential + Proprietary Colossus Machine B Restore ● Download from distributed storage Task 19
  • 20. Confidential + Proprietary Machine B Restore ● Download from distributed storage ● Deserialize state Task 20
  • 21. Confidential + Proprietary Machine B Restore ● Download from distributed storage ● Deserialize state ● Continue running task Task 21
  • 22. Confidential + Proprietary Machine B Isolated Task Environment ● Machine is opaque to the task ● Your local data travels with the task ● Your IP changes ● Google libraries re-establish connections Task 22
  • 23. Confidential + Proprietary Task IPv6 + NET namespace Networking Networking @ Google ● Standardized RPC implementation: Stubby/gRPC ● Nearly all communication is RPC ● Unique IPv6 address per task ● BNS: Borg DNS, used by RPC layer Task Migration ● Stubby/gRPC automatically reconnects ● Reconnect is transparent to users ● IP address changes, but this is rarely a problem 23 Network Process Stubby/gRPC Library
  • 24. Confidential + Proprietary Storage Storage @ Google: Minimized local storage ● Most tasks are stateless, few require local SSD/HDD ● Those that require state use our remote storage stacks (e.g.: Colossus, Spanner) ● Small local storage is offered via tmpfs Task migration ● Lack of local storage greatly simplifies work ● Remote storage stacks use RPC and thus recover gracefully ● Small local storage is migrated with task 24 Task Colossus Process Stubby/gRPC Library Spanner Tmpfs scratch space PD
  • 25. Confidential + Proprietary Task environment Container ● Primarily used for resource isolation ● Full namespaces applied Security ● Root is not mapped into user namespace ● Capabilities are strictly limited Root filesystem ● Separate from the host machine’s ● Built and bundled by the task as a package 25 Task Container full cgroups + full namespaces Process no root + limited caps Isolated Filesystem
  • 26. Confidential + Proprietary CRIU Checkpoint/Restore in User Space ● Used to serialize/deserialize the task’s process Security and isolation ● Run inside a task’s container ● Run with minimal privileges The Migrator ● Injected into task during a migration, orchestrates the migration ● Manages execution of CRIU ● Encrypts and compresses checkpoint on the fly ○ Pretends to be a CRIU pageserver 26 Task Container Borglet [root] Process Migrator [root] CRIU Colossus Process [user] CRIU [user]
  • 27. Confidential + Proprietary In practice today Migrations take 1-2min and succeed 90%+ of the time Where the time goes ● Checkpoint/restore is relatively fast for well-behaved tasks ● Writing/reading to remote storage dominates checkpoint/restore ● Scheduling delays are also a large source of latency Causes of failures ● Timeouts from high task resource usage (e.g.: threads, memory) ● Different host environments ● Misc failures in serialization (e.g.: unsupported features) 27
  • 28. Confidential + Proprietary Users Works well for batch jobs ● Latency tolerant, longer-running, and lower priority ● Some are highly sharded and see many evictions ● Long pipelines suffer when some parts are evicted User feedback ● They love it! Super simple to adopt ● Desire for advanced features ○ Migration notifications ○ User-controlled pause/resume Not a great offering for latency-sensitive jobs 28 1s 10s 1m 3m 10m Batch/LTLS
  • 29. Confidential + Proprietary Adoption challenges Handling connection failures ● In theory: users are taught to expect failures ● In practice: users don’t handle failures well ○ Expect them not to occur and reset their state when they do Isolating task environment ● Users make assumptions about the underlying host ○ Services are available via localhost ○ Expecting host:port to work ● Users don’t expect the underlying host to change at runtime ○ Certain features detected at startup and never refreshed (e.g.: kernel, CPU, location) 29
  • 30. Confidential + Proprietary Experience with CRIU In one word? AMAZING! ● Mostly worked out of the box with few changes ● Reliability and performance have been great in production ● Community has been helpful and quick to fix issues Our changes ● Performance improvements for checkpoint/restore ● Increasing/improving some limitations (see next slide) ● Most patches sent upstream 30
  • 31. Confidential + Proprietary CRIU security CRIU suggested to run as root ● Security auditing found a series of bugs ● A malicious task can hijack a CRIU process Recommendation ● Run CRIU as the task’s user ● Run in user namespace without root mapped in ● Trim privileges to minimal set 31 Task Container Borglet [root] Process [user] Migrator [root] CRIU [user]
  • 32. Confidential + Proprietary What could do with improvements Performance ● Some expensive operations remain, some have kernel limitations ○ e.g.: waitpid on all threads is O(n2 ) Security ● Reducing need for root and elevated capabilities ● Not well tested in this setup Misc ● Contributing patches back is a bit hard 32
  • 33. Confidential + Proprietary What could do with improvements Live migration ● Parts of incremental restore are very, very difficult ● Lots of work ahead to do the type of brownout used in VM live migration today Handling time ● Hard to abstract away many of the time HW counters ● Time namespaces to the rescue? 33
  • 34. Confidential + Proprietary Future work Increasing adoption internally ● Reduce lost compute and simplify user tasks ● Targeting on-by-default for large batch workloads Machine-to-machine migration ● Skip the distributed storage of the checkpoint ● Reduces migration times to ~30s Live migration ● Able to address latency sensitive workloads ● Will require some work in our stack and in CRIU 34
  • 35. Confidential + Proprietary Questions? Native task migration offering in Borg ● Reduces compute lost to evictions ● Simplifies task handling of preemptions ● Addresses most batch workloads ● Serving workloads need live migration CRIU ● Works amazingly well out of the box ● Security an area of investment ● We are excited about and look forward to live migration! Victor Marmol vmarmol@google.com Andy Tucker agtucker@google.com 35