Scaling DevOps of Microservices at Uber (Code Conf 2018)
1. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Scaling DevOps at
Kiran Bondalapati, Uber
Igniting opportunity by setting the world in motion
2. 10+ billion trips
15M+ trips per day
6 continents, 65 countries and 600+ cities
75M active monthly users
3M+ active drivers
16,000+ employees worldwide
3000+ developers worldwide
7. Pre-history PHP (outsourced)
Marketplace Node.JS, moving to Go
Core Services Python, moving to Go, Java
Maps Python and Java
Data Python and Java
Metrics Go
Code
20000+ repos
Multiple languages and frameworks
Multiple communication protocols
10. 4000+ builds per day
Build times affect developer productivity
Build sizes affect deployments
Build
Build without docker
Optimize layer generation
Distributed cache for intermediate layers
11. 100s of services pulling 1000s images from Registry
Deploy
Vertical Scaling
Horizontal Scaling
P2P Distribution - Scales with Load
12. Reproduce Halloween and New Year
Systemic issues are hard in unit tests
Cascading failures are common in real life
Test
Hailstorm load testing framework
uDestroy random failure injection framework
Regular failure and failover drills
no testee … no workee
13. Containers are sized for peak load
Dynamic utilization affects cluster efficiency
Typical auto-scaling does not help
Run
Combine responsive and revocable tasks
Oversubscribe resources
Rate limiting of revocation
14. M3 metrics platform
~5B time series
~10M metrics/sec
Changing services, metrics, infrastructure, ...Monitoring
Rule based alert generators
Git based review and update
Measure oncall quality
15. HW/SW has tendency to have faults
100M+ alerts per month across Uber stack
Many faults are transient/temporary
Remediate
Smart alert prioritization
Automate manual tasks - reboot, restart, ...
SLA aware remediation
17. Standards based innovation
Layercake architecture
Avoid cyclic dependencies
Avoid cascading failures while designing
Incremental deployments - code and config
Test often … including production
Add guardrails to automation
Design for understandability
Learnings
18. Larger systems
Bigger impact of changeScale
Larger teams
Less each person knows
Our understanding of systems
breaks more often than
actual systems do