Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1ph8Rq1.
Roy Rapoport discusses canary analysis deployment and observability patterns he believes that are generally useful, and talks about the difference between manual and automated canary analysis. Filmed at qconnewyork.com.
Roy Rapoport manages the Insight Engineering group at Netflix, responsible for building Netflix's Operational Insight platforms, including cloud telemetry, alerting, and real-time analytics. He originally joined Netflix as part of its datacenter-based IT/Ops group, and prior to transferring over to Product Engineering, was managing Service Delivery for IT/Ops.
Canary Analyze All The Things: How We Learned to Keep Calm and Release Often
1. Canary Analyze All the
Things
Roy Rapoport
@royrapoport
June 12, 2014
Significant contributions by Chris Sanden, @chris_sanden
1
2. Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/canary-analysis-deployment-pattern
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
4. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
2
5. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
4y:11m:14d
•At Netflix:
•Systems Engineering, Service Delivery in IT/Ops
•Troubleshooter and Builder of Python Things[tm] in Product
Engineering
•Current role: Insight Engineering in Product Engineering
•Real-Time Operational Insight
3
6. A Word About Netflix…
Just the Stats
•16 years
•2000+ employees
•48 million users
•5x10^9 hours/quarter
4
7. A Word About Netflix…
Freedom and Responsibility Culture
•Optimize speed of innovation
Constrain availability
Cost will be what cost will be
•Hire smart (experienced)
people
Get out of their way
•Anti-process bias
5
8. A Word About Netflix…
Technology and Operations
•Service Oriented Architecture
•Decentralized Operations. You
•Build
•Test
•Deploy
•Set up alerting and monitoring
•Wake up at 2AM
6
9. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
7
11. So You’ve Just Done a Release
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/cat
{“response”: “meow”}
9
12. So You’ve Just Done a Release
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/dog
{“response”: “woof”}
10
13. So You’ve Just Done a Release
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox
{“response”: “wa-pa-pa-pa-pa-pa-pow”}
The correct answer to “what does the fox say?” is left an exercise for the reader
11
15. You Need Better Testing!
“I’m going to push to production, though
I’m pretty sure it’s going to kill the system”
13
- Said no one, ever*
* Hopefully
16. Detour
Rate of Change vs Availability
1 10 100 1000
Rate of Change
6
5
4
3
2
1
0
Availability (nines)
Operations
Engineering
14
17. You Need Better Testing!Deployments!
Canary Analysis
• A deployment process where
• a new change (in behavior, code, or both)
• is rolled out into production gradually,
• with checkpoints along the way to examine the new (canary) systems
• (optionally versus the old (baseline) systems)
• and make go/no-go decisions.
15
18. Canary Analysis Is Not
•A replacement for any sort of
software testing
•A/B Testing
•Releasing 100% to production
and hoping for the best
16
19. Version
Control
System
1000
servers
@ 1.0.2
1000
servers
@ 1.0.1
Customers
commit
Build &
Deployment
System
1 server
@ 1.0.2
build
deploy
Automated
Canary
go
Analysis
10
servers
@ 1.0.2
One Possible Process
17
20. Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
Canary
go
Analysis
1000
servers
@ 1.0.2
One Possible Process
18
21. Version
Control
System
1000
servers
@ 1.0.1
Customers
Build &
Deployment
System
Automated
no Canary
go
Analysis
1000
servers
@ 1.0.2
One Possible Process
19
22. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
20
23. Are We There Yet?
• We’re not
• You’re probably not either
21
33. A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
• Absolutes are never right
• Automate decision
• Automate execution
31
34. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
32
35. To Save You Some Time …
Not all
metrics are
created
equal
Focus on
System and
Application
Metrics
Weight by
category
(system,
latency, etc)
33
36. To Save You Some Time …
Outliers are
out, lying
Use a group
of servers
Balance
fidelity with
customer
impact
34
37. To Save You Some Time …
Exercise
without
Repeat
warmup
canary
can result
analysis
in injury
frequently
Both traffic
and startup
time are
factors
35
38. To Save You Some Time …
vive la
différence!
Hot-OK,
Cold-OK
Let
Application
Owners
Choose
36
39. To Save You Some Time …
Signal is better
than no1$#[NO
CARRIER]
Ignore weak
signals
37
40. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
38
41. Good News
• Software-Defined Everything
• Incremental Pricing
39
42. Bad News
• Capacity Management
• Unpredictable Inconsistency
40
43. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
41
44. Numbers
• 752 services in production
• In-house telemetry platform
• A few metrics
42
45. Been there.
Done that.
Manually. Artisanally
• Started in the Data Center
• Manual, dashboard-driven
43
57. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
• ACA All The Things
• OpenConnect firmware updates
• Client software changes
• Configuration changes in production
55
58. Summary
• Canary Analysis makes your changes
• Safer
• Faster
• Easier
• Most people can start doing it
• Everyone can do it better
56