2. ● Who we are
● Our journey into Kubernetes
● Why Canary Release
● How we solve it
● Next Step
Agenda
3. 3 Intuit Confidential and Proprietary
Intuit mission
Powering Prosperity Around the World
4. 4 Intuit Confidential and Proprietary
Who we are
Founded
9,000
Employees
50M
Customers
1993
IPO
$6B
FY18
Revenue
21
Locations
1983
5. 5 Intuit Confidential and Proprietary
Challenges in our cloud journey
● Too much time spent on infrastructure tasks.
○ AWS/Chef expertise
○ No standard deployment pipeline
● High cost for cross teams contributions.
● Engineers just want to get features out to the customers
asap without worrying of the deployment/infrastructure.
6. 6 Intuit Confidential and Proprietary
Intuit Development Platform (Modern SaaS)
Splunk
(Logging)
PagerDuty
(Alerts)
Appdynamics
(Monitoring)
Wavefront
(Monitoring)
ServiceNow
(CM)
IDPS
(Secrets)
Intuit Kubernetes Service (IKS)
(Core Kubernetes with Intuit Network & Security policies & best practices)
EKS
Security &
Compliance
Kops
Continuous Operations
(Monitoring, Analytics, Remediation)
Olympus
(SSO & AWS Roles)
NetGenie
(Certs))
GitHub
(Apps as Code)
IBP 2.0 Jenkins
(Build & Test - CI/d)
Quality
Frameworks
(TDS, Overwatch, TrinityJS,
Hubble…)
JFrog
Artifactory
(CDP)
Argo CD
(GitOps)
JSK + Config +
Experimentatio
n
Intuit API (v4)
Streaming/
Messaging
Dev Patterns
Serverless
Framework
Argo
workflows
UX Fabric
Multi-Cluster Service Mesh and Gateway Service Catalog
AWS Infrastructure VPC, ALB/NLB, S3, RDS, DynamoDB, Elasticache, ...
Developer and Operations
Experience
Onboarding Monitoring Management
Multi-Cluster
Mgmt (IKSM)
Discover Lean/Play
Metrics/Analytics
(Team Speed Dashboards)
7. 7 Intuit Confidential and Proprietary
Key Components of Modern SaaS platform
● CI/CD pipeline supporting GitOps for container
○ Jenkins 2.0 for pipeline
○ Artifactory as Docker image repo
○ Argo CD for deployment
● Monitoring
○ Pod metrics in Wavefront using heapster
○ Splunk for log analysis
○ AppDynamics as APM
8. 8 Intuit Confidential and Proprietary
What is performance environment?
● Solving for
○ Identifying bottlenecks
○ Performance/Latency/Capacity
● Challenges
○ Very difficult to simulate production traffics
○ Hard to replicate production dataset
○ Dependencies not like production
9. 9 Intuit Confidential and Proprietary
What is a Canary Release?
● “ ... a small set of end users selected for testing act
as the canaries ... negative results from a canary
release can be inferred from telemetry and metrics in
relation to key performance indicators … ”
● What we measure:
○ Pod metrics
○ JVM metrics
○ App metrics
10. 10 Intuit Confidential and Proprietary
Common questions on Canary Release
● How is Canary Release different from Blue/Green?
○ Blue/Green will take 100% of the traffic and is used to solve for
quick fallback to minimize potential downtime.
● How can I release software that’s not fully tested?
○ Your functional tests are supposed to catch functional issues.
○ Canary is to catch performance drift and other scale issues in
prod.
11. 11 Intuit Confidential and Proprietary
Canary Analysis Tools
● Netflix Kayenta (hosting)
○ Requires minimum 60 data points per metric.
○ Calculates mean and Std. Dev. per metric.
○ Score = Sum of weight x group metric score (aka
Model).
○ Support for custom Judge implementation.
● Wavefront as data store for canary and prod metrics.
12. 12 Intuit Confidential and Proprietary
Changes to production pipeline
● Collect JVM and App Metrics
○ Jolokia (JVM) and Telegraph (WF integration) sidecars
○ Netflix Servo (MBeans) for App Metrics
● Support Canary Deployment (Jenkins pipeline)
○ Canary Deployment Stage using Argo CD
○ Wait and Compute Score
○ Approval Stage for prod deployment (if score > 90)
13. 13 Intuit Confidential and Proprietary
Canary Release Flow
PR
Jenkins Pipeline
Deploy
Stage
Sanity
Test
Deploy
Canary
Wait &
Compute
Score
Approval
Deploy
Prod
Wave Front
(metrics)
Canary
Pod
Prod Pods
K8s
Kayenta
S
er
vi
c
e
J
o
l
o
k
i
a
T
el
e
g
r
a
p
h
Compute Score
Model A Model B
Prod PodsProd Pods
14. 14 Intuit Confidential and Proprietary
The Canary Analysis Model
● Pod (Heapster)
○ CPU, memory, Page Fault
● JVM Heap Usage (Jolokia & Telegraph)
○ Thread Count, GC Count
● Application Level (Jolokia & Telegraph & Servo)
○ Business metrics
○ Server Errors Count
○ 200, 400, 500 Count
15. 15 Intuit Confidential and Proprietary
Canary Model Refinement
● Start with Happy Path (in DR)
○ Assert on similar result (=100)
● Test for “Unhappy Paths” (in DR)
○ Spike in Application errors (< 100)
○ Spike in Memory/Thread for GC/Thread count (<100)
○ Combination of the two spikes to assert if the score
aggregates shows a lower score.
● Refine using prod traffic with manual gate
○ Assert Canary Score against other monitoring tools.
16. 16 Intuit Confidential and Proprietary
What we have learned
● Start with as many metrics as possible because:
○ Each run will take time (minimum one hour).
○ “What if” scenarios can be applied to Collected
Metrics.
● Minimally ten metric groups to have meaningful score.
● How do you compare set of latency metrics in an one
minute window? Mean, TP50, TP99?
17. 17 Intuit Confidential and Proprietary
In Summary
● Performance Environment is never the same as production.
● Canary Release detects performance drift and bottlenecks
using production environment and traffic.
● Canary Release Process
○ Define Metrics and Model
○ Orchestrate the canary release
○ Collect Metrics
○ Compute/Validate the score
18. 18 Intuit Confidential and Proprietary
Next Step (Making it Scale!)
● Argo Rollouts for canary deployment
○ Eliminate custom deployment in Jenkins pipeline.
○ Enable scale up canary and scale down prod.
○ Add Baseline support.
● Prometheus for Metric Collection
○ Eliminate sidecars like Jolokia and Telegraph.
● Service Mesh to throttle Canary (5%) and Baseline (5%).
19. Thank you!
+Parin Shah and Danny Thomson
contribution!
Billy Yuen
billy_yuen@intuit.com
We’re Hiring!!