The document discusses deploying next-generation systems at Twilio while ensuring zero downtime and zero regressions. It describes the challenges of replacing two core systems while maintaining high availability and horizontal scalability without losing any data. It outlines techniques used such as running tests against both systems, canary deployments, shadow mode with double bookkeeping for rollbacks, and gradual rollouts. The goal is reliable online upgrades of critical systems processing millions of transactions.
3. 125,000,000
123,090,34
100,000,000 4
75,000,000
67,186,111
e
50,000,000
25,679,631
25,000,000
365,782
0
Dec 2009 Dec 2010 Dec 2011 Dec 2012
Twilio Transactions Per Month
7. THE CHALLENGE
• Design, build, deploy replacements of 2 core systems.
➡ They must be HA.
#twiliocon
8. THE CHALLENGE
• Design, build, deploy replacements of 2 core systems.
➡ They must be HA.
➡ They must be horizontally-scalable.
#twiliocon
9. THE CHALLENGE
• Design, build, deploy replacements of 2 core systems.
➡ They must be HA.
➡ They must be horizontally-scalable.
#twiliocon
10. THE CHALLENGE
• Design, build, deploy replacements of 2 core systems.
➡ They must be HA.
➡ They must be horizontally-scalable.
... Oh, and don’t lose a single billing event or API request in the process.
#twiliocon
20. TEST, TEST, TEST
• Unit & Functional tests for local development
#twiliocon
21. TEST, TEST, TEST
• Unit & Functional tests for local development
• The same tests run in our Staging Cluster
#twiliocon
22. TEST, TEST, TEST
• Unit & Functional tests for local development
• The same tests run in our Staging Cluster
• The same cluster tests run against Both API Frameworks
#twiliocon
30. NGINX.CONF
# Map the HTTP header X-Requested-Api-Stack: <value>
# to a named location in nginx
map $http_x_requested_api_stack $requested_stack_default_php {
default @php;
python @python;
}
31. NGINX.CONF
# Map the HTTP header X-Requested-Api-Stack: <value>
# to a named location in nginx
map $http_x_requested_api_stack $requested_stack_default_php {
default @php;
python @python;
}
32. NGINX.CONF
# Map the HTTP header X-Requested-Api-Stack: <value>
# to a named location in nginx
map $http_x_requested_api_stack $requested_stack_default_php {
default @php;
python @python;
}
33. NGINX.CONF
# Map the HTTP header X-Requested-Api-Stack: <value>
# to a named location in nginx
map $http_x_requested_api_stack $requested_stack_default_php {
default @php;
python @python;
}
34. NGINX.CONF
# Map the HTTP header X-Requested-Api-Stack: <value>
# to a named location in nginx
map $http_x_requested_api_stack $requested_stack_default_php {
default @php;
python @python;
}
location @python {
proxy_pass 127.0.0.1:5555;
}
location @php {
proxy_pass 127.0.0.1:12345;
}
35. NGINX.CONF
# Map the HTTP header X-Requested-Api-Stack: <value>
# to a named location in nginx
map $http_x_requested_api_stack $requested_stack_default_php {
default @php;
python @python;
}
location @python {
proxy_pass 127.0.0.1:5555;
}
location @php {
proxy_pass 127.0.0.1:12345;
}
location ~ / {
try_files Kwijibo $requested_stack_default_php;
}
Lets rewind to 2009 and take a look at what we built.\n
At Twilio\n- \nnerd out to billing everyday. \n\nIts odd, quirky, but super powerful \n\n- mission critical to the advancement of twilio.\n\nWe decided to invest in building this early on in the company.\n
\n
high availability\nalways be processing\n
Realtime scoreboards, usage, metrics\n
Two piggy banks\nSingle process dequeuers - limited by the number of transactions you can process on a single database\n
Two piggy banks\nSingle process dequeuers - limited by the number of transactions you can process on a single database\n
\n
Once we understood the problems...\n\nWent to the drawing board...\n\nWe built something new to replace the infrastructure alongside it.\n\n<button>\n\nA system that disconnects our dequeuers from our processors. \n\n<button> \n\nProcessors powered by a REST API and uses status codes for success & error resolution.\n\n<button>\n\nOnly inserting into our databases as a log server.\n\n<button>\n\nAnd to fix our realtime issue, we&#x2019;re using redis as an in-flight datastore to atomically process metrics as we process transactions.\n
\n
\n
\n
Now we have two systems side-by-side, but we need to compare the two.\n<click>\nDouble book keeping lets us compare balances and metrics.\n\nIf we have a bug,\n<click>\nwe need to rollback. \n\nAnd we don&#x2019;t want to do it all at once.\n<click>\n\n
Now we have two systems side-by-side, but we need to compare the two.\n<click>\nDouble book keeping lets us compare balances and metrics.\n\nIf we have a bug,\n<click>\nwe need to rollback. \n\nAnd we don&#x2019;t want to do it all at once.\n<click>\n\n
Now we have two systems side-by-side, but we need to compare the two.\n<click>\nDouble book keeping lets us compare balances and metrics.\n\nIf we have a bug,\n<click>\nwe need to rollback. \n\nAnd we don&#x2019;t want to do it all at once.\n<click>\n\n
With so much throughput, we couldn&#x2019;t just shut down the billing system. We also couldn&#x2019;t lose a billing event.\n<click>\nSo we built an abstraction between the two systems that would allow us to atomically control transactions.\n<click>\nWhen the faucet is turned off, it will wait till both queues are drained and send us a report.\n<click>\n
With so much throughput, we couldn&#x2019;t just shut down the billing system. We also couldn&#x2019;t lose a billing event.\n<click>\nSo we built an abstraction between the two systems that would allow us to atomically control transactions.\n<click>\nWhen the faucet is turned off, it will wait till both queues are drained and send us a report.\n<click>\n
Check the books. Can we turn any accounts online?\n
Nope, we should not enable any accounts yet.\n
If we had moved accounts, with a click, we can migrate them back.\n
Or if the cluster catches fire, we can turn off the entire new system and reroute billing traffic back to its legacy system.\n\n
<TODO>\nGraph build out each week with the story.\n\n\n\n\nPractice is good.\nWe tested our platform thoroughly in a practice mode with no account flags turned on.\n\nAs we progressed and fixed the edge-cases, we migrated 1%, %5, all the way up to all accounts over a period of time.\n\nPlanning with your tools lets you build a gradual deployment with ease.\n\n
Just to follow up\n\nbetter tools equal better deployments.\n\nWhen we had issues with our new in-flight store, we had a way to rollback.\nWhen we were seeing discrepancies in balances, we would investigate, fix, deploy, and compare.\n\nSo you get the idea, to migrate to a new micro-payments platform,\nwe must engineer tools that let us migrate back and\nforth with ease so that we can spend time on the solutions.\n