Scaling highly available database infrastructure to 100x, 1000x, and beyond has historically been one of the hardest technical challenges that any successful web business must face. This is quickly changing with fully-managed database services such as Amazon DynamoDB and Amazon Redshift, as the scaling efforts which previously required herculean effort are now as simple as an API call.
Over the last few years, Twilio has evolved their database infrastructure to a pipeline consisting of Amazon SQS, Sharded MySQL, Amazon DynamoDB, Amazon S3, Amazon EMR and Amazon Redshift. In this session, Twilio cover show they achieved success, specifically:
- How they replaced their data pipeline deployed to Amazon EC2 to meet their scaling needs with zero downtime.
- How they adopted Amazon DynamoDB and Amazon Redshift at the same scale as their MySQL infrastructure, at 1/5th the cost and operational overhead.
- Why they believe adopting managed database services like Amazon DynamoDB is key to accelerating delivery of value to their customers.
Sponsored by Twilio.
2. Hi, I’m Ryan
Tech Lead of the User Data team at Twilio
3. What is Twilio?
We provide a communications API that enables phones, VoIP, and messaging to be embedded into web, desktop and mobile software.
4. How Does it Work?
A user calls your
number
Twilio receives the call
Your app responds
5. What is the User Data Team?
•We scale Twilio'sbackend database infrastructure
•We build customer facing data APIs
•We manage data policies and data security at rest
10. Problems at Scale
•Many consumers of data
•Data with different performance characteristics
•Failure in the database degrades many services
•Horizontal scaling and orchestration is complicated
12. What is a Service-Oriented Architecture?
An architecture in which required system behavior is decomposed into discrete units of functionality, implemented as individual services for applications to compose and consume.
13. Communicate Through Interfaces, Not Databases
API
Web
Billing
In Flight MySQL
Call/Message Service
In Flight Service
Post Flight Service
Post Flight MySQL
Carriers
14. Database Can Change Without Changing Every Service
API
Web
Billing
In Flight MySQL
Call/Message Service
In Flight Service
Post Flight Service
Post Flight Amazon DynamoDB
Carriers
15. SOA Doesn’t Solve Everything
No matter how many services you put in front of MySQL, it’s still a single point of failure.
19. Rolling it Out With Zero Downtime (the hard part)
•We provide a 24/7, always on service
•Communications is intolerantof inconsistency and latency
•There is no maintenance window
20. Bringing Up a New Shard
Master1
Slave1
Master2
Slave2
Application
0-9
21. Split Odds and Evens for Writes
Master1
Slave1
Master2
Slave2
Application
Odds
Evens
0-9
25. A Necessary Burden
In the beginning, the burden of managing our own databases was non-negotiable.
26. The Landscape has Changed
We now have a variety of managed database services which solve these problems for us, such as Amazon RDS, Amazon DynamoDB, Amazon SimpleDB, Amazon Redshift, etc.
27. Cost Is Never Optimized
Application developers do not (and should not) optimize for database cost.
33. Thinking in Terms of Throughput
Amazon DynamoDBallows us to scale in terms of throughput, not machines. This is the future of resource provisioning.
39. SELECT * FROM events WHERE IpAddress=“5.6.7.8” ORDER BY date DESC;
40. SELECT * FROM events WHERE IpAddress=“5.6.7.8” AND Date<=“2014-10-03” ORDER BY date DESC;
GET /Accounts/2/Events?IpAddress=5.6.7.8&Date<=2014-10-03
41. AccountId (Hash)
Date (Range)
IpAddress_Date
Type
2
2014-10-03
5.6.7.8|2014-10-03
call
2
2014-10-01
5.6.7.8|2014-10-01
message
GET /Accounts/2/Events
AccountId=2, ScanIndexForward=false
42. AccountId (Hash)
IpAddress_Date (Range)
Date
Type
2
5.6.7.8|2014-10-03
2014-10-03
call
2
5.6.7.8|2014-10-01
2014-10-01
message
GET /Accounts/2/Events?IpAddress=5.6.7.8
AccountId=2, IpAddress_Date begins with “5.6.7.8|”, ScanIndexForward=false
43. AccountId (Hash)
IpAddress_Date (Range)
Date
Type
2
5.6.7.8|2014-10-03
2014-10-03
call
2
5.6.7.8|2014-10-01
2014-10-01
message
GET /Accounts/2/Events?IpAddress=5.6.7.8&Date<=2014-10-03
AccountId=2, IpAddress_DateLT “5.6.7.8|2014-10-03”, ScanIndexForward=false
44. Need to Handle Exceeded Throughput Failures
Exceeding provisioned throughput is a runtime error.
45. Handling Exceeded Write Throughput with Amazon SQS
Queuing events to Amazon SQS processing asynchronously allows us to gracefully deal with write throughput errors.
50. Brief History
2008 -2011
All business intelligence queries run on replicas of MySQL clusters servingproduction traffic.
51. Brief History
2011 -2013
Data pushed to Amazon S3 and queried with Pig, Amazon EMR, improving ability to aggregate, but with high latency.
52. Brief History
2013 -Present
Move to Amazon Redshift cut the time these reports took from hours to seconds allowing us to answer critical BI and financial questions in near real time.
53. Pushing Data Into Amazon Redshift
Post Flight Service
Kafka
SQS (DLQ)
Amazon S3 Loader
S3
Warehouse Loader
Amazon Redshift
55. Managed Services as a Culture
Our focus is on creating an experience that unifies and simplifies communications is a reflection on our adoption of managed services.
56. Managed Services as a Culture
Understanding and focusing on our areas of expertise and leveraging managed services for the rest accelerates the delivery of value and innovation to our customers.