Presented at JAX London 2013
Hailo, the taxi app, has served more than 5 million passengers in 15 cities and has taken fares of $100 million this year. I'm going to talk about how that rapid growth has been powered by a platform based on Cassandra and operational analytics and insights powered by Acunu Analytics. I'll cover some challenges and lessons learned from scaling fast!
8. Facts and figures
• The world’s highest-rated taxi app – over 11,000 five-star reviews
• Over 500,000 registered passengers
• A Hailo hail is accepted around the world every 4 seconds
• Hailo operates in 15 cities on 3 continents from Tokyo to Toronto in
nearly 2 years of operation
JAXLONDON2013
9. Hailo is growing
• Hailo is a marketplace that facilitates over $100M in run-rate
transactions and is making the world a better place for passengers
and drivers
• Hailo has raised over $50M in financing from the world's best
investors including Union Square Ventures, Accel, the founder of
Skype (via Atomico), Wellington Partners (Spotify), Sir Richard
Branson, and our CEO's mother, Janice
JAXLONDON2013
11. “NoSQL DBs trade off traditional features to better
support new and emerging use cases”
Andy Gross, Riak
http://www.slideshare.net/argv0/riak-use-cases-dissecting-the-solutions-to-hard-problems
JAXLONDON2013
12. What are we trading off?
• More widely used, tested and documented software
• Ad-hoc querying
• Talent pool with direct experience
JAXLONDON2013
13. What do we get back in return?
• High availability
• Scalability
• Operational simplicity
JAXLONDON2013
17. Consistency level (CL)
How many replicas must respond to declare success?
Level
Description
ONE
1st Response
QUORUM
N/2 + 1 replicas
LOCAL_QUORUM
N/2 + 1 replicas in local data centre
EACH_QUORUM
N/2 + 1 replicas in each data centre
ALL
All replicas
JAXLONDON2013
18. Big Table
•
•
•
•
•
•
Sparse column based data model
SSTable disk storage
Append-only commit log
Memtable (buffer and sort)
Immutable SSTable files
Compaction
http://research.google.com/archive/bigtable-osdi06.pdf
http://www.slideshare.net/geminimobile/bigtable-4820829
JAXLONDON2013
25. Hailo launched in London in November 2011
• Launched on AWS
• Two PHP/MySQL web apps plus a Java backend
• Mostly built by a team of 3 or 4 backend engineers
• MySQL multi-master for single AZ resilience
JAXLONDON2013
26. Why Cassandra?
• A desire for greater resilience – “become a utility”
Cassandra is designed for high availability
• Plans for international expansion around a single consumer app
Cassandra is good at global replication
• Expected growth
Cassandra scales linearly for both reads and writes
• Prior experience
I had experience with Cassandra and could recommend it
JAXLONDON2013
27. The path to adoption
• Largely unilateral decision by developers – a result of a startup
culture
• Replacement of key consumer app functionality, splitting up the
PHP/MySQL web app into a mixture of global PHP/Java services
backed by a Cassandra data store
• Launched into production in September 2012 – originally just
powering North American expansion, before gradually switching
over Dublin and London
JAXLONDON2013
28. One year on...
• Further breakdown of functionality into Go/Java SOA
• Migrating all online databases to Cassandra
JAXLONDON2013
33. Considerations for entity storage
• Do not read the entire entity, update one property and then write
back a mutation containing every column
• Only mutate columns that have been set
• This avoids read-before-write race conditions
JAXLONDON2013
38. Considerations for time series storage
• Choose row key carefully, since this partitions the records
• Think about how many records you want in a single row
• Denormalise on write into many indexes
JAXLONDON2013
39. Analytics
• With Cassandra we lost the ability to carry out analytics
eg: COUNT, SUM, AVG, GROUP BY
• We use Acunu Analytics to give us this abilty in real time, for preplanned query templates
• It is backed by Cassandra and therefore highly available, resilient
and globally distributed
• Integration is straightforward (HTTP POST)
JAXLONDON2013
42. Get a picture of driver supply
SELECT COUNT DISTINCT(driverId)
FROM driverLocs
WHERE timestamp BETWEEN '1 day ago' AND 'now'
GROUP BY timestamp(hour)
SELECT COUNT
FROM driverLocs
WHERE timestamp BETWEEN '1 day ago' AND 'now'
GROUP BY latitude(0.01), longitude(0.01)
JAXLONDON2013
49. Stats
Cluster
AWS VPCs with Open
VPN links
3 AZs per region
m1.large machines
~ 1TB/node
Provisoned IOPS EBS
Operational
Cluster
~ 200GB/node
JAXLONDON2013
50. Backups
• SSTable snapshot
• Used to upload to S3, but this was taking >6 hours and consuming
all our network bandwidth
• Now take EBS snapshot of the data volumes
JAXLONDON2013
51. Encryption
• Requirement for NYC launch
• We use dmcrypt to encrypt the entire EBS volume
• Chose dmcrypt because it is uncomplicated
• Our tests show a 1% performance hit in disk performance, which
concurs with what Amazon suggest
JAXLONDON2013
53. Multi DC
• Something that Cassandra makes trivial
• Would have been very difficult to accomplish active-active inter-DC
replication with a team of 2 without Cassandra
• Rolling repair needed to make it safe (we use LOCAL_QUORUM)
• We schedule “narrow repairs” on different nodes in our cluster each
night
JAXLONDON2013
54. Compression
• Our stats cluster was running at ~1.5TB per node
• We didn’t want to add more nodes
• With compression, we are now back to ~600GB
• Easy to accomplish
• `nodetool upgradesstables` on a rolling schedule
JAXLONDON2013
56. “The days of the quick and dirty are over”
Simon V, EVP Operations
JAXLONDON2013
57. Technically, everything is fine…
• Our COO feels that C* is “technically good and beautiful”, a
“perfectly good option”
• Our EVPO says that C* reminds him of a time series database in
use at Goldman Sachs that had “very good performance”
…but there are concerns
JAXLONDON2013
58. People who can
attempt to query
MySQL
People who can
attempt to
query Cassandra
JAXLONDON2013
63. Lesson learned
• Have an advocate - get someone who will sell the vision internally
• Learn the theory - teach each team member the fundamentals
• Make an effort to get everyone on board
JAXLONDON2013
70. Lesson learned
• Be pro-active with Cassandra, even if it seems to be running
smoothly
• Peer-review data models, take time to think about them
• Big rows are bad - use cfstats to look for them
• Mixed workloads can cause problems - use cfhistograms and look
out for signs of data modeling problems
• Think about the compaction strategy for each CF
JAXLONDON2013
72. Lessons learned
• EBS is nearly always the cause of Amazon outages
• EBS is a single point of failure (it will fail everywhere in your cluster)
• EBS is slow
• EBS is expensive
• EBS is unnecessary!
JAXLONDON2013
74. Lessons learned
• Keep the business informed – explain the tradeoffs in simple terms
• Sing from the same hymn sheet
• Make sure there solutions in place for every use case from the
beginning
JAXLONDON2013
75. People who can
attempt to query
MySQL
People who can
attempt to
query Cassandra
JAXLONDON2013
77. We like Cassandra
• Solid design
• HA characteristics
• Easy multi-DC setup
• Simplicity of operation
JAXLONDON2013
78. Lessons for successful adoption
• Have an advocate, sell the dream
• Learn the fundamentals, get the best out of Cassandra
• Invest in tools to make life easier
• Keep management in the loop, explain the trade offs
JAXLONDON2013
79. The future
• We will continue to invest in Cassandra as we expand globally
• We will hire people with experience running Cassandra
• We will focus on expanding our reporting facilities
• We aspire to extend our network (1M consumer installs, wallet)
beyond cabs
• We will continue to hire the best engineers in London, NYC and Asia
JAXLONDON2013