Choosing the Right Database - Facebook DevC Malang Hackdays 2017
1. Choosing the Right Database
Rendy B. Junior - Data Eng Lead @ Traveloka
Facebook DevC Malang Hackdays 2017
2. 2Facebook DevC Malang Hackdays 2017
About
Traveloka
Traveloka is a leading Southeast Asia technology company that provides a wide
range of travel needs in one platform. Has been downloaded more than 20 million
times, making it one of the most popular travel booking app in the region.
Rendy B Junior
Data Engineering Lead, Traveloka
Gone through revolution of Traveloka data pipeline since early phase of the
organization. Established both batch and real time data platform which powers
organization insights and tens of data-intensive application use cases. Managed to
handle 10,000x growth of data over the past 3 years
4. How we use our data
● Business Intelligence
● Analytics
● Personalization
● Fraud Detection
● Ads optimization
● Cross selling
● AB Test
● etc.
5. Why you need a database?
Facebook DevC Malang Hackdays 2017
6. 6Facebook DevC Malang Hackdays 2017
So our service could remember something.
To store state.
Our case:
- Flights / hotel available (inventory)
- Customer Bookings
- Payment
Why you need a database?
7. 7Facebook DevC Malang Hackdays 2017
Common client-server architecture. Database serve a backend service.
(don’t expose to client / public…)
Why you need a database?
Android App
Internal
storage, etc
Web
Cookie
browser
Backend Service
Database
9. 9Facebook DevC Malang Hackdays 2017
Know your use case
Examples:
● Login feature, need to store user data
● Ecommerce with search - add to cart - payment, need to
store inventory, cart, payment
10. 10Facebook DevC Malang Hackdays 2017
Common requirements for early stage apps:
Transactional / OLTP (that’s it)
What usually works: MySQL, PostgreSQL
Logical modeling:
Biz requirement → identify entity → ERD → Normalize
Physical modeling: Define schema (field & type), PK, define
indexes
Common requirements
12. 12Facebook DevC Malang Hackdays 2017
Common (temporary) solution for analytics in early stage apps:
Replica / Slave on transactional DB
Example: MySQL master slave replication
Don’t do analytics in Master… ever…
Don’t use SQL script, use analytics tools (a lot of open source tools, e.g. Superset)
Don’t stay at this solution! Use it temporarily (will explain in later slide)
I need analytics on my transaction!
Master Slave
Analytics
Tools
13. 13Facebook DevC Malang Hackdays 2017
Several things is not really transactional…
But we need to get insights from it.
● How much user use sort feature, does it ever been clicked?
● How much search per day? How’s the trend for the past 7 days?
We need to send and store user activity, and be able to get
insight from it, how?
I need user activity insights!
14. 14Facebook DevC Malang Hackdays 2017
Send your user activity to backend (usually called tracking)
Usually RDBMS like MySQL will work for small data, but won’t work for huge data.
● It is not designed for high write activity, at some point your throughput will stuck
● Normally it is 6TB max, and user activity usually a lot more than that
● It is not designed for analytics workload (aggregate on huge number of row)
More activities = slow user activity database = slow app
tl;dr, eventually it will slow down, eventually you’ll need something different by then
I need user activity insights!
Android
App
Backend
Service
Database
15. 15Facebook DevC Malang Hackdays 2017
Common solution (so called big data):
High write activity? Handled by datahub such as Kafka.
Huge size of data? Handled by datalake such as HDFS or S3
Failed analytics query on huge data? Handled by Hadoop or Spark
Important note:
Use it when you really need it! Those (fancy) things are hard to manage…
Don’t bother if your scale is not there yet.
I need user activity insights!
Android
App
Backend
Service
Datalake
Datahub
16. 16Facebook DevC Malang Hackdays 2017
(Cont.) Don’t stay using replica /slave.
Use it for short period of time only (early stage of your app).
Why?
● Eventually replica will not be able to handle your load
● You will eventually want to join data from different database (not only MySQL)
● You want to store your processed data as well
Overall, it is not a good approach to access your db, it is against the best practice...
Remember about our replica / slave?
17. 17Facebook DevC Malang Hackdays 2017
So? Treat your transaction data like user activity data, send it as tracking.
E.g. you have booking data with two status: booked, paid. Send it as two tracking
1) booking event
2) payment event
along with all those details like booking id, product id, timestamp, etc.
So instead of having one table of booking with two possible status, you’ll have two
tables, one for each status.
This is called immutable data, a log.
Read more: https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
Transaction and user activity is not so different
18. 18Facebook DevC Malang Hackdays 2017
Both transaction and analytics events are flowing here...
So one architecture for analytics & transaction
Android
App
Backend
Tracking
Service
Datalake
Datahub
19. 19Facebook DevC Malang Hackdays 2017
Alternative: Google Analytics, Flurry etc.
Pros:
Good for early stage of app / web, solution is quite complete already.
Simple to manage.
Cons:
But you don’t own your data
● Eventually you’ll need to query it by yourself
● Or create a data processing which enable a feature such as recommendation
Eventually you have to build your own. Data is an important asset.
ML, which is a promising field as competitive advantage, also based on data.
Fully managed alternative
21. 21Facebook DevC Malang Hackdays 2017
Not-so-common application requirements
Several examples:
● I want to store text, and then search it. Use inverted index
like Solr, ElasticSearch.
● I want to create a social network. Use graph database like
Neo4j.
● I need to store system metrics and aggregate it. Use time
series db like Graphite.
We no longer live in RDBMS-only world...
22. 22Facebook DevC Malang Hackdays 2017
The problem, as well as blessing, there are tons of them!
● RDBMS: MySQL, PostgreSQL
● Document based: MongoDB, DynamoDB
● Columnar database: Redshift, BigQuery
● NewSQL: Aurora, Spanner
● Time Series: InfluxDB
● Inverted Index: ElasticSearch
● Distributed filesystem: HDFS, S3, GCS
● Graph: Neo4j, ArrangoDB
● Cache: Redis, Memcache
● Key value: HBase, DynamoDB
And many more…
Know the concept and common use case is kuntji.
e.g. transactional / operational db best use RDBMS, columnar is usually used for analytics
3. Know databases available out there
24. 24Facebook DevC Malang Hackdays 2017
Use cases:
business intelligence, personalization, fraud detection, ab test, etc.
Sample case in traveloka
Consumer of Data
Streaming
Batch
Traveloka
App
Kafka
ETL
Data
Warehouse
S3 Data
Lake
Batch
Ingest
Android,
iOS
DOMO
Analytics UI
NoSQL DB
Traveloka
Services
Ingest
Cloud
Pub/Sub
Storage
Cloud
Storage
Pipelines
Cloud
Dataflow
Analytics
BigQuery
Monitoring
Logging
Hive, Presto
Query
25. 25Facebook DevC Malang Hackdays 2017
● Always use technology based on your requirement, not
because it is fancy
● Careful of gotchas! There's no silver bullet… (example: Google
“MySQL limitation”)
● Eventually, you’ll need to adapt to your growth by:
○ Split use cases based on query pattern and latency (see appendix)
○ Scalable tech based on growth estimation (need to test!)
○ Autoscale! (and managed service if possible)
● Keep learning, new database is coming
● New to those things? Ask around! Save your time, a lot
6. Epilogue: Lessons learned from Traveloka
28. 28Facebook DevC Malang Hackdays 2017
Outline
1. Define your database requirement
2. Logical data modeling
3. Know databases available out there
4. Choose database, physical data modeling
5. Proof of concept, test
6. Epilogue: Lessons learned from Traveloka
29. 29Facebook DevC Malang Hackdays 2017
● Understand biz requirement
● Query patterns - try to write your query in sentence or SQL
● Expected query/s, insert/s, update/s - peak and low
● Latency SLA - subsecond? seconds? minutes?
● Consistency - strong or eventual?
● Consider data growth - will the db survive in 3y? 5y?
● Retention policy - is data > 1 year old still relevant?
Don’t forget basic requirements
● e.g. high availability, reliability
Aware of constraints
● Cost, budget
● Maintainability (Managed/self-service, Community/Proprietary)
1. Define database requirement
30. 30Facebook DevC Malang Hackdays 2017
● Biz requirement:
○ We want to recommend based on latest purchase item type of a user,
show relevant items in home page to increase user engagement
● Write your query in sentence:
○ For user id = 1234, return latest purchase item type
○ or in SQL, select latest_purchase_item_type where id = 1234
● Expected query per second:
○ I want to show recommendation on homepage, and homepage is viewed
for 1,000/s during peak, and 100/s during low hour
● Latency SLA:
○ Max query latency 200ms percentile 95, for UX convenience
1. Define database requirement - study case
31. 31Facebook DevC Malang Hackdays 2017
● Consistency - strong or eventual
○ Eventually consistent is OK, data lag max 1 hour
● Consider data growth
○ Business expect user growth 3x next year, and 10x in 3 years
● Retention policy - is data > 1 year old still relevant?
○ We could not delete user data, but purchase could be deleted after 1 year
1. Define database requirement - study case
32. 32Facebook DevC Malang Hackdays 2017
● Think of entities and its properties
● PK - Each row unique by?
● Indexes you might need - Refer back to your query pattern
Study case
● There is user entity, with latest purchase item as its
properties
● User id as PK will be ideal
● No need for other index
2. Logical Data Modeling
33. 33Facebook DevC Malang Hackdays 2017
The problem, as well as blessing, there are tons of them!
● RDBMS: MySQL, PostgreSQL
● Document based: MongoDB, DynamoDB
● Columnar database: Redshift, BigQuery
● NewSQL: Aurora, Spanner
● Time Series: InfluxDB
● Inverted Index: ElasticSearch
● Distributed filesystem: HDFS, S3, GCS
● Graph: Neo4j, ArrangoDB
● Cache: Redis, Memcache
● Key value: HBase, DynamoDB
And many more…
Know the concept and common use case is the key / kuntji.
e.g. transactional / operational db best use RDBMS, columnar is usually used for analytics
3. Know databases available out there
34. 34Facebook DevC Malang Hackdays 2017
Choose database
● Use your database requirement.
○ Our study case: query always by key, it is key value
● Shortlist db that commonly solve your use case.
○ Our study case: mongodb, dynamodb, hbase, bigtable
● Do cost benefit analysis, find the trade off
● Look for gotchas, google something like “problems with xxDB”
4. Choose database, physical data modeling
35. 35Facebook DevC Malang Hackdays 2017
Physical Data Modeling in the Old days:
● Choose between MySQL or PostgreSQL
● Create ERD, convert to normalized form
● Define PK and index
Nowadays:
How data modeled physically is different from one database to another!
Several examples:
● in MongoDB, logical object with nested properties could be stored as is
● in Hbase, defining row key is very crucial
4. Choose database, physical data modeling
36. 36Facebook DevC Malang Hackdays 2017
● Do capacity planning
○ Row Count: ~500 million
○ Num of column: ~100
○ Size: ~10TB
○ Update: 100-1000 update / s Read: 10,000 read / s
○ etc...
● Plan test with part of the expected capacity, e.g. 1/10
○ Define success criteria, e.g. query latency 200ms for percentile 95
● Load test data
● Test query and insert/update
○ Cross check with success criteria
5. Proof of concept, test!