Choosing the Right Database - Facebook DevC Malang Hackdays 2017

Choosing the Right Database
Rendy B. Junior - Data Eng Lead @ Traveloka
Facebook DevC Malang Hackdays 2017

2Facebook DevC Malang Hackdays 2017
About
Traveloka
Traveloka is a leading Southeast Asia technology company that provides a wide
range of travel needs in one platform. Has been downloaded more than 20 million
times, making it one of the most popular travel booking app in the region.
Rendy B Junior
Data Engineering Lead, Traveloka
Gone through revolution of Traveloka data pipeline since early phase of the
organization. Established both batch and real time data platform which powers
organization insights and tens of data-intensive application use cases. Managed to
handle 10,000x growth of data over the past 3 years

How we use our data
● Business Intelligence
● Analytics
● Personalization
● Fraud Detection
● Ads optimization
● Cross selling
● AB Test
● etc.

Why you need a database?

So our service could remember something.
To store state.
Our case:
- Flights / hotel available (inventory)
- Customer Bookings
- Payment

Common client-server architecture. Database serve a backend service.
(don’t expose to client / public…)
Android App
Internal
storage, etc
Web
Cookie
browser
Backend Service
Database

Back to basic
What’s your requirement?

Know your use case
Examples:
● Login feature, need to store user data
● Ecommerce with search - add to cart - payment, need to
store inventory, cart, payment

Common requirements for early stage apps:
Transactional / OLTP (that’s it)
What usually works: MySQL, PostgreSQL
Logical modeling:
Biz requirement → identify entity → ERD → Normalize
Physical modeling: Define schema (field & type), PK, define
indexes
Common requirements

How about analytics?

Common (temporary) solution for analytics in early stage apps:
Replica / Slave on transactional DB
Example: MySQL master slave replication
Don’t do analytics in Master… ever…
Don’t use SQL script, use analytics tools (a lot of open source tools, e.g. Superset)
Don’t stay at this solution! Use it temporarily (will explain in later slide)
I need analytics on my transaction!
Master Slave
Analytics
Tools

Several things is not really transactional…
But we need to get insights from it.
● How much user use sort feature, does it ever been clicked?
● How much search per day? How’s the trend for the past 7 days?
We need to send and store user activity, and be able to get
insight from it, how?
I need user activity insights!

Send your user activity to backend (usually called tracking)
Usually RDBMS like MySQL will work for small data, but won’t work for huge data.
● It is not designed for high write activity, at some point your throughput will stuck
● Normally it is 6TB max, and user activity usually a lot more than that
● It is not designed for analytics workload (aggregate on huge number of row)
More activities = slow user activity database = slow app
tl;dr, eventually it will slow down, eventually you’ll need something different by then
Android
App
Backend
Service
Database

Common solution (so called big data):
High write activity? Handled by datahub such as Kafka.
Huge size of data? Handled by datalake such as HDFS or S3
Failed analytics query on huge data? Handled by Hadoop or Spark
Important note:
Use it when you really need it! Those (fancy) things are hard to manage…
Don’t bother if your scale is not there yet.
Android
App
Backend
Service
Datalake
Datahub

(Cont.) Don’t stay using replica /slave.
Use it for short period of time only (early stage of your app).
Why?
● Eventually replica will not be able to handle your load
● You will eventually want to join data from different database (not only MySQL)
● You want to store your processed data as well
Overall, it is not a good approach to access your db, it is against the best practice...
Remember about our replica / slave?

So? Treat your transaction data like user activity data, send it as tracking.
E.g. you have booking data with two status: booked, paid. Send it as two tracking
1) booking event
2) payment event
along with all those details like booking id, product id, timestamp, etc.
So instead of having one table of booking with two possible status, you’ll have two
tables, one for each status.
This is called immutable data, a log.
Read more: https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
Transaction and user activity is not so different

Both transaction and analytics events are flowing here...
So one architecture for analytics & transaction
Android
App
Backend
Tracking
Service
Datalake
Datahub

Alternative: Google Analytics, Flurry etc.
Pros:
Good for early stage of app / web, solution is quite complete already.
Simple to manage.
Cons:
But you don’t own your data
● Eventually you’ll need to query it by yourself
● Or create a data processing which enable a feature such as recommendation
Eventually you have to build your own. Data is an important asset.
ML, which is a promising field as competitive advantage, also based on data.
Fully managed alternative

How about not-so-common
application requirements?

Not-so-common application requirements
Several examples:
● I want to store text, and then search it. Use inverted index
like Solr, ElasticSearch.
● I want to create a social network. Use graph database like
Neo4j.
● I need to store system metrics and aggregate it. Use time
series db like Graphite.
We no longer live in RDBMS-only world...

The problem, as well as blessing, there are tons of them!
● RDBMS: MySQL, PostgreSQL
● Document based: MongoDB, DynamoDB
● Columnar database: Redshift, BigQuery
● NewSQL: Aurora, Spanner
● Time Series: InfluxDB
● Inverted Index: ElasticSearch
● Distributed filesystem: HDFS, S3, GCS
● Graph: Neo4j, ArrangoDB
● Cache: Redis, Memcache
● Key value: HBase, DynamoDB
And many more…
Know the concept and common use case is kuntji.
e.g. transactional / operational db best use RDBMS, columnar is usually used for analytics
3. Know databases available out there

Lesson Learned

Use cases:
business intelligence, personalization, fraud detection, ab test, etc.
Sample case in traveloka
Consumer of Data
Streaming
Batch
Traveloka
App
Kafka
ETL
Data
Warehouse
S3 Data
Lake
Batch
Ingest
Android,
iOS
DOMO
Analytics UI
NoSQL DB
Traveloka
Services
Ingest
Cloud
Pub/Sub
Storage
Cloud
Storage
Pipelines
Cloud
Dataflow
Analytics
BigQuery
Monitoring
Logging
Hive, Presto
Query

● Always use technology based on your requirement, not
because it is fancy
● Careful of gotchas! There's no silver bullet… (example: Google
“MySQL limitation”)
● Eventually, you’ll need to adapt to your growth by:
○ Split use cases based on query pattern and latency (see appendix)
○ Scalable tech based on growth estimation (need to test!)
○ Autoscale! (and managed service if possible)
● Keep learning, new database is coming
● New to those things? Ask around! Save your time, a lot
6. Epilogue: Lessons learned from Traveloka

Appendix: Methodical Approach

Outline
1. Define your database requirement
2. Logical data modeling
4. Choose database, physical data modeling
5. Proof of concept, test
6. Epilogue: Lessons learned from Traveloka

● Understand biz requirement
● Query patterns - try to write your query in sentence or SQL
● Expected query/s, insert/s, update/s - peak and low
● Latency SLA - subsecond? seconds? minutes?
● Consistency - strong or eventual?
● Consider data growth - will the db survive in 3y? 5y?
● Retention policy - is data > 1 year old still relevant?
Don’t forget basic requirements
● e.g. high availability, reliability
Aware of constraints
● Cost, budget
● Maintainability (Managed/self-service, Community/Proprietary)
1. Define database requirement

● Biz requirement:
○ We want to recommend based on latest purchase item type of a user,
show relevant items in home page to increase user engagement
● Write your query in sentence:
○ For user id = 1234, return latest purchase item type
○ or in SQL, select latest_purchase_item_type where id = 1234
● Expected query per second:
○ I want to show recommendation on homepage, and homepage is viewed
for 1,000/s during peak, and 100/s during low hour
● Latency SLA:
○ Max query latency 200ms percentile 95, for UX convenience
1. Define database requirement - study case

● Consistency - strong or eventual
○ Eventually consistent is OK, data lag max 1 hour
● Consider data growth
○ Business expect user growth 3x next year, and 10x in 3 years
● Retention policy - is data > 1 year old still relevant?
○ We could not delete user data, but purchase could be deleted after 1 year
1. Define database requirement - study case

● Think of entities and its properties
● PK - Each row unique by?
● Indexes you might need - Refer back to your query pattern
Study case
● There is user entity, with latest purchase item as its
properties
● User id as PK will be ideal
● No need for other index
2. Logical Data Modeling

The problem, as well as blessing, there are tons of them!
● RDBMS: MySQL, PostgreSQL
● Document based: MongoDB, DynamoDB
● Columnar database: Redshift, BigQuery
● NewSQL: Aurora, Spanner
● Time Series: InfluxDB
● Inverted Index: ElasticSearch
● Distributed filesystem: HDFS, S3, GCS
● Graph: Neo4j, ArrangoDB
● Cache: Redis, Memcache
● Key value: HBase, DynamoDB
And many more…
Know the concept and common use case is the key / kuntji.
e.g. transactional / operational db best use RDBMS, columnar is usually used for analytics

Choose database
● Use your database requirement.
○ Our study case: query always by key, it is key value
● Shortlist db that commonly solve your use case.
○ Our study case: mongodb, dynamodb, hbase, bigtable
● Do cost benefit analysis, find the trade off
● Look for gotchas, google something like “problems with xxDB”

Physical Data Modeling in the Old days:
● Choose between MySQL or PostgreSQL
● Create ERD, convert to normalized form
● Define PK and index
Nowadays:
How data modeled physically is different from one database to another!
Several examples:
● in MongoDB, logical object with nested properties could be stored as is
● in Hbase, defining row key is very crucial

● Do capacity planning
○ Row Count: ~500 million
○ Num of column: ~100
○ Size: ~10TB
○ Update: 100-1000 update / s Read: 10,000 read / s
○ etc...
● Plan test with part of the expected capacity, e.g. 1/10
○ Define success criteria, e.g. query latency 200ms for percentile 95
● Load test data
● Test query and insert/update
○ Cross check with success criteria
5. Proof of concept, test!

Choosing the Right Database - Facebook DevC Malang Hackdays 2017

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Choosing the Right Database - Facebook DevC Malang Hackdays 2017

Semelhante a Choosing the Right Database - Facebook DevC Malang Hackdays 2017 (20)

Último

Último (20)

Choosing the Right Database - Facebook DevC Malang Hackdays 2017