SlideShare uma empresa Scribd logo
1 de 36
Choosing the Right Database
Rendy B. Junior - Data Eng Lead @ Traveloka
Facebook DevC Malang Hackdays 2017
2Facebook DevC Malang Hackdays 2017
About
Traveloka
Traveloka is a leading Southeast Asia technology company that provides a wide
range of travel needs in one platform. Has been downloaded more than 20 million
times, making it one of the most popular travel booking app in the region.
Rendy B Junior
Data Engineering Lead, Traveloka
Gone through revolution of Traveloka data pipeline since early phase of the
organization. Established both batch and real time data platform which powers
organization insights and tens of data-intensive application use cases. Managed to
handle 10,000x growth of data over the past 3 years
#EnablingMobility
How we use our data
● Business Intelligence
● Analytics
● Personalization
● Fraud Detection
● Ads optimization
● Cross selling
● AB Test
● etc.
Why you need a database?
Facebook DevC Malang Hackdays 2017
6Facebook DevC Malang Hackdays 2017
So our service could remember something.
To store state.
Our case:
- Flights / hotel available (inventory)
- Customer Bookings
- Payment
Why you need a database?
7Facebook DevC Malang Hackdays 2017
Common client-server architecture. Database serve a backend service.
(don’t expose to client / public…)
Why you need a database?
Android App
Internal
storage, etc
Web
Cookie
browser
Backend Service
Database
Back to basic
What’s your requirement?
Facebook DevC Malang Hackdays 2017
9Facebook DevC Malang Hackdays 2017
Know your use case
Examples:
● Login feature, need to store user data
● Ecommerce with search - add to cart - payment, need to
store inventory, cart, payment
10Facebook DevC Malang Hackdays 2017
Common requirements for early stage apps:
Transactional / OLTP (that’s it)
What usually works: MySQL, PostgreSQL
Logical modeling:
Biz requirement → identify entity → ERD → Normalize
Physical modeling: Define schema (field & type), PK, define
indexes
Common requirements
How about analytics?
Facebook DevC Malang Hackdays 2017
12Facebook DevC Malang Hackdays 2017
Common (temporary) solution for analytics in early stage apps:
Replica / Slave on transactional DB
Example: MySQL master slave replication
Don’t do analytics in Master… ever…
Don’t use SQL script, use analytics tools (a lot of open source tools, e.g. Superset)
Don’t stay at this solution! Use it temporarily (will explain in later slide)
I need analytics on my transaction!
Master Slave
Analytics
Tools
13Facebook DevC Malang Hackdays 2017
Several things is not really transactional…
But we need to get insights from it.
● How much user use sort feature, does it ever been clicked?
● How much search per day? How’s the trend for the past 7 days?
We need to send and store user activity, and be able to get
insight from it, how?
I need user activity insights!
14Facebook DevC Malang Hackdays 2017
Send your user activity to backend (usually called tracking)
Usually RDBMS like MySQL will work for small data, but won’t work for huge data.
● It is not designed for high write activity, at some point your throughput will stuck
● Normally it is 6TB max, and user activity usually a lot more than that
● It is not designed for analytics workload (aggregate on huge number of row)
More activities = slow user activity database = slow app
tl;dr, eventually it will slow down, eventually you’ll need something different by then
I need user activity insights!
Android
App
Backend
Service
Database
15Facebook DevC Malang Hackdays 2017
Common solution (so called big data):
High write activity? Handled by datahub such as Kafka.
Huge size of data? Handled by datalake such as HDFS or S3
Failed analytics query on huge data? Handled by Hadoop or Spark
Important note:
Use it when you really need it! Those (fancy) things are hard to manage…
Don’t bother if your scale is not there yet.
I need user activity insights!
Android
App
Backend
Service
Datalake
Datahub
16Facebook DevC Malang Hackdays 2017
(Cont.) Don’t stay using replica /slave.
Use it for short period of time only (early stage of your app).
Why?
● Eventually replica will not be able to handle your load
● You will eventually want to join data from different database (not only MySQL)
● You want to store your processed data as well
Overall, it is not a good approach to access your db, it is against the best practice...
Remember about our replica / slave?
17Facebook DevC Malang Hackdays 2017
So? Treat your transaction data like user activity data, send it as tracking.
E.g. you have booking data with two status: booked, paid. Send it as two tracking
1) booking event
2) payment event
along with all those details like booking id, product id, timestamp, etc.
So instead of having one table of booking with two possible status, you’ll have two
tables, one for each status.
This is called immutable data, a log.
Read more: https://engineering.linkedin.com/distributed-systems/log-what-every-
software-engineer-should-know-about-real-time-datas-unifying
Transaction and user activity is not so different
18Facebook DevC Malang Hackdays 2017
Both transaction and analytics events are flowing here...
So one architecture for analytics & transaction
Android
App
Backend
Tracking
Service
Datalake
Datahub
19Facebook DevC Malang Hackdays 2017
Alternative: Google Analytics, Flurry etc.
Pros:
Good for early stage of app / web, solution is quite complete already.
Simple to manage.
Cons:
But you don’t own your data
● Eventually you’ll need to query it by yourself
● Or create a data processing which enable a feature such as recommendation
Eventually you have to build your own. Data is an important asset.
ML, which is a promising field as competitive advantage, also based on data.
Fully managed alternative
How about not-so-common
application requirements?
Facebook DevC Malang Hackdays 2017
21Facebook DevC Malang Hackdays 2017
Not-so-common application requirements
Several examples:
● I want to store text, and then search it. Use inverted index
like Solr, ElasticSearch.
● I want to create a social network. Use graph database like
Neo4j.
● I need to store system metrics and aggregate it. Use time
series db like Graphite.
We no longer live in RDBMS-only world...
22Facebook DevC Malang Hackdays 2017
The problem, as well as blessing, there are tons of them!
● RDBMS: MySQL, PostgreSQL
● Document based: MongoDB, DynamoDB
● Columnar database: Redshift, BigQuery
● NewSQL: Aurora, Spanner
● Time Series: InfluxDB
● Inverted Index: ElasticSearch
● Distributed filesystem: HDFS, S3, GCS
● Graph: Neo4j, ArrangoDB
● Cache: Redis, Memcache
● Key value: HBase, DynamoDB
And many more…
Know the concept and common use case is kuntji.
e.g. transactional / operational db best use RDBMS, columnar is usually used for analytics
3. Know databases available out there
Lesson Learned
Facebook DevC Malang Hackdays 2017
24Facebook DevC Malang Hackdays 2017
Use cases:
business intelligence, personalization, fraud detection, ab test, etc.
Sample case in traveloka
Consumer of Data
Streaming
Batch
Traveloka
App
Kafka
ETL
Data
Warehouse
S3 Data
Lake
Batch
Ingest
Android,
iOS
DOMO
Analytics UI
NoSQL DB
Traveloka
Services
Ingest
Cloud
Pub/Sub
Storage
Cloud
Storage
Pipelines
Cloud
Dataflow
Analytics
BigQuery
Monitoring
Logging
Hive, Presto
Query
25Facebook DevC Malang Hackdays 2017
● Always use technology based on your requirement, not
because it is fancy
● Careful of gotchas! There's no silver bullet… (example: Google
“MySQL limitation”)
● Eventually, you’ll need to adapt to your growth by:
○ Split use cases based on query pattern and latency (see appendix)
○ Scalable tech based on growth estimation (need to test!)
○ Autoscale! (and managed service if possible)
● Keep learning, new database is coming
● New to those things? Ask around! Save your time, a lot
6. Epilogue: Lessons learned from Traveloka
Thank you
Appendix: Methodical Approach
Facebook DevC Malang Hackdays 2017
28Facebook DevC Malang Hackdays 2017
Outline
1. Define your database requirement
2. Logical data modeling
3. Know databases available out there
4. Choose database, physical data modeling
5. Proof of concept, test
6. Epilogue: Lessons learned from Traveloka
29Facebook DevC Malang Hackdays 2017
● Understand biz requirement
● Query patterns - try to write your query in sentence or SQL
● Expected query/s, insert/s, update/s - peak and low
● Latency SLA - subsecond? seconds? minutes?
● Consistency - strong or eventual?
● Consider data growth - will the db survive in 3y? 5y?
● Retention policy - is data > 1 year old still relevant?
Don’t forget basic requirements
● e.g. high availability, reliability
Aware of constraints
● Cost, budget
● Maintainability (Managed/self-service, Community/Proprietary)
1. Define database requirement
30Facebook DevC Malang Hackdays 2017
● Biz requirement:
○ We want to recommend based on latest purchase item type of a user,
show relevant items in home page to increase user engagement
● Write your query in sentence:
○ For user id = 1234, return latest purchase item type
○ or in SQL, select latest_purchase_item_type where id = 1234
● Expected query per second:
○ I want to show recommendation on homepage, and homepage is viewed
for 1,000/s during peak, and 100/s during low hour
● Latency SLA:
○ Max query latency 200ms percentile 95, for UX convenience
1. Define database requirement - study case
31Facebook DevC Malang Hackdays 2017
● Consistency - strong or eventual
○ Eventually consistent is OK, data lag max 1 hour
● Consider data growth
○ Business expect user growth 3x next year, and 10x in 3 years
● Retention policy - is data > 1 year old still relevant?
○ We could not delete user data, but purchase could be deleted after 1 year
1. Define database requirement - study case
32Facebook DevC Malang Hackdays 2017
● Think of entities and its properties
● PK - Each row unique by?
● Indexes you might need - Refer back to your query pattern
Study case
● There is user entity, with latest purchase item as its
properties
● User id as PK will be ideal
● No need for other index
2. Logical Data Modeling
33Facebook DevC Malang Hackdays 2017
The problem, as well as blessing, there are tons of them!
● RDBMS: MySQL, PostgreSQL
● Document based: MongoDB, DynamoDB
● Columnar database: Redshift, BigQuery
● NewSQL: Aurora, Spanner
● Time Series: InfluxDB
● Inverted Index: ElasticSearch
● Distributed filesystem: HDFS, S3, GCS
● Graph: Neo4j, ArrangoDB
● Cache: Redis, Memcache
● Key value: HBase, DynamoDB
And many more…
Know the concept and common use case is the key / kuntji.
e.g. transactional / operational db best use RDBMS, columnar is usually used for analytics
3. Know databases available out there
34Facebook DevC Malang Hackdays 2017
Choose database
● Use your database requirement.
○ Our study case: query always by key, it is key value
● Shortlist db that commonly solve your use case.
○ Our study case: mongodb, dynamodb, hbase, bigtable
● Do cost benefit analysis, find the trade off
● Look for gotchas, google something like “problems with xxDB”
4. Choose database, physical data modeling
35Facebook DevC Malang Hackdays 2017
Physical Data Modeling in the Old days:
● Choose between MySQL or PostgreSQL
● Create ERD, convert to normalized form
● Define PK and index
Nowadays:
How data modeled physically is different from one database to another!
Several examples:
● in MongoDB, logical object with nested properties could be stored as is
● in Hbase, defining row key is very crucial
4. Choose database, physical data modeling
36Facebook DevC Malang Hackdays 2017
● Do capacity planning
○ Row Count: ~500 million
○ Num of column: ~100
○ Size: ~10TB
○ Update: 100-1000 update / s Read: 10,000 read / s
○ etc...
● Plan test with part of the expected capacity, e.g. 1/10
○ Define success criteria, e.g. query latency 200ms for percentile 95
● Load test data
● Test query and insert/update
○ Cross check with success criteria
5. Proof of concept, test!

Mais conteúdo relacionado

Mais procurados

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"Rob Winters
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidDatabricks
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingTrieu Nguyen
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataTreasure Data, Inc.
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerationsAseem Bansal
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data PipelineManish Kumar
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksDatabricks
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixStitch Fix Algorithms
 
JEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldJEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldSerg Masyutin
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summitOpen Analytics
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitZhenxiao Luo
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
 

Mais procurados (20)

No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
Funnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and DruidFunnel Analysis with Apache Spark and Druid
Funnel Analysis with Apache Spark and Druid
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Lambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB TestingLambda Architecture 2.0 for Reactive AB Testing
Lambda Architecture 2.0 for Reactive AB Testing
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerations
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
Building a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with DatabricksBuilding a Data Science as a Service Platform in Azure with Databricks
Building a Data Science as a Service Platform in Azure with Databricks
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
JEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java WorldJEEConf 2015 Big Data Analysis in Java World
JEEConf 2015 Big Data Analysis in Java World
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Big data bi-mature-oanyc summit
Big data bi-mature-oanyc summitBig data bi-mature-oanyc summit
Big data bi-mature-oanyc summit
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks Summit
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 

Semelhante a Choosing the Right Database - Facebook DevC Malang Hackdays 2017

Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data PlatformDani Solà Lagares
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfprevota
 
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demandsMongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demandsMongoDB
 
Running Data Platforms Like Products
Running Data Platforms Like ProductsRunning Data Platforms Like Products
Running Data Platforms Like ProductsVMware Tanzu
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist Manjunath Sindagi
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Key Skills Required for Data Engineering
Key Skills Required for Data EngineeringKey Skills Required for Data Engineering
Key Skills Required for Data EngineeringFibonalabs
 
Big Data overview
Big Data overviewBig Data overview
Big Data overviewalexisroos
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineTrieu Nguyen
 
Build next generation apps with eyes and ears using Google Chrome
Build next generation apps with eyes and ears using Google ChromeBuild next generation apps with eyes and ears using Google Chrome
Build next generation apps with eyes and ears using Google ChromeAhmedabadJavaMeetup
 
Shanthkumar 6yrs-java-analytics-resume
Shanthkumar 6yrs-java-analytics-resumeShanthkumar 6yrs-java-analytics-resume
Shanthkumar 6yrs-java-analytics-resumeShantha Kumar N
 
How To Run A Successful BI Project with Hadoop
How To Run A Successful BI Project with HadoopHow To Run A Successful BI Project with Hadoop
How To Run A Successful BI Project with HadoopMammoth Data
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with DatabricksGrega Kespret
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryAli Dasdan
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning India Quotient
 

Semelhante a Choosing the Right Database - Facebook DevC Malang Hackdays 2017 (20)

Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdf
 
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demandsMongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
MongoDB .local Chicago 2019: MongoDB – Powering the new age data demands
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Running Data Platforms Like Products
Running Data Platforms Like ProductsRunning Data Platforms Like Products
Running Data Platforms Like Products
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Key Skills Required for Data Engineering
Key Skills Required for Data EngineeringKey Skills Required for Data Engineering
Key Skills Required for Data Engineering
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Building Reactive Real-time Data Pipeline
Building Reactive Real-time Data PipelineBuilding Reactive Real-time Data Pipeline
Building Reactive Real-time Data Pipeline
 
Build next generation apps with eyes and ears using Google Chrome
Build next generation apps with eyes and ears using Google ChromeBuild next generation apps with eyes and ears using Google Chrome
Build next generation apps with eyes and ears using Google Chrome
 
Shanthkumar 6yrs-java-analytics-resume
Shanthkumar 6yrs-java-analytics-resumeShanthkumar 6yrs-java-analytics-resume
Shanthkumar 6yrs-java-analytics-resume
 
How To Run A Successful BI Project with Hadoop
How To Run A Successful BI Project with HadoopHow To Run A Successful BI Project with Hadoop
How To Run A Successful BI Project with Hadoop
 
How Celtra Optimizes its Advertising Platform with Databricks
How Celtra Optimizes its Advertising Platformwith DatabricksHow Celtra Optimizes its Advertising Platformwith Databricks
How Celtra Optimizes its Advertising Platform with Databricks
 
Abhishek jaiswal
Abhishek jaiswalAbhishek jaiswal
Abhishek jaiswal
 
Modern Thinking área digital MSKM 21/09/2017
Modern Thinking área digital MSKM 21/09/2017Modern Thinking área digital MSKM 21/09/2017
Modern Thinking área digital MSKM 21/09/2017
 
How to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 

Último

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Último (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Choosing the Right Database - Facebook DevC Malang Hackdays 2017

  • 1. Choosing the Right Database Rendy B. Junior - Data Eng Lead @ Traveloka Facebook DevC Malang Hackdays 2017
  • 2. 2Facebook DevC Malang Hackdays 2017 About Traveloka Traveloka is a leading Southeast Asia technology company that provides a wide range of travel needs in one platform. Has been downloaded more than 20 million times, making it one of the most popular travel booking app in the region. Rendy B Junior Data Engineering Lead, Traveloka Gone through revolution of Traveloka data pipeline since early phase of the organization. Established both batch and real time data platform which powers organization insights and tens of data-intensive application use cases. Managed to handle 10,000x growth of data over the past 3 years
  • 4. How we use our data ● Business Intelligence ● Analytics ● Personalization ● Fraud Detection ● Ads optimization ● Cross selling ● AB Test ● etc.
  • 5. Why you need a database? Facebook DevC Malang Hackdays 2017
  • 6. 6Facebook DevC Malang Hackdays 2017 So our service could remember something. To store state. Our case: - Flights / hotel available (inventory) - Customer Bookings - Payment Why you need a database?
  • 7. 7Facebook DevC Malang Hackdays 2017 Common client-server architecture. Database serve a backend service. (don’t expose to client / public…) Why you need a database? Android App Internal storage, etc Web Cookie browser Backend Service Database
  • 8. Back to basic What’s your requirement? Facebook DevC Malang Hackdays 2017
  • 9. 9Facebook DevC Malang Hackdays 2017 Know your use case Examples: ● Login feature, need to store user data ● Ecommerce with search - add to cart - payment, need to store inventory, cart, payment
  • 10. 10Facebook DevC Malang Hackdays 2017 Common requirements for early stage apps: Transactional / OLTP (that’s it) What usually works: MySQL, PostgreSQL Logical modeling: Biz requirement → identify entity → ERD → Normalize Physical modeling: Define schema (field & type), PK, define indexes Common requirements
  • 11. How about analytics? Facebook DevC Malang Hackdays 2017
  • 12. 12Facebook DevC Malang Hackdays 2017 Common (temporary) solution for analytics in early stage apps: Replica / Slave on transactional DB Example: MySQL master slave replication Don’t do analytics in Master… ever… Don’t use SQL script, use analytics tools (a lot of open source tools, e.g. Superset) Don’t stay at this solution! Use it temporarily (will explain in later slide) I need analytics on my transaction! Master Slave Analytics Tools
  • 13. 13Facebook DevC Malang Hackdays 2017 Several things is not really transactional… But we need to get insights from it. ● How much user use sort feature, does it ever been clicked? ● How much search per day? How’s the trend for the past 7 days? We need to send and store user activity, and be able to get insight from it, how? I need user activity insights!
  • 14. 14Facebook DevC Malang Hackdays 2017 Send your user activity to backend (usually called tracking) Usually RDBMS like MySQL will work for small data, but won’t work for huge data. ● It is not designed for high write activity, at some point your throughput will stuck ● Normally it is 6TB max, and user activity usually a lot more than that ● It is not designed for analytics workload (aggregate on huge number of row) More activities = slow user activity database = slow app tl;dr, eventually it will slow down, eventually you’ll need something different by then I need user activity insights! Android App Backend Service Database
  • 15. 15Facebook DevC Malang Hackdays 2017 Common solution (so called big data): High write activity? Handled by datahub such as Kafka. Huge size of data? Handled by datalake such as HDFS or S3 Failed analytics query on huge data? Handled by Hadoop or Spark Important note: Use it when you really need it! Those (fancy) things are hard to manage… Don’t bother if your scale is not there yet. I need user activity insights! Android App Backend Service Datalake Datahub
  • 16. 16Facebook DevC Malang Hackdays 2017 (Cont.) Don’t stay using replica /slave. Use it for short period of time only (early stage of your app). Why? ● Eventually replica will not be able to handle your load ● You will eventually want to join data from different database (not only MySQL) ● You want to store your processed data as well Overall, it is not a good approach to access your db, it is against the best practice... Remember about our replica / slave?
  • 17. 17Facebook DevC Malang Hackdays 2017 So? Treat your transaction data like user activity data, send it as tracking. E.g. you have booking data with two status: booked, paid. Send it as two tracking 1) booking event 2) payment event along with all those details like booking id, product id, timestamp, etc. So instead of having one table of booking with two possible status, you’ll have two tables, one for each status. This is called immutable data, a log. Read more: https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying Transaction and user activity is not so different
  • 18. 18Facebook DevC Malang Hackdays 2017 Both transaction and analytics events are flowing here... So one architecture for analytics & transaction Android App Backend Tracking Service Datalake Datahub
  • 19. 19Facebook DevC Malang Hackdays 2017 Alternative: Google Analytics, Flurry etc. Pros: Good for early stage of app / web, solution is quite complete already. Simple to manage. Cons: But you don’t own your data ● Eventually you’ll need to query it by yourself ● Or create a data processing which enable a feature such as recommendation Eventually you have to build your own. Data is an important asset. ML, which is a promising field as competitive advantage, also based on data. Fully managed alternative
  • 20. How about not-so-common application requirements? Facebook DevC Malang Hackdays 2017
  • 21. 21Facebook DevC Malang Hackdays 2017 Not-so-common application requirements Several examples: ● I want to store text, and then search it. Use inverted index like Solr, ElasticSearch. ● I want to create a social network. Use graph database like Neo4j. ● I need to store system metrics and aggregate it. Use time series db like Graphite. We no longer live in RDBMS-only world...
  • 22. 22Facebook DevC Malang Hackdays 2017 The problem, as well as blessing, there are tons of them! ● RDBMS: MySQL, PostgreSQL ● Document based: MongoDB, DynamoDB ● Columnar database: Redshift, BigQuery ● NewSQL: Aurora, Spanner ● Time Series: InfluxDB ● Inverted Index: ElasticSearch ● Distributed filesystem: HDFS, S3, GCS ● Graph: Neo4j, ArrangoDB ● Cache: Redis, Memcache ● Key value: HBase, DynamoDB And many more… Know the concept and common use case is kuntji. e.g. transactional / operational db best use RDBMS, columnar is usually used for analytics 3. Know databases available out there
  • 23. Lesson Learned Facebook DevC Malang Hackdays 2017
  • 24. 24Facebook DevC Malang Hackdays 2017 Use cases: business intelligence, personalization, fraud detection, ab test, etc. Sample case in traveloka Consumer of Data Streaming Batch Traveloka App Kafka ETL Data Warehouse S3 Data Lake Batch Ingest Android, iOS DOMO Analytics UI NoSQL DB Traveloka Services Ingest Cloud Pub/Sub Storage Cloud Storage Pipelines Cloud Dataflow Analytics BigQuery Monitoring Logging Hive, Presto Query
  • 25. 25Facebook DevC Malang Hackdays 2017 ● Always use technology based on your requirement, not because it is fancy ● Careful of gotchas! There's no silver bullet… (example: Google “MySQL limitation”) ● Eventually, you’ll need to adapt to your growth by: ○ Split use cases based on query pattern and latency (see appendix) ○ Scalable tech based on growth estimation (need to test!) ○ Autoscale! (and managed service if possible) ● Keep learning, new database is coming ● New to those things? Ask around! Save your time, a lot 6. Epilogue: Lessons learned from Traveloka
  • 27. Appendix: Methodical Approach Facebook DevC Malang Hackdays 2017
  • 28. 28Facebook DevC Malang Hackdays 2017 Outline 1. Define your database requirement 2. Logical data modeling 3. Know databases available out there 4. Choose database, physical data modeling 5. Proof of concept, test 6. Epilogue: Lessons learned from Traveloka
  • 29. 29Facebook DevC Malang Hackdays 2017 ● Understand biz requirement ● Query patterns - try to write your query in sentence or SQL ● Expected query/s, insert/s, update/s - peak and low ● Latency SLA - subsecond? seconds? minutes? ● Consistency - strong or eventual? ● Consider data growth - will the db survive in 3y? 5y? ● Retention policy - is data > 1 year old still relevant? Don’t forget basic requirements ● e.g. high availability, reliability Aware of constraints ● Cost, budget ● Maintainability (Managed/self-service, Community/Proprietary) 1. Define database requirement
  • 30. 30Facebook DevC Malang Hackdays 2017 ● Biz requirement: ○ We want to recommend based on latest purchase item type of a user, show relevant items in home page to increase user engagement ● Write your query in sentence: ○ For user id = 1234, return latest purchase item type ○ or in SQL, select latest_purchase_item_type where id = 1234 ● Expected query per second: ○ I want to show recommendation on homepage, and homepage is viewed for 1,000/s during peak, and 100/s during low hour ● Latency SLA: ○ Max query latency 200ms percentile 95, for UX convenience 1. Define database requirement - study case
  • 31. 31Facebook DevC Malang Hackdays 2017 ● Consistency - strong or eventual ○ Eventually consistent is OK, data lag max 1 hour ● Consider data growth ○ Business expect user growth 3x next year, and 10x in 3 years ● Retention policy - is data > 1 year old still relevant? ○ We could not delete user data, but purchase could be deleted after 1 year 1. Define database requirement - study case
  • 32. 32Facebook DevC Malang Hackdays 2017 ● Think of entities and its properties ● PK - Each row unique by? ● Indexes you might need - Refer back to your query pattern Study case ● There is user entity, with latest purchase item as its properties ● User id as PK will be ideal ● No need for other index 2. Logical Data Modeling
  • 33. 33Facebook DevC Malang Hackdays 2017 The problem, as well as blessing, there are tons of them! ● RDBMS: MySQL, PostgreSQL ● Document based: MongoDB, DynamoDB ● Columnar database: Redshift, BigQuery ● NewSQL: Aurora, Spanner ● Time Series: InfluxDB ● Inverted Index: ElasticSearch ● Distributed filesystem: HDFS, S3, GCS ● Graph: Neo4j, ArrangoDB ● Cache: Redis, Memcache ● Key value: HBase, DynamoDB And many more… Know the concept and common use case is the key / kuntji. e.g. transactional / operational db best use RDBMS, columnar is usually used for analytics 3. Know databases available out there
  • 34. 34Facebook DevC Malang Hackdays 2017 Choose database ● Use your database requirement. ○ Our study case: query always by key, it is key value ● Shortlist db that commonly solve your use case. ○ Our study case: mongodb, dynamodb, hbase, bigtable ● Do cost benefit analysis, find the trade off ● Look for gotchas, google something like “problems with xxDB” 4. Choose database, physical data modeling
  • 35. 35Facebook DevC Malang Hackdays 2017 Physical Data Modeling in the Old days: ● Choose between MySQL or PostgreSQL ● Create ERD, convert to normalized form ● Define PK and index Nowadays: How data modeled physically is different from one database to another! Several examples: ● in MongoDB, logical object with nested properties could be stored as is ● in Hbase, defining row key is very crucial 4. Choose database, physical data modeling
  • 36. 36Facebook DevC Malang Hackdays 2017 ● Do capacity planning ○ Row Count: ~500 million ○ Num of column: ~100 ○ Size: ~10TB ○ Update: 100-1000 update / s Read: 10,000 read / s ○ etc... ● Plan test with part of the expected capacity, e.g. 1/10 ○ Define success criteria, e.g. query latency 200ms for percentile 95 ● Load test data ● Test query and insert/update ○ Cross check with success criteria 5. Proof of concept, test!