This document summarizes the lessons learned from Traveloka's journey in building a scalable data pipeline. Some key lessons include: (1) splitting data pipelines based on query patterns and SLAs, (2) using technologies like Kafka to decouple data publishing and consumption and handle high throughput, (3) planning for a data warehouse from the beginning, and (4) testing scalability and choosing technologies suited for specific use cases. The document also outlines Traveloka's future plans to simplify their data architecture through a single entry point for data and less operational complexity.
Scalable data pipeline at Traveloka - Facebook Dev Bandung
1. Scalable Data Pipeline @
Traveloka : How We Get
There
Stories and lessons learned on building a scalable
data pipeline at Traveloka.
2. Very Early days
Applications
& Services
Summarizer
Internal
Dashboard
Report Scripts +
Crontab
- Raw Activity
- Key Value
- Time Series
3. Full... Split & Shard! Raw, KV, and Time Series DB
Applications
& Services Internal
Dashboard
Report Scripts +
Crontab
Raw Activity
(Sharded)
Time Series
SummarySummarizer
Lesson Learned
1. UNIX principle: “Do One Thing and Do It Well”
2. Split use cases based on SLA & query pattern
3. Scalable tech based on growth estimation
Key Value DB
(Sharded)
4. Throughput? Kafka comes into rescue
Applications
& Services
Raw Activity
(Sharded)
Lesson Learned
1. Use something that can handle higher
throughput for cases with high write volume like
tracking
2. Decouple publish and consume
Kafka as
Datahub
Raw data
consumer
Key Value
(Sharded)
insert
update
5. We need Data Warehouse and BI Tool, and we
need it fast!
Raw Activity
(Sharded)
Other sources
Python ETL
(temporary
solution)
Star Schema
DW on
Postgres
Periscope BI
Tool
Lesson Learned
1. Think DW since the beginning of data pipeline
2. BI Tools: Do not reinvent the wheel
6. Postgres couldn’t handle the load!
Raw Activity
(Sharded)
Other sources
Python ETL
(temporary
solution)
Star Schema
DW on
Redshift
Periscope BI
Tool
Lesson Learned
1. Choose specific tech that best fit the use case
7. Scaling out in MongoDB every so often is not
manageable...
Lesson Learned
1. MongoDB Shard: Scalability need to be tested!
Kafka as
Datahub
Gobblin as
Consumer
Raw Activity
on S3
8. “Have” to adopt big data
Lesson Learned
1. Processing have to be easily scaled
2. Scale processing separately for: day to day job,
backfill job
Kafka as
Datahub
Gobblin as
Consumer
Raw Activity
on S3
Processing on
Spark
Star Schema
DW on
Redshift
9. Near Real Time on Big Data is challenging
Lesson Learned
1.Dig requirement until it is very specific, for data it
is related to: 1) latency SLA 2) query pattern 3)
accuracy 4) processing requirement 5) tools
integration
Kafka as
Datahub
MemSQL for Near
Real Time DB
10. Open your mind for any combination of tech!
Lesson Learned
1. Combination of cloud provider is possible, but
be careful of latency concern
2. During a research project, always prepare plan
B & C plus proper buffer on timeline
3. Autoscale!
PubSub as
Datahub
DataFlow for
Stream
Processing
Key Value on
DynamoDB
11. More autoscale!
Lesson Learned
1. Autoscale = cost monitoring
Caveat
Autoscale != everything solved
e.g. PubSub default quota 200MB/s (could be
increased, but manually request)
PubSub as
Datahub
BigQuery for Near
Real Time DB
12. More autoscale!
Lesson Learned
1. Scalable as granular as possible, in this case
separate compute and storage scalability
2. Separate BI with well defined SLA and
exploration use case
Kafka as
Datahub
Gobblin as
Consumer
Raw Activity
on S3
Processing on
Spark
Hive & Presto on
Qubole as Query
Engine
BI & Exploration
Tools
13.
14.
15. Key Lessons Learned
● Scalability in mind -- esp disk full.. :)
● Scalable as granular as possible -- compute, storage
● Scalability need to be tested (of course!)
● Do one thing, and do it well, dig your requirement -- SLA, query pattern
● Decouple publish and consume -- publisher availability is very important!
● Choose tech that is specific to the use case
● Careful of Gotchas! There's no silver bullet...
16. Future Roadmap
- In the past, we see problems/needs, see what technology can solve it, and
plug it to the existing pipeline.
- It works well.
- But after some time, we need to maintain a lot of different components.
- Multiple clusters:
- Kafka
- Spark
- Hive/Presto
- Redshift
- etc
- Multiple data entry points for analyst:
- BigQuery
- Hive/Presto
- Redshift
17. Future Roadmap
Our goal:
- Simplifying our data architecture.
- Single data entry point for data analysts/scientists, both streaming and batch
data.
- Without compromising what we can do now.
- Reliability, speed, and scale.
- Less or no ops.
- We also want to make migration as simple/easy as possible.
18. Future Roadmap
How will we achieve this?
- There are few options that we are considering right now.
- Some of them introducing new technologies/components.
- Some of them is making use of our existing technology to its maximum
potential.
- We are trying exciting new (relatively) technologies:
- Google BigQuery
- AWS Athena
- AWS Redshift Spectrum
- etc
Mongo track (raw, sharded) +mongosdim + mongo summary + hi-chart + js script
karena space nya nggak muat, terus dibikin biar scalable (tapi enggak lol)
Misahin raw dan summary
Karena app sering kena high latency query pas ambil key-value data
Misahin yang dipake application
Lesson learned:
bikin db itu jangan multi purpose (pisah track & summary)
Foresee growth data dan plan perlu scalable sampai mana
Pisah db yang perlu well defined SLA, mesti predictable load-nya (karena dulu campur di dwh, jadi bisa kena script2)
Kafka + Custom consumer + mongo
Karena app sering mati kalau pas ada query berat, insertnya jadi lama
decoupling read and write
naikin throughput tracking -> supaya dari sisi app yang nulis nggak bottleneck di db
Lesson learned
Decouple in infrastructure level itu penting
Datahub konsep yang sejauh ini validated
Mongo track (raw, sharded) + Postgres dwh + etl in python + BI Tools
ada etl yay, processing lebih “ekspresif”, nggak depend sama monpro
karena postgres bisa connect bi tools macam2
karena pake BI Tools dan SQL lebih accessible sama yang non-coding user
Lesson learned
Perlu bikin dwh dari awal
Untuk analysis, SQL compatibility itu penting sekali, skill yg sangat ubiquitous dan cocok buat analyst karena ga perlu ngoding programatik (cukup deklaratif, lebih cepet)
Jangan bikin ulang tools komoditas yang bukan fokusnya kita, (kita coba bikin bi tools sendiri)
Redshift dwh
performance
Space
Lesson learned:
Assess teknologi yang tepat sesuai kebutuhannya yang spesifik
Gobblin
mongo tracking nya udah mau penuh, yg ga diquery pindah ke s3 aja jadi ga usah nulis ke mongo
Bikin tracking data available ke S3
Lesson learned
Foresee growth, cari solusi yg scale utk growth tsb
ETL with Spark+Airflow
pake python di single node ngga kuat, ngescale nya susah
Ngedefine dependency data dengan lebih gampang
Rerunnability (?)
Lesson learned
Distributed processing in mind
Gobblin
mongo tracking nya udah mau penuh, yg ga diquery pindah ke s3 aja jadi ga usah nulis ke mongo
Bikin tracking data available ke S3
Lesson learned
Foresee growth, cari solusi yg scale utk growth tsb
ETL with Spark+Airflow
pake python di single node ngga kuat, ngescale nya susah
Ngedefine dependency data dengan lebih gampang
Rerunnability (?)
Lesson learned
Distributed processing in mind
Near real time with memsql
Mongo mau dimatiin aja, terus yg monitoring pindah ke mana? Kalau s3 hourly soalnya
misah concern yang near real time cemacem alerting
Lesson learned
Ketika migrasi, cover semua use case dengan clear, ini agak ketinggalan
Gali terus requirement sampai spesifik banget, utk data itu mostly terkait 1) latency SLA 2) query pattern 3) accuracy 4) processing requirement 5) tools integration
Real time with with pubsub+dataflow+dynamodb
Mongosdim mau didecom juga karena mongo nggak scale, pindahnya ke yang cocok key value
Processingnya sekalian dipindah ke state of the art
Lesson learned
Utk research, selalu siapin buffer, dan plan B (sama plan C kalau perlu)
Integrasi antar cloud itu nggak semengerikan yang dikira, tapi perlu aware latency jadi problem utama. Paling asik pake VPN (belum coba)
Near real time with bigquery
Memsql perlu admin sendiri utk scale
Lesson learned
Terus tambah pengetahuan, mungkin ada tech yg bisa bantu, jadi ga migrasi 2x
Data lake with Hive+Presto
Redshift nya ngga kuat kalau dipakai buat exploration case, ngga makes sense buat exploration query yg aneh2 share db sama dashboard+report yang regular
Lesson learned
Scalable itu kadang jadi barang jualan, perlu lebih jeli lagi bagian mana yang akan mentok duluan, apakah bisa discale bagian itu aja (dalam hal ini redshift kalah dibanding presto karena presto compute & storage pisah sehingga bisa scale masing2)
Telat 1 tahun dibanding yg lain utk adopt presto. Mesti update terus dengan teknologi dan aware ini cocok utk yg mana