This talk presented how Traveloka uses Google Cloud BigQuery to build Data Provisioning API which enables the microservices in Traveloka to consume data from our BigQuery.
2. Session 4
14:40 - 15:15
Data Lake API
with BigQuery.
PRESENTERS: Imre Nagi
Software Engineer
Traveloka
Rendy Bambang Jr.
Data System Architect
Traveloka
6. Metrics
● ~4 TiB per day data goes in to PubSub
● ~400 TB (~500 billion rows) data in BigQuery
● ~250 TB data in GCS
● >2 PiB BigQuery data scan per month (excluding ETLs)
● >60k batch jobs executed per day
● >2500 Dataflow jobs per day
● >1500 charts using BigQuery generated via BI tools
7. AGENDA
● How we use Data
● Problem Statement
● Data Lake API
● Future work
8. Data drives
product & enables
business use case
Each mission team has unique use
cases in terms of data usage. Data team
in Traveloka needs to fulfill this need in
order to maximize Traveloka growth and
revenue.
9. ● Personalization
● Fraud Detection
● Improving User Experience
● A/B Test
● Giving recommendation
● Review Moderation
● Photo classification
● and many other use cases
How we use data
11. Data Provisioning in Traveloka
Machine To Machine Machine To Human
Frequent, Small
request
Huge Data, High
Latency
12. Data Provisioning in Traveloka
Machine To Machine Machine To Human
Frequent, Small
request
Huge Data, High
Latency
Huge Data, High
Latency
13. How we previously deliver big data to product team
Product team
requests data for a
specific use case
Data team provides
raw or
pre-processed data
in a blob storage
Data team grants
access to bucket or
tables for team
microservices
Product team pulls
the data and do its
job
1 2 3 4
15. This becomes problematic
● Systems are tightly coupled
● No column level access control
● Hard to audit data usage
16. What we need?
A standardised Way in Accessing Data
● Clear contract between client and server
● Client is not tightly coupled to internal
implementation
● Better access control
20. Data Provisioning API underlying architecture
Storage
Cloud Storage
For Storing
Results
Cloud SQL
BigQuery
Interface
Data Provision
API
Consumers
Traveloka Backend
Services
Data Source
BigQuery
Tracking Data
Processing Pipeline
BigQuery SQL
Orchestration
Monitoring Logging
Architecture: Data Provision API overall architecture
21. Storage
Cloud Storage
For Storing
Results
Cloud SQL
BigQuery
Interface
Data Provision
API
Consumers
Traveloka Backend
Services
Data Source
BigQuery
Tracking Data
Processing Pipeline
BigQuery SQL
Orchestration
Monitoring Logging
22. How Data Provisioning API works
Query
Interpreter
Interf
ace
Monitoring
Logging
Service
Client
Kubernetes
Engine
Query
Validation
BigQuery
Cloud
SQL
Cloud
StorageJob Creation
Query
Execution
Write to
Permanent
Table
Export
Permanent
Table to GCS
Generate
Sign URL
Store the
sign URL
ACL Checks
24. Future Improvement
● Use Queue to manage the jobs
● Add more capabilities to the
query features (complex
aggregation, etc)
● Separating ACL service to
enable service reuse.