SlideShare a Scribd company logo
1 of 33
Download to read offline
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Thiago Rigo and David Mariassy, GetYourGuide
Modern ETL Pipelines
with Change Data Capture
#UnifiedDataAnalytics #SparkAISummit
Who are we?
5 years of experience in
Business Intelligence and Data
Engineering roles from the Berlin
e-commerce scene.
Data Engineer, Data Platform.
Software engineer for the past 7
years, last 3 focused on data
engineering.
Senior Data Engineer, Data
Platform.
Agenda
1 Intro to GetYourGuide
2 GYG’s Legacy ETL Pipeline
3 Rivulus ETL Pipeline
4 Conclusion
5 Questions
Intro to GetYourGuide
We make it simple to book and enjoy
incredible experiences
Europe’s largest marketplace
for travel experiences
50k+
Products in 150+
countries
25M+
Tickets sold
$650M+
In VC funding
600+
Strong global team
150+
Traveler nationalities
GYG’s Legacy ETL Pipeline
Breaking
schema
changes
upstream
Requires
special
knowledge
Long
recovery
times
Difficult to
test
Bad SLAs
Where we started
Requires
special
knowledge
Breaking
schema
changes
upstream
Long
recovery
times
Difficult to
test
Bad SLAs
Automatic
handling of
schema
changes
Familiar
tooling (Scala,
SQL)
Maximum
parallelism
Built for
testability
Better SLAs
What we wanted
Rivulus ETL Pipeline
Overview
Extraction Layer
The pipeline
Debezium
● Open source distributed platform for change
data capture
● Can read several databases
○ MySQL, Postgres, Cassandra, Oracle, SQL
Server, and Mongo DB
● It works as a connector part of Kafka Connect
● It streams the database's event log into Kafka
● Streams those changes to Kafka
● Scala library
● Keeps track of all schema changes applied to
the tables
● Holds PK, timestamp and partition columns
● Prevents breaking changes from being
introduced
○ Type changes
● Upcast types
● Schema Service works on column level
Schema Service
Automatic
handling of
schema
changes
Avro Converter
● Regular Scala application
● Runs as part of Airflow DAG
● Reads raw Avro files from S3
● Communicates with Schema Service to handle
schema changes automatically
● Writes out Parquet files
Automatic
handling of
schema
changes
Upsert
● Spark application
● Runs as part of Airflow DAG
● Reads in new Parquet files
● Communicates with Schema Service to get PK,
timestamp and partition columns
● Compacts the data based on table’s PK
● Creates Hive table which contains a replica of
source DB
Transformation Layer
The performance penalty of managing
transformation dependencies
inefficiently
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
● Small set of
transformations.
● Small team / single
engineer.
● Simple one-to-one type
dependencies.
● Defining an optimal
dependency model by
hand is possible.
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
● Growing set of transformations.
● Growing team.
● One-to-many / many-to-many type
dependencies.
● Defining a dependency model by
hand becomes cumbersome and
error-prone
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
The hard
choice
between
performance
and
correctness
● As optimal dependency models
become ever more difficult to maintain
and expand manually without making
errors, teams decide to optimise for
correctness over performance.
● This results in crude dependency
models with a lot of sequential
execution in places where
parallelization would be possible.
The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
The
performance
bottleneck
strikes back
The hard
choice
between
performance
and
correctness
● Sequential execution results in
long execution and long recovery
times. In other words Poor SLAs.
● 💣🔥
Rivulus SQL for automated dependency
inference
Maximum
parallelism
● SQL transformations
○ A collection of Rivulus SQL files that make use of a set of
custom template variables.
● Executor app
○ Spark app that executes a single transformation at a
time.
● DGB (Dependency Graph Builder)
○ Parses all files in the SQL library and builds a
dependency graph of the transformations by
interpolating Rivulus SQL template vars.
● Airflow
○ Executes the transformations on Databricks in the order
specified by the DGB.
Main components
Rivulus SQL syntax
● {% reference:target ‘dim_tour’ %}
○ Declares a dependency between this transformation and the
dim_tour transformation that must be defined in the same SQL
library
● {% reference:source ‘gyg__customer’ %}
○ Declares a dependency between this transformation and a raw
data source (gyg.customer) that is loaded to Hive by an
extraction job
● {% load ‘file.sql’ %}
○ Loads a reusable subquery defined in file.sql into this
transformation.
Familiar
tooling (Scala,
SQL)
{
"fact_nps_feedback": {
"source_dependencies": [
"gyg__nps_feedback"
],
"transformation_dependencies": [
"dim_nps_feedback_stage"
]
}
}
DGB
Airflow
Example
Executor
app
invocations
on DB
SELECT
nps_feedback_id
, nps_feedback_stage_id
, booking_id
, score
, feedback
, update_timestamp
, source
FROM {% reference:source 'gyg__nps_feedback' %} AS nf
LEFT JOIN {% reference:target 'dim_nps_feedback_stage' %} nfs
ON nfs.nps_feedback_stage_name = nf.stage
Build time
Rivulus SQL
Build time
Runtime
A word on testing
● Maximum parallelism enhances testability
● Separation of config from code
○ Configurable input and output paths
Built for
testability
Conclusion
Results
Eliminated
vulnerability
to upstream
schema
changes
Democratized
our ETL by
migrating all
business logic
to SQL
Minimized
recovery
time by
maximizing
parallelism
Designed
for E2E
testability
Cut
processing
time by 70%
(further
reductions are
possible)
Next Steps
Intra-day
micro-batches
Database
Replication as
a Service
Rivulus SQL is
GYG’s
standard tool
for writing
transformations
Delta for Upsert
Questions?
We’re hiring!
https://careers.getyourguide.com

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hive Bucketing in Apache Spark
Hive Bucketing in Apache SparkHive Bucketing in Apache Spark
Hive Bucketing in Apache Spark
Tejas Patil
 

What's hot (20)

Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Spark
SparkSpark
Spark
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Hive Bucketing in Apache Spark
Hive Bucketing in Apache SparkHive Bucketing in Apache Spark
Hive Bucketing in Apache Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 

Similar to Modern ETL Pipelines with Change Data Capture

Similar to Modern ETL Pipelines with Change Data Capture (20)

Encode Club workshop slides
Encode Club workshop slidesEncode Club workshop slides
Encode Club workshop slides
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Building scalable data with kafka and spark
Building scalable data with kafka and sparkBuilding scalable data with kafka and spark
Building scalable data with kafka and spark
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
AirBNB's ML platform - BigHead
AirBNB's ML platform - BigHeadAirBNB's ML platform - BigHead
AirBNB's ML platform - BigHead
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Ultimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on KubernetesUltimate Guide to Microservice Architecture on Kubernetes
Ultimate Guide to Microservice Architecture on Kubernetes
 
Dataflow.pptx
Dataflow.pptxDataflow.pptx
Dataflow.pptx
 
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
Agile Oracle to PostgreSQL migrations (PGConf.EU 2013)
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
Building data pipelines at Shopee with DEC
Building data pipelines at Shopee with DECBuilding data pipelines at Shopee with DEC
Building data pipelines at Shopee with DEC
 
Transform Your Data Integration Platform From Informatica To ODI
Transform Your Data Integration Platform From Informatica To ODI Transform Your Data Integration Platform From Informatica To ODI
Transform Your Data Integration Platform From Informatica To ODI
 
GraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits togetherGraphQL across the stack: How everything fits together
GraphQL across the stack: How everything fits together
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Recently uploaded (20)

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Modern ETL Pipelines with Change Data Capture

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Thiago Rigo and David Mariassy, GetYourGuide Modern ETL Pipelines with Change Data Capture #UnifiedDataAnalytics #SparkAISummit
  • 3. Who are we? 5 years of experience in Business Intelligence and Data Engineering roles from the Berlin e-commerce scene. Data Engineer, Data Platform. Software engineer for the past 7 years, last 3 focused on data engineering. Senior Data Engineer, Data Platform.
  • 4. Agenda 1 Intro to GetYourGuide 2 GYG’s Legacy ETL Pipeline 3 Rivulus ETL Pipeline 4 Conclusion 5 Questions
  • 6. We make it simple to book and enjoy incredible experiences
  • 7. Europe’s largest marketplace for travel experiences 50k+ Products in 150+ countries 25M+ Tickets sold $650M+ In VC funding 600+ Strong global team 150+ Traveler nationalities
  • 10. Requires special knowledge Breaking schema changes upstream Long recovery times Difficult to test Bad SLAs Automatic handling of schema changes Familiar tooling (Scala, SQL) Maximum parallelism Built for testability Better SLAs What we wanted
  • 15. Debezium ● Open source distributed platform for change data capture ● Can read several databases ○ MySQL, Postgres, Cassandra, Oracle, SQL Server, and Mongo DB ● It works as a connector part of Kafka Connect ● It streams the database's event log into Kafka ● Streams those changes to Kafka
  • 16. ● Scala library ● Keeps track of all schema changes applied to the tables ● Holds PK, timestamp and partition columns ● Prevents breaking changes from being introduced ○ Type changes ● Upcast types ● Schema Service works on column level Schema Service Automatic handling of schema changes
  • 17. Avro Converter ● Regular Scala application ● Runs as part of Airflow DAG ● Reads raw Avro files from S3 ● Communicates with Schema Service to handle schema changes automatically ● Writes out Parquet files Automatic handling of schema changes
  • 18. Upsert ● Spark application ● Runs as part of Airflow DAG ● Reads in new Parquet files ● Communicates with Schema Service to get PK, timestamp and partition columns ● Compacts the data based on table’s PK ● Creates Hive table which contains a replica of source DB
  • 20. The performance penalty of managing transformation dependencies inefficiently
  • 21. The gradual forsaking of performance on the altar of dependency management Humble beginnings ● Small set of transformations. ● Small team / single engineer. ● Simple one-to-one type dependencies. ● Defining an optimal dependency model by hand is possible.
  • 22. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon ● Growing set of transformations. ● Growing team. ● One-to-many / many-to-many type dependencies. ● Defining a dependency model by hand becomes cumbersome and error-prone
  • 23. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon The hard choice between performance and correctness ● As optimal dependency models become ever more difficult to maintain and expand manually without making errors, teams decide to optimise for correctness over performance. ● This results in crude dependency models with a lot of sequential execution in places where parallelization would be possible.
  • 24. The gradual forsaking of performance on the altar of dependency management Humble beginnings Complexity on the horizon The performance bottleneck strikes back The hard choice between performance and correctness ● Sequential execution results in long execution and long recovery times. In other words Poor SLAs. ● 💣🔥
  • 25. Rivulus SQL for automated dependency inference Maximum parallelism
  • 26. ● SQL transformations ○ A collection of Rivulus SQL files that make use of a set of custom template variables. ● Executor app ○ Spark app that executes a single transformation at a time. ● DGB (Dependency Graph Builder) ○ Parses all files in the SQL library and builds a dependency graph of the transformations by interpolating Rivulus SQL template vars. ● Airflow ○ Executes the transformations on Databricks in the order specified by the DGB. Main components
  • 27. Rivulus SQL syntax ● {% reference:target ‘dim_tour’ %} ○ Declares a dependency between this transformation and the dim_tour transformation that must be defined in the same SQL library ● {% reference:source ‘gyg__customer’ %} ○ Declares a dependency between this transformation and a raw data source (gyg.customer) that is loaded to Hive by an extraction job ● {% load ‘file.sql’ %} ○ Loads a reusable subquery defined in file.sql into this transformation. Familiar tooling (Scala, SQL)
  • 28. { "fact_nps_feedback": { "source_dependencies": [ "gyg__nps_feedback" ], "transformation_dependencies": [ "dim_nps_feedback_stage" ] } } DGB Airflow Example Executor app invocations on DB SELECT nps_feedback_id , nps_feedback_stage_id , booking_id , score , feedback , update_timestamp , source FROM {% reference:source 'gyg__nps_feedback' %} AS nf LEFT JOIN {% reference:target 'dim_nps_feedback_stage' %} nfs ON nfs.nps_feedback_stage_name = nf.stage Build time Rivulus SQL Build time Runtime
  • 29. A word on testing ● Maximum parallelism enhances testability ● Separation of config from code ○ Configurable input and output paths Built for testability
  • 31. Results Eliminated vulnerability to upstream schema changes Democratized our ETL by migrating all business logic to SQL Minimized recovery time by maximizing parallelism Designed for E2E testability Cut processing time by 70% (further reductions are possible)
  • 32. Next Steps Intra-day micro-batches Database Replication as a Service Rivulus SQL is GYG’s standard tool for writing transformations Delta for Upsert