In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
2. Thiago Rigo and David Mariassy, GetYourGuide
Modern ETL Pipelines
with Change Data Capture
#UnifiedDataAnalytics #SparkAISummit
3. Who are we?
5 years of experience in
Business Intelligence and Data
Engineering roles from the Berlin
e-commerce scene.
Data Engineer, Data Platform.
Software engineer for the past 7
years, last 3 focused on data
engineering.
Senior Data Engineer, Data
Platform.
6. We make it simple to book and enjoy
incredible experiences
7. Europe’s largest marketplace
for travel experiences
50k+
Products in 150+
countries
25M+
Tickets sold
$650M+
In VC funding
600+
Strong global team
150+
Traveler nationalities
15. Debezium
● Open source distributed platform for change
data capture
● Can read several databases
○ MySQL, Postgres, Cassandra, Oracle, SQL
Server, and Mongo DB
● It works as a connector part of Kafka Connect
● It streams the database's event log into Kafka
● Streams those changes to Kafka
16. ● Scala library
● Keeps track of all schema changes applied to
the tables
● Holds PK, timestamp and partition columns
● Prevents breaking changes from being
introduced
○ Type changes
● Upcast types
● Schema Service works on column level
Schema Service
Automatic
handling of
schema
changes
17. Avro Converter
● Regular Scala application
● Runs as part of Airflow DAG
● Reads raw Avro files from S3
● Communicates with Schema Service to handle
schema changes automatically
● Writes out Parquet files
Automatic
handling of
schema
changes
18. Upsert
● Spark application
● Runs as part of Airflow DAG
● Reads in new Parquet files
● Communicates with Schema Service to get PK,
timestamp and partition columns
● Compacts the data based on table’s PK
● Creates Hive table which contains a replica of
source DB
21. The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
● Small set of
transformations.
● Small team / single
engineer.
● Simple one-to-one type
dependencies.
● Defining an optimal
dependency model by
hand is possible.
22. The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
● Growing set of transformations.
● Growing team.
● One-to-many / many-to-many type
dependencies.
● Defining a dependency model by
hand becomes cumbersome and
error-prone
23. The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
The hard
choice
between
performance
and
correctness
● As optimal dependency models
become ever more difficult to maintain
and expand manually without making
errors, teams decide to optimise for
correctness over performance.
● This results in crude dependency
models with a lot of sequential
execution in places where
parallelization would be possible.
24. The gradual forsaking of performance on the altar
of dependency management
Humble
beginnings
Complexity on
the horizon
The
performance
bottleneck
strikes back
The hard
choice
between
performance
and
correctness
● Sequential execution results in
long execution and long recovery
times. In other words Poor SLAs.
● 💣🔥
25. Rivulus SQL for automated dependency
inference
Maximum
parallelism
26. ● SQL transformations
○ A collection of Rivulus SQL files that make use of a set of
custom template variables.
● Executor app
○ Spark app that executes a single transformation at a
time.
● DGB (Dependency Graph Builder)
○ Parses all files in the SQL library and builds a
dependency graph of the transformations by
interpolating Rivulus SQL template vars.
● Airflow
○ Executes the transformations on Databricks in the order
specified by the DGB.
Main components
27. Rivulus SQL syntax
● {% reference:target ‘dim_tour’ %}
○ Declares a dependency between this transformation and the
dim_tour transformation that must be defined in the same SQL
library
● {% reference:source ‘gyg__customer’ %}
○ Declares a dependency between this transformation and a raw
data source (gyg.customer) that is loaded to Hive by an
extraction job
● {% load ‘file.sql’ %}
○ Loads a reusable subquery defined in file.sql into this
transformation.
Familiar
tooling (Scala,
SQL)
28. {
"fact_nps_feedback": {
"source_dependencies": [
"gyg__nps_feedback"
],
"transformation_dependencies": [
"dim_nps_feedback_stage"
]
}
}
DGB
Airflow
Example
Executor
app
invocations
on DB
SELECT
nps_feedback_id
, nps_feedback_stage_id
, booking_id
, score
, feedback
, update_timestamp
, source
FROM {% reference:source 'gyg__nps_feedback' %} AS nf
LEFT JOIN {% reference:target 'dim_nps_feedback_stage' %} nfs
ON nfs.nps_feedback_stage_name = nf.stage
Build time
Rivulus SQL
Build time
Runtime
29. A word on testing
● Maximum parallelism enhances testability
● Separation of config from code
○ Configurable input and output paths
Built for
testability