The trade-off between development speed and pipeline maintainability is a constant for data engineers, especially for those in a rapidly evolving organization
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
1.
2. Designing the Next Generation of
Data Pipelines at Zillow with
Apache Spark
Nedra Albrecht
Senior Software Engineer
Derek Gorthy
Software Engineer II
3. Introductions
Nedra Albrecht
▪ Joined Zillow in October 2017
▪ 20 years database development experience
with OLTP, OLAP, and Big Data systems
▪ Founding member of the Zillow Offers data
engineering team
Derek Gorthy
▪ Joined Zillow in August 2019
▪ Background in developing highly-scalable data
pipelines and machine learning applications
▪ 4+ years of experience using Apache Spark
4. Agenda
▪ What is Zillow Offers (ZO)?
▪ Previous architecture
▪ Scope of Zillow Offers data
engineering domain
▪ Next generation architecture
▪ Overview
▪ Design process
▪ Key components
▪ Lessons learned
7. Zillow Offers Data Engineering
2018
1
2
3
Onboard a variety of internal and
external data sources
Develop data pipelines quickly
Enable analytic teams to
develop specific business logic
8. Original Architecture
Kinesis
API Call
Internal Data
Source
External Data
Source
Airflow +
Custom
Logic
Merge
Deltas
Convert to
Parquet
Convert to
Parquet
Merge
Deltas
Custom
Logic
Merge
Deltas
Convert to
Parquet
Pipeline1Pipeline2PipelineN
Hive Presto
Combined
Data Table 1
Combined
Data Table N
… Views
Data stored as JSON
object in each row
JSON extract used
to expose data
Cleansing, exposing,
and data type validation
implemented through
nested views
11. Zillow Offers Data Engineering
2020
1
2
3
Decrease the time it takes to
onboard a new data source
Earlier detection of data quality
issues in our pipelines
Library-based development
processing that can be extended
across Zillow
Zillow Offers Data Engineering
2018
1
2
3
Onboard a variety of internal and
external data sources
Develop data pipelines quickly
Enable analytic teams to
develop specific business logic
12. New Architecture
Velocity
Quality
Kafka
API Call
Hive/EDW
Internal Data
Source
External Data
Source
Config
Pipeline Generation
Schema Airflow (orchestration)
Pipelines 1 … N
Convert to
Parquet
Validate
Schema
Validate
Data
Merge
Deltas
Flatten
Arrays
Business
Logic
Data
Auditing
Hive Hive
Valid
Dataset
Served
Dataset
Data
Marts
Pipeler Library
13. Establish Processing Layers
Kafka
API Call
Hive/EDW
Internal Data
Source
External Data
Source
Config
Pipeline Generation
Schema Airflow (orchestration)
Pipelines 1 … N
Convert to
Parquet
Validate
Schema
Validate
Data
Merge
Deltas
Flatten
Arrays
Business
Logic
Data
Auditing
Hive Hive
Valid
Dataset
Served
Dataset
Data
Marts
Pipeler Library
Config
Pipeline Generation
Schema
Hive/EDW
Hive Hive
Valid
Dataset
Served
Dataset
Data
Marts
Velocity
Quality
14. Pipeler Library
Velocity
Quality
Kafka
API Call
Hive/EDW
Internal Data
Source
External Data
Source
Config
Pipeline Generation
Schema Airflow (orchestration)
Pipelines 1 … N
Convert to
Parquet
Validate
Schema
Validate
Data
Merge
Deltas
Flatten
Arrays
Business
Logic
Data
Auditing
Hive Hive
Valid
Dataset
Served
Dataset
Data
Marts
Pipeler Library
Config
Pipeline Generation
Schema
Convert to
Parquet
Validate
Schema
Validate
Data
Merge
Deltas
Flatten
Arrays
Business
Logic
Data
Auditing
Pipeler Library
16. Data Processing vs. Business Logic
Kafka
API Call
Hive/EDW
Internal Data
Source
External Data
Source
Config
Pipeline Generation
Schema Airflow (orchestration)
Pipelines 1 … N
Convert to
Parquet
Validate
Schema
Validate
Data
Merge
Deltas
Flatten
Arrays
Business
Logic
Data
Auditing
Hive Hive
Valid
Dataset
Served
Dataset
Data
Marts
Pipeler Library
Velocity
Convert to
Parquet
Validate
Schema
Validate
Data
Quality
Merge
Deltas
Flatten
Arrays
Business
Logic
Data
Auditing
Config
Pipeline Generation
SchemaConfig
Pipeline Generation
Schema
17. Validating Data Early
Velocity
Quality
Kafka
API Call
Hive/EDW
Internal Data
Source
External Data
Source
Config
Pipeline Generation
Schema Airflow (orchestration)
Pipelines 1 … N
Convert to
Parquet
Validate
Schema
Validate
Data
Merge
Deltas
Flatten
Arrays
Business
Logic
Data
Auditing
Hive Hive
Valid
Dataset
Served
Dataset
Data
Marts
Pipeler Library
Config
Pipeline Generation
SchemaKafka
Internal Data
Source
Validate
Schema
Validate
Data
Hive
Valid
Dataset
18. Key Takeaways
Data engineers are not limited to pipeline
building, they also develop tooling
▪ Pipeler processing library
▪ Configuration framework
Early detection and alerting of data
quality issues
▪ Enforcing code-based contracts
▪ Data quality should be owned by all teams
Proactive collaboration between data
engineering and product teams in event
design
▪ Schema design and registry