This document discusses machine learning data lineage using Delta Lake. It introduces Richard Zang and Denny Lee, then outlines the machine learning lifecycle and challenges of model management. It describes how MLflow Model Registry can track model versions, stages, and metadata. It also discusses how Delta Lake allows data to be processed continuously and incrementally in a data lake. Delta Lake uses a transaction log and file format to provide ACID transactions and allow optimistic concurrency control for conflicts.
Machine Learning Data Lineage with MLflow and Delta Lake
2. Machine Learning Data Lineage
with and Delta Lake
Richard Zang, Senior Software Engineer, Databricks
Denny Lee, Staff Developer Advocate, Databricks
3. Richard Zang
Senior Software Engineer at
Databricks
Previously
▪ Senior Software Engineer at
Hortonworks
▪ Senior Software Engineer at
Opentext Analytics
4. Denny Lee
Staff Developer Advocate at
Databricks
Previously
▪ Senior Director of Data Science
Engineering at Concur
▪ Principal Program Manager at at
Microsoft
▪ Project Isotope (Azure
HDInsight)
▪ SQLCAT DW/BI Lead
8. Tracking
Record and query
experiments: code,
metrics, parameters,
artifacts, models
Models
General model
format
that standardizes
deployment options
Model Registry
Centralized and
collaborative
model lifecycle
management
Projects
Packaging format
for reproducible runs
on any compute
platform
Components
9. Model Lifecycle Data Lineage
Staging Production Archived
Data Scientists Deployment Engineers
v1
v2
v3
Models Tracking
Flavor 2Flavor 1
Model Registry
In-Line Code
Containers
Batch & Stream
Scoring
Cloud Inference
Services
OSS Serving
Solutions
Serving
Parameter
s
Metrics Artifacts
ModelsMetadata
v0
v1
10. Challenges in Model Management
When you’re working on one ML app alone, keeping the model in
files is manageable
MODEL
DEVELOPER
classifier_v1.h5
classifier_v2.h5
classifier_v3_sept_19.h5
classifier_v3_new.h5
…
11. Challenges in Model Management
When you work in a large organization with many models,
management becomes a big challenge:
• Where can I find the best version of this model?
• How was this model trained?
• How can I track docs for each model?
• How can I review models?
MODEL
DEVELOPER
REVIEWER
MODEL
USER
???
12. MLflow Model Registry
Repository of named, versioned
models with comments & tags
Track each model’s stage: dev,
staging, production, archived
Easily load a specific version
13. MLflow Model Registry
Model Registry
MODEL
DEVELOPER
DOWNSTREAM
USERS
AUTOMATED JOBS
REST SERVING
REVIEWERS,
CI/CD TOOLS
14. A Data Engineer’s Dream...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a cost
efficient way without having to choose between batch or streaming
16. Implementing Atomicity
Changes to the table
are stored as
ordered, atomic units
called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
…
17. Solving Conflicts Optimistically
1. Record start version
2. Record reads/writes
3. Attempt commit
4. If someone else wins,
check if anything you
read has changed.
5. Try again.
000000.json
000001.json
000002.json
User 1 User 2
Write: Append
Read: Schema
Write: Append
Read: Schema