SlideShare uma empresa Scribd logo
1 de 66
Baixar para ler offline
Dr. Jim Dowling1,2
Slides together with Alexandru A. Ormenisan1,2
, Mahmoud Ismail1,2
PROVENANCE
FOR MACHINE LEARNING PIPELINES
KTH - Royal Institute of Technology (1)
Logical Clocks AB (2)
Growing Consensus on how to manage complexity of AI
Data validation
Distributed
ENGINEER
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
2
Growing Consensus on how to manage complexity of AI
Data validation
Distributed
ENGINEER
Model
Serving
A/B
Testing
Monitoring
Pipeline Management
HyperParameter
Tuning
Feature Engineering
Data
Collection
Hardware
Management
Data Model Prediction
φ(x)
ML PLATFORM
TRAIN and SERVE
FEATURE
STORE
What is provenance for ML Pipelines?
ML Pipeline
Feature
engineering
Training Serving
Raw Data Features Models
Governance
…search:
Discovery
Debug &
Analyse ?
Serving
Feature
Engineering
Training &
Validating
?
Pipeline
Automation
Integrity &
Garbage
Collection
Traceability &
Compliance
Why track provenance?
End-to-End Machine Learning (ML) Pipelines
▪ Logical Clocks – Hopsworks (world’s first/only fully open source)
▪ Uber Michelangelo
▪ Airbnb – Bighead/Zipline
▪ Comcast
▪ Twitter
▪ GO-JEK Feast
▪ Conde Nast
▪ Facebook FB Learner
▪ Netflix
▪ Reference: www.featurestore.org
Feature Stores in Production
Event DataRaw Data
Data Lake
Data
Pipelines
BI
Platforms
SQL Data
Feature
Pipelines
Feature
Store
FEATURES FOR MODEL TRAINING
SERVE RT FEATURES TO ONLINE MODELS
FEATURES FOR ANALYTICAL MODELS (BATCH)
Feature Stores make existing Data Infrastructure available to Data Scientists and Online Apps
Click features every 10
secs
CDC data every 30
secs
User profile updates every
hour
Featurized weblogs data every
day
Online
Feature
Store
Offline
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
User-Entered Features (<2
secs)
Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
<10ms
TBs/PBs
Feature Pipelines update the Feature Store (2 Databases!) with data from backend Platforms
Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
Flink
RTFeatureGroup
Online
App
Train,
Batch App
User Clicks
DB Updates
User Profile Updates
Weblogs
Real-time features
Kafka Output
The FeatureGroup abstraction hides the complexity of dealing with 2 databases
Features name Pclass Sex Survive Name Balance
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic
Passenger List
Passenger
Bank Account
File format
.tfrecord
.npy
.csv
.hdf5,
.petastorm,
etc
Storage
GCS
Amazon S3
HopsFS
Features, FeatureGroups, and Train/Test Datasets are all versioned
The FeatureGroup abstraction hides the complexity of dealing with 2 databasesFeature Store Concepts in Hopsworks
Example Ingestion of data into a FeatureGroup
https://docs.hopsworks.ai/
dataframe = spark.read.json("s3://dataset/rain.json")
# do feature engineering on your dataframe
df.withColumn('precipitation', (df.val-min)/(max-min))
fg = fs.create_feature_group("rain",
version=1,
description="Rain features",
primary_key=['date', 'location_id'],
online_enabled=True)
fg.save(dataframe)
Example Creation of Train/Test Data from a Feature Store
https://docs.hopsworks.ai/
# Join features across FeatureGroups. Use “on=[..]” to explicitly enter the JOIN key.
feature_join = rain_fg.select_all()
.join(temperature_fg.select_all(), on=["date", "location_id"])
.join(location_fg.select_all()))
td = fs.create_training_dataset("training_dataset",
version=1,
data_format="tfrecords",
description="Training dataset, TfRecords format",
splits={'train': 0.7, 'test': 0.2, 'validate': 0.1})
# The train/test/validation files are now saved to the filesystem (S3, HDFS, etc)
td.save(feature_join)
# Use the training data as follows:
df = td.read(split="train")
Event DataRaw Data
Feature Pipeline FEATURE STORE TRAIN/VALIDATE MODEL SERVING
MONITOR
Data Lake
The end of the End-to-End ML Pipeline!
ML Pipelines start and stop at the Feature Store
End-to-End ML Pipelines on Hopsworks. Provenance is Collecting Metadata.
HopsFS
Code and
configuration
Data Lake,
Warehous
e, Kafka
Feature
Store
Model
registry
Prediction Logs
Monitoring Logs
Feature
Engineering
Serving on
Kubernetes
Model
Training
Model
Deploy
Serving and
Monitoring
Experiments/
Development Scaleout
Metadata
Features
Validate
Deploy to
Log
Artifact (File)
Artifact
Metadata
Elasticsearch
Sync
Metadata is data that describes other data.
Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata?
Artifacts and Metadata in End-to-End ML Pipelines
File System (S3, HopsFS, etc)
Metastore (Database)
Provenance queries
● SQL or Free-Text or Graph?
● Update Throughput?
● Latency of queries?
● Size of Metadata?
https://www.dataplatformschool.com/blog/w0y8g0-the-data-governance-zoo
Metadata Cataloging Systems - a whole industry
3 Mechanisms for Metadata Collection. Polyglot Metadata Storage for Efficient Querying.
File Systems, Databases, Data Warehouses, Message Bus, etc
Metastore (Database)
Crawler
Job
Pull
(REST) API
Push
Change Data
Capture(CDC) API
Application
Instrumented APIJob
Graph DB Search (Elastic)
Metadata Query API
Artifacts and Metadata in End-to-End ML Pipelines
File System (S3, HopsFS, etc)
Metastore
Consistency issues
Synchronization
?
Metadata is data that describes other data.
Unspoken Assumption:
Why are Data and Metadata always separate stores?
Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata Revisited?
Artifacts and Metadata in End-to-End ML Pipelines
Raw Data Features
Experiments
(Progs, Logs,
Checkpoints)
Models
Artifacts
Metadata
File System (S3, HopsFS, etc)
Metastore (Database)
Experiments
(HParams, Env,
Results, Graphs)
Feature Stats
(Min,Max,Std,
Mean, Distrib.)
Governance
(Privileges,Audit,
Retention, etc)
Model Desc
(Privileges, Perf,
Provenance, etc)
Mechanism 4: Artifacts and Metadata in the same system - a Unified Metadata Layer (Hopsworks)
Features
Experiments
(Progs, Logs,
Checkpoints)
Models
Artifacts
Metadata
HopsFS
Metastore (NDB)
Experiments
(HParams, Env,
Results, Graphs)
Feature Stats
(Min,Max,Std,
Mean, Distrib.)
Governance
(Privileges,Audit,
Retention, etc)
Model Desc
(Privileges, Perf,
Provenance, etc)
Metastore (NDB)
Raw Data
extends extends extends extends
Libraries
Application
Data
platform
Metadata
store
Explicit
Top–down tracking of provenance.
Push/Pull, CDC, or instrumented
application or library code.
Standalone Metadata Store.
Implicit
Bottom-up tracking of provenance.
Requires redesigning the platform.
Conventions link files to artifacts.
Metadata is strongly consistent
with storage platform.
Mechanism 4: Implicit Provenance
ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata
Mahmoud Ismail1
, Mikael Ronström2
, Seif Haridi1
, Jim Dowling1
1
KTH - Royal Institute of Technology 2
Oracle
19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (IEEE/ACM CCGrid 2019), May 15th
25
Tightly coupled Metadata and Data - replicating Metadata to External Systems
HopsFS
Scaleout
Metadata
Artifact (File)
Artifact
Metadata
Elasticsearch
Sync
DN1 DN2 DN3 DN4 DN5
26
● Highly scalable next-generation distribution of HDFS
mkdir /Images
write /Images/cat.png
Images
cat.png
/
DN1, DN3, DN5
What is HopsFS?
27
inodeID name parentID
Block
storage
(Datenodes)
Metadata
storage
NDB
mkdir /Images
1 / 0
2 Images 1
3 cat.png 2
write /Images/cat.png
What is HopsFS?
28
● Drop-in replacement distribution of HDFS
● 16X - 37X the throughput of HDFS
● 37 larger clusters than HDFS
● 10 times lower latency
What is HopsFS?
29
ePipe: Near Real-Time Polyglot Persistence of HopsFS
Metadata
30
HopsFS
Get all images with 1 cat and 1 guitar
1 cat and 1 guitar
Full-text search is not supported by NDB
Search in HopsFS
31
Store
X
Store
Y
HopsFS
App A
App B
?
Polyglot Persistence - Replicating Metadata to External Systems for Efficient Querying
32
ePipe: Near Real-Time Polyglot Persistence of HopsFS
Metadata
33
● ePipe is a databus that provides replicated metadata as a service for HopsFS
● ePipe internally
• creates a consistent and correctly ordered change stream for HopsFS
metadata
• and eventually delivers the change stream with low latency (sub second)
(Near Real-time) to consumers
ePipe
34
● Extend HopsFS with a logging table to log file system changes
● Leverage the NDB events API to live stream changes on the logging table to
ePipe
● ePipe enriches the file system events with appropriate data and publish the
enriched events to the consumers
ePipe: Design Decisions
35
Create /f1
name operationinodeID name parentID
1 / 0
Inodes table logging table
NDB
HopsFS Namenodes
2 f1 1
3 f2 1
f1 CREATE
f2 CREATE
f2 DELETE
f1 DELETE
Create /f2Delete /f2Delete /f1
Inodes table and logging table updated in the same Transaction to ensure Consistency/Integrity
36
HopsFS
NDB
NDB
log fs
changes
ePipe
Change
stream
Store
X
Store
Y
App A
App Benrichment
subscribe
for changes
ePipe
37
HopsFS
NDB
NDB
ePipe
Create f1
Append f1
Create f2
Delete f1
Delete f2
Create f1
Append f1
Create f2
Delete f1
Delete f2
Epoch1
Epoch2Epoch3
Order across epochs Order within epoch
Delete f1 after Create f1 Create f1 ?? Append f1
Delete f2 after Create f2 Create f2 ?? Delete f1
…..
Inconsistencies
100 ms
Ordering of Log Entries
38
● Property 1: The epochs are totally ordered.
● Property 2: The changes within the same transaction happen in the same
epoch.
● Property 3: The changes on files are ordered only if they are in different
epochs, that is, no ordering is guaranteed within the same epoch.
NDB Ordering Properties
39
HopsFS
NDB
NDB
ePipe
Epoch1
Epoch2Epoch3
Delete f2 ,2
Create f1 ,1
Append f1 ,2
Create f2 ,1
Delete f1 ,3
Delete f2 ,2
Create f1 ,1
Append f1 ,2
Create f2 ,1
Delete f1 ,3
We introduced a version number per inode
which we will increment whenever
a change occurs to an inode.
Append f1 after Create f1
Create f2 ?? Delete f1
Strengthening NDB Ordering Properties
40
● Property 1 & 2 & 3
● Property 4 & 5: The version number ensures the serializability of the changes
on the same file/directory within epochs.
● Property 6: The order of changes for different files/directories within the same
epoch doesn't matter.
ePipe ordering Properties
41
Logging overhead on HopsFS
42
Logging overhead on HopsFS
43
logscalebase10
Notifications Throughput
44
logscalebase10
Latency: average Lag Time
45
● Supports failure recovery thanks to the persistent logging table
• The log entries are deleted only once the associated events are successfully
replicated to the downstream consumers.
• At least once delivery semantics.
● Pluggable architecture
• For example, filter events based on file name or any other attribute.
● Not Limited to HopsFS
• Can be extended to watch for other logging tables for different purposes.
More about ePipe
46
● A databus that provides replicated metadata as a service for HopsFS
● Low overhead on HopsFS
● Low replication lag (sub-second)
● High throughput
● Pluggable architecture
ePipe Properties
What is provenance - ML Pipeline
ML Pipeline
Feature
engineering
Training Serving
Raw Data Features Models
MLFlow Metadata - Explicit API calls
def train(data_path, max_depth, min_child_weight, estimators, model_name):
X_train, X_test, y_train, y_test = build_data(..)
mlflow.set_tracking_uri("jdbc:mysql://username:password@host:3306/database")
mlflow.set_experiment("My Experiment")
with mlflow.start_run() as run:
...
mlflow.log_param("max_depth", max_depth)
mlflow.log_param("min_child_weight", min_child_weight)
mlflow.log_param("estimators", estimators)
with open("test.txt", "w") as f:
f.write("hello world!")
mlflow.log_artifacts("/full/path/to/test.txt")
...
model.fit(X_train, y_train) # auto-logging
...
mlflow.tensorflow.log_model(model, "tensorflow-model",
registered_model_name=model_name)
Hopsworks Metadata - Implicit Metadata
def train(data_path, max_depth, min_child_weight, estimators):
X_train, X_test, y_train, y_test = build_data(..)
...
print("hello world") # monkeypatched - prints in notebook
...
model.fit(X_train, y_train) # auto-logging
…
#Saves model to ”hopsfs://Projects/myProj/models/..”
hops.export_model(model, "tensorflow",..,model_name)
...
# maggy makes an API call to track this dict
return {'accuracy': accuracy, 'loss': loss, 'diagram': 'diagram.png'}
from maggy import experiment
experiment.lagom(train, name="My Experiment", ...)
Metadata
In [ ]:
add(fg_eng, raw_data, features)
…
add(training, features, model)
<fg_eng, raw_data, features>
Pipeline code
What is provenance - Metadata
Feature
engineering
Training Serving
Raw
Data
Features Models
ePipe (with ML Provenance)
Distributed File System (HopsFS)
Full Text Search (Elastic)
Feature
engineering
Training Serving
Raw
Data
Features Models
Let the platform manage the metadata!
ML Artifacts
Features, Feature Metadata,
Train/Test Datasets
Models, Model Metadata
Possibly thousands of files
Distributed File System
Generate thousands of operations
Change Data Capture (CDC)
Capture only relevant operations
Systems Challenges - Operations
More context for file system operations?
user: John user: Alex
Are any of these operations related?
user: John,
app1
user: John,
app3
user: Alex,
app2
Certificates (with AppId) enabled FS Operation
Order of operations
Order of operations
Richer provenance information
Distributed File System
Read/Write/Create/Delete/XAttr/Metadata
Resource Manager - Yarn (Application Context)
Application X
Job Manager - Hopsworks (Job Context)
Workflow Manager - Airflow (Pipeline Context)
Link input/output files via Apps
Different Executions of the same Job
Jobs as Stages of the same Pipeline
Additional Context
Richer provenance information
<file, op, user_id, app_id, job_id, pipeline_id>
Hopsworks Conventions
/training_datasets
/models
/logs
/notebooks
/featurestore
CDC API - Filtering Mechanisms
/training_datasets
/models
/featurestoreProject
Example
CDC API - Filtering Mechanisms
Path based filtering
Path based filtering
Tag based filtering
Example:
Custom metadata based on HDFS XAttr.
Tag: <tutorial>, <debug>
Tags can enable logging of all operations,
if path based filtering is not easy to set
CDC API - Filtering Mechanisms
Path based filtering
Tag based filtering
Coalesce FS Operations
Example:
Read file1
Read file2
…
Read filen
Access1
Training Dataset
CDC API - Filtering Mechanisms
Parent Create Artifact Create
Parent Delete Artifact Delete
Children Read Artifact Access
Children
Create/Delete/
Append/Truncate
Artifact Mutation
Namenodes
NDB
ePipe
Cache
per namenode
Log table
With duplicates
Remove duplicates
In [ ]:
hops.load_training_dataset(
“/Projects/LC/Training_Datasets/ImageNet”)
…
hops.save_model(“/Projects/LC/Models/ResNet”)
Optimization - FS Operation Coalesce
Path based filtering
Tag based filtering
Coalesce FS Operations
Filtered Operations
Filesystem Op Metadata Stored
Create/Delete Artifact existence
XAttr Add metadata to artifact
Read Artifact used by ..
Children Files
Create/Delete
Artifact mutation
Append/Truncate Artifact mutation
Permissions/ACL Artifact metadata mutation
CDC API - Filtering Mechanisms
DataOps
CI/CD Platform
Feature Store
...
Commit-0002
Commit-0001
Commit-0097
Model Training &
Model Validation
MLOps
CI/CD Platform
Model Repository
Model Serving
& Monitoring
Data
Develop/Test
Feature Pipelines2Data1 Develop Model3
Train/Validate
Model4
Deploy/
Monitor5
Hopsworks ML Pipelines
Metadata Store
CDC events CDC events
CDC events
API calls
CDC events
API calls
Bias Detected
!
?
Provenance example
What do I do
DeltaCommit
10/01/20@10:10:01
DeltaCommit
10/01/20@10:12:01
DeltaCommit
10/01/20@12:10:01
… …
DeltaCommit
20/07/20@02:10:01
Hudi Feature Timeline
Claim of Model Bias!
Can we determine the exact features used?
Provenance + Time travel
Feature
engineering
Training Serving
Raw Data Features Models
Training
timestamp
Application
● Provenance improves understanding of complex ML Pipelines.
● Provenance should not change the core ML pipeline code.
● Provenance facilitates Debugging, Analyzing, Automating and Cleaning
of ML Pipelines.
● Provenance and Time Travel facilitate reproducibility of experiments.
● In Hopsworks, we introduced a new mechanism for provenance based
on embedded metadata in a scale-out consistent metadata layer.
Summary
● Ormenisan et al, Time-travel and Provenance for ML Pipelines, Usenix OpML 2020
● Niazi et al, HopsFS, Usenix Fast 2017
● Ismail et al, ePipe, CCGrid 2019
● Small Files in HopsFS, ACM Middleware 2018
● Ismail et al, HopsFS-S3, ACM Middleware 2020
● Meister et al, Oblivious Training Functions, 2020
● Hopsworks
References
@hopsworks
http://github.com/logicalclocks/hopsworks

Mais conteúdo relacionado

Mais procurados

Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...Kim Hammar
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
 
The Feature Store in Hopsworks
The Feature Store in HopsworksThe Feature Store in Hopsworks
The Feature Store in HopsworksJim Dowling
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Jim Dowling
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingJim Dowling
 
Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020Jim Dowling
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxLex Avstreikh
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Databricks
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
ADF Gold Nuggets (Oracle Open World 2011)
ADF Gold Nuggets (Oracle Open World 2011)ADF Gold Nuggets (Oracle Open World 2011)
ADF Gold Nuggets (Oracle Open World 2011)Lucas Jellema
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019Kim Hammar
 
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Lucidworks
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Databricks
 
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Lucidworks
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
 
Accelerated data access
Accelerated data accessAccelerated data access
Accelerated data accessgordonyorke
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsSteven Gustafson
 

Mais procurados (20)

Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
The Feature Store in Hopsworks
The Feature Store in HopsworksThe Feature Store in Hopsworks
The Feature Store in Hopsworks
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
 
Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
ADF Gold Nuggets (Oracle Open World 2011)
ADF Gold Nuggets (Oracle Open World 2011)ADF Gold Nuggets (Oracle Open World 2011)
ADF Gold Nuggets (Oracle Open World 2011)
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
 
The CoFX Data Model
The CoFX Data ModelThe CoFX Data Model
The CoFX Data Model
 
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
Practical End-to-End Learning to Rank Using Fusion - Andy Liu, Lucidworks
 
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
 
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
Learning to Rank: From Theory to Production - Malvina Josephidou & Diego Cecc...
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
 
Accelerated data access
Accelerated data accessAccelerated data access
Accelerated data access
 
AutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital DecisionsAutoML for Data Science Productivity and Toward Better Digital Decisions
AutoML for Data Science Productivity and Toward Better Digital Decisions
 

Semelhante a Metadata and Provenance for ML Pipelines with Hopsworks

Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for HadoopJim Dowling
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineJan Wiegelmann
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
 
Log Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovLog Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovSoftServe
 
Log Data Analysis Platform
Log Data Analysis PlatformLog Data Analysis Platform
Log Data Analysis PlatformValentin Kropov
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real ExperienceIhor Bobak
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...nadine39280
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsStephan Ewen
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowDaniel S. Katz
 

Semelhante a Metadata and Provenance for ML Pipelines with Hopsworks (20)

Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Data provenance in Hopsworks
Data provenance in HopsworksData provenance in Hopsworks
Data provenance in Hopsworks
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
Flux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / PipelineFlux - Open Machine Learning Stack / Pipeline
Flux - Open Machine Learning Stack / Pipeline
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
Log Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovLog Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin Kropov
 
Log Data Analysis Platform
Log Data Analysis PlatformLog Data Analysis Platform
Log Data Analysis Platform
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...
 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
 

Mais de Jim Dowling

ARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfJim Dowling
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfJim Dowling
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdfJim Dowling
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Jim Dowling
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money LaunderingJim Dowling
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityJim Dowling
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJim Dowling
 
Berlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsBerlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsJim Dowling
 
All AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AIAll AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AIJim Dowling
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Jim Dowling
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceJim Dowling
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraJim Dowling
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsJim Dowling
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsJim Dowling
 

Mais de Jim Dowling (18)

ARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdf
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money Laundering
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Berlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on HopsBerlin buzzwords 2018 TensorFlow on Hops
Berlin buzzwords 2018 TensorFlow on Hops
 
All AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AIAll AI Roads lead to Distribution - Dot AI
All AI Roads lead to Distribution - Dot AI
 
Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)Distributed TensorFlow on Hops (Papis London, April 2018)
Distributed TensorFlow on Hops (Papis London, April 2018)
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in Finance
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
 

Último

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Último (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Metadata and Provenance for ML Pipelines with Hopsworks

  • 1. Dr. Jim Dowling1,2 Slides together with Alexandru A. Ormenisan1,2 , Mahmoud Ismail1,2 PROVENANCE FOR MACHINE LEARNING PIPELINES KTH - Royal Institute of Technology (1) Logical Clocks AB (2)
  • 2. Growing Consensus on how to manage complexity of AI Data validation Distributed ENGINEER Model Serving A/B Testing Monitoring Pipeline Management HyperParameter Tuning Feature Engineering Data Collection Hardware Management Data Model Prediction φ(x) 2
  • 3. Growing Consensus on how to manage complexity of AI Data validation Distributed ENGINEER Model Serving A/B Testing Monitoring Pipeline Management HyperParameter Tuning Feature Engineering Data Collection Hardware Management Data Model Prediction φ(x) ML PLATFORM TRAIN and SERVE FEATURE STORE
  • 4. What is provenance for ML Pipelines? ML Pipeline Feature engineering Training Serving Raw Data Features Models
  • 5. Governance …search: Discovery Debug & Analyse ? Serving Feature Engineering Training & Validating ? Pipeline Automation Integrity & Garbage Collection Traceability & Compliance Why track provenance?
  • 7. ▪ Logical Clocks – Hopsworks (world’s first/only fully open source) ▪ Uber Michelangelo ▪ Airbnb – Bighead/Zipline ▪ Comcast ▪ Twitter ▪ GO-JEK Feast ▪ Conde Nast ▪ Facebook FB Learner ▪ Netflix ▪ Reference: www.featurestore.org Feature Stores in Production
  • 8. Event DataRaw Data Data Lake Data Pipelines BI Platforms SQL Data Feature Pipelines Feature Store FEATURES FOR MODEL TRAINING SERVE RT FEATURES TO ONLINE MODELS FEATURES FOR ANALYTICAL MODELS (BATCH) Feature Stores make existing Data Infrastructure available to Data Scientists and Online Apps
  • 9. Click features every 10 secs CDC data every 30 secs User profile updates every hour Featurized weblogs data every day Online Feature Store Offline Feature Store SQL DW S3, HDFS SQL Event Data Real-Time Data User-Entered Features (<2 secs) Online App Low Latency Features High Latency Features Train, Batch App Feature Store No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores. <10ms TBs/PBs Feature Pipelines update the Feature Store (2 Databases!) with data from backend Platforms
  • 10. Feature Store ClickFeatureGroup TableFeatureGroup UserFeatureGroup LogsFeatureGroup Event Data SQL DW S3, HDFS SQL DataFrameAPI Kafka Input Flink RTFeatureGroup Online App Train, Batch App User Clicks DB Updates User Profile Updates Weblogs Real-time features Kafka Output The FeatureGroup abstraction hides the complexity of dealing with 2 databases
  • 11. Features name Pclass Sex Survive Name Balance Train / Test Datasets Survivename PClass Sex Balance Join key Feature Groups Titanic Passenger List Passenger Bank Account File format .tfrecord .npy .csv .hdf5, .petastorm, etc Storage GCS Amazon S3 HopsFS Features, FeatureGroups, and Train/Test Datasets are all versioned The FeatureGroup abstraction hides the complexity of dealing with 2 databasesFeature Store Concepts in Hopsworks
  • 12. Example Ingestion of data into a FeatureGroup https://docs.hopsworks.ai/ dataframe = spark.read.json("s3://dataset/rain.json") # do feature engineering on your dataframe df.withColumn('precipitation', (df.val-min)/(max-min)) fg = fs.create_feature_group("rain", version=1, description="Rain features", primary_key=['date', 'location_id'], online_enabled=True) fg.save(dataframe)
  • 13. Example Creation of Train/Test Data from a Feature Store https://docs.hopsworks.ai/ # Join features across FeatureGroups. Use “on=[..]” to explicitly enter the JOIN key. feature_join = rain_fg.select_all() .join(temperature_fg.select_all(), on=["date", "location_id"]) .join(location_fg.select_all())) td = fs.create_training_dataset("training_dataset", version=1, data_format="tfrecords", description="Training dataset, TfRecords format", splits={'train': 0.7, 'test': 0.2, 'validate': 0.1}) # The train/test/validation files are now saved to the filesystem (S3, HDFS, etc) td.save(feature_join) # Use the training data as follows: df = td.read(split="train")
  • 14. Event DataRaw Data Feature Pipeline FEATURE STORE TRAIN/VALIDATE MODEL SERVING MONITOR Data Lake The end of the End-to-End ML Pipeline! ML Pipelines start and stop at the Feature Store
  • 15. End-to-End ML Pipelines on Hopsworks. Provenance is Collecting Metadata. HopsFS Code and configuration Data Lake, Warehous e, Kafka Feature Store Model registry Prediction Logs Monitoring Logs Feature Engineering Serving on Kubernetes Model Training Model Deploy Serving and Monitoring Experiments/ Development Scaleout Metadata Features Validate Deploy to Log Artifact (File) Artifact Metadata Elasticsearch Sync
  • 16. Metadata is data that describes other data. Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata?
  • 17. Artifacts and Metadata in End-to-End ML Pipelines File System (S3, HopsFS, etc) Metastore (Database) Provenance queries ● SQL or Free-Text or Graph? ● Update Throughput? ● Latency of queries? ● Size of Metadata?
  • 19. 3 Mechanisms for Metadata Collection. Polyglot Metadata Storage for Efficient Querying. File Systems, Databases, Data Warehouses, Message Bus, etc Metastore (Database) Crawler Job Pull (REST) API Push Change Data Capture(CDC) API Application Instrumented APIJob Graph DB Search (Elastic) Metadata Query API
  • 20. Artifacts and Metadata in End-to-End ML Pipelines File System (S3, HopsFS, etc) Metastore Consistency issues Synchronization ?
  • 21. Metadata is data that describes other data. Unspoken Assumption: Why are Data and Metadata always separate stores? Artifacts and Metadata in End-to-End ML PipelinesWhat is Metadata Revisited?
  • 22. Artifacts and Metadata in End-to-End ML Pipelines Raw Data Features Experiments (Progs, Logs, Checkpoints) Models Artifacts Metadata File System (S3, HopsFS, etc) Metastore (Database) Experiments (HParams, Env, Results, Graphs) Feature Stats (Min,Max,Std, Mean, Distrib.) Governance (Privileges,Audit, Retention, etc) Model Desc (Privileges, Perf, Provenance, etc)
  • 23. Mechanism 4: Artifacts and Metadata in the same system - a Unified Metadata Layer (Hopsworks) Features Experiments (Progs, Logs, Checkpoints) Models Artifacts Metadata HopsFS Metastore (NDB) Experiments (HParams, Env, Results, Graphs) Feature Stats (Min,Max,Std, Mean, Distrib.) Governance (Privileges,Audit, Retention, etc) Model Desc (Privileges, Perf, Provenance, etc) Metastore (NDB) Raw Data extends extends extends extends
  • 24. Libraries Application Data platform Metadata store Explicit Top–down tracking of provenance. Push/Pull, CDC, or instrumented application or library code. Standalone Metadata Store. Implicit Bottom-up tracking of provenance. Requires redesigning the platform. Conventions link files to artifacts. Metadata is strongly consistent with storage platform. Mechanism 4: Implicit Provenance
  • 25. ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata Mahmoud Ismail1 , Mikael Ronström2 , Seif Haridi1 , Jim Dowling1 1 KTH - Royal Institute of Technology 2 Oracle 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (IEEE/ACM CCGrid 2019), May 15th 25 Tightly coupled Metadata and Data - replicating Metadata to External Systems HopsFS Scaleout Metadata Artifact (File) Artifact Metadata Elasticsearch Sync
  • 26. DN1 DN2 DN3 DN4 DN5 26 ● Highly scalable next-generation distribution of HDFS mkdir /Images write /Images/cat.png Images cat.png / DN1, DN3, DN5 What is HopsFS?
  • 27. 27 inodeID name parentID Block storage (Datenodes) Metadata storage NDB mkdir /Images 1 / 0 2 Images 1 3 cat.png 2 write /Images/cat.png What is HopsFS?
  • 28. 28 ● Drop-in replacement distribution of HDFS ● 16X - 37X the throughput of HDFS ● 37 larger clusters than HDFS ● 10 times lower latency What is HopsFS?
  • 29. 29 ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata
  • 30. 30 HopsFS Get all images with 1 cat and 1 guitar 1 cat and 1 guitar Full-text search is not supported by NDB Search in HopsFS
  • 31. 31 Store X Store Y HopsFS App A App B ? Polyglot Persistence - Replicating Metadata to External Systems for Efficient Querying
  • 32. 32 ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata
  • 33. 33 ● ePipe is a databus that provides replicated metadata as a service for HopsFS ● ePipe internally • creates a consistent and correctly ordered change stream for HopsFS metadata • and eventually delivers the change stream with low latency (sub second) (Near Real-time) to consumers ePipe
  • 34. 34 ● Extend HopsFS with a logging table to log file system changes ● Leverage the NDB events API to live stream changes on the logging table to ePipe ● ePipe enriches the file system events with appropriate data and publish the enriched events to the consumers ePipe: Design Decisions
  • 35. 35 Create /f1 name operationinodeID name parentID 1 / 0 Inodes table logging table NDB HopsFS Namenodes 2 f1 1 3 f2 1 f1 CREATE f2 CREATE f2 DELETE f1 DELETE Create /f2Delete /f2Delete /f1 Inodes table and logging table updated in the same Transaction to ensure Consistency/Integrity
  • 37. 37 HopsFS NDB NDB ePipe Create f1 Append f1 Create f2 Delete f1 Delete f2 Create f1 Append f1 Create f2 Delete f1 Delete f2 Epoch1 Epoch2Epoch3 Order across epochs Order within epoch Delete f1 after Create f1 Create f1 ?? Append f1 Delete f2 after Create f2 Create f2 ?? Delete f1 ….. Inconsistencies 100 ms Ordering of Log Entries
  • 38. 38 ● Property 1: The epochs are totally ordered. ● Property 2: The changes within the same transaction happen in the same epoch. ● Property 3: The changes on files are ordered only if they are in different epochs, that is, no ordering is guaranteed within the same epoch. NDB Ordering Properties
  • 39. 39 HopsFS NDB NDB ePipe Epoch1 Epoch2Epoch3 Delete f2 ,2 Create f1 ,1 Append f1 ,2 Create f2 ,1 Delete f1 ,3 Delete f2 ,2 Create f1 ,1 Append f1 ,2 Create f2 ,1 Delete f1 ,3 We introduced a version number per inode which we will increment whenever a change occurs to an inode. Append f1 after Create f1 Create f2 ?? Delete f1 Strengthening NDB Ordering Properties
  • 40. 40 ● Property 1 & 2 & 3 ● Property 4 & 5: The version number ensures the serializability of the changes on the same file/directory within epochs. ● Property 6: The order of changes for different files/directories within the same epoch doesn't matter. ePipe ordering Properties
  • 45. 45 ● Supports failure recovery thanks to the persistent logging table • The log entries are deleted only once the associated events are successfully replicated to the downstream consumers. • At least once delivery semantics. ● Pluggable architecture • For example, filter events based on file name or any other attribute. ● Not Limited to HopsFS • Can be extended to watch for other logging tables for different purposes. More about ePipe
  • 46. 46 ● A databus that provides replicated metadata as a service for HopsFS ● Low overhead on HopsFS ● Low replication lag (sub-second) ● High throughput ● Pluggable architecture ePipe Properties
  • 47. What is provenance - ML Pipeline ML Pipeline Feature engineering Training Serving Raw Data Features Models
  • 48. MLFlow Metadata - Explicit API calls def train(data_path, max_depth, min_child_weight, estimators, model_name): X_train, X_test, y_train, y_test = build_data(..) mlflow.set_tracking_uri("jdbc:mysql://username:password@host:3306/database") mlflow.set_experiment("My Experiment") with mlflow.start_run() as run: ... mlflow.log_param("max_depth", max_depth) mlflow.log_param("min_child_weight", min_child_weight) mlflow.log_param("estimators", estimators) with open("test.txt", "w") as f: f.write("hello world!") mlflow.log_artifacts("/full/path/to/test.txt") ... model.fit(X_train, y_train) # auto-logging ... mlflow.tensorflow.log_model(model, "tensorflow-model", registered_model_name=model_name)
  • 49. Hopsworks Metadata - Implicit Metadata def train(data_path, max_depth, min_child_weight, estimators): X_train, X_test, y_train, y_test = build_data(..) ... print("hello world") # monkeypatched - prints in notebook ... model.fit(X_train, y_train) # auto-logging … #Saves model to ”hopsfs://Projects/myProj/models/..” hops.export_model(model, "tensorflow",..,model_name) ... # maggy makes an API call to track this dict return {'accuracy': accuracy, 'loss': loss, 'diagram': 'diagram.png'} from maggy import experiment experiment.lagom(train, name="My Experiment", ...)
  • 50. Metadata In [ ]: add(fg_eng, raw_data, features) … add(training, features, model) <fg_eng, raw_data, features> Pipeline code What is provenance - Metadata Feature engineering Training Serving Raw Data Features Models
  • 51. ePipe (with ML Provenance) Distributed File System (HopsFS) Full Text Search (Elastic) Feature engineering Training Serving Raw Data Features Models Let the platform manage the metadata!
  • 52. ML Artifacts Features, Feature Metadata, Train/Test Datasets Models, Model Metadata Possibly thousands of files Distributed File System Generate thousands of operations Change Data Capture (CDC) Capture only relevant operations Systems Challenges - Operations
  • 53. More context for file system operations? user: John user: Alex Are any of these operations related? user: John, app1 user: John, app3 user: Alex, app2 Certificates (with AppId) enabled FS Operation Order of operations Order of operations Richer provenance information
  • 54. Distributed File System Read/Write/Create/Delete/XAttr/Metadata Resource Manager - Yarn (Application Context) Application X Job Manager - Hopsworks (Job Context) Workflow Manager - Airflow (Pipeline Context) Link input/output files via Apps Different Executions of the same Job Jobs as Stages of the same Pipeline Additional Context Richer provenance information <file, op, user_id, app_id, job_id, pipeline_id>
  • 56. /training_datasets /models /featurestoreProject Example CDC API - Filtering Mechanisms Path based filtering
  • 57. Path based filtering Tag based filtering Example: Custom metadata based on HDFS XAttr. Tag: <tutorial>, <debug> Tags can enable logging of all operations, if path based filtering is not easy to set CDC API - Filtering Mechanisms
  • 58. Path based filtering Tag based filtering Coalesce FS Operations Example: Read file1 Read file2 … Read filen Access1 Training Dataset CDC API - Filtering Mechanisms
  • 59. Parent Create Artifact Create Parent Delete Artifact Delete Children Read Artifact Access Children Create/Delete/ Append/Truncate Artifact Mutation Namenodes NDB ePipe Cache per namenode Log table With duplicates Remove duplicates In [ ]: hops.load_training_dataset( “/Projects/LC/Training_Datasets/ImageNet”) … hops.save_model(“/Projects/LC/Models/ResNet”) Optimization - FS Operation Coalesce
  • 60. Path based filtering Tag based filtering Coalesce FS Operations Filtered Operations Filesystem Op Metadata Stored Create/Delete Artifact existence XAttr Add metadata to artifact Read Artifact used by .. Children Files Create/Delete Artifact mutation Append/Truncate Artifact mutation Permissions/ACL Artifact metadata mutation CDC API - Filtering Mechanisms
  • 61. DataOps CI/CD Platform Feature Store ... Commit-0002 Commit-0001 Commit-0097 Model Training & Model Validation MLOps CI/CD Platform Model Repository Model Serving & Monitoring Data Develop/Test Feature Pipelines2Data1 Develop Model3 Train/Validate Model4 Deploy/ Monitor5 Hopsworks ML Pipelines Metadata Store CDC events CDC events CDC events API calls CDC events API calls
  • 63. DeltaCommit 10/01/20@10:10:01 DeltaCommit 10/01/20@10:12:01 DeltaCommit 10/01/20@12:10:01 … … DeltaCommit 20/07/20@02:10:01 Hudi Feature Timeline Claim of Model Bias! Can we determine the exact features used? Provenance + Time travel Feature engineering Training Serving Raw Data Features Models Training timestamp Application
  • 64. ● Provenance improves understanding of complex ML Pipelines. ● Provenance should not change the core ML pipeline code. ● Provenance facilitates Debugging, Analyzing, Automating and Cleaning of ML Pipelines. ● Provenance and Time Travel facilitate reproducibility of experiments. ● In Hopsworks, we introduced a new mechanism for provenance based on embedded metadata in a scale-out consistent metadata layer. Summary
  • 65. ● Ormenisan et al, Time-travel and Provenance for ML Pipelines, Usenix OpML 2020 ● Niazi et al, HopsFS, Usenix Fast 2017 ● Ismail et al, ePipe, CCGrid 2019 ● Small Files in HopsFS, ACM Middleware 2018 ● Ismail et al, HopsFS-S3, ACM Middleware 2020 ● Meister et al, Oblivious Training Functions, 2020 ● Hopsworks References