SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or
distribute these slides for commercial purposes. You may make copies of these slides and
use or distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides.
For the rest of the details of the license, see
https://creativecommons.org/licenses/by-sa/2.0/legalcode
Collecting, Labeling, and
Validating Data
Welcome
“Data is the hardest part of ML and the most important piece to get right...
Broken data is the most common cause of problems in production ML systems”
- Scaling Machine Learning at Uber with Michelangelo - Uber
The importance of data
“No other activity in the machine learning life cycle has a higher return on
investment than improving the data a model has access to.”
- Feast: Bridging ML Models and Data - Gojek
Introduction to Machine
Learning Engineering
for Production
Overview
Outline
● Machine learning (ML) engineering for production: overview
● Production ML = ML development + software development
● Challenges in production ML Yields
Data Training
Evaluation
Traditional ML modeling
Production ML systems require so much more
Configuration
Data Verification
Feature Extraction Process Management Tools
Analysis Tools
Machine Resource
Management
Serving
Infrastructure
Monitoring
Data Collection
ML Code
Academic/Research ML Production ML
Static Dynamic - Shifting
Optimal tuning and training Continuously assess and retrain
Highest overall accuracy Fast inference, good interpretability
High accuracy algorithm Entire system
Very important Crucial
Data
Model training
Priority for design
Challenge
Fairness
ML modeling vs production ML
Machine learning
development
Modern software
development
+
Production machine learning Managing the entire life cycle of data
● Labeling
● Feature space coverage
● Minimal dimensionality
● Maximum predictive data
● Fairness
● Rare conditions
Accounts for:
Modern software development
● Scalability
● Extensibility
● Configuration
● Consistency & reproducibility
● Safety & security
● Modularity
● Testability
● Monitoring
● Best practices
Production machine learning system
Define
Project
Define data
and establish
baseline
Label and
organize
data
Select and
train model
Perform error
analysis
Deploy in
production
Monitor and
maintain
system
Scoping Data Modeling Deployment
New data
● Build integrated ML systems
● Continuously operate it in production
● Handle continuously changing data
● Optimize compute resource costs
Challenges in production grade ML
Introduction to Machine
Learning Engineering
for Production
ML Pipelines
● ML Pipelines
● Directed Acyclic Graphs and Pipeline Orchestration Frameworks
● Intro to TensorFlow Extended (TFX)
Outline
Infrastructure for
automating, monitoring, and maintaining
model training and deployment
ML pipelines
Scoping Data Modeling Deployment
New data
CD Foundation MLOps reference architecture
18
Production ML infrastructure
● A directed acyclic graph (DAG) is a directed graph that has no cycles
● ML pipeline workflows are usually DAGs
● DAGs define the sequencing of the tasks to be performed, based on their
relationships and dependencies.
Scoping Data Modeling Deployment
Directed acyclic graphs
● Responsible for scheduling the various components in
an ML pipeline DAG dependencies
● Help with pipeline automation
● Examples: Airflow, Argo, Celery, Luigi, Kubeflow
Configuration
Data
Verification
Feature Extraction Process Management
Tools
Analysis Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
Data Collection
ML Code
Pipeline orchestration frameworks
TensorFlow Extended (TFX)
End-to-end platform for deploying production ML pipelines
Sequence of components that are designed for scalable, high-performance
machine learning tasks
Data
Ingestion
Data
Validation
Feature
Engineering
Train
Model
Validate
Model
Push if
good
Serve
Model
Libraries
Components
Bulk Inference
Tuner InfraValidator
DATA
INGESTION
TENSORFLOW
DATA VALIDATION
TENSORFLOW
TRANSFORM
ESTIMATOR OR
KERAS MODEL
TENSORFLOW
MODEL ANALYSIS
TENSORFLOW
SERVING
VALIDATION
OUTCOMES
ExampleGen
Example
Validator
SchemaGen
StatisticsGen
Transform Trainer Evaluator Pusher
Model Server
TFX production components
Data
Ingestion
Data
Validation
Feature
Engineering
Train
Model
Validate
Model
Push if
good
Serve
Model
TFX Hello World Key points
● Production ML pipelines: automating, monitoring, and maintaining end-to-end processes
● Production ML is much more than just ML code
○ ML development + software development
● TFX is an open-source end-to-end ML platform
Scoping Data Modeling Deployment
New data
Collecting Data
Importance of Data
● Importance of data quality
● Data pipeline: data collection, ingestion and preparation
● Data collection and monitoring
Outline
“Data is the hardest part of ML and the most important piece to get right... Broken data is the most
common cause of problems in production ML systems”
- Scaling Machine Learning at Uber with Michelangelo - Uber
The importance of data
“No other activity in the machine learning life cycle has a higher return on investment than improving
the data a model has access to.”
- Feast: Bridging ML Models and Data - Gojek
ML: Data is a first class citizen
● Software 1.0
● Software 2.0
○ Explicit instructions to the computer
○ Specify some goal on the behavior of a program
○ Find solution using optimization techniques
○ Good data is key for success
○ Code in Software = Data in ML
Software 1.0
Software 2.0
P
r
o
g
r
a
m
C
o
m
p
l
e
x
i
t
y
● Models aren’t magic
● Meaningful data:
○ maximize predictive content
○ remove non-informative data
○ feature space coverage
Everything starts with data
Garbage in, garbage out
● Data collection
● Data ingestion
● Data formatting
● Feature engineering
● Feature extraction
Define
Project
Define data
and establish
baseline
Label and
organize
data
Select and
train model
Perform error
analysis
Deploy in
production
Monitor and
maintain
system
Scoping Data Modeling Deployment
Data pipeline
● Downtime
● Errors
● Distributions
shifts
● Data failure
● Service failure
Define
Project
Define data
and establish
baseline
Label and
organize
data
Select and
train model
Perform error
analysis
Deploy in
production
Monitor and
maintain
system
Scoping Data Modeling Deployment
Data collection and monitoring Key Points
● Understand users, translate user needs into data problems
● Ensure data coverage and high predictive signal
● Source, store and monitor quality data responsibly
Collecting Data
Example Application:
Suggesting Runs
Example application: Suggesting runs
Users Runners
User Need Run more often
User Actions Complete run using the app
ML System Output
● What routes to suggest
● When to suggest them
ML System Learning
● Patterns of behaviour around accepting run prompts
● Completing runs
● Improving consistency
Key considerations
● Data availability and collection
○ What kind of/how much data is available?
○ How often does the new data come in?
○ Is it annotated?
■ If not, how hard/expensive is it to get it labeled?
● Translate user needs into data needs
○ Data needed
○ Features needed
○ Labels needed
Runner ID Run Runner Time Elevation Fun
AV3DE Boston Marathon 03:40:32 1,300 ft Low
X8KGF Seattle Oktoberfest
5k
00:35:40 0 ft High
BH9IU Houston
Half-marathon
02:01:18 200 ft Medium
FEATURES
LABELS
EXAMPLES
Example dataset
● Identify data sources
● Check if they are refreshed
● Consistency for values, units, & data types
● Monitor outliers and errors
Get to know your data
● Inconsistent formatting
○ Is zero “0”, “0.0”, or an indicator of a missing measurement
● Compounding errors from other ML Models
● Monitor data sources for system issues and outages
Dataset issues
● Intuition about data value can be misleading
○ Which features have predictive value and which ones do
not?
● Feature engineering helps to maximize the predictive signals
● Feature selection helps to measure the predictive signals
Measure data effectiveness
Data Needed
● Running data from the app
● Demographic data
● Local geographic data
Translate user needs into data needs
Features Needed
● Runner demographics
● Time of day
● Run completion rate
● Pace
● Distance ran
● Elevation gained
● Heart rate
Translate user needs into data needs
Labels Needed
● Runner acceptance or rejection of app suggestions
● User generated feedback regarding why suggestion was
rejected
● User rating of enjoyment of recommended runs
Translate user needs into data needs
Key points
● Understand your user, translate their needs into data problems
○ What kind of/how much data is available
○ What are the details and issues of your data
○ What are your predictive features
○ What are the labels you are tracking
○ What are your metrics
Collecting Data
Responsible Data:
Security, Privacy &
Fairness
● Data Sourcing
● Data Security and User Privacy
● Bias and Fairness
Outline
Example: classifier trained on the Open Images dataset
Avoiding problematic biases in datasets
Source Data Responsibly
Build synthetic
dataset
Open source
dataset
Web scraping
Build your own
dataset
Collect live data
● Data collection and management isn’t just about your
model
○ Give user control of what data can be collected
○ Is there a risk of inadvertently revealing user data?
● Compliance with regulations and policies (e.g. GDPR)
Data security and privacy
● Protect personally identifiable information
○ Aggregation - replace unique values with summary
value
○ Redaction - remove some data to create less
complete picture
Users privacy How ML systems can fail users
● Representational harm
● Opportunity denial
● Disproportionate product failure
● Harm by disadvantage
Fair Accountable Transparent Explainable
● Make sure your models are fair
○ Group fairness, equal accuracy
● Bias in human labeled and/or collected
data
● ML Models can amplify biases
Commit to fairness Biased data representation
● Accurate labels are necessary for supervised learning
● Labeling can be done by:
○ Automation (logging or weak supervision)
○ Humans (aka “Raters”, often semi-supervised)
Reducing bias: Design fair labeling systems Types of human raters
Raters
Subject matter
experts
Generalists
crowdsourcing tools
Specialized tools, e.g. medical Image
labelling
Your Users
Derived labels, e.g. tagging
photos
● Ensure rater pool diversity
● Investigate rater context and incentives
● Evaluate rater tools
● Manage cost
● Determine freshness requirements
Key points
Labeling Data
Case Study: Degraded
Model Performance
You’re an Online Retailer
Selling Shoes ...
Your model predicts
click-through rates
(CTR), helping you decide
how much inventory to
order
When suddenly
Your AUC and prediction accuracy
have dropped on men’s dress shoes!
Why? How do we know that we
have a problem?
Case study: taking action
● How to detect problems early on?
● What are the possible causes?
● What can be done to solve these?
What causes problems?
Kinds of problems:
● Fast - example: bad sensor, bad software update
● Slow - example: drift
• Trend and seasonality
• Distribution of features changes
• Relative importance of features changes
Data changes
• Styles change
• Scope and processes change
• Competitors change
• Business expands to other geos
World changes
Gradual problems
Data collection problem Systems problem
• Bad sensor/camera
• Bad log data
• Moved or disabled
sensors/cameras
• Bad software update
• Loss of network connectivity
• System down
• Bad credentials
Sudden problems
Why “Understand” the model?
● Mispredictions do not have uniform cost to your business
● The data you have is rarely the data you wish you had
● Model objective is nearly always a proxy for your business
objectives
● Some percentage of your customers may have a bad experience
The real world does not stand still!
Labeling Data
Data and Concept
Change in
Production ML
● Detecting problems with deployed models
○ Data and concept change
● Changing ground truth
○ Easy problems
○ Harder problems
○ Really hard problems
Outline
● Data and scope changes
● Monitor models and validate data to find problems early
● Changing ground truth: label new training data
Detecting problems with deployed models
● Ground truth changes slowly (months, years)
● Model retraining driven by:
○ Model improvements, better data
○ Changes in software and/or systems
● Labeling
○ Curated datasets
○ Crowd-based
Easy problems
● Ground truth changes faster (weeks)
● Model retraining driven by:
○ Declining model performance
○ Model improvements, better data
○ Changes in software and/or system
● Labeling
○ Direct feedback
○ Crowd-based
Harder problems
● Ground truth changes very fast (days, hours, min)
● Model retraining driven by:
○ Declining model performance
○ Model improvements, better data
○ Changes in software and/or system
● Labeling
○ Direct feedback
○ Weak supervision
Really hard problems
● Model performance decays over time
○ Data and Concept Drift
● Model retraining helps to improve performance
○ Data labeling for changing ground truth and scarce labels
Key points
Labeling Data
Process Feedback and
Human Labeling
Variety of Methods
○ Process Feedback (Direct Labeling)
○ Human Labeling
○ Semi-Supervised Labeling
○ Active Learning
○ Weak Supervision
○ Semi-Supervised Labeling
○ Active Learning
○ Weak Supervision
Practice later as advanced
labeling methods
Data labeling
Process Feedback
Human Labeling
Example: Actual vs predicted click-through
Example: Cardiologists labeling MRI images
Data labeling
● Using business/organisation available data
● Frequent model retraining
● Labeling ongoing and critical process
● Creating a training datasets requires labels
Why is labeling important in production ML?
Direct labeling: continuous creation of training dataset
Features from
inference requests
Labels from
monitoring
predictions
Join results with
inference requests
Similar to reinforcement learning
rewards
Process feedback - advantages
● Training dataset continuous creation
● Labels evolve quickly
● Captures strong label signals
Process feedback - disadvantages
● Hindered by inherent nature of the problem
● Failure to capture ground truth
● Largely bespoke design
Process feedback - Open-Source log analysis tools
Logstash
Free and open source data processing pipeline
● Ingests data from a multitude of sources
● Transforms it
● Sends it to your favorite "stash."
Fluentd
Open source data collector
Unify the data collection and consumption
Google Cloud Logging
● Data and events from Google Cloud and AWS
● BindPlane. Logging: application components, on-premise and hybrid
cloud systems
● Sends it to your favorite "stash"
AWS ElasticSearch
Azure Monitor
Cloud Log Analysis
Process feedback - Cloud log analytics Human labeling
People (“raters”) to examine data and assign labels manually
Raw data Unlabeled and ambiguous data
is sent to raters for annotation
A training data set is
ready for use
Human labeling - Methodology
Unlabeled data is collected
Human “raters” are recruited
Instructions to guide raters are created
Labels are collected and conflicts resolved
Data is divided and assigned to raters
Human labeling - advantages
● More labels
● Pure supervised learning
Human labeling - Disadvantages
Quality consistency: Many datasets
difficult for human labeling
Slow
Expensive
Small dataset curation
Why is human labeling a problem?
Slow, difficult and
expensive
MRI: high cost for specialist labeling
Single rater: limited #examples per day
Recruitment is slow and expensive
● Various methods of data labeling
○ Process feedback
○ Human labeling
● Advantages and disadvantages of both
Key points
Validating Data
Detecting Data Issues
● Data issues
○ Drift and skew
■ Data and concept Drift
■ Schema Skew
■ Distribution Skew
● Detecting data issues
Outline
Drift
Changes in data over time, such as data collected once
a day
Skew
Difference between two static versions, or different
sources, such as training set and serving set
Drift and skew Typical ML pipeline
Data
Batch processing
Request
Real-time
processing
During training During serving
Model Decay : Data drift
Update
Performance decay : Concept drift
Training
Serving
● Detecting schema skew
○ Training and serving data do not conform to the same
schema
● Detecting distribution skew
○ Dataset shift → covariate or concept shift
● Requires continuous evaluation
Detecting data issues
Dataset shift
Training Serving
Joint
Conditional
Marginal
Covariate shift
Concept shift
Detecting distribution skew
Skew detection workflow
Training data
Baseline stats Schema
Serving data
Stats
Validate statistics
Detect anomalies
Alert and analyze
Validating Data
TensorFlow
Data Validation
● Understand, validate, and monitor ML data at
scale
● Used to analyze and validate petabytes of data at
Google every day
● Proven track record in helping TFX users maintain
the health of their ML pipelines
TensorFlow Data Validation (TFDV) TFDV capabilities
● Generates data statistics and browser visualizations
● Infers the data schema
● Performs validity checks against schema
● Detects training/serving skew
1. Schema Skew
2. Feature Skew
3. Distribution Skew
Training data
Baseline stats Schema
Serving data
Stats
Validate statistics
Detect anomalies
Alert and analyze
Skew detection - TFDV
● Supported for categorical features
● Expressed in terms of L-infinity distance (Chebyshev Distance):
● Set a threshold to receive warnings
2 2 2 2 2
2 1 1 1 2
2 1 1 2
2 1 1 1 2
2 2 2 2 2
Skew - TFDV
Serving and training data don’t conform to same schema:
● For example, int != float
Schema skew
Training feature values are different than the serving feature
values:
● Feature values are modified between training and serving
time
● Transformation applied only in one of the two instances
Feature skew
Distribution of serving and training dataset is significantly
different:
● Faulty sampling method during training
● Different data sources for training and serving data
● Trend, seasonality, changes in data over time
Distribution skew
● TFDV: Descriptive statistics at scale with the embedded facets visualizations
● It provides insight into:
○ What are the underlying statistics of your data
○ How does your training, evaluation, and serving dataset statistics compare
○ How can you detect and fix data anomalies
Key points
● Differences between ML modeling and a production ML system
● Responsible data collection for building a fair production ML system
● Process feedback and human labeling
● Detecting data issues
Practice data validation with TFDV in this week’s exercise notebook
Test your skills with the programming assignment
Wrap up

Mais conteúdo relacionado

Semelhante a C2_W1---.pdf

Semelhante a C2_W1---.pdf (20)

MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
Role of ML engineer
Role of ML engineerRole of ML engineer
Role of ML engineer
 
Integrating Azure Machine Learning and Predictive Analytics with SharePoint O...
Integrating Azure Machine Learning and Predictive Analytics with SharePoint O...Integrating Azure Machine Learning and Predictive Analytics with SharePoint O...
Integrating Azure Machine Learning and Predictive Analytics with SharePoint O...
 
Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
 
Pydata Chicago - work hard once
Pydata Chicago - work hard oncePydata Chicago - work hard once
Pydata Chicago - work hard once
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
The A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOpsThe A-Z of Data: Introduction to MLOps
The A-Z of Data: Introduction to MLOps
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
AI projects - Lifecyle & Best Practices
AI projects - Lifecyle & Best PracticesAI projects - Lifecyle & Best Practices
AI projects - Lifecyle & Best Practices
 
Introduction to ml ops in daily apps
Introduction to ml ops in daily appsIntroduction to ml ops in daily apps
Introduction to ml ops in daily apps
 
Transition to a modern data platform
Transition to a modern data platform Transition to a modern data platform
Transition to a modern data platform
 
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will loveScaling & Transforming Stitch Fix's Visibility into What Folks will love
Scaling & Transforming Stitch Fix's Visibility into What Folks will love
 
Data Ops at TripActions
Data Ops at TripActionsData Ops at TripActions
Data Ops at TripActions
 

Último

Call Girls Btm Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Btm Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Btm Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Btm Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
➥🔝 7737669865 🔝▻ bhavnagar Call-girls in Women Seeking Men 🔝bhavnagar🔝 Esc...
➥🔝 7737669865 🔝▻ bhavnagar Call-girls in Women Seeking Men  🔝bhavnagar🔝   Esc...➥🔝 7737669865 🔝▻ bhavnagar Call-girls in Women Seeking Men  🔝bhavnagar🔝   Esc...
➥🔝 7737669865 🔝▻ bhavnagar Call-girls in Women Seeking Men 🔝bhavnagar🔝 Esc...
amitlee9823
 
Call Girls In Chandapura ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Chandapura ☎ 7737669865 🥵 Book Your One night StandCall Girls In Chandapura ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Chandapura ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bidadi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Bidadi Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Bidadi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Bidadi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
➥🔝 7737669865 🔝▻ Satara Call-girls in Women Seeking Men 🔝Satara🔝 Escorts S...
➥🔝 7737669865 🔝▻ Satara Call-girls in Women Seeking Men  🔝Satara🔝   Escorts S...➥🔝 7737669865 🔝▻ Satara Call-girls in Women Seeking Men  🔝Satara🔝   Escorts S...
➥🔝 7737669865 🔝▻ Satara Call-girls in Women Seeking Men 🔝Satara🔝 Escorts S...
amitlee9823
 
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Devanahalli ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Devanahalli ☎ 7737669865 🥵 Book Your One night StandCall Girls In Devanahalli ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Devanahalli ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ bharuch Call-girls in Women Seeking Men 🔝bharuch🔝 Escorts...
➥🔝 7737669865 🔝▻ bharuch Call-girls in Women Seeking Men  🔝bharuch🔝   Escorts...➥🔝 7737669865 🔝▻ bharuch Call-girls in Women Seeking Men  🔝bharuch🔝   Escorts...
➥🔝 7737669865 🔝▻ bharuch Call-girls in Women Seeking Men 🔝bharuch🔝 Escorts...
amitlee9823
 
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
gajnagarg
 
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdfreStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
Ken Fuller
 
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night StandCall Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Bommanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Bommanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service ...Call Girls Bommanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Bommanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
amitlee9823
 
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men 🔝Secunderabad🔝...
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men  🔝Secunderabad🔝...➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men  🔝Secunderabad🔝...
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men 🔝Secunderabad🔝...
amitlee9823
 
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Último (20)

Personal Brand Exploration - Fernando Negron
Personal Brand Exploration - Fernando NegronPersonal Brand Exploration - Fernando Negron
Personal Brand Exploration - Fernando Negron
 
Call Girls Btm Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Btm Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Btm Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Btm Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
➥🔝 7737669865 🔝▻ bhavnagar Call-girls in Women Seeking Men 🔝bhavnagar🔝 Esc...
➥🔝 7737669865 🔝▻ bhavnagar Call-girls in Women Seeking Men  🔝bhavnagar🔝   Esc...➥🔝 7737669865 🔝▻ bhavnagar Call-girls in Women Seeking Men  🔝bhavnagar🔝   Esc...
➥🔝 7737669865 🔝▻ bhavnagar Call-girls in Women Seeking Men 🔝bhavnagar🔝 Esc...
 
Call Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Road Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls In Chandapura ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Chandapura ☎ 7737669865 🥵 Book Your One night StandCall Girls In Chandapura ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Chandapura ☎ 7737669865 🥵 Book Your One night Stand
 
Personal Brand Exploration ppt.- Ronnie Jones
Personal Brand  Exploration ppt.- Ronnie JonesPersonal Brand  Exploration ppt.- Ronnie Jones
Personal Brand Exploration ppt.- Ronnie Jones
 
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Sa...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Sa...Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Sa...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Sa...
 
Call Girls Bidadi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Bidadi Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Bidadi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Bidadi Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
➥🔝 7737669865 🔝▻ Satara Call-girls in Women Seeking Men 🔝Satara🔝 Escorts S...
➥🔝 7737669865 🔝▻ Satara Call-girls in Women Seeking Men  🔝Satara🔝   Escorts S...➥🔝 7737669865 🔝▻ Satara Call-girls in Women Seeking Men  🔝Satara🔝   Escorts S...
➥🔝 7737669865 🔝▻ Satara Call-girls in Women Seeking Men 🔝Satara🔝 Escorts S...
 
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Sarjapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Devanahalli ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Devanahalli ☎ 7737669865 🥵 Book Your One night StandCall Girls In Devanahalli ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Devanahalli ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ bharuch Call-girls in Women Seeking Men 🔝bharuch🔝 Escorts...
➥🔝 7737669865 🔝▻ bharuch Call-girls in Women Seeking Men  🔝bharuch🔝   Escorts...➥🔝 7737669865 🔝▻ bharuch Call-girls in Women Seeking Men  🔝bharuch🔝   Escorts...
➥🔝 7737669865 🔝▻ bharuch Call-girls in Women Seeking Men 🔝bharuch🔝 Escorts...
 
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
Just Call Vip call girls Firozabad Escorts ☎️9352988975 Two shot with one gir...
 
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdfreStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
reStartEvents 5:9 DC metro & Beyond V-Career Fair Employer Directory.pdf
 
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night StandCall Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Kengeri Satellite Town ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hosur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Bommanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Bommanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service ...Call Girls Bommanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Bommanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
 
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men 🔝Secunderabad🔝...
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men  🔝Secunderabad🔝...➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men  🔝Secunderabad🔝...
➥🔝 7737669865 🔝▻ Secunderabad Call-girls in Women Seeking Men 🔝Secunderabad🔝...
 
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Devanahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Hyderabad 💫✅💃 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATIS...
Hyderabad 💫✅💃 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATIS...Hyderabad 💫✅💃 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATIS...
Hyderabad 💫✅💃 24×7 BEST GENUINE PERSON LOW PRICE CALL GIRL SERVICE FULL SATIS...
 

C2_W1---.pdf

  • 1. Copyright Notice These slides are distributed under the Creative Commons License. DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute these slides for commercial purposes. You may make copies of these slides and use or distribute them for educational purposes as long as you cite DeepLearning.AI as the source of the slides. For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode Collecting, Labeling, and Validating Data Welcome “Data is the hardest part of ML and the most important piece to get right... Broken data is the most common cause of problems in production ML systems” - Scaling Machine Learning at Uber with Michelangelo - Uber The importance of data “No other activity in the machine learning life cycle has a higher return on investment than improving the data a model has access to.” - Feast: Bridging ML Models and Data - Gojek Introduction to Machine Learning Engineering for Production Overview
  • 2. Outline ● Machine learning (ML) engineering for production: overview ● Production ML = ML development + software development ● Challenges in production ML Yields Data Training Evaluation Traditional ML modeling Production ML systems require so much more Configuration Data Verification Feature Extraction Process Management Tools Analysis Tools Machine Resource Management Serving Infrastructure Monitoring Data Collection ML Code Academic/Research ML Production ML Static Dynamic - Shifting Optimal tuning and training Continuously assess and retrain Highest overall accuracy Fast inference, good interpretability High accuracy algorithm Entire system Very important Crucial Data Model training Priority for design Challenge Fairness ML modeling vs production ML
  • 3. Machine learning development Modern software development + Production machine learning Managing the entire life cycle of data ● Labeling ● Feature space coverage ● Minimal dimensionality ● Maximum predictive data ● Fairness ● Rare conditions Accounts for: Modern software development ● Scalability ● Extensibility ● Configuration ● Consistency & reproducibility ● Safety & security ● Modularity ● Testability ● Monitoring ● Best practices Production machine learning system Define Project Define data and establish baseline Label and organize data Select and train model Perform error analysis Deploy in production Monitor and maintain system Scoping Data Modeling Deployment New data
  • 4. ● Build integrated ML systems ● Continuously operate it in production ● Handle continuously changing data ● Optimize compute resource costs Challenges in production grade ML Introduction to Machine Learning Engineering for Production ML Pipelines ● ML Pipelines ● Directed Acyclic Graphs and Pipeline Orchestration Frameworks ● Intro to TensorFlow Extended (TFX) Outline
  • 5. Infrastructure for automating, monitoring, and maintaining model training and deployment ML pipelines Scoping Data Modeling Deployment New data CD Foundation MLOps reference architecture 18 Production ML infrastructure ● A directed acyclic graph (DAG) is a directed graph that has no cycles ● ML pipeline workflows are usually DAGs ● DAGs define the sequencing of the tasks to be performed, based on their relationships and dependencies. Scoping Data Modeling Deployment Directed acyclic graphs ● Responsible for scheduling the various components in an ML pipeline DAG dependencies ● Help with pipeline automation ● Examples: Airflow, Argo, Celery, Luigi, Kubeflow Configuration Data Verification Feature Extraction Process Management Tools Analysis Tools Machine Resource Management Serving Infrastructure Monitoring Data Collection ML Code Pipeline orchestration frameworks
  • 6. TensorFlow Extended (TFX) End-to-end platform for deploying production ML pipelines Sequence of components that are designed for scalable, high-performance machine learning tasks Data Ingestion Data Validation Feature Engineering Train Model Validate Model Push if good Serve Model Libraries Components Bulk Inference Tuner InfraValidator DATA INGESTION TENSORFLOW DATA VALIDATION TENSORFLOW TRANSFORM ESTIMATOR OR KERAS MODEL TENSORFLOW MODEL ANALYSIS TENSORFLOW SERVING VALIDATION OUTCOMES ExampleGen Example Validator SchemaGen StatisticsGen Transform Trainer Evaluator Pusher Model Server TFX production components Data Ingestion Data Validation Feature Engineering Train Model Validate Model Push if good Serve Model TFX Hello World Key points ● Production ML pipelines: automating, monitoring, and maintaining end-to-end processes ● Production ML is much more than just ML code ○ ML development + software development ● TFX is an open-source end-to-end ML platform Scoping Data Modeling Deployment New data
  • 7. Collecting Data Importance of Data ● Importance of data quality ● Data pipeline: data collection, ingestion and preparation ● Data collection and monitoring Outline “Data is the hardest part of ML and the most important piece to get right... Broken data is the most common cause of problems in production ML systems” - Scaling Machine Learning at Uber with Michelangelo - Uber The importance of data “No other activity in the machine learning life cycle has a higher return on investment than improving the data a model has access to.” - Feast: Bridging ML Models and Data - Gojek
  • 8. ML: Data is a first class citizen ● Software 1.0 ● Software 2.0 ○ Explicit instructions to the computer ○ Specify some goal on the behavior of a program ○ Find solution using optimization techniques ○ Good data is key for success ○ Code in Software = Data in ML Software 1.0 Software 2.0 P r o g r a m C o m p l e x i t y ● Models aren’t magic ● Meaningful data: ○ maximize predictive content ○ remove non-informative data ○ feature space coverage Everything starts with data Garbage in, garbage out ● Data collection ● Data ingestion ● Data formatting ● Feature engineering ● Feature extraction Define Project Define data and establish baseline Label and organize data Select and train model Perform error analysis Deploy in production Monitor and maintain system Scoping Data Modeling Deployment Data pipeline
  • 9. ● Downtime ● Errors ● Distributions shifts ● Data failure ● Service failure Define Project Define data and establish baseline Label and organize data Select and train model Perform error analysis Deploy in production Monitor and maintain system Scoping Data Modeling Deployment Data collection and monitoring Key Points ● Understand users, translate user needs into data problems ● Ensure data coverage and high predictive signal ● Source, store and monitor quality data responsibly Collecting Data Example Application: Suggesting Runs Example application: Suggesting runs Users Runners User Need Run more often User Actions Complete run using the app ML System Output ● What routes to suggest ● When to suggest them ML System Learning ● Patterns of behaviour around accepting run prompts ● Completing runs ● Improving consistency
  • 10. Key considerations ● Data availability and collection ○ What kind of/how much data is available? ○ How often does the new data come in? ○ Is it annotated? ■ If not, how hard/expensive is it to get it labeled? ● Translate user needs into data needs ○ Data needed ○ Features needed ○ Labels needed Runner ID Run Runner Time Elevation Fun AV3DE Boston Marathon 03:40:32 1,300 ft Low X8KGF Seattle Oktoberfest 5k 00:35:40 0 ft High BH9IU Houston Half-marathon 02:01:18 200 ft Medium FEATURES LABELS EXAMPLES Example dataset ● Identify data sources ● Check if they are refreshed ● Consistency for values, units, & data types ● Monitor outliers and errors Get to know your data ● Inconsistent formatting ○ Is zero “0”, “0.0”, or an indicator of a missing measurement ● Compounding errors from other ML Models ● Monitor data sources for system issues and outages Dataset issues
  • 11. ● Intuition about data value can be misleading ○ Which features have predictive value and which ones do not? ● Feature engineering helps to maximize the predictive signals ● Feature selection helps to measure the predictive signals Measure data effectiveness Data Needed ● Running data from the app ● Demographic data ● Local geographic data Translate user needs into data needs Features Needed ● Runner demographics ● Time of day ● Run completion rate ● Pace ● Distance ran ● Elevation gained ● Heart rate Translate user needs into data needs Labels Needed ● Runner acceptance or rejection of app suggestions ● User generated feedback regarding why suggestion was rejected ● User rating of enjoyment of recommended runs Translate user needs into data needs
  • 12. Key points ● Understand your user, translate their needs into data problems ○ What kind of/how much data is available ○ What are the details and issues of your data ○ What are your predictive features ○ What are the labels you are tracking ○ What are your metrics Collecting Data Responsible Data: Security, Privacy & Fairness ● Data Sourcing ● Data Security and User Privacy ● Bias and Fairness Outline Example: classifier trained on the Open Images dataset Avoiding problematic biases in datasets
  • 13. Source Data Responsibly Build synthetic dataset Open source dataset Web scraping Build your own dataset Collect live data ● Data collection and management isn’t just about your model ○ Give user control of what data can be collected ○ Is there a risk of inadvertently revealing user data? ● Compliance with regulations and policies (e.g. GDPR) Data security and privacy ● Protect personally identifiable information ○ Aggregation - replace unique values with summary value ○ Redaction - remove some data to create less complete picture Users privacy How ML systems can fail users ● Representational harm ● Opportunity denial ● Disproportionate product failure ● Harm by disadvantage Fair Accountable Transparent Explainable
  • 14. ● Make sure your models are fair ○ Group fairness, equal accuracy ● Bias in human labeled and/or collected data ● ML Models can amplify biases Commit to fairness Biased data representation ● Accurate labels are necessary for supervised learning ● Labeling can be done by: ○ Automation (logging or weak supervision) ○ Humans (aka “Raters”, often semi-supervised) Reducing bias: Design fair labeling systems Types of human raters Raters Subject matter experts Generalists crowdsourcing tools Specialized tools, e.g. medical Image labelling Your Users Derived labels, e.g. tagging photos
  • 15. ● Ensure rater pool diversity ● Investigate rater context and incentives ● Evaluate rater tools ● Manage cost ● Determine freshness requirements Key points Labeling Data Case Study: Degraded Model Performance You’re an Online Retailer Selling Shoes ... Your model predicts click-through rates (CTR), helping you decide how much inventory to order When suddenly Your AUC and prediction accuracy have dropped on men’s dress shoes!
  • 16. Why? How do we know that we have a problem?
  • 17. Case study: taking action ● How to detect problems early on? ● What are the possible causes? ● What can be done to solve these? What causes problems? Kinds of problems: ● Fast - example: bad sensor, bad software update ● Slow - example: drift • Trend and seasonality • Distribution of features changes • Relative importance of features changes Data changes • Styles change • Scope and processes change • Competitors change • Business expands to other geos World changes Gradual problems Data collection problem Systems problem • Bad sensor/camera • Bad log data • Moved or disabled sensors/cameras • Bad software update • Loss of network connectivity • System down • Bad credentials Sudden problems
  • 18. Why “Understand” the model? ● Mispredictions do not have uniform cost to your business ● The data you have is rarely the data you wish you had ● Model objective is nearly always a proxy for your business objectives ● Some percentage of your customers may have a bad experience The real world does not stand still! Labeling Data Data and Concept Change in Production ML ● Detecting problems with deployed models ○ Data and concept change ● Changing ground truth ○ Easy problems ○ Harder problems ○ Really hard problems Outline ● Data and scope changes ● Monitor models and validate data to find problems early ● Changing ground truth: label new training data Detecting problems with deployed models
  • 19. ● Ground truth changes slowly (months, years) ● Model retraining driven by: ○ Model improvements, better data ○ Changes in software and/or systems ● Labeling ○ Curated datasets ○ Crowd-based Easy problems ● Ground truth changes faster (weeks) ● Model retraining driven by: ○ Declining model performance ○ Model improvements, better data ○ Changes in software and/or system ● Labeling ○ Direct feedback ○ Crowd-based Harder problems ● Ground truth changes very fast (days, hours, min) ● Model retraining driven by: ○ Declining model performance ○ Model improvements, better data ○ Changes in software and/or system ● Labeling ○ Direct feedback ○ Weak supervision Really hard problems ● Model performance decays over time ○ Data and Concept Drift ● Model retraining helps to improve performance ○ Data labeling for changing ground truth and scarce labels Key points
  • 20. Labeling Data Process Feedback and Human Labeling Variety of Methods ○ Process Feedback (Direct Labeling) ○ Human Labeling ○ Semi-Supervised Labeling ○ Active Learning ○ Weak Supervision ○ Semi-Supervised Labeling ○ Active Learning ○ Weak Supervision Practice later as advanced labeling methods Data labeling Process Feedback Human Labeling Example: Actual vs predicted click-through Example: Cardiologists labeling MRI images Data labeling ● Using business/organisation available data ● Frequent model retraining ● Labeling ongoing and critical process ● Creating a training datasets requires labels Why is labeling important in production ML?
  • 21. Direct labeling: continuous creation of training dataset Features from inference requests Labels from monitoring predictions Join results with inference requests Similar to reinforcement learning rewards Process feedback - advantages ● Training dataset continuous creation ● Labels evolve quickly ● Captures strong label signals Process feedback - disadvantages ● Hindered by inherent nature of the problem ● Failure to capture ground truth ● Largely bespoke design Process feedback - Open-Source log analysis tools Logstash Free and open source data processing pipeline ● Ingests data from a multitude of sources ● Transforms it ● Sends it to your favorite "stash." Fluentd Open source data collector Unify the data collection and consumption
  • 22. Google Cloud Logging ● Data and events from Google Cloud and AWS ● BindPlane. Logging: application components, on-premise and hybrid cloud systems ● Sends it to your favorite "stash" AWS ElasticSearch Azure Monitor Cloud Log Analysis Process feedback - Cloud log analytics Human labeling People (“raters”) to examine data and assign labels manually Raw data Unlabeled and ambiguous data is sent to raters for annotation A training data set is ready for use Human labeling - Methodology Unlabeled data is collected Human “raters” are recruited Instructions to guide raters are created Labels are collected and conflicts resolved Data is divided and assigned to raters Human labeling - advantages ● More labels ● Pure supervised learning
  • 23. Human labeling - Disadvantages Quality consistency: Many datasets difficult for human labeling Slow Expensive Small dataset curation Why is human labeling a problem? Slow, difficult and expensive MRI: high cost for specialist labeling Single rater: limited #examples per day Recruitment is slow and expensive ● Various methods of data labeling ○ Process feedback ○ Human labeling ● Advantages and disadvantages of both Key points
  • 24. Validating Data Detecting Data Issues ● Data issues ○ Drift and skew ■ Data and concept Drift ■ Schema Skew ■ Distribution Skew ● Detecting data issues Outline Drift Changes in data over time, such as data collected once a day Skew Difference between two static versions, or different sources, such as training set and serving set Drift and skew Typical ML pipeline Data Batch processing Request Real-time processing During training During serving
  • 25. Model Decay : Data drift Update Performance decay : Concept drift Training Serving ● Detecting schema skew ○ Training and serving data do not conform to the same schema ● Detecting distribution skew ○ Dataset shift → covariate or concept shift ● Requires continuous evaluation Detecting data issues Dataset shift Training Serving Joint Conditional Marginal Covariate shift Concept shift Detecting distribution skew
  • 26. Skew detection workflow Training data Baseline stats Schema Serving data Stats Validate statistics Detect anomalies Alert and analyze Validating Data TensorFlow Data Validation ● Understand, validate, and monitor ML data at scale ● Used to analyze and validate petabytes of data at Google every day ● Proven track record in helping TFX users maintain the health of their ML pipelines TensorFlow Data Validation (TFDV) TFDV capabilities ● Generates data statistics and browser visualizations ● Infers the data schema ● Performs validity checks against schema ● Detects training/serving skew
  • 27. 1. Schema Skew 2. Feature Skew 3. Distribution Skew Training data Baseline stats Schema Serving data Stats Validate statistics Detect anomalies Alert and analyze Skew detection - TFDV ● Supported for categorical features ● Expressed in terms of L-infinity distance (Chebyshev Distance): ● Set a threshold to receive warnings 2 2 2 2 2 2 1 1 1 2 2 1 1 2 2 1 1 1 2 2 2 2 2 2 Skew - TFDV Serving and training data don’t conform to same schema: ● For example, int != float Schema skew Training feature values are different than the serving feature values: ● Feature values are modified between training and serving time ● Transformation applied only in one of the two instances Feature skew
  • 28. Distribution of serving and training dataset is significantly different: ● Faulty sampling method during training ● Different data sources for training and serving data ● Trend, seasonality, changes in data over time Distribution skew ● TFDV: Descriptive statistics at scale with the embedded facets visualizations ● It provides insight into: ○ What are the underlying statistics of your data ○ How does your training, evaluation, and serving dataset statistics compare ○ How can you detect and fix data anomalies Key points ● Differences between ML modeling and a production ML system ● Responsible data collection for building a fair production ML system ● Process feedback and human labeling ● Detecting data issues Practice data validation with TFDV in this week’s exercise notebook Test your skills with the programming assignment Wrap up