SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
DataOps: Machine Learning in Production
Stepan Pushkarev
CTO of Hydrosphere.io
Mission: Accelerate Machine Learning to Production
Opensource Products:
- Mist: Serverless proxy for Spark
- ML Lambda: ML Function as a Service
- Sonar: Data and ML Monitoring
Business Model: Subscription services and hands-on consulting
About
How long is the journey?
How long is the journey?
Challenges in Production Solutions
Ad-hoc and disjointed application and
deployment architecture
Training / serving data skew
Reference architecture of data pipelines
and machine learning pipelines
Hand-off between Data Scientist -> ML Eng
-> Data Eng -> SA Eng -> QA -> Ops
Streamline deployment
Data format drifts, new data, wrong
features
Biased Training set / training issue
Model Degradation
Schema first design
Manual Quality Models
Statistical Quality Models
Unsupervised / automatic quality models
Predictive retraining
Continuous labeling and active learning
Performance issues Model Optimisations (out of the scope)
Vulnerability issues, malicious users,
adversarial input
Adversarial training
(out of the scope today)
Application Architectures are well studied
S3/HDFS/DWH
ML Architecture: data scientist + S3 = magic?
train.py
prepare.py
clean.py
Challenges of isolated ML architectures
- Focus on training. Serving and re-training is not designed
- Isolated from the rest of the applications and the team
- Offline. Turning into interactive multi-tenant is a pain
- Designed for static “Kaggle” dataset while 90% of the
world’s data points have a timestamp (it’s live!)
- Batch. Painful to turn into streaming.
- For internal use. No SLA to support real production users
- Plan to throw one away - PoC mode, not designed for QA
Reference Architecture that matters
Streaming ML Architecture that matters
Comfort Zone for Data Scientist in the
middle of Production
ML architecture takeaways
● Should be designed in house from the day one - no yet
another “Big Data Lake Platforms” for enterprise
● No data silos and forks. Up-to-date valid identical state
(data sets) for offline, batch, real-time and interactive
use cases
● Unified experimentation, testing and production
environments
● Unified data and application architecture. All
applications are data driven all data processing stages
are built as applications.
● Unified deployment, monitoring and metrics infrastructure
Challenges in Production Solutions
Ad-hoc and disjointed application and
infrastructure architecture
Training / serving data skew
Reference architecture examples
Hand-off between Data Scientist -> ML Eng
-> Data Eng -> SA Eng -> QA -> Ops
Streamline deployment
Data format drifts, new data, wrong
features
Biased Training set / training issue
Model Degradation
Schema first design
Manual Quality Models
Statistical Quality Models
Unsupervised / automatic quality models
Predictive retraining
Continuous labeling and active learning
Performance issues Model Optimisations (out of the scope)
Vulnerability issues, malicious users,
adversarial input
Adversarial training
(out of the scope today)
ML deployment and serving requirements
- Plumbing: Model metadata for REST, gRPC or Streaming API
- Unified across ML frameworks
- Immutable model versioning
- Agnostic to training pipeline and notebook environment
- Support for stateful and unsupervised models
- Support for prediction, search and recommendation models
- Support for model meta-pipelines (e.g. encoder->decoder)
- Infrastructure and runtime optimized for Serving
- Model optimization for Serving
- Support for streaming applications
Streamline Deployment and integration
model.pkl model.zip
How to integrate it into AI Application?
Model server = Model Artifact + ...
matching_model v2
[
....
]
Build Docker and deploy to the cloud.
Now what?
It is still an anonymous black box.
Model server = Model Artifact +
Metadata + Runtime + Deps
/predict
input:
string text;
bytes image;
output:
string summary;
JVM DL4j
GPU
matching_model v2
[
....
]
gRPC HTTP server
Model server = Model Artifact +
Metadata + Runtime + Deps + Sidecar
/predict
input:
string text;
bytes image;
output:
string summary;
JVM DL4j
GPU
matching_model v2
[
....
]
gRPC HTTP server
routing, shadowing
pipelining
tracing
metrics
autoscaling
A/B, canary
sidecar
serving
requests
ML applications
Streaming: reuse Model as a Service
Immutable model state
Microservices ecosystem for
updates/rollbacks and A/B
One click deployment, no custom
Kafka-Flink-Spark implementation
required
Decoupled from streaming
Out of the box support for all
ML frameworks
Can be reused for online
requests
Can store state in Kafka and
keep model stateless
TF Serving challenges
● Other ML runtimes (DL4J, Scikit,
Spark ML). Servables are overkill.
● Need better versioning and
immutability (Docker per version)
● Don’t want to deal with state
(model loaded, offloaded, etc)
● Want to re-use microservices stack
(tracing, logging, metrics)
● Need better scalability
AWS SageMaker challenges
● Cost. m4.xlarge per model. 100 models = 100 m4.xlarge
● No Model API and metadata. Model is still a black box
● No Versioning
AWS SageMaker advantages
● Great docs and quick starts
● Integration with AWS ecosystem
● Python SDK and notebooks integration
model.zip
Model Deployment takeaways
● Eliminates hand-off between Data Scientist -> ML Eng ->
Data Eng -> SA Eng -> QA -> Ops
● Sticks components together: Data + Model + Applications +
Automation = AI Application
● Enables quick transition from research to production. ML
engineers can deploy models many times a day
But wait… This is not safe!
How to ensure we’ll not break things in prod?
Challenges in Production Solutions
Ad-hoc and disjointed application and
infrastructure architecture
Training / serving data skew
Reference architecture examples
Hand-off between Data Scientist -> ML
Eng -> Data Eng -> SA Eng -> QA -> Ops
Streamline deployment
Data format drifts, new data, wrong
features
Biased Training set / training issue
Model Degradation
Concept drift
Schema first design
Manual Quality Models
Statistical Quality Models
Unsupervised / automatic quality models
Predictive retraining
Continuous labeling and active learning
Performance issues Model Optimisations (out of the scope)
Vulnerability issues, malicious users,
adversarial input
Adversarial training
(out of the scope today)
Cost of AI/ML Error
● Fun
© http://blog.ycombinator.com/how-adversarial-attacks-work/
● Fun
● Not fun
Cost of AI Error
● Fun
● Not fun
● Not fun at all...
Cost of AI/ML Error
● Fun
● Not fun
● Not fun at all…
● People life
Cost of AI Error
● Fun
● Not fun
● Not fun at all…
● People life
● Money
Cost of AI/ML Error
● Fun
● Not fun
● Not fun at all…
● People life
● Money
● Business
Cost of ML Error
Where may AI fail in prod?
Everywhere!
Where may AI fail in prod?
● Bad training data
● Bad serving data
● Training/serving data skew
● Misconfiguration
● Deployment issue
● Retraining issue
● Performance
Everywhere!
Data exploration in production
Research:
Data Scientist makes
assumptions based on results
of data exploration
Data exploration in production
Research:
Data Scientist explores
datasets and makes
assumptions/hypothesis
Production:
The model works if and only
if the format and statistical
properties of prod data are
the same as in research
Push to Prod
Data exploration in production
Research:
Data Scientist makes
assumptions based on results
of data exploration
Production:
The model works if and only
if format and statistical
properties of prod data are
the same as in research
Push to Prod
Continuous data exploration
and validation?
Step 1: Schema first design
● Data defines the contracts and API between components
● Avro/Protobuf for all data records
● Confluent Schema Registry to manage Avro schemas
● Not only for Kafka. Must be used for all the data
pipelines (batch, Spark, etc)
{"namespace": "example.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "number", "type": ["int", "null"]},
{"name": "color", "type": ["string", "null"]}
]
}
Schema first design
Step 2: Extend schema
● Avro/Protobuf can catch data format bugs
● How about data profile? Min, max, mean, etc?
● Describe a data profile, statistical properties and
validation rules in extended schema!
{"name": "User",
"fields": [
{"name": "name", "type": "string", "min_length": 2, "max_length": 128},
{"name": "age", "type": ["int", "null"], "range": "[10, 100]"},
{"name": "sex", "type": ["string", "null"], " enum": "[male, female, ...]"},
{"name": "wage", "type": ["int", "null"], "validator": "DSL here..."}
]
}
Extended schema generate to
Data Quality metrics
Metrics types:
● Profiling
● Timeliness
● Completeness
Step 3: Generate Extended Schema
● Manually specified schema provides data quality model and
improves data pipeline reliability but hard to maintain
● We can automatically profile data shapes and generate
statistical properties to be used in extended schema
Step 4: Clustering and Anomaly detection
● How to deal with multidimensional datasets and
complicated seasonality
● Rule based programs -> statistical models -> machine
learning models
Algorithms to consider:
● Deep Autoencoders
● Density based clustering algorithms with
Elbow method
● Clustering algorithms with Silhouette
method
Model server = Metadata + Model Artifact +
Runtime + Deps + Sidecar + Training Metadata
/predict
input:
output:
JVM DL4j
GPU
matching_model v2
[
....
]
gRPC HTTP server
sidecar
serving
requests
training data stats:
- min
- max
- clusters
- autoencoder
compare with prod
data in runtime
Model Monitoring
● Feedback loop for ML
Engineer
● Eliminates hand-off with
Ops
● Safe experiment on
shadowed traffic
● Shifts experimentation
to prod environment
● Fills the gap between
research and prod
● Correlation with
business metrics
Research: Quality monitoring of NLU system
Figure from: Bapna, Ankur, et al. "Towards zero-shot frame semantic parsing for domain scaling."
arXiv preprint arXiv:1707.02363 (2017).
Research: Quality monitoring of NLU system
Source image: Kurata, Gakuto, et al. "Leveraging sentence-level information with encoder lstm for semantic slot filling." arXiv preprint arXiv:1601.01530 (2016).
● Train and test offline on restaurants domain
● Deploy do prod
● Feed the model with new random Wiki data
● Monitor intermediate input representations (neural network hidden states)
Research: Quality monitoring of NLU system
● Red and Purple - cluster
of “Bad” production data
● Yellow and Blue - dev and
test data
Model Retraining - open questions
When to retrain?
When/how to push to prod?
What data to use for retraining?
Manually on demand
Works well for 1 model
But does not scale
Model Retraining - open questions
When to retrain?
When/how to push to prod?
What data to use for retraining?
Manually on demand
Works well for 1 model
But does not scale
Automatically by
schedule
Not safe
Can be expensive
Solution: Predictive Retraining + Safe Deployment
● Retrain when model
monitoring alerts
● Warm up new models on
shadowed/canary traffic
● Optionally relabel
● Rollback automatically
when monitoring alerts
● Build retraining dataset
from monitoring stats
Quality Solutions takeaways
● Makes ML in production safe and predictable
● Allows ML operation to scale in prod to hundreds and
thousands models
● Blurs the line between research and production by
enabling ML experiments on shadowed or canary traffic
● Data Quality model should be a part of the Protocol
● Very dirty and time consuming job of Data QA can be
automated with machine learning
Webinar takeaway: Production-ready ML apps
with the speed of Prototyping
Thank you
- Stepan Pushkarev
- @hydrospheredata
- https://github.com/Hydrospheredata
- https://hydrosphere.io/
- spushkarev@hydrosphere.io

Mais conteúdo relacionado

Mais procurados

Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineRobert Dempsey
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowDatabricks
 
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
Machine Learning system architecture – Microsoft Translator, a Case Study :  ...Machine Learning system architecture – Microsoft Translator, a Case Study :  ...
Machine Learning system architecture – Microsoft Translator, a Case Study : ...Vishal Chowdhary
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...Databricks
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningLviv Startup Club
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOpsCarl W. Handlin
 
Weave GitOps - continuous delivery for any Kubernetes
Weave GitOps - continuous delivery for any KubernetesWeave GitOps - continuous delivery for any Kubernetes
Weave GitOps - continuous delivery for any KubernetesWeaveworks
 
Richard Coffey (x18140785) - Research in Computing CA2
Richard Coffey (x18140785) - Research in Computing CA2Richard Coffey (x18140785) - Research in Computing CA2
Richard Coffey (x18140785) - Research in Computing CA2Richard Coffey
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.Knoldus Inc.
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsGianmario Spacagna
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Debraj GuhaThakurta
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Databricks
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...Robert Grossman
 

Mais procurados (20)

Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Seamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflowSeamless MLOps with Seldon and MLflow
Seamless MLOps with Seldon and MLflow
 
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
Machine Learning system architecture – Microsoft Translator, a Case Study :  ...Machine Learning system architecture – Microsoft Translator, a Case Study :  ...
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 
Magdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine LearningMagdalena Stenius: MLOPS Will Change Machine Learning
Magdalena Stenius: MLOPS Will Change Machine Learning
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
Weave GitOps - continuous delivery for any Kubernetes
Weave GitOps - continuous delivery for any KubernetesWeave GitOps - continuous delivery for any Kubernetes
Weave GitOps - continuous delivery for any Kubernetes
 
Richard Coffey (x18140785) - Research in Computing CA2
Richard Coffey (x18140785) - Research in Computing CA2Richard Coffey (x18140785) - Research in Computing CA2
Richard Coffey (x18140785) - Research in Computing CA2
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.MLOps Bridging the gap between Data Scientists and Ops.
MLOps Bridging the gap between Data Scientists and Ops.
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
 
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
How to Lower the Cost of Deploying Analytics: An Introduction to the Portable...
 

Semelhante a Data ops: Machine Learning in production

Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...All Things Open
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)dtz001
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?Matei Zaharia
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5gdgsurrey
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflowDatabricks
 
Build, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at ScaleBuild, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at ScaleAmazon Web Services
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningEdunomica
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics Ruben Pertusa Lopez
 

Semelhante a Data ops: Machine Learning in production (20)

Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
NEXiDA at OMG June 2009
NEXiDA at OMG June 2009NEXiDA at OMG June 2009
NEXiDA at OMG June 2009
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
Deployment Design Patterns - Deploying Machine Learning and Deep Learning Mod...
 
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
AllThingsOpen 2018 - Deployment Design Patterns (Dan Zaratsian)
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?What are the Unique Challenges and Opportunities in Systems for ML?
What are the Unique Challenges and Opportunities in Systems for ML?
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
Build, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at ScaleBuild, Train, and Deploy ML Models at Scale
Build, Train, and Deploy ML Models at Scale
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine LearningPaige Roberts: Shortcut MLOps with In-Database Machine Learning
Paige Roberts: Shortcut MLOps with In-Database Machine Learning
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics AzureML Welcome to the future of Predictive Analytics
AzureML Welcome to the future of Predictive Analytics
 

Mais de Stepan Pushkarev

AI for the Human Retina to Protect Newborn Vision
AI for the Human Retina to Protect Newborn VisionAI for the Human Retina to Protect Newborn Vision
AI for the Human Retina to Protect Newborn VisionStepan Pushkarev
 
Automating machine learning lifecycle with kubeflow
Automating machine learning lifecycle with kubeflowAutomating machine learning lifecycle with kubeflow
Automating machine learning lifecycle with kubeflowStepan Pushkarev
 
Handling inference in anomalous ever changing environment
Handling inference in anomalous ever changing environmentHandling inference in anomalous ever changing environment
Handling inference in anomalous ever changing environmentStepan Pushkarev
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningStepan Pushkarev
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operationsStepan Pushkarev
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 

Mais de Stepan Pushkarev (7)

AI for the Human Retina to Protect Newborn Vision
AI for the Human Retina to Protect Newborn VisionAI for the Human Retina to Protect Newborn Vision
AI for the Human Retina to Protect Newborn Vision
 
Automating machine learning lifecycle with kubeflow
Automating machine learning lifecycle with kubeflowAutomating machine learning lifecycle with kubeflow
Automating machine learning lifecycle with kubeflow
 
Handling inference in anomalous ever changing environment
Handling inference in anomalous ever changing environmentHandling inference in anomalous ever changing environment
Handling inference in anomalous ever changing environment
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learning
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 

Último

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Último (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

Data ops: Machine Learning in production

  • 1. DataOps: Machine Learning in Production Stepan Pushkarev CTO of Hydrosphere.io
  • 2. Mission: Accelerate Machine Learning to Production Opensource Products: - Mist: Serverless proxy for Spark - ML Lambda: ML Function as a Service - Sonar: Data and ML Monitoring Business Model: Subscription services and hands-on consulting About
  • 3. How long is the journey?
  • 4. How long is the journey?
  • 5. Challenges in Production Solutions Ad-hoc and disjointed application and deployment architecture Training / serving data skew Reference architecture of data pipelines and machine learning pipelines Hand-off between Data Scientist -> ML Eng -> Data Eng -> SA Eng -> QA -> Ops Streamline deployment Data format drifts, new data, wrong features Biased Training set / training issue Model Degradation Schema first design Manual Quality Models Statistical Quality Models Unsupervised / automatic quality models Predictive retraining Continuous labeling and active learning Performance issues Model Optimisations (out of the scope) Vulnerability issues, malicious users, adversarial input Adversarial training (out of the scope today)
  • 7. S3/HDFS/DWH ML Architecture: data scientist + S3 = magic? train.py prepare.py clean.py
  • 8. Challenges of isolated ML architectures - Focus on training. Serving and re-training is not designed - Isolated from the rest of the applications and the team - Offline. Turning into interactive multi-tenant is a pain - Designed for static “Kaggle” dataset while 90% of the world’s data points have a timestamp (it’s live!) - Batch. Painful to turn into streaming. - For internal use. No SLA to support real production users - Plan to throw one away - PoC mode, not designed for QA
  • 10. Streaming ML Architecture that matters Comfort Zone for Data Scientist in the middle of Production
  • 11. ML architecture takeaways ● Should be designed in house from the day one - no yet another “Big Data Lake Platforms” for enterprise ● No data silos and forks. Up-to-date valid identical state (data sets) for offline, batch, real-time and interactive use cases ● Unified experimentation, testing and production environments ● Unified data and application architecture. All applications are data driven all data processing stages are built as applications. ● Unified deployment, monitoring and metrics infrastructure
  • 12. Challenges in Production Solutions Ad-hoc and disjointed application and infrastructure architecture Training / serving data skew Reference architecture examples Hand-off between Data Scientist -> ML Eng -> Data Eng -> SA Eng -> QA -> Ops Streamline deployment Data format drifts, new data, wrong features Biased Training set / training issue Model Degradation Schema first design Manual Quality Models Statistical Quality Models Unsupervised / automatic quality models Predictive retraining Continuous labeling and active learning Performance issues Model Optimisations (out of the scope) Vulnerability issues, malicious users, adversarial input Adversarial training (out of the scope today)
  • 13. ML deployment and serving requirements - Plumbing: Model metadata for REST, gRPC or Streaming API - Unified across ML frameworks - Immutable model versioning - Agnostic to training pipeline and notebook environment - Support for stateful and unsupervised models - Support for prediction, search and recommendation models - Support for model meta-pipelines (e.g. encoder->decoder) - Infrastructure and runtime optimized for Serving - Model optimization for Serving - Support for streaming applications
  • 14. Streamline Deployment and integration model.pkl model.zip How to integrate it into AI Application?
  • 15. Model server = Model Artifact + ... matching_model v2 [ .... ] Build Docker and deploy to the cloud. Now what? It is still an anonymous black box.
  • 16. Model server = Model Artifact + Metadata + Runtime + Deps /predict input: string text; bytes image; output: string summary; JVM DL4j GPU matching_model v2 [ .... ] gRPC HTTP server
  • 17. Model server = Model Artifact + Metadata + Runtime + Deps + Sidecar /predict input: string text; bytes image; output: string summary; JVM DL4j GPU matching_model v2 [ .... ] gRPC HTTP server routing, shadowing pipelining tracing metrics autoscaling A/B, canary sidecar serving requests
  • 19. Streaming: reuse Model as a Service Immutable model state Microservices ecosystem for updates/rollbacks and A/B One click deployment, no custom Kafka-Flink-Spark implementation required Decoupled from streaming Out of the box support for all ML frameworks Can be reused for online requests Can store state in Kafka and keep model stateless
  • 20. TF Serving challenges ● Other ML runtimes (DL4J, Scikit, Spark ML). Servables are overkill. ● Need better versioning and immutability (Docker per version) ● Don’t want to deal with state (model loaded, offloaded, etc) ● Want to re-use microservices stack (tracing, logging, metrics) ● Need better scalability
  • 21. AWS SageMaker challenges ● Cost. m4.xlarge per model. 100 models = 100 m4.xlarge ● No Model API and metadata. Model is still a black box ● No Versioning AWS SageMaker advantages ● Great docs and quick starts ● Integration with AWS ecosystem ● Python SDK and notebooks integration model.zip
  • 22. Model Deployment takeaways ● Eliminates hand-off between Data Scientist -> ML Eng -> Data Eng -> SA Eng -> QA -> Ops ● Sticks components together: Data + Model + Applications + Automation = AI Application ● Enables quick transition from research to production. ML engineers can deploy models many times a day But wait… This is not safe! How to ensure we’ll not break things in prod?
  • 23. Challenges in Production Solutions Ad-hoc and disjointed application and infrastructure architecture Training / serving data skew Reference architecture examples Hand-off between Data Scientist -> ML Eng -> Data Eng -> SA Eng -> QA -> Ops Streamline deployment Data format drifts, new data, wrong features Biased Training set / training issue Model Degradation Concept drift Schema first design Manual Quality Models Statistical Quality Models Unsupervised / automatic quality models Predictive retraining Continuous labeling and active learning Performance issues Model Optimisations (out of the scope) Vulnerability issues, malicious users, adversarial input Adversarial training (out of the scope today)
  • 24. Cost of AI/ML Error ● Fun © http://blog.ycombinator.com/how-adversarial-attacks-work/
  • 25. ● Fun ● Not fun Cost of AI Error
  • 26. ● Fun ● Not fun ● Not fun at all... Cost of AI/ML Error
  • 27. ● Fun ● Not fun ● Not fun at all… ● People life Cost of AI Error
  • 28. ● Fun ● Not fun ● Not fun at all… ● People life ● Money Cost of AI/ML Error
  • 29. ● Fun ● Not fun ● Not fun at all… ● People life ● Money ● Business Cost of ML Error
  • 30. Where may AI fail in prod? Everywhere!
  • 31. Where may AI fail in prod? ● Bad training data ● Bad serving data ● Training/serving data skew ● Misconfiguration ● Deployment issue ● Retraining issue ● Performance Everywhere!
  • 32. Data exploration in production Research: Data Scientist makes assumptions based on results of data exploration
  • 33. Data exploration in production Research: Data Scientist explores datasets and makes assumptions/hypothesis Production: The model works if and only if the format and statistical properties of prod data are the same as in research Push to Prod
  • 34. Data exploration in production Research: Data Scientist makes assumptions based on results of data exploration Production: The model works if and only if format and statistical properties of prod data are the same as in research Push to Prod Continuous data exploration and validation?
  • 35. Step 1: Schema first design ● Data defines the contracts and API between components ● Avro/Protobuf for all data records ● Confluent Schema Registry to manage Avro schemas ● Not only for Kafka. Must be used for all the data pipelines (batch, Spark, etc) {"namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "number", "type": ["int", "null"]}, {"name": "color", "type": ["string", "null"]} ] }
  • 37. Step 2: Extend schema ● Avro/Protobuf can catch data format bugs ● How about data profile? Min, max, mean, etc? ● Describe a data profile, statistical properties and validation rules in extended schema! {"name": "User", "fields": [ {"name": "name", "type": "string", "min_length": 2, "max_length": 128}, {"name": "age", "type": ["int", "null"], "range": "[10, 100]"}, {"name": "sex", "type": ["string", "null"], " enum": "[male, female, ...]"}, {"name": "wage", "type": ["int", "null"], "validator": "DSL here..."} ] }
  • 38. Extended schema generate to Data Quality metrics Metrics types: ● Profiling ● Timeliness ● Completeness
  • 39. Step 3: Generate Extended Schema ● Manually specified schema provides data quality model and improves data pipeline reliability but hard to maintain ● We can automatically profile data shapes and generate statistical properties to be used in extended schema
  • 40. Step 4: Clustering and Anomaly detection ● How to deal with multidimensional datasets and complicated seasonality ● Rule based programs -> statistical models -> machine learning models Algorithms to consider: ● Deep Autoencoders ● Density based clustering algorithms with Elbow method ● Clustering algorithms with Silhouette method
  • 41. Model server = Metadata + Model Artifact + Runtime + Deps + Sidecar + Training Metadata /predict input: output: JVM DL4j GPU matching_model v2 [ .... ] gRPC HTTP server sidecar serving requests training data stats: - min - max - clusters - autoencoder compare with prod data in runtime
  • 42. Model Monitoring ● Feedback loop for ML Engineer ● Eliminates hand-off with Ops ● Safe experiment on shadowed traffic ● Shifts experimentation to prod environment ● Fills the gap between research and prod ● Correlation with business metrics
  • 43. Research: Quality monitoring of NLU system Figure from: Bapna, Ankur, et al. "Towards zero-shot frame semantic parsing for domain scaling." arXiv preprint arXiv:1707.02363 (2017).
  • 44. Research: Quality monitoring of NLU system Source image: Kurata, Gakuto, et al. "Leveraging sentence-level information with encoder lstm for semantic slot filling." arXiv preprint arXiv:1601.01530 (2016). ● Train and test offline on restaurants domain ● Deploy do prod ● Feed the model with new random Wiki data ● Monitor intermediate input representations (neural network hidden states)
  • 45. Research: Quality monitoring of NLU system ● Red and Purple - cluster of “Bad” production data ● Yellow and Blue - dev and test data
  • 46. Model Retraining - open questions When to retrain? When/how to push to prod? What data to use for retraining? Manually on demand Works well for 1 model But does not scale
  • 47. Model Retraining - open questions When to retrain? When/how to push to prod? What data to use for retraining? Manually on demand Works well for 1 model But does not scale Automatically by schedule Not safe Can be expensive
  • 48. Solution: Predictive Retraining + Safe Deployment ● Retrain when model monitoring alerts ● Warm up new models on shadowed/canary traffic ● Optionally relabel ● Rollback automatically when monitoring alerts ● Build retraining dataset from monitoring stats
  • 49. Quality Solutions takeaways ● Makes ML in production safe and predictable ● Allows ML operation to scale in prod to hundreds and thousands models ● Blurs the line between research and production by enabling ML experiments on shadowed or canary traffic ● Data Quality model should be a part of the Protocol ● Very dirty and time consuming job of Data QA can be automated with machine learning
  • 50. Webinar takeaway: Production-ready ML apps with the speed of Prototyping
  • 51. Thank you - Stepan Pushkarev - @hydrospheredata - https://github.com/Hydrospheredata - https://hydrosphere.io/ - spushkarev@hydrosphere.io