SlideShare a Scribd company logo
1 of 44
Download to read offline
Apache Spark™ MLlib 2.x:
How to Productionize your
Machine Learning Models
Richard Garris (Principal Solutions Architect)
Jules S. Damji (Spark Community Evangelist)
March 9 , 2017
About Me
• Jules Damji
• Twitter: @2twitme; LinkedIn
• Spark Community Evangelist @ Databricks
• Worked in various software engineering roles building
distributed systems and applications at Sun, Netscape,
LoudCloud/Opsware, VeriSign, Scalix, Ebrary & Hortonworks
Webinar Logistics
3
Empower anyone to innovate faster with big data.
Founded by the creators of Apache Spark.
Contributes 75%of the open source code,
10xmore than any other company.
VISION
WHO WE ARE
A fully-managed data processing platform
for the enterprise, powered by Apache Spark.
PRODUCT
CLUSTER TUNING &
MANAGEMENT
INTERACTIVE
WORKSPACE
PRODUCTION
PIPELINE
AUTOMATION
OPTIMIZED DATA
ACCESS
DATABRICKS ENTERPRISE SECURITY
YOUR	TEAMS
Data Science
Data Engineering
Many others…
BI Analysts
YOUR	DATA
Cloud Storage
Data Warehouses
Data Lake
VIRTUAL ANALYTICS PLATFORM
Apache Spark™ MLlib 2.x:
How to Productionize your
Machine Learning Models
Richard Garris (Principal Solutions Architect)
March 9 , 2017
About Me
• Richard L Garris
• rlgarris@databricks.com
• Twitter @rlgarris
• Principal Data Solutions Architect @ Databricks
• 12+ years designing Enterprise Data Solutions for everyone from
startups to Global 2000
• Prior Work Experience PwC, Google and Skytree – the Machine
Learning Company
• Ohio State Buckeye and Masters from CMU
Outline
• Spark MLlib 2.X
• Model Serialization
• Model Scoring System Requirements
• Model Scoring Architectures
• Databricks Model Scoring
About Apache Spark™ MLlib
• Started with Spark 0.8 in the
AMPLab in 2014
• Migration to Spark
DataFrames started with
Spark 1.3 with feature parity
within 2.X
• Contributions by 75+ orgs,
~250 individuals
• Distributed algorithms that
scale linearly with the data
MLlib’s Goals
• General purpose machine learning library optimized for big data
• Linearly scalable = 2x more machines , runtime theoretically cut in half
• Fault tolerant = resilient to the failure of nodes
• Covers the most common algorithms with distributed implementations
• Built around the concept of a Data Science Pipeline (scikit-learn)
• Written entirely using Apache Spark™
• Integrates well with the Agile Modeling Process
A Model is a Mathematical Function
• A model is a function: 𝑓 𝑥
• Linear regression 𝑦	 = 	𝑏0	 + 	𝑏1 𝑥1	 + 	𝑏2 𝑥2
ML Pipelines
Train	model
Evaluate
Load	data
Extract	features
A	very	simple	pipeline
ML Pipelines
Train	model	1
Evaluate
Datasource	1
Datasource	2
Datasource	3
Extract	featuresExtract	features
Feature	transform	1
Feature	transform	2
Feature	transform	3
Train	model	2
Ensemble
A	real	pipeline!
Productionizing Models Today
Data Science Data Engineering
Develop Prototype
Model using Python/R Re-implement model for
production (Java)
Problems with ProductionizingModels
Develop Prototype
Model using Python/R
Re-implement model for
production (Java)
- Extra work
- Different code paths
- Data science does not translate to production
- Slow to update models
Data Science Data Engineering
MLlib 2.X Model Serialization
Data Science Data Engineering
Develop Prototype
Model using Python/R
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
Scala
val lrModel = lrPipeline.fit(dataset)
// Save the Model
lrModel.write.save("/models/lr")
•
MLlib 2.X Model Serialization Snippet
Python
lrModel = lrPipeline.fit(dataset)
# Save the Model
lrModel.write.save("/models/lr")
•
Model Serialization Output
Code
// List Contents of the Model Dir
dbutils.fs.ls("/models/lr")
•
Output
Remember	this	is	a	pipeline	
model	and	these	are	the	stages!
Transformer Stage (StringIndexer)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/00_strI
dx_bb9728f85745/metadata/part-00000")
// Display the Parquet File in the Data dir
display(spark.read.parquet(”/models/lr/sta
ges/00_strIdx_bb9728f85745/data/"))
Output
{
"class":"org.apache.spark.ml.feature.StringIndexerModel",
"timestamp":1488120411719,
"sparkVersion":"2.1.0",
"uid":"strIdx_bb9728f85745",
"paramMap":{
"outputCol":"workclassIdx",
"inputCol":"workclass",
"handleInvalid":"error"
}
}
Metadata	and	params
Data	(Hashmap)
Estimator Stage (LogisticRegression)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/18_logr
eg_325fa760f925/metadata/part-00000")
// Display the Parquet File in the Data dir
display(spark.read.parquet("/models/lr/sta
ges/18_logreg_325fa760f925/data/"))
Output
Model	params
Intercept	+	Coefficients
{"class":"org.apache.spark.ml.classification.LogisticRegressionModel",
"timestamp":1488120446324,
"sparkVersion":"2.1.0",
"uid":"logreg_325fa760f925",
"paramMap":{
"predictionCol":"prediction",
"standardization":true,
"probabilityCol":"probability",
"maxIter":100,
"elasticNetParam":0.0,
"family":"auto",
"regParam":0.0,
"threshold":0.5,
"fitIntercept":true,
"labelCol":"label” }}
Output
Decision	Tree	Splits
Estimator Stage (DecisionTree)
Code
// Display the Parquet File in the Data dir
display(spark.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/"))
// Re-save as JSON
spark.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").
Visualize Stage (DecisionTree)
Visualization	of	the	Tree	
In	Databricks
What are the Requirements for
a Robust Model Deployment
System?
Model Scoring Environment Examples
• In Web Applications / Ecommerce Portals
• Mainframe / Batch Processing Systems
• Real-Time Processing Systems / Middleware
• Via API / Microservice
• Embedded in Devices (Mobile Phones, Medical Devices, Autos)
Hidden Technical Debt in ML Systems
“Hidden Technical Debt in Machine
Learning Systems “, Google NIPS 2015
“Hidden Technical Debt in MachineLearning Systems “, GoogleNIPS 2015
Agile Modeling Process
Set	Business	Goals
Understand	 Your	
Data
Create	Hypothesis
Devise	Experiment
Prepare	Data
Train-Tune-Test	
Model
Deploy	Model
Measure/Evaluate	
Results
Agile Modeling Process
Set	Business	Goals
Understand	 Your	
Data
Create	Hypothesis
Devise	Experiment
Prepare	Data
Train-Tune-Test	
Model
Deploy	Model
Measure/Evaluate	
Results
Focus	of	this	
talk
Set	Business	Goals
Understand	 Your	
Data
Create	Hypothesis
Devise	Experiment
Prepare	Data
Train-Tune-Test	
Model
Deploy	Model
Measure/Evaluate	
Results
Deployment Should be Agile
• Deployment needs to
support A/B testing and
experiments
• Deployment should
support measuring and
evaluating model
performance
• Deployment should be
fast and adaptive to
business needs
Model A/B Testing, Monitoring, Updates
• A/B testing– comparing two versions to see what performs better
• Monitoringis the process of observing the model’s performance, logging it’s
behavior and alerting when the model degrades
• Logging should log exactly the data feedinto the model at the time of scoring
• Model update process
• Benchmark (or Shadow Models)
• Phase-In (20% traffic)
• Avoid Big Bang
Consider the Scoring Environment
Customer SLAs
•Response time
•Throughput
(predictionsper
second)
•Uptime / Reliability
Tech Stack
–C / C++
–Legacy (mainframe)
–Java
Batch Real-Time
Scoring in Batch vs Real-Time
• Synchronous
• Could be Seconds:
– Customer is waiting
(human real-time)
• Subsecond:
– High FrequencyTrading
– Fraud Detection on the Swipe
• Asynchronous
• Internal Use
• Triggers can be event based on time
based
• Used for EmailCampaigns,
Notifications
Open Loop – human being involved
Closed Loop– no human involved
• Model Scoring– almost always
closed loop, some open loop e.g.
alert agents or customer service
• Model Training– usually open loop
with a data scientist in the loop to
update the model
Online Learning and Open / Closed Loop
• Online is closed loop, entirely
machine driven but modelingis
risky
• need to have proper model
monitoringand safeguards to
prevent abuse / sensitivity to noise
• MLlibsupports online through
streaming models (k-means, logistic
regression support online)
• Alternative – use a more complex
model to better fit new data rather
than using online learning
Open / Closed Loop Online Learning
Model Scoring – Bot Detection
Not All Models Return Boolean – e.g. a Yes / No
Example: Login Bot Detector
Different behavior depending on probability that use is a bot
0.0-0.4 ☞ Allow login
0.4-0.6 ☞ Send ChallengeQuestion
0.6 to 0.75 ☞ Send SMS Code
0.75 to 0.9 ☞ Refer to Agent
0.9 - 1.0 ☞ Block
Model Scoring – Recommendations
Output is a ranking of the top n items
API – send user ID + number of items
Return sorted set of items to recommend
Optional –
pass context sensitive information to tailor results
Model Scoring Architectures
Architecture Option A
Precompute Predictions using Spark and Serve from Database
Train	ALS	Model
Send	Email	Offers	
to	Customers
Save	Offers	to	
NoSQL
Ranked	Offers
Display	Ranked	
Offers	in	Web	/	
Mobile
Recurring	
Batch
Architecture Option B
Spark Stream and Score using an API with Cached Predictions
Web	Activity	Logs
Kill	User’s	Login	
SessionCompute	Features Run	Prediction
Streaming
Cache	Predictions API	Check
Architecture Option C
Train with Spark and Score Outside of Spark
Train	Model	in	
Spark
Save	Model	to	S3	
/	HDFS
New	Data
Copy	
Model	to	
Production
Predictions
Load	coefficients	and	
intercept	from	file
Databricks Model Scoring
Databricks Model Scoring
• Based on Architecture Option C
• Goal: Deploy MLlib model outside of Apache Spark and
Databricks.
• Easy to Embedin Existing Environments
• Low Latency and Complexity
• Low Overhead
• Train Modelin Databricks
– Call Fit on Pipeline
– Save Model as JSON
• Deploy modelin external system
– Add dependency on “dbml-local”
package (without Spark)
– Load model from JSON at startup
– Make predictions in real time
Databricks Model Scoring
Code
// Fit and Export the Model in Databricks
val lrModel = lrPipeline.fit(dataset)
ModelExporter.export(lrModel, " /models/db ")
// In YourApplication (Scala)
import com.databricks.ml.local.ModelImport
val lrModel = ModelImport.import("s3a:/...")
val jsonInput = ...
val jsonOutput = lrModel.transform(jsonInput)
Databricks Model Scoring Private Beta
• Private Beta Available for Databricks Customers
• Available on Databricks using Apache Spark 2.1
• Only logistic regression available now
• Additional Estimators and Transformers in Progress
Demo Model Serialization
https://databricks-prod-
cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8
714f173bcfc/1526931011080774/1904316851197504/63204405618
00420/latest.html
Thank You!
Questions?
Happy Sparking

More Related Content

What's hot

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowDatabricks
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaSpark Summit
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureData Science Milan
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleDatabricks
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operationsStepan Pushkarev
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Databricks
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkDatabricks
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDatabricks
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriDatabricks
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Databricks
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 

What's hot (20)

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkData-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
An Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal MalohlavaAn Introduction to Sparkling Water by Michal Malohlava
An Introduction to Sparkling Water by Michal Malohlava
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
 
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML InfrastructureMLOps with a Feature Store: Filling the Gap in ML Infrastructure
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at ScaleData Agility—A Journey to Advanced Analytics and Machine Learning at Scale
Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer Nori
 
How Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-ShmaHow Spark Enables the Internet of Things- Paula Ta-Shma
How Spark Enables the Internet of Things- Paula Ta-Shma
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 

Similar to Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018Adam Gibson
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-useltonrodriguez11
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...DataWorks Summit
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
BBBT Watson Data Platform Presentation
BBBT Watson Data Platform PresentationBBBT Watson Data Platform Presentation
BBBT Watson Data Platform PresentationRitika Gunnar
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerProvectus
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021Sandesh Rao
 
A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)Arnab Biswas
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenGoDataDriven
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
 

Similar to Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models (20)

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
Machine learning
Machine learningMachine learning
Machine learning
 
Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...Software engineering practices for the data science and machine learning life...
Software engineering practices for the data science and machine learning life...
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
BBBT Watson Data Platform Presentation
BBBT Watson Data Platform PresentationBBBT Watson Data Platform Presentation
BBBT Watson Data Platform Presentation
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021
 
A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)A survey on Machine Learning In Production (July 2018)
A survey on Machine Learning In Production (July 2018)
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDriven
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 

More from Anyscale

ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfAnyscale
 
What's Next for MLflow in 2019
What's Next for MLflow in 2019What's Next for MLflow in 2019
What's Next for MLflow in 2019Anyscale
 
Putting AI to Work on Apache Spark
Putting AI to Work on Apache SparkPutting AI to Work on Apache Spark
Putting AI to Work on Apache SparkAnyscale
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Anyscale
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Monitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksMonitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksAnyscale
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Anyscale
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark Anyscale
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Anyscale
 

More from Anyscale (11)

ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdf
 
What's Next for MLflow in 2019
What's Next for MLflow in 2019What's Next for MLflow in 2019
What's Next for MLflow in 2019
 
Putting AI to Work on Apache Spark
Putting AI to Work on Apache SparkPutting AI to Work on Apache Spark
Putting AI to Work on Apache Spark
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Monitoring Error Logs at Databricks
Monitoring Error Logs at DatabricksMonitoring Error Logs at Databricks
Monitoring Error Logs at Databricks
 
Monitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at DatabricksMonitoring Large-Scale Apache Spark Clusters at Databricks
Monitoring Large-Scale Apache Spark Clusters at Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 

Recently uploaded

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile EnvironmentVictorSzoltysek
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 

Recently uploaded (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT  - Elevating Productivity in Today's Agile EnvironmentHarnessing ChatGPT  - Elevating Productivity in Today's Agile Environment
Harnessing ChatGPT - Elevating Productivity in Today's Agile Environment
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models

  • 1. Apache Spark™ MLlib 2.x: How to Productionize your Machine Learning Models Richard Garris (Principal Solutions Architect) Jules S. Damji (Spark Community Evangelist) March 9 , 2017
  • 2. About Me • Jules Damji • Twitter: @2twitme; LinkedIn • Spark Community Evangelist @ Databricks • Worked in various software engineering roles building distributed systems and applications at Sun, Netscape, LoudCloud/Opsware, VeriSign, Scalix, Ebrary & Hortonworks
  • 4. Empower anyone to innovate faster with big data. Founded by the creators of Apache Spark. Contributes 75%of the open source code, 10xmore than any other company. VISION WHO WE ARE A fully-managed data processing platform for the enterprise, powered by Apache Spark. PRODUCT
  • 5. CLUSTER TUNING & MANAGEMENT INTERACTIVE WORKSPACE PRODUCTION PIPELINE AUTOMATION OPTIMIZED DATA ACCESS DATABRICKS ENTERPRISE SECURITY YOUR TEAMS Data Science Data Engineering Many others… BI Analysts YOUR DATA Cloud Storage Data Warehouses Data Lake VIRTUAL ANALYTICS PLATFORM
  • 6. Apache Spark™ MLlib 2.x: How to Productionize your Machine Learning Models Richard Garris (Principal Solutions Architect) March 9 , 2017
  • 7. About Me • Richard L Garris • rlgarris@databricks.com • Twitter @rlgarris • Principal Data Solutions Architect @ Databricks • 12+ years designing Enterprise Data Solutions for everyone from startups to Global 2000 • Prior Work Experience PwC, Google and Skytree – the Machine Learning Company • Ohio State Buckeye and Masters from CMU
  • 8. Outline • Spark MLlib 2.X • Model Serialization • Model Scoring System Requirements • Model Scoring Architectures • Databricks Model Scoring
  • 9. About Apache Spark™ MLlib • Started with Spark 0.8 in the AMPLab in 2014 • Migration to Spark DataFrames started with Spark 1.3 with feature parity within 2.X • Contributions by 75+ orgs, ~250 individuals • Distributed algorithms that scale linearly with the data
  • 10. MLlib’s Goals • General purpose machine learning library optimized for big data • Linearly scalable = 2x more machines , runtime theoretically cut in half • Fault tolerant = resilient to the failure of nodes • Covers the most common algorithms with distributed implementations • Built around the concept of a Data Science Pipeline (scikit-learn) • Written entirely using Apache Spark™ • Integrates well with the Agile Modeling Process
  • 11. A Model is a Mathematical Function • A model is a function: 𝑓 𝑥 • Linear regression 𝑦 = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2
  • 14. Productionizing Models Today Data Science Data Engineering Develop Prototype Model using Python/R Re-implement model for production (Java)
  • 15. Problems with ProductionizingModels Develop Prototype Model using Python/R Re-implement model for production (Java) - Extra work - Different code paths - Data science does not translate to production - Slow to update models Data Science Data Engineering
  • 16. MLlib 2.X Model Serialization Data Science Data Engineering Develop Prototype Model using Python/R Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  • 17. Scala val lrModel = lrPipeline.fit(dataset) // Save the Model lrModel.write.save("/models/lr") • MLlib 2.X Model Serialization Snippet Python lrModel = lrPipeline.fit(dataset) # Save the Model lrModel.write.save("/models/lr") •
  • 18. Model Serialization Output Code // List Contents of the Model Dir dbutils.fs.ls("/models/lr") • Output Remember this is a pipeline model and these are the stages!
  • 19. Transformer Stage (StringIndexer) Code // Cat the contents of the Metadata dir dbutils.fs.head(”/models/lr/stages/00_strI dx_bb9728f85745/metadata/part-00000") // Display the Parquet File in the Data dir display(spark.read.parquet(”/models/lr/sta ges/00_strIdx_bb9728f85745/data/")) Output { "class":"org.apache.spark.ml.feature.StringIndexerModel", "timestamp":1488120411719, "sparkVersion":"2.1.0", "uid":"strIdx_bb9728f85745", "paramMap":{ "outputCol":"workclassIdx", "inputCol":"workclass", "handleInvalid":"error" } } Metadata and params Data (Hashmap)
  • 20. Estimator Stage (LogisticRegression) Code // Cat the contents of the Metadata dir dbutils.fs.head(”/models/lr/stages/18_logr eg_325fa760f925/metadata/part-00000") // Display the Parquet File in the Data dir display(spark.read.parquet("/models/lr/sta ges/18_logreg_325fa760f925/data/")) Output Model params Intercept + Coefficients {"class":"org.apache.spark.ml.classification.LogisticRegressionModel", "timestamp":1488120446324, "sparkVersion":"2.1.0", "uid":"logreg_325fa760f925", "paramMap":{ "predictionCol":"prediction", "standardization":true, "probabilityCol":"probability", "maxIter":100, "elasticNetParam":0.0, "family":"auto", "regParam":0.0, "threshold":0.5, "fitIntercept":true, "labelCol":"label” }}
  • 21. Output Decision Tree Splits Estimator Stage (DecisionTree) Code // Display the Parquet File in the Data dir display(spark.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/")) // Re-save as JSON spark.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").
  • 23. What are the Requirements for a Robust Model Deployment System?
  • 24. Model Scoring Environment Examples • In Web Applications / Ecommerce Portals • Mainframe / Batch Processing Systems • Real-Time Processing Systems / Middleware • Via API / Microservice • Embedded in Devices (Mobile Phones, Medical Devices, Autos)
  • 25. Hidden Technical Debt in ML Systems “Hidden Technical Debt in Machine Learning Systems “, Google NIPS 2015 “Hidden Technical Debt in MachineLearning Systems “, GoogleNIPS 2015
  • 26. Agile Modeling Process Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure/Evaluate Results
  • 27. Agile Modeling Process Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure/Evaluate Results Focus of this talk
  • 28. Set Business Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure/Evaluate Results Deployment Should be Agile • Deployment needs to support A/B testing and experiments • Deployment should support measuring and evaluating model performance • Deployment should be fast and adaptive to business needs
  • 29. Model A/B Testing, Monitoring, Updates • A/B testing– comparing two versions to see what performs better • Monitoringis the process of observing the model’s performance, logging it’s behavior and alerting when the model degrades • Logging should log exactly the data feedinto the model at the time of scoring • Model update process • Benchmark (or Shadow Models) • Phase-In (20% traffic) • Avoid Big Bang
  • 30. Consider the Scoring Environment Customer SLAs •Response time •Throughput (predictionsper second) •Uptime / Reliability Tech Stack –C / C++ –Legacy (mainframe) –Java
  • 31. Batch Real-Time Scoring in Batch vs Real-Time • Synchronous • Could be Seconds: – Customer is waiting (human real-time) • Subsecond: – High FrequencyTrading – Fraud Detection on the Swipe • Asynchronous • Internal Use • Triggers can be event based on time based • Used for EmailCampaigns, Notifications
  • 32. Open Loop – human being involved Closed Loop– no human involved • Model Scoring– almost always closed loop, some open loop e.g. alert agents or customer service • Model Training– usually open loop with a data scientist in the loop to update the model Online Learning and Open / Closed Loop • Online is closed loop, entirely machine driven but modelingis risky • need to have proper model monitoringand safeguards to prevent abuse / sensitivity to noise • MLlibsupports online through streaming models (k-means, logistic regression support online) • Alternative – use a more complex model to better fit new data rather than using online learning Open / Closed Loop Online Learning
  • 33. Model Scoring – Bot Detection Not All Models Return Boolean – e.g. a Yes / No Example: Login Bot Detector Different behavior depending on probability that use is a bot 0.0-0.4 ☞ Allow login 0.4-0.6 ☞ Send ChallengeQuestion 0.6 to 0.75 ☞ Send SMS Code 0.75 to 0.9 ☞ Refer to Agent 0.9 - 1.0 ☞ Block
  • 34. Model Scoring – Recommendations Output is a ranking of the top n items API – send user ID + number of items Return sorted set of items to recommend Optional – pass context sensitive information to tailor results
  • 36. Architecture Option A Precompute Predictions using Spark and Serve from Database Train ALS Model Send Email Offers to Customers Save Offers to NoSQL Ranked Offers Display Ranked Offers in Web / Mobile Recurring Batch
  • 37. Architecture Option B Spark Stream and Score using an API with Cached Predictions Web Activity Logs Kill User’s Login SessionCompute Features Run Prediction Streaming Cache Predictions API Check
  • 38. Architecture Option C Train with Spark and Score Outside of Spark Train Model in Spark Save Model to S3 / HDFS New Data Copy Model to Production Predictions Load coefficients and intercept from file
  • 40. Databricks Model Scoring • Based on Architecture Option C • Goal: Deploy MLlib model outside of Apache Spark and Databricks. • Easy to Embedin Existing Environments • Low Latency and Complexity • Low Overhead
  • 41. • Train Modelin Databricks – Call Fit on Pipeline – Save Model as JSON • Deploy modelin external system – Add dependency on “dbml-local” package (without Spark) – Load model from JSON at startup – Make predictions in real time Databricks Model Scoring Code // Fit and Export the Model in Databricks val lrModel = lrPipeline.fit(dataset) ModelExporter.export(lrModel, " /models/db ") // In YourApplication (Scala) import com.databricks.ml.local.ModelImport val lrModel = ModelImport.import("s3a:/...") val jsonInput = ... val jsonOutput = lrModel.transform(jsonInput)
  • 42. Databricks Model Scoring Private Beta • Private Beta Available for Databricks Customers • Available on Databricks using Apache Spark 2.1 • Only logistic regression available now • Additional Estimators and Transformers in Progress