SlideShare uma empresa Scribd logo
1 de 35
Baixar para ler offline
Distributed Models Over Distributed Data
with MLflow, Pyspark, and Pandas
Thunder Shiviah
thunder@databricks.com
Senior Solutions Architect, Databricks
SAIS EUROPE - 2019
Abstract
● More data often improves DL models in high variance problem spaces (with semi or
unstructured data) such as NLP, image, video but does not significantly improve high
bias problem spaces where traditional ML is more appropriate.
● That said, in the limits of more data, there still not a reason to go distributed if you
can instead use transfer learning.
● Data scientists have pain points
○ running many models in parallel
○ automating the experimental set up
○ Getting others (especially analysts) within an organization to use their models
● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
Why do single node data science?
Is more data always better?
The Unreasonable Effectiveness of Data, Alon Halevy, Peter Norvig, and Fernando Pereira,
Google 2009
Banko and Brill, 2001
Team A came up with a very sophisticated algorithm using the Netflix
data. Team B used a very simple algorithm, but they added in
additional data beyond the Netflix set: information about movie genres
from the Internet Movie Database (IMDB). Guess which team did
better?
Team B got much better results, close to the best results on the Netflix
leaderboard! I'm really happy for them, and they're going to tune their
algorithm and take a crack at the grand prize.
But the bigger point is, adding more, independent data usually
beats out designing ever-better algorithms to analyze an existing
data set. I'm often surprised that many people in the business, and
even in academia, don't realize this. - "More data usually beats better algorithms", Anand
Rajaraman
Okay, case closed. Right?
Not so fast!
● First Norvig, Banko & Brill were focused on NLP specifically, not general ML.
● Second, Norvig et al were not saying that ML models trained on more data are better
than ML models trained on less data. Norvig was comparing ML (statistical
learning) to hand coded language rules in NLP: “...a deep approach that relies on
hand-coded grammars and ontologies, represented as complex networks of relations;”
(Norvig, 2009)
● In his 2010 UBC Department of Computer Science's Distinguished Lecture Series
talk, Norvig mentions that regarding NLP, google is quickly hitting their limits
with simple models and is needing to focus on more complex models (deep
learning).
● Re: the Netflix data set: The winning team actually then commented and said
Our experience with the Netflix data is different.
IMDB data and the likes gives more information only about the movies, not about the users ... The test set is
dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra
information about the movies.
Our experiments clearly show that once you have strong CF models, such extra data is redundant
and cannot improve accuracy on the Netflix dataset.
Netflix algorithm in production
Netflix recommendations: beyond the 5 stars, Xavier Amatriain (Netflix)
Oh, and what about the size of those data sets?
● 1 billion word corpus = ~2GB
● Netflix prize data = 700Mb compressed
○ 1.5 GB uncompressed (source)
So where does that leave us?
Bias vs Variance
Andrew Ng, AI is the New Electricity
Conclusion: more data makes sense for high variance
(semi-structured or unstructured) problem domains like
text and images. Sampling makes sense for high bias
domains such as structured problem domains.
Should we always use more data with deep learning?
No! Transfer learning on smaller data often beats
training nets from scratch on larger data-sets.
Open AI pointed out that while the amount of compute has been a key component of AI
progress, “Massive compute is certainly not a requirement to produce important results.”
(source)
In a benchmark run by our very own Matei Zaharia at Stanford, Fast.ai was able to win both
fastest and cheapest image classification:
Imagenet competition, our results were:
● Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single
machine (and faster than Intel’s entry that used a cluster of 128 machines!)
● Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost,
as discussed below).Overall, our findings were:
● Algorithmic creativity is more important than bare-metal performance
(source)
Transfer learning models with a small number of training examples can achieve better
performance than models trained from scratch on 100x the data
Introducing state of the art text classification with universal language models,
Jeremy Howard and Sebastian Ruder
Take-away: Even in the case of deep learning, if an
established model exists, it’s better to use transfer
learning on small data then train from scratch on larger
data
So where does databricks fit into this story?
Training models (including hyperparameter search and
cross validation) is embarrassingly parallel
Shift from distributed data to distributed models
Introducing Pandas UDF for PySpark: How to run your native Python code with PySpark, fast.
Data Scientists spend lots of time setting up their
environment
Rule of frequency: automate the things you do the most
Source: xkcd
Maybe...
Source: xkcd
ML runtime
Data Scientists have a difficult time tracking models
Data Scientists don’t know how to get the rest of the
organization to use their models
Key insight: Most modern enterprise run on SQL not REST
APIs
Use MLflow + spark UDFs to democratize ML within the
org.
See my mlflow deployment example notebook.
Summary
● More data often improves DL models in high variance problem spaces (with semi or
unstructured data) such as NLP, image, video but does not significantly improve high
bias problem spaces where traditional ML is more appropriate.
● That said, in the limits of more data, there still not a reason to go distributed if you
can instead use transfer learning.
● Data scientists have pain points
○ running many models in parallel
○ automating the experimental set up
○ Getting others (especially analysts) within an organization to use their models
● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
● See my paper for more information.
Thank you!
thunder@databricks.com

Mais conteúdo relacionado

Mais procurados

"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"Databricks
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflowDatabricks
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsWeaveworks
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with DatabricksLiangjun Jiang
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Databricks
 
Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowFernando Ortega Gallego
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkDatabricks
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Databricks
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Databricks
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMatei Zaharia
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflowDatabricks
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySparkRussell Jurney
 
Managing the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOpsManaging the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOpsFatih Baltacı
 

Mais procurados (20)

MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
Using MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOpsUsing MLOps to Bring ML to Production/The Promise of MLOps
Using MLOps to Bring ML to Production/The Promise of MLOps
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with Databricks
 
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
Continuous Delivery of Deep Transformer-Based NLP Models Using MLflow and AWS...
 
Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlow
 
Machine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache SparkMachine Learning at Scale with MLflow and Apache Spark
Machine Learning at Scale with MLflow and Apache Spark
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
 
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
Advanced MLflow: Multi-Step Workflows, Hyperparameter Tuning and Integrating ...
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine Learning
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
 
Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Managing the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOpsManaging the Machine Learning Lifecycle with MLOps
Managing the Machine Learning Lifecycle with MLOps
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 

Semelhante a Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas

Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Grigory Sapunov
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...Edge AI and Vision Alliance
 
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsPractical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsAlexey Rybakov
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...Edge AI and Vision Alliance
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Clarisse Hedglin
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupYves Peirsman
 
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019 Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019 Stefan Geißler
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobileAnirudh Koul
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningmustafa sarac
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular dataJimmyLiang20
 
Neuromation.io AI Ukraine Presentation
Neuromation.io AI Ukraine PresentationNeuromation.io AI Ukraine Presentation
Neuromation.io AI Ukraine PresentationBohdan Klimenko
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Demystifying Ml, DL and AI
Demystifying Ml, DL and AIDemystifying Ml, DL and AI
Demystifying Ml, DL and AIGreg Werner
 

Semelhante a Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas (20)

Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn..."Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
 
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and CarsPractical Artificial Intelligence: Deep Learning Beyond Cats and Cars
Practical Artificial Intelligence: Deep Learning Beyond Cats and Cars
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017
 
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP MeetupDealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
Dealing with Data Scarcity in Natural Language Processing - Belgium NLP Meetup
 
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019 Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Deep learning on mobile
Deep learning on mobileDeep learning on mobile
Deep learning on mobile
 
building intelligent systems with large scale deep learning
building intelligent systems with large scale deep learningbuilding intelligent systems with large scale deep learning
building intelligent systems with large scale deep learning
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular data
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
 
Neuromation.io AI Ukraine Presentation
Neuromation.io AI Ukraine PresentationNeuromation.io AI Ukraine Presentation
Neuromation.io AI Ukraine Presentation
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Demystifying Ml, DL and AI
Demystifying Ml, DL and AIDemystifying Ml, DL and AI
Demystifying Ml, DL and AI
 

Mais de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 

Último (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 

Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas

  • 1. Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas Thunder Shiviah thunder@databricks.com Senior Solutions Architect, Databricks SAIS EUROPE - 2019
  • 2. Abstract ● More data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video but does not significantly improve high bias problem spaces where traditional ML is more appropriate. ● That said, in the limits of more data, there still not a reason to go distributed if you can instead use transfer learning. ● Data scientists have pain points ○ running many models in parallel ○ automating the experimental set up ○ Getting others (especially analysts) within an organization to use their models ● Databricks solves these problems using pandas udfs, ml runtime and mlflow.
  • 3. Why do single node data science?
  • 4. Is more data always better? The Unreasonable Effectiveness of Data, Alon Halevy, Peter Norvig, and Fernando Pereira, Google 2009
  • 6. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better? Team B got much better results, close to the best results on the Netflix leaderboard! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize. But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often surprised that many people in the business, and even in academia, don't realize this. - "More data usually beats better algorithms", Anand Rajaraman
  • 9. ● First Norvig, Banko & Brill were focused on NLP specifically, not general ML. ● Second, Norvig et al were not saying that ML models trained on more data are better than ML models trained on less data. Norvig was comparing ML (statistical learning) to hand coded language rules in NLP: “...a deep approach that relies on hand-coded grammars and ontologies, represented as complex networks of relations;” (Norvig, 2009) ● In his 2010 UBC Department of Computer Science's Distinguished Lecture Series talk, Norvig mentions that regarding NLP, google is quickly hitting their limits with simple models and is needing to focus on more complex models (deep learning). ● Re: the Netflix data set: The winning team actually then commented and said Our experience with the Netflix data is different. IMDB data and the likes gives more information only about the movies, not about the users ... The test set is dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra information about the movies. Our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.
  • 10. Netflix algorithm in production Netflix recommendations: beyond the 5 stars, Xavier Amatriain (Netflix)
  • 11. Oh, and what about the size of those data sets? ● 1 billion word corpus = ~2GB ● Netflix prize data = 700Mb compressed ○ 1.5 GB uncompressed (source)
  • 12. So where does that leave us?
  • 14.
  • 15. Andrew Ng, AI is the New Electricity
  • 16. Conclusion: more data makes sense for high variance (semi-structured or unstructured) problem domains like text and images. Sampling makes sense for high bias domains such as structured problem domains.
  • 17. Should we always use more data with deep learning?
  • 18. No! Transfer learning on smaller data often beats training nets from scratch on larger data-sets. Open AI pointed out that while the amount of compute has been a key component of AI progress, “Massive compute is certainly not a requirement to produce important results.” (source)
  • 19. In a benchmark run by our very own Matei Zaharia at Stanford, Fast.ai was able to win both fastest and cheapest image classification: Imagenet competition, our results were: ● Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single machine (and faster than Intel’s entry that used a cluster of 128 machines!) ● Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost, as discussed below).Overall, our findings were: ● Algorithmic creativity is more important than bare-metal performance (source)
  • 20. Transfer learning models with a small number of training examples can achieve better performance than models trained from scratch on 100x the data Introducing state of the art text classification with universal language models, Jeremy Howard and Sebastian Ruder
  • 21. Take-away: Even in the case of deep learning, if an established model exists, it’s better to use transfer learning on small data then train from scratch on larger data
  • 22. So where does databricks fit into this story?
  • 23. Training models (including hyperparameter search and cross validation) is embarrassingly parallel
  • 24. Shift from distributed data to distributed models Introducing Pandas UDF for PySpark: How to run your native Python code with PySpark, fast.
  • 25. Data Scientists spend lots of time setting up their environment
  • 26. Rule of frequency: automate the things you do the most Source: xkcd
  • 29. Data Scientists have a difficult time tracking models
  • 30.
  • 31. Data Scientists don’t know how to get the rest of the organization to use their models
  • 32. Key insight: Most modern enterprise run on SQL not REST APIs
  • 33. Use MLflow + spark UDFs to democratize ML within the org. See my mlflow deployment example notebook.
  • 34. Summary ● More data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video but does not significantly improve high bias problem spaces where traditional ML is more appropriate. ● That said, in the limits of more data, there still not a reason to go distributed if you can instead use transfer learning. ● Data scientists have pain points ○ running many models in parallel ○ automating the experimental set up ○ Getting others (especially analysts) within an organization to use their models ● Databricks solves these problems using pandas udfs, ml runtime and mlflow. ● See my paper for more information.