SlideShare uma empresa Scribd logo
1 de 51
Production-Ready BIG ML
Workflows
From Zero to Hero
Daniel Marcous
Google, Waze, Data Wizard
dmarcous@gmail/google.com
What’s a Data Wizard you ask?
Gain Actionable Insights!
What’s here?
Example Use Case
Optimizing Waze ETA Prediction
What’s here?
Methodology
Deploying to production - step by step
Pitfalls
What to look out for in both methodology and code
Use Cases
Showing off what we actually do in Waze Analytics
Based on tough lessons learned & Google experts recommendations and inputs.
Why Big ML?
Bigger Is Better!
● Trying everything
○ Grid search all the parameters you ever
wanted.
○ Cross validate in parallel with no extra
effort.
● Algorithmic benefits
○ Some models (Artificial Neural Networks)
can’t do good without training on a lot of
data.
○ Keep training until you hit 0
● Data size
○ Tons of training data
■ We store EVERYTHING
■ no need for sampling on wrong
populations!
○ Millions of features
■ text processing with TF-IDF
Challenges
Bigger is harder
● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)
● Curse of dimensionality
○ Some algorithms require exponential time/memory as dimensions grow
○ Harder and more important to tell what’s gold and what’s noise
○ Unbalance data goes a long way with more records
● Big model != Small model
○ Different parameter settings
Solution = Workflow
A car goes from
Chinatown to Times
Square. How long will
it take to arrive?
● Ride features
● Historical data
● Real time data
● User features
● Map features
Measure first, optimize
second.
Before you start
● Create synthetic input data
○ Raw input
○ Feature row
○ Output data
○ Metric value
● Set up your metrics
○ Derived from business needs
○ Confusion matrix
○ Precision / recall
■ Per class metrics
○ AUC
○ CoverageRemember : Desired short term behaviour does not imply long term behaviour
Measure
Preprocess
(parse, clean, join, etc.)
Naive implementation -
preprocess & measure
● Synthetic Data
○ Input - {user_id: 1 , from: 32.113, 34.818, to: 32.113, 34.802, time_start: 17:15}
○ Feature record - {from_neighbourhood: Neot-Afeka, to_neighbourhood: Ramat-
Aviv, hour_start: 17, minute_start: 15, air_distance: 1.5km, road_distance: 2.9km,
heavy_traffic: NO}
○ Predicted value - {ETA: 17:25.43}
○ Actual value - {ETA: 17:26.34}
● Metric computation - absolute error
Preprocess
Measure
Create a sample dataset
On real input
Monitor.
Visualise - easiest way to measure quickly
● Set up your dashboard
○ Amounts of input data
■ Before /after joining
○ Amounts of output data
○ Metrics (See “Measure first, optimize second”)
● Different model comparison - what’s best, when and where
● Timeseries Analysis
Dashboard
monitoring
Dashboard should support - picking
different models, comparing metrics.
Pick models
to compare
Statistical tests on distributions
t.test / AUC
Dashboard
monitoring
Dashboard should support -
Timeseries anomaly detection, and
impact analysis (deploying new
model)
Start small and grow.
Getting a feel
Advanced variable selection with
regularisation techniques in R.
Intercepts - by significance
No intercept = not entered to model
Getting a feel
Trying modeling techniques in R.
Root mean square error
Lower = better (~ kinda)
Fit a gradient boosted
trees model
Getting a feel
Modeling bigger data with R, using
parallelism.
Fit and combine 6 random forest
models (10k trees each) in parallel
Start with a flow.
Basic moving parts (Naive Flow)
Data
source 1
Data
source N
Preprocess
Training
Feature matrix
Scoring
Models
1..N
Predictions
1..N
Dashboard
Serving DB
Feedback loop
Conf.
User/Model assignments
Application
Good ML code trumps
performance.
Why so many parts you ask?
● Scaling
● Fault tolerance
○ Failed preprocessing /training doesn’t affect serving model
○ Rerunning only failed parts
● Different logical parts - Different processes (@”Clean code” by Uncle Bob)
● Easier to tweak and deploy changes
Test your infrastructure.
Set up a baseline.
Start with a neutral launch
● Take a snapshot of your PRODUCTION metric reads:
○ The ones you chose earlier in the process as important to you
■ Confusion matrix
■ Weighted average % classified correctly
■ % subject coverage
● Latency
○ Building feature matrix on last day data takes X minutes
You are here:
Remember : You are running with a naive model. Everything better than the old model / random is OK.
Go to work.
Coffee recommended at this point.
Optimize
What? How?
● Grid search over
parameters
● Cross validate Everything
● Evaluate metrics
○ Using a Spark predefined
Evaluator
○ Using user defined metrics
● Tweak preprocessing (mainly features)
○ Feature engineering
○ Feature transformers
■ Discretize / Normalise
○ Feature selectors
○ In Apache Spark Latest
● Tweak training
Spark ML
Building a training pipeline with
spark.ml.
Create dummy variables
Required response label format
The ML model itself
Labels back to readable format
Assembled training pipeline
Spark ML
Cross-validate, grid search params
and evaluate metrics.
Grid search with reference to
ML model stage (RF)
Metrics to evaluate
Yes, you can definitely extend
and add your own metrics.
A/B
Test your changes
● Best testing - production A/B test
○ Use current production model and new model in parallel
○ Local ETA Model (averaging road ETA) VS Global ETA
Model (error on full ride) VS Hybrid
● Metrics improvements (Remember your
dashboard?)
○ Local ETA Model ~65s error VS Global ETA Model ~60s
error VS Hybrid ~58.6s error
Compare to baseline
A/B Infrastructures
Setting up a very basic A/B testing
infrastructure built upon our earlier
presented modeling wrapper.
Conf hold Mapping of:
model -> user_id/subject list
Score in parallel (inside a map)
Distributed=awesome.
Fancy scala union for all score files
Ad-Hoc statistics
● If you wrote your code right, you can easily reuse it in a notebook !
● Answer ad-hoc questions
○ How many predictions did you output last month?
○ How many new users had a prediction with probability > 0.7
○ How accurate were we on last month predictions? (join with real data)
● No need to rebuild anything!
Enter Apache Zeppelin
Playing with it
Read a parquet file , show statistics,
register as table and run SparkSQL on
it.
Parquet - already has a schema inside
For usage in SparkSQL
Putting it all together
Work Process
Step by step for deploying your big ML
workflows to production, ready for
operations and optimisations.
1. Measure first, optimize second.
a. Define metrics.
b. Preprocess data (using examples)
c. Monitor. (dashboard setup)
2. Start small and grow.
3. Start with a flow.
a. Good ML code trumps performance.
b. Test your infrastructure.
4. Set up a baseline.
5. Go to work.
a. Optimize.
b. A/B.
i. Test new flow in parallel to existing flow.
ii. Update user assignments.
6. Watch. Iterate. (see 5.)
Possible Pitfalls
● Code produced with
○ Apache Spark 1.6 / Scala 2.11.4
● RDD VS Dataframe
○ Enter “Dataset API” (V2.0+)
● mllib VS spark.ml
○ Always use spark.ml if
functionality exists
● Unbalanced partitions
○ Stuck on reduce
○ Stragglers
● Parameter tuning
○ Spark.sql.partitions
○ Executors
○ Driver VS executor memory
Use Cases
@Waze
Irregular Traffic Events
Major events, causing out of the ordinary traffic
● Exploration tool over time & space ● Seasonal traffic anomaly detection
Dangerous Places
Find most dangerous places, using custom developed clustering algorithms
● Alert authorities / users
● Compare & share with 3rd parties (NYPD)
Parking Places Detection
Parking entrance
Parking lot
Street parking
Speed Limits Inference
Waze Segment
Data
Machine
Learning
Speed
Limit
Prediction
Waze
Segment
Data
Community
Verification
Show in App
Text Mining - Store Sentiments
Text Mining - Sentiment by Time & Place
Code & Slides
https://github.com/dmarcous/BigMLFlow/
Daniel Marcous
dmarcous@google.com
dmarcous@gmail.com

Mais conteúdo relacionado

Mais procurados

Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
 
Detecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine LearningDetecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine LearningDatabricks
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Databricks
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stageNick Handel
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Databricks
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingDatabricks
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
 
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Spark Summit
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkDatabricks
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...Databricks
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsJohann Schleier-Smith
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...Databricks
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)Jasjeet Thind
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 

Mais procurados (20)

Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
Detecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine LearningDetecting Financial Fraud at Scale with Machine Learning
Detecting Financial Fraud at Scale with Machine Learning
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Ml infra at an early stage
Ml infra at an early stageMl infra at an early stage
Ml infra at an early stage
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Consolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest AirportsConsolidating MLOps at One of Europe’s Biggest Airports
Consolidating MLOps at One of Europe’s Biggest Airports
 
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
MLflow with R
MLflow with RMLflow with R
MLflow with R
 
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
 
An Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time ApplicationsAn Architecture for Agile Machine Learning in Real-Time Applications
An Architecture for Agile Machine Learning in Real-Time Applications
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
 
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)DevOps and Machine Learning (Geekwire Cloud Tech Summit)
DevOps and Machine Learning (Geekwire Cloud Tech Summit)
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 

Semelhante a Production ready big ml workflows from zero to hero daniel marcous @ waze

Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalJoachim Draeger
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflowsAdam Gibson
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity ManagementEDB
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfvitm11
 
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningSigOpt
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
Measure everything you can
Measure everything you canMeasure everything you can
Measure everything you canRicardo Bánffy
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestBerker Kozan
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Lviv Startup Club
 

Semelhante a Production ready big ml workflows from zero to hero daniel marcous @ waze (20)

Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Elasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ SignalElasticsearch Performance Testing and Scaling @ Signal
Elasticsearch Performance Testing and Scaling @ Signal
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflows
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep Learning
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Measure everything you can
Measure everything you canMeasure everything you can
Measure everything you can
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle ContestDA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
DA 592 - Term Project Presentation - Berker Kozan Can Koklu - Kaggle Contest
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
DevOps Days Rockies MLOps
DevOps Days Rockies MLOpsDevOps Days Rockies MLOps
DevOps Days Rockies MLOps
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
 

Mais de Ido Shilon

Why ml and ai are the future of gaming david sachs @ tomobox
Why ml and ai are the future of gaming david sachs @ tomoboxWhy ml and ai are the future of gaming david sachs @ tomobox
Why ml and ai are the future of gaming david sachs @ tomoboxIdo Shilon
 
Deep learning at nmc devin jones
Deep learning at nmc devin jones Deep learning at nmc devin jones
Deep learning at nmc devin jones Ido Shilon
 
Accelerating scale from startups to enterprise by Peter bakas
Accelerating scale from startups to enterprise by Peter bakasAccelerating scale from startups to enterprise by Peter bakas
Accelerating scale from startups to enterprise by Peter bakasIdo Shilon
 
Blind spots in big data erez koren @ forter
Blind spots in big data erez koren @ forterBlind spots in big data erez koren @ forter
Blind spots in big data erez koren @ forterIdo Shilon
 
Micro apps across 3 continents using React js
Micro apps across 3 continents using React js Micro apps across 3 continents using React js
Micro apps across 3 continents using React js Ido Shilon
 
BDX 2016 - Arnon rotem gal-oz @ appsflyer
BDX 2016 - Arnon rotem gal-oz @ appsflyerBDX 2016 - Arnon rotem gal-oz @ appsflyer
BDX 2016 - Arnon rotem gal-oz @ appsflyerIdo Shilon
 
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate Ido Shilon
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ NetflixIdo Shilon
 
BDX 2016 - Tal sliwowicz @ taboola
BDX 2016 - Tal sliwowicz @ taboolaBDX 2016 - Tal sliwowicz @ taboola
BDX 2016 - Tal sliwowicz @ taboolaIdo Shilon
 
BDX 2016 - Tzach zohar @ kenshoo
BDX 2016 - Tzach zohar  @ kenshooBDX 2016 - Tzach zohar  @ kenshoo
BDX 2016 - Tzach zohar @ kenshooIdo Shilon
 
Scaling to 1 million users v1
Scaling to 1 million users v1Scaling to 1 million users v1
Scaling to 1 million users v1Ido Shilon
 
Couchbase@live person meetup july 22nd
Couchbase@live person meetup   july 22ndCouchbase@live person meetup   july 22nd
Couchbase@live person meetup july 22ndIdo Shilon
 

Mais de Ido Shilon (12)

Why ml and ai are the future of gaming david sachs @ tomobox
Why ml and ai are the future of gaming david sachs @ tomoboxWhy ml and ai are the future of gaming david sachs @ tomobox
Why ml and ai are the future of gaming david sachs @ tomobox
 
Deep learning at nmc devin jones
Deep learning at nmc devin jones Deep learning at nmc devin jones
Deep learning at nmc devin jones
 
Accelerating scale from startups to enterprise by Peter bakas
Accelerating scale from startups to enterprise by Peter bakasAccelerating scale from startups to enterprise by Peter bakas
Accelerating scale from startups to enterprise by Peter bakas
 
Blind spots in big data erez koren @ forter
Blind spots in big data erez koren @ forterBlind spots in big data erez koren @ forter
Blind spots in big data erez koren @ forter
 
Micro apps across 3 continents using React js
Micro apps across 3 continents using React js Micro apps across 3 continents using React js
Micro apps across 3 continents using React js
 
BDX 2016 - Arnon rotem gal-oz @ appsflyer
BDX 2016 - Arnon rotem gal-oz @ appsflyerBDX 2016 - Arnon rotem gal-oz @ appsflyer
BDX 2016 - Arnon rotem gal-oz @ appsflyer
 
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate BDX 2016 - Kevin lyons & yakir buskilla  @ eXelate
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
BDX 2016 - Tal sliwowicz @ taboola
BDX 2016 - Tal sliwowicz @ taboolaBDX 2016 - Tal sliwowicz @ taboola
BDX 2016 - Tal sliwowicz @ taboola
 
BDX 2016 - Tzach zohar @ kenshoo
BDX 2016 - Tzach zohar  @ kenshooBDX 2016 - Tzach zohar  @ kenshoo
BDX 2016 - Tzach zohar @ kenshoo
 
Scaling to 1 million users v1
Scaling to 1 million users v1Scaling to 1 million users v1
Scaling to 1 million users v1
 
Couchbase@live person meetup july 22nd
Couchbase@live person meetup   july 22ndCouchbase@live person meetup   july 22nd
Couchbase@live person meetup july 22nd
 

Último

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Último (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Production ready big ml workflows from zero to hero daniel marcous @ waze

  • 1. Production-Ready BIG ML Workflows From Zero to Hero Daniel Marcous Google, Waze, Data Wizard dmarcous@gmail/google.com
  • 2. What’s a Data Wizard you ask? Gain Actionable Insights!
  • 4. Example Use Case Optimizing Waze ETA Prediction What’s here? Methodology Deploying to production - step by step Pitfalls What to look out for in both methodology and code Use Cases Showing off what we actually do in Waze Analytics Based on tough lessons learned & Google experts recommendations and inputs.
  • 6. Bigger Is Better! ● Trying everything ○ Grid search all the parameters you ever wanted. ○ Cross validate in parallel with no extra effort. ● Algorithmic benefits ○ Some models (Artificial Neural Networks) can’t do good without training on a lot of data. ○ Keep training until you hit 0 ● Data size ○ Tons of training data ■ We store EVERYTHING ■ no need for sampling on wrong populations! ○ Millions of features ■ text processing with TF-IDF
  • 8. Bigger is harder ● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R) ● Curse of dimensionality ○ Some algorithms require exponential time/memory as dimensions grow ○ Harder and more important to tell what’s gold and what’s noise ○ Unbalance data goes a long way with more records ● Big model != Small model ○ Different parameter settings
  • 10. A car goes from Chinatown to Times Square. How long will it take to arrive? ● Ride features ● Historical data ● Real time data ● User features ● Map features
  • 12. Before you start ● Create synthetic input data ○ Raw input ○ Feature row ○ Output data ○ Metric value ● Set up your metrics ○ Derived from business needs ○ Confusion matrix ○ Precision / recall ■ Per class metrics ○ AUC ○ CoverageRemember : Desired short term behaviour does not imply long term behaviour Measure Preprocess (parse, clean, join, etc.)
  • 13. Naive implementation - preprocess & measure ● Synthetic Data ○ Input - {user_id: 1 , from: 32.113, 34.818, to: 32.113, 34.802, time_start: 17:15} ○ Feature record - {from_neighbourhood: Neot-Afeka, to_neighbourhood: Ramat- Aviv, hour_start: 17, minute_start: 15, air_distance: 1.5km, road_distance: 2.9km, heavy_traffic: NO} ○ Predicted value - {ETA: 17:25.43} ○ Actual value - {ETA: 17:26.34} ● Metric computation - absolute error Preprocess Measure
  • 14. Create a sample dataset On real input
  • 16. Visualise - easiest way to measure quickly ● Set up your dashboard ○ Amounts of input data ■ Before /after joining ○ Amounts of output data ○ Metrics (See “Measure first, optimize second”) ● Different model comparison - what’s best, when and where ● Timeseries Analysis
  • 17. Dashboard monitoring Dashboard should support - picking different models, comparing metrics. Pick models to compare Statistical tests on distributions t.test / AUC
  • 18. Dashboard monitoring Dashboard should support - Timeseries anomaly detection, and impact analysis (deploying new model)
  • 20. Getting a feel Advanced variable selection with regularisation techniques in R. Intercepts - by significance No intercept = not entered to model
  • 21. Getting a feel Trying modeling techniques in R. Root mean square error Lower = better (~ kinda) Fit a gradient boosted trees model
  • 22. Getting a feel Modeling bigger data with R, using parallelism. Fit and combine 6 random forest models (10k trees each) in parallel
  • 23. Start with a flow.
  • 24. Basic moving parts (Naive Flow) Data source 1 Data source N Preprocess Training Feature matrix Scoring Models 1..N Predictions 1..N Dashboard Serving DB Feedback loop Conf. User/Model assignments Application
  • 25. Good ML code trumps performance.
  • 26. Why so many parts you ask? ● Scaling ● Fault tolerance ○ Failed preprocessing /training doesn’t affect serving model ○ Rerunning only failed parts ● Different logical parts - Different processes (@”Clean code” by Uncle Bob) ● Easier to tweak and deploy changes
  • 28. Set up a baseline. Start with a neutral launch
  • 29. ● Take a snapshot of your PRODUCTION metric reads: ○ The ones you chose earlier in the process as important to you ■ Confusion matrix ■ Weighted average % classified correctly ■ % subject coverage ● Latency ○ Building feature matrix on last day data takes X minutes You are here: Remember : You are running with a naive model. Everything better than the old model / random is OK.
  • 30. Go to work. Coffee recommended at this point.
  • 31. Optimize What? How? ● Grid search over parameters ● Cross validate Everything ● Evaluate metrics ○ Using a Spark predefined Evaluator ○ Using user defined metrics ● Tweak preprocessing (mainly features) ○ Feature engineering ○ Feature transformers ■ Discretize / Normalise ○ Feature selectors ○ In Apache Spark Latest ● Tweak training
  • 32. Spark ML Building a training pipeline with spark.ml. Create dummy variables Required response label format The ML model itself Labels back to readable format Assembled training pipeline
  • 33. Spark ML Cross-validate, grid search params and evaluate metrics. Grid search with reference to ML model stage (RF) Metrics to evaluate Yes, you can definitely extend and add your own metrics.
  • 35. ● Best testing - production A/B test ○ Use current production model and new model in parallel ○ Local ETA Model (averaging road ETA) VS Global ETA Model (error on full ride) VS Hybrid ● Metrics improvements (Remember your dashboard?) ○ Local ETA Model ~65s error VS Global ETA Model ~60s error VS Hybrid ~58.6s error Compare to baseline
  • 36. A/B Infrastructures Setting up a very basic A/B testing infrastructure built upon our earlier presented modeling wrapper. Conf hold Mapping of: model -> user_id/subject list Score in parallel (inside a map) Distributed=awesome. Fancy scala union for all score files
  • 38. ● If you wrote your code right, you can easily reuse it in a notebook ! ● Answer ad-hoc questions ○ How many predictions did you output last month? ○ How many new users had a prediction with probability > 0.7 ○ How accurate were we on last month predictions? (join with real data) ● No need to rebuild anything! Enter Apache Zeppelin
  • 39. Playing with it Read a parquet file , show statistics, register as table and run SparkSQL on it. Parquet - already has a schema inside For usage in SparkSQL
  • 40. Putting it all together
  • 41. Work Process Step by step for deploying your big ML workflows to production, ready for operations and optimisations. 1. Measure first, optimize second. a. Define metrics. b. Preprocess data (using examples) c. Monitor. (dashboard setup) 2. Start small and grow. 3. Start with a flow. a. Good ML code trumps performance. b. Test your infrastructure. 4. Set up a baseline. 5. Go to work. a. Optimize. b. A/B. i. Test new flow in parallel to existing flow. ii. Update user assignments. 6. Watch. Iterate. (see 5.)
  • 42. Possible Pitfalls ● Code produced with ○ Apache Spark 1.6 / Scala 2.11.4 ● RDD VS Dataframe ○ Enter “Dataset API” (V2.0+) ● mllib VS spark.ml ○ Always use spark.ml if functionality exists ● Unbalanced partitions ○ Stuck on reduce ○ Stragglers ● Parameter tuning ○ Spark.sql.partitions ○ Executors ○ Driver VS executor memory
  • 44. Irregular Traffic Events Major events, causing out of the ordinary traffic ● Exploration tool over time & space ● Seasonal traffic anomaly detection
  • 45. Dangerous Places Find most dangerous places, using custom developed clustering algorithms ● Alert authorities / users ● Compare & share with 3rd parties (NYPD)
  • 46. Parking Places Detection Parking entrance Parking lot Street parking
  • 47. Speed Limits Inference Waze Segment Data Machine Learning Speed Limit Prediction Waze Segment Data Community Verification Show in App
  • 48. Text Mining - Store Sentiments
  • 49. Text Mining - Sentiment by Time & Place
  • 50.
  • 51. Code & Slides https://github.com/dmarcous/BigMLFlow/ Daniel Marcous dmarcous@google.com dmarcous@gmail.com