Production ready big ml workflows from zero to hero daniel marcous @ waze

Production-Ready BIG ML
Workflows
From Zero to Hero
Daniel Marcous
Google, Waze, Data Wizard
dmarcous@gmail/google.com

What’s a Data Wizard you ask?
Gain Actionable Insights!

Example Use Case
Optimizing Waze ETA Prediction
What’s here?
Methodology
Deploying to production - step by step
Pitfalls
What to look out for in both methodology and code
Use Cases
Showing off what we actually do in Waze Analytics
Based on tough lessons learned & Google experts recommendations and inputs.

Bigger Is Better!
● Trying everything
○ Grid search all the parameters you ever
wanted.
○ Cross validate in parallel with no extra
effort.
● Algorithmic benefits
○ Some models (Artificial Neural Networks)
can’t do good without training on a lot of
data.
○ Keep training until you hit 0
● Data size
○ Tons of training data
■ We store EVERYTHING
■ no need for sampling on wrong
populations!
○ Millions of features
■ text processing with TF-IDF

Bigger is harder
● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)
● Curse of dimensionality
○ Some algorithms require exponential time/memory as dimensions grow
○ Harder and more important to tell what’s gold and what’s noise
○ Unbalance data goes a long way with more records
● Big model != Small model
○ Different parameter settings

A car goes from
Chinatown to Times
Square. How long will
it take to arrive?
● Ride features
● Historical data
● Real time data
● User features
● Map features

Measure first, optimize
second.

Before you start
● Create synthetic input data
○ Raw input
○ Feature row
○ Output data
○ Metric value
● Set up your metrics
○ Derived from business needs
○ Confusion matrix
○ Precision / recall
■ Per class metrics
○ AUC
○ CoverageRemember : Desired short term behaviour does not imply long term behaviour
Measure
Preprocess
(parse, clean, join, etc.)

Naive implementation -
preprocess & measure
● Synthetic Data
○ Input - {user_id: 1 , from: 32.113, 34.818, to: 32.113, 34.802, time_start: 17:15}
○ Feature record - {from_neighbourhood: Neot-Afeka, to_neighbourhood: Ramat-
Aviv, hour_start: 17, minute_start: 15, air_distance: 1.5km, road_distance: 2.9km,
heavy_traffic: NO}
○ Predicted value - {ETA: 17:25.43}
○ Actual value - {ETA: 17:26.34}
● Metric computation - absolute error
Preprocess
Measure

Create a sample dataset
On real input

Visualise - easiest way to measure quickly
● Set up your dashboard
○ Amounts of input data
■ Before /after joining
○ Amounts of output data
○ Metrics (See “Measure first, optimize second”)
● Different model comparison - what’s best, when and where
● Timeseries Analysis

Dashboard
monitoring
Dashboard should support - picking
different models, comparing metrics.
Pick models
to compare
Statistical tests on distributions
t.test / AUC

Dashboard
monitoring
Dashboard should support -
Timeseries anomaly detection, and
impact analysis (deploying new
model)

Getting a feel
Advanced variable selection with
regularisation techniques in R.
Intercepts - by significance
No intercept = not entered to model

Getting a feel
Trying modeling techniques in R.
Root mean square error
Lower = better (~ kinda)
Fit a gradient boosted
trees model

Getting a feel
Modeling bigger data with R, using
parallelism.
Fit and combine 6 random forest
models (10k trees each) in parallel

Basic moving parts (Naive Flow)
Data
source 1
Data
source N
Preprocess
Training
Feature matrix
Scoring
Models
1..N
Predictions
1..N
Dashboard
Serving DB
Feedback loop
Conf.
User/Model assignments
Application

Good ML code trumps
performance.

Why so many parts you ask?
● Scaling
● Fault tolerance
○ Failed preprocessing /training doesn’t affect serving model
○ Rerunning only failed parts
● Different logical parts - Different processes (@”Clean code” by Uncle Bob)
● Easier to tweak and deploy changes

Set up a baseline.
Start with a neutral launch

● Take a snapshot of your PRODUCTION metric reads:
○ The ones you chose earlier in the process as important to you
■ Confusion matrix
■ Weighted average % classified correctly
■ % subject coverage
● Latency
○ Building feature matrix on last day data takes X minutes
You are here:
Remember : You are running with a naive model. Everything better than the old model / random is OK.

Go to work.
Coffee recommended at this point.

Optimize
What? How?
● Grid search over
parameters
● Cross validate Everything
● Evaluate metrics
○ Using a Spark predefined
Evaluator
○ Using user defined metrics
● Tweak preprocessing (mainly features)
○ Feature engineering
○ Feature transformers
■ Discretize / Normalise
○ Feature selectors
○ In Apache Spark Latest
● Tweak training

Spark ML
Building a training pipeline with
spark.ml.
Create dummy variables
Required response label format
The ML model itself
Labels back to readable format
Assembled training pipeline

Spark ML
Cross-validate, grid search params
and evaluate metrics.
Grid search with reference to
ML model stage (RF)
Metrics to evaluate
Yes, you can definitely extend
and add your own metrics.

● Best testing - production A/B test
○ Use current production model and new model in parallel
○ Local ETA Model (averaging road ETA) VS Global ETA
Model (error on full ride) VS Hybrid
● Metrics improvements (Remember your
dashboard?)
○ Local ETA Model ~65s error VS Global ETA Model ~60s
error VS Hybrid ~58.6s error
Compare to baseline

A/B Infrastructures
Setting up a very basic A/B testing
infrastructure built upon our earlier
presented modeling wrapper.
Conf hold Mapping of:
model -> user_id/subject list
Score in parallel (inside a map)
Distributed=awesome.
Fancy scala union for all score files

● If you wrote your code right, you can easily reuse it in a notebook !
● Answer ad-hoc questions
○ How many predictions did you output last month?
○ How many new users had a prediction with probability > 0.7
○ How accurate were we on last month predictions? (join with real data)
● No need to rebuild anything!
Enter Apache Zeppelin

Playing with it
Read a parquet file , show statistics,
register as table and run SparkSQL on
it.
Parquet - already has a schema inside
For usage in SparkSQL

Work Process
Step by step for deploying your big ML
workflows to production, ready for
operations and optimisations.
1. Measure first, optimize second.
a. Define metrics.
b. Preprocess data (using examples)
c. Monitor. (dashboard setup)
2. Start small and grow.
3. Start with a flow.
a. Good ML code trumps performance.
b. Test your infrastructure.
4. Set up a baseline.
5. Go to work.
a. Optimize.
b. A/B.
i. Test new flow in parallel to existing flow.
ii. Update user assignments.
6. Watch. Iterate. (see 5.)

Possible Pitfalls
● Code produced with
○ Apache Spark 1.6 / Scala 2.11.4
● RDD VS Dataframe
○ Enter “Dataset API” (V2.0+)
● mllib VS spark.ml
○ Always use spark.ml if
functionality exists
● Unbalanced partitions
○ Stuck on reduce
○ Stragglers
● Parameter tuning
○ Spark.sql.partitions
○ Executors
○ Driver VS executor memory

Irregular Traffic Events
Major events, causing out of the ordinary traffic
● Exploration tool over time & space ● Seasonal traffic anomaly detection

Dangerous Places
Find most dangerous places, using custom developed clustering algorithms
● Alert authorities / users
● Compare & share with 3rd parties (NYPD)

Parking Places Detection
Parking entrance
Parking lot
Street parking

Speed Limits Inference
Waze Segment
Data
Machine
Learning
Speed
Limit
Prediction
Waze
Segment
Data
Community
Verification
Show in App

Text Mining - Store Sentiments

Text Mining - Sentiment by Time & Place

Code & Slides
https://github.com/dmarcous/BigMLFlow/
Daniel Marcous
dmarcous@google.com
dmarcous@gmail.com

Production ready big ml workflows from zero to hero daniel marcous @ waze

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Production ready big ml workflows from zero to hero daniel marcous @ waze

Semelhante a Production ready big ml workflows from zero to hero daniel marcous @ waze (20)

Mais de Ido Shilon

Mais de Ido Shilon (12)

Último

Último (20)

Production ready big ml workflows from zero to hero daniel marcous @ waze