4. Example Use Case
Optimizing Waze ETA Prediction
What’s here?
Methodology
Deploying to production - step by step
Pitfalls
What to look out for in both methodology and code
Use Cases
Showing off what we actually do in Waze Analytics
Based on tough lessons learned & Google experts recommendations and inputs.
6. Bigger Is Better!
● Trying everything
○ Grid search all the parameters you ever
wanted.
○ Cross validate in parallel with no extra
effort.
● Algorithmic benefits
○ Some models (Artificial Neural Networks)
can’t do good without training on a lot of
data.
○ Keep training until you hit 0
● Data size
○ Tons of training data
■ We store EVERYTHING
■ no need for sampling on wrong
populations!
○ Millions of features
■ text processing with TF-IDF
8. Bigger is harder
● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)
● Curse of dimensionality
○ Some algorithms require exponential time/memory as dimensions grow
○ Harder and more important to tell what’s gold and what’s noise
○ Unbalance data goes a long way with more records
● Big model != Small model
○ Different parameter settings
10. A car goes from
Chinatown to Times
Square. How long will
it take to arrive?
● Ride features
● Historical data
● Real time data
● User features
● Map features
12. Before you start
● Create synthetic input data
○ Raw input
○ Feature row
○ Output data
○ Metric value
● Set up your metrics
○ Derived from business needs
○ Confusion matrix
○ Precision / recall
■ Per class metrics
○ AUC
○ CoverageRemember : Desired short term behaviour does not imply long term behaviour
Measure
Preprocess
(parse, clean, join, etc.)
16. Visualise - easiest way to measure quickly
● Set up your dashboard
○ Amounts of input data
■ Before /after joining
○ Amounts of output data
○ Metrics (See “Measure first, optimize second”)
● Different model comparison - what’s best, when and where
● Timeseries Analysis
24. Basic moving parts (Naive Flow)
Data
source 1
Data
source N
Preprocess
Training
Feature matrix
Scoring
Models
1..N
Predictions
1..N
Dashboard
Serving DB
Feedback loop
Conf.
User/Model assignments
Application
26. Why so many parts you ask?
● Scaling
● Fault tolerance
○ Failed preprocessing /training doesn’t affect serving model
○ Rerunning only failed parts
● Different logical parts - Different processes (@”Clean code” by Uncle Bob)
● Easier to tweak and deploy changes
28. Set up a baseline.
Start with a neutral launch
29. ● Take a snapshot of your PRODUCTION metric reads:
○ The ones you chose earlier in the process as important to you
■ Confusion matrix
■ Weighted average % classified correctly
■ % subject coverage
● Latency
○ Building feature matrix on last day data takes X minutes
You are here:
Remember : You are running with a naive model. Everything better than the old model / random is OK.
31. Optimize
What? How?
● Grid search over
parameters
● Cross validate Everything
● Evaluate metrics
○ Using a Spark predefined
Evaluator
○ Using user defined metrics
● Tweak preprocessing (mainly features)
○ Feature engineering
○ Feature transformers
■ Discretize / Normalise
○ Feature selectors
○ In Apache Spark Latest
● Tweak training
32. Spark ML
Building a training pipeline with
spark.ml.
Create dummy variables
Required response label format
The ML model itself
Labels back to readable format
Assembled training pipeline
33. Spark ML
Cross-validate, grid search params
and evaluate metrics.
Grid search with reference to
ML model stage (RF)
Metrics to evaluate
Yes, you can definitely extend
and add your own metrics.
35. ● Best testing - production A/B test
○ Use current production model and new model in parallel
○ Local ETA Model (averaging road ETA) VS Global ETA
Model (error on full ride) VS Hybrid
● Metrics improvements (Remember your
dashboard?)
○ Local ETA Model ~65s error VS Global ETA Model ~60s
error VS Hybrid ~58.6s error
Compare to baseline
36. A/B Infrastructures
Setting up a very basic A/B testing
infrastructure built upon our earlier
presented modeling wrapper.
Conf hold Mapping of:
model -> user_id/subject list
Score in parallel (inside a map)
Distributed=awesome.
Fancy scala union for all score files
38. ● If you wrote your code right, you can easily reuse it in a notebook !
● Answer ad-hoc questions
○ How many predictions did you output last month?
○ How many new users had a prediction with probability > 0.7
○ How accurate were we on last month predictions? (join with real data)
● No need to rebuild anything!
Enter Apache Zeppelin
39. Playing with it
Read a parquet file , show statistics,
register as table and run SparkSQL on
it.
Parquet - already has a schema inside
For usage in SparkSQL
41. Work Process
Step by step for deploying your big ML
workflows to production, ready for
operations and optimisations.
1. Measure first, optimize second.
a. Define metrics.
b. Preprocess data (using examples)
c. Monitor. (dashboard setup)
2. Start small and grow.
3. Start with a flow.
a. Good ML code trumps performance.
b. Test your infrastructure.
4. Set up a baseline.
5. Go to work.
a. Optimize.
b. A/B.
i. Test new flow in parallel to existing flow.
ii. Update user assignments.
6. Watch. Iterate. (see 5.)
42. Possible Pitfalls
● Code produced with
○ Apache Spark 1.6 / Scala 2.11.4
● RDD VS Dataframe
○ Enter “Dataset API” (V2.0+)
● mllib VS spark.ml
○ Always use spark.ml if
functionality exists
● Unbalanced partitions
○ Stuck on reduce
○ Stragglers
● Parameter tuning
○ Spark.sql.partitions
○ Executors
○ Driver VS executor memory
44. Irregular Traffic Events
Major events, causing out of the ordinary traffic
● Exploration tool over time & space ● Seasonal traffic anomaly detection
45. Dangerous Places
Find most dangerous places, using custom developed clustering algorithms
● Alert authorities / users
● Compare & share with 3rd parties (NYPD)