A Look Under the Hood of H2O Driverless AI

Plano, TX 5/1/19
Arno Candel
CTO H2O.ai
@ArnoCandel
A Look Under the
Hood of H2O
Driverless AI

LinkedIn Workforce Report | United States | August 2018
Why Driverless AI?

Driverless AI: AutoML for the Enterprise
Tabular
Data with
Outcomes
Automatic ML & DS
Grandmaster Recipes
• Feature Engineering
• Time Series
• Model Tuning / Ensembling
• Overfitting Protection
• Bring Your Own Recipe
Powered by datatable,  
H2O-3 and H2O4GPU
ML Interpretability 
(reason codes in production)
Automatic Report
Scoring Pipeline 
(Python & Java, C++ soon)
AutoVis
Scores 
Diagnostics
Debugging
ML: machine learning 
DS: data science
Put models in production in days vs months

Confidential3
Industry Use Cases
Save Time. Save Money. Gain a Competitive Advantage.
Wholesale / Commercial
Banking
• Know Your Customers (KYC)
• Anti-Money Laundering (AML)
Card / Payments Business
• Transaction frauds
• Collusion fraud
• Real-time targeting
• Credit risk scoring
• In-context promotion
Retail Banking
• Deposit fraud
• Customer churn prediction
• Auto-loan
Financial Services
• Early cancer detection
• Product recommendations
• Personalized prescription
matching
• Medical claim fraud detection
• Flu season prediction
• Drug discovery
• ER and hospital
management
• Remote patient monitoring
• Medical test predictions
Healthcare
• Predictive maintenance
• Avoidable truck-rolls
• Customer churn prediction
• Improved customer viewing
experience
• Master data management
• In-context promotions
• Intelligent ad placements
• Personalized program
recommendations
Telecom
• Funnel predictions
• Personalized ads
• Credit scoring
• Fraud detection
• Next best offer
• Next best customer
• Smart profiling
• Prediction
• Customer recommendations
• Ad predictions and spend
Marketing and Retail
Driverless AI: Used Across Many Industries

Confidential4
“Driverless AI is giving amazing results in terms
of feature and model performance”
Venkatesh Ramanathan
Senior Data Scientist, PayPal
“Driverless AI helped us gain an edge with our
Intelligent Marketing Cloud for our clients. AI to
do AI, truly is improving our system on a daily basis.”
Martin Stein
Chief Product Officer, G5
“H2O Driverless AI feature engineering is better than
anything I've seen out there right now. And the scoring
pipeline generation is probably one of the bigger
pluses for me. These features alone have provided
us with a true competitive edge in agile manufacturing.
It's a massive time saver.”
Dr. Robert Coop
AI and ML Manager, Stanley Black & Decker
“Driverless AI powers our data science team to
operate efficiently and experiment at scale… with this
latest innovation, we have the opportunity to impact
care at large.”
Bharath Sudarshan
Director of Data Science, Armada Health
“H2O.ai is doing a great job in enhancing the product
at such a rapid rate. Each release provides significant
increases in usability and value. Driverless AI gives
startups like ours an effective alternative to large
data science teams and their outsized cost. It can
dramatically reduce the time needed to deliver first-
rate ML models for a wide range of markets.”
Driverless AI Customer Feedback
Marc Stein
CEO, Underwrite.ai
Driverless AI: Customer Feedback

Driverless AI Architecture
InfoWorld Tech of the Year Award: 2018 & 2019

2 months for Grandmasters — 2 hours for Driverless AI
single run, fully automated: 2h on DGX Station! 6h on PC
Driverless AI: 10th place in private LB at Kaggle (out of 2926)
Driverless AI: top 10 in BNP Paribas Kaggle competition

https://www.h2o.ai/blog/
Driverless AI — Teamwork and Maker’s Culture

Feature v1.0 v1.1 v1.2 v1.3 v1.4
v1.5 
v1.6 LTS
v1.7
v1.8 LTS
v2.0
Kaggle Grandmaster Recipes for i.i.d. data, XGBoost Models
Automatic Visualization
Machine Learning Interpretability
Standalone Python Scoring Pipeline
Hardware acceleration: NVIDIA GPUs (DGX-1 etc.)
User Management and Security (LDAP/Kerberos)
Data Connectors: NFS/HDFS/S3/GCS/BigQuery, CSV/Excel/Parquet/Feather
Native Installer (RPM/DEB) and Cloud Neutral: Amazon/Microsoft/Google
Kaggle Grandmaster Recipes for Time-Series
Automatic Documentation
Deep Learning TensorFlow Models (CPU/GPU)
Standalone Java Scoring Pipeline (MOJO)
Deep Learning for NLP / Text (CPU/GPU)
LightGBM Models (CPU/GPU)
Improved Time-Series Recipes (Multiple Windows, MLI for Time-Series
Local Feature Brain
Improved Scalability, FTRL Models, Model Diagnostics, Data Splitting, Retrain Final Model, etc.
C++ Scoring Pipeline (Runtime for MOJO), with Python and R bindings
Improved Time-Series Recipes (backtesting, test-time augmentation, single time-series)
Project Workspace
Bring Your Own Recipe (Transformers, Models, Scorers) - Custom Python Code
Data Augmentation
Model Monitoring
R client API
Multi-Node and Multi-User Deployment
Driverless AI Roadmap v1.7.0 MAY ‘19

MLI - Machine Learning Interpretation
Gain confidence in models before deploying them!
Shapley values, partial dependence, ICE, original and transformed features

Automatic Visualization
Scalable outlier detection
(no sampling)
Contains novel statistical algorithms to 
only show “relevant” aspects of the data
 
(soon: actionable recipes and interactive visualization)

Secret Sauce: 1) Grandmaster Feature Engineering
Numerical/Categorical Interactions, Target
Encoding, Clustering, Dimensionality Reduction,
Weight of Evidence, etc.
Time-Series: Lags and historical aggregates
with causality constraints

Secret Sauce: 2) Grandmaster Pipeline Tuning + Validation
19,000 features tested
1,000 models trained
reliable generalization estimates (overfitting avoidance)
Example: Driverless AI BNP Paribas on 3-GPU workstation
evolutionary strategies
DOI: 10.1126/science.aaa9375
MTV
1 final optimal
scoring pipeline
massively parallel processing
(multi-CPU, multi-GPU)

https://web.stanford.edu/~hastie/Papers/ESLII.pdf
http://www.deeplearningbook.org
Typically better for structured data
(CSV, SQL, Transactional)
Typically better for unstructured data
(Images, Video, Audio, Text)
GLM/CART/RF/GBM/XGBoost 
K-Means/PCA/SVD
TensorFlow Deep Learning
Secret Sauce: 3) Statistical Learning & Deep Learning

time:
Gap=1 | Forecast Horizon=2
invalid lag size (no information available)
valid lag size (information available)
1 2 3 4 5 6 7 8 9 10 11 12
[Gap]
"[ Gap ]" "8" "9" [Gap] [Gap]
test
tvs train tvs valid
train
test
Time Series in Driverless AI
• Automatic Selection or Manual Control for:
• Forecast Horizon
• Gap between Training and Production

Text / Natural Language Processing in Driverless AI
Now also CharCNN and Bi-GRU LSTM, and custom embeddings!

1.7.0: BYOR — Bring Your Own Recipe!

Open-Source Recipes - Makers Gonna Make!
Bring Your
Own Recipe!

Bring Your Own Recipes At Full Speed!
BYOR is first-class citizen: 
native integration, no performance
penalty, no memory overhead, no
restrictions, even MOJOs possible.
H2O.ai Dev API = BYOR API

With Freedom Comes Responsibility
Now some of the responsibility is with
the creator and user of the Recipe.
Example:
User disables all but 3 specific custom
transformers: {MyLog, MyRound,
MyRandom} and Identity for numerical
columns:
Features like log(EDUCATION)
will show up, even though there is no
statistical benefit (same signal:noise as
EDUCATION).
Solution: DAI needs more statistical
checks - WIP

AutoDoc - Automatic Documentation of Experiments
Full transparency into automation process: 
Validation scheme, model tuning, feature selection, ensembling, metrics, diagnostics.
Includes custom recipes, fully editable/customizable Word document.

A Look Under the Hood of H2O Driverless AI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Look Under the Hood of H2O Driverless AI

Similar to A Look Under the Hood of H2O Driverless AI (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

A Look Under the Hood of H2O Driverless AI