Driverless AI is H2O.ai's latest flagship product for automatic machine learning. It fully automates some of the most challenging and productive tasks in applied data science such as feature engineering, model tuning, model ensembling and production deployment. Driverless AI turns Kaggle-winning grandmaster recipes into production-ready code (Java and C++), and is specifically designed to avoid common mistakes such as under- or overfitting, data leakage or improper model validation, some of the hardest challenges in data science. Other industry-leading capabilities include automatic data visualization and machine learning interpretability.
With Driverless AI, data scientists of all proficiency levels can train and deploy modeling pipelines with just a few clicks from the GUI. Advanced users can use the client API from Python or R. Driverless AI builds hundreds or thousands of models under the hood to select the best feature engineering and modeling pipeline for every specific problem such as churn prediction, fraud detection, real-estate pricing, store sales prediction, marketing ad campaigns and many more.
With Bring-Your-Own-Recipe, domain experts and advanced data scientists can now write their own recipes and seamlessly extend Driverless AI with their favorite tools from the rich ecosystem of open-source data science and machine learning libraries.
In this talk, we explain how Driverless AI works and demonstrate it with live demos.
Arno's Bio:
Arno Candel is the Chief Technology Officer at H2O.ai. He is the main committer of H2O-3 and Driverless AI and has been designing and implementing high-performance machine-learning algorithms since 2012. Previously, he spent a decade in supercomputing at ETH and SLAC and collaborated with CERN on next-generation particle accelerators.
Arno holds a PhD and Masters summa cum laude in Physics from ETH Zurich, Switzerland. He was named “2014 Big Data All-Star” by Fortune Magazine and featured by ETH GLOBE in 2015. Follow him on Twitter: @ArnoCandel.
3. Driverless AI: AutoML for the Enterprise
Tabular
Data with
Outcomes
Automatic ML & DS
Grandmaster Recipes
• Feature Engineering
• Time Series
• Model Tuning / Ensembling
• Overfitting Protection
• Bring Your Own Recipe
Powered by datatable,
H2O-3 and H2O4GPU
ML Interpretability
(reason codes in production)
Automatic Report
Scoring Pipeline
(Python & Java, C++ soon)
AutoVis
Scores
Diagnostics
Debugging
ML: machine learning
DS: data science
Put models in production in days vs months
4. Confidential3
Industry Use Cases
Save Time. Save Money. Gain a Competitive Advantage.
Wholesale / Commercial
Banking
• Know Your Customers (KYC)
• Anti-Money Laundering (AML)
Card / Payments Business
• Transaction frauds
• Collusion fraud
• Real-time targeting
• Credit risk scoring
• In-context promotion
Retail Banking
• Deposit fraud
• Customer churn prediction
• Auto-loan
Financial Services
• Early cancer detection
• Product recommendations
• Personalized prescription
matching
• Medical claim fraud detection
• Flu season prediction
• Drug discovery
• ER and hospital
management
• Remote patient monitoring
• Medical test predictions
Healthcare
• Predictive maintenance
• Avoidable truck-rolls
• Customer churn prediction
• Improved customer viewing
experience
• Master data management
• In-context promotions
• Intelligent ad placements
• Personalized program
recommendations
Telecom
• Funnel predictions
• Personalized ads
• Credit scoring
• Fraud detection
• Next best offer
• Next best customer
• Smart profiling
• Prediction
• Customer recommendations
• Ad predictions and spend
Marketing and Retail
Driverless AI: Used Across Many Industries
5. Confidential4
“Driverless AI is giving amazing results in terms
of feature and model performance”
Venkatesh Ramanathan
Senior Data Scientist, PayPal
“Driverless AI helped us gain an edge with our
Intelligent Marketing Cloud for our clients. AI to
do AI, truly is improving our system on a daily basis.”
Martin Stein
Chief Product Officer, G5
“H2O Driverless AI feature engineering is better than
anything I've seen out there right now. And the scoring
pipeline generation is probably one of the bigger
pluses for me. These features alone have provided
us with a true competitive edge in agile manufacturing.
It's a massive time saver.”
Dr. Robert Coop
AI and ML Manager, Stanley Black & Decker
“Driverless AI powers our data science team to
operate efficiently and experiment at scale… with this
latest innovation, we have the opportunity to impact
care at large.”
Bharath Sudarshan
Director of Data Science, Armada Health
“H2O.ai is doing a great job in enhancing the product
at such a rapid rate. Each release provides significant
increases in usability and value. Driverless AI gives
startups like ours an effective alternative to large
data science teams and their outsized cost. It can
dramatically reduce the time needed to deliver first-
rate ML models for a wide range of markets.”
Driverless AI Customer Feedback
Marc Stein
CEO, Underwrite.ai
Driverless AI: Customer Feedback
7. 2 months for Grandmasters — 2 hours for Driverless AI
single run, fully automated: 2h on DGX Station! 6h on PC
Driverless AI: 10th place in private LB at Kaggle (out of 2926)
Driverless AI: top 10 in BNP Paribas Kaggle competition
9. Feature v1.0 v1.1 v1.2 v1.3 v1.4
v1.5
v1.6 LTS
v1.7
v1.8 LTS
v2.0
Kaggle Grandmaster Recipes for i.i.d. data, XGBoost Models
Automatic Visualization
Machine Learning Interpretability
Standalone Python Scoring Pipeline
Hardware acceleration: NVIDIA GPUs (DGX-1 etc.)
User Management and Security (LDAP/Kerberos)
Data Connectors: NFS/HDFS/S3/GCS/BigQuery, CSV/Excel/Parquet/Feather
Native Installer (RPM/DEB) and Cloud Neutral: Amazon/Microsoft/Google
Kaggle Grandmaster Recipes for Time-Series
Automatic Documentation
Deep Learning TensorFlow Models (CPU/GPU)
Standalone Java Scoring Pipeline (MOJO)
Deep Learning for NLP / Text (CPU/GPU)
LightGBM Models (CPU/GPU)
Improved Time-Series Recipes (Multiple Windows, MLI for Time-Series
Local Feature Brain
Improved Scalability, FTRL Models, Model Diagnostics, Data Splitting, Retrain Final Model, etc.
C++ Scoring Pipeline (Runtime for MOJO), with Python and R bindings
Improved Time-Series Recipes (backtesting, test-time augmentation, single time-series)
Project Workspace
Bring Your Own Recipe (Transformers, Models, Scorers) - Custom Python Code
Data Augmentation
Model Monitoring
R client API
Multi-Node and Multi-User Deployment
Driverless AI Roadmap v1.7.0 MAY ‘19
10. MLI - Machine Learning Interpretation
Gain confidence in models before deploying them!
Shapley values, partial dependence, ICE, original and transformed features
11. Automatic Visualization
Scalable outlier detection
(no sampling)
Contains novel statistical algorithms to
only show “relevant” aspects of the data
(soon: actionable recipes and interactive visualization)
12. Secret Sauce: 1) Grandmaster Feature Engineering
Numerical/Categorical Interactions, Target
Encoding, Clustering, Dimensionality Reduction,
Weight of Evidence, etc.
Time-Series: Lags and historical aggregates
with causality constraints
13. Secret Sauce: 2) Grandmaster Pipeline Tuning + Validation
19,000 features tested
1,000 models trained
reliable generalization estimates (overfitting avoidance)
Example: Driverless AI BNP Paribas on 3-GPU workstation
evolutionary strategies
DOI: 10.1126/science.aaa9375
MTV
1 final optimal
scoring pipeline
massively parallel processing
(multi-CPU, multi-GPU)
19. Bring Your Own Recipes At Full Speed!
BYOR is first-class citizen:
native integration, no performance
penalty, no memory overhead, no
restrictions, even MOJOs possible.
H2O.ai Dev API = BYOR API
20. With Freedom Comes Responsibility
Now some of the responsibility is with
the creator and user of the Recipe.
Example:
User disables all but 3 specific custom
transformers: {MyLog, MyRound,
MyRandom} and Identity for numerical
columns:
Features like log(EDUCATION)
will show up, even though there is no
statistical benefit (same signal:noise as
EDUCATION).
Solution: DAI needs more statistical
checks - WIP
21. AutoDoc - Automatic Documentation of Experiments
Full transparency into automation process:
Validation scheme, model tuning, feature selection, ensembling, metrics, diagnostics.
Includes custom recipes, fully editable/customizable Word document.