SlideShare uma empresa Scribd logo
1 de 33
Data Scientist, H2O.ai
GitHub: mlandry22, Email: mark@h2o.ai
Mark Landry
An Analysis of Driverless AI Feature Creation
Driver vs Driverless
Overview
• Analyze two recent problems with an emphasis on feature
generation
• Overview of the problem
• Discussion of my approach
• Show the features Driverless created
• Feature creation
• Feature representation
• Results
Who am I?
• Data scientist @ H2O since 2015, user since 2014
• 15 top 20 finishes in Kaggle, highest rank 33
• R | H2O | data.table | GBM
• The “driver”
Problem #1
How Many Attempts will a Student Make
• Online question/answer platform with computer science
problems
• Predict the number of attempts a particular student will make on
a particular problem
• Data
• Student: level, ranking, highest ranking
• Problem: type, 3-tier level, points awarded
• Training: 124,000 attempt counts
• Testing: 60,000 attempt counts
Regression as Classification
• Natural problem is numerical
• End user prefers buckets
• Volume
• 1: 53%
• 2 31%
• 3 9%
• 4 4%
• 5 1%
• 6 2%
The Driver Approach
• Think of it like a recommender problem
• Standard: Matrix factorization, collaborative filtering
• GBM: use deep categorical encodings
• Frequent use of target encoding
interaction
interaction
interaction
The Driver Approach
A messy chain of hierarchical target encoding and if/else statements
The Driver Approach
h2o.gbm feature importance – primarily using three target encodings
The Driverless Approach
The Driverless Approach
The Driverless Approach
The Driverless Approach
Top 5 Features: divided into components
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Feature #1: same base target encoding I found to be the best
• {16} {CV TE} {problem_id} {0}
• 16: indicator of base features – 16 is later used twice more in the top
15
• CV TE: Cross-Fold target encoding
• Problem_id: feature used as basis for target encoding
• 0: the target; multinomial, so the two other uses are for class 1 & 5
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Feature #2: deeper interaction
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• 51: base feature ID
• CV TE: out of sample target encoding result
• points * problem: interaction of two different features, both related to
the problem; it is subdividing the problem further
• 0: again, rate of class 0 as the target
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Feature #3: no transformation
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• max rating
• used as is – no alternate encoding; was first natural feature in my model as well
• this is the first variable of the student dimension
• a 4-digit number with close to a normal distribution
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Feature #4: frequency encoding
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• Counting the occurrences of two fields
• Last online & rating are both numerics in the student dimension
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Feature 5: four-way interaction w/ target encoding
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
• Target encoding of class 0
• Finding the rate for each value of the result of a four way interaction
The Driverless Approach
Top 5 Features: did I try?
• YES {16} {CV TE} {problem_id} {0}
• NO {51} {CV TE} {points * problem_id} {0}
• YES {3} {max_rating}
• NO {32} {freq} {last_online_time_seconds * rating}
• NO {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Problem #2
Bank Customer Churn
• Identify customers likely to churn balances in the next quarter
by 50%
• Data
• 300,00 training rows; 200,000 testing rows
• 377 columns
• Customer: age, gender, demographics
• Reported assets, liabilities
• Monthly balance history
The Driver Approach
Exploit before Explore
• 377 columns made [quick] manual investigation harder
• Rather than iterate: {analyze > model > analyze > … },
I changed to {model > analyze > model > … }
lagged features
The Driver Approach
After lagging, try differences and ratios
• Lagging features present the balance features at
several time steps.
• But often, the interesting part is not the raw balance
itself, but whether it is growing or shrinking
• Decision trees have a hard time “seeing” this so it is
wise to engineer mathematical features: + - * /
The Driver Approach
After lagging, try differences and ratios
• I used the leading monthly feature from the model and
created new features representing month-over-month
differences and a binary indicator
• One field, one specific length (1 month), two calculations
The Driverless Approach
The Driverless Approach
The Driverless Approach
It knows math!
subtraction
subtraction
subtraction
The Driverless Approach
Top 10 Features: divided into categories
• (3) Subtraction: #1, #6, #8
• (1) Truncated SVD components: #2
• (2) Cluster Distances: #3, #9
• (1) Target encoding: #4
• (3) Direct features: #5, #7, #10
The Driverless Approach
Lagged balances also used in clusters & SVD
• Distance to cluster #1 after segmenting columns into 6 clusters
• BAL_prev6
• D_prev1
• D_prev2
• I_AQB_PrevQ1
• Component #1 of truncated SVD of
• D_prev1
• D_prev2
• EOP_prev1_1
The Driverless Approach
Final Analysis
• On first iteration, Driverless AI had surpassed my manual
modeling
• Features were well beyond what I would have ever attempted
• Accuracy was stable: Driverless AI self-reported scores within
1% of competition submission
The Driverless Approach
Competition Results
Kaggle Grandmaster
Kaggle Grandmaster
Also Driverless AI
100% Driverless AI
The End
Thank You

Mais conteúdo relacionado

Mais procurados

Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
Justin Basilico
 
Catch Me If You Can: Keeping Up With ML Models in Production
Catch Me If You Can: Keeping Up With ML Models in ProductionCatch Me If You Can: Keeping Up With ML Models in Production
Catch Me If You Can: Keeping Up With ML Models in Production
Databricks
 

Mais procurados (20)

Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
Modern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and PracticesModern Machine Learning Infrastructure and Practices
Modern Machine Learning Infrastructure and Practices
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Personalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing RecommendationsPersonalized Page Generation for Browsing Recommendations
Personalized Page Generation for Browsing Recommendations
 
MLconf seattle 2015 presentation
MLconf seattle 2015 presentationMLconf seattle 2015 presentation
MLconf seattle 2015 presentation
 
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
Machine Learning system architecture – Microsoft Translator, a Case Study :  ...Machine Learning system architecture – Microsoft Translator, a Case Study :  ...
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
Catch Me If You Can: Keeping Up With ML Models in Production
Catch Me If You Can: Keeping Up With ML Models in ProductionCatch Me If You Can: Keeping Up With ML Models in Production
Catch Me If You Can: Keeping Up With ML Models in Production
 
Deploying ml
Deploying mlDeploying ml
Deploying ml
 
DutchMLSchool. ML Automation
DutchMLSchool. ML AutomationDutchMLSchool. ML Automation
DutchMLSchool. ML Automation
 
Agile Machine Learning for Real-time Recommender Systems
Agile Machine Learning for Real-time Recommender SystemsAgile Machine Learning for Real-time Recommender Systems
Agile Machine Learning for Real-time Recommender Systems
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Agile data visualisation
Agile data visualisationAgile data visualisation
Agile data visualisation
 
Workshop: Your first machine learning project
Workshop: Your first machine learning projectWorkshop: Your first machine learning project
Workshop: Your first machine learning project
 

Semelhante a Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product Manager, H2O.ai

From science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productFrom science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning product
Bruce Kuo
 

Semelhante a Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product Manager, H2O.ai (20)

From science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning productFrom science to engineering, the process to build a machine learning product
From science to engineering, the process to build a machine learning product
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
Before Kaggle
Before KaggleBefore Kaggle
Before Kaggle
 
Just the Facets, Ma'am
Just the Facets, Ma'amJust the Facets, Ma'am
Just the Facets, Ma'am
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Test Cases - are they dead?
Test Cases - are they dead?Test Cases - are they dead?
Test Cases - are they dead?
 
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
 
Model-Driven Optimization: Generating Smart Mutation Operators for Multi-Obj...
 Model-Driven Optimization: Generating Smart Mutation Operators for Multi-Obj... Model-Driven Optimization: Generating Smart Mutation Operators for Multi-Obj...
Model-Driven Optimization: Generating Smart Mutation Operators for Multi-Obj...
 
EKON 23 Code_review_checklist
EKON 23 Code_review_checklistEKON 23 Code_review_checklist
EKON 23 Code_review_checklist
 
Model-based Testing: Taking BDD/ATDD to the Next Level
Model-based Testing: Taking BDD/ATDD to the Next LevelModel-based Testing: Taking BDD/ATDD to the Next Level
Model-based Testing: Taking BDD/ATDD to the Next Level
 
JUG Poznan - 2017.01.31
JUG Poznan - 2017.01.31 JUG Poznan - 2017.01.31
JUG Poznan - 2017.01.31
 
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
 
Predictive Analytics based Regression Test Optimization
Predictive Analytics based Regression Test OptimizationPredictive Analytics based Regression Test Optimization
Predictive Analytics based Regression Test Optimization
 
Database and application performance vivek sharma
Database and application performance vivek sharmaDatabase and application performance vivek sharma
Database and application performance vivek sharma
 
Technical debt management strategies
Technical debt management strategiesTechnical debt management strategies
Technical debt management strategies
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 
OOP.pptx
OOP.pptxOOP.pptx
OOP.pptx
 
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationPR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
 
DevOps: Find Solutions, Not More Defects
DevOps: Find Solutions, Not More DefectsDevOps: Find Solutions, Not More Defects
DevOps: Find Solutions, Not More Defects
 

Mais de Sri Ambati

Mais de Sri Ambati (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Generative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptxGenerative AI Masterclass - Model Risk Management.pptx
Generative AI Masterclass - Model Risk Management.pptx
 
AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek AI and the Future of Software Development: A Sneak Peek
AI and the Future of Software Development: A Sneak Peek
 
LLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5thLLMOps: Match report from the top of the 5th
LLMOps: Match report from the top of the 5th
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Open-Source AI: Community is the Way
Open-Source AI: Community is the WayOpen-Source AI: Community is the Way
Open-Source AI: Community is the Way
 
Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product Manager, H2O.ai

  • 1.
  • 2. Data Scientist, H2O.ai GitHub: mlandry22, Email: mark@h2o.ai Mark Landry
  • 3. An Analysis of Driverless AI Feature Creation Driver vs Driverless
  • 4. Overview • Analyze two recent problems with an emphasis on feature generation • Overview of the problem • Discussion of my approach • Show the features Driverless created • Feature creation • Feature representation • Results
  • 5. Who am I? • Data scientist @ H2O since 2015, user since 2014 • 15 top 20 finishes in Kaggle, highest rank 33 • R | H2O | data.table | GBM • The “driver”
  • 6. Problem #1 How Many Attempts will a Student Make • Online question/answer platform with computer science problems • Predict the number of attempts a particular student will make on a particular problem • Data • Student: level, ranking, highest ranking • Problem: type, 3-tier level, points awarded • Training: 124,000 attempt counts • Testing: 60,000 attempt counts
  • 7. Regression as Classification • Natural problem is numerical • End user prefers buckets • Volume • 1: 53% • 2 31% • 3 9% • 4 4% • 5 1% • 6 2%
  • 8. The Driver Approach • Think of it like a recommender problem • Standard: Matrix factorization, collaborative filtering • GBM: use deep categorical encodings • Frequent use of target encoding interaction interaction interaction
  • 9. The Driver Approach A messy chain of hierarchical target encoding and if/else statements
  • 10. The Driver Approach h2o.gbm feature importance – primarily using three target encodings
  • 14. The Driverless Approach Top 5 Features: divided into components • {16} {CV TE} {problem_id} {0} • {51} {CV TE} {points * problem_id} {0} • {3} {max_rating} • {32} {freq} {last_online_time_seconds * rating} • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 15. The Driverless Approach Feature #1: same base target encoding I found to be the best • {16} {CV TE} {problem_id} {0} • 16: indicator of base features – 16 is later used twice more in the top 15 • CV TE: Cross-Fold target encoding • Problem_id: feature used as basis for target encoding • 0: the target; multinomial, so the two other uses are for class 1 & 5 • {51} {CV TE} {points * problem_id} {0} • {3} {max_rating} • {32} {freq} {last_online_time_seconds * rating} • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 16. The Driverless Approach Feature #2: deeper interaction • {16} {CV TE} {problem_id} {0} • {51} {CV TE} {points * problem_id} {0} • 51: base feature ID • CV TE: out of sample target encoding result • points * problem: interaction of two different features, both related to the problem; it is subdividing the problem further • 0: again, rate of class 0 as the target • {3} {max_rating} • {32} {freq} {last_online_time_seconds * rating} • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 17. The Driverless Approach Feature #3: no transformation • {16} {CV TE} {problem_id} {0} • {51} {CV TE} {points * problem_id} {0} • {3} {max_rating} • max rating • used as is – no alternate encoding; was first natural feature in my model as well • this is the first variable of the student dimension • a 4-digit number with close to a normal distribution • {32} {freq} {last_online_time_seconds * rating} • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 18. The Driverless Approach Feature #4: frequency encoding • {16} {CV TE} {problem_id} {0} • {51} {CV TE} {points * problem_id} {0} • {3} {max_rating} • {32} {freq} {last_online_time_seconds * rating} • Counting the occurrences of two fields • Last online & rating are both numerics in the student dimension • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 19. The Driverless Approach Feature 5: four-way interaction w/ target encoding • {16} {CV TE} {problem_id} {0} • {51} {CV TE} {points * problem_id} {0} • {3} {max_rating} • {32} {freq} {last_online_time_seconds * rating} • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0} • Target encoding of class 0 • Finding the rate for each value of the result of a four way interaction
  • 20. The Driverless Approach Top 5 Features: did I try? • YES {16} {CV TE} {problem_id} {0} • NO {51} {CV TE} {points * problem_id} {0} • YES {3} {max_rating} • NO {32} {freq} {last_online_time_seconds * rating} • NO {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 22. Problem #2 Bank Customer Churn • Identify customers likely to churn balances in the next quarter by 50% • Data • 300,00 training rows; 200,000 testing rows • 377 columns • Customer: age, gender, demographics • Reported assets, liabilities • Monthly balance history
  • 23. The Driver Approach Exploit before Explore • 377 columns made [quick] manual investigation harder • Rather than iterate: {analyze > model > analyze > … }, I changed to {model > analyze > model > … } lagged features
  • 24. The Driver Approach After lagging, try differences and ratios • Lagging features present the balance features at several time steps. • But often, the interesting part is not the raw balance itself, but whether it is growing or shrinking • Decision trees have a hard time “seeing” this so it is wise to engineer mathematical features: + - * /
  • 25. The Driver Approach After lagging, try differences and ratios • I used the leading monthly feature from the model and created new features representing month-over-month differences and a binary indicator • One field, one specific length (1 month), two calculations
  • 28. The Driverless Approach It knows math! subtraction subtraction subtraction
  • 29. The Driverless Approach Top 10 Features: divided into categories • (3) Subtraction: #1, #6, #8 • (1) Truncated SVD components: #2 • (2) Cluster Distances: #3, #9 • (1) Target encoding: #4 • (3) Direct features: #5, #7, #10
  • 30. The Driverless Approach Lagged balances also used in clusters & SVD • Distance to cluster #1 after segmenting columns into 6 clusters • BAL_prev6 • D_prev1 • D_prev2 • I_AQB_PrevQ1 • Component #1 of truncated SVD of • D_prev1 • D_prev2 • EOP_prev1_1
  • 31. The Driverless Approach Final Analysis • On first iteration, Driverless AI had surpassed my manual modeling • Features were well beyond what I would have ever attempted • Accuracy was stable: Driverless AI self-reported scores within 1% of competition submission
  • 32. The Driverless Approach Competition Results Kaggle Grandmaster Kaggle Grandmaster Also Driverless AI 100% Driverless AI