SlideShare uma empresa Scribd logo
1 de 32
Baixar para ler offline
How to Effectively Combine
Numerical Features and Categorical Features
June 14th, 2017
Liangjie Hong
Head of Data Science, Etsy Inc.
• Head of Data Science
- Etsy Inc. in NYC, NY (2016. – Present)
- Search & Discovery; Personalization and Recommendation; Computational Advertising
• Senior Manager of Research
- Yahoo Research in Sunnyvale, CA (2013 – 2016)
Leading science efforts for personalization and search sciences
• Published papers in SIGIR, WWW, KDD, CIKM, AAAI, WSDM, RecSys and ICML
• 3 Best Paper Awards, 2000+ Citations with H-Index 18
• Program committee members in KDD, WWW, SIGIR, WSDM, AAAI, EMNLP, ICWSM, ACL, CIKM,
IJCAI and various journal reviewers
Liangjie Hong
About This Paper
• Authors
Qian Zhao, PhD Student from University of Minnesota
Yue Shi, Research Scientist at Facebook
Liangjie Hong, Head of Data Science at Etsy Inc.
• Paper Venue
Full Research Paper in The 26th International World Wide Web Conference, 2017 (WWW 2017)
High-Level Takeaways
• A new family of models to handle categorical features and numerical features well by combining
embedding models and tree-based models
• A simple learning algorithm that can be easily extended from existing data mining and machine learning
toolkits
• State-of-the-art performance on major datasets
Why we need GB-CENT
Why we need GB-CENT
Motivations
• Real-World Data
Categorical features: user ids, item ids, words, document ids, ...
Numerical features: dwell time, average purchase prices, click-through-rate,...
Why we need GB-CENT
Motivations
• Real-World Data
Categorical features: user ids, item ids, words, document ids, ...
Numerical features: dwell time, average purchase prices, click-through-rate,...
• Ideas
Converting categorical features into numerical ones (e.g., statistics, embedding methods, topic models...)
Converting numerical features into categorical ones (e.g., bucketizing, binary codes, sigmoid transformation...)
Why we need GB-CENT
Motivations
Two Families of Powerful Practical Data Mining and Machine Learning Tools
• Tree-based Models
Decision Trees, Random Forest, Gradient Boosted Decision Trees…
• Matrix-based Embedding Models
Matrix Factorization, Factorization Machines…
Why we need GB-CENT: Tree-based Models
• Pros:
Interpretability for simple trees
Effectiveness in certain tasks: IR ranking models
Simple and easy to train
Handle numerical features well
…
Why we need GB-CENT: Tree-based Models
• Pros:
Interpretability for simple trees
Effectiveness in certain tasks: IR ranking models
Simple and easy to train
Handle numerical features well
…
• Cons:
Need one-hot-encoding to handle categorical features and therefore
cannot easily handle features with large cardinality*
For complex trees, features might appear multiple times in a tree – hard
to explain
…
*TIANQI CHEN AND CARLOS GUESTRIN. XGBOOST: A SCALABLE TREE BOOSTING SYSTEM. KDD '16.
Why we need GB-CENT: Embedding-based Models
• Pros:
Predictive power
Effectiveness in certain tasks: recommender systems
Handle categorical features well through one-hot-encoding
…
Why we need GB-CENT: Embedding-based Models
• Pros:
Predictive power
Effectiveness in certain tasks: recommender systems
Handle categorical features well through one-hot-encoding
…
• Cons:
Numerical features usually need preprocessing and hard to handle.
Hard to interpret in general
…
Why we need GB-CENT
Tree-based models are good at numerical features.
Embedding models are good at categorical features.
Why not combine them two?
What is GB-CENT
What is GB-CENT
In a nutshell, GB-CENT is Gradient Boosted Categorical Embedding and Numerical Trees, which combines
• Matrix-based Embedding Models
Handle large-cardinality categorical features…
• Tree-based Models
Handle numerical features…
What is GB-CENT
In a nutshell, GB-CENT is Gradient Boosted Categorical Embedding and Numerical Trees, which combines
• Factorization Machines
Handle large-cardinality categorical features…
• Gradient Boosted Decision Trees
Handle numerical features…
What is GB-CENT
CAT-E (Factorization Machines)
• Bias term for each categorical feature
• Embedding for each categorical feature
• Interactions between meaningful categorical groups
e.g., users, items, age groups, gender...
No numerical features
What is GB-CENT
CAT-NT (Gradient Boosted Decision Trees)
• One tree per categorical feature (potentially)
• For each tree, the training data is all data instances with numerical features containing this particular
categorical feature.
No categorical features
What is GB-CENT
CAT-E (Factorization Machines)
• generalizes categorical features by embedding them into low-dimensional space.
CAT-NT (Gradient Boosted Decision Trees)
• memorizes each categorical feature’s peculiarities.
HENG-TZE CHENG, LEVENT KOC, JEREMIAH HARMSEN, TAL SHAKED, TUSHAR CHANDRA, HRISHI ARADHYE, GLEN ANDERSON, GREG CORRADO,
WEI CHAI, MUSTAFA ISPIR, ROHAN ANIL, ZAKARIA HAQUE, LICHAN HONG, VIHAN JAIN, XIAOBING LIU, AND HEMAL SHAH. WIDE & DEEP
LEARNING FOR RECOMMENDER SYSTEMS. IN PROCEEDINGS OF THE 1ST WORKSHOP ON DEEP LEARNING FOR RECOMMENDER
SYSTEMS (DLRS 2016). ACM, NEW YORK, NY, USA, 7-10.
What is GB-CENT
Different from GBDT:
• The number of trees in GB-CENT depends on the cardinality of categorical features in the data set, while GBDT has a
pre-specified number of trees M.
• Each tree in GB-CENT only takes numerical features as input while GBDT takes in both categorical and numerical
features.
• Learning a tree for GBDT uses all N instances in the data set while the tree for a categorical feature in GB-CENT only
involves its supporting instances.
What is GB-CENT
Training GB-CENT:
• Train CAT-E part firstly using Stochastic Gradient Descent (SGD)
• Train CAT-NT part secondly
What is GB-CENT
Training GB-CENT:
• Train CAT-E part firstly using Stochastic Gradient Descent (SGD)
• Train CAT-NT part secondly
-- 1) Sort categorical features by their support (how many data instances)
-- 2) Check whether we meet minTreeSupport
-- 3) Use maxTreeDepth and minNodeSplit to fit a tree
-- 4) Use minTreeGain to decide whether keeping a tree
How does GB-CENT perform
How does GB-CENT perform
• Datasets
MovieLens
Statistics: 240K users, 33K movies, 22M instances, 5 ratings
Categorical features: user_id, item_id, genre, language, country, grade
Numerical features: year, runTime, imdbVotes, imdbRating, metaScore
RedHat
Statistics: 151K customers, 7 categories, 2M instances, binary response
Categorical features: people_id, activity_category
Numerical features: activity characteristics
How does GB-CENT perform
• Datasets
MovieLens
Evaluation Metric: Root Mean Squared Error (RMSE)
RedHat
Evaluation Metric: Area Under the Curve (AUC)
80% of train, 10% of validation and 10% of testing
We also compare empirical training time.
How does GB-CENT perform
• Baselines
GB-CENT variants:
1) CAT-E
2) CAT-NT
3) GB-CENT
GBDT variants:
1) GBDT-OH: GBDT + One-hot-encoding for categorical features
2) GBDT-CE: Fit CAT-E firstly and then feed into GBDT
FM variants:
1) FM-S: Transform numerical features by sigmoid and feed into FM
2) FM-D: Transform numerical features by discretizing them and feed into FM
SVDFeature variants:
1) SVDFeature-S: Transform numerical features by sigmoid and feed into SVDFeature
2) SVDFeature-D: Transform numerical features by discretizing them and feed into SVDFeature
All latent dimensionality is 20. For GB-CENT, minTreeSupport = 50, minTreeGain = 0.0, minNodeSplit = 50 and maxTreeDepth
= 3.
How does GB-CENT perform
How does GB-CENT perform
Main takeaway: Learn many shallow small trees
How does GB-CENT perform
Main takeaway: Learn many shallow small trees
How does GB-CENT perform
GB-CENT
• Combine Factorization Machines (handle categorical features) and GBDT (handle numerical features) together
• Combine interpretable results and high predictive power
• Achieve high performance in real-world datasets
Summary
Questions

Mais conteúdo relacionado

Mais procurados

Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationJen Stirrup
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data MiningAmritanshu Mehra
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : ConceptsPragya Pandey
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analyticsInbavalli Valli
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Edureka!
 
Towards Visualization Recommendation Systems
Towards Visualization Recommendation SystemsTowards Visualization Recommendation Systems
Towards Visualization Recommendation SystemsAditya Parameswaran
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceMark West
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningAbcdDcba12
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 

Mais procurados (20)

Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 
3 classification
3  classification3  classification
3 classification
 
Data mining
Data miningData mining
Data mining
 
Next Generation Text Analytics Platform
Next Generation Text Analytics PlatformNext Generation Text Analytics Platform
Next Generation Text Analytics Platform
 
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data VisualisationDigital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Knowledge Discovery and Data Mining
Knowledge Discovery and Data MiningKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data Mining : Concepts
Data Mining : ConceptsData Mining : Concepts
Data Mining : Concepts
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analytics
 
Ch 1 intro_dw
Ch 1 intro_dwCh 1 intro_dw
Ch 1 intro_dw
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Towards Visualization Recommendation Systems
Towards Visualization Recommendation SystemsTowards Visualization Recommendation Systems
Towards Visualization Recommendation Systems
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Data mining and its applications!
Data mining and its applications!Data mining and its applications!
Data mining and its applications!
 

Semelhante a How to Effectively Combine Numerical Features and Categorical Features

Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!Khalid Salama
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningJeff Heaton
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learningIvo Andreev
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Deep feature synthesis
Deep feature synthesisDeep feature synthesis
Deep feature synthesisDurra Sahtout
 
What is cluster analysis
What is cluster analysisWhat is cluster analysis
What is cluster analysisPrabhat gangwar
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningXavier Llorà
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptAravindReddy565690
 
Predict oscars (4:17)
Predict oscars (4:17)Predict oscars (4:17)
Predict oscars (4:17)Thinkful
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013oj08
 
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...KamleshKumar394
 
Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...
Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...
Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...Kaitlan Chu
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfphongnguyen312110237
 

Semelhante a How to Effectively Combine Numerical Features and Categorical Features (20)

Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learning
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
 
Deep feature synthesis
Deep feature synthesisDeep feature synthesis
Deep feature synthesis
 
What is cluster analysis
What is cluster analysisWhat is cluster analysis
What is cluster analysis
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using   Genetics-Based Machine LearningLarge Scale Data Mining using   Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
Data Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).pptData Mining- Unit-I PPT (1).ppt
Data Mining- Unit-I PPT (1).ppt
 
Predict oscars (4:17)
Predict oscars (4:17)Predict oscars (4:17)
Predict oscars (4:17)
 
Computational intelligence for big data analytics bda 2013
Computational intelligence for big data analytics   bda 2013Computational intelligence for big data analytics   bda 2013
Computational intelligence for big data analytics bda 2013
 
Machine learning
Machine learning Machine learning
Machine learning
 
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
 
Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...
Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...
Challenges Faced & Lessons Learned Conducting Cleveland Clinic's First UX Stu...
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
 

Mais de Domino Data Lab

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...Domino Data Lab
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...Domino Data Lab
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataDomino Data Lab
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itDomino Data Lab
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationDomino Data Lab
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryDomino Data Lab
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusDomino Data Lab
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterDomino Data Lab
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceDomino Data Lab
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Domino Data Lab
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at ScaleDomino Data Lab
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataDomino Data Lab
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data ScientistsDomino Data Lab
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Domino Data Lab
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyDomino Data Lab
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino Data Lab
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceDomino Data Lab
 

Mais de Domino Data Lab (20)

What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...What's in your workflow? Bringing data science workflows to business analysis...
What's in your workflow? Bringing data science workflows to business analysis...
 
The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...The Proliferation of New Database Technologies and Implications for Data Scie...
The Proliferation of New Database Technologies and Implications for Data Scie...
 
Racial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops dataRacial Bias in Policing: an analysis of Illinois traffic stops data
Racial Bias in Policing: an analysis of Illinois traffic stops data
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
 
Leveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive IndustryLeveraging Data Science in the Automotive Industry
Leveraging Data Science in the Automotive Industry
 
Summertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile VirusSummertime Analytics: Predicting E. coli and West Nile Virus
Summertime Analytics: Predicting E. coli and West Nile Virus
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
 
GeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data ScienceGeoViz: A Canvas for Data Science
GeoViz: A Canvas for Data Science
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)Doing your first Kaggle (Python for Big Data sets)
Doing your first Kaggle (Python for Big Data sets)
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
 
Software Engineering for Data Scientists
Software Engineering for Data ScientistsSoftware Engineering for Data Scientists
Software Engineering for Data Scientists
 
Making Big Data Smart
Making Big Data SmartMaking Big Data Smart
Making Big Data Smart
 
Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...Moving Data Science from an Event to A Program: Considerations in Creating Su...
Moving Data Science from an Event to A Program: Considerations in Creating Su...
 
Building Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technologyBuilding Data Analytics pipelines in the cloud using serverless technology
Building Data Analytics pipelines in the cloud using serverless technology
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
The Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data ScienceThe Role and Importance of Curiosity in Data Science
The Role and Importance of Curiosity in Data Science
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 

Último (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 

How to Effectively Combine Numerical Features and Categorical Features

  • 1. How to Effectively Combine Numerical Features and Categorical Features June 14th, 2017 Liangjie Hong Head of Data Science, Etsy Inc.
  • 2. • Head of Data Science - Etsy Inc. in NYC, NY (2016. – Present) - Search & Discovery; Personalization and Recommendation; Computational Advertising • Senior Manager of Research - Yahoo Research in Sunnyvale, CA (2013 – 2016) Leading science efforts for personalization and search sciences • Published papers in SIGIR, WWW, KDD, CIKM, AAAI, WSDM, RecSys and ICML • 3 Best Paper Awards, 2000+ Citations with H-Index 18 • Program committee members in KDD, WWW, SIGIR, WSDM, AAAI, EMNLP, ICWSM, ACL, CIKM, IJCAI and various journal reviewers Liangjie Hong
  • 3. About This Paper • Authors Qian Zhao, PhD Student from University of Minnesota Yue Shi, Research Scientist at Facebook Liangjie Hong, Head of Data Science at Etsy Inc. • Paper Venue Full Research Paper in The 26th International World Wide Web Conference, 2017 (WWW 2017)
  • 4. High-Level Takeaways • A new family of models to handle categorical features and numerical features well by combining embedding models and tree-based models • A simple learning algorithm that can be easily extended from existing data mining and machine learning toolkits • State-of-the-art performance on major datasets
  • 5. Why we need GB-CENT
  • 6. Why we need GB-CENT Motivations • Real-World Data Categorical features: user ids, item ids, words, document ids, ... Numerical features: dwell time, average purchase prices, click-through-rate,...
  • 7. Why we need GB-CENT Motivations • Real-World Data Categorical features: user ids, item ids, words, document ids, ... Numerical features: dwell time, average purchase prices, click-through-rate,... • Ideas Converting categorical features into numerical ones (e.g., statistics, embedding methods, topic models...) Converting numerical features into categorical ones (e.g., bucketizing, binary codes, sigmoid transformation...)
  • 8. Why we need GB-CENT Motivations Two Families of Powerful Practical Data Mining and Machine Learning Tools • Tree-based Models Decision Trees, Random Forest, Gradient Boosted Decision Trees… • Matrix-based Embedding Models Matrix Factorization, Factorization Machines…
  • 9. Why we need GB-CENT: Tree-based Models • Pros: Interpretability for simple trees Effectiveness in certain tasks: IR ranking models Simple and easy to train Handle numerical features well …
  • 10. Why we need GB-CENT: Tree-based Models • Pros: Interpretability for simple trees Effectiveness in certain tasks: IR ranking models Simple and easy to train Handle numerical features well … • Cons: Need one-hot-encoding to handle categorical features and therefore cannot easily handle features with large cardinality* For complex trees, features might appear multiple times in a tree – hard to explain … *TIANQI CHEN AND CARLOS GUESTRIN. XGBOOST: A SCALABLE TREE BOOSTING SYSTEM. KDD '16.
  • 11. Why we need GB-CENT: Embedding-based Models • Pros: Predictive power Effectiveness in certain tasks: recommender systems Handle categorical features well through one-hot-encoding …
  • 12. Why we need GB-CENT: Embedding-based Models • Pros: Predictive power Effectiveness in certain tasks: recommender systems Handle categorical features well through one-hot-encoding … • Cons: Numerical features usually need preprocessing and hard to handle. Hard to interpret in general …
  • 13. Why we need GB-CENT Tree-based models are good at numerical features. Embedding models are good at categorical features. Why not combine them two?
  • 15. What is GB-CENT In a nutshell, GB-CENT is Gradient Boosted Categorical Embedding and Numerical Trees, which combines • Matrix-based Embedding Models Handle large-cardinality categorical features… • Tree-based Models Handle numerical features…
  • 16. What is GB-CENT In a nutshell, GB-CENT is Gradient Boosted Categorical Embedding and Numerical Trees, which combines • Factorization Machines Handle large-cardinality categorical features… • Gradient Boosted Decision Trees Handle numerical features…
  • 17. What is GB-CENT CAT-E (Factorization Machines) • Bias term for each categorical feature • Embedding for each categorical feature • Interactions between meaningful categorical groups e.g., users, items, age groups, gender... No numerical features
  • 18. What is GB-CENT CAT-NT (Gradient Boosted Decision Trees) • One tree per categorical feature (potentially) • For each tree, the training data is all data instances with numerical features containing this particular categorical feature. No categorical features
  • 19. What is GB-CENT CAT-E (Factorization Machines) • generalizes categorical features by embedding them into low-dimensional space. CAT-NT (Gradient Boosted Decision Trees) • memorizes each categorical feature’s peculiarities. HENG-TZE CHENG, LEVENT KOC, JEREMIAH HARMSEN, TAL SHAKED, TUSHAR CHANDRA, HRISHI ARADHYE, GLEN ANDERSON, GREG CORRADO, WEI CHAI, MUSTAFA ISPIR, ROHAN ANIL, ZAKARIA HAQUE, LICHAN HONG, VIHAN JAIN, XIAOBING LIU, AND HEMAL SHAH. WIDE & DEEP LEARNING FOR RECOMMENDER SYSTEMS. IN PROCEEDINGS OF THE 1ST WORKSHOP ON DEEP LEARNING FOR RECOMMENDER SYSTEMS (DLRS 2016). ACM, NEW YORK, NY, USA, 7-10.
  • 20. What is GB-CENT Different from GBDT: • The number of trees in GB-CENT depends on the cardinality of categorical features in the data set, while GBDT has a pre-specified number of trees M. • Each tree in GB-CENT only takes numerical features as input while GBDT takes in both categorical and numerical features. • Learning a tree for GBDT uses all N instances in the data set while the tree for a categorical feature in GB-CENT only involves its supporting instances.
  • 21. What is GB-CENT Training GB-CENT: • Train CAT-E part firstly using Stochastic Gradient Descent (SGD) • Train CAT-NT part secondly
  • 22. What is GB-CENT Training GB-CENT: • Train CAT-E part firstly using Stochastic Gradient Descent (SGD) • Train CAT-NT part secondly -- 1) Sort categorical features by their support (how many data instances) -- 2) Check whether we meet minTreeSupport -- 3) Use maxTreeDepth and minNodeSplit to fit a tree -- 4) Use minTreeGain to decide whether keeping a tree
  • 23. How does GB-CENT perform
  • 24. How does GB-CENT perform • Datasets MovieLens Statistics: 240K users, 33K movies, 22M instances, 5 ratings Categorical features: user_id, item_id, genre, language, country, grade Numerical features: year, runTime, imdbVotes, imdbRating, metaScore RedHat Statistics: 151K customers, 7 categories, 2M instances, binary response Categorical features: people_id, activity_category Numerical features: activity characteristics
  • 25. How does GB-CENT perform • Datasets MovieLens Evaluation Metric: Root Mean Squared Error (RMSE) RedHat Evaluation Metric: Area Under the Curve (AUC) 80% of train, 10% of validation and 10% of testing We also compare empirical training time.
  • 26. How does GB-CENT perform • Baselines GB-CENT variants: 1) CAT-E 2) CAT-NT 3) GB-CENT GBDT variants: 1) GBDT-OH: GBDT + One-hot-encoding for categorical features 2) GBDT-CE: Fit CAT-E firstly and then feed into GBDT FM variants: 1) FM-S: Transform numerical features by sigmoid and feed into FM 2) FM-D: Transform numerical features by discretizing them and feed into FM SVDFeature variants: 1) SVDFeature-S: Transform numerical features by sigmoid and feed into SVDFeature 2) SVDFeature-D: Transform numerical features by discretizing them and feed into SVDFeature All latent dimensionality is 20. For GB-CENT, minTreeSupport = 50, minTreeGain = 0.0, minNodeSplit = 50 and maxTreeDepth = 3.
  • 27. How does GB-CENT perform
  • 28. How does GB-CENT perform Main takeaway: Learn many shallow small trees
  • 29. How does GB-CENT perform Main takeaway: Learn many shallow small trees
  • 30. How does GB-CENT perform
  • 31. GB-CENT • Combine Factorization Machines (handle categorical features) and GBDT (handle numerical features) together • Combine interpretable results and high predictive power • Achieve high performance in real-world datasets Summary