SlideShare uma empresa Scribd logo
1 de 12
Version 1.0
Machine Learning - Feature
Selection
Feature selection describes the process of picking particular,
relevant data features out of a wider data set, to be used to
perform model training.
Obioma Anomnachi
Engineer @ Anant
Data Preparation
● Data preparation deals with
transformations applied to data that
prepare it for use with machine
learning algorithms
○ Previously, we’ve covered a number
of methods within the field:
https://blog.anant.us/spark-and-
cassandra-for-machine-learning-
data-pre-processing/
○ Vectorization and Encoding help
organize raw data into a form that
ML models can work with
○ Standardization can help to better
express the variance within data and
prepare it for models that expect
data within certain ranges
Data Preparation (2)
● Imputation is one of a number of methods
for dealing with missing fields for particular
rows within your data
● Feature selection actually falls within the
same category as PCA, a previously covered
topic. Both methods are types of
dimensionality reduction.
○ Dimensionality reduction focuses on
removing irrelevant data from the data set to
reduce computational costs, improve model
performance, and work towards “legibility” -
or the ability of the model to be understood
by humans.
Feature Selection - Overview
● Feature selection, as a subcategory of dimensionality reduction, is concerned with picking the
most relevant features out of a dataset. It is a process for removing irrelevant or misleading
columns from a dataset before any models are trained.
○ Just like ML models in general, feature selection methods can be supervised or unsupervised, depending on
whether the data that they interact with is labeled or not.
■ Unsupervised feature selection processes do not have a label against which they can compare the
relevance of the data, so the most it can accomplish is to remove redundant data from the data set.
■ Supervised processes can compare how highly certain fields are correlated with the label we want the
model to predict in the end, so data can be defined as irrelevant if it has no bearing on that outcome.
■ Essentially, supervised methods are about the relationship between your data and the labels while
unsupervised methods are about the relationships between your data and the rest of your data.
Feature Selection - Unsupervised
● Unsupervised methods can work within singular features to remove ones that even in isolation fail
to add information to the wider data set
○ Variance Thresholds are used to remove any fields with variance below a certain value.
○ In the most extreme case, fields that contain the same value for every row in the dataset can safely be
dropped. Variance thresholds allow less extreme settings but generally accomplish the same type of thing.
● They can also work across the entire set of feature to remove redundant ones.
○ A correlation matrix can be built between fields in the data set. Fields that show extremely high correlation
with each other can be chosen between.
○ For an extreme example consider a data set that contains two fields measuring the exact same thing with
different units. At most, one of those should make it into the training set. In this test they would show 100%
correlations with each other signalling that we only need one.
Feature Selection - Supervised
● Supervised filter selection methods compare
predictor fields to the label field, picking out
the most relevant fields to prediction
outcomes.
○ Supervised methods get divided further into
three groups.
○ Filter Methods use information theory to
select and drop the least relevant fields based
on their relation to the label field.
○ Wrapper methods progressively remove
fields from the data set and train and test
models, using the testing results to determine
the best fields to remove.
○ Intrinsic methods combine the training and
testing steps of the wrapper methods with
rule based methods for selecting out subsets
of fields to test
Feature Selection - Supervised - Filter Methods
● Filter methods use statistical analysis to
perform feature selections. Which algorithms
need to be used depend on the type of the
fields of the label fields and the predictor field
being analyzed.
○ Numerical fields cover any fields with integer
or decimal types.
○ Categorical fields include boolean types,
ordinal categories, and nominal categories.
● Each of these combinations have various
associated statistical tests. Some of these are
familiar like Pearson’s Correlation
Coefficients, a measure of correlation and
ANOVA, a measure of statistical significance
used in scientific research.
Feature Selection - Supervised - Wrapper Methods
● Wrapper methods train models on subsets of fields and evaluate the
performance of those models to determine the best subset of features to
select.
○ The most obvious method included in this subset is Exhaustive Feature Selection
method, where each combination of features is used to train a model. Each
model’s performance is compared and the best performing subset is selected as
the set of features for the actual learning task. This returns the best performing
subset of features over all of the possible combinations.
○ Other techniques include:
■ Forward Feature Selection - Start with the best single feature and add
features until criteria are met.
■ Backward Feature Elimination - Start with all of the features and remove
them until criteria are satisfied.
■ Recursive Feature Elimination - Recursively remove features or groups
of features that are determined to be least important
Feature Selection - Supervised - Intrinsic Methods
● Intrinsic methods are similar to wrapper methods of feature selection in that they involve training
a model.
○ While wrapper methods do preliminary training of example models in order to extract statistical information
intrinsic methods take place during the actual model training process.
○ L1 Regularization - or LASSO directly changes the cost function to help avoid overfitting. In the process of
this, it adds an extra coefficient for each field in the training set. These coefficients can go down to zero,
effectively removing fields from the data set.
Demo
Resources
● https://sebastianraschka.com/faq/docs/feature_sele_categories.html
● http://www.feat.engineering/classes-of-feature-selection-methodologies.html
● https://machinelearningmastery.com/feature-selection-with-real-and-categorical-
data/#:~:text=Feature%20selection%20is%20the%20process,the%20performance
%20of%20the%20model.
● https://scikit-learn.org/stable/modules/feature_selection.html
● https://en.wikipedia.org/wiki/Feature_selection
● https://www.simplilearn.com/tutorials/machine-learning-tutorial/feature-
selection-in-machine-learning
● https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-
machine-learning/
Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

Mais conteúdo relacionado

Semelhante a Data Engineer’s Lunch #67: Machine Learning - Feature Selection

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptxnarmeen11
 
A Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection StrategiesA Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection Strategiesijtsrd
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...IEEEGLOBALSOFTTECHNOLOGIES
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classificationrahulmonikasharma
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applicationsBenjaminlapid1
 
student performance ppt1.pptx
student performance ppt1.pptxstudent performance ppt1.pptx
student performance ppt1.pptxdattuprince1
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentationNaveen Kumar
 
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...IEEEGLOBALSOFTTECHNOLOGIES
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...IEEEGLOBALSOFTTECHNOLOGIES
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...IEEEFINALYEARPROJECTS
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Jayanti Pande
 
It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!To Sum It Up
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 

Semelhante a Data Engineer’s Lunch #67: Machine Learning - Feature Selection (20)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
ml-09x01.pdf
ml-09x01.pdfml-09x01.pdf
ml-09x01.pdf
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
A Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection StrategiesA Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection Strategies
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Booster in High Dimensional Data Classification
Booster in High Dimensional Data ClassificationBooster in High Dimensional Data Classification
Booster in High Dimensional Data Classification
 
Supervised learning techniques and applications
Supervised learning techniques and applicationsSupervised learning techniques and applications
Supervised learning techniques and applications
 
student performance ppt1.pptx
student performance ppt1.pptxstudent performance ppt1.pptx
student performance ppt1.pptx
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
 
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
 
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
Module-4_Part-II.pptx
Module-4_Part-II.pptxModule-4_Part-II.pptx
Module-4_Part-II.pptx
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
ML-Unit-4.pdf
ML-Unit-4.pdfML-Unit-4.pdf
ML-Unit-4.pdf
 
It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!It's Machine Learning Basics -- For You!
It's Machine Learning Basics -- For You!
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 

Mais de Anant Corporation

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137Anant Corporation
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfAnant Corporation
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotAnant Corporation
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapAnant Corporation
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowAnant Corporation
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksAnant Corporation
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionAnant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & FutureAnant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Anant Corporation
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsAnant Corporation
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraAnant Corporation
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
 

Mais de Anant Corporation (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
YugabyteDB Developer Tools
YugabyteDB Developer ToolsYugabyteDB Developer Tools
YugabyteDB Developer Tools
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward Talks
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
CL 121
CL 121CL 121
CL 121
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 

Último

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 

Último (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 

Data Engineer’s Lunch #67: Machine Learning - Feature Selection

  • 1. Version 1.0 Machine Learning - Feature Selection Feature selection describes the process of picking particular, relevant data features out of a wider data set, to be used to perform model training. Obioma Anomnachi Engineer @ Anant
  • 2. Data Preparation ● Data preparation deals with transformations applied to data that prepare it for use with machine learning algorithms ○ Previously, we’ve covered a number of methods within the field: https://blog.anant.us/spark-and- cassandra-for-machine-learning- data-pre-processing/ ○ Vectorization and Encoding help organize raw data into a form that ML models can work with ○ Standardization can help to better express the variance within data and prepare it for models that expect data within certain ranges
  • 3. Data Preparation (2) ● Imputation is one of a number of methods for dealing with missing fields for particular rows within your data ● Feature selection actually falls within the same category as PCA, a previously covered topic. Both methods are types of dimensionality reduction. ○ Dimensionality reduction focuses on removing irrelevant data from the data set to reduce computational costs, improve model performance, and work towards “legibility” - or the ability of the model to be understood by humans.
  • 4. Feature Selection - Overview ● Feature selection, as a subcategory of dimensionality reduction, is concerned with picking the most relevant features out of a dataset. It is a process for removing irrelevant or misleading columns from a dataset before any models are trained. ○ Just like ML models in general, feature selection methods can be supervised or unsupervised, depending on whether the data that they interact with is labeled or not. ■ Unsupervised feature selection processes do not have a label against which they can compare the relevance of the data, so the most it can accomplish is to remove redundant data from the data set. ■ Supervised processes can compare how highly certain fields are correlated with the label we want the model to predict in the end, so data can be defined as irrelevant if it has no bearing on that outcome. ■ Essentially, supervised methods are about the relationship between your data and the labels while unsupervised methods are about the relationships between your data and the rest of your data.
  • 5. Feature Selection - Unsupervised ● Unsupervised methods can work within singular features to remove ones that even in isolation fail to add information to the wider data set ○ Variance Thresholds are used to remove any fields with variance below a certain value. ○ In the most extreme case, fields that contain the same value for every row in the dataset can safely be dropped. Variance thresholds allow less extreme settings but generally accomplish the same type of thing. ● They can also work across the entire set of feature to remove redundant ones. ○ A correlation matrix can be built between fields in the data set. Fields that show extremely high correlation with each other can be chosen between. ○ For an extreme example consider a data set that contains two fields measuring the exact same thing with different units. At most, one of those should make it into the training set. In this test they would show 100% correlations with each other signalling that we only need one.
  • 6. Feature Selection - Supervised ● Supervised filter selection methods compare predictor fields to the label field, picking out the most relevant fields to prediction outcomes. ○ Supervised methods get divided further into three groups. ○ Filter Methods use information theory to select and drop the least relevant fields based on their relation to the label field. ○ Wrapper methods progressively remove fields from the data set and train and test models, using the testing results to determine the best fields to remove. ○ Intrinsic methods combine the training and testing steps of the wrapper methods with rule based methods for selecting out subsets of fields to test
  • 7. Feature Selection - Supervised - Filter Methods ● Filter methods use statistical analysis to perform feature selections. Which algorithms need to be used depend on the type of the fields of the label fields and the predictor field being analyzed. ○ Numerical fields cover any fields with integer or decimal types. ○ Categorical fields include boolean types, ordinal categories, and nominal categories. ● Each of these combinations have various associated statistical tests. Some of these are familiar like Pearson’s Correlation Coefficients, a measure of correlation and ANOVA, a measure of statistical significance used in scientific research.
  • 8. Feature Selection - Supervised - Wrapper Methods ● Wrapper methods train models on subsets of fields and evaluate the performance of those models to determine the best subset of features to select. ○ The most obvious method included in this subset is Exhaustive Feature Selection method, where each combination of features is used to train a model. Each model’s performance is compared and the best performing subset is selected as the set of features for the actual learning task. This returns the best performing subset of features over all of the possible combinations. ○ Other techniques include: ■ Forward Feature Selection - Start with the best single feature and add features until criteria are met. ■ Backward Feature Elimination - Start with all of the features and remove them until criteria are satisfied. ■ Recursive Feature Elimination - Recursively remove features or groups of features that are determined to be least important
  • 9. Feature Selection - Supervised - Intrinsic Methods ● Intrinsic methods are similar to wrapper methods of feature selection in that they involve training a model. ○ While wrapper methods do preliminary training of example models in order to extract statistical information intrinsic methods take place during the actual model training process. ○ L1 Regularization - or LASSO directly changes the cost function to help avoid overfitting. In the process of this, it adds an extra coefficient for each field in the training set. These coefficients can go down to zero, effectively removing fields from the data set.
  • 10. Demo
  • 11. Resources ● https://sebastianraschka.com/faq/docs/feature_sele_categories.html ● http://www.feat.engineering/classes-of-feature-selection-methodologies.html ● https://machinelearningmastery.com/feature-selection-with-real-and-categorical- data/#:~:text=Feature%20selection%20is%20the%20process,the%20performance %20of%20the%20model. ● https://scikit-learn.org/stable/modules/feature_selection.html ● https://en.wikipedia.org/wiki/Feature_selection ● https://www.simplilearn.com/tutorials/machine-learning-tutorial/feature- selection-in-machine-learning ● https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in- machine-learning/
  • 12. Strategy: Scalable Fast Data Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037