In Data Engineer's Lunch #67, Obioma Anomnachi will discuss the process of feature selection as part of a Machine Learning process. Feature selection describes the process of picking particular, relevant data features out of a wider data set, to be used to perform model training.
Accompanying Blog: Coming Soon!
Accompanying YouTube: https://youtu.be/3CPpoQv2tjU
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
Statistics notes ,it includes mean to index numbers
Data Engineer's Lunch #67: Machine Learning - Feature Selection
1. Version 1.0
Machine Learning - Feature
Selection
Feature selection describes the process of picking particular,
relevant data features out of a wider data set, to be used to
perform model training.
Obioma Anomnachi
Engineer @ Anant
2. Data Preparation
● Data preparation deals with
transformations applied to data that
prepare it for use with machine
learning algorithms
○ Previously, we’ve covered a
number of methods within the field:
https://blog.anant.us/spark-and-
cassandra-for-machine-learning-
data-pre-processing/
○ Vectorization and Encoding help
organize raw data into a form that
ML models can work with
○ Standardization can help to better
express the variance within data
and prepare it for models that
expect data within certain ranges
3. Data Preparation (2)
● Imputation is one of a number of methods
for dealing with missing fields for particular
rows within your data
● Feature selection actually falls within the
same category as PCA, a previously
covered topic. Both methods are types of
dimensionality reduction.
○ Dimensionality reduction focuses on
removing irrelevant data from the data set to
reduce computational costs, improve model
performance, and work towards “legibility” -
or the ability of the model to be understood
by humans.
4. Feature Selection - Overview
● Feature selection, as a subcategory of dimensionality reduction, is concerned with picking the
most relevant features out of a dataset. It is a process for removing irrelevant or misleading
columns from a dataset before any models are trained.
○ Just like ML models in general, feature selection methods can be supervised or unsupervised, depending
on whether the data that they interact with is labeled or not.
■ Unsupervised feature selection processes do not have a label against which they can compare the
relevance of the data, so the most it can accomplish is to remove redundant data from the data set.
■ Supervised processes can compare how highly certain fields are correlated with the label we want
the model to predict in the end, so data can be defined as irrelevant if it has no bearing on that
outcome.
■ Essentially, supervised methods are about the relationship between your data and the labels while
unsupervised methods are about the relationships between your data and the rest of your data.
5. Feature Selection - Unsupervised
● Unsupervised methods can work within singular features to remove ones that even in isolation
fail to add information to the wider data set
○ Variance Thresholds are used to remove any fields with variance below a certain value.
○ In the most extreme case, fields that contain the same value for every row in the dataset can safely be
dropped. Variance thresholds allow less extreme settings but generally accomplish the same type of thing.
● They can also work across the entire set of feature to remove redundant ones.
○ A correlation matrix can be built between fields in the data set. Fields that show extremely high correlation
with each other can be chosen between.
○ For an extreme example consider a data set that contains two fields measuring the exact same thing with
different units. At most, one of those should make it into the training set. In this test they would show 100%
correlations with each other signalling that we only need one.
6. Feature Selection - Supervised
● Supervised filter selection methods
compare predictor fields to the label field,
picking out the most relevant fields to
prediction outcomes.
○ Supervised methods get divided further into
three groups.
○ Filter Methods use information theory to
select and drop the least relevant fields
based on their relation to the label field.
○ Wrapper methods progressively remove
fields from the data set and train and test
models, using the testing results to
determine the best fields to remove.
○ Intrinsic methods combine the training and
testing steps of the wrapper methods with
rule based methods for selecting out subsets
of fields to test
7. Feature Selection - Supervised - Filter Methods
● Filter methods use statistical analysis to
perform feature selections. Which algorithms
need to be used depend on the type of the
fields of the label fields and the predictor field
being analyzed.
○ Numerical fields cover any fields with integer
or decimal types.
○ Categorical fields include boolean types,
ordinal categories, and nominal categories.
● Each of these combinations have various
associated statistical tests. Some of these
are familiar like Pearson’s Correlation
Coefficients, a measure of correlation and
ANOVA, a measure of statistical significance
used in scientific research.
8. Feature Selection - Supervised - Wrapper Methods
● Wrapper methods train models on subsets of fields and evaluate the
performance of those models to determine the best subset of features to
select.
○ The most obvious method included in this subset is Exhaustive Feature
Selection method, where each combination of features is used to train a model.
Each model’s performance is compared and the best performing subset is
selected as the set of features for the actual learning task. This returns the best
performing subset of features over all of the possible combinations.
○ Other techniques include:
■ Forward Feature Selection - Start with the best single feature and add
features until criteria are met.
■ Backward Feature Elimination - Start with all of the features and
remove them until criteria are satisfied.
■ Recursive Feature Elimination - Recursively remove features or groups
of features that are determined to be least important
9. Feature Selection - Supervised - Intrinsic Methods
● Intrinsic methods are similar to wrapper methods of feature selection in that they involve training
a model.
○ While wrapper methods do preliminary training of example models in order to extract statistical information
intrinsic methods take place during the actual model training process.
○ L1 Regularization - or LASSO directly changes the cost function to help avoid overfitting. In the process of
this, it adds an extra coefficient for each field in the training set. These coefficients can go down to zero,
effectively removing fields from the data set.