Dimension Reduction: What? Why? and How?

DIMENSION REDUCTION
KaziToufiqWadud
kazitoufiq@gmail.com
Twitter: @KaziToufiqWadud

WHAT IS DIMENSION REDUCTION?
Process of converting data set having vast dimensionsinto data set
with lesser dimensions ensuring that it conveys similar
information concisely

DIMENSION REDUCTION :WHY ?
Curse of Dimensionality
what is the curse?

UNDERSTANDINGTHE CURSE
Consider a 3 -class pattern recognition problem.
1) Strat with 1 dimension/feature
2) Divide the feature space into uniform bins
3) Compute the ratio of examples for each class at each bin
4) For a new example, find its bin and choose the predominant class in that bin
In this case, We decide to start with one feature and divide the real line into 3 bins
- but exists overlap between classes ; so let’s add 2nd Feature to improve discrimination.

 2 dimensions : two dimensions increases the number of bins from 3 to 32 =9
QUESTION: Which should we maintain constant?
The total number of examples?This results in a 2D scatter plot - Reduced Overlapping , Higher Sparsity
To address sparsity, what about keeping the density of examples per bin constant (say 3) ?This increases the
number of examples from 9 to 27 ( 9 x 3 = 27 , at least)

Moving to 3 features
 The number of bins grow to 3^3 = 27
 For same number of examples, 3D scatter plot is
almost empty
Constant Density:
 To keep the initial density of 3, required examples, 27 X 3 = 81

IMPLICATIONS OF CURSE OF
DIMENSIONALITY
Exponential growth with dimensionality in the number of examples
required to accurately estimate a function
In practice, the curse of dimensionality means that for a given sample size, there is a
maximum number of features above which the performance of a classifier will degrade
rather than improve.
In most cases the information that is lost by discarding some features is compensated by
more accurate mapping in lower dimensional space

MULTICOLLINEARITY
Multicollinearity is a state of very high
intercorrelations or inter-associations among the
independent variables. It is therefore a type of
disturbance in the data, and if present in the data the
statistical inferences made about the data may not
be reliable.

UNDERSTANDING MULTI-COLLINEARITY
Let’s look at the image shown below.
It shows 2 dimensions x1 and x2, which are,
Let us say, measurements of several object -
in cm (x1) and inches (x2).
Now, if you were to use both these dimensions in machine learning,
they will convey similar information and introduce a lot of noise in
System.
We are better of just using one dimension. Here we have converted
the dimension of data from 2D (from x1 and x2) to 1D (z1), which has made the data
relatively easier to explain.

BENEFITS OF APPLYING DIMENSION
REDUCTION
 Data Compression; Reduction of storage space
 Less computing; Faster processing
 Removal of multi-collinearity (redundant features) to reduce noise for better
model fit
 Better visualization and interpretation

APPROACHES FOR DIMENSION
REDUCTION
 Feature selection:
choosing a subset of all the features
 Feature extraction:
creating new features by combining existing ones
In either case, the goal is to find a low-dimensional representation of the data that preserves (most of)
the information or structure in the data

COMMON METHODS FOR DIMENSION
REDUCTION
 MISSINGVALUE
While exploring data, if we encounter missing values, what we do? Our first step should be to
identify the reason then impute missing values/ drop variables using appropriate methods.
But, what if we have too many missing values? Should we impute missing or drop the variables?
May be dropping is a good option, because it would not have lot more details about data set.
Also, it would not help in improving the power of model.
Next question, is there any threshold of missing values for dropping a variable?
It varies from case to case. If the information contained in the variable is not that much, you can
drop the variable if it has more than ~40-50% missing values.

REDUCTION
 MISSING VALUE
R code:
summary(data)

REDUCTION
 LOWVARIANCE
Let’s think of a scenario where we have a constant variable (all observations have
same value, 5) in our data set. Do you think, it can improve the power of model?
Of course NOT, because it has zero variance. In case of high number of dimensions,
we should drop variables having low variance compared to others because these
variables will not explain the variation in target variables.
nearZeroVar

EXAMPLE AND CODE
 To identify these types of predictors, the following two metrics can be calculated:
- the frequency of the most prevalent value over the second most frequent value
(called the "frequency ratio''), which would be near one for well-behaved predictors
and very large for highly-unbalanced data>
- the "percent of unique values'' is the number of unique values divided by the total
number of samples (times 100) that approaches zero as the granularity of the data
increases
 If the frequency ratio is less than a pre-specified threshold and the unique value
percentage is less than a threshold, we might consider a predictor to be near zero-
variance

NEARZEROVARIANCE
data(mdrr)
data.frame(table(mdrrDescr$nR11))
nzv <- nearZeroVar(mdrrDescr, saveMetrics= TRUE)
nzv[nzv$nzv,][1:10,]
dim(mdrrDescr)
nzv <- nearZeroVar(mdrrDescr)
filteredDescr <- mdrrDescr[, -nzv]
dim(filteredDescr)

REDUCTION
 DECISION TREE
It can be used as an ultimate solution to tackle multiple challenges like missing
values, outliers and identifying significant variables
- How? Need to understand how DecisionTree works and the
concept of Entropy and Information Gain

DECISIONTREE: HOW DOES IT LOOK LIKE?

DECISIONTREE: ENTROPY AND
INFORMATION GAIN
To build a decision tree , we need to calculate 2 types of Entropy using frequency tables:
a) Entropy using the frequency table of one attribute

INFORMATION GAIN
b) Entropy using the frequency table of two attributes:

INFORMATION GAIN
Similarly :

KEY POINTS:
 Branch with entropy 0 is leaf node

KEY POINTS:
 A branch with entropy more than 0 needs further splitting.
The ID3 algorithm is run
recursively on the non-leaf
branches, until all data is
classified

R CODE
fancyRpartPlot(fit)
##to calculate information gain
library(FSelector)
weights <- information.gain(Play~., data=w)
print(weights)
subset <- cutoff.k(weights, 3)
f <- as.simple.formula(subset, "Play")
print(f)

REDUCTION
 Random Forest
Similar to decision tree is Random Forest. I would also recommend using the in-built
feature importance provided by random forests to select a smaller subset of input
features.
Just be careful that random forests have a tendency to bias towards variables that have
more no. of distinct values i.e. favour numeric variables over binary/categorical values.

R CODE
w <- read.csv("weather.csv",header=T)
library(randomForest)
set.seed(12345)
w <- as.data.frame(w)
w <- read.csv("weather.csv",header=T)
str(w)
w <- w[names(w)[1:5]]
fit1 <- randomForest(Play~ Outlook+Temperature+Humidity+Windy, data=w,
importance=TRUE, ntree=20)
varImpPlot(fit1)

REDUCTION
 High Correlation
Dimensions exhibiting higher correlation can lower down the performance of model.
Moreover, it is not good to have multiple variables of similar information or variation
also known as “Multicollinearity”.
You can use Pearson (continuous variables) or Polychoric (discrete variables) correlation
matrix to identify the variables with high correlation and select one of them using
VIF (Variance Inflation Factor).Variables having higher value (VIF > 5 ) can be dropped.

IMPACT OF MULTICOLLINEARITY
 R demonstration with example
 13 independent variables against response variable of Crimerate
- a discrepancy is noticeable on studying the regression equation with regard to
expenditure on police services between the years 1959 and 1960.Why should police
expenditure in one year be associated with increase in crime rate and decrease in the
previous year? It does not make sense

IMPACT OF MULTICOLLINEARITY
2nd - even though the F statistic is highly significant, and which provides evidence for the presence
of a linear relationship between all 13 variables and the response variable, the β coefficients of both
expenditures for 1959 and 1960 have nonsignificant t ratios. Non-significant t means there is no
slope! In other words police expenditure has no effect whatsoever on crime rate!

HIGH CORRELATION -VIF
Most widely-used diagnostic for multicollinearity, the variance inflation factor (VIF).
-TheVIF may be calculated for each predictor by doing a linear regression of that predictor on all
the other predictors, and then obtaining the R2 from that regression.TheVIF is just 1/(1-R2).
- It’s called the variance inflation factor because it estimates how much the variance of a
coefficient is “inflated” because of linear dependence with other predictors.Thus, aVIF of 1.8 tells
us that the variance (the square of the standard error) of a particular coefficient is 80% larger than
it would be if that predictor was completely uncorrelated with all the other predictors.
-TheVIF has a lower bound of 1 but no upper bound. Authorities differ on how high theVIF has to
be to constitute a problem. Personally, I tend to get concerned when aVIF is greater than 2.50,
which corresponds to an R2 of .60 with the other variables.

KEY POINTS:
 The variables with highVIFs are control variables, and the variables of interest
do not have highVIFs.
Let’s the sample consists of U.S. colleges.The dependent variable is graduation
rate, and the variable of interest is an indicator (dummy) for public vs. private.Two
control variables are average SAT scores and average ACT scores for entering
freshmen.These two variables have a correlation above .9, which corresponds to
VIFs of at least 5.26 for each of them. But theVIF for the public/private indicator is
only 1.04. So there’s no problem to be concerned about, and no need to delete one
or the other of the two controls.
http://statisticalhorizons.com/multicollinearity

REDUCTION
 Backward Feature Elimination/ Forward Feature Selection
 In this method, we start with all n dimensions. Compute the sum of square of error
(SSR) after eliminating each variable (n times).Then, identifying variables whose
removal has produced the smallest increase in the SSR and removing it finally, leaving
us with n-1 input features.
 Repeat this process until no other variables can be dropped.
Reverse to this, we can use “Forward Feature Selection” method. In this method, we
select one variable and analyse the performance of model by adding another variable.
Here, selection of variable is based on higher improvement in model performance.

REDUCTION
 Factor Analysis
Let’s say some variables are highly correlated.These variables can be grouped by their
correlations i.e. all variables in a particular group can be highly correlated among themselves but
have low correlation with variables of other group(s). Here each group represents a single
underlying construct or factor.These factors are small in number as compared to large number of
dimensions. However, these factors are difficult to observe.
There are basically two methods of performing factor analysis:
 EFA (Exploratory Factor Analysis)
 CFA (Confirmatory FactorAnalysis)

REDUCTION
 Principal Component Analysis (PCA)
In this technique, variables are transformed into a new set of variables, which are linear combination of original
variables.These new set of variables are known as principle components.They are obtained in such a way that
first principle component accounts for most of the possible variation of original data after which each succeeding
component has the highest possible variance.
The second principal component must be orthogonal to the first principal component. In other words, it does its
best to capture the variance in the data that is not captured by the first principal component. For two-dimensional
dataset, there can be only two principal components.
The principal components are sensitive to the scale of measurement, now to fix this issue we should always
standardize variables before applying PCA.Applying PCA to your data set loses its meaning.
If interpretability of the results is important for your analysis, PCA is not the right technique for your project.

UNDERSTANDING PCA:
PREREQUISITE
Standard Deviation
Statisticians are usually concerned with taking a sample of a population
Mean
“The average distance from the mean of the data set to a point”
Variance

UNDERSTANDING PCA:
PREREQUISITE
Covariance Covariance is always measured between 2 dimensions.
Covariance Matrix

EXAMPLE: HOWTO CALCULATE
COVARIANCE

EIGENVECTORS & EIGENVALUES OF A
MATRIX
EigenVector
EigenValueMatrix

PROPERTIES:
- eigenvectors and eigenvalues always come in pairs
- Eigenvectors can only be found for square matrices; but not all square matrices have eigenvectors;
-For n x n matrix, there are n eigenvectors
-Another property of eigenvectors is that even if I scale the vector by some amount
before I multiply it, I still get the same multiple of it as a result.This
is because if you scale a vector by some amount, all you are doing is making it longer, not changing it’s
direction

PROPERTIES:
all the eigenvectors of a matrix are perpendicular,
ie. at right angles to each other, no matter how many dimensions you have. By the way,
another word for perpendicular, in maths talk, is orthogonal
This is important ;
That means we can express data in terms of these perpendicular eigenvectors, instead of expressing
them in terms of the x axes and y axes.

PROPERTIES:
Another important thing to know is that when mathematicians find eigenvectors,
they like to find the eigenvectors whose length is exactly one.This is because, as
you know, the length of a vector doesn’t affect whether it’s an eigenvector or not,
whereas the direction does. So, in order to keep eigenvectors standard, whenever
we find an eigenvector we usually scale it to make it have a length of 1, so that all
eigenvectors have the same length. Here’s a demonstration from our example
above.

HOW DOES ONE GO ABOUT FINDINGTHESE
MYSTICAL EIGENVECTORS?
- Unfortunately, it’s only easy(ish) if you have a rather small matrix,
-The usual way to find the eigenvectors is by some complicated iterative method
which is beyond the scope of this tutorial

METHOD
 1. Get Some Data
 2. Subtract the mean
3. Calculate the covariance matrix

METHOD ….CONTD
4. Calculate the eigenvectors and eigenvalues of the covariance
Matrix

METHOD:
 Step 5: Deriving the new data set
RawFeatureVector is the matrix with the eigenvectors in the columns transposed
so that the eigenvectors are now in the rows, with the most significant eigenvector
at the top,
RowDataAdjust is the mean-adjusted data transposed, ie. the data
items are in each column, with each row holding a separate dimension

Thanks
kazitoufiq@gmail.com
Twitter @KaziToufiqWadud

Dimension Reduction: What? Why? and How?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dimension Reduction: What? Why? and How?

Similar to Dimension Reduction: What? Why? and How? (20)

Recently uploaded

Recently uploaded (20)

Dimension Reduction: What? Why? and How?