Principal Component Analysis(PCA) understanding document
1. PCA Understanding Document
Theory :
Let the data points be the following on which PCA will be applied.
X Y
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3.0
2.3 2.7
2 1.6
1 1.1
1.5 1.6
1.1 0.9
Subtract the mean from the dataset from each of the individual axes.The modified dataset is :
X Y
.69 .49
-1.31 -1.21
.39 .99
.09 .29
1.29 1.09
.49 79
.19 -.31
-.81 -.81
-.31 -.31
-.71 -1.01
2. Calculate the covariance matrix :
ccv X Y
X 0.61655556 0.615444444
Y 0.615444444 0.71655556
Calculate the eigenvalues and the eigenvectors.
eigenvalues
0.490833989
1.28402771
eigenvector 1 eigenvector 2
-0.735178656 -0.677873399
0.677873399 -0.735178656
The eigenvector with the highest eigenvalue is the principal component of the data set.
Once eigenvectors are found from the covariance matrix, the next step is to order them by
eigenvalue, highest to lowest. This gives you the components in order of significance. Now, if you
like, you can decide to ignore the components of lesser significance. You do lose some information,
but if the eigenvalues are small, you don’t lose much. If you leave out some components, the final
data set will have less dimensions than the original. To be precise, if you originally have dimensions
in your data, and so you calculate eigenvectors and eigenvalues, and then you choose only the first
eigenvectors, then the final data set has only dimensions.
FeatureVector = [eig1 eig2 eig3.....]
eigenvector 1 eigenvector 2
-0.677873399 -0.735178656
0.735178656 -0.677873399
We can choose to leave out the smaller, less significant component and only have a single column:
eigenvector 1
-0.677873399
3. 0.735178656
FinalData = RowFeatureVector * RowDataAdjust
where RowFeatureVector is the matrix with the eigenvectors in the columns transposed so that the
eigenvectors are now in the rows, with the most significant eigenvector at the top and
RowDataAdjust is the mean-adjusted data transposed ie. the data items are in each column, with each
row holding a separate dimension.
X Y
-.827970186 -.175115307
1.77758033 .142857227
-.992197494 .384374989
-.274210416 .130417207
-1.67580142 -.209498461
-.912949103 .175282444
.0991094375 -.349824698
1.14457216 .0464172582
.438046137 .0177646297
1.22382056 -.162675287
Transformed Data (Single eigenvector)
X
-.827970186
1.77758033
-.992197494
-.274210416
-1.67580142
-.912949103
.0991094375
1.14457216
4. .438046137
1.22382056
Eg .
We have n features
FinalData = SampleData(1*n matrix) * Eigen Vector (n*1 matrix)
= 1*1 Matrix i.e( 1st eigen vector of n eigen vectors which are in descending order according to
its eigen values is used to get 1st value for features after PCA execution.)
To get the Final Data :
FinalData = RowFeatureVector * RowDataAdjust
Getting back old Data :
RowDataAdjust = RowFeatureVector^(-1) * Final Data
Java Library :
1 . java-statistical-analysis-tool: (JSAT)
https://code.google.com/p/java-statistical-analysis-
tool/source/browse/trunk/JSAT/src/jsat/datatransform/PCA.java?spec=svn414&r=414
License : GNU GPL v3
2. efficient-java-matrix-library: (EJML)
https://code.google.com/p/efficient-java-matrix-library/wiki/PrincipleComponentAnalysisExample
License : GNU Lesser GPL
3. Michael Thomas Flanagan's Java Scientific Library
http://www.ee.ucl.ac.uk/~mflanaga/java/PCA.html
License : This library is no longer publicly available
5. Here we can commercially use efficient-java-matrix-library: (EJML) :
Explaining EJML :
Here is the code which you can use after adding ejml jar to the classpath :
https://code.google.com/p/efficient-java-matrix-library/wiki/PrincipleComponentAnalysisExample
We can write a test component class for this class.
Process:
1. First we have to provide all the data sample by
pca.addSample(sample);
2. Then we have to call : pca.computeBasis(n);
It actually is the main component , here n is the number to which we want our feature to reduce to.
3. Now we can use eigen vectors created to actually get values using function : sampleToEigenSpace(
double[] sampleData )
Points to Note :
PCA will not be pretty useful with the data having 0's and 1's as the data having this feature can be
easily converted to sparse matrix format which will automatically reduces your memory req.
The PCA o/p will never be useful to convert it into sparse matrix format as it will not contain 0's .
So its better not to use PCA if we have data having 0's and 1's.
(We didn't got any java library to give sparse matrix as input format to PCA)
Links :
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf