1. Literature review of dimensionality reduction
Feature Selection (FS) & Feature Extraction (FE)
o K. Fukunaga and DR Olsen, "An algorithm for finding intrinsic dimensionality of data," IEEE
Transactions on Computers, vol. 20, no. 2, pp. 176-183, 1971.
This paper is the earliest literature I could collect on dimensionality reduction issue. The
intrinsic dimensionality is defined as the smallest dimensional space we can obtain under
some constraint. Two problems were addressed in this paper: the intrinsic dimensionality
representation (mapping) and intrinsic dimensionality for classification (separating).
FOCUS
There are several famous dimensionality reduction algorithms which appear in almost all
literatures. They are FOCUS and RELIEF.
o H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the
Ninth National Conference on Artificial Intelligence (AAAI-91), pages 547--552, Anaheim, CA, USA,
1991. AAAI Press.
o Almuallim H., Dietterich T.G.: Efficient Algorithms for Identifying Relevant Features, Proc. of the
Ninth Canadian Conference on Artificial Intelligence, University of British Columbia, Vancouver,
May 11-15, 1992, 38-45
The first paper above proposed the FOCUS algorithm; the second upgraded it into
FOCUS-2. FOCUS implements the Min-Feature bias that prefers consistent hypotheses
definable over as few features as possible. In the simplest implementation, it does a
breadth-first search and check for any inconsistency.
RELIEF
o K. Kira and L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In
Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), pages 129--134,
Menlo Park, CA, USA, 1992. AAAI Press.
o Kononenko, Igor. Estimating Attributes: Analysis and Extensions of RELIEF, In European Conference
on Machine Learning, pages 171-182, 1994.
In the first paper, RELIEF was presented as a feature ranking algorithm, using a weight-
based algorithm. From the set of training instances, it first chooses a sample of instances;
the user must provide the number of instances in this sample. RELIEFrandomly picks this
sample of instances, and for each instance in it finds Near Hit (minimum distance to the
same class instance) and Near Miss (minimum distance to the different class instance)
instances based on a Euclidean distance measure.
Class B
Class A
2. Figure: Near Hit and Near Miss
The basic idea is to update the weights that are initialized to zero in the beginning based
on the following equation:
W[ A] W[ A] diff ( A, R, H ) 2 diff ( A, R, M ) 2
Where, A is the attribute; R is the instance randomly selected; H is the near hit; M is the
near miss; and diff calculates the difference between two instances on attribute. After
exhausting all instances in the sample, RELIEF chooses all features having weight
greater than or equal to a threshold.
IN the second paper, the algorithm of RELIEF, which only dealt with two-class problems,
was upgraded into RELIEFF, which could handle noisy, incomplete, and multi-class data
sets. Firstly, RELIEF was extended to search for k-nearest hits/misses instead of only one
near/miss. The extended version was called RELIEF-A. It averaged the contribution of k
nearest hits/misses to deal with noisy data. Secondly, based on different diff functions, we
could get different BELIEF versions. So in this paper the author proposed three versions
of RELIEF (RELIEF-B, RELIEF-C and RELIEF-D) to deal with incomplete data sets.
RELIEF-D: Given two instances (I1 and I2)
If one instance (I1) has unknown value
diff ( A, I1, I 2) 1 P(value( A, I 2) | class( I1))
If both instances have unknown values
#values( A)
diff ( A, I1, I 2) 1 ( P(V | calss( I1)) P(V | class( I 2)))
V
Finally, extension was made on RELIEF-D, the author got version to deal with multi-
class problems, i.e., RELIEF-E and RELIEF-F. However, the RELIEF-F show
advantages both in noise free and noisy data.
RELIEF-F: find one near miss M(C) near each different class and average their
contribution for updating estimates W[A] :
W [ A] W [ A] diff ( A, R, H ) / m [ P(C) diff ( A, R, M (C))]/ m
C class ( R )
FS: Liu, Dash et al.
Review of DR methods
One of important work of Liu and Dash is to review and group all feature selection, under
classification scenarios.
o Dash, M., &: Liu, H. 1997. Feature selection for classification. Intelligent Data Analysis, 1, 131-156.
3. The major contribution of this paper is to present the following figure, in which the
feature selection process is addressed clearly.
Original Subset
Generation Evaluation
Feature Set
Subset
Goodness of
No Stopping Yes Validation
Criterion
Figure: Feature selection process with validation
And Liu et al. claimed that each feature selection method could be characterized in the
types of generation procedure and evaluation function used by it. So this paper presented
a table to classify all feature selection methods. In this table, each row stands for one type
of evaluation measures. Each column represents for one kind of generation method. Most
of the methods listed in each cell in this table (totally there are 5 X 3 = 15 cells) were
addressed in this review paper.
Table: Two dimensional categorization of feature selection methods
Evaluation Generation methods
Measures Heuristic Complete Random
Distance measure Relief, Relief-F, Sege84 B&B, BFF, Bobr88
Information measure DTM, Kroll-Saha96 MDLM
Dependency measure POE1ACC, PRESET
Consistency measure Focus, MIFES
Classifier error rate SBS, SFS, SBS- Ichi-Skla84a, Ichi- GA, SA, RGSS,
FLASH, PQSS, Moor- Skla84b, AMB&B, RMHC-PF1
Lee94, BDS, RC, Quei- BS
Gels84
LVF + Consistency Measure
H. Liu, M. Dash, et al. published many papers on their feature selection methods.
o Liu, H., and Setiono, R. (1996) A probabilistic approach to feature selection - A filter solution. In 13th
International Conference on Machine Learning (ICML'96), July 1996, pp. 319-327. Bari, Italy.
o H. Liu and S. Setiono. Some issues on scalable feature selection. In 4th World Congress of Expert
Systems: Application of Advanced Information Technologies, 1998.
In the introduction section of the first paper, the authors first compared the wrapper
approach and the filter approach. Some reasons were pointed out why the wrapper
4. approach is not as general as the filter approach, although it has certain advantages: (1)
learning algorithm’s bias, (2) high computational cost, (3) large dataset will cause
problems while running some algorithm, and impractical to employ computationally
intensive ;earning algorithms such as neural network or genetic algorithm. Second, the
section addressed two types of feature selection methods: exhaustive (check all order
correlations) and heuristic (make use of 1st and 2nd order information – one attribute and
the combination of two attributes) search.
The first paper tried to introduce random search based on a Las Vegas algorithm, which
used randomness to guide the search to guarantee sooner or later it will get a correct
solution. The probabilistic approach is called LVF, which use the inconsistency rate and
the LV random search. LVF only works on discrete attributes because it replies on the
inconsistency calculation.
The second paper pointed out three big issues in feature selection, identified by Liu et al.
(1) large number of features; (2) large number of instance; and (3) feature expanding due
to environment changing. The authors proposed a LVF algorithm to reduce the number of
features. An upgraded LVF algorithm (LVS) was developed. The major idea is to use
some percentage of data set and then add more instances step by step until some
conditions satisfied. The last issue became the missing attribute problem in many cases.
5. 1. Scaling data 2. Expanding features
At the end of this paper, the authors considered the computing implementing as a
potential future work area. Twp ideas were listed. One is to use parallel feature selection;
the other is to use the database techniques such as data warehouse, metadata.
o Huan Liu, Rudy Setiono: Incremental Feature Selection. Applied Intelligence 9(3): 217-230 (1998)
This paper actually is almost the same as the one addressing LVS; the different is just the
name of algorithm is LVI.
ABB + Consistency Measure
o H. Liu, H. Motoda, and M. Dash, A Monotonic Measure for Optimal Feature Selection, Proc. of
ECML-98, pages 101-106, 1998.
o M. Dash, H. Liu and H. Motoda, "Consistency Based Feature Selection", pp 98 -- 109, PAKDD 2000,
Kyoto, Japan. April, 2000. Springer.
The first paper studied the monotonic characteristic of the inconsistency measure, with
regard to the fact that most error- or distance-based measures are not monotonic. The
authors argued that monotonic measure should be necessary to find an optimal subset of
features while using complete, but not exhaustive search. This paper gave an ABB
(Automatic Branch & Bound) algorithm to select subset of features.
6. The second paper focused on the consistency measure, which was used on five different
algorithms of feature selection: FOCUS (exhaustive search), ABB (Automated Branch &
Bounce) (complete search), SetCover (heuristic search), LVF (probabilistic search), and
QBB (combination of LVF and ABB) (hybrid search).
QBB + Consistency Measure
o M. Dash and H. Liu, "Hybrid search of feature subsets," in PRICAI'98, (Singapore), Springer-Verlag,
November 1998.
This paper proposed a hybrid algorithm (QBB) of probabilistic and complete search,
which began with LVF to reduce the number of features and then ran ABB to get optimal
feature subset. So if M is known as SMALL, apply FocusM; else if N >= 9, apply ABB;
else apply QBB.
VHR (Discretization)
o H. Liu and R. Setiono, "Dimensionality reduction via discretization," Knowledge Based Systems 9(1),
pp. 71--77, 1996.
This paper proposed a vertical and horizontal reduction (VHR) method to build a data
and dimensionality reduction (DDR) system. This algorithm is based on the idea of Chi-
merge during the discretization. The duplication after merging is removed from the data
set. At the end, if an attribute is merged to only one value, it simple means that the
attribute could be discarded.
7. Neural Network Pruning
o R. Setiono and H. Liu, Neural network feature selector," IEEE Trans. on Neural Networks, vol. 8, no.
3, pp. 654-662, 1997.
This paper proposed the use of a three layer feed forward neural network to select those
input attributes (features) that are most useful for discriminating classes in a given set of
input patterns. A network pruning algorithm was the foundation of the proposed
algorithm. A simple criterion was developed to remove an attribute based on the accuracy
rate of the network.
Kohavi, John et al.
Compared to Liu et al, Kohavi, John et al. did much research work in the Wrapper
approach.
o R. Kohavi and G.H. John, Wrappers for feature subset selection, Artificial Intelligence 97(1-2) (1997),
273--324.
In this influential paper Kohavi and John presented a number of disadvantages of the
filter approach to the feature selection problem, steering research towards algorithms
adopting the wrapper approach.
The wrapper approach to feature subset selection is shown in the following figure:
Relevance measure
Definition 1: an optimal feature subset. Given an inducer I, and a dataset D with
features X1, X2, …, Xn, from a distribution D over the labeled instance space, an optimal
feature subset, Xopt, is a subset of the features such that the accuracy of the inducer
classifier C = I(D) is maximal.
Definition 2-3: Existing definitions of relevance. Almullim & Dietterich: A feature Xi is
said to be relevant to a concept C if Xi appears in every Boolean formula that represents C
8. and irrelevant otherwise. Gennari et al.: Xi is relevant iff there exists some xi and y for
which p( X i xi ) 0 such that p(Y y | X i xi ) p(Y y) .
Definition 4. Let Si {X 1 ,..., X i 1 , X i 1 ,..., X m } . si is a value assigned to all features in Si.
Xi is relevant iff there exists some xi and y and si for which p( X i xi ) 0 such that
p(Y y, Si si | X i xi ) p(Y y, Si si ) .
Definition 5 (Strong relevance). Xi is relevant iff there exists some xi and y and si for
which p(Si si , X i xi ) 0 such that p(Y y | Si si , X i xi ) p(Y y | Si si ) .
Definition 6 (Weak relevance). Xi is relevant iff it is strong relevant and there exists a
subset of Si, Si’ for which there exists some xi and y and si’ with p(Si' si' , X i xi ) 0
such that p(Y y | Si' si' , X i xi ) p(Y y | Si' si' ) .
The following figure shows a view of feature set relevance.
Search & Induce
This paper then demonstrated the wrapper approach with two searching methods: hill-
climbing and best-first search, two induction algorithms: decision tree (ID3) and Naïve-
Bayes, on 14 datasets.
9. This paper also gave some directions of future work. (1) Other search engines such as
simulating annealing, genetic algorithms. (2) Select initial subset of features. (3)
Incremental operations and aggregation techniques. (4) Parallel computing techniques. (5)
Overfitting issue (using cross-validation).
FE: PCA
o Partridge M. and RA Calvo. Fast Dimensionality Reduction and Simple PCA, Intelligent Data
Analysis, 2(3), 1998.
A fast and simple algorithm for approximately calculating the principal components (PCs)
of a data set and so reducing its dimensionality is described. This Simple Principal
Components Analysis (SPCA) method was used for dimensionality reduction of two
high-dimensional image databases, one of handwritten digits and one of handwritten
Japanese characters. It was tested and compared with other techniques: matrix methods
such as SVD and several data methods.
o David Gering (2002) In fulfillment of the Area Exam doctoral requirements: Linear and Nonlinear
Data Dimensionality Reduction, http://www.ai.mit.edu/people/gering/areaexam/areaexam.doc
This report presented three different approaches to deriving PCA: Pearson’s Least
Squares Distance approach, Hotelling’s Change of Variables approach, and the author’s
new method of Matrix Factorization for Variation Compression. The report also
addresses the Multidimensional Scaling (MDS). Then the author gave three
implementations of the two techniques: Eigenfaces, Locally Linear Embedding (LLE)
and Isometric feature mapping (Isomap).
FE: Soft Computing Approaches
This is an interesting area.
o Pal, N.R. and K.K.Chintalapudi (1997). "A connectionist system for feature selection", Neural, Parallel
and Scientific Computation Vol. 5, 359-382.
o R. N. Pal, "Soft Computing for Feature Analysis," Fuzzy Sets and Systems, Vol.103, 201221, 1999.
10. In the first paper, Pal and Chintalapudi proposed a connectionist model for selection of a
subset of good features for pattern recognition problems. Each input node of an MLP has
an associated multiplier, which allows or restricts the passing of the corresponding
feature into the higher layers of the net. A high value of the attenuation factor indicates
that the associated feature is either redundant or harmful. The network learns both the
connection weights and the attenuation factors. At the end of learning, features with high
value of the attenuation factors are eliminated.
The second paper gave an overview of using three soft computing techniques: fuzzy logic,
neural networks, and genetic algorithms for feature ranking, selection and extraction.
o Fuzzy sets were introduced in 1965 by Zadeh as a new way to represent
vagueness in everyday life.
o Neural networks have characteristics of parallel computing, robustness, built-in
learn ability, and capability to deal with imprecise, fuzzy, noisy, and probabilistic
information.
o Genetic Algorithms (GAs) are biologically inspired tools for optimization. They
are parallel and randomized search techniques.
This paper also mentioned the separation of the two problems of feature selection and
feature extraction.
For feature extraction using neural networks, Pal reviewed several methods such as
o PCA Neural networks by Rubner (J. Rubner and P. Tavan. A self-organization network for
principal-component analysis. Europhysics. Letters, 10:693-698, 1989.) (This reminded me
the book of Principal component neural networks: theory and applications / Kostas I.
Diamantaras, S.Y. Kung, New York: Wiley, 1996 );
o Nonlinear projection by Sammon (J. W. Sammon, Jr., "A nonlinear mapping for data
structure analysis," IEEE Trans. Comput., vol. C-18, pp. 401--409, May 1969.)
For feature ranking & selection using neural networks, Pal summarized three types of
methods:
o Saliency based feature ranking (SAFER)
(Ruck, D.W., S.K.Rogers and M.Kabrisky (1990). "Feature selection using a multilayer perceptron",
Journal of Neural Network Computing, 40-48.)
o Sensitivity based feature ranking (SEFER)
(R. K. De, N. R. Pal, and S. K. Pal. Feature analysis: Neural network and fuzzy set theoretic approaches.
Pattern Recognition, 30(10):1579--1590, 1997)
o An attenuator based feature selection (AFES)
Neural networks
o De Backer S., Naud A., Scheunders P.. - Nonlinear dimensionality reduction techniques for
unsupervised feature extraction. - In: Pattern recognition letters, 19(1998), p. 711-720
In this paper, a study is performed on unsupervised non-linear feature extraction. Four
techniques were studied: a multidimensional scaling approach (MDS), Sammon’s
mapping (SAM), Kohonen’s self-organizing map (SOM) and an auto-associative
feedforward neural network (AFN). All four yield better classification results than the
optimal linear approach PCA , and therefore can be utilized as a feature extraction step in
11. a design for classification schemes. Because of the nature of the techniques, SOM and
AFN perform better for very low dimensions. Because of the complexity of the
techniques, MDS and SAM are most suited for high-dimensional data sets with a limited
number of data points, while SOM and AFN are more appropriate for low-dimensional
problems with a large number of data points.
Aha et al.
o Aha, D. W. & Bankert, R. L. (1994), Feature selection for case-based classification of cloud types: An
empirical comparison, in Working Notes of the AAAI-94 Workshop on Case-based Reasoning, pp
106-112. 22
o Aha, D. W. and Bankert, R. L. (1995), A comparative evaluation of sequential feature selection
algorithms, In Proceedings of the Fifth International 33 Workshop on Artificial Intelligence and
Statistics, editors D. Fisher and H. Lenz, pp. 1--7, Ft. Lauderdale, FL.
The two papers gave a framework of sequential feature selection - BEAM. The following
figures show two versions of algorithms in the two papers.
It actually considered the feature selection as a combination of the search method (FSS
and BSS) and the evaluation function (IB1 and Index).
Bradley et al
o P. S. Bradley, O. L. Mangasarian, and W. N. Street. Feature selection via mathematical programming.
INFORMS Journal on Computing, 10:209--217, 1998.
This paper tried to transform the feature selection into a mathematical program problem.
The task became discriminating two given sets ( A R m n and B R k n ) in an n-
dimensional feature space by using as few of the given features as possible. In the
12. mathematical program terms, we will attempt to generate a separating plane
( P : {x | x R n , xT } , suppressing as many of the components of ω as possible) in a
feature space of as small as a dimension as possible while minimizing the average
distance of misclassified points to the plane.
A e e y
T T
e y e z B e e z
( FS ) min (1 )( ) e T v* , [0,1)
, , y , z ,v m k y 0, z 0
v v
Typically this will be achieved in a feature space of reduced dimensionality, that is
eT v n . eT v* is a step function term that is discrete. Different approximation methods of
it will lead to different algorithms. Three methods were proposed in this paper:
o Standard sigmoid: eT v* eT (e v 1
) . Get FSS (FS sigmoid).
o Concave exponential (advantageous of its simplicity & concavity):
eT v* eT (e v
) . Get FSV (FS Concave).
o Consider it as a linear program with equilibrium constraints. Get FSL. After
reformulating to avoid the computational difficulty, we get FSB (FS bilinear
program).
This paper also gave an adaptation of the optimal brain damage (OBD).
Algorithms for FSS, FSV, FSB and OBD were presented. Experiments were carried out
on the WPBC (Wisconsin Prognostic Breast Cancer) problem and the Ionosphere
problem.
o P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector
machines. In Proc. 15th International Conf. on Machine Learning, pages 82--90. Morgan Kaufmann,
San Francisco, CA, 1998.
This paper actually extracted parts of the previous paper, focusing on FSV and its
algorithm (Successive Linearization Algorithm - SLA). The introduction of SVM was
just to add another suppressing term in the objective function.
Hall et al.
o Hall, M. A., Smith, L. A. (1999). Feature selection for machine learning: comparing a correlation-
based filter approach to the wrapper. Proceedings of the Florida Artificial Intelligence Symposium
(FLAIRS-99).
o Practical feature subset selection for machine learning. Proceedings of the 21st Australian Computer
Science Conference. Springer. 181-191.
o M. Hall. Correlation-based feature selection for machine learning. PhD thesis, University of Waikato,
1999.
o Hall, M. (2000). Correlation-based feature selection of discrete and numeric class machine learning. In
Proceedings of the International Conference on Machine Learning, pages 359-366, San Francisco, CA.
Morgan Kaufmann Publishers.
13. In these papers and thesis, Hall tried to present his new method CFS (Correlation based
Feature Selection). It is an algorithm that couples this evaluation formula with an
appropriate correlation measure and a heuristic search strategy.
Further experiments compared CFS with a wrapper—a well known approach to feature
selection that employs the target learning algorithm to evaluate feature sets. In many
cases CFS gave comparable results to the wrapper, and in general, outperformed the
wrapper on small datasets. CFS executes many times faster than the wrapper, which
allows it to scale to larger datasets.
Langely et al
o Blum, Avrim and Langely, Pat. Selection of Relevant Features and Examples in Machine Learning, In
Artificial Intelligence, Vol.97, No.1-2, pages 245-271,1997.
This paper addressed the two problems of irrelevant features and irrelevant samples. For
the feature selection, they used almost the same relevance concepts and definitions by
Johnn, Kohavi and Pfleger.
Secondly, the authors regarded the feature selection as a problem of heuristic search. 4
basic issues were studied: (1) the starting point; (2) the organization of search: exhaustive
or greedy; (3) the alternative subset evaluation; and (4) the halting criterion. Such
discussion is the same as Liu’s publication.
Then, the authors reviewed three types of feature selection methods: (1) those that embed
the selection within the basic induction algorithm; (2) those that use the selection to filter
features passed to induction; and (3) those that treat the selection as a wrapper around the
induction process. The following two tables list the characteristics of different methods.
14. Devaney et al. - Conceptual Clustering
o Devaney, M., and Ram, A. Efficient Feature Selection in Conceptual Clustering. Machine Learning:
ICML '97, 92-97, Morgan Kaufmann, San Francisco, CA, 1997.
This paper addressed that feature selection in unsupervised situations. Any typical
wrapper approach cannot be applied due to the absence of the class labels in the dataset.
One solution is to use average predictive accuracy over all attributes. Another way is to
use category utility (M. A. Gluck and J. E. Corter. Information, uncertainty, and the unity of categories.
In Proceedings of the 7th Annual Conference of the Cognitive Science Society, pages 283--287, Irvine, CA,
1985).
The COBWEB system (D. H. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139--172, 1987) applying the category utility was used in the paper.
The Ck terms are the concepts in the partition, Ai is each attribute, and Vij is each of the
possible values for attribute. This equation yields a measure of the increase in the number
of attribute values that can be predicted given a set of concepts, C1 … Ck, over the
number of attribute values that could be predicted without using any concepts.
The term i j P ( Ai Vij ) is the probability of each attribute value independent of class
membership and is obtained from the parent of the partition. The P(Ck ) term weights the
values for each concept according to its size, and the division by n, the number of
concepts in the partition, allows comparison of partitions of different sizes.
For the continuous attributes, the paper used CLASSIT algorithm (Gennari, J. H. (1990). An
experimental study of concept formation. Doctoral dissertation, Department of Information & Computer
Science, University of California, Irvine. ).
15. where K is the number of classes in the partition, ik the standard deviation of attribute i
in class k and pi the standard deviation of attribute i at the parent node.
The method, as the authors described, blurs the traditional wrapper/filter model
distinction – it is like a wrapper model in that the underlying learning algorithm is being
used to guide the descriptor search but it is like a filter in that the evolution function
measures an intrinsic property of the data rather than some type of predictive accuracy.
Then a hill-climbing based search algorithm was proposed – AICC; and the heart disease,
LED datasets were used to benchmark the methodology.
Caruana & Freitag
o Caruana, R. and D. Freitag. Greedy Attribute Selection. in International Conference on Machine
Learning. 1994.
The paper examined five greedy hillclimbing procedures (forward selection, backward
elimination, forward stepwise selection, backward stepwise elimination, and backward
stepwise elimination–SLASH) that search for attribute sets that generalize well with
ID3/C4.5. A caching scheme was presented that made attribute hillclimbing more
practical computationally.