SlideShare uma empresa Scribd logo
1 de 15
Literature review of dimensionality reduction
Feature Selection (FS) & Feature Extraction (FE)

o   K. Fukunaga and DR Olsen, "An algorithm for finding intrinsic dimensionality of data," IEEE
    Transactions on Computers, vol. 20, no. 2, pp. 176-183, 1971.

This paper is the earliest literature I could collect on dimensionality reduction issue. The
intrinsic dimensionality is defined as the smallest dimensional space we can obtain under
some constraint. Two problems were addressed in this paper: the intrinsic dimensionality
representation (mapping) and intrinsic dimensionality for classification (separating).

FOCUS
There are several famous dimensionality reduction algorithms which appear in almost all
literatures. They are FOCUS and RELIEF.

o   H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the
    Ninth National Conference on Artificial Intelligence (AAAI-91), pages 547--552, Anaheim, CA, USA,
    1991. AAAI Press.
o   Almuallim H., Dietterich T.G.: Efficient Algorithms for Identifying Relevant Features, Proc. of the
    Ninth Canadian Conference on Artificial Intelligence, University of British Columbia, Vancouver,
    May 11-15, 1992, 38-45

The first paper above proposed the FOCUS algorithm; the second upgraded it into
FOCUS-2. FOCUS implements the Min-Feature bias that prefers consistent hypotheses
definable over as few features as possible. In the simplest implementation, it does a
breadth-first search and check for any inconsistency.

RELIEF
o   K. Kira and L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In
    Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), pages 129--134,
    Menlo Park, CA, USA, 1992. AAAI Press.
o   Kononenko, Igor. Estimating Attributes: Analysis and Extensions of RELIEF, In European Conference
    on Machine Learning, pages 171-182, 1994.

In the first paper, RELIEF was presented as a feature ranking algorithm, using a weight-
based algorithm. From the set of training instances, it first chooses a sample of instances;
the user must provide the number of instances in this sample. RELIEFrandomly picks this
sample of instances, and for each instance in it finds Near Hit (minimum distance to the
same class instance) and Near Miss (minimum distance to the different class instance)
instances based on a Euclidean distance measure.

                                                       Class B
                        Class A
Figure: Near Hit and Near Miss

The basic idea is to update the weights that are initialized to zero in the beginning based
on the following equation:

                         W[ A] W[ A] diff ( A, R, H ) 2                diff ( A, R, M ) 2

Where, A is the attribute; R is the instance randomly selected; H is the near hit; M is the
near miss; and diff calculates the difference between two instances on attribute. After
exhausting all instances in the sample, RELIEF chooses all features having weight
greater than or equal to a threshold.

IN the second paper, the algorithm of RELIEF, which only dealt with two-class problems,
was upgraded into RELIEFF, which could handle noisy, incomplete, and multi-class data
sets. Firstly, RELIEF was extended to search for k-nearest hits/misses instead of only one
near/miss. The extended version was called RELIEF-A. It averaged the contribution of k
nearest hits/misses to deal with noisy data. Secondly, based on different diff functions, we
could get different BELIEF versions. So in this paper the author proposed three versions
of RELIEF (RELIEF-B, RELIEF-C and RELIEF-D) to deal with incomplete data sets.

RELIEF-D: Given two instances (I1 and I2)
   If one instance (I1) has unknown value
                   diff ( A, I1, I 2) 1 P(value( A, I 2) | class( I1))
   If both instances have unknown values
                                         #values( A)
                 diff ( A, I1, I 2) 1            ( P(V | calss( I1)) P(V | class( I 2)))
                                             V
Finally, extension was made on RELIEF-D, the author got version to deal with multi-
class problems, i.e., RELIEF-E and RELIEF-F. However, the RELIEF-F show
advantages both in noise free and noisy data.

RELIEF-F: find one near miss M(C) near each different class and average their
contribution for updating estimates W[A] :

            W [ A] W [ A] diff ( A, R, H ) / m                  [ P(C) diff ( A, R, M (C))]/ m
                                                       C class ( R )



FS: Liu, Dash et al.
Review of DR methods
One of important work of Liu and Dash is to review and group all feature selection, under
classification scenarios.

o   Dash, M., &: Liu, H. 1997. Feature selection for classification. Intelligent Data Analysis, 1, 131-156.
The major contribution of this paper is to present the following figure, in which the
feature selection process is addressed clearly.


       Original                              Subset
                             Generation                     Evaluation
      Feature Set




                                                             Subset

                                                                      Goodness of
                                           No               Stopping                Yes     Validation
                                                            Criterion

Figure: Feature selection process with validation

And Liu et al. claimed that each feature selection method could be characterized in the
types of generation procedure and evaluation function used by it. So this paper presented
a table to classify all feature selection methods. In this table, each row stands for one type
of evaluation measures. Each column represents for one kind of generation method. Most
of the methods listed in each cell in this table (totally there are 5 X 3 = 15 cells) were
addressed in this review paper.

Table: Two dimensional categorization of feature selection methods
         Evaluation                                         Generation methods
          Measures                       Heuristic                Complete                   Random
      Distance measure           Relief, Relief-F, Sege84    B&B, BFF, Bobr88
    Information measure           DTM, Kroll-Saha96                MDLM
    Dependency measure            POE1ACC, PRESET
    Consistency measure                                        Focus, MIFES
     Classifier error rate          SBS, SFS, SBS-           Ichi-Skla84a, Ichi-          GA, SA, RGSS,
                                 FLASH, PQSS, Moor-          Skla84b, AMB&B,               RMHC-PF1
                                 Lee94, BDS, RC, Quei-               BS
                                         Gels84



LVF + Consistency Measure
H. Liu, M. Dash, et al. published many papers on their feature selection methods.

o    Liu, H., and Setiono, R. (1996) A probabilistic approach to feature selection - A filter solution. In 13th
     International Conference on Machine Learning (ICML'96), July 1996, pp. 319-327. Bari, Italy.
o    H. Liu and S. Setiono. Some issues on scalable feature selection. In 4th World Congress of Expert
     Systems: Application of Advanced Information Technologies, 1998.

In the introduction section of the first paper, the authors first compared the wrapper
approach and the filter approach. Some reasons were pointed out why the wrapper
approach is not as general as the filter approach, although it has certain advantages: (1)
learning algorithm’s bias, (2) high computational cost, (3) large dataset will cause
problems while running some algorithm, and impractical to employ computationally
intensive ;earning algorithms such as neural network or genetic algorithm. Second, the
section addressed two types of feature selection methods: exhaustive (check all order
correlations) and heuristic (make use of 1st and 2nd order information – one attribute and
the combination of two attributes) search.




The first paper tried to introduce random search based on a Las Vegas algorithm, which
used randomness to guide the search to guarantee sooner or later it will get a correct
solution. The probabilistic approach is called LVF, which use the inconsistency rate and
the LV random search. LVF only works on discrete attributes because it replies on the
inconsistency calculation.

The second paper pointed out three big issues in feature selection, identified by Liu et al.
(1) large number of features; (2) large number of instance; and (3) feature expanding due
to environment changing. The authors proposed a LVF algorithm to reduce the number of
features. An upgraded LVF algorithm (LVS) was developed. The major idea is to use
some percentage of data set and then add more instances step by step until some
conditions satisfied. The last issue became the missing attribute problem in many cases.
1. Scaling data                           2. Expanding features
At the end of this paper, the authors considered the computing implementing as a
potential future work area. Twp ideas were listed. One is to use parallel feature selection;
the other is to use the database techniques such as data warehouse, metadata.
o   Huan Liu, Rudy Setiono: Incremental Feature Selection. Applied Intelligence 9(3): 217-230 (1998)

This paper actually is almost the same as the one addressing LVS; the different is just the
name of algorithm is LVI.

ABB + Consistency Measure
o   H. Liu, H. Motoda, and M. Dash, A Monotonic Measure for Optimal Feature Selection, Proc. of
    ECML-98, pages 101-106, 1998.
o   M. Dash, H. Liu and H. Motoda, "Consistency Based Feature Selection", pp 98 -- 109, PAKDD 2000,
    Kyoto, Japan. April, 2000. Springer.

The first paper studied the monotonic characteristic of the inconsistency measure, with
regard to the fact that most error- or distance-based measures are not monotonic. The
authors argued that monotonic measure should be necessary to find an optimal subset of
features while using complete, but not exhaustive search. This paper gave an ABB
(Automatic Branch & Bound) algorithm to select subset of features.
The second paper focused on the consistency measure, which was used on five different
algorithms of feature selection: FOCUS (exhaustive search), ABB (Automated Branch &
Bounce) (complete search), SetCover (heuristic search), LVF (probabilistic search), and
QBB (combination of LVF and ABB) (hybrid search).

QBB + Consistency Measure
o   M. Dash and H. Liu, "Hybrid search of feature subsets," in PRICAI'98, (Singapore), Springer-Verlag,
    November 1998.

This paper proposed a hybrid algorithm (QBB) of probabilistic and complete search,
which began with LVF to reduce the number of features and then ran ABB to get optimal
feature subset. So if M is known as SMALL, apply FocusM; else if N >= 9, apply ABB;
else apply QBB.




VHR (Discretization)
o   H. Liu and R. Setiono, "Dimensionality reduction via discretization," Knowledge Based Systems 9(1),
    pp. 71--77, 1996.

This paper proposed a vertical and horizontal reduction (VHR) method to build a data
and dimensionality reduction (DDR) system. This algorithm is based on the idea of Chi-
merge during the discretization. The duplication after merging is removed from the data
set. At the end, if an attribute is merged to only one value, it simple means that the
attribute could be discarded.
Neural Network Pruning
o   R. Setiono and H. Liu, Neural network feature selector," IEEE Trans. on Neural Networks, vol. 8, no.
    3, pp. 654-662, 1997.

This paper proposed the use of a three layer feed forward neural network to select those
input attributes (features) that are most useful for discriminating classes in a given set of
input patterns. A network pruning algorithm was the foundation of the proposed
algorithm. A simple criterion was developed to remove an attribute based on the accuracy
rate of the network.

Kohavi, John et al.
Compared to Liu et al, Kohavi, John et al. did much research work in the Wrapper
approach.

o   R. Kohavi and G.H. John, Wrappers for feature subset selection, Artificial Intelligence 97(1-2) (1997),
    273--324.

In this influential paper Kohavi and John presented a number of disadvantages of the
filter approach to the feature selection problem, steering research towards algorithms
adopting the wrapper approach.

The wrapper approach to feature subset selection is shown in the following figure:




Relevance measure

Definition 1: an optimal feature subset. Given an inducer I, and a dataset D with
features X1, X2, …, Xn, from a distribution D over the labeled instance space, an optimal
feature subset, Xopt, is a subset of the features such that the accuracy of the inducer
classifier C = I(D) is maximal.

Definition 2-3: Existing definitions of relevance. Almullim & Dietterich: A feature Xi is
said to be relevant to a concept C if Xi appears in every Boolean formula that represents C
and irrelevant otherwise. Gennari et al.: Xi is relevant iff there exists some xi and y for
which p( X i xi ) 0 such that p(Y y | X i xi ) p(Y y) .

Definition 4. Let Si {X 1 ,..., X i 1 , X i 1 ,..., X m } . si is a value assigned to all features in Si.
Xi is relevant iff there exists some xi and y and si for which p( X i xi ) 0 such that
p(Y y, Si si | X i xi ) p(Y y, Si si ) .

Definition 5 (Strong relevance). Xi is relevant iff there exists some xi and y and si for
which p(Si si , X i xi ) 0 such that p(Y y | Si si , X i xi ) p(Y y | Si si ) .

Definition 6 (Weak relevance). Xi is relevant iff it is strong relevant and there exists a
subset of Si, Si’ for which there exists some xi and y and si’ with p(Si' si' , X i xi ) 0
such that p(Y      y | Si'   si' , X i   xi )   p(Y   y | Si'   si' ) .

The following figure shows a view of feature set relevance.




Search & Induce

This paper then demonstrated the wrapper approach with two searching methods: hill-
climbing and best-first search, two induction algorithms: decision tree (ID3) and Naïve-
Bayes, on 14 datasets.
This paper also gave some directions of future work. (1) Other search engines such as
simulating annealing, genetic algorithms. (2) Select initial subset of features. (3)
Incremental operations and aggregation techniques. (4) Parallel computing techniques. (5)
Overfitting issue (using cross-validation).

FE: PCA
o   Partridge M. and RA Calvo. Fast Dimensionality Reduction and Simple PCA, Intelligent Data
    Analysis, 2(3), 1998.

A fast and simple algorithm for approximately calculating the principal components (PCs)
of a data set and so reducing its dimensionality is described. This Simple Principal
Components Analysis (SPCA) method was used for dimensionality reduction of two
high-dimensional image databases, one of handwritten digits and one of handwritten
Japanese characters. It was tested and compared with other techniques: matrix methods
such as SVD and several data methods.

o   David Gering (2002) In fulfillment of the Area Exam doctoral requirements: Linear and Nonlinear
    Data Dimensionality Reduction, http://www.ai.mit.edu/people/gering/areaexam/areaexam.doc

This report presented three different approaches to deriving PCA: Pearson’s Least
Squares Distance approach, Hotelling’s Change of Variables approach, and the author’s
new method of Matrix Factorization for Variation Compression. The report also
addresses the Multidimensional Scaling (MDS). Then the author gave three
implementations of the two techniques: Eigenfaces, Locally Linear Embedding (LLE)
and Isometric feature mapping (Isomap).

FE: Soft Computing Approaches
This is an interesting area.

o   Pal, N.R. and K.K.Chintalapudi (1997). "A connectionist system for feature selection", Neural, Parallel
    and Scientific Computation Vol. 5, 359-382.
o   R. N. Pal, "Soft Computing for Feature Analysis," Fuzzy Sets and Systems, Vol.103, 201221, 1999.
In the first paper, Pal and Chintalapudi proposed a connectionist model for selection of a
subset of good features for pattern recognition problems. Each input node of an MLP has
an associated multiplier, which allows or restricts the passing of the corresponding
feature into the higher layers of the net. A high value of the attenuation factor indicates
that the associated feature is either redundant or harmful. The network learns both the
connection weights and the attenuation factors. At the end of learning, features with high
value of the attenuation factors are eliminated.

The second paper gave an overview of using three soft computing techniques: fuzzy logic,
neural networks, and genetic algorithms for feature ranking, selection and extraction.
    o Fuzzy sets were introduced in 1965 by Zadeh as a new way to represent
        vagueness in everyday life.
    o Neural networks have characteristics of parallel computing, robustness, built-in
        learn ability, and capability to deal with imprecise, fuzzy, noisy, and probabilistic
        information.
    o Genetic Algorithms (GAs) are biologically inspired tools for optimization. They
        are parallel and randomized search techniques.
This paper also mentioned the separation of the two problems of feature selection and
feature extraction.

For feature extraction using neural networks, Pal reviewed several methods such as
    o PCA Neural networks by Rubner (J. Rubner and P. Tavan. A self-organization network for
         principal-component analysis. Europhysics. Letters, 10:693-698, 1989.) (This reminded me
         the book of Principal component neural networks: theory and applications / Kostas I.
         Diamantaras, S.Y. Kung, New York: Wiley, 1996 );
    o Nonlinear projection by Sammon (J. W. Sammon, Jr., "A nonlinear mapping for data
         structure analysis," IEEE Trans. Comput., vol. C-18, pp. 401--409, May 1969.)
For feature ranking & selection using neural networks, Pal summarized three types of
methods:
    o Saliency based feature ranking (SAFER)
(Ruck, D.W., S.K.Rogers and M.Kabrisky (1990). "Feature selection using a multilayer perceptron",
Journal of Neural Network Computing, 40-48.)
    o Sensitivity based feature ranking (SEFER)
(R. K. De, N. R. Pal, and S. K. Pal. Feature analysis: Neural network and fuzzy set theoretic approaches.
Pattern Recognition, 30(10):1579--1590, 1997)
    o An attenuator based feature selection (AFES)

Neural networks

o   De Backer S., Naud A., Scheunders P.. - Nonlinear dimensionality reduction techniques for
    unsupervised feature extraction. - In: Pattern recognition letters, 19(1998), p. 711-720

In this paper, a study is performed on unsupervised non-linear feature extraction. Four
techniques were studied: a multidimensional scaling approach (MDS), Sammon’s
mapping (SAM), Kohonen’s self-organizing map (SOM) and an auto-associative
feedforward neural network (AFN). All four yield better classification results than the
optimal linear approach PCA , and therefore can be utilized as a feature extraction step in
a design for classification schemes. Because of the nature of the techniques, SOM and
AFN perform better for very low dimensions. Because of the complexity of the
techniques, MDS and SAM are most suited for high-dimensional data sets with a limited
number of data points, while SOM and AFN are more appropriate for low-dimensional
problems with a large number of data points.

Aha et al.
o   Aha, D. W. & Bankert, R. L. (1994), Feature selection for case-based classification of cloud types: An
    empirical comparison, in Working Notes of the AAAI-94 Workshop on Case-based Reasoning, pp
    106-112. 22
o   Aha, D. W. and Bankert, R. L. (1995), A comparative evaluation of sequential feature selection
    algorithms, In Proceedings of the Fifth International 33 Workshop on Artificial Intelligence and
    Statistics, editors D. Fisher and H. Lenz, pp. 1--7, Ft. Lauderdale, FL.

The two papers gave a framework of sequential feature selection - BEAM. The following
figures show two versions of algorithms in the two papers.




It actually considered the feature selection as a combination of the search method (FSS
and BSS) and the evaluation function (IB1 and Index).

Bradley et al
o   P. S. Bradley, O. L. Mangasarian, and W. N. Street. Feature selection via mathematical programming.
    INFORMS Journal on Computing, 10:209--217, 1998.

This paper tried to transform the feature selection into a mathematical program problem.
The task became discriminating two given sets ( A R m n and B R k n ) in an n-
dimensional feature space by using as few of the given features as possible. In the
mathematical program terms, we will attempt to generate a separating plane
( P : {x | x R n , xT       } , suppressing as many of the components of ω as possible) in a
feature space of as small as a dimension as possible while minimizing the average
distance of misclassified points to the plane.
                                                A     e   e y
                        T        T
                       e y e z                 B     e   e z
( FS ) min (1 )(                   )    e T v*                  ,    [0,1)
        , , y , z ,v    m        k               y 0, z 0
                                                   v       v
Typically this will be achieved in a feature space of reduced dimensionality, that is
eT v n . eT v* is a step function term that is discrete. Different approximation methods of
it will lead to different algorithms. Three methods were proposed in this paper:
     o Standard sigmoid: eT v* eT (e              v 1
                                                   ) . Get FSS (FS sigmoid).
     o Concave exponential (advantageous of its simplicity & concavity):
          eT v* eT (e       v
                              ) . Get FSV (FS Concave).
     o Consider it as a linear program with equilibrium constraints. Get FSL. After
          reformulating to avoid the computational difficulty, we get FSB (FS bilinear
          program).

This paper also gave an adaptation of the optimal brain damage (OBD).

Algorithms for FSS, FSV, FSB and OBD were presented. Experiments were carried out
on the WPBC (Wisconsin Prognostic Breast Cancer) problem and the Ionosphere
problem.

o   P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector
    machines. In Proc. 15th International Conf. on Machine Learning, pages 82--90. Morgan Kaufmann,
    San Francisco, CA, 1998.

This paper actually extracted parts of the previous paper, focusing on FSV and its
algorithm (Successive Linearization Algorithm - SLA). The introduction of SVM was
just to add another suppressing term in the objective function.

Hall et al.
o   Hall, M. A., Smith, L. A. (1999). Feature selection for machine learning: comparing a correlation-
    based filter approach to the wrapper. Proceedings of the Florida Artificial Intelligence Symposium
    (FLAIRS-99).
o   Practical feature subset selection for machine learning. Proceedings of the 21st Australian Computer
    Science Conference. Springer. 181-191.
o   M. Hall. Correlation-based feature selection for machine learning. PhD thesis, University of Waikato,
    1999.
o   Hall, M. (2000). Correlation-based feature selection of discrete and numeric class machine learning. In
    Proceedings of the International Conference on Machine Learning, pages 359-366, San Francisco, CA.
    Morgan Kaufmann Publishers.
In these papers and thesis, Hall tried to present his new method CFS (Correlation based
Feature Selection). It is an algorithm that couples this evaluation formula with an
appropriate correlation measure and a heuristic search strategy.

Further experiments compared CFS with a wrapper—a well known approach to feature
selection that employs the target learning algorithm to evaluate feature sets. In many
cases CFS gave comparable results to the wrapper, and in general, outperformed the
wrapper on small datasets. CFS executes many times faster than the wrapper, which
allows it to scale to larger datasets.

Langely et al
o   Blum, Avrim and Langely, Pat. Selection of Relevant Features and Examples in Machine Learning, In
    Artificial Intelligence, Vol.97, No.1-2, pages 245-271,1997.

This paper addressed the two problems of irrelevant features and irrelevant samples. For
the feature selection, they used almost the same relevance concepts and definitions by
Johnn, Kohavi and Pfleger.




Secondly, the authors regarded the feature selection as a problem of heuristic search. 4
basic issues were studied: (1) the starting point; (2) the organization of search: exhaustive
or greedy; (3) the alternative subset evaluation; and (4) the halting criterion. Such
discussion is the same as Liu’s publication.
Then, the authors reviewed three types of feature selection methods: (1) those that embed
the selection within the basic induction algorithm; (2) those that use the selection to filter
features passed to induction; and (3) those that treat the selection as a wrapper around the
induction process. The following two tables list the characteristics of different methods.
Devaney et al. - Conceptual Clustering
o   Devaney, M., and Ram, A. Efficient Feature Selection in Conceptual Clustering. Machine Learning:
    ICML '97, 92-97, Morgan Kaufmann, San Francisco, CA, 1997.

This paper addressed that feature selection in unsupervised situations. Any typical
wrapper approach cannot be applied due to the absence of the class labels in the dataset.
One solution is to use average predictive accuracy over all attributes. Another way is to
use category utility (M. A. Gluck and J. E. Corter. Information, uncertainty, and the unity of categories.
In Proceedings of the 7th Annual Conference of the Cognitive Science Society, pages 283--287, Irvine, CA,
1985).

The COBWEB system (D. H. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139--172, 1987) applying the category utility was used in the paper.




The Ck terms are the concepts in the partition, Ai is each attribute, and Vij is each of the
possible values for attribute. This equation yields a measure of the increase in the number
of attribute values that can be predicted given a set of concepts, C1 … Ck, over the
number of attribute values that could be predicted without using any concepts.
The term i j P ( Ai Vij ) is the probability of each attribute value independent of class
membership and is obtained from the parent of the partition. The P(Ck ) term weights the
values for each concept according to its size, and the division by n, the number of
concepts in the partition, allows comparison of partitions of different sizes.
For the continuous attributes, the paper used CLASSIT algorithm (Gennari, J. H. (1990). An
experimental study of concept formation. Doctoral dissertation, Department of Information & Computer
Science, University of California, Irvine. ).
where K is the number of classes in the partition, ik the standard deviation of attribute i
in class k and pi the standard deviation of attribute i at the parent node.
The method, as the authors described, blurs the traditional wrapper/filter model
distinction – it is like a wrapper model in that the underlying learning algorithm is being
used to guide the descriptor search but it is like a filter in that the evolution function
measures an intrinsic property of the data rather than some type of predictive accuracy.

Then a hill-climbing based search algorithm was proposed – AICC; and the heart disease,
LED datasets were used to benchmark the methodology.




Caruana & Freitag
o   Caruana, R. and D. Freitag. Greedy Attribute Selection. in International Conference on Machine
    Learning. 1994.

The paper examined five greedy hillclimbing procedures (forward selection, backward
elimination, forward stepwise selection, backward stepwise elimination, and backward
stepwise elimination–SLASH) that search for attribute sets that generalize well with
ID3/C4.5. A caching scheme was presented that made attribute hillclimbing more
practical computationally.

Mais conteúdo relacionado

Mais procurados

EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...cscpconf
 
Certified global minima
Certified global minimaCertified global minima
Certified global minimassuserfa7e73
 
Citython presentation
Citython presentationCitython presentation
Citython presentationAnkit Tewari
 
imple and new optimization algorithm for solving constrained and unconstraine...
imple and new optimization algorithm for solving constrained and unconstraine...imple and new optimization algorithm for solving constrained and unconstraine...
imple and new optimization algorithm for solving constrained and unconstraine...salam_a
 

Mais procurados (6)

EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
 
Certified global minima
Certified global minimaCertified global minima
Certified global minima
 
Citython presentation
Citython presentationCitython presentation
Citython presentation
 
Unit Root Test
Unit Root Test Unit Root Test
Unit Root Test
 
imple and new optimization algorithm for solving constrained and unconstraine...
imple and new optimization algorithm for solving constrained and unconstraine...imple and new optimization algorithm for solving constrained and unconstraine...
imple and new optimization algorithm for solving constrained and unconstraine...
 
Ch03
Ch03Ch03
Ch03
 

Destaque

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learningbutest
 
SP.Matveev.IComp.Cover.AUG2016
SP.Matveev.IComp.Cover.AUG2016SP.Matveev.IComp.Cover.AUG2016
SP.Matveev.IComp.Cover.AUG2016Alex Matveev
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 

Destaque (7)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
20161010090514287
2016101009051428720161010090514287
20161010090514287
 
SP.Matveev.IComp.Cover.AUG2016
SP.Matveev.IComp.Cover.AUG2016SP.Matveev.IComp.Cover.AUG2016
SP.Matveev.IComp.Cover.AUG2016
 
Polynomial stations
Polynomial stationsPolynomial stations
Polynomial stations
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 

Semelhante a FOCUS.doc

FOCUS.doc
FOCUS.docFOCUS.doc
FOCUS.docbutest
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Miningijsrd.com
 
Two-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionTwo-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionXin-She Yang
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptAnshika865276
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...AIRCC Publishing Corporation
 
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A...
 On Feature Selection Algorithms and Feature Selection Stability Measures : A... On Feature Selection Algorithms and Feature Selection Stability Measures : A...
On Feature Selection Algorithms and Feature Selection Stability Measures : A...AIRCC Publishing Corporation
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningcsandit
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...cscpconf
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGcsandit
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleSajith Edirisinghe
 
Higgs bosob machine learning challange
Higgs bosob machine learning challangeHiggs bosob machine learning challange
Higgs bosob machine learning challangeTharindu Ranasinghe
 
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...ijsc
 
SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text IJERA Editor
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1  Feature SetOptimal Feature Selection from VMware ESXi 5.1  Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 

Semelhante a FOCUS.doc (20)

FOCUS.doc
FOCUS.docFOCUS.doc
FOCUS.doc
 
Survey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data MiningSurvey on Various Classification Techniques in Data Mining
Survey on Various Classification Techniques in Data Mining
 
Two-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionTwo-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential Evolution
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.ppt
 
nnml.ppt
nnml.pptnnml.ppt
nnml.ppt
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
On Feature Selection Algorithms and Feature Selection Stability Measures : A ...
 
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...
 
On Feature Selection Algorithms and Feature Selection Stability Measures : A...
 On Feature Selection Algorithms and Feature Selection Stability Measures : A... On Feature Selection Algorithms and Feature Selection Stability Measures : A...
On Feature Selection Algorithms and Feature Selection Stability Measures : A...
 
Analytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion miningAnalytical study of feature extraction techniques in opinion mining
Analytical study of feature extraction techniques in opinion mining
 
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
Radial Basis Function Neural Network (RBFNN), Induction Motor, Vector control...
 
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MININGANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
ANALYTICAL STUDY OF FEATURE EXTRACTION TECHNIQUES IN OPINION MINING
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
 
Higgs bosob machine learning challange
Higgs bosob machine learning challangeHiggs bosob machine learning challange
Higgs bosob machine learning challange
 
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...
Single Reduct Generation Based on Relative Indiscernibility of Rough Set Theo...
 
[IJET-V2I3P22] Authors: Harsha Pakhale,Deepak Kumar Xaxa
[IJET-V2I3P22] Authors: Harsha Pakhale,Deepak Kumar Xaxa[IJET-V2I3P22] Authors: Harsha Pakhale,Deepak Kumar Xaxa
[IJET-V2I3P22] Authors: Harsha Pakhale,Deepak Kumar Xaxa
 
SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text SVM Based Identification of Psychological Personality Using Handwritten Text
SVM Based Identification of Psychological Personality Using Handwritten Text
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1  Feature SetOptimal Feature Selection from VMware ESXi 5.1  Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 

Mais de butest

Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 
Download
DownloadDownload
Downloadbutest
 
resume.doc
resume.docresume.doc
resume.docbutest
 
Download.doc.doc
Download.doc.docDownload.doc.doc
Download.doc.docbutest
 

Mais de butest (20)

Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 
Download
DownloadDownload
Download
 
resume.doc
resume.docresume.doc
resume.doc
 
Download.doc.doc
Download.doc.docDownload.doc.doc
Download.doc.doc
 

FOCUS.doc

  • 1. Literature review of dimensionality reduction Feature Selection (FS) & Feature Extraction (FE) o K. Fukunaga and DR Olsen, "An algorithm for finding intrinsic dimensionality of data," IEEE Transactions on Computers, vol. 20, no. 2, pp. 176-183, 1971. This paper is the earliest literature I could collect on dimensionality reduction issue. The intrinsic dimensionality is defined as the smallest dimensional space we can obtain under some constraint. Two problems were addressed in this paper: the intrinsic dimensionality representation (mapping) and intrinsic dimensionality for classification (separating). FOCUS There are several famous dimensionality reduction algorithms which appear in almost all literatures. They are FOCUS and RELIEF. o H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence (AAAI-91), pages 547--552, Anaheim, CA, USA, 1991. AAAI Press. o Almuallim H., Dietterich T.G.: Efficient Algorithms for Identifying Relevant Features, Proc. of the Ninth Canadian Conference on Artificial Intelligence, University of British Columbia, Vancouver, May 11-15, 1992, 38-45 The first paper above proposed the FOCUS algorithm; the second upgraded it into FOCUS-2. FOCUS implements the Min-Feature bias that prefers consistent hypotheses definable over as few features as possible. In the simplest implementation, it does a breadth-first search and check for any inconsistency. RELIEF o K. Kira and L. Rendell. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), pages 129--134, Menlo Park, CA, USA, 1992. AAAI Press. o Kononenko, Igor. Estimating Attributes: Analysis and Extensions of RELIEF, In European Conference on Machine Learning, pages 171-182, 1994. In the first paper, RELIEF was presented as a feature ranking algorithm, using a weight- based algorithm. From the set of training instances, it first chooses a sample of instances; the user must provide the number of instances in this sample. RELIEFrandomly picks this sample of instances, and for each instance in it finds Near Hit (minimum distance to the same class instance) and Near Miss (minimum distance to the different class instance) instances based on a Euclidean distance measure. Class B Class A
  • 2. Figure: Near Hit and Near Miss The basic idea is to update the weights that are initialized to zero in the beginning based on the following equation: W[ A] W[ A] diff ( A, R, H ) 2 diff ( A, R, M ) 2 Where, A is the attribute; R is the instance randomly selected; H is the near hit; M is the near miss; and diff calculates the difference between two instances on attribute. After exhausting all instances in the sample, RELIEF chooses all features having weight greater than or equal to a threshold. IN the second paper, the algorithm of RELIEF, which only dealt with two-class problems, was upgraded into RELIEFF, which could handle noisy, incomplete, and multi-class data sets. Firstly, RELIEF was extended to search for k-nearest hits/misses instead of only one near/miss. The extended version was called RELIEF-A. It averaged the contribution of k nearest hits/misses to deal with noisy data. Secondly, based on different diff functions, we could get different BELIEF versions. So in this paper the author proposed three versions of RELIEF (RELIEF-B, RELIEF-C and RELIEF-D) to deal with incomplete data sets. RELIEF-D: Given two instances (I1 and I2)  If one instance (I1) has unknown value diff ( A, I1, I 2) 1 P(value( A, I 2) | class( I1))  If both instances have unknown values #values( A) diff ( A, I1, I 2) 1 ( P(V | calss( I1)) P(V | class( I 2))) V Finally, extension was made on RELIEF-D, the author got version to deal with multi- class problems, i.e., RELIEF-E and RELIEF-F. However, the RELIEF-F show advantages both in noise free and noisy data. RELIEF-F: find one near miss M(C) near each different class and average their contribution for updating estimates W[A] : W [ A] W [ A] diff ( A, R, H ) / m [ P(C) diff ( A, R, M (C))]/ m C class ( R ) FS: Liu, Dash et al. Review of DR methods One of important work of Liu and Dash is to review and group all feature selection, under classification scenarios. o Dash, M., &: Liu, H. 1997. Feature selection for classification. Intelligent Data Analysis, 1, 131-156.
  • 3. The major contribution of this paper is to present the following figure, in which the feature selection process is addressed clearly. Original Subset Generation Evaluation Feature Set Subset Goodness of No Stopping Yes Validation Criterion Figure: Feature selection process with validation And Liu et al. claimed that each feature selection method could be characterized in the types of generation procedure and evaluation function used by it. So this paper presented a table to classify all feature selection methods. In this table, each row stands for one type of evaluation measures. Each column represents for one kind of generation method. Most of the methods listed in each cell in this table (totally there are 5 X 3 = 15 cells) were addressed in this review paper. Table: Two dimensional categorization of feature selection methods Evaluation Generation methods Measures Heuristic Complete Random Distance measure Relief, Relief-F, Sege84 B&B, BFF, Bobr88 Information measure DTM, Kroll-Saha96 MDLM Dependency measure POE1ACC, PRESET Consistency measure Focus, MIFES Classifier error rate SBS, SFS, SBS- Ichi-Skla84a, Ichi- GA, SA, RGSS, FLASH, PQSS, Moor- Skla84b, AMB&B, RMHC-PF1 Lee94, BDS, RC, Quei- BS Gels84 LVF + Consistency Measure H. Liu, M. Dash, et al. published many papers on their feature selection methods. o Liu, H., and Setiono, R. (1996) A probabilistic approach to feature selection - A filter solution. In 13th International Conference on Machine Learning (ICML'96), July 1996, pp. 319-327. Bari, Italy. o H. Liu and S. Setiono. Some issues on scalable feature selection. In 4th World Congress of Expert Systems: Application of Advanced Information Technologies, 1998. In the introduction section of the first paper, the authors first compared the wrapper approach and the filter approach. Some reasons were pointed out why the wrapper
  • 4. approach is not as general as the filter approach, although it has certain advantages: (1) learning algorithm’s bias, (2) high computational cost, (3) large dataset will cause problems while running some algorithm, and impractical to employ computationally intensive ;earning algorithms such as neural network or genetic algorithm. Second, the section addressed two types of feature selection methods: exhaustive (check all order correlations) and heuristic (make use of 1st and 2nd order information – one attribute and the combination of two attributes) search. The first paper tried to introduce random search based on a Las Vegas algorithm, which used randomness to guide the search to guarantee sooner or later it will get a correct solution. The probabilistic approach is called LVF, which use the inconsistency rate and the LV random search. LVF only works on discrete attributes because it replies on the inconsistency calculation. The second paper pointed out three big issues in feature selection, identified by Liu et al. (1) large number of features; (2) large number of instance; and (3) feature expanding due to environment changing. The authors proposed a LVF algorithm to reduce the number of features. An upgraded LVF algorithm (LVS) was developed. The major idea is to use some percentage of data set and then add more instances step by step until some conditions satisfied. The last issue became the missing attribute problem in many cases.
  • 5. 1. Scaling data 2. Expanding features At the end of this paper, the authors considered the computing implementing as a potential future work area. Twp ideas were listed. One is to use parallel feature selection; the other is to use the database techniques such as data warehouse, metadata. o Huan Liu, Rudy Setiono: Incremental Feature Selection. Applied Intelligence 9(3): 217-230 (1998) This paper actually is almost the same as the one addressing LVS; the different is just the name of algorithm is LVI. ABB + Consistency Measure o H. Liu, H. Motoda, and M. Dash, A Monotonic Measure for Optimal Feature Selection, Proc. of ECML-98, pages 101-106, 1998. o M. Dash, H. Liu and H. Motoda, "Consistency Based Feature Selection", pp 98 -- 109, PAKDD 2000, Kyoto, Japan. April, 2000. Springer. The first paper studied the monotonic characteristic of the inconsistency measure, with regard to the fact that most error- or distance-based measures are not monotonic. The authors argued that monotonic measure should be necessary to find an optimal subset of features while using complete, but not exhaustive search. This paper gave an ABB (Automatic Branch & Bound) algorithm to select subset of features.
  • 6. The second paper focused on the consistency measure, which was used on five different algorithms of feature selection: FOCUS (exhaustive search), ABB (Automated Branch & Bounce) (complete search), SetCover (heuristic search), LVF (probabilistic search), and QBB (combination of LVF and ABB) (hybrid search). QBB + Consistency Measure o M. Dash and H. Liu, "Hybrid search of feature subsets," in PRICAI'98, (Singapore), Springer-Verlag, November 1998. This paper proposed a hybrid algorithm (QBB) of probabilistic and complete search, which began with LVF to reduce the number of features and then ran ABB to get optimal feature subset. So if M is known as SMALL, apply FocusM; else if N >= 9, apply ABB; else apply QBB. VHR (Discretization) o H. Liu and R. Setiono, "Dimensionality reduction via discretization," Knowledge Based Systems 9(1), pp. 71--77, 1996. This paper proposed a vertical and horizontal reduction (VHR) method to build a data and dimensionality reduction (DDR) system. This algorithm is based on the idea of Chi- merge during the discretization. The duplication after merging is removed from the data set. At the end, if an attribute is merged to only one value, it simple means that the attribute could be discarded.
  • 7. Neural Network Pruning o R. Setiono and H. Liu, Neural network feature selector," IEEE Trans. on Neural Networks, vol. 8, no. 3, pp. 654-662, 1997. This paper proposed the use of a three layer feed forward neural network to select those input attributes (features) that are most useful for discriminating classes in a given set of input patterns. A network pruning algorithm was the foundation of the proposed algorithm. A simple criterion was developed to remove an attribute based on the accuracy rate of the network. Kohavi, John et al. Compared to Liu et al, Kohavi, John et al. did much research work in the Wrapper approach. o R. Kohavi and G.H. John, Wrappers for feature subset selection, Artificial Intelligence 97(1-2) (1997), 273--324. In this influential paper Kohavi and John presented a number of disadvantages of the filter approach to the feature selection problem, steering research towards algorithms adopting the wrapper approach. The wrapper approach to feature subset selection is shown in the following figure: Relevance measure Definition 1: an optimal feature subset. Given an inducer I, and a dataset D with features X1, X2, …, Xn, from a distribution D over the labeled instance space, an optimal feature subset, Xopt, is a subset of the features such that the accuracy of the inducer classifier C = I(D) is maximal. Definition 2-3: Existing definitions of relevance. Almullim & Dietterich: A feature Xi is said to be relevant to a concept C if Xi appears in every Boolean formula that represents C
  • 8. and irrelevant otherwise. Gennari et al.: Xi is relevant iff there exists some xi and y for which p( X i xi ) 0 such that p(Y y | X i xi ) p(Y y) . Definition 4. Let Si {X 1 ,..., X i 1 , X i 1 ,..., X m } . si is a value assigned to all features in Si. Xi is relevant iff there exists some xi and y and si for which p( X i xi ) 0 such that p(Y y, Si si | X i xi ) p(Y y, Si si ) . Definition 5 (Strong relevance). Xi is relevant iff there exists some xi and y and si for which p(Si si , X i xi ) 0 such that p(Y y | Si si , X i xi ) p(Y y | Si si ) . Definition 6 (Weak relevance). Xi is relevant iff it is strong relevant and there exists a subset of Si, Si’ for which there exists some xi and y and si’ with p(Si' si' , X i xi ) 0 such that p(Y y | Si' si' , X i xi ) p(Y y | Si' si' ) . The following figure shows a view of feature set relevance. Search & Induce This paper then demonstrated the wrapper approach with two searching methods: hill- climbing and best-first search, two induction algorithms: decision tree (ID3) and Naïve- Bayes, on 14 datasets.
  • 9. This paper also gave some directions of future work. (1) Other search engines such as simulating annealing, genetic algorithms. (2) Select initial subset of features. (3) Incremental operations and aggregation techniques. (4) Parallel computing techniques. (5) Overfitting issue (using cross-validation). FE: PCA o Partridge M. and RA Calvo. Fast Dimensionality Reduction and Simple PCA, Intelligent Data Analysis, 2(3), 1998. A fast and simple algorithm for approximately calculating the principal components (PCs) of a data set and so reducing its dimensionality is described. This Simple Principal Components Analysis (SPCA) method was used for dimensionality reduction of two high-dimensional image databases, one of handwritten digits and one of handwritten Japanese characters. It was tested and compared with other techniques: matrix methods such as SVD and several data methods. o David Gering (2002) In fulfillment of the Area Exam doctoral requirements: Linear and Nonlinear Data Dimensionality Reduction, http://www.ai.mit.edu/people/gering/areaexam/areaexam.doc This report presented three different approaches to deriving PCA: Pearson’s Least Squares Distance approach, Hotelling’s Change of Variables approach, and the author’s new method of Matrix Factorization for Variation Compression. The report also addresses the Multidimensional Scaling (MDS). Then the author gave three implementations of the two techniques: Eigenfaces, Locally Linear Embedding (LLE) and Isometric feature mapping (Isomap). FE: Soft Computing Approaches This is an interesting area. o Pal, N.R. and K.K.Chintalapudi (1997). "A connectionist system for feature selection", Neural, Parallel and Scientific Computation Vol. 5, 359-382. o R. N. Pal, "Soft Computing for Feature Analysis," Fuzzy Sets and Systems, Vol.103, 201221, 1999.
  • 10. In the first paper, Pal and Chintalapudi proposed a connectionist model for selection of a subset of good features for pattern recognition problems. Each input node of an MLP has an associated multiplier, which allows or restricts the passing of the corresponding feature into the higher layers of the net. A high value of the attenuation factor indicates that the associated feature is either redundant or harmful. The network learns both the connection weights and the attenuation factors. At the end of learning, features with high value of the attenuation factors are eliminated. The second paper gave an overview of using three soft computing techniques: fuzzy logic, neural networks, and genetic algorithms for feature ranking, selection and extraction. o Fuzzy sets were introduced in 1965 by Zadeh as a new way to represent vagueness in everyday life. o Neural networks have characteristics of parallel computing, robustness, built-in learn ability, and capability to deal with imprecise, fuzzy, noisy, and probabilistic information. o Genetic Algorithms (GAs) are biologically inspired tools for optimization. They are parallel and randomized search techniques. This paper also mentioned the separation of the two problems of feature selection and feature extraction. For feature extraction using neural networks, Pal reviewed several methods such as o PCA Neural networks by Rubner (J. Rubner and P. Tavan. A self-organization network for principal-component analysis. Europhysics. Letters, 10:693-698, 1989.) (This reminded me the book of Principal component neural networks: theory and applications / Kostas I. Diamantaras, S.Y. Kung, New York: Wiley, 1996 ); o Nonlinear projection by Sammon (J. W. Sammon, Jr., "A nonlinear mapping for data structure analysis," IEEE Trans. Comput., vol. C-18, pp. 401--409, May 1969.) For feature ranking & selection using neural networks, Pal summarized three types of methods: o Saliency based feature ranking (SAFER) (Ruck, D.W., S.K.Rogers and M.Kabrisky (1990). "Feature selection using a multilayer perceptron", Journal of Neural Network Computing, 40-48.) o Sensitivity based feature ranking (SEFER) (R. K. De, N. R. Pal, and S. K. Pal. Feature analysis: Neural network and fuzzy set theoretic approaches. Pattern Recognition, 30(10):1579--1590, 1997) o An attenuator based feature selection (AFES) Neural networks o De Backer S., Naud A., Scheunders P.. - Nonlinear dimensionality reduction techniques for unsupervised feature extraction. - In: Pattern recognition letters, 19(1998), p. 711-720 In this paper, a study is performed on unsupervised non-linear feature extraction. Four techniques were studied: a multidimensional scaling approach (MDS), Sammon’s mapping (SAM), Kohonen’s self-organizing map (SOM) and an auto-associative feedforward neural network (AFN). All four yield better classification results than the optimal linear approach PCA , and therefore can be utilized as a feature extraction step in
  • 11. a design for classification schemes. Because of the nature of the techniques, SOM and AFN perform better for very low dimensions. Because of the complexity of the techniques, MDS and SAM are most suited for high-dimensional data sets with a limited number of data points, while SOM and AFN are more appropriate for low-dimensional problems with a large number of data points. Aha et al. o Aha, D. W. & Bankert, R. L. (1994), Feature selection for case-based classification of cloud types: An empirical comparison, in Working Notes of the AAAI-94 Workshop on Case-based Reasoning, pp 106-112. 22 o Aha, D. W. and Bankert, R. L. (1995), A comparative evaluation of sequential feature selection algorithms, In Proceedings of the Fifth International 33 Workshop on Artificial Intelligence and Statistics, editors D. Fisher and H. Lenz, pp. 1--7, Ft. Lauderdale, FL. The two papers gave a framework of sequential feature selection - BEAM. The following figures show two versions of algorithms in the two papers. It actually considered the feature selection as a combination of the search method (FSS and BSS) and the evaluation function (IB1 and Index). Bradley et al o P. S. Bradley, O. L. Mangasarian, and W. N. Street. Feature selection via mathematical programming. INFORMS Journal on Computing, 10:209--217, 1998. This paper tried to transform the feature selection into a mathematical program problem. The task became discriminating two given sets ( A R m n and B R k n ) in an n- dimensional feature space by using as few of the given features as possible. In the
  • 12. mathematical program terms, we will attempt to generate a separating plane ( P : {x | x R n , xT } , suppressing as many of the components of ω as possible) in a feature space of as small as a dimension as possible while minimizing the average distance of misclassified points to the plane. A e e y T T e y e z B e e z ( FS ) min (1 )( ) e T v* , [0,1) , , y , z ,v m k y 0, z 0 v v Typically this will be achieved in a feature space of reduced dimensionality, that is eT v n . eT v* is a step function term that is discrete. Different approximation methods of it will lead to different algorithms. Three methods were proposed in this paper: o Standard sigmoid: eT v* eT (e v 1 ) . Get FSS (FS sigmoid). o Concave exponential (advantageous of its simplicity & concavity): eT v* eT (e v ) . Get FSV (FS Concave). o Consider it as a linear program with equilibrium constraints. Get FSL. After reformulating to avoid the computational difficulty, we get FSB (FS bilinear program). This paper also gave an adaptation of the optimal brain damage (OBD). Algorithms for FSS, FSV, FSB and OBD were presented. Experiments were carried out on the WPBC (Wisconsin Prognostic Breast Cancer) problem and the Ionosphere problem. o P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In Proc. 15th International Conf. on Machine Learning, pages 82--90. Morgan Kaufmann, San Francisco, CA, 1998. This paper actually extracted parts of the previous paper, focusing on FSV and its algorithm (Successive Linearization Algorithm - SLA). The introduction of SVM was just to add another suppressing term in the objective function. Hall et al. o Hall, M. A., Smith, L. A. (1999). Feature selection for machine learning: comparing a correlation- based filter approach to the wrapper. Proceedings of the Florida Artificial Intelligence Symposium (FLAIRS-99). o Practical feature subset selection for machine learning. Proceedings of the 21st Australian Computer Science Conference. Springer. 181-191. o M. Hall. Correlation-based feature selection for machine learning. PhD thesis, University of Waikato, 1999. o Hall, M. (2000). Correlation-based feature selection of discrete and numeric class machine learning. In Proceedings of the International Conference on Machine Learning, pages 359-366, San Francisco, CA. Morgan Kaufmann Publishers.
  • 13. In these papers and thesis, Hall tried to present his new method CFS (Correlation based Feature Selection). It is an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy. Further experiments compared CFS with a wrapper—a well known approach to feature selection that employs the target learning algorithm to evaluate feature sets. In many cases CFS gave comparable results to the wrapper, and in general, outperformed the wrapper on small datasets. CFS executes many times faster than the wrapper, which allows it to scale to larger datasets. Langely et al o Blum, Avrim and Langely, Pat. Selection of Relevant Features and Examples in Machine Learning, In Artificial Intelligence, Vol.97, No.1-2, pages 245-271,1997. This paper addressed the two problems of irrelevant features and irrelevant samples. For the feature selection, they used almost the same relevance concepts and definitions by Johnn, Kohavi and Pfleger. Secondly, the authors regarded the feature selection as a problem of heuristic search. 4 basic issues were studied: (1) the starting point; (2) the organization of search: exhaustive or greedy; (3) the alternative subset evaluation; and (4) the halting criterion. Such discussion is the same as Liu’s publication. Then, the authors reviewed three types of feature selection methods: (1) those that embed the selection within the basic induction algorithm; (2) those that use the selection to filter features passed to induction; and (3) those that treat the selection as a wrapper around the induction process. The following two tables list the characteristics of different methods.
  • 14. Devaney et al. - Conceptual Clustering o Devaney, M., and Ram, A. Efficient Feature Selection in Conceptual Clustering. Machine Learning: ICML '97, 92-97, Morgan Kaufmann, San Francisco, CA, 1997. This paper addressed that feature selection in unsupervised situations. Any typical wrapper approach cannot be applied due to the absence of the class labels in the dataset. One solution is to use average predictive accuracy over all attributes. Another way is to use category utility (M. A. Gluck and J. E. Corter. Information, uncertainty, and the unity of categories. In Proceedings of the 7th Annual Conference of the Cognitive Science Society, pages 283--287, Irvine, CA, 1985). The COBWEB system (D. H. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139--172, 1987) applying the category utility was used in the paper. The Ck terms are the concepts in the partition, Ai is each attribute, and Vij is each of the possible values for attribute. This equation yields a measure of the increase in the number of attribute values that can be predicted given a set of concepts, C1 … Ck, over the number of attribute values that could be predicted without using any concepts. The term i j P ( Ai Vij ) is the probability of each attribute value independent of class membership and is obtained from the parent of the partition. The P(Ck ) term weights the values for each concept according to its size, and the division by n, the number of concepts in the partition, allows comparison of partitions of different sizes. For the continuous attributes, the paper used CLASSIT algorithm (Gennari, J. H. (1990). An experimental study of concept formation. Doctoral dissertation, Department of Information & Computer Science, University of California, Irvine. ).
  • 15. where K is the number of classes in the partition, ik the standard deviation of attribute i in class k and pi the standard deviation of attribute i at the parent node. The method, as the authors described, blurs the traditional wrapper/filter model distinction – it is like a wrapper model in that the underlying learning algorithm is being used to guide the descriptor search but it is like a filter in that the evolution function measures an intrinsic property of the data rather than some type of predictive accuracy. Then a hill-climbing based search algorithm was proposed – AICC; and the heart disease, LED datasets were used to benchmark the methodology. Caruana & Freitag o Caruana, R. and D. Freitag. Greedy Attribute Selection. in International Conference on Machine Learning. 1994. The paper examined five greedy hillclimbing procedures (forward selection, backward elimination, forward stepwise selection, backward stepwise elimination, and backward stepwise elimination–SLASH) that search for attribute sets that generalize well with ID3/C4.5. A caching scheme was presented that made attribute hillclimbing more practical computationally.