SlideShare uma empresa Scribd logo
1 de 5
IMPROVING CLASSIFIER ACCURACY USING UNLABELED DATA
                                           Thamar I. Solorio          Olac Fuentes

                                               Department of Computer Science
                                    Instituto Nacional de Astrofísica, Óptica y Electrónica
                                                     Luis Enrique Erro #1
                                           Santa María Tonantzintla, Puebla, México



ABSTRACT                                                            Thus the question is: can we take advantage of the large
This paper describes an algorithm for improving classifier          pool of unlabeled data? It would be extremely useful if we
accuracy using unlabeled data. This is of practical                 could find an algorithm that allowed improving
significance given the high cost of obtaining labeled data,         classification accuracy when the labeled data are
and the large pool of unlabeled data readily available. The         insufficient. This is the problem addressed in this paper.
algorithm consists of building a classifier using a very            We evaluated the impact of incorporating unlabeled data
small set of previously labeled data, then classifying a            to the learning process using several learning algorithms.
larger set of unlabeled data using that classifier, and             Experimental results show that the classifiers trained with
finally building a new classifier using a combined data set         labeled and unlabeled data are more accurate than the
containing the original set of labeled data and the set of          ones trained with labeled data only. This is the result of
previously unlabeled data. The algorithm proposed here              the overall averages from ten learning tasks.
was implemented using three well known learning
algorithms: feedforward neural networks trained with                Even though the interest in learning algorithms that use
Backpropagation, the Naive Bayes Classifier and the C4.5            unlabeled data is recent, several methods have been
rule induction algorithm as base learning algorithms.               proposed. Blum and Mitchell proposed a method for
Preliminary experimental results using 10 datasets from             combining labeled and unlabeled data called co-training
the UCI repository show that using unlabeled data                   [1]. This method is targeted to a particular type of
improves the classification accuracy by 5% on average               problem: classification where the examples can naturally
and that for 80% of the experiments the use of unlabeled            be described using several different types of information.
data results in an improvement in the classifier's accuracy.        In other words, an instance can be classified using
                                                                    different subsets of the attributes describing that instance.
1. INTRODUCTION                                                     Basically, the co-training algorithm is this: two weak
                                                                    classifiers are built, each one using different kind of
One of the problems addressed by machine learning is                information, then, bootstrap from these classifiers using
that of data classification. Since the 1960’s, many                 unlabeled data. They focused on the problem of web-page
algorithms for data classification have been proposed.              classification where each example can be classified using
However, all learning algorithms suffer the same                    the words contained in that page or using the links that
weakness: when the training set is small the classifier             point to that page.
accuracy is low. Thus, these algorithms can become an
impractical solution due to the need of a very large                Nigam et al. proposed a different approach, where a
training set. In many domains, unlabeled data are readily           theoretical argument is presented showing that useful
available, but manual labeling is time-consuming, difficult         information about the target function can be extracted
or even impossible. For example, there are millions of text         from unlabeled data [2]. The algorithm learns to classify
documents available on the world-wide web, but, for the             text from labeled and unlabeled documents. The idea in
vast majority, a label indicating their topic is not                Nigam’s approach was to combine the Expectation
available. Another example is character recognition:                Maximization algorithm (EM) with the Naive Bayes
gathering examples with handwritten characters is easy,             classifier. They report an error reduction of up to 30%. In
but manual labeling each character is a tedious task. In            this work we extended this approach, incorporating
astronomy, something similar occurs, thousands of spectra           unlabeled data to three different learning algorithms, and
per night can be obtained with an automated telescope,              evaluate it using several data sets form the UCI
but an astronomer needs several minutes to manually                 Repository [3].
classify each spectrum.
                                                                    Unlabeled data have also been used for improving the
                                                                    performance of artificial neural networks. Fardanesh and
                                                                    Okan used the backpropagation algorithm, and the results
show that the classifier error can be decreased using           The Naive Bayes classifier is a probabilistic algorithm
unlabeled data in some problem domains [4].                     based on the simplifying assumption that the attribute
                                                                values are conditionally independent given the target
The paper is organized as follows: the next section             values. Even though we know that in practice this
presents the learning algorithms. Section 3 describes how       assumption does not hold, the algorithm's performance
unlabeled data are incorporated to the classifier's training.   has been shown to be comparable to that of neural
Section 4 presents experimental results that compare the        networks in some domains [6,7]. The Naive Bayes
performance of the algorithms trained using labeled and         classifier applies to learning tasks where each instance x
unlabeled data to those obtained by the classifiers trained     can be described as a tuple of attribute values <a1, a2, …
with labeled data only. Finally, some conclusions and           an> and the target function f(x) can take on any value
directions for future work are presented.                       from a finite set V.
                                                                When a new instance x is presented, the Naive Bayes
2. LEARNING ALGORITHMS                                          classifier assigns to it the most probable target value by
                                                                applying this rule:
Experiments in this work were made with three of the
most successful classification learning algorithms:             F(x)=argmaxvj∈VP(vi)∏IP(ai|vj)
feedforward     neural     networks      trained    with
backpropagation, the C4.5 learning algorithm [5] and the        To summarize, the learning task of the Naive Bayes is to
Naive Bayes classifier.                                         build a hypothesis by estimating the different P(vi) and
                                                                P(ai|vj) terms based on their frequencies over the
2.1 Backpropagation and Feedforward                             training data.
Neural Networks

For problems involving real-valued attributes, Artificial
                                                                2.3 The C4.5 Algorithm
Neural Networks (ANNs) are among the most effective
learning methods currently known. Algorithms such as            C4.5 is an extension to the decision-tree learning
Backpropagation use gradient descent or other                   algorithm ID3 [8]. Only a brief description of the method
optimization algorithm to tune network parameters to best       is given here, more information can be found in [5]. The
fit a training set of input-output pairs. The                   algorithm consists of the following steps:
Backpropagation algorithm was applied in this work to a          1. Build the decision tree form the training set
feedforward network containing two layers of sigmoidal                (conventional ID3).
units.                                                           2. Convert the resulting tree into an equivalent set of
                                                                      rules. The number of rules is equivalent to the
                                                                      number of possible paths from the root to a leaf
                                                                      node.
         X1          HIDDEN LAYER                                3. Prune each rule by removing any preconditions that
                                                                      result in improving its accuracy, according to a
        X2                                                            validation set.
                                                                 4. Sort the pruned rules in descending order according
                                                                      to their accuracy, and consider them in this sequence
                                                                      when classifying subsequent instances.
        X3
                                                                Since the learning tasks used to evaluate this work involve
                                                                nominal and numeric values, we implemented the version
         X4                  WHO                                of C4.5 that incorporates continuous values.

               WIH                                              3. INCORPORATING UNLABELED DATA
Figure 1. Representation of a feedforward neural network
with one hidden layer.                                          The algorithm for combining labeled and unlabeled data is
                                                                described in this section. In the three learning algorithms
                                                                we apply this same procedure. First, the data set is divided
2.2 Naive Bayes Classifier                                      randomly into several groups, one of these groups is
                                                                considered with its original classifications as the training
                                                                set, another group is separated as the test set and the
remaining data are the unlabeled examples. A classifier C1
is built using the training set and the learning algorithm           As we can see in the three figures, the algorithm that
L1. Then, we use C1 to classify the unlabeled examples.              shows the largest improvement with the incorporation of
With the labels assigned by C1 we merge both sets into               unlabeled data is C4.5. In the ten learning tasks C4.5
one training set to build a final classifier C2. Finally, the        presented an improvement average of 8% while the
test data are classified using C2.                                   improvement averages for neural networks and Naive
                                                                     Bayes were 5% and 3% respectively. Table 1 summarizes
The process described above was carried out ten times                the results obtained in these experiments.
with each learning task and the overall averages are the
results described in the next section.                               5. CONCLUSIONS AND FUTURE WORK

                                                                     We have shown how learning from small sets of labeled
4. EXPERIMENTAL RESULTS                                              training data can be improved upon with the use of larger
                                                                     sets of unlabeled data. Our experimental results using
We used the following dataset form the UCI repository:               several training sets and three different learning
wine, glass, chess, breast cancer, lymphography, balloons,           algorithms show that for the vast majority of the cases,
thyroid disease, tic-tac-toe, ionosphere and iris. Figure 2          using unlabeled data improves the quality of the
compares the performance of C4.5 trained using the                   predictions made by the algorithms. This is of practical
labeled data only with the same algorithm using both                 significance in domains where unlabeled data are readily
labeled and unlabeled data as described in the previous              available, but manual labeling may be time-consuming,
section. One point is plotted for each of the ten learning           difficult or impractical. Present and future work includes:
tasks taken from the Irving repository of machine learning           • Applying this methodology using ensembles of
datasets [2]. We can see that most points lie above the                   classifiers, where presumably the labeling of the
dotted line, which indicates that the error rate of the C4.5              unlabeled data and thus the final classifications
classifier trained with labeled and unlabeled data is                     assigned by the algorithm can be made more
smaller than the error of C4.5 trained with labeled data                  accurate.
only. Similarly, Figure 3 compares the performance of the            • Experimental studies to characterize situations in
Naive Bayes classifier trained using labeled and unlabeled                which this approach is not applicable. It is clear that
data to that obtained using only labeled data. Again, a                   when the set of labeled examples is large enough or
lower degree of error can be attained incorporating                       when the pseudo-labels can not be assigned
unlabeled data. Finally, Figure 4 shows the performance                   accurately, the use of unlabeled data can not improve
comparison of incorporating unlabeled data to a neural                    and may even decrease the overall classification
network's training to that using only labeled data.                       accuracy.




                                         Naive Bayes
                                          Classifier            Neural Networks        C4.5
                                       C1     C2 Ratio            C1   C2 Ratio C1      C2          Ratio
                      Wine            11.11 6.31 0.57            7.61 7.57 0.99 22.34 20.56         0.92
                      Glass           69.35 68.82 0.99          25.23 26.21 1.04 61.04 58.60        0.96
                      Chess           28.35 37.91 1.34                           19.58 18.53        0.95
                      Breast          27.30 27.51 1.01           5.94 5.36 0.90 10.70 10.28         0.96
                      Lympho          27.82 36.72 1.32                           40.17 38.28        0.95
                      Balloons        28.43 32.18 1.13                           32.50 25.00        0.77
                      Tiroides        9.31   8.44 0.91          10.00 8.18 0.82 18.97 18.46         0.97
                      tic_tac_toe     34.87 32.98 0.95                           20.94 19.81        0.95
                      ionosphere      64.04 64.04 1.00          12.52 12.47 1.00 25.64 21.81        0.85
                      Iris            4.93   2.64 0.54           9.80 9.42 0.96 18.10 16.76         0.93
                      Average                        0.97                   0.95                    0.92
Table 1. Comparison of the error rates of the three algorithms. C1 is the classifier built using labeled data only, C2 is the
classifier built combining labeled and unlabeled data. Column Ratio presents results for C2 divided by the corresponding
figure for C1. In bold we can see lowest error for a given dataset and the largest reduction in error as a fraction of the original
error for each learning task. C4.5 shows the best improvement in 60% of the tasks. In 77% of the learning tasks the error was
reduced when using unlabeled data, and in 80% of the tasks the best overall results where obtained by a classifier that used
unlabeled data.




Figure 2. Comparison of C4.5 using labeled data only                  6. ACKNOWLEDGEMENT: We would like to
with C4.5 using unlabeled data. Points above the diagonal             thank CONACyT for partially supporting this work under
line exhibit lower error when the C4.5 is given unlabeled             grant J31877-A.
data.
                                                                      7. REFERENCES

                                                                      [1] A.  Blum, T. Mitchell, Combining Labeled and
                                                                           Unlabeled Data with Co-Training. Proc. 1998
                                                                           Conference on Computational Learning Theory, July
                                                                           1998.

                                                                      [2] K. Nigam, A. McCallum, S. Thrun , & T. Mitchell,
                                                                           Learning to Classify Text from Labeled and
                                                                           Unlabeled   Documents,   Machine  Learning,
                                                                           1999,1-22.

                                                                      [3] C.   Merz, & P. M. Murphy, UCI repository of
                                                                           machine             learning         databases,
Figure 3. Comparison of Naive Bayes Classifier using                       http://www.ics.uci.edu./~mlearn/MLRepository.ht
labeled data only with Neural Network trained with a                       ml, 1996.
Naive Bayes Classifier using unlabeled data. Points above
the diagonal line exhibit lower error when the Neural
Network is given unlabeled data                                       [4] M.T. Fardanesh and Okan K. Ersoy, "Classification
                                                                           Accuracy Improvement of Neural Network
                                                                           Classifiers by Using Unlabeled Data," IEEE
                                                                           Transactions on Geoscience and Remote Sensing ,
                                                                           Vol. 36, No. 3, 1998, 1020-1025.

                                                                      [5] J. R. Quinlan, C4.5: Programs for Machine Learning
                                                                           (San Mateo, CA: Morgan Kaufmann, 1993).
[6] D.  Lewis, & M. Ringuette, A comparison of two             1997 International     Conference    on   Machine
    learning algorithms for text categorization, Third         Learning.1997.
    Annual Symposium on Document Analysis and
    Information Retrieval, 1994, 81-93.                    [8] J. R. Quinlan, Induction of decision trees. Machine
                                                               Learning, 1(1), 1986, 81-106.
[7] T. Joachims, A probabilistic analysis of the Rocchio
     algorithm with TFIDF for text categorization, Proc.
Figure 4.Comparison results of an ANN. Points above the
diagonal line exhibit lower error when the ANN is given
unlabeled data.

Mais conteúdo relacionado

Mais procurados

ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERIJCSEA Journal
 
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...IJERA Editor
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...IRJET Journal
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...Alexander Decker
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learningAnil Yadav
 
Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...
Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...
Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...ijtsrd
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Armando Vieira
 
Neural Network Classification and its Applications in Insurance Industry
Neural Network Classification and its Applications in Insurance IndustryNeural Network Classification and its Applications in Insurance Industry
Neural Network Classification and its Applications in Insurance IndustryInderjeet Singh
 
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...MLAI2
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionIJCSIS Research Publications
 
A Stacked Generalization Ensemble Approach for Improved Intrusion Detection
A Stacked Generalization Ensemble Approach for Improved Intrusion DetectionA Stacked Generalization Ensemble Approach for Improved Intrusion Detection
A Stacked Generalization Ensemble Approach for Improved Intrusion DetectionIJCSIS Research Publications
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningMLAI2
 
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...IDES Editor
 
Neural Network Classification and its Applications in Insurance Industry
Neural Network Classification and its Applications in Insurance IndustryNeural Network Classification and its Applications in Insurance Industry
Neural Network Classification and its Applications in Insurance IndustryInderjeet Singh
 
Machine learning and_neural_network_lecture_slide_ece_dku
Machine learning and_neural_network_lecture_slide_ece_dkuMachine learning and_neural_network_lecture_slide_ece_dku
Machine learning and_neural_network_lecture_slide_ece_dkuSeokhyun Yoon
 
A novel ensemble modeling for intrusion detection system
A novel ensemble modeling for intrusion detection system A novel ensemble modeling for intrusion detection system
A novel ensemble modeling for intrusion detection system IJECEIAES
 

Mais procurados (20)

41 125-1-pb
41 125-1-pb41 125-1-pb
41 125-1-pb
 
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINERANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
 
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
Rainfall Prediction using Data-Core Based Fuzzy Min-Max Neural Network for Cl...
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...An efficient technique for color image classification based on lower feature ...
An efficient technique for color image classification based on lower feature ...
 
Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning15857 cse422 unsupervised-learning
15857 cse422 unsupervised-learning
 
Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...
Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...
Mobile Network Coverage Determination at 900MHz for Abuja Rural Areas using A...
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio
 
Neural Network Classification and its Applications in Insurance Industry
Neural Network Classification and its Applications in Insurance IndustryNeural Network Classification and its Applications in Insurance Industry
Neural Network Classification and its Applications in Insurance Industry
 
Cq4201618622
Cq4201618622Cq4201618622
Cq4201618622
 
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware Detection
 
A Stacked Generalization Ensemble Approach for Improved Intrusion Detection
A Stacked Generalization Ensemble Approach for Improved Intrusion DetectionA Stacked Generalization Ensemble Approach for Improved Intrusion Detection
A Stacked Generalization Ensemble Approach for Improved Intrusion Detection
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive Learning
 
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
A Novel Classification via Clustering Method for Anomaly Based Network Intrus...
 
Neural Network Classification and its Applications in Insurance Industry
Neural Network Classification and its Applications in Insurance IndustryNeural Network Classification and its Applications in Insurance Industry
Neural Network Classification and its Applications in Insurance Industry
 
Machine learning and_neural_network_lecture_slide_ece_dku
Machine learning and_neural_network_lecture_slide_ece_dkuMachine learning and_neural_network_lecture_slide_ece_dku
Machine learning and_neural_network_lecture_slide_ece_dku
 
A novel ensemble modeling for intrusion detection system
A novel ensemble modeling for intrusion detection system A novel ensemble modeling for intrusion detection system
A novel ensemble modeling for intrusion detection system
 

Destaque

R E U N I O P A R E S 1 R (09 10)
R E U N I O  P A R E S 1 R (09 10)R E U N I O  P A R E S 1 R (09 10)
R E U N I O P A R E S 1 R (09 10)guest405f4f
 
2a Enquesta "Perfil social del blocaire ebrenc"
2a Enquesta "Perfil social del blocaire ebrenc"2a Enquesta "Perfil social del blocaire ebrenc"
2a Enquesta "Perfil social del blocaire ebrenc"Daniel Gil
 
Franks Seminars
Franks SeminarsFranks Seminars
Franks SeminarsFrankmc
 
Lesson 13 computer systems software
Lesson 13   computer systems softwareLesson 13   computer systems software
Lesson 13 computer systems softwareguevarra_2000
 

Destaque (8)

R E U N I O P A R E S 1 R (09 10)
R E U N I O  P A R E S 1 R (09 10)R E U N I O  P A R E S 1 R (09 10)
R E U N I O P A R E S 1 R (09 10)
 
2a Enquesta "Perfil social del blocaire ebrenc"
2a Enquesta "Perfil social del blocaire ebrenc"2a Enquesta "Perfil social del blocaire ebrenc"
2a Enquesta "Perfil social del blocaire ebrenc"
 
Physiologulation
PhysiologulationPhysiologulation
Physiologulation
 
Halloween
HalloweenHalloween
Halloween
 
Franks Seminars
Franks SeminarsFranks Seminars
Franks Seminars
 
Presentation Listex Oct 2013
Presentation Listex Oct 2013Presentation Listex Oct 2013
Presentation Listex Oct 2013
 
Lesson 13 computer systems software
Lesson 13   computer systems softwareLesson 13   computer systems software
Lesson 13 computer systems software
 
Global E- business
Global E- businessGlobal E- business
Global E- business
 

Semelhante a Improving Classifier Accuracy using Unlabeled Data..doc

Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Fatimakhan325
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET Journal
 
Classification Of Iris Plant Using Feedforward Neural Network
Classification Of Iris Plant Using Feedforward Neural NetworkClassification Of Iris Plant Using Feedforward Neural Network
Classification Of Iris Plant Using Feedforward Neural Networkirjes
 
Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...IDES Editor
 
Comparison of hybrid pso sa algorithm and genetic algorithm for classification
Comparison of hybrid pso sa algorithm and genetic algorithm for classificationComparison of hybrid pso sa algorithm and genetic algorithm for classification
Comparison of hybrid pso sa algorithm and genetic algorithm for classificationAlexander Decker
 
11.comparison of hybrid pso sa algorithm and genetic algorithm for classifica...
11.comparison of hybrid pso sa algorithm and genetic algorithm for classifica...11.comparison of hybrid pso sa algorithm and genetic algorithm for classifica...
11.comparison of hybrid pso sa algorithm and genetic algorithm for classifica...Alexander Decker
 
Knowledge base completion presentation
Knowledge base completion presentationKnowledge base completion presentation
Knowledge base completion presentationDemai Ni
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesVarad Meru
 
Neural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningNeural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningFrancisco E. Figueroa-Nigaglioni
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Miningijdmtaiir
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural networkNagarajan
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesValue Amplify Consulting
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...IAEME Publication
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Feed forward neural network for sine
Feed forward neural network for sineFeed forward neural network for sine
Feed forward neural network for sineijcsa
 
Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...
Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...
Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...Eswar Publications
 
Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...IOSR Journals
 

Semelhante a Improving Classifier Accuracy using Unlabeled Data..doc (20)

Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)Types of Machine Learnig Algorithms(CART, ID3)
Types of Machine Learnig Algorithms(CART, ID3)
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
Classification Of Iris Plant Using Feedforward Neural Network
Classification Of Iris Plant Using Feedforward Neural NetworkClassification Of Iris Plant Using Feedforward Neural Network
Classification Of Iris Plant Using Feedforward Neural Network
 
Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...Performance Evaluation of Classifiers used for Identification of Encryption A...
Performance Evaluation of Classifiers used for Identification of Encryption A...
 
Comparison of hybrid pso sa algorithm and genetic algorithm for classification
Comparison of hybrid pso sa algorithm and genetic algorithm for classificationComparison of hybrid pso sa algorithm and genetic algorithm for classification
Comparison of hybrid pso sa algorithm and genetic algorithm for classification
 
11.comparison of hybrid pso sa algorithm and genetic algorithm for classifica...
11.comparison of hybrid pso sa algorithm and genetic algorithm for classifica...11.comparison of hybrid pso sa algorithm and genetic algorithm for classifica...
11.comparison of hybrid pso sa algorithm and genetic algorithm for classifica...
 
Knowledge base completion presentation
Knowledge base completion presentationKnowledge base completion presentation
Knowledge base completion presentation
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
Neural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learningNeural networks, naïve bayes and decision tree machine learning
Neural networks, naïve bayes and decision tree machine learning
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Mining
 
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm""Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
 
PNN and inversion-B
PNN and inversion-BPNN and inversion-B
PNN and inversion-B
 
F017533540
F017533540F017533540
F017533540
 
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
 
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
AN IMPROVED METHOD FOR IDENTIFYING WELL-TEST INTERPRETATION MODEL BASED ON AG...
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Feed forward neural network for sine
Feed forward neural network for sineFeed forward neural network for sine
Feed forward neural network for sine
 
Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...
Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...
Robust Fault-Tolerant Training Strategy Using Neural Network to Perform Funct...
 
Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...Performance Evaluation of Different Data Mining Classification Algorithm and ...
Performance Evaluation of Different Data Mining Classification Algorithm and ...
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Improving Classifier Accuracy using Unlabeled Data..doc

  • 1. IMPROVING CLASSIFIER ACCURACY USING UNLABELED DATA Thamar I. Solorio Olac Fuentes Department of Computer Science Instituto Nacional de Astrofísica, Óptica y Electrónica Luis Enrique Erro #1 Santa María Tonantzintla, Puebla, México ABSTRACT Thus the question is: can we take advantage of the large This paper describes an algorithm for improving classifier pool of unlabeled data? It would be extremely useful if we accuracy using unlabeled data. This is of practical could find an algorithm that allowed improving significance given the high cost of obtaining labeled data, classification accuracy when the labeled data are and the large pool of unlabeled data readily available. The insufficient. This is the problem addressed in this paper. algorithm consists of building a classifier using a very We evaluated the impact of incorporating unlabeled data small set of previously labeled data, then classifying a to the learning process using several learning algorithms. larger set of unlabeled data using that classifier, and Experimental results show that the classifiers trained with finally building a new classifier using a combined data set labeled and unlabeled data are more accurate than the containing the original set of labeled data and the set of ones trained with labeled data only. This is the result of previously unlabeled data. The algorithm proposed here the overall averages from ten learning tasks. was implemented using three well known learning algorithms: feedforward neural networks trained with Even though the interest in learning algorithms that use Backpropagation, the Naive Bayes Classifier and the C4.5 unlabeled data is recent, several methods have been rule induction algorithm as base learning algorithms. proposed. Blum and Mitchell proposed a method for Preliminary experimental results using 10 datasets from combining labeled and unlabeled data called co-training the UCI repository show that using unlabeled data [1]. This method is targeted to a particular type of improves the classification accuracy by 5% on average problem: classification where the examples can naturally and that for 80% of the experiments the use of unlabeled be described using several different types of information. data results in an improvement in the classifier's accuracy. In other words, an instance can be classified using different subsets of the attributes describing that instance. 1. INTRODUCTION Basically, the co-training algorithm is this: two weak classifiers are built, each one using different kind of One of the problems addressed by machine learning is information, then, bootstrap from these classifiers using that of data classification. Since the 1960’s, many unlabeled data. They focused on the problem of web-page algorithms for data classification have been proposed. classification where each example can be classified using However, all learning algorithms suffer the same the words contained in that page or using the links that weakness: when the training set is small the classifier point to that page. accuracy is low. Thus, these algorithms can become an impractical solution due to the need of a very large Nigam et al. proposed a different approach, where a training set. In many domains, unlabeled data are readily theoretical argument is presented showing that useful available, but manual labeling is time-consuming, difficult information about the target function can be extracted or even impossible. For example, there are millions of text from unlabeled data [2]. The algorithm learns to classify documents available on the world-wide web, but, for the text from labeled and unlabeled documents. The idea in vast majority, a label indicating their topic is not Nigam’s approach was to combine the Expectation available. Another example is character recognition: Maximization algorithm (EM) with the Naive Bayes gathering examples with handwritten characters is easy, classifier. They report an error reduction of up to 30%. In but manual labeling each character is a tedious task. In this work we extended this approach, incorporating astronomy, something similar occurs, thousands of spectra unlabeled data to three different learning algorithms, and per night can be obtained with an automated telescope, evaluate it using several data sets form the UCI but an astronomer needs several minutes to manually Repository [3]. classify each spectrum. Unlabeled data have also been used for improving the performance of artificial neural networks. Fardanesh and Okan used the backpropagation algorithm, and the results
  • 2. show that the classifier error can be decreased using The Naive Bayes classifier is a probabilistic algorithm unlabeled data in some problem domains [4]. based on the simplifying assumption that the attribute values are conditionally independent given the target The paper is organized as follows: the next section values. Even though we know that in practice this presents the learning algorithms. Section 3 describes how assumption does not hold, the algorithm's performance unlabeled data are incorporated to the classifier's training. has been shown to be comparable to that of neural Section 4 presents experimental results that compare the networks in some domains [6,7]. The Naive Bayes performance of the algorithms trained using labeled and classifier applies to learning tasks where each instance x unlabeled data to those obtained by the classifiers trained can be described as a tuple of attribute values <a1, a2, … with labeled data only. Finally, some conclusions and an> and the target function f(x) can take on any value directions for future work are presented. from a finite set V. When a new instance x is presented, the Naive Bayes 2. LEARNING ALGORITHMS classifier assigns to it the most probable target value by applying this rule: Experiments in this work were made with three of the most successful classification learning algorithms: F(x)=argmaxvj∈VP(vi)∏IP(ai|vj) feedforward neural networks trained with backpropagation, the C4.5 learning algorithm [5] and the To summarize, the learning task of the Naive Bayes is to Naive Bayes classifier. build a hypothesis by estimating the different P(vi) and P(ai|vj) terms based on their frequencies over the 2.1 Backpropagation and Feedforward training data. Neural Networks For problems involving real-valued attributes, Artificial 2.3 The C4.5 Algorithm Neural Networks (ANNs) are among the most effective learning methods currently known. Algorithms such as C4.5 is an extension to the decision-tree learning Backpropagation use gradient descent or other algorithm ID3 [8]. Only a brief description of the method optimization algorithm to tune network parameters to best is given here, more information can be found in [5]. The fit a training set of input-output pairs. The algorithm consists of the following steps: Backpropagation algorithm was applied in this work to a 1. Build the decision tree form the training set feedforward network containing two layers of sigmoidal (conventional ID3). units. 2. Convert the resulting tree into an equivalent set of rules. The number of rules is equivalent to the number of possible paths from the root to a leaf node. X1 HIDDEN LAYER 3. Prune each rule by removing any preconditions that result in improving its accuracy, according to a X2 validation set. 4. Sort the pruned rules in descending order according to their accuracy, and consider them in this sequence when classifying subsequent instances. X3 Since the learning tasks used to evaluate this work involve nominal and numeric values, we implemented the version X4 WHO of C4.5 that incorporates continuous values. WIH 3. INCORPORATING UNLABELED DATA Figure 1. Representation of a feedforward neural network with one hidden layer. The algorithm for combining labeled and unlabeled data is described in this section. In the three learning algorithms we apply this same procedure. First, the data set is divided 2.2 Naive Bayes Classifier randomly into several groups, one of these groups is considered with its original classifications as the training set, another group is separated as the test set and the
  • 3. remaining data are the unlabeled examples. A classifier C1 is built using the training set and the learning algorithm As we can see in the three figures, the algorithm that L1. Then, we use C1 to classify the unlabeled examples. shows the largest improvement with the incorporation of With the labels assigned by C1 we merge both sets into unlabeled data is C4.5. In the ten learning tasks C4.5 one training set to build a final classifier C2. Finally, the presented an improvement average of 8% while the test data are classified using C2. improvement averages for neural networks and Naive Bayes were 5% and 3% respectively. Table 1 summarizes The process described above was carried out ten times the results obtained in these experiments. with each learning task and the overall averages are the results described in the next section. 5. CONCLUSIONS AND FUTURE WORK We have shown how learning from small sets of labeled 4. EXPERIMENTAL RESULTS training data can be improved upon with the use of larger sets of unlabeled data. Our experimental results using We used the following dataset form the UCI repository: several training sets and three different learning wine, glass, chess, breast cancer, lymphography, balloons, algorithms show that for the vast majority of the cases, thyroid disease, tic-tac-toe, ionosphere and iris. Figure 2 using unlabeled data improves the quality of the compares the performance of C4.5 trained using the predictions made by the algorithms. This is of practical labeled data only with the same algorithm using both significance in domains where unlabeled data are readily labeled and unlabeled data as described in the previous available, but manual labeling may be time-consuming, section. One point is plotted for each of the ten learning difficult or impractical. Present and future work includes: tasks taken from the Irving repository of machine learning • Applying this methodology using ensembles of datasets [2]. We can see that most points lie above the classifiers, where presumably the labeling of the dotted line, which indicates that the error rate of the C4.5 unlabeled data and thus the final classifications classifier trained with labeled and unlabeled data is assigned by the algorithm can be made more smaller than the error of C4.5 trained with labeled data accurate. only. Similarly, Figure 3 compares the performance of the • Experimental studies to characterize situations in Naive Bayes classifier trained using labeled and unlabeled which this approach is not applicable. It is clear that data to that obtained using only labeled data. Again, a when the set of labeled examples is large enough or lower degree of error can be attained incorporating when the pseudo-labels can not be assigned unlabeled data. Finally, Figure 4 shows the performance accurately, the use of unlabeled data can not improve comparison of incorporating unlabeled data to a neural and may even decrease the overall classification network's training to that using only labeled data. accuracy. Naive Bayes Classifier Neural Networks C4.5 C1 C2 Ratio C1 C2 Ratio C1 C2 Ratio Wine 11.11 6.31 0.57 7.61 7.57 0.99 22.34 20.56 0.92 Glass 69.35 68.82 0.99 25.23 26.21 1.04 61.04 58.60 0.96 Chess 28.35 37.91 1.34 19.58 18.53 0.95 Breast 27.30 27.51 1.01 5.94 5.36 0.90 10.70 10.28 0.96 Lympho 27.82 36.72 1.32 40.17 38.28 0.95 Balloons 28.43 32.18 1.13 32.50 25.00 0.77 Tiroides 9.31 8.44 0.91 10.00 8.18 0.82 18.97 18.46 0.97 tic_tac_toe 34.87 32.98 0.95 20.94 19.81 0.95 ionosphere 64.04 64.04 1.00 12.52 12.47 1.00 25.64 21.81 0.85 Iris 4.93 2.64 0.54 9.80 9.42 0.96 18.10 16.76 0.93 Average 0.97 0.95 0.92
  • 4. Table 1. Comparison of the error rates of the three algorithms. C1 is the classifier built using labeled data only, C2 is the classifier built combining labeled and unlabeled data. Column Ratio presents results for C2 divided by the corresponding figure for C1. In bold we can see lowest error for a given dataset and the largest reduction in error as a fraction of the original error for each learning task. C4.5 shows the best improvement in 60% of the tasks. In 77% of the learning tasks the error was reduced when using unlabeled data, and in 80% of the tasks the best overall results where obtained by a classifier that used unlabeled data. Figure 2. Comparison of C4.5 using labeled data only 6. ACKNOWLEDGEMENT: We would like to with C4.5 using unlabeled data. Points above the diagonal thank CONACyT for partially supporting this work under line exhibit lower error when the C4.5 is given unlabeled grant J31877-A. data. 7. REFERENCES [1] A. Blum, T. Mitchell, Combining Labeled and Unlabeled Data with Co-Training. Proc. 1998 Conference on Computational Learning Theory, July 1998. [2] K. Nigam, A. McCallum, S. Thrun , & T. Mitchell, Learning to Classify Text from Labeled and Unlabeled Documents, Machine Learning, 1999,1-22. [3] C. Merz, & P. M. Murphy, UCI repository of machine learning databases, Figure 3. Comparison of Naive Bayes Classifier using http://www.ics.uci.edu./~mlearn/MLRepository.ht labeled data only with Neural Network trained with a ml, 1996. Naive Bayes Classifier using unlabeled data. Points above the diagonal line exhibit lower error when the Neural Network is given unlabeled data [4] M.T. Fardanesh and Okan K. Ersoy, "Classification Accuracy Improvement of Neural Network Classifiers by Using Unlabeled Data," IEEE Transactions on Geoscience and Remote Sensing , Vol. 36, No. 3, 1998, 1020-1025. [5] J. R. Quinlan, C4.5: Programs for Machine Learning (San Mateo, CA: Morgan Kaufmann, 1993).
  • 5. [6] D. Lewis, & M. Ringuette, A comparison of two 1997 International Conference on Machine learning algorithms for text categorization, Third Learning.1997. Annual Symposium on Document Analysis and Information Retrieval, 1994, 81-93. [8] J. R. Quinlan, Induction of decision trees. Machine Learning, 1(1), 1986, 81-106. [7] T. Joachims, A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, Proc. Figure 4.Comparison results of an ANN. Points above the diagonal line exhibit lower error when the ANN is given unlabeled data.