SlideShare a Scribd company logo
1 of 4
Download to read offline
Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51
www.ijera.com 48 | P a g e
Selection of Significant Features Using Decision Tree Classifiers
Preeti Kumari1
, K. Rajeswari2
,
1
M.E Student, PCCOE, Pune
2
Ph.D Research Scholar, SASTRA and Associate professor, PCCOE, Pune, India
Abstract
Data Mining refers to extraction or mining knowledge from huge volume of data. Classification is an important
data mining technique with various applications. It classifies data of various kinds and used to classify each item
in a set of data into one of predefined set of classes or groups. This work has been carried out to select the most
significant attributes from a dataset using Decision tree classifier j48.This Decision tree classification algorithm
can be efficiently used in selecting the most significant feature of a dataset and hence we can reduce the
dimensionality of a dataset and yet obtain a very good accuracy. In this paper we have selected two datasets
from university of California, Irvine website, of weather and vote and applied j48 algorithm on it. After each
iteration, one attribute is removed and its accuracy has been found .We have compared the accuracy of dataset
with different attribute count and then selected the most significant features of dataset.
I. Literature review
1.1 Data mining
Data mining (sometimes called data or
knowledge discovery) is the process of analyzing
data from different perspectives and summarizing it
into useful information - information that can be used
to increase revenue, cuts costs, or both. Data mining
is a tool for analyzing data allowing users to analyze
data from many different dimensions or angles,
categorize it, and summarize the relationships
identified. Technically, data mining is the process of
finding correlations or patterns among dozens of
fields in large relational databases.
Classification is a data mining function that
assigns items in a collection to target categories or
classes. The goal of classification is to accurately
predict the target class for each case in the data. For
example, a classification model could be used to
identify loan applicants as low, medium, or high
credit risks. Classification models are tested by
comparing the predicted values to known target
values in a set of test data. The historical data for a
classification project is typically divided into two
data sets: one for building the model; the other for
testing the model to classify a given instance.
1.2 Feature selection
Feature selection is a term commonly used in
data mining to describe the tools and techniques
available for reducing inputs to a manageable size for
processing and analysis. Feature selection implies
not only cardinality reduction, which means imposing
an arbitrary or predefined cutoff on the number of
attributes that can be considered when building a
model, but also the choice of attributes, meaning that
either the analyst or the modeling tool actively selects
or discards attributes based on their usefulness for
analysis. The ability to apply feature selection is
critical for effective analysis, because datasets
frequently contain far more information than is
needed to build the model. For example, a dataset
might contain 500 columns that describe the
characteristics of customers, but if the data in some
of the columns is very sparse you would gain very
little benefit from adding them to the model. If you
keep the unneeded columns while building the
model, more CPU and memory are required during
the training process, and more storage space is
required for the completed model. Selection of
significant features of a dataset is an important task in
any application.
1.3 Decision tree classifier J48
J48 are the improved versions of C4.5 algorithms
or can be called as optimized implementation of the
C4.5. The output of J48 is the Decision tree. A
Decision tree is similar to the tree structure having
root node, intermediate nodes and leaf node. Each
node in the tree consist a decision and that decision
leads to our result. Decision tree divide the input
space of a data set into mutually exclusive areas, each
area having a label, a value or an action to describe
its data points. Splitting criterion is used to calculate
which attribute is the best to split that portion tree of
the training data that reaches a particular node.
II. Methodology
In Weka datasets should be formatted to the
ARFF format. The Weka Explorer will use these
automatically if it does not recognize a given file as
an ARFF file, the Preprocess panel has facilities for
importing data from a database, and for
preprocessing this data using a filtering algorithm.
These filters can be used to transform the data and
RESEARCH ARTICLE OPEN ACCESS
Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51
www.ijera.com 49 | P a g e
make it possible to delete instances and attributes
according to specific criteria Datasets
lassify tab is used for the classification purpose. We
have selected the decision tree classifier J48 to obtain
the results.
Step 1: Take the input dataset descretize it properly.
Step 2: Apply the decision tree classifier algorithm
J48 on the whole data set and note the accuracy given
by it.
Step 3: Reduce the dimension of the given dataset by
eliminating one or more attribute pair and note the
accuracy of dataset after applying J48 on it.
Step 4: Repeat step 4 for all combination of
attributes.
Step 5: Compare the different accuracy provided by
the dataset of different attribute taken together and
identify the significant feature of the dataset
III. Results and Discussion
1) Weather dataset
Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51
www.ijera.com 50 | P a g e
Best attribute: Temperature
Here we have selected temperature is the best attribute which alone gives the accuracy of 64.28 %
which is far better than the accuracy provided by the four attribute taken together 50%
2) Vote dataset
Best attribute: physician-free-freeze
For the vote dataset, which consist of 16 attributes which together gives accuracy of 96.32%, after
applying the algorithm we have found out that even single attribute physician-free-freeze gives the
accuracy of 95.63%, which is very efficient. Hence for this dataset it can be considered as the best
attribute.
IV. Conclusion
Thus in this paper we have tried to select best
significant feature of data set using repeatedly
applying the decision tree classification algorithm
J48 over the dataset and found that for weather
dataset temperature is the best attribute which alone
gives the accuracy of 64.28 % which is far better than
the accuracy provided by the four attribute taken
together 50% .For the vote dataset, which consist of
16 attributes which together gives accuracy of
96.32%, after applying the algorithm we have found
out that even single attribute physician-free-freeze
gives the accuracy of 95.63%, which is very efficient.
Hence for this dataset it can be considered as the best
attribute. Future work will focus on analysis on the
feature selected with various other techniques like
principle component analysis, genetic algorithms.
Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51
www.ijera.com 51 | P a g e
References
[1]. Olivier Henchiri , Nathalie Japkowicz. “A
Feature Selection and Evaluation Scheme
for Computer Virus Detection, Proceedings
of the Sixth International Conference on
Data Mining “, (ICDM'06).
[2] M. Dash, H. Liu, Feature Selection for
Classification, Intelligent Data Analysis 1
(1997) 131-156.
[3] Frank,A. & Asuncion ,A. (2010).UCI
Machine Learning Repository
[http://archive.ics.uci.edu/ml].Irvine,CA:
University of California, School of
Information and Computer Science.
[4] Data mining book – kamber
[5] V.Vaithiyanathan, K.Rajeswari, Swati
Tonge,, “ Improved Apriori algorithm based
on Selection Criterion”, IEEE Conference
ICCIC Dec 18-20, 2012, Coimbatore.
Catalog Number: CFP1220J-ART ISBN:
978-1-4673-1344-5
[6] Machine Learning Group at the University
of Waikato:http://www.cs.waikato.ac.nz/ml
/weka/

More Related Content

What's hot

Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RIOSR Journals
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusabilityAlexander Decker
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYEditor IJMTER
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection methodIJSRD
 
Classification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterClassification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterIOSR Journals
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...IJECEIAES
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
Comparative study of various supervisedclassification methodsforanalysing def...
Comparative study of various supervisedclassification methodsforanalysing def...Comparative study of various supervisedclassification methodsforanalysing def...
Comparative study of various supervisedclassification methodsforanalysing def...eSAT Publishing House
 
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentIJDKP
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CAREijistjournal
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
 
An efficient feature selection in
An efficient feature selection inAn efficient feature selection in
An efficient feature selection incsandit
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
 

What's hot (20)

Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
 
11.software modules clustering an effective approach for reusability
11.software modules clustering an effective approach for  reusability11.software modules clustering an effective approach for  reusability
11.software modules clustering an effective approach for reusability
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
Classification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterClassification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted Cluster
 
A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...A Survey on Constellation Based Attribute Selection Method for High Dimension...
A Survey on Constellation Based Attribute Selection Method for High Dimension...
 
IJET-V2I6P32
IJET-V2I6P32IJET-V2I6P32
IJET-V2I6P32
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
 
M43016571
M43016571M43016571
M43016571
 
The pertinent single-attribute-based classifier for small datasets classific...
The pertinent single-attribute-based classifier  for small datasets classific...The pertinent single-attribute-based classifier  for small datasets classific...
The pertinent single-attribute-based classifier for small datasets classific...
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
Comparative study of various supervisedclassification methodsforanalysing def...
Comparative study of various supervisedclassification methodsforanalysing def...Comparative study of various supervisedclassification methodsforanalysing def...
Comparative study of various supervisedclassification methodsforanalysing def...
 
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARECLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
CLUSTERING DICHOTOMOUS DATA FOR HEALTH CARE
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
 
An efficient feature selection in
An efficient feature selection inAn efficient feature selection in
An efficient feature selection in
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 

Viewers also liked

Apresentação Casulo
Apresentação CasuloApresentação Casulo
Apresentação CasuloYaraTerra
 
Inter Biz # Walmart
Inter Biz # WalmartInter Biz # Walmart
Inter Biz # Walmartmonsadako
 
Aula roteiro 2011.1 - 1o.per. - 2a.parte (alunos)
Aula roteiro   2011.1 - 1o.per. - 2a.parte (alunos)Aula roteiro   2011.1 - 1o.per. - 2a.parte (alunos)
Aula roteiro 2011.1 - 1o.per. - 2a.parte (alunos)informaticafdv
 
Will Gallows & o Troll Barriga de Serpente
Will Gallows & o Troll Barriga de SerpenteWill Gallows & o Troll Barriga de Serpente
Will Gallows & o Troll Barriga de SerpenteNanda Soares
 
Arch Linux: Uma distribuição leve e simples - Érico Nunes
Arch Linux: Uma distribuição leve e simples - Érico NunesArch Linux: Uma distribuição leve e simples - Érico Nunes
Arch Linux: Uma distribuição leve e simples - Érico NunesTchelinux
 
Gisela Espinosa. Coloquio Regiones, 2010
Gisela Espinosa. Coloquio Regiones, 2010Gisela Espinosa. Coloquio Regiones, 2010
Gisela Espinosa. Coloquio Regiones, 2010Pro Regiones
 
As cores do mundo
As cores do mundoAs cores do mundo
As cores do mundovaltermn
 
Derecho politico limitado
Derecho politico limitadoDerecho politico limitado
Derecho politico limitadoHugo Araujo
 

Viewers also liked (20)

Apresentação Casulo
Apresentação CasuloApresentação Casulo
Apresentação Casulo
 
Inter Biz # Walmart
Inter Biz # WalmartInter Biz # Walmart
Inter Biz # Walmart
 
Classroom
ClassroomClassroom
Classroom
 
Empleate y ocupate
Empleate y ocupateEmpleate y ocupate
Empleate y ocupate
 
Case de Sucesso Symnetics: Brasil Telecom
Case de Sucesso Symnetics: Brasil TelecomCase de Sucesso Symnetics: Brasil Telecom
Case de Sucesso Symnetics: Brasil Telecom
 
Aula roteiro 2011.1 - 1o.per. - 2a.parte (alunos)
Aula roteiro   2011.1 - 1o.per. - 2a.parte (alunos)Aula roteiro   2011.1 - 1o.per. - 2a.parte (alunos)
Aula roteiro 2011.1 - 1o.per. - 2a.parte (alunos)
 
Case de Sucesso Symnetics: Petrobras
Case de Sucesso Symnetics: PetrobrasCase de Sucesso Symnetics: Petrobras
Case de Sucesso Symnetics: Petrobras
 
18 cartões anjos azuis
18 cartões anjos azuis18 cartões anjos azuis
18 cartões anjos azuis
 
Webquest
WebquestWebquest
Webquest
 
Will Gallows & o Troll Barriga de Serpente
Will Gallows & o Troll Barriga de SerpenteWill Gallows & o Troll Barriga de Serpente
Will Gallows & o Troll Barriga de Serpente
 
Lesson5
Lesson5Lesson5
Lesson5
 
Arch Linux: Uma distribuição leve e simples - Érico Nunes
Arch Linux: Uma distribuição leve e simples - Érico NunesArch Linux: Uma distribuição leve e simples - Érico Nunes
Arch Linux: Uma distribuição leve e simples - Érico Nunes
 
Gisela Espinosa. Coloquio Regiones, 2010
Gisela Espinosa. Coloquio Regiones, 2010Gisela Espinosa. Coloquio Regiones, 2010
Gisela Espinosa. Coloquio Regiones, 2010
 
2 festa junina 2013
2 festa junina 20132 festa junina 2013
2 festa junina 2013
 
As cores do mundo
As cores do mundoAs cores do mundo
As cores do mundo
 
Wethebiocloud.programa.2013
Wethebiocloud.programa.2013Wethebiocloud.programa.2013
Wethebiocloud.programa.2013
 
El comercio electrónico.odt
El comercio electrónico.odtEl comercio electrónico.odt
El comercio electrónico.odt
 
Derecho politico limitado
Derecho politico limitadoDerecho politico limitado
Derecho politico limitado
 
Gestão para a organização da copa 2014
Gestão para a organização da copa 2014Gestão para a organização da copa 2014
Gestão para a organização da copa 2014
 
Mecanismos de estímulo às pessoas para inovação
Mecanismos de estímulo às pessoas para inovaçãoMecanismos de estímulo às pessoas para inovação
Mecanismos de estímulo às pessoas para inovação
 

Similar to G046024851

AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILESAN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILEScscpconf
 
Study on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsStudy on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsIRJET Journal
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesIRJET Journal
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET Journal
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET Journal
 
Distributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebDistributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebEditor IJCATR
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesIRJET Journal
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
 
IRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big DataIRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big DataIRJET Journal
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Jayanti Pande
 
A Firefly based improved clustering algorithm
A Firefly based improved clustering algorithmA Firefly based improved clustering algorithm
A Firefly based improved clustering algorithmIRJET Journal
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET Journal
 
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEA NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEaciijournal
 
Advanced Computational Intelligence: An International Journal (ACII)
Advanced Computational Intelligence: An International Journal (ACII)Advanced Computational Intelligence: An International Journal (ACII)
Advanced Computational Intelligence: An International Journal (ACII)aciijournal
 
An experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestAn experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestIJDKP
 
Predicting students' performance using id3 and c4.5 classification algorithms
Predicting students' performance using id3 and c4.5 classification algorithmsPredicting students' performance using id3 and c4.5 classification algorithms
Predicting students' performance using id3 and c4.5 classification algorithmsIJDKP
 

Similar to G046024851 (20)

AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILESAN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
 
Ijetcas14 338
Ijetcas14 338Ijetcas14 338
Ijetcas14 338
 
Study on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsStudy on Relavance Feature Selection Methods
Study on Relavance Feature Selection Methods
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
Clustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining TechniquesClustering of Big Data Using Different Data-Mining Techniques
Clustering of Big Data Using Different Data-Mining Techniques
 
IRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware PerformanceIRJET- Deep Learning Model to Predict Hardware Performance
IRJET- Deep Learning Model to Predict Hardware Performance
 
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor DriveIRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
IRJET- Analysis of PV Fed Vector Controlled Induction Motor Drive
 
Distributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic WebDistributed Digital Artifacts on the Semantic Web
Distributed Digital Artifacts on the Semantic Web
 
Feature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering TechniquesFeature Subset Selection for High Dimensional Data using Clustering Techniques
Feature Subset Selection for High Dimensional Data using Clustering Techniques
 
Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)Efficient classification of big data using vfdt (very fast decision tree)
Efficient classification of big data using vfdt (very fast decision tree)
 
IRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big DataIRJET-Scaling Distributed Associative Classifier using Big Data
IRJET-Scaling Distributed Associative Classifier using Big Data
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
A Firefly based improved clustering algorithm
A Firefly based improved clustering algorithmA Firefly based improved clustering algorithm
A Firefly based improved clustering algorithm
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data MiningIRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
 
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINEA NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
 
Advanced Computational Intelligence: An International Journal (ACII)
Advanced Computational Intelligence: An International Journal (ACII)Advanced Computational Intelligence: An International Journal (ACII)
Advanced Computational Intelligence: An International Journal (ACII)
 
An experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestAn experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forest
 
Predicting students' performance using id3 and c4.5 classification algorithms
Predicting students' performance using id3 and c4.5 classification algorithmsPredicting students' performance using id3 and c4.5 classification algorithms
Predicting students' performance using id3 and c4.5 classification algorithms
 
Hx3115011506
Hx3115011506Hx3115011506
Hx3115011506
 

Recently uploaded

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

G046024851

  • 1. Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51 www.ijera.com 48 | P a g e Selection of Significant Features Using Decision Tree Classifiers Preeti Kumari1 , K. Rajeswari2 , 1 M.E Student, PCCOE, Pune 2 Ph.D Research Scholar, SASTRA and Associate professor, PCCOE, Pune, India Abstract Data Mining refers to extraction or mining knowledge from huge volume of data. Classification is an important data mining technique with various applications. It classifies data of various kinds and used to classify each item in a set of data into one of predefined set of classes or groups. This work has been carried out to select the most significant attributes from a dataset using Decision tree classifier j48.This Decision tree classification algorithm can be efficiently used in selecting the most significant feature of a dataset and hence we can reduce the dimensionality of a dataset and yet obtain a very good accuracy. In this paper we have selected two datasets from university of California, Irvine website, of weather and vote and applied j48 algorithm on it. After each iteration, one attribute is removed and its accuracy has been found .We have compared the accuracy of dataset with different attribute count and then selected the most significant features of dataset. I. Literature review 1.1 Data mining Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining is a tool for analyzing data allowing users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Classification is a data mining function that assigns items in a collection to target categories or classes. The goal of classification is to accurately predict the target class for each case in the data. For example, a classification model could be used to identify loan applicants as low, medium, or high credit risks. Classification models are tested by comparing the predicted values to known target values in a set of test data. The historical data for a classification project is typically divided into two data sets: one for building the model; the other for testing the model to classify a given instance. 1.2 Feature selection Feature selection is a term commonly used in data mining to describe the tools and techniques available for reducing inputs to a manageable size for processing and analysis. Feature selection implies not only cardinality reduction, which means imposing an arbitrary or predefined cutoff on the number of attributes that can be considered when building a model, but also the choice of attributes, meaning that either the analyst or the modeling tool actively selects or discards attributes based on their usefulness for analysis. The ability to apply feature selection is critical for effective analysis, because datasets frequently contain far more information than is needed to build the model. For example, a dataset might contain 500 columns that describe the characteristics of customers, but if the data in some of the columns is very sparse you would gain very little benefit from adding them to the model. If you keep the unneeded columns while building the model, more CPU and memory are required during the training process, and more storage space is required for the completed model. Selection of significant features of a dataset is an important task in any application. 1.3 Decision tree classifier J48 J48 are the improved versions of C4.5 algorithms or can be called as optimized implementation of the C4.5. The output of J48 is the Decision tree. A Decision tree is similar to the tree structure having root node, intermediate nodes and leaf node. Each node in the tree consist a decision and that decision leads to our result. Decision tree divide the input space of a data set into mutually exclusive areas, each area having a label, a value or an action to describe its data points. Splitting criterion is used to calculate which attribute is the best to split that portion tree of the training data that reaches a particular node. II. Methodology In Weka datasets should be formatted to the ARFF format. The Weka Explorer will use these automatically if it does not recognize a given file as an ARFF file, the Preprocess panel has facilities for importing data from a database, and for preprocessing this data using a filtering algorithm. These filters can be used to transform the data and RESEARCH ARTICLE OPEN ACCESS
  • 2. Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51 www.ijera.com 49 | P a g e make it possible to delete instances and attributes according to specific criteria Datasets lassify tab is used for the classification purpose. We have selected the decision tree classifier J48 to obtain the results. Step 1: Take the input dataset descretize it properly. Step 2: Apply the decision tree classifier algorithm J48 on the whole data set and note the accuracy given by it. Step 3: Reduce the dimension of the given dataset by eliminating one or more attribute pair and note the accuracy of dataset after applying J48 on it. Step 4: Repeat step 4 for all combination of attributes. Step 5: Compare the different accuracy provided by the dataset of different attribute taken together and identify the significant feature of the dataset III. Results and Discussion 1) Weather dataset
  • 3. Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51 www.ijera.com 50 | P a g e Best attribute: Temperature Here we have selected temperature is the best attribute which alone gives the accuracy of 64.28 % which is far better than the accuracy provided by the four attribute taken together 50% 2) Vote dataset Best attribute: physician-free-freeze For the vote dataset, which consist of 16 attributes which together gives accuracy of 96.32%, after applying the algorithm we have found out that even single attribute physician-free-freeze gives the accuracy of 95.63%, which is very efficient. Hence for this dataset it can be considered as the best attribute. IV. Conclusion Thus in this paper we have tried to select best significant feature of data set using repeatedly applying the decision tree classification algorithm J48 over the dataset and found that for weather dataset temperature is the best attribute which alone gives the accuracy of 64.28 % which is far better than the accuracy provided by the four attribute taken together 50% .For the vote dataset, which consist of 16 attributes which together gives accuracy of 96.32%, after applying the algorithm we have found out that even single attribute physician-free-freeze gives the accuracy of 95.63%, which is very efficient. Hence for this dataset it can be considered as the best attribute. Future work will focus on analysis on the feature selected with various other techniques like principle component analysis, genetic algorithms.
  • 4. Preeti Kumari Int. Journal of Engineering Research and Applications www.ijera.com ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.48-51 www.ijera.com 51 | P a g e References [1]. Olivier Henchiri , Nathalie Japkowicz. “A Feature Selection and Evaluation Scheme for Computer Virus Detection, Proceedings of the Sixth International Conference on Data Mining “, (ICDM'06). [2] M. Dash, H. Liu, Feature Selection for Classification, Intelligent Data Analysis 1 (1997) 131-156. [3] Frank,A. & Asuncion ,A. (2010).UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].Irvine,CA: University of California, School of Information and Computer Science. [4] Data mining book – kamber [5] V.Vaithiyanathan, K.Rajeswari, Swati Tonge,, “ Improved Apriori algorithm based on Selection Criterion”, IEEE Conference ICCIC Dec 18-20, 2012, Coimbatore. Catalog Number: CFP1220J-ART ISBN: 978-1-4673-1344-5 [6] Machine Learning Group at the University of Waikato:http://www.cs.waikato.ac.nz/ml /weka/