SlideShare uma empresa Scribd logo
1 de 18
Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya  [email_address] Winter 2006 Dalhousie University Machine Learning Prediction
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Definition of the Problem ,[object Object],[object Object],[object Object],[object Object],[object Object]
Related Work ,[object Object],[object Object],[object Object],[object Object],[object Object]
Algorithms ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Description of the Data ,[object Object],[object Object],[object Object],[object Object]
Methodology of Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Feature Selection (gene subset) Algorithm All features
Methodology of Experiments (cont…) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Results ,[object Object],[object Object],[object Object],Cross-validation results are lower; it uses nearly all the data for training and testing, giving a more realistic estimation.
Results (cont…) ,[object Object],[object Object],[object Object],There is an increase in the overall accuracy, more notorious in DLBCL
Results (cont…) ,[object Object],[object Object]
Relevance of Results ,[object Object],[object Object],[object Object],[object Object]
Relevance of Results (cont…) ,[object Object],[object Object],[object Object],[object Object]
Conclusions & Future Work ,[object Object],[object Object],[object Object],[object Object]
Conclusions & Future Work (cont…) ,[object Object],[object Object],[object Object]
References ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object]

Mais conteúdo relacionado

Mais procurados

A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...TELKOMNIKA JOURNAL
 
Classification of pneumonia from X-ray images using siamese convolutional net...
Classification of pneumonia from X-ray images using siamese convolutional net...Classification of pneumonia from X-ray images using siamese convolutional net...
Classification of pneumonia from X-ray images using siamese convolutional net...TELKOMNIKA JOURNAL
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...ijsc
 
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural NetworkIRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural NetworkIRJET Journal
 
Sample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap IdentificationSample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap IdentificationPhD Assistance
 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceIJSTA
 
An approach for breast cancer diagnosis classification using neural network
An approach for breast cancer diagnosis classification using neural networkAn approach for breast cancer diagnosis classification using neural network
An approach for breast cancer diagnosis classification using neural networkacijjournal
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET Journal
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET Journal
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1Double Check ĆŐNSULTING
 
Regularized Weighted Ensemble of Deep Classifiers
Regularized Weighted Ensemble of Deep Classifiers Regularized Weighted Ensemble of Deep Classifiers
Regularized Weighted Ensemble of Deep Classifiers ijcsa
 
Define cancer treatment using knn and naive bayes algorithms
Define cancer treatment using knn and naive bayes algorithmsDefine cancer treatment using knn and naive bayes algorithms
Define cancer treatment using knn and naive bayes algorithmsrajab ssemwogerere
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...IJCSEA Journal
 
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Devansh16
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformaticsdagunisa
 
Identification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic AlgorithmIdentification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic Algorithmijtsrd
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksDevansh16
 

Mais procurados (19)

A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...A new model for large dataset dimensionality reduction based on teaching lear...
A new model for large dataset dimensionality reduction based on teaching lear...
 
Classification of pneumonia from X-ray images using siamese convolutional net...
Classification of pneumonia from X-ray images using siamese convolutional net...Classification of pneumonia from X-ray images using siamese convolutional net...
Classification of pneumonia from X-ray images using siamese convolutional net...
 
DREAM Challenge
DREAM ChallengeDREAM Challenge
DREAM Challenge
 
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
AN EFFICIENT PSO BASED ENSEMBLE CLASSIFICATION MODEL ON HIGH DIMENSIONAL DATA...
 
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural NetworkIRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
IRJET- Crop Leaf Disease Diagnosis using Convolutional Neural Network
 
Sample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap IdentificationSample Work For Engineering Literature Review and Gap Identification
Sample Work For Engineering Literature Review and Gap Identification
 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
 
An approach for breast cancer diagnosis classification using neural network
An approach for breast cancer diagnosis classification using neural networkAn approach for breast cancer diagnosis classification using neural network
An approach for breast cancer diagnosis classification using neural network
 
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
IRJET- Gene Mutation Data using Multiplicative Adaptive Algorithm and Gene On...
 
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...IRJET-  	  Classification of Chemical Medicine or Drug using K Nearest Neighb...
IRJET- Classification of Chemical Medicine or Drug using K Nearest Neighb...
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
 
Regularized Weighted Ensemble of Deep Classifiers
Regularized Weighted Ensemble of Deep Classifiers Regularized Weighted Ensemble of Deep Classifiers
Regularized Weighted Ensemble of Deep Classifiers
 
Define cancer treatment using knn and naive bayes algorithms
Define cancer treatment using knn and naive bayes algorithmsDefine cancer treatment using knn and naive bayes algorithms
Define cancer treatment using knn and naive bayes algorithms
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...
 
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
Paper Annotated: SinGAN-Seg: Synthetic Training Data Generation for Medical I...
 
Nat poster
Nat posterNat poster
Nat poster
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Identification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic AlgorithmIdentification of Disease in Leaves using Genetic Algorithm
Identification of Disease in Leaves using Genetic Algorithm
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarks
 

Semelhante a CSCI 6505 Machine Learning Project

A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...IJTET Journal
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andAlexander Decker
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Damian R. Mingle, MBA
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
 
BRITEREU_finalposter
BRITEREU_finalposterBRITEREU_finalposter
BRITEREU_finalposterElsa Fecke
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...ahmad abdelhafeez
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET Journal
 
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...rahulmonikasharma
 
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...IJECEIAES
 
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...Arinze Akutekwe
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...IJECEIAES
 
Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...ijsc
 
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...ijsc
 
Diagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryDiagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryIRJET Journal
 
Artificial Intelligence in pathology
Artificial Intelligence in pathologyArtificial Intelligence in pathology
Artificial Intelligence in pathologynehaSingh1543
 
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...BRNSSPublicationHubI
 
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET Journal
 
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...Kiogyf
 

Semelhante a CSCI 6505 Machine Learning Project (20)

A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
A Classification of Cancer Diagnostics based on Microarray Gene Expression Pr...
 
Comparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning andComparing prediction accuracy for machine learning and
Comparing prediction accuracy for machine learning and
 
Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...Controlling informative features for improved accuracy and faster predictions...
Controlling informative features for improved accuracy and faster predictions...
 
Comparison of breast cancer classification models on Wisconsin dataset
Comparison of breast cancer classification models on Wisconsin  datasetComparison of breast cancer classification models on Wisconsin  dataset
Comparison of breast cancer classification models on Wisconsin dataset
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniques
 
BRITEREU_finalposter
BRITEREU_finalposterBRITEREU_finalposter
BRITEREU_finalposter
 
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classif...
 
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
IRJET - Breast Cancer Prediction using Supervised Machine Learning Algorithms...
 
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
Improving Prediction Accuracy Results by Using Q-Statistic Algorithm in High ...
 
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
SCDT: FC-NNC-structured Complex Decision Technique for Gene Analysis Using Fu...
 
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
Inference of Nonlinear Gene Regulatory Networks through Optimized Ensemble of...
 
Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...Multivariate sample similarity measure for feature selection with a resemblan...
Multivariate sample similarity measure for feature selection with a resemblan...
 
Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...Mining of Important Informative Genes and Classifier Construction for Cancer ...
Mining of Important Informative Genes and Classifier Construction for Cancer ...
 
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
MINING OF IMPORTANT INFORMATIVE GENES AND CLASSIFIER CONSTRUCTION FOR CANCER ...
 
Diagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set TheoryDiagnosis of Cancer using Fuzzy Rough Set Theory
Diagnosis of Cancer using Fuzzy Rough Set Theory
 
Artificial Intelligence in pathology
Artificial Intelligence in pathologyArtificial Intelligence in pathology
Artificial Intelligence in pathology
 
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
Impact of Classification Algorithms on Cardiotocography Dataset for Fetal Sta...
 
IRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer PredictionIRJET - Survey on Analysis of Breast Cancer Prediction
IRJET - Survey on Analysis of Breast Cancer Prediction
 
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...
A CLASSIFICATION MODEL ON TUMOR CANCER DISEASE BASED MUTUAL INFORMATION AND F...
 
Updated proposal powerpoint.pptx
Updated proposal powerpoint.pptxUpdated proposal powerpoint.pptx
Updated proposal powerpoint.pptx
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

CSCI 6505 Machine Learning Project

  • 1. Evaluation of Supervised Learning Algorithms on Gene Expression Data CSCI 6505 – Machine Learning Adan Cosgaya [email_address] Winter 2006 Dalhousie University Machine Learning Prediction
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.

Notas do Editor

  1. (at the end) We used Weka to perform the experiments We evaluated KNN, NB, DT, and SVM. Each has its own strengths and limitations. It would be difficult to say which one gives the best results. It is necessary to evaluate on the basis of the same datasets and with a common evaluation criteria. In our experiments, we perform comparative studies using the full set of features, as well as a subset of them. A DNA microarray is a collection of microscopic DNA spots attached to a solid surface, such as glass, plastic or silicon chip forming an array. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously.
  2. In a classification problem, we are given m training instances, and l classes, where the instances consist of n features, and the known class labels C. The goal is to predict, the class label for a new given instance. For our problem, we consider the features being gene expression coefficients, and the instances correspond to patients. Here, n >> m . Overfitting : building models that are very good for the training set but perform poorly of future independent samples How can we guard against overtting? Split the data into a training set and a crossvalidation set. Use the latter for monitoring the generalization performance. When overtting sets in, stop the training process. Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). DNA microarray experiments from biological samples generate thousands of gene expression measurements. The datasets produced are highly dimensional and often noisy due to the process involved in the experiments. This is not only a challenging problem were the results can be used to diagnose a disease or predict survival of a patient. The approach taken by this project is to provide comparative results to indicate that a small number of instances can be used to create a useful model, and that feature selection improves the classification accuracy.
  3. Golub et al. … its results demonstrate the feasibility of cancer classification based solely on gene expression. A. Rosenwald et al. … for diffuse large-b-cell lymphoma Furey et al. … their results indicate that SVM is able to classify this kind of data, and be used in the identification of the presence of a disease. Guyon et al. … their results show an increase in the overall performance of SVM classification with the reduced set of features.
  4. KNN - To classify a given instance I , the algorithm ranks the neighbors of I , and uses the class labels of the k most similar neighbors to predict the class of the instance I . Then, after gathering the class labels of neighbors, majority of them is taken, and I is assigned the class label with the greatest number of votes among the K nearest neighbors. The best choice of k depends on the dataset. NB - The training phase consists on calculating the conditional probability P(x|c) of an instance given a class label, and the prior probability P(c) of the class. To classify an unseen instance, the posterior probability of each class given the instance, is calculated, and the instance is assigned the class with the highest probability. DT - The algorithm builds a tree based on a training dataset, it recursively partitions the set by choosing an attribute and creates a separate branch for each value of the chosen attribute. The best attribute to split on is the one with the highest information gain or lowest entropy. To classify an instance, the method starts at the root node, testing the attribute specified by the node, then moving down the branch corresponding to the value of the attribute in the given instance. This process is repeated for the subtree rooted at the new node until a leaf is encountered, and the instance is finally labeled with the class indicated by the leaf. SVM - The Support Vector Machine (SVM) method finds a linear discriminant called hyperplane, which separates the classes in a given a dataset. The best hyperplane is the one that keeps the maximum separation between the classes in order to better generalize the model, so we are looking for the maximum margin hyperplane.
  5. The datasets used for this evaluation were obtained from the Kent Ridge Biomedical Data Set Repository. They correspond to gene expression data obtained from DNA microarrays. Leukemia dataset. The source of the gene expression were taken from bone marrow samples and blood samples. Diffuse Large-B-Cell Lymphoma (DLBCL) dataset. This dataset consists of biopsy samples of 240 patients that were examined for gene expression with the use of DNA microarrays. The number of microarray features is 7399, and each sample belongs to one of two classes: Alive, Dead. The two classes correspond to the prediction of survival after chemotherapy for diffuse large-B-cell lymphoma.
  6. FEATURE SELECTION Due to the high dimensional nature of this type of data, we chose a smaller set of features from the set of original features. Another reason to perform feature selection, lies in the fact that having a number of features much greater that the number of instances, increases the potential problem of overfitting. TESTING METHODOLOGY We divided both datasets with different ratios of train/test sets (66/34, 80/20, and 90/10), and averaged over the results (macroaveraging). However, given the fact that our datasets are small, we also wanted to evaluate the accuracy on the basis of 10-fold cross-validation. The major advantage of cross-validation is that all the cases in the dataset are used for testing, and nearly all the cases are used for training the classifier. This resampling technique can provide a good estimate of the accuracy.
  7. The classification of the data corresponds to a binary classification task; we want to determine if a patient is alive or dead, or if it has one of two types of leukemia. However, using only the accuracy can result in misleading overoptimistic estimates, that is why, to evaluate the performance of the classification algorithms, we also use the concepts of precision, recall, and F-measure. Precision is the proportion of the instances which actually have class C among all those which were classified as class C . Recall is the proportion of instances which were classified as class C , among all instances which truly have class C , i. e. how much part of the class was captured. In order to pay equal importance to each class, we want to average the values of precision, recall and F-measure that we get for each class C . Classes are equally (almost evenly) represented in the training samples, that is why we can trust in accuracy as a measure of performance.
  8. For both datasets there is an intuitive agreement between the evaluation over an independent test set and cross-validation , however cross-validation results are lower, most likely because it uses nearly all the data for training and testing, giving a more realistic estimation. In the Leukemia dataset, the classification accuracies in both evaluation methods, are remarkably high, there are features that completely determine the class, and Naive Bayes and SVM algorithms tend to slightly outperform KNN and DT. In the case of SVM , it is due to the fact that the classes are linearly divisible, and for NB , its assumption of feature independence indicates that there is at least a number of features that completely determine the class, despite possible redundant or noisy features. For the DLBCL dataset , the accuracy is significantly low in all algorithms, being KNN (66.92%, and 62.91%) the best classifier. Decision Trees gave the lowest accuracy, this is due to the large number of features involved. Surprisingly, KNN outperforms SVM in DLBCL and almost matches it in Leukemia.
  9. We must point out that reducing the dimensionality using now the best ranked features , increases the accuracy when compared with using the full set of features. The results obtained from the independent test set evaluations and cross-validation, still intuitively agree , being cross-validation measures, again a little lower. For the Leukemia dataset , the reduced dimensionality brought an slight increase in the overall accuracy, indicating that this dataset can be described to a high degree of accuracy by a reduced number of features. For the DLBCL dataset , feature selection significantly increased the overall performance in all the algorithms being Naive Bayes (78.84%, and 70.83%), and SVM (75.37%, and 71.25%) the ones with the highest accuracies.
  10. Observing that cross-validation gives a more realistic view of the algorithms' behavior, the table summarizes the best performance for each type of classifier with and without feature selection, in the terms of 10-fold cross validation. The Figure shows the variation of the F-Measure in each algorithm, using both datasets, reinforcing the assumption, that SVM outperforms the rest. It is interesting to point that the measures are consistent among all the algorithms in each dataset. For example, Leukemia with all features is in the range of [0.847, 0.985], DLBCL with feature selection, is in the range of [0.612, 0.706].
  11. Performance depends ... This is confirmed by the remarkably high results obtained with the Leukemia dataset, and which drop dramatically with DLBCL data. Feature selection … No matter which algorithm is being used, all of them benefit from feature selection, increasing the performance. This is specially important for algorithms such as KNN where distances must be computed in terms of features. The use of an information gain based method such as gain ratio, seems to preserve the underlying correlation between the selected features, and the class labels. SVM … As initially suspected, SVM classification gave the best results, however, in spite of the fact that they perform well with high dimensional data, we have shown that SVM can also benefit from reducing the dimensionality with feature selection. Decision Trees … it is widely known that they do not behave well with high dimensional and noisy datasets.
  12. Surprisingly, KNN … its relatively strong performance makes it a good choice for baseline when applied to gene expression data. The DLBCL dataset … The reason for the low results, might be due to the fact that predicting whether a patient is dead or alive after certain time has passed since chemotherapy, involves other circumstances such as the living environment, care of the patient, etc, which can not be numerically measured, and they do affect the final prediction.
  13. While our results indicate that SVM by its very own nature, deal well with high dimensional gene expression data, we have showed that other methods work surprisingly well too . The datasets used, contain relatively a few number of instances, and do not allow one method to demonstrate absolute superiority. We have also shown that there is no single approach that works well in all situations, and the use of one algorithm instead of others should be evaluated on a case by case basis.
  14. Knowing that data transformation methods destroy the underlying meaning of the set of features, it would be interesting to see if algorithms such as SVM and Naive Bayes which assumes term independence, benefit from the transformation. Another direction for future research can be the statistical analysis of the effect of noisy gene expression data on the reliability of the classifier. This is interesting, given the fact that the methods to obtain this type of data can be subject to “noise”, it is crucial to determine these effects on the results and conclude on the basis of robustness of an algorithm in the presence of noisy measures or mislabeled classes. Finally more experiments with other datasets should be performed before deriving final conclusions.