SlideShare uma empresa Scribd logo
1 de 116
Baixar para ler offline
Machine Learning Methods: an overview Master in Bioinformatica – April 9th, 2010 Paolo Marcatili University of Rome “Sapienza” Dept. of Biochemical Sciences “Rossi Fanelli” [email_address] ,[object Object],[object Object],[object Object],[object Object]
Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Overview
Why Overview ,[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],Why Overview
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Why Overview
How ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Overview
How ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Probability and statistics are fundamental They provide a solid framework for creating  models  and acquire  knowledge Overview
Datasets ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Overview
Methods Machine Learning can Overview
Methods Machine Learning can Predict unknown function values Overview
Methods Machine Learning can Predict unknown function values Infer classes and assign samples Overview
Methods Machine Learning can Predict unknown function values Infer classes and assign samples Overview
Methods Machine Learning can not Overview
Methods Machine Learning can not Provide knowledge Overview
Methods Machine Learning can not Provide knowledge Overview
Methods Machine Learning can not Provide knowledge Learn Overview
Methods Information is In the data? In the model? Overview
Methods ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Overview
Love all, trust a few, do wrong to none.  Overview 4 patients, 4 controls
Love all, trust a few, do wrong to none.  Overview 2 more
Love all, trust a few, do wrong to none.  Overview 10 more
Assessment Prediction of unknown data! Problems : Few data, robustness. Overview
Assessment ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Overview
Assessment 50% Training set : used to tune the model parameters 25% Test set : used to verify that the machine has “learnt” 25% Validation set : final assessment of the results Unfeasible with few data Overview
Assessment Leave-one-out: for each sample  A i Training set : all the samples - { A i } Test set : { A i  } Repeat  Computationally intensive,  good estimate of the mean error high variance Overview
Assessment K-fold cross validation: Divide your data in  K  subsets  S 1..k Training set : all the samples -  S i Test set :  S i  Repeat  good compromise Overview
Assessment Sensitivity:  TP/ [ TP + FN ] Given the disease is present, the likelihood of testing positive. Specificity:  TN / [ TN + FP ] Given the disease is not present, the likelihood of testing negative. Positive Predictive Value :  TP / [ TP + FP ] Given a test is positive, the likelihood disease is present Overview
Assessment Sensitivity:  TP/ [ TP + FN ] Given the disease is present, the likelihood of testing positive. Specificity:  TN / [ TN + FP ] Given the disease is not present, the likelihood of testing negative. Positive Predictive Value :  TP / [ TP + FP ] Given a test is positive, the likelihood disease is present receiver operating characteristic (ROC) is a graphical plot of the  sensitivity vs. (1 - specificity)  for a binary classifier system as its discrimination threshold is varied. Overview
Assessment Overview Sensitivity:  TP/ [ TP + FN ] Given the disease is present, the likelihood of testing positive. Specificity:  TN / [ TN + FP ] Given the disease is not present, the likelihood of testing negative. Predictive Value Positive:  TP / [ TP + FP ] Given a test is positive, the likelihood disease is present receiver operating characteristic (ROC) is a graphical plot of the  sensitivity vs. (1 - specificity)  for a binary classifier system as its discrimination threshold is varied. Area under ROC (AROC) is often used as a parameter to compare different classifiers
Agenda Supervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Supervised Learning Supervised Learning Basic Idea: use data+classification of known samples find “fingerprints” of classes in the data
Supervised Learning Supervised Learning Basic Idea: use data+classification of known samples find “fingerprints” of classes in the data Example: use microarray data, different condition classes:  genes related/unrelated  to different cancer types
Supervised Learning Supervised Learning Basic Idea: use data+classification of known samples find “fingerprints” of classes in the data Example: use microarray data, different condition classes:  genes related/unrelated  to different cancer types
Support Vector Machines Supervised Learning Basic idea: Plot your data in an N-dimensional space Find the best hyperplane that separates the different classes Further samples can be classified using the region of the space they belong to
Support Vector Machines Supervised Learning length weight Fail Pass
Support Vector Machines Supervised Learning Fail Pass length weight Fail Pass margin
Support Vector Machines Supervised Learning Optimal Hyperplane (OHP) simple kind of SVM  (called an  LSVM ) maximum  margin Fail Pass length weight Fail Pass margin Support vectors
Support Vector Machines Supervised Learning What if data are not linearly separable? Original Data
Support Vector Machines Supervised Learning What if data are not linearly separable? Allow mismatches soft margins (add a weight matrix) Original Data Original Data
Support Vector Machines Supervised Learning Hyperplane What if data are not linearly separable? weight 2 length 2 weight * length Original Data
Support Vector Machines Supervised Learning Only Inner product is needed to calculate Dual problem and decision function   Hypersurface Kernelization Hyperplane What if data are not linearly separable? The Kernel trick! weight 2 length 2 weight * length length Original Data
SVM example Supervised Learning Knowledge-based analysis of microarray gene expression data by using support vector machines Michael P. S. Brown*, William Noble Grundy †‡ , David Lin*, Nello Cristianini § , Charles Walsh Sugnet ¶ , Terrence S. Furey*, Manuel Ares, Jr. ¶ , and David Haussler* We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data. To judge overall performance, we define the cost of using the method  M  as  C ( M )  5  fp ( M )  1  2 z fn ( M ), where  fp ( M ) is the number of false positives for method  M , and  fn ( M ) is the number of false negatives for method  M . The false negatives are weighted more heavily than the false positives because, for these data, the number of positive examples is small compared with the number of negatives.
Hidden Markov Models Supervised Learning ,[object Object],[object Object],[object Object],[object Object]
Hidden Markov Models Supervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hidden Markov Models Supervised Learning
Decision trees Supervised Learning Mimics the behavior of an expert
Decision trees Supervised Learning Mimics the behavior of an expert
Mimics the behavior of an expert Decision trees Supervised Learning
Mimics the behavior of an expert Decision trees Supervised Learning
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Majority rules! Decision trees Supervised Learning
Random Forests Supervised Learning Split the data in several subsets, construct a DT for each set Each DT expresses a vote, the majority wins Much more accurate and robust (bootstrap)
Random Forests Supervised Learning Split the data in several subsets, construct a DT for each set Each DT expresses a vote, the majority wins Much more accurate and robust (bootstrap) Prediction of protein–protein interactions using random decision forest framework  Xue-Wen Chen * and Mei Liu  Motivation: Protein interactions are of biological interest because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Domains are the building blocks of proteins; therefore, proteins are assumed to interact as a result of their interacting domains. Many domain-based models for protein interaction prediction have been developed, and preliminary results have demonstrated their feasibility. Most of the existing domain-based methods, however, consider only single-domain pairs (one domain from one protein) and assume independence between domain–domain interactions.  Results: In this paper, we introduce a domain-based random forest of decision trees to infer protein interactions. Our proposed method is capable of  exploring all possible domain interactions and making predictions based on all the protein domains. Experimental results on Saccharomyces cerevisiae dataset demonstrate that our approach can predict protein–protein interactions with higher sensitivity (79.78%) and specificity (64.38%) compared with that of the maximum likelihood approach. Furthermore, our model can be used to infer interactions not only for single-domain pairs but also for multiple domain pairs.
Bayesian Networks Supervised Learning The probabilistic approach  is extremely powerful but requires a huge amount of information/data for a complete representation Not all correlations or cause-effect relationships between variables are significative
Bayesian Networks Supervised Learning The probabilistic approach  is extremely powerful but requires a huge amount of information/data for a complete representation Not all correlations or cause-effect relationships between variables are significative Consider only meaningful links!
Bayesian Networks Supervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Bayesian Networks Supervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Bayes Theorem again!
Bayesian Networks Supervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Neural Networks Supervised Learning Neural Networks interpolate functions They have nothing to do with brains
Neural Networks Supervised Learning Neural Networks interpolate functions They have nothing to do with brains
Neural Networks Supervised Learning Neural Networks interpolate functions They have nothing to do with brains
Neural Networks Supervised Learning Parameter settings:
Neural Networks Supervised Learning Parameter settings: avoid overfitting
Neural Networks Supervised Learning Parameter settings: avoid overfitting Learning --> validation --> usage No underlying model, but it often works
Neural Networks Supervised Learning Protein Disorder Prediction: Implications for Structural Proteomics Rune Linding, 1,4, * Lars Juhl Jensen, 1,2,4  Francesca Diella, 3  Peer Bork, 1,2  Toby J. Gibson, 1  and Robert B. Russell 1 Abstract A great challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Disordered regions in proteins often contain short linear peptide motifs (e.g., SH3 ligands and targeting signals) that are important for protein function. We present here DisEMBL, a computational tool for prediction of disordered/unstructured regions within a protein sequence. As no clear definition of disorder exists, we have developed parameters based on several alternative definitions and introduced a new one based on the concept of “hot loops,” i.e., coils with high temperature factors. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability, and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects. The tool is freely available via a web interface (http://dis.embl.de) and can be downloaded for use in large-scale studies.
Agenda Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Unsupervised Learning Unsupervised Learning If we have no idea of actual data classification, we can try to guess
Clustering Unsupervised Learning Put together similar objects to define classes
Clustering Unsupervised Learning Put together similar objects to define classes
Clustering Unsupervised Learning K-means Hierarchical top-down Hierarchical down-up Fuzzy Put together similar objects to define classes How?
Clustering Unsupervised Learning Euclidean Correlation Spearman Rank Manhattan Put together similar objects to define classes Which metric? How?
Clustering Unsupervised Learning Put together similar objects to define classes Which metric? Which “shape”? Compact Concave Outliers Inner radius cluster separation How?
Hierarchical Clustering Unsupervised Learning ,[object Object],[object Object],[object Object]
Hierarchical Clustering Unsupervised Learning ,[object Object],[object Object],[object Object]
Hierarchical Clustering Unsupervised Learning ,[object Object],[object Object],[object Object]
Hierarchical Clustering Unsupervised Learning ,[object Object],[object Object],[object Object]
Hierarchical Clustering Unsupervised Learning ,[object Object],[object Object],[object Object]
Hierarchical Clustering Unsupervised Learning ,[object Object],[object Object],[object Object]
Hierarchical Clustering Unsupervised Learning ,[object Object],[object Object],[object Object]
Hierarchical Clustering Unsupervised Learning ,[object Object],[object Object],[object Object],
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
K-means Unsupervised Learning ,[object Object],[object Object],[object Object],[object Object]
PCA Unsupervised Learning Multidimensional data (hard to visualize) Data variability is not equally distributed
PCA Unsupervised Learning Multidimensional data (hard to visualize) Data variability is not equally distributed Correlation between variables Change coordinate system, remove correlation  retain only most variable coordinates How: (generalized eigenvectors, SVD) Pro: noise (and information) reduction
Agenda Caveats ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data independence Training set, Test set and Validation set must be clearly separated  E.g. neural network to infer gene function from sequence training set: annotated gene sequences, deposit date before Jan 2007 test set: annotated gene sequences, deposit date after Jan 2007 But annotation of new sequences is often inferred from old sequences! Caveats
Biases Data should be unbiased, i.e. it should be a good sample of our “space” E.g. neural network to find disordered regions training set: solved structures, residues in SEQRES but not in ATOM But solved structures are typically small, globular, cytoplasmatic proteins Caveats
Take-home message ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Caveats
References ,[object Object],[object Object],[object Object],[object Object]
Bayes Theorem Supplementary a) AIDS is affecting 0,01% of population. b) The AIDS test, when performed on patients, is correct 99,9% of times. b) The AIDS test, when performed on uninfected people, is correct 99,99% of times. If a person has a positive test, how likely is it for him to be infected? A B A  B E
Bayes Theorem Supplementary a) AIDS is affecting 0,01% of population. b) The AIDS test, when performed on patients, is correct 99,9% of times. b) The AIDS test, when performed on uninfected people, is correct 99,99% of times. If a person has a positive test, how likely is it for him to be infected? P(A|T) =P(T|A)*P(A) / (P(T|A)*P(A) + P(T|¬A)*P(¬A)) P(A|T) = 49.97% A B A  B E

Mais conteúdo relacionado

Mais procurados

Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.butest
 
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...Eirini Ntoutsi
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.pptbutest
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in RBabu Priyavrat
 
An overview of machine learning (1)
An overview of machine learning (1)An overview of machine learning (1)
An overview of machine learning (1)Pranjal Tiwari
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised LearningLukas Tencer
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learningbutest
 
Machine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersMachine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersSatyam Jaiswal
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningKmPooja4
 
Cmpe 255 cross validation
Cmpe 255 cross validationCmpe 255 cross validation
Cmpe 255 cross validationAbraham Kong
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning SARCCOM
 
Machine learning
Machine learningMachine learning
Machine learningRohit Kumar
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning AlgorithmsDezyreAcademy
 
Supervised learning
Supervised learningSupervised learning
Supervised learningAlia Hamwi
 
On Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondOn Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondEunjeong (Lucy) Park
 

Mais procurados (20)

my IEEE
my IEEEmy IEEE
my IEEE
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
Supervised Machine Learning in R
Supervised  Machine Learning  in RSupervised  Machine Learning  in R
Supervised Machine Learning in R
 
An overview of machine learning (1)
An overview of machine learning (1)An overview of machine learning (1)
An overview of machine learning (1)
 
Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
Machine Learning Interview Questions and Answers
Machine Learning Interview Questions and AnswersMachine Learning Interview Questions and Answers
Machine Learning Interview Questions and Answers
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Statistical learning intro
Statistical learning introStatistical learning intro
Statistical learning intro
 
Cmpe 255 cross validation
Cmpe 255 cross validationCmpe 255 cross validation
Cmpe 255 cross validation
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
On Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and BeyondOn Semi-Supervised Learning and Beyond
On Semi-Supervised Learning and Beyond
 
Machine Learning by Rj
Machine Learning by RjMachine Learning by Rj
Machine Learning by Rj
 

Semelhante a Machine Learning

32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Projectbutest
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...DataScienceConferenc1
 
Highly Reliable Hepatitis Diagnosis with multi classifiers
Highly Reliable Hepatitis Diagnosis with multi classifiersHighly Reliable Hepatitis Diagnosis with multi classifiers
Highly Reliable Hepatitis Diagnosis with multi classifiersahmedbohy
 
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...butest
 
Hepatitis diagnosis based on Artificial Intelligence Using Single And multi c...
Hepatitis diagnosis based on Artificial Intelligence Using Single And multi c...Hepatitis diagnosis based on Artificial Intelligence Using Single And multi c...
Hepatitis diagnosis based on Artificial Intelligence Using Single And multi c...ahmedbohy
 
Detection Using Principle Component Analysis And Case...
Detection Using Principle Component Analysis And Case...Detection Using Principle Component Analysis And Case...
Detection Using Principle Component Analysis And Case...Pamela Watkins
 
Download It
Download ItDownload It
Download Itbutest
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniquesinventionjournals
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018Nancy Garmer
 
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...Mixed Models: How to Effectively Account for Inbreeding and Population Struct...
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...Golden Helix Inc
 
Basic course on computer-based methods
Basic course on computer-based methodsBasic course on computer-based methods
Basic course on computer-based methodsimprovemed
 
Basic course for computer based methods
Basic course for computer based methodsBasic course for computer based methods
Basic course for computer based methodsimprovemed
 
Did Something Change? - Using Statistical Techniques to Interpret Service and...
Did Something Change? - Using Statistical Techniques to Interpret Service and...Did Something Change? - Using Statistical Techniques to Interpret Service and...
Did Something Change? - Using Statistical Techniques to Interpret Service and...Joao Galdino Mello de Souza
 

Semelhante a Machine Learning (20)

32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
CSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning ProjectCSCI 6505 Machine Learning Project
CSCI 6505 Machine Learning Project
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 
Introductionedited
IntroductioneditedIntroductionedited
Introductionedited
 
Highly Reliable Hepatitis Diagnosis with multi classifiers
Highly Reliable Hepatitis Diagnosis with multi classifiersHighly Reliable Hepatitis Diagnosis with multi classifiers
Highly Reliable Hepatitis Diagnosis with multi classifiers
 
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
 
Hepatitis diagnosis based on Artificial Intelligence Using Single And multi c...
Hepatitis diagnosis based on Artificial Intelligence Using Single And multi c...Hepatitis diagnosis based on Artificial Intelligence Using Single And multi c...
Hepatitis diagnosis based on Artificial Intelligence Using Single And multi c...
 
Detection Using Principle Component Analysis And Case...
Detection Using Principle Component Analysis And Case...Detection Using Principle Component Analysis And Case...
Detection Using Principle Component Analysis And Case...
 
Download It
Download ItDownload It
Download It
 
Classification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining TechniquesClassification of Breast Cancer Diseases using Data Mining Techniques
Classification of Breast Cancer Diseases using Data Mining Techniques
 
Data in science
Data in science Data in science
Data in science
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018
 
GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018GradTrack: Getting Started with Statistics September 20, 2018
GradTrack: Getting Started with Statistics September 20, 2018
 
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...Mixed Models: How to Effectively Account for Inbreeding and Population Struct...
Mixed Models: How to Effectively Account for Inbreeding and Population Struct...
 
Basic course on computer-based methods
Basic course on computer-based methodsBasic course on computer-based methods
Basic course on computer-based methods
 
Basic course for computer based methods
Basic course for computer based methodsBasic course for computer based methods
Basic course for computer based methods
 
Analyzing survey data
Analyzing survey dataAnalyzing survey data
Analyzing survey data
 
Did Something Change? - Using Statistical Techniques to Interpret Service and...
Did Something Change? - Using Statistical Techniques to Interpret Service and...Did Something Change? - Using Statistical Techniques to Interpret Service and...
Did Something Change? - Using Statistical Techniques to Interpret Service and...
 

Mais de Paolo Marcatili (9)

Regexp master 2011
Regexp master 2011Regexp master 2011
Regexp master 2011
 
Master perl io_2011
Master perl io_2011Master perl io_2011
Master perl io_2011
 
Master datatypes 2011
Master datatypes 2011Master datatypes 2011
Master datatypes 2011
 
Master unix 2011
Master unix 2011Master unix 2011
Master unix 2011
 
Data Types Master
Data Types MasterData Types Master
Data Types Master
 
Hashes Master
Hashes MasterHashes Master
Hashes Master
 
Regexp Master
Regexp MasterRegexp Master
Regexp Master
 
Perl Io Master
Perl Io MasterPerl Io Master
Perl Io Master
 
Unix Master
Unix MasterUnix Master
Unix Master
 

Machine Learning

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 10. Methods Machine Learning can Predict unknown function values Overview
  • 11. Methods Machine Learning can Predict unknown function values Infer classes and assign samples Overview
  • 12. Methods Machine Learning can Predict unknown function values Infer classes and assign samples Overview
  • 13. Methods Machine Learning can not Overview
  • 14. Methods Machine Learning can not Provide knowledge Overview
  • 15. Methods Machine Learning can not Provide knowledge Overview
  • 16. Methods Machine Learning can not Provide knowledge Learn Overview
  • 17. Methods Information is In the data? In the model? Overview
  • 18.
  • 19. Love all, trust a few, do wrong to none. Overview 4 patients, 4 controls
  • 20. Love all, trust a few, do wrong to none. Overview 2 more
  • 21. Love all, trust a few, do wrong to none. Overview 10 more
  • 22. Assessment Prediction of unknown data! Problems : Few data, robustness. Overview
  • 23.
  • 24. Assessment 50% Training set : used to tune the model parameters 25% Test set : used to verify that the machine has “learnt” 25% Validation set : final assessment of the results Unfeasible with few data Overview
  • 25. Assessment Leave-one-out: for each sample A i Training set : all the samples - { A i } Test set : { A i } Repeat Computationally intensive, good estimate of the mean error high variance Overview
  • 26. Assessment K-fold cross validation: Divide your data in K subsets S 1..k Training set : all the samples - S i Test set : S i Repeat good compromise Overview
  • 27. Assessment Sensitivity: TP/ [ TP + FN ] Given the disease is present, the likelihood of testing positive. Specificity: TN / [ TN + FP ] Given the disease is not present, the likelihood of testing negative. Positive Predictive Value : TP / [ TP + FP ] Given a test is positive, the likelihood disease is present Overview
  • 28. Assessment Sensitivity: TP/ [ TP + FN ] Given the disease is present, the likelihood of testing positive. Specificity: TN / [ TN + FP ] Given the disease is not present, the likelihood of testing negative. Positive Predictive Value : TP / [ TP + FP ] Given a test is positive, the likelihood disease is present receiver operating characteristic (ROC) is a graphical plot of the sensitivity vs. (1 - specificity) for a binary classifier system as its discrimination threshold is varied. Overview
  • 29. Assessment Overview Sensitivity: TP/ [ TP + FN ] Given the disease is present, the likelihood of testing positive. Specificity: TN / [ TN + FP ] Given the disease is not present, the likelihood of testing negative. Predictive Value Positive: TP / [ TP + FP ] Given a test is positive, the likelihood disease is present receiver operating characteristic (ROC) is a graphical plot of the sensitivity vs. (1 - specificity) for a binary classifier system as its discrimination threshold is varied. Area under ROC (AROC) is often used as a parameter to compare different classifiers
  • 30.
  • 31. Supervised Learning Supervised Learning Basic Idea: use data+classification of known samples find “fingerprints” of classes in the data
  • 32. Supervised Learning Supervised Learning Basic Idea: use data+classification of known samples find “fingerprints” of classes in the data Example: use microarray data, different condition classes: genes related/unrelated to different cancer types
  • 33. Supervised Learning Supervised Learning Basic Idea: use data+classification of known samples find “fingerprints” of classes in the data Example: use microarray data, different condition classes: genes related/unrelated to different cancer types
  • 34. Support Vector Machines Supervised Learning Basic idea: Plot your data in an N-dimensional space Find the best hyperplane that separates the different classes Further samples can be classified using the region of the space they belong to
  • 35. Support Vector Machines Supervised Learning length weight Fail Pass
  • 36. Support Vector Machines Supervised Learning Fail Pass length weight Fail Pass margin
  • 37. Support Vector Machines Supervised Learning Optimal Hyperplane (OHP) simple kind of SVM (called an LSVM ) maximum margin Fail Pass length weight Fail Pass margin Support vectors
  • 38. Support Vector Machines Supervised Learning What if data are not linearly separable? Original Data
  • 39. Support Vector Machines Supervised Learning What if data are not linearly separable? Allow mismatches soft margins (add a weight matrix) Original Data Original Data
  • 40. Support Vector Machines Supervised Learning Hyperplane What if data are not linearly separable? weight 2 length 2 weight * length Original Data
  • 41. Support Vector Machines Supervised Learning Only Inner product is needed to calculate Dual problem and decision function Hypersurface Kernelization Hyperplane What if data are not linearly separable? The Kernel trick! weight 2 length 2 weight * length length Original Data
  • 42. SVM example Supervised Learning Knowledge-based analysis of microarray gene expression data by using support vector machines Michael P. S. Brown*, William Noble Grundy †‡ , David Lin*, Nello Cristianini § , Charles Walsh Sugnet ¶ , Terrence S. Furey*, Manuel Ares, Jr. ¶ , and David Haussler* We introduce a method of functionally classifying genes by using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines (SVMs). SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods, such as hierarchical clustering and self-organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data. To judge overall performance, we define the cost of using the method M as C ( M ) 5 fp ( M ) 1 2 z fn ( M ), where fp ( M ) is the number of false positives for method M , and fn ( M ) is the number of false negatives for method M . The false negatives are weighted more heavily than the false positives because, for these data, the number of positive examples is small compared with the number of negatives.
  • 43.
  • 44.
  • 45. Hidden Markov Models Supervised Learning
  • 46. Decision trees Supervised Learning Mimics the behavior of an expert
  • 47. Decision trees Supervised Learning Mimics the behavior of an expert
  • 48. Mimics the behavior of an expert Decision trees Supervised Learning
  • 49. Mimics the behavior of an expert Decision trees Supervised Learning
  • 50.
  • 51. Random Forests Supervised Learning Split the data in several subsets, construct a DT for each set Each DT expresses a vote, the majority wins Much more accurate and robust (bootstrap)
  • 52. Random Forests Supervised Learning Split the data in several subsets, construct a DT for each set Each DT expresses a vote, the majority wins Much more accurate and robust (bootstrap) Prediction of protein–protein interactions using random decision forest framework Xue-Wen Chen * and Mei Liu Motivation: Protein interactions are of biological interest because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Domains are the building blocks of proteins; therefore, proteins are assumed to interact as a result of their interacting domains. Many domain-based models for protein interaction prediction have been developed, and preliminary results have demonstrated their feasibility. Most of the existing domain-based methods, however, consider only single-domain pairs (one domain from one protein) and assume independence between domain–domain interactions. Results: In this paper, we introduce a domain-based random forest of decision trees to infer protein interactions. Our proposed method is capable of exploring all possible domain interactions and making predictions based on all the protein domains. Experimental results on Saccharomyces cerevisiae dataset demonstrate that our approach can predict protein–protein interactions with higher sensitivity (79.78%) and specificity (64.38%) compared with that of the maximum likelihood approach. Furthermore, our model can be used to infer interactions not only for single-domain pairs but also for multiple domain pairs.
  • 53. Bayesian Networks Supervised Learning The probabilistic approach is extremely powerful but requires a huge amount of information/data for a complete representation Not all correlations or cause-effect relationships between variables are significative
  • 54. Bayesian Networks Supervised Learning The probabilistic approach is extremely powerful but requires a huge amount of information/data for a complete representation Not all correlations or cause-effect relationships between variables are significative Consider only meaningful links!
  • 55.
  • 56.
  • 57.
  • 58. Neural Networks Supervised Learning Neural Networks interpolate functions They have nothing to do with brains
  • 59. Neural Networks Supervised Learning Neural Networks interpolate functions They have nothing to do with brains
  • 60. Neural Networks Supervised Learning Neural Networks interpolate functions They have nothing to do with brains
  • 61. Neural Networks Supervised Learning Parameter settings:
  • 62. Neural Networks Supervised Learning Parameter settings: avoid overfitting
  • 63. Neural Networks Supervised Learning Parameter settings: avoid overfitting Learning --> validation --> usage No underlying model, but it often works
  • 64. Neural Networks Supervised Learning Protein Disorder Prediction: Implications for Structural Proteomics Rune Linding, 1,4, * Lars Juhl Jensen, 1,2,4 Francesca Diella, 3 Peer Bork, 1,2 Toby J. Gibson, 1 and Robert B. Russell 1 Abstract A great challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Disordered regions in proteins often contain short linear peptide motifs (e.g., SH3 ligands and targeting signals) that are important for protein function. We present here DisEMBL, a computational tool for prediction of disordered/unstructured regions within a protein sequence. As no clear definition of disorder exists, we have developed parameters based on several alternative definitions and introduced a new one based on the concept of “hot loops,” i.e., coils with high temperature factors. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability, and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects. The tool is freely available via a web interface (http://dis.embl.de) and can be downloaded for use in large-scale studies.
  • 65.
  • 66. Unsupervised Learning Unsupervised Learning If we have no idea of actual data classification, we can try to guess
  • 67. Clustering Unsupervised Learning Put together similar objects to define classes
  • 68. Clustering Unsupervised Learning Put together similar objects to define classes
  • 69. Clustering Unsupervised Learning K-means Hierarchical top-down Hierarchical down-up Fuzzy Put together similar objects to define classes How?
  • 70. Clustering Unsupervised Learning Euclidean Correlation Spearman Rank Manhattan Put together similar objects to define classes Which metric? How?
  • 71. Clustering Unsupervised Learning Put together similar objects to define classes Which metric? Which “shape”? Compact Concave Outliers Inner radius cluster separation How?
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 100.
  • 101.
  • 102.
  • 103.
  • 104.
  • 105.
  • 106.
  • 107.
  • 108. PCA Unsupervised Learning Multidimensional data (hard to visualize) Data variability is not equally distributed
  • 109. PCA Unsupervised Learning Multidimensional data (hard to visualize) Data variability is not equally distributed Correlation between variables Change coordinate system, remove correlation retain only most variable coordinates How: (generalized eigenvectors, SVD) Pro: noise (and information) reduction
  • 110.
  • 111. Data independence Training set, Test set and Validation set must be clearly separated E.g. neural network to infer gene function from sequence training set: annotated gene sequences, deposit date before Jan 2007 test set: annotated gene sequences, deposit date after Jan 2007 But annotation of new sequences is often inferred from old sequences! Caveats
  • 112. Biases Data should be unbiased, i.e. it should be a good sample of our “space” E.g. neural network to find disordered regions training set: solved structures, residues in SEQRES but not in ATOM But solved structures are typically small, globular, cytoplasmatic proteins Caveats
  • 113.
  • 114.
  • 115. Bayes Theorem Supplementary a) AIDS is affecting 0,01% of population. b) The AIDS test, when performed on patients, is correct 99,9% of times. b) The AIDS test, when performed on uninfected people, is correct 99,99% of times. If a person has a positive test, how likely is it for him to be infected? A B A  B E
  • 116. Bayes Theorem Supplementary a) AIDS is affecting 0,01% of population. b) The AIDS test, when performed on patients, is correct 99,9% of times. b) The AIDS test, when performed on uninfected people, is correct 99,99% of times. If a person has a positive test, how likely is it for him to be infected? P(A|T) =P(T|A)*P(A) / (P(T|A)*P(A) + P(T|¬A)*P(¬A)) P(A|T) = 49.97% A B A  B E