SlideShare a Scribd company logo
1 of 37
Download to read offline
Introduction to Bioinformatics:
      Mining Your Data

          Gerry Lushington
         Lushington in Silico

  modeling / informatics consultant
What is Data Mining?
 Use of computational methods to perceive trends in data that
 can be used to explain or predict important outcomes or
 properties

Applicable across many disciplines:
Molecular bioinformatics

Medical Informatics

Health Informatics

Biodiversity informatics
Example Applications:
     Find relationships between:

Convenient Observables                vs.    Important Outcomes
a)    Relative gene expression data          1.   Disease susceptibility
b)    Relative protein abundance data        2.   Drug efficacy
c)    Relative lipid & metabolite profiles   3.   Toxin susceptibility
d)    Glycosylation variants                 4.   Immunity
e)    SNPs, alleles                          5.   Genetic disorders
f)    Cellular traits                        6.   Microbial virulence
g)    Organism traits                        7.   Species adaptive success
h)    Behavioral traits                      8.   Species complementarity
i)    Case history
Goals for this lecture:
Focus on Data Mining: how to approach your data and use it to
understand biology

Overview of available techniques

Understanding model validation

Try to think about data you’ve seen: what techniques might be
useful?




        Don’t worry about grasping everything:
     K-INBRE Bioinformatics Core is here to help!!
Basic Data Mining:
Find relationships between:
a) Easy to measure properties   vs.
b) Important (but harder to measure) outcomes or attributes

Use relationships to understand the conceptual basis for
outcomes in b)

Use relationships to predict outcomes in new cases where
outcome has not yet been measured
Basic Data Mining: simple measureables
Basic Data Mining: general observation




          Unhappy         Happy
Basic Data Mining: relationship (#1)




              Unhappy                Happy


    Blue = happy; Red = unhappy   accuracy = 12/20 = 60%
Basic Data Mining: relationship (#2)




               Unhappy                     Happy


  Blue + BIG Red = happy; little red = unhappy     accuracy = 17/20 = 85%
Data Mining: procedure

1.   Data Acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing       Peak heights?
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                   Peak positions?
Key issues include:
a) format conversion from instrument
b) any necessary mathematical manipulations (e.g., Density = M/V)
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection                                   Use controls to
4.   Classification                                        scale data
5.   Validation
6.   Prediction & Iteration


Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers




C    C 1 2 3      C 1 2 3                C 1 2 3    C 1 2 3
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection                                   Subjective
4.   Classification                                 (requires experience
5.   Validation                                        and/or domain
6.   Prediction & Iteration                             knowledge)


Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.
3.
     Data Preprocessing
     Feature Selection
                                  x              x
4.   Classification
5.   Validation
6.   Prediction & Iteration     1 2 3 4        1 2 3 4   1 2 3 4    1 2 3 4


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.
3.
     Data Preprocessing
     Feature Selection
                                                 x
4.   Classification
5.   Validation
6.   Prediction & Iteration     1 2 3 4        1 2 3 4   1 2 3 4    1 2 3 4


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.
3.
     Data Preprocessing
     Feature Selection
                                                          x
4.   Classification
5.   Validation
6.   Prediction & Iteration     1 2 3 4        1 2 3 4   1 2 3 4    1 2 3 4


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training

                                                               1 2 3 4
Data Mining: procedure

1.   Data acquisition                 • Train preliminary models based on
                                        random sets of properties
2.   Data Preprocessing
                                      • Evaluate models according to
3.   Feature Selection                  correlative or predictive performance
4.   Classification                   • Experiment with promising sets adding
5.   Validation                         or deleting descriptors to gauge impact
6.   Prediction & Iteration             on performance


Which out of many measurable properties relate to outcome of interest?
a)   Intrinsic information content
b)   Redundancy relative to other properties
c)   Correlation with target attribute
d)   Iterative model training
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


 Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                          y



1.   Data acquisition
2.   Data Preprocessing                             x
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                          y



1.   Data acquisition
2.   Data Preprocessing                                          x
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                     -n y +n
Predict which sample will have which outcome?   NO               YES

a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure
                                                x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                     x1

Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                                  y1            y2

                                                 x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification                                                             y3
5.   Validation                                 y4
6.   Prediction & Iteration
                                                                           x1

Predict which sample will have which outcome?
a)   Correlative methods                y1 = resistant to types I & II diabetes
b)   Distance-based clustering          y2 = susceptible only to type II
c)   Boundary detection
d)   Rule learning                      y3 = susceptible only to type I
e)   Weighted probability               y4 = susceptible to types I & II
Data Mining: procedure                               Resistant to type I

                                                x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                                        x1
                                                     Susceptible to type I
Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering
c)   Boundary detection
d)   Rule learning
e)   Weighted probability
Data Mining: procedure                               Resistant to type I

                                                x2
1.   Data acquisition
2.   Data Preprocessing                                                         b
3.   Feature Selection
4.   Classification                              a
5.   Validation
6.   Prediction & Iteration
                                                                   c       x1
                                                     Susceptible to type I
Predict which sample will have which outcome?
a)   Correlative methods
b)   Distance-based clustering              If x1 < c and x2 > a then resistant
c)   Boundary detection                     Else if x1 > c and x2 > b then resistant
d)   Rule learning                          Else susceptible
e)   Weighted probability

                                                                           E=9
Data Mining: procedure
                                             Resistant                      Susc.
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection                                           a                 x1
4.   Classification
5.   Validation                                   Susc.                    Resistant
6.   Prediction & Iteration

                                                                b                  x2
Predict which sample will have which outcome?
a)   Correlative methods                     Resistant                      Susc.
b)   Distance-based clustering
c)   Boundary detection
                                                                    c      Fx1 -
d)   Rule learning
                                                                           Gx2
e)   Weighted probability                If Fx1 - Gx2 < c then resistant
                                         Else susceptible
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Define criteria and tests to prove model validity
a)   Accuracy
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation
Data Mining: procedure
                                                    x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection          Resistant (Neg.)
4.   Classification
5.   Validation                                                                    Susc.
6.   Prediction & Iteration
                                                                                x1 (Pos.)

Define criteria and tests to prove model validity
a)   Accuracy                                            Accuracy =      (TP + TN)
b)   Sensitivity vs. Specificity                                      TP + TN + FP + FN
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation                                             = 142 / 154
Data Mining: procedure
                                                       x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection             Resistant (Neg.)
4.   Classification
5.   Validation                                                            Susc.
6.   Prediction & Iteration
                                                                        x1 (Pos.)

Define criteria and tests to prove model validity
a)   Accuracy                                   Sensitivity =   TP      = 67 / 72
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot             TP + FN
d)   Cross-validation
                                                      FPR =     FP      = 6 / 81
                                                              TN + FP
                 Note: Specificity = 1 - FPR
Data Mining: procedure
                                                         x2
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection             Resistant (Neg.)
4.   Classification                               less
5.   Validation                        Varying                             Susc.
6.   Prediction & Iteration            model
                                                     more               x1 (Pos.)
                                     stringency

Define criteria and tests to prove model validity
a)   Accuracy                                   Sensitivity =   TP      = 69 / 72
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot             TP + FN
d)   Cross-validation
                                                      FPR =     FP      = 19 / 81
                                                              TN + FP
                 Note: Specificity = 1 - FPR
Data Mining: procedure
                                               Sens
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                      FPR

Define criteria and tests to prove model validity
a)   Accuracy
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation
Data Mining: procedure
                                               Sens
1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration
                                                                  FPR

Define criteria and tests to prove model validity
                                                      Area under curve is
a)   Accuracy
                                                      excellent measure of
b)   Sensitivity vs. Specificity                      model performance
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation                                 1.0: perfect model
                                                      0.5: random
Data Mining: procedure

1.   Data acquisition                     Predictions are imperfect due to:
2.   Data Preprocessing                   • Imperfect Algorithms
3.   Feature Selection                    • Imperfect Data
4.   Classification
5.   Validation
6.   Prediction & Iteration


Define criteria and tests to prove model validity
a)   Accuracy
b)   Sensitivity vs. Specificity
c)   Receiver Operating Characteristic (ROC) plot
d)   Cross-validation
Cross-Validation:

• Carefully monitor features that are useful across different
  independent data subsets
• This can be accomplished with N-fold cross-validation:

     Trial 1        Trial 2      Trial 3      Trial 4      Trial 5
            Test




    Train          Model performance = mean predictive performance over 5 trials


• Best feature selection and classification algorithms will yield
  best consistent performance across independent trials
• Best features will be consistently important across trials
Data Mining: procedure

1.   Data acquisition
2.   Data Preprocessing
3.   Feature Selection
4.   Classification
5.   Validation
6.   Prediction & Iteration


Analysis is only useful if it is used; only improves if it is tested
a) Good validation requires successful new predictions
b) Imperfect predictions can lead to method refinement and
   greater understanding
Questions?


      Lushington in Silico
Geraldlushington3117 at aol.com
     Geraldlushington.org

More Related Content

What's hot (20)

Dynamic programming
Dynamic programming Dynamic programming
Dynamic programming
 
Global and Local Sequence Alignment
Global and Local Sequence AlignmentGlobal and Local Sequence Alignment
Global and Local Sequence Alignment
 
Protein database
Protein databaseProtein database
Protein database
 
Cath
CathCath
Cath
 
UPGMA
UPGMAUPGMA
UPGMA
 
Fasta
FastaFasta
Fasta
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
UniProt
UniProtUniProt
UniProt
 
In silico structure prediction
In silico structure predictionIn silico structure prediction
In silico structure prediction
 
Biological database
Biological databaseBiological database
Biological database
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
Biological networks
Biological networksBiological networks
Biological networks
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
Pymol
PymolPymol
Pymol
 
FASTA
FASTAFASTA
FASTA
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 

Viewers also liked

Classification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic RegressionClassification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic RegressionSetia Pramana
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecastingDevon Barrow
 
Learning from data: data mining approaches for Energy & Weather/Climate appli...
Learning from data: data mining approaches for Energy & Weather/Climate appli...Learning from data: data mining approaches for Energy & Weather/Climate appli...
Learning from data: data mining approaches for Energy & Weather/Climate appli...matteodefelice
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
100505 koenig biological_databases
100505 koenig biological_databases100505 koenig biological_databases
100505 koenig biological_databasesMeetika Gupta
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsDenis C. Bauer
 
1.bioinformatics introduction 32.03.2071
1.bioinformatics introduction 32.03.20711.bioinformatics introduction 32.03.2071
1.bioinformatics introduction 32.03.2071RajDip Basnet
 
B.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformaticsB.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformaticsRai University
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p collegeSKUASTKashmir
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databasesPranavathiyani G
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformaticsnadeem akhter
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data MiningSushil Kulkarni
 

Viewers also liked (20)

Classification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic RegressionClassification using L1-Penalized Logistic Regression
Classification using L1-Penalized Logistic Regression
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecasting
 
Data mining ppt
Data mining pptData mining ppt
Data mining ppt
 
Machine learning group computer science department ULB - Lab'InSight Artifici...
Machine learning group computer science department ULB - Lab'InSight Artifici...Machine learning group computer science department ULB - Lab'InSight Artifici...
Machine learning group computer science department ULB - Lab'InSight Artifici...
 
Learning from data: data mining approaches for Energy & Weather/Climate appli...
Learning from data: data mining approaches for Energy & Weather/Climate appli...Learning from data: data mining approaches for Energy & Weather/Climate appli...
Learning from data: data mining approaches for Energy & Weather/Climate appli...
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
100505 koenig biological_databases
100505 koenig biological_databases100505 koenig biological_databases
100505 koenig biological_databases
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
1.bioinformatics introduction 32.03.2071
1.bioinformatics introduction 32.03.20711.bioinformatics introduction 32.03.2071
1.bioinformatics introduction 32.03.2071
 
B.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformaticsB.sc biochem i bobi u-1 introduction to bioinformatics
B.sc biochem i bobi u-1 introduction to bioinformatics
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p college
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
Biological Databases
Biological DatabasesBiological Databases
Biological Databases
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Biological databases
Biological databasesBiological databases
Biological databases
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Ch 1 Intro to Data Mining
Ch 1 Intro to Data MiningCh 1 Intro to Data Mining
Ch 1 Intro to Data Mining
 

Similar to Introduction to Data Mining / Bioinformatics

Online Chemical Modeling Environment: Models
Online Chemical Modeling Environment: ModelsOnline Chemical Modeling Environment: Models
Online Chemical Modeling Environment: ModelsSSA KPI
 
Machine Learning Techniques for the Evaluating of External ...
Machine Learning Techniques for the Evaluating of External ...Machine Learning Techniques for the Evaluating of External ...
Machine Learning Techniques for the Evaluating of External ...butest
 
Software Engineering Course 2009 - Mining Software Archives
Software Engineering Course 2009 - Mining Software ArchivesSoftware Engineering Course 2009 - Mining Software Archives
Software Engineering Course 2009 - Mining Software ArchivesKim Herzig
 
Machine Learning Challenges For Automated Prompting In Smart Homes
Machine Learning Challenges For Automated Prompting In Smart HomesMachine Learning Challenges For Automated Prompting In Smart Homes
Machine Learning Challenges For Automated Prompting In Smart HomesBarnan Das
 
Model-driven decision support for monitoring network design based on analysis...
Model-driven decision support for monitoring network design based on analysis...Model-driven decision support for monitoring network design based on analysis...
Model-driven decision support for monitoring network design based on analysis...Velimir (monty) Vesselinov
 
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...Saifeng (Aaron) Liu
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Dmitry Grapov
 
Randomization Approach in Case-Based Reasoning: Case of study of mammography ...
Randomization Approach in Case-Based Reasoning: Case of study of mammography ...Randomization Approach in Case-Based Reasoning: Case of study of mammography ...
Randomization Approach in Case-Based Reasoning: Case of study of mammography ...Miled Basma Bentaiba
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Interscience discovering knowledge in data an introduction to data mining
Interscience discovering knowledge in data   an introduction to data miningInterscience discovering knowledge in data   an introduction to data mining
Interscience discovering knowledge in data an introduction to data miningCludius
 
Cluster Analysis : Assignment & Update
Cluster Analysis : Assignment & UpdateCluster Analysis : Assignment & Update
Cluster Analysis : Assignment & UpdateBilly Yang
 
Malware detection-using-machine-learning
Malware detection-using-machine-learningMalware detection-using-machine-learning
Malware detection-using-machine-learningSecurity Bootcamp
 
Azure machine learning 101 Parts 1 & 2 - Classification Algorithms
Azure machine learning 101  Parts 1 & 2  -  Classification Algorithms Azure machine learning 101  Parts 1 & 2  -  Classification Algorithms
Azure machine learning 101 Parts 1 & 2 - Classification Algorithms Setu Chokshi
 
Azure machine learning 101 - Part 1
Azure machine learning 101 - Part 1Azure machine learning 101 - Part 1
Azure machine learning 101 - Part 1Setu Chokshi
 
Qtp (basics to advanced)
Qtp (basics to advanced)Qtp (basics to advanced)
Qtp (basics to advanced)G.C Reddy
 

Similar to Introduction to Data Mining / Bioinformatics (20)

Online Chemical Modeling Environment: Models
Online Chemical Modeling Environment: ModelsOnline Chemical Modeling Environment: Models
Online Chemical Modeling Environment: Models
 
Geog2 question 2
Geog2 question 2Geog2 question 2
Geog2 question 2
 
clustering.ppt
clustering.pptclustering.ppt
clustering.ppt
 
Machine Learning Techniques for the Evaluating of External ...
Machine Learning Techniques for the Evaluating of External ...Machine Learning Techniques for the Evaluating of External ...
Machine Learning Techniques for the Evaluating of External ...
 
Software Engineering Course 2009 - Mining Software Archives
Software Engineering Course 2009 - Mining Software ArchivesSoftware Engineering Course 2009 - Mining Software Archives
Software Engineering Course 2009 - Mining Software Archives
 
Machine Learning Challenges For Automated Prompting In Smart Homes
Machine Learning Challenges For Automated Prompting In Smart HomesMachine Learning Challenges For Automated Prompting In Smart Homes
Machine Learning Challenges For Automated Prompting In Smart Homes
 
Model-driven decision support for monitoring network design based on analysis...
Model-driven decision support for monitoring network design based on analysis...Model-driven decision support for monitoring network design based on analysis...
Model-driven decision support for monitoring network design based on analysis...
 
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
Prostate Cancer Diagnosis using Deep Learning with 3D Multiparametric MRI: Pr...
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
Metabolomics and Beyond Challenges and Strategies for Next-gen Omic Analyses
 
Randomization Approach in Case-Based Reasoning: Case of study of mammography ...
Randomization Approach in Case-Based Reasoning: Case of study of mammography ...Randomization Approach in Case-Based Reasoning: Case of study of mammography ...
Randomization Approach in Case-Based Reasoning: Case of study of mammography ...
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Interscience discovering knowledge in data an introduction to data mining
Interscience discovering knowledge in data   an introduction to data miningInterscience discovering knowledge in data   an introduction to data mining
Interscience discovering knowledge in data an introduction to data mining
 
Cluster Analysis : Assignment & Update
Cluster Analysis : Assignment & UpdateCluster Analysis : Assignment & Update
Cluster Analysis : Assignment & Update
 
Seminarppt
SeminarpptSeminarppt
Seminarppt
 
Input modeling
Input modelingInput modeling
Input modeling
 
Malware detection-using-machine-learning
Malware detection-using-machine-learningMalware detection-using-machine-learning
Malware detection-using-machine-learning
 
Azure machine learning 101 Parts 1 & 2 - Classification Algorithms
Azure machine learning 101  Parts 1 & 2  -  Classification Algorithms Azure machine learning 101  Parts 1 & 2  -  Classification Algorithms
Azure machine learning 101 Parts 1 & 2 - Classification Algorithms
 
Azure machine learning 101 - Part 1
Azure machine learning 101 - Part 1Azure machine learning 101 - Part 1
Azure machine learning 101 - Part 1
 
Qtp (basics to advanced)
Qtp (basics to advanced)Qtp (basics to advanced)
Qtp (basics to advanced)
 

More from Gerald Lushington

A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of ActionA Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of ActionGerald Lushington
 
Gerald Lushington presentation on Biologically Relevant Chemical Diversity An...
Gerald Lushington presentation on Biologically Relevant Chemical Diversity An...Gerald Lushington presentation on Biologically Relevant Chemical Diversity An...
Gerald Lushington presentation on Biologically Relevant Chemical Diversity An...Gerald Lushington
 
Personalized medicine via molecular interrogation, data mining and systems bi...
Personalized medicine via molecular interrogation, data mining and systems bi...Personalized medicine via molecular interrogation, data mining and systems bi...
Personalized medicine via molecular interrogation, data mining and systems bi...Gerald Lushington
 

More from Gerald Lushington (6)

A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of ActionA Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
A Biclustering Method for Rationalizing Chemical Biology Mechanisms of Action
 
Report ghl20130320
Report ghl20130320Report ghl20130320
Report ghl20130320
 
Gerald Lushington presentation on Biologically Relevant Chemical Diversity An...
Gerald Lushington presentation on Biologically Relevant Chemical Diversity An...Gerald Lushington presentation on Biologically Relevant Chemical Diversity An...
Gerald Lushington presentation on Biologically Relevant Chemical Diversity An...
 
LiS services
LiS servicesLiS services
LiS services
 
Open source
Open sourceOpen source
Open source
 
Personalized medicine via molecular interrogation, data mining and systems bi...
Personalized medicine via molecular interrogation, data mining and systems bi...Personalized medicine via molecular interrogation, data mining and systems bi...
Personalized medicine via molecular interrogation, data mining and systems bi...
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

Introduction to Data Mining / Bioinformatics

  • 1. Introduction to Bioinformatics: Mining Your Data Gerry Lushington Lushington in Silico modeling / informatics consultant
  • 2. What is Data Mining? Use of computational methods to perceive trends in data that can be used to explain or predict important outcomes or properties Applicable across many disciplines: Molecular bioinformatics Medical Informatics Health Informatics Biodiversity informatics
  • 3. Example Applications: Find relationships between: Convenient Observables vs. Important Outcomes a) Relative gene expression data 1. Disease susceptibility b) Relative protein abundance data 2. Drug efficacy c) Relative lipid & metabolite profiles 3. Toxin susceptibility d) Glycosylation variants 4. Immunity e) SNPs, alleles 5. Genetic disorders f) Cellular traits 6. Microbial virulence g) Organism traits 7. Species adaptive success h) Behavioral traits 8. Species complementarity i) Case history
  • 4. Goals for this lecture: Focus on Data Mining: how to approach your data and use it to understand biology Overview of available techniques Understanding model validation Try to think about data you’ve seen: what techniques might be useful? Don’t worry about grasping everything: K-INBRE Bioinformatics Core is here to help!!
  • 5. Basic Data Mining: Find relationships between: a) Easy to measure properties vs. b) Important (but harder to measure) outcomes or attributes Use relationships to understand the conceptual basis for outcomes in b) Use relationships to predict outcomes in new cases where outcome has not yet been measured
  • 6. Basic Data Mining: simple measureables
  • 7. Basic Data Mining: general observation Unhappy Happy
  • 8. Basic Data Mining: relationship (#1) Unhappy Happy Blue = happy; Red = unhappy accuracy = 12/20 = 60%
  • 9. Basic Data Mining: relationship (#2) Unhappy Happy Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%
  • 10. Data Mining: procedure 1. Data Acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration
  • 11. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing Peak heights? 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Peak positions? Key issues include: a) format conversion from instrument b) any necessary mathematical manipulations (e.g., Density = M/V)
  • 12. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Key issues include: a) Normalization to account for experimental bias b) Statistical detection of flagrant outliers
  • 13. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Use controls to 4. Classification scale data 5. Validation 6. Prediction & Iteration Key issues include: a) Normalization to account for experimental bias b) Statistical detection of flagrant outliers C C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3
  • 14. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Subjective 4. Classification (requires experience 5. Validation and/or domain 6. Prediction & Iteration knowledge) Key issues include: a) Normalization to account for experimental bias b) Statistical detection of flagrant outliers
  • 15. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 16. Data Mining: procedure 1. Data acquisition 2. 3. Data Preprocessing Feature Selection x x 4. Classification 5. Validation 6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 17. Data Mining: procedure 1. Data acquisition 2. 3. Data Preprocessing Feature Selection x 4. Classification 5. Validation 6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 18. Data Mining: procedure 1. Data acquisition 2. 3. Data Preprocessing Feature Selection x 4. Classification 5. Validation 6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training 1 2 3 4
  • 19. Data Mining: procedure 1. Data acquisition • Train preliminary models based on random sets of properties 2. Data Preprocessing • Evaluate models according to 3. Feature Selection correlative or predictive performance 4. Classification • Experiment with promising sets adding 5. Validation or deleting descriptors to gauge impact 6. Prediction & Iteration on performance Which out of many measurable properties relate to outcome of interest? a) Intrinsic information content b) Redundancy relative to other properties c) Correlation with target attribute d) Iterative model training
  • 20. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 21. Data Mining: procedure y 1. Data acquisition 2. Data Preprocessing x 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 22. Data Mining: procedure y 1. Data acquisition 2. Data Preprocessing x 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration -n y +n Predict which sample will have which outcome? NO YES a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 23. Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration x1 Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 24. Data Mining: procedure y1 y2 x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification y3 5. Validation y4 6. Prediction & Iteration x1 Predict which sample will have which outcome? a) Correlative methods y1 = resistant to types I & II diabetes b) Distance-based clustering y2 = susceptible only to type II c) Boundary detection d) Rule learning y3 = susceptible only to type I e) Weighted probability y4 = susceptible to types I & II
  • 25. Data Mining: procedure Resistant to type I x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration x1 Susceptible to type I Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering c) Boundary detection d) Rule learning e) Weighted probability
  • 26. Data Mining: procedure Resistant to type I x2 1. Data acquisition 2. Data Preprocessing b 3. Feature Selection 4. Classification a 5. Validation 6. Prediction & Iteration c x1 Susceptible to type I Predict which sample will have which outcome? a) Correlative methods b) Distance-based clustering If x1 < c and x2 > a then resistant c) Boundary detection Else if x1 > c and x2 > b then resistant d) Rule learning Else susceptible e) Weighted probability E=9
  • 27. Data Mining: procedure Resistant Susc. 1. Data acquisition 2. Data Preprocessing 3. Feature Selection a x1 4. Classification 5. Validation Susc. Resistant 6. Prediction & Iteration b x2 Predict which sample will have which outcome? a) Correlative methods Resistant Susc. b) Distance-based clustering c) Boundary detection c Fx1 - d) Rule learning Gx2 e) Weighted probability If Fx1 - Gx2 < c then resistant Else susceptible
  • 28. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Define criteria and tests to prove model validity a) Accuracy b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot d) Cross-validation
  • 29. Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Resistant (Neg.) 4. Classification 5. Validation Susc. 6. Prediction & Iteration x1 (Pos.) Define criteria and tests to prove model validity a) Accuracy Accuracy = (TP + TN) b) Sensitivity vs. Specificity TP + TN + FP + FN c) Receiver Operating Characteristic (ROC) plot d) Cross-validation = 142 / 154
  • 30. Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Resistant (Neg.) 4. Classification 5. Validation Susc. 6. Prediction & Iteration x1 (Pos.) Define criteria and tests to prove model validity a) Accuracy Sensitivity = TP = 67 / 72 b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot TP + FN d) Cross-validation FPR = FP = 6 / 81 TN + FP Note: Specificity = 1 - FPR
  • 31. Data Mining: procedure x2 1. Data acquisition 2. Data Preprocessing 3. Feature Selection Resistant (Neg.) 4. Classification less 5. Validation Varying Susc. 6. Prediction & Iteration model more x1 (Pos.) stringency Define criteria and tests to prove model validity a) Accuracy Sensitivity = TP = 69 / 72 b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot TP + FN d) Cross-validation FPR = FP = 19 / 81 TN + FP Note: Specificity = 1 - FPR
  • 32. Data Mining: procedure Sens 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration FPR Define criteria and tests to prove model validity a) Accuracy b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot d) Cross-validation
  • 33. Data Mining: procedure Sens 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration FPR Define criteria and tests to prove model validity Area under curve is a) Accuracy excellent measure of b) Sensitivity vs. Specificity model performance c) Receiver Operating Characteristic (ROC) plot d) Cross-validation 1.0: perfect model 0.5: random
  • 34. Data Mining: procedure 1. Data acquisition Predictions are imperfect due to: 2. Data Preprocessing • Imperfect Algorithms 3. Feature Selection • Imperfect Data 4. Classification 5. Validation 6. Prediction & Iteration Define criteria and tests to prove model validity a) Accuracy b) Sensitivity vs. Specificity c) Receiver Operating Characteristic (ROC) plot d) Cross-validation
  • 35. Cross-Validation: • Carefully monitor features that are useful across different independent data subsets • This can be accomplished with N-fold cross-validation: Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Test Train Model performance = mean predictive performance over 5 trials • Best feature selection and classification algorithms will yield best consistent performance across independent trials • Best features will be consistently important across trials
  • 36. Data Mining: procedure 1. Data acquisition 2. Data Preprocessing 3. Feature Selection 4. Classification 5. Validation 6. Prediction & Iteration Analysis is only useful if it is used; only improves if it is tested a) Good validation requires successful new predictions b) Imperfect predictions can lead to method refinement and greater understanding
  • 37. Questions? Lushington in Silico Geraldlushington3117 at aol.com Geraldlushington.org