This presentation, prepared by Gerry Lushington, is a friendly introduction to the basics of data mining, as applied to biological problems. The intended audience is students and scientific researchers from a non-computational background.
2. What is Data Mining?
Use of computational methods to perceive trends in data that
can be used to explain or predict important outcomes or
properties
Applicable across many disciplines:
Molecular bioinformatics
Medical Informatics
Health Informatics
Biodiversity informatics
3. Example Applications:
Find relationships between:
Convenient Observables vs. Important Outcomes
a) Relative gene expression data 1. Disease susceptibility
b) Relative protein abundance data 2. Drug efficacy
c) Relative lipid & metabolite profiles 3. Toxin susceptibility
d) Glycosylation variants 4. Immunity
e) SNPs, alleles 5. Genetic disorders
f) Cellular traits 6. Microbial virulence
g) Organism traits 7. Species adaptive success
h) Behavioral traits 8. Species complementarity
i) Case history
4. Goals for this lecture:
Focus on Data Mining: how to approach your data and use it to
understand biology
Overview of available techniques
Understanding model validation
Try to think about data you’ve seen: what techniques might be
useful?
Don’t worry about grasping everything:
K-INBRE Bioinformatics Core is here to help!!
5. Basic Data Mining:
Find relationships between:
a) Easy to measure properties vs.
b) Important (but harder to measure) outcomes or attributes
Use relationships to understand the conceptual basis for
outcomes in b)
Use relationships to predict outcomes in new cases where
outcome has not yet been measured
8. Basic Data Mining: relationship (#1)
Unhappy Happy
Blue = happy; Red = unhappy accuracy = 12/20 = 60%
9. Basic Data Mining: relationship (#2)
Unhappy Happy
Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%
10. Data Mining: procedure
1. Data Acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
11. Data Mining: procedure
1. Data acquisition
2. Data Preprocessing Peak heights?
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
Peak positions?
Key issues include:
a) format conversion from instrument
b) any necessary mathematical manipulations (e.g., Density = M/V)
12. Data Mining: procedure
1. Data acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers
13. Data Mining: procedure
1. Data acquisition
2. Data Preprocessing
3. Feature Selection Use controls to
4. Classification scale data
5. Validation
6. Prediction & Iteration
Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers
C C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3
14. Data Mining: procedure
1. Data acquisition
2. Data Preprocessing
3. Feature Selection Subjective
4. Classification (requires experience
5. Validation and/or domain
6. Prediction & Iteration knowledge)
Key issues include:
a) Normalization to account for experimental bias
b) Statistical detection of flagrant outliers
15. Data Mining: procedure
1. Data acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information content
b) Redundancy relative to other properties
c) Correlation with target attribute
d) Iterative model training
16. Data Mining: procedure
1. Data acquisition
2.
3.
Data Preprocessing
Feature Selection
x x
4. Classification
5. Validation
6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information content
b) Redundancy relative to other properties
c) Correlation with target attribute
d) Iterative model training
17. Data Mining: procedure
1. Data acquisition
2.
3.
Data Preprocessing
Feature Selection
x
4. Classification
5. Validation
6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information content
b) Redundancy relative to other properties
c) Correlation with target attribute
d) Iterative model training
18. Data Mining: procedure
1. Data acquisition
2.
3.
Data Preprocessing
Feature Selection
x
4. Classification
5. Validation
6. Prediction & Iteration 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information content
b) Redundancy relative to other properties
c) Correlation with target attribute
d) Iterative model training
1 2 3 4
19. Data Mining: procedure
1. Data acquisition • Train preliminary models based on
random sets of properties
2. Data Preprocessing
• Evaluate models according to
3. Feature Selection correlative or predictive performance
4. Classification • Experiment with promising sets adding
5. Validation or deleting descriptors to gauge impact
6. Prediction & Iteration on performance
Which out of many measurable properties relate to outcome of interest?
a) Intrinsic information content
b) Redundancy relative to other properties
c) Correlation with target attribute
d) Iterative model training
20. Data Mining: procedure
1. Data acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
Predict which sample will have which outcome?
a) Correlative methods
b) Distance-based clustering
c) Boundary detection
d) Rule learning
e) Weighted probability
21. Data Mining: procedure y
1. Data acquisition
2. Data Preprocessing x
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
Predict which sample will have which outcome?
a) Correlative methods
b) Distance-based clustering
c) Boundary detection
d) Rule learning
e) Weighted probability
22. Data Mining: procedure y
1. Data acquisition
2. Data Preprocessing x
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
-n y +n
Predict which sample will have which outcome? NO YES
a) Correlative methods
b) Distance-based clustering
c) Boundary detection
d) Rule learning
e) Weighted probability
23. Data Mining: procedure
x2
1. Data acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
x1
Predict which sample will have which outcome?
a) Correlative methods
b) Distance-based clustering
c) Boundary detection
d) Rule learning
e) Weighted probability
24. Data Mining: procedure y1 y2
x2
1. Data acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification y3
5. Validation y4
6. Prediction & Iteration
x1
Predict which sample will have which outcome?
a) Correlative methods y1 = resistant to types I & II diabetes
b) Distance-based clustering y2 = susceptible only to type II
c) Boundary detection
d) Rule learning y3 = susceptible only to type I
e) Weighted probability y4 = susceptible to types I & II
25. Data Mining: procedure Resistant to type I
x2
1. Data acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
x1
Susceptible to type I
Predict which sample will have which outcome?
a) Correlative methods
b) Distance-based clustering
c) Boundary detection
d) Rule learning
e) Weighted probability
26. Data Mining: procedure Resistant to type I
x2
1. Data acquisition
2. Data Preprocessing b
3. Feature Selection
4. Classification a
5. Validation
6. Prediction & Iteration
c x1
Susceptible to type I
Predict which sample will have which outcome?
a) Correlative methods
b) Distance-based clustering If x1 < c and x2 > a then resistant
c) Boundary detection Else if x1 > c and x2 > b then resistant
d) Rule learning Else susceptible
e) Weighted probability
E=9
27. Data Mining: procedure
Resistant Susc.
1. Data acquisition
2. Data Preprocessing
3. Feature Selection a x1
4. Classification
5. Validation Susc. Resistant
6. Prediction & Iteration
b x2
Predict which sample will have which outcome?
a) Correlative methods Resistant Susc.
b) Distance-based clustering
c) Boundary detection
c Fx1 -
d) Rule learning
Gx2
e) Weighted probability If Fx1 - Gx2 < c then resistant
Else susceptible
28. Data Mining: procedure
1. Data acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
Define criteria and tests to prove model validity
a) Accuracy
b) Sensitivity vs. Specificity
c) Receiver Operating Characteristic (ROC) plot
d) Cross-validation
29. Data Mining: procedure
x2
1. Data acquisition
2. Data Preprocessing
3. Feature Selection Resistant (Neg.)
4. Classification
5. Validation Susc.
6. Prediction & Iteration
x1 (Pos.)
Define criteria and tests to prove model validity
a) Accuracy Accuracy = (TP + TN)
b) Sensitivity vs. Specificity TP + TN + FP + FN
c) Receiver Operating Characteristic (ROC) plot
d) Cross-validation = 142 / 154
30. Data Mining: procedure
x2
1. Data acquisition
2. Data Preprocessing
3. Feature Selection Resistant (Neg.)
4. Classification
5. Validation Susc.
6. Prediction & Iteration
x1 (Pos.)
Define criteria and tests to prove model validity
a) Accuracy Sensitivity = TP = 67 / 72
b) Sensitivity vs. Specificity
c) Receiver Operating Characteristic (ROC) plot TP + FN
d) Cross-validation
FPR = FP = 6 / 81
TN + FP
Note: Specificity = 1 - FPR
31. Data Mining: procedure
x2
1. Data acquisition
2. Data Preprocessing
3. Feature Selection Resistant (Neg.)
4. Classification less
5. Validation Varying Susc.
6. Prediction & Iteration model
more x1 (Pos.)
stringency
Define criteria and tests to prove model validity
a) Accuracy Sensitivity = TP = 69 / 72
b) Sensitivity vs. Specificity
c) Receiver Operating Characteristic (ROC) plot TP + FN
d) Cross-validation
FPR = FP = 19 / 81
TN + FP
Note: Specificity = 1 - FPR
32. Data Mining: procedure
Sens
1. Data acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
FPR
Define criteria and tests to prove model validity
a) Accuracy
b) Sensitivity vs. Specificity
c) Receiver Operating Characteristic (ROC) plot
d) Cross-validation
33. Data Mining: procedure
Sens
1. Data acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
FPR
Define criteria and tests to prove model validity
Area under curve is
a) Accuracy
excellent measure of
b) Sensitivity vs. Specificity model performance
c) Receiver Operating Characteristic (ROC) plot
d) Cross-validation 1.0: perfect model
0.5: random
34. Data Mining: procedure
1. Data acquisition Predictions are imperfect due to:
2. Data Preprocessing • Imperfect Algorithms
3. Feature Selection • Imperfect Data
4. Classification
5. Validation
6. Prediction & Iteration
Define criteria and tests to prove model validity
a) Accuracy
b) Sensitivity vs. Specificity
c) Receiver Operating Characteristic (ROC) plot
d) Cross-validation
35. Cross-Validation:
• Carefully monitor features that are useful across different
independent data subsets
• This can be accomplished with N-fold cross-validation:
Trial 1 Trial 2 Trial 3 Trial 4 Trial 5
Test
Train Model performance = mean predictive performance over 5 trials
• Best feature selection and classification algorithms will yield
best consistent performance across independent trials
• Best features will be consistently important across trials
36. Data Mining: procedure
1. Data acquisition
2. Data Preprocessing
3. Feature Selection
4. Classification
5. Validation
6. Prediction & Iteration
Analysis is only useful if it is used; only improves if it is tested
a) Good validation requires successful new predictions
b) Imperfect predictions can lead to method refinement and
greater understanding
37. Questions?
Lushington in Silico
Geraldlushington3117 at aol.com
Geraldlushington.org