Women Who Code-HSV Event:
'An Introduction to Machine Learning and Genomics'. Dr. Lasseigne will introduce the R programming language and the foundational concepts of machine learning with real-world examples including applications in the field of genomics with an emphasis on complex human disease research.
Brittany Lasseigne, PhD, is a postdoctoral fellow in the lab of Dr. Richard Myers at the HudsonAlpha Institute for Biotechnology and a 2016-2017 Prevent Cancer Foundation Fellow. Dr. Lasseigne received a BS in biological engineering from the James Worth Bagley College of Engineering at Mississippi State University and a PhD in biotechnology science and engineering from The University of Alabama in Huntsville. As a graduate student, she studied the role of epigenetics and copy number variation in cancer, identifying novel diagnostic biomarkers and prognostic signatures associated with kidney cancer. In her current position, Dr. Lasseigne’s research focus is the application of genetics and genomics to complex human diseases. Her recent work includes the identification of gene variants linked to ALS, characterization of gene expression patterns in schizophrenia and bipolar disorder, and development of non-invasive biomarker assays. Dr. Lasseigne is currently focused on integrating genomic data across cancers with functional annotations and patient information to explore novel mechanisms in cancer etiology and progression, identify therapeutic targets, and understand genomic changes associated with patient survival. Based upon those analyses, she is creating tools to share with the scientific community.
9. Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 9
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
Big Data
10. 10
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 files with 2000+ metadata attributes
• >2.5 Petabytes of data
14. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com. 14
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Multidimensional Data Sets
• We have lots of data and complex problems
• We want to make data-driven predictions
and need to automate model building
15. Cells, Tissues, & Diseases Functional Annotations
mage from encodeproject.org and xorlogics.com.
15
Multidimensional Data Sets
Complex problems + Big Data —> Machine Learning!
16. • data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
16
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
• Our goal isn’t to make perfect guesses, but to make useful guesses—we want to
build a model that is useful for the future
22. • Open source implementation of S which was originally developed at Bell Lab
• Free programming language and software environment for advanced statistical
computing and graphics
• Functional programming language written primarily in C, Fortran
• Good at data manipulation, modeling and computing, data visualization
• Cross-platform compatible
• Vast community (e.g., CRAN, R-bloggers, Bioconductor)
• Over 10,000 packages including parallel/high-performance compute packages
• Used extensively by statisticians and academics
• Popularity is substantially increasing in recent years
• Drawbacks: can be steep learning curve (better recently), limited GUI (RStudio!),
documentation can be sparse, memory allocation can be an issue
The R Programming Language
22
24. Fisher’s/Anderson's iris data set:
measurements (cm) of the sepal length and width and petal length and width (4 features) for 50 flowers from
each of 3 species (Iris setosa, versicolor, and virginica)
Iris Dataset in R
24
28. Iris Data: Adding Regularization (LASSO)
•Model building with a large # of
features for a moderate
number of samples can result
in ‘overfitting’ —the model is
too specific to the training set
and not generalizable enough
for accurate predictions with
new data
•Regularization is a technique
for preventing this by
introducing tuning parameters
that penalize the coefficients
of variables that are linearly
dependent (redundant)
•This results in FEATURE
SELECTION
•Ridge regression and LASSO
regression are methods of
regression with regularization
28
Computer
Petal.Length
Sepal.Width
Sepal.Length
Petal.Width
Petal.Width~
0.968*Sepal.Length
+ 0.187
Petal.Width ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + b
0 0
Petal.Width ~ 0*Petal.Length + 0*Sepal.Width + C* Sepal.Length + b
Petal.Width ~ Sepal.Length + b
p value < 2.2*10-16
29. Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
29
Petal.Length < 2.35 cm
Setosa (40/0/0)
Petal.Width < 1.65 cm
Versicolor (0/40/12) Virginica (0/0/28)
31. Iris Data: Neural Nets
• Neural Networks (NNs) emulate how the
human brain works with a network of
interconnected neurons (essentially
logistic regression units) organized in
multiple layers, allowing more complex,
abstract, and subtle decisions
• Lots of tuning parameters (# of hidden
layers, # of neurons in each layer, and
multiple ways to tune learning)
• Learning is an iterative feedback
mechanism where training data error is
used to adjust the corresponding input
weights which is propagated back to
previous layers (i.e., back-propagation)
• NNs are good at learning non-linear
functions and can handle multiple
outputs, but have a long training time and
models are susceptible to local minimum
traps (can be mitigated by doing multiple
rounds—takes more time!)
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
31
32. Other Machine Learning Methods
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian
network with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the
time…
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
32
Algorithm Selection is an Important Step!
35. 35
Integrating genomic data with machine learning to improve
predictive modeling
1) Cross-Cancer Patient Outcome Prediction Model
2) Improved Kidney Cancer Patient Outcome Prediction Model
36. Scaled -log10 Cox p-value
-1 2 30 1
‘Common Survival Genes’ across 19 cancers
• ‘Common Survival Genes’
Cox regression uncorrected p-value
<0.05 for a gene in at least 9/19
cancers:
• 84 genes, enriched for
proliferation-related processes
including mitosis, cell and
nuclear division, and spindle
formation
• Clustering by Cox regression p-
values:
7 ‘Proliferative Informative Cancers’
and 12 ‘Non-Proliferative Informative
Cancers’
36
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
Ramaker & Lasseigne, et al. 2017.
37. Scaled -log10 Cox p-value
-1 2 30 1
‘Common Survival Genes’ across 19 cancers
Proliferative Informative Cancers
(PICs)
37
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
• ‘Common Survival Genes’
Cox regression uncorrected p-value
<0.05 for a gene in at least 9/19
cancers:
• 84 genes, enriched for
proliferation-related processes
including mitosis, cell and
nuclear division, and spindle
formation
• Clustering by Cox regression p-
values:
7 ‘Proliferative Informative Cancers’
and 12 ‘Non-Proliferative Informative
Cancers’
Ramaker & Lasseigne, et al. 2017.
38. Scaled -log10 Cox p-value
-1 2 30 1
‘Common Survival Genes’ across 19 cancers
Proliferative Informative Cancers
(PICs)
38
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
Non-Proliferative Informative Cancers
(Non-PICs)
• ‘Common Survival Genes’
Cox regression uncorrected p-value
<0.05 for a gene in at least 9/19
cancers:
• 84 genes, enriched for
proliferation-related processes
including mitosis, cell and
nuclear division, and spindle
formation
• Clustering by Cox regression p-
values:
7 ‘Proliferative Informative Cancers’
and 12 ‘Non-Proliferative Informative
Cancers’
Ramaker & Lasseigne, et al. 2017.
40. 40
Integrating genomic data with machine learning to improve
predictive modeling
1) Cross-Cancer Patient Outcome Prediction Model
2) Improved Kidney Cancer Patient Survival Prediction Model
41. TCGA Kidney Renal Cell Carcinoma (KIRC)
Data Set
• 291 tumor samples with
clinical, RNA-seq, DNAm, and
CNV data available (~1/3 of
patients died from disease)
41
42. 42
Can we improve clinically relevant phenotype prediction with
multi-omics classifiers?
Clinically
Annotated
Multidimensional
Data Sets
DNAm CNV
RNA
Expression
Protein
Expression
microRNA
Expression
Mutations RNA
CNVDNAm
Cox regression with LASSO feature selection
45. 45
• RNA+DNAm+CNV model of
patient survival
outperformed each data type
alone or with another single
data type, as well as models
built on features before
dimension reduction
• Synergistic effect by
combining RNA, DNAm, and
CNV into combined features
for prediction of patient
outcome
• Some principal components
were strongly correlated with
CIN or DNAmIN status
Multi-omic classifiers to predict patient outcome
RNA
CNVDNAm
Patient Outcome
Model Test AUC
CNV <0.5
RNA 0.5683
DNAm 0.6794
DNAm+CNV 0.6571
RNA+CNV 0.6730
RNA+DNAm 0.7397
RNA+DNAm+CNV 0.7619
accuracy: