An Introduction to Machine Learning and Genomics

An Introduction to Machine Learning and Genomics
Brittany N. Lasseigne, PhD
HudsonAlpha Intstitute for Biotechnology
27 June 2017
@bnlasse blasseigne@hudsonalpha.org

• ‘Genomical’ Data
• Introduction to Machine
Learning and R
• Machine Learning Algorithms
• Applying Machine Learning to
Genomics Data + Problems

4American Cancer Society, 2015 & Harvard NeuroDiscovery Center, 2017.
Cancer:
• Men have a 1 in 2 lifetime risk of developing cancer and a 1 in 4 lifetime risk of dying from cancer
• Women have a 1 in 3 lifetime risk of developing cancer and a 1 in 5 lifetime risk of dying from
cancer
Psychiatric Illness:
• 1 in 4 American adults suffere from a diagnosable mental disorder in any given year
• ~6% suffer serious disabilities as a result
Neurodegenerative Disease:
• ~6.5M Americans suffer (AD, PD, MS, ALS, HD), expected to rise to 12M by 2030
Complex Human Diseases:
usually caused by a combination of genetic, environmental and lifetyle factors
(most of which have not yet been identified)

• Which patients are high risk for developing cancer?
• What are early biomarkers of cancer?
• Which patients are likely to be short/long term cancer survivers?
• What chemotherapeutic might a cancer patient benefit from?
5
Improve disease prevention, diagnosis, prognosis, and treatment efficacy
Complex problems

Genomics
• Understanding the function of the
genome (total genetic material) and
how it relates to human disease
(studying all of the genes at once!)
• The sequencing of the human
genome paved the way for genomic
studies
• Our goal it identify genetic/genomic
variation associated with disease to
improve patient care
6

Cells, Tissues, & Diseases Functional Annotations
Image from encodeproject.org 9
Multidimensional Data Sets
Big Data

10
Case study: The Cancer Genome Atlas
• Mulitiple data types for 11,000+ patients across 33 tumor types
• 549,625 ﬁles with 2000+ metadata attributes
• >2.5 Petabytes of data

Genomics Data is Big Data
11Stephens, et al. PLOS Biology, 2015.
1 zettabyte (ZB) = 1024 EB
1 exabyte (EB) = 1024 PB
1 petabyte (PB) = 1024 TB
1 terabyte (TB) = 1024 GB

Astronomical ‘Genomical’ Data:
the ‘four-headed beast’ of the data life-cycle (2025 Projections)
12Stephens, et al. PLOS Biology, 2015 and nanalyze.com.
1 zettabyte (ZB) = 1024 EB
1 exabyte (EB) = 1024 PB
1 petabyte (PB) = 1024 TB
1 terabyte (TB) = 1024 GB

mage from encodeproject.org and xorlogics.com. 14
• We have lots of data and complex problems
• We want to make data-driven predictions
and need to automate model building

mage from encodeproject.org and xorlogics.com.
15
Complex problems + Big Data —> Machine Learning!

• data analysis method that automates analytical model building
• make data driven predictions or discover patterns without explicit human intervention
• Useful when have complex problems and lots of data (‘big data’)
Machine Learning
16
Computer
Data
Program
Output
Traditional Programming
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+
• Our goal isn’t to make perfect guesses, but to make useful guesses—we want to
build a model that is useful for the future

17
Supervised Learning:
-Prediction
Ex. linear & logistic regression
Unsupervised Learning:
-Find patterns
Ex. Clustering, Principle Component Analysis
Known Data + Known Response
YES
NO
MODEL
NEW DATA
Predict Response
Clusters of Categorized Data
Uncategorized Data

Real-World Machine Learning Applications
18
Recommendation Engine
Mail Sorting
Self-Driving Car
HBO’s Silicon Valley ‘not hotdog!’ app

The Rise of Machine Learning
• Hardware Advances
• Extreme performance
hardware (ex.
application-specific
integrated circuits)
• Smaller, cheaper
hardware (Moore’s law)
• Cloud computing (ex.
AWS)
• Software Advances
• New machine learning
algorithms including
deep learning and
reinforcement learning
• Data Advances
• High-performance, less
expensive sensors & data
generation
• ex. wearables, next-gen
sequencing, social media
19
2016 Q3

Machine Learning with the R Programming Language
20
kdnuggets, 2015
Python is also a great choice!
• R tends to be favored by statisticians
and academics (for research)
• Python tends to be favored by
engineers (with production workflows)

21
Burtch Works asked data scientists and
predictive analytics pros:
Which do you prefer to use?

• Open source implementation of S which was originally developed at Bell Lab
• Free programming language and software environment for advanced statistical
computing and graphics
• Functional programming language written primarily in C, Fortran
• Good at data manipulation, modeling and computing, data visualization
• Cross-platform compatible
• Vast community (e.g., CRAN, R-bloggers, Bioconductor)
• Over 10,000 packages including parallel/high-performance compute packages
• Used extensively by statisticians and academics
• Popularity is substantially increasing in recent years
• Drawbacks: can be steep learning curve (better recently), limited GUI (RStudio!),
documentation can be sparse, memory allocation can be an issue
The R Programming Language
22

Fisher’s/Anderson's iris data set:
measurements (cm) of the sepal length and width and petal length and width (4 features) for 50 ﬂowers from
each of 3 species (Iris setosa, versicolor, and virginica)
Iris Dataset in R
24

Iris Dataset:
Summarize/Descriptive Statistics (Observational)
25
Computer
Data
Program
Output
Computer
Sepal.Lenth
mean(x)
5.843

Iris Dataset: Correlation (still Descriptive)
t-test, p value < 2.2*10-16
26
Setosa
Versicolor
Virginica

Iris Dataset:
Linear Regression is Machine Learning!
• Red line is a linear regression line fit
to the data describing petal length as
a function of petal width
• We can now PREDICT petal width
given petal length

Petal.Width~0.416*Petal.Length - 0.363
(y=mx+b)
Computer
Data
Output
Program
Machine Learning
Computer
Petal.Length
Petal.Width
Petal.Width~
0.416*Petal.Length
- 0.363
27

Iris Data: Adding Regularization (LASSO)
•Model building with a large # of
features for a moderate
number of samples can result
in ‘overfitting’ —the model is
too specific to the training set
and not generalizable enough
for accurate predictions with
new data
•Regularization is a technique
for preventing this by
introducing tuning parameters
that penalize the coefficients
of variables that are linearly
dependent (redundant)
•This results in FEATURE
SELECTION
•Ridge regression and LASSO
regression are methods of
regression with regularization
28
Computer
Petal.Length
Sepal.Width
Sepal.Length
Petal.Width
Petal.Width~
0.968*Sepal.Length
+ 0.187
Petal.Width ~ A*Petal.Length + B*Sepal.Width + C* Sepal.Length + b
0 0
Petal.Width ~ 0*Petal.Length + 0*Sepal.Width + C* Sepal.Length + b
Petal.Width ~ Sepal.Length + b
p value < 2.2*10-16

Iris Data: Decision Trees
• Decision trees can take different
data types (categorical, binary,
numeric) as input/output
variables, handle missing data
and outliers well, and are intuitive
• Decision tree limitations include
that each decision boundary at
each split is a concrete binary
decision and the decision criteria
only consider one input feature at
a time (not a combination of
multiple input features)
• Examples: Video games, clinical
decision models
29
Petal.Length < 2.35 cm
Setosa (40/0/0)
Petal.Width < 1.65 cm
Versicolor (0/40/12) Virginica (0/0/28)

Iris Data: Ensemble Methods
Example: tree bagging and boosting
• Instead of picking a single model, ensemble methods
combine multiple models to fit the training data
(‘bagging’ and ‘boosting’)
• Random Forest is a Decision Tree Ensemble Method
Image: Machado, et al. Veterinary Research, 2015. 30

Iris Data: Neural Nets
• Neural Networks (NNs) emulate how the
human brain works with a network of
interconnected neurons (essentially
logistic regression units) organized in
multiple layers, allowing more complex,
abstract, and subtle decisions
• Lots of tuning parameters (# of hidden
layers, # of neurons in each layer, and
multiple ways to tune learning)
• Learning is an iterative feedback
mechanism where training data error is
used to adjust the corresponding input
weights which is propagated back to
previous layers (i.e., back-propagation)
• NNs are good at learning non-linear
functions and can handle multiple
outputs, but have a long training time and
models are susceptible to local minimum
traps (can be mitigated by doing multiple
rounds—takes more time!)
X1
X2
Output
(Summation of
Input and
Activation with
Sigmoid Fxn)
‘Neuron’
31

Other Machine Learning Methods
• Naive Bayes (based on prior probabilities)
• Hidden Markov Models (Bayesian
network with hidden states)
• K Nearest Neighbors (instance-based
learning—clustering!)
• Support Vector Machines (discriminator
defined by a separating hyperplane)
• Additional Ensemble Method Approaches
(combining multiple models)
• And new methods coming out all the
time…
Raw Data
Clean/Normalize Data
Training Set Test Set
Build Model
Test
Apply to New Data
(Validation Cohort or
Model Application)
Tune Model
32
Algorithm Selection is an Important Step!

• Which patients are high risk for
developing cancer?
• What are early biomarkers of
cancer?
• Which patients are likely to be
short/long term cancer survivers?
• What chemotherapeutic might a
cancer patient benefit from?
34
Complex problems + Big Data —>
Machine Learning

35
Integrating genomic data with machine learning to improve
predictive modeling
1) Cross-Cancer Patient Outcome Prediction Model
2) Improved Kidney Cancer Patient Outcome Prediction Model

Scaled -log10 Cox p-value
-1 2 30 1
‘Common Survival Genes’ across 19 cancers
• ‘Common Survival Genes’
Cox regression uncorrected p-value
<0.05 for a gene in at least 9/19
cancers:
• 84 genes, enriched for
proliferation-related processes
including mitosis, cell and
nuclear division, and spindle
formation
• Clustering by Cox regression p-
values:
7 ‘Proliferative Informative Cancers’
and 12 ‘Non-Proliferative Informative
Cancers’
36
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
TopCrossCancerSurvivalGenes
*
C
Ramaker & Lasseigne, et al. 2017.

-1 2 30 1
Proliferative Informative Cancers
(PICs)
37
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
*
C
cancers:
formation
values:
Cancers’

-1 2 30 1
Proliferative Informative Cancers
(PICs)
38
ESCA
STAD
OV
LUSC
GBM
LAML
LIHC
SARC
BLCA
CESC
HNSC
BRCA
ACC
MESO
KIRP
LUAD
PAAD
LGG
KIRC
*
C
Non-Proliferative Informative Cancers
(Non-PICs)
cancers:
formation
values:
Cancers’

39
Cross-Cancer Patient Outcome Model
Cox
regression
with
LASSO
feature
selection
~20,000 gene
expression
values
Cancer Patient
Survival
Survival~ -0.104 + 0.086*ADAM12
+ 0.037*CKS1 - 0.088*CRYL1 +
0.056*DNA2 + 0.013*DONSON +
0.098*HJURP - 0.022*NDRG2 +
0.031*RAD54B + 0.040*SHOX2 -
0.155*SUOX

40
Integrating genomic data with machine learning to improve
predictive modeling
1) Cross-Cancer Patient Outcome Prediction Model
2) Improved Kidney Cancer Patient Survival Prediction Model

TCGA Kidney Renal Cell Carcinoma (KIRC)
Data Set
• 291 tumor samples with
clinical, RNA-seq, DNAm, and
CNV data available (~1/3 of
patients died from disease)
41

42
Can we improve clinically relevant phenotype prediction with
multi-omics classiﬁers?
Clinically
Annotated
Multidimensional
Data Sets
DNAm CNV
RNA
Expression
Protein
Expression
microRNA
Expression
Mutations RNA
CNVDNAm
Cox regression with LASSO feature selection

43
Multi-omic classifiers to predict patient outcome
RNA
CNV DNAm
Patient Outcome
Model Test AUC
CNV <0.5
RNA 0.5683
DNAm 0.6794
DNAm+CNV PCs 0.6571
RNA+CNV PCs 0.6730
RNA+DNAm PCs 0.7397
RNA+DNAm+CNV PCs
PPCs
0.7619
accuracy:

44
RNA
CNV DNAmDNAm CNV
RNA
Patient Outcome
Model Test AUC
CNV <0.5
RNA 0.5683
DNAm 0.6794
DNAm+CNV 0.6571
RNA+CNV 0.6730
RNA+DNAm 0.7397
RNA+DNAm+CNV PCs
PPCs
0.7619
accuracy:

45
• RNA+DNAm+CNV model of
patient survival
outperformed each data type
alone or with another single
data type, as well as models
built on features before
dimension reduction
• Synergistic effect by
combining RNA, DNAm, and
CNV into combined features
for prediction of patient
outcome
• Some principal components
were strongly correlated with
CIN or DNAmIN status
RNA
CNVDNAm
Patient Outcome
Model Test AUC
CNV <0.5
RNA 0.5683
DNAm 0.6794
DNAm+CNV 0.6571
RNA+CNV 0.6730
RNA+DNAm 0.7397
RNA+DNAm+CNV 0.7619
accuracy:

Take-Home Message
• Genomics generates big data to address complex biological problems, e.g., improving human
disease prevention, diagnosis, prognosis, and treatment efficacy
• Machine learning is a data analysis method that automate analytical model building to make
data driven predictions or discover patterns without explicit human intervention
• Machine learning is a subfield of computer science—>the algorithms are implemented in code
• Machine learning is useful when we have complex problems with lots of ‘big’ data
46
Computer
Data
Program
Output
Computer
[2,3]
+
5
Computer
Data
Output
Program
Machine Learning
Computer
[2,3]
5
+

HudsonAlpha:
hudsonalpha.org
Information is Power: http://hudsonalpha.org/information-is-power
R Programming Language and/or Machine Learning (mostly free):
Software Carpentry (software-carpentry.org) and Data Carpentry (datacarpentry.org)
coursera.org and datacamp.com
Stanford Online’s ‘Statistical Learning’ class
Books:
Rosalind Franklin: The Dark Lady of DNA by Brenda Maddox (Female scientist biography)
The Emperor of All Maladies by Siddhartha Mukherjee (History of cancer)
The Gene by Siddhartha Mukherjee (History of genetics)
Genome by Matt Ridley (Human Genome)
Algorithms to Live By by Brian Christian and Tom Griffiths (CS application to real-life)
Headstrong: 52 Women Who Changed Science-and the World by Rachel Swaby
Lean In by Sheryl Sandberg (Women and the workplace)
Bossypants by Tina Fey (Autobiography)

48
Thanks!
Brittany N. Lasseigne, PhD
@bnlasse blasseigne@hudsonalpha.org

An Introduction to Machine Learning and Genomics

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a An Introduction to Machine Learning and Genomics

Semelhante a An Introduction to Machine Learning and Genomics (20)

Último

Último (20)

An Introduction to Machine Learning and Genomics