Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23

If the physicists do it, the software engineers do it,
Why can’t we do it?:

Moving beyond linear investigations
Both of the science and of how we work

Integrating layers of omics data models and building
using compute spaces capable of enabling models
to be evolved by teams of teams

Stephen Friend MD PhD

Sage Bionetworks (Non-Profit Organization)
Seattle/ Beijing/ Amsterdam
February 23, 2012

So
what
is
the
problem?

Most
approved
therapies
were
assumed
to
be

monotherapies
for
diseases
represen4ng
homogenous

popula4ons

Our
exis4ng
disease
models
o9en
assume
pathway

knowledge
suﬃcient
to
infer
correct
therapies

The value of appropriate representations/ maps

“Data Intensive” Science- Fourth Scientific Paradigm

Equipment capable of generating
massive amounts of data

IT Interoperability

Open Information System

Host evolving computational models
in a “Compute Space”

WHY
NOT
USE

“DATA
INTENSIVE”
SCIENCE

TO
BUILD
BETTER
DISEASE
MAPS?

what will it take to understand disease?

DNA

RNA
PROTEIN
(dark
maHer)

MOVING
BEYOND
ALTERED
COMPONENT
LISTS

2002 Can one build a “causal” model?

Preliminary Probabalistic Models- Rosetta /Schadt

Networks facilitate direct
identification of genes that are
causal for disease
Evolutionarily tolerated weak spots

Gene symbol Gene name Variance of OFPM Mouse Source
explained by gene model
expression*
Zfp90 Zinc finger protein 90 68% tg Constructed using BAC transgenics
Gas7 Growth arrest specific 7 68% tg Constructed using BAC transgenics
Gpx3 Glutathione peroxidase 3 61% tg Provided by Prof. Oleg
Mirochnitchenko (University of
Medicine and Dentistry at New
Jersey, NJ) [12]

Lactb Lactamase beta 52% tg Constructed using BAC transgenics
Me1 Malic enzyme 1 52% ko Naturally occurring KO
Gyk Glycerol kinase 46% ko Provided by Dr. Katrina Dipple
(UCLA) [13]
Lpl Lipoprotein lipase 46% ko Provided by Dr. Ira Goldberg
(Columbia University, NY) [11]
C3ar1 Complement component 46% ko Purchased from Deltagen, CA
3a receptor 1
Tgfbr2 Transforming growth 39% ko Purchased from Deltagen, CA
Nat Genet (2005) 205:370 factor beta receptor 2

DIVERSE
POWERFUL
USE
OF
MODELS
AND
NETWORKS

List of Influential Papers in Network Modeling

  50 network papers
  http://sagebase.org/research/resources.php

“Data Intensive” Science- Fourth Scientific Paradigm
Score Card for Medical Sciences

Equipment capable of generating
massive amounts of data A-

IT Interoperability D

Open Information System D-

Host evolving computational models
in a “Compute Space F

We still consider much clinical research as if we were
hunter gathers - not sharing
.

TENURE

FEUDAL
STATES

Clinical/genomic data
are accessible but minimally usable

Little incentive to annotate and curate
data for other scientists to use

Mathematical
models of disease
are not built to be
reproduced or
versioned by others

Lack of standard forms for future rights and consents

Sage Mission
Sage Bionetworks is a non-profit organization with a vision to
create a commons where integrative bionetworks are evolved by
contributor scientists with a shared vision to accelerate the
elimination of human disease

Building Disease Maps Data Repository

Commons Pilots Discovery Platform
Sagebase.org

Sage Bionetworks Collaborators

  Pharma Partners
  Merck, Pfizer, Takeda, Astra Zeneca,
Amgen, Johnson &Johnson
  Foundations
  Kauffman CHDI, Gates Foundation

  Government
  NIH, LSDF, NCI

  Academic
  Levy (Framingham)
  Rosengren (Lund)
  Krauss (CHORI)

  Federation
  Ideker, Califano, Nolan, Schadt 27

JUN ZHU
Model of Breast Cancer: Co-expression
A) Miller 159 samples B) Christos 189 samples
NKI: N Engl J Med. 2002 Dec 19;347(25):1999.

Wang: Lancet. 2005 Feb 19-25;365(9460):671.

Miller: Breast Cancer Res. 2005;7(6):R953.

Christos: J Natl Cancer Inst. 2006 15;98(4):262.

C) NKI 295 samples

E) Super modules

Cell
cycle

Pre-mRNA

ECM
D) Wang 286 samples Blood vessel

Immune
response

28
Zhang B et al., Towards a global picture of breast cancer (manuscript).

CHRIS
GAITERI-‐ALZHEIMER’S

What
is
this?

Bayesian
networks
enriched

in
inflammaQon
genes

correlated
with
disease

severity
in
pre-‐frontal

cortex
of
250
Alzheimer’s

paQents.

What
does
it
mean?

InflammaQon

in
AD
is
an

interacQve
mulQ-‐pathway

system.

More
broadly,

network
structure
organizes

complex
disease
effects
into

coherent
sub-‐systems
and

can
prioriQze
key
genes.

Are
you
joking?

Gene
validaQon
shows

novel
key
drivers
increase

Abeta
uptake
and
decrease

neurite
length
through
an

ROS
burst.
(highly
relevant

to
AD
pathology)

ELIAS NETO Causal Model Selection Hypothesis Tests in Systems Genetics
Elias Chaibub Neto1, Aimee T. Broman2, Mark P. Keller2, Alan D. Attie2, Bin Zhang1, Jun Zhu1, Brian S. Yandell2
1 Sage Bionetworks, Seattle, WA USA; 2 University of Wisconsin-Madison, Madison, WI USA

Abstract Vuong’s Model Selection Test Causal Model Selection Tests (CMST) Simulation Study
Current efforts in systems genetics have focused on the Vuong's test derives from the Kullback-Leibler Information In our applications we consider four models: M1, M2, M3 and We conducted a simulation study generating data from the
development of statistical approaches aiming to disentangle Criterion (KLIC). M4. models on
causal relationships among molecular phenotypes in segregating the Figure below.
populations. Model selection criterions, such as the AIC and Let h0(y | x) represent the true model. We derive intersection-union tests based on six separate Vuong
BIC, have been widely used for this purpose, in spite of being (Clarke) tests:
unable to quantify the uncertainty associated with the model Consider the parametric family of conditional models: {f(y | x; f1 vs f2 , f1 vs f3 , f1 vs f4 , f2 vs f3 , f2 vs f4 , f3 vs f4
selection call. Here we propose three novel hypothesis tests to φ): φ ϵ Ф}.
perform model selection among models representing distinct We propose three distinct CMST tests: (1) parametric, (2) non-
Then parametric, and (3) joint-parametric CMST tests.
causal relationships. We focus on models composed of pairs of
phenotypes and use their common QTL to determine which KLIC(h0, f) = E0[log h0(y | x)] – E0[log f(y | x; φ)], The results are shown below:
phenotype has a causal effect on the other, or whether the
phenotypes are not causally related, and are only statistically where the expectation E0 is computed w.r.t h0(y, x), and φ* is the Parametric CMST:
associated. Our hypothesis tests are fully analytical and avoid parameter value that minimizes KLIC(h0, f).
H0: model M1 is not closer to the true model than M2, M3 or M4.
the use of computationally expensive permutation or re-sampling
Consider two models: f1 ≡ f1(y | x; φ1*) and f2 ≡ f2(y | x; φ2*). H1: model M1 is closer to the true model than M2, M3 and M4.
strategies. They adapt and extend Vuong's (and Clarke’s) model
selection test to the comparison of four possibly misspecified
models, handling the full range of possible causal relationships Model f1 is a better approximation of h0 than f2 if and only if H0: { E0[LR12] = 0 } { E0[LR13] = 0 } { E0[LR14] = 0 }
among a pair of phenotypes. We evaluate the performance of our H1: { E0[LR12] > 0 } ∩ { E0[LR13] > 0 } ∩ { E0[LR14] > 0 }
tests against the AIC, BIC and a published causality inference KLIC(h0, f1) < KLIC(h0, f2)  E0[log f1] > E0[log f2].
The rejection region and p-value for this IU-test are given by:
test in simulation studies. Furthermore, we compare the
precision of the causal predictions made by the methods using Let LR12 = log f1 – log f2. Then we test
biologically validated causal relationships extracted from a min{z12 , z13 , z14} > cα , p1 = max{p12 , p13 , p14}.
database of 247 knockout experiments in yeast. Overall, our H0: E0[LR12] = 0, H1: E0[LR12] > 0, H2: E0[LR12] < 0.
model selection hypothesis tests achieve higher precision than
the alternative methods at the expense of reduced statistical The quantity E0[LR12] is unknown, but the sample mean and Non-parametric CMST:
power. variance of
Analogous to the parametric CMST. Just replace Vuong’s by
LR = log f – log f 2,i, f 1 ≡ f(y | x; φ 1), φ ≡
12,i 1,i 1 Clarke’s tests.
ML est. of φ1
Pairwise Causal Models
converve a.s. to E0[LR12] and Var0[LR12] = σ12.12 . Joint parametric CMST:
Given a pair of phenotypes, Y1 and Y2, that co-map to the same
quantitative trait loci, Q, we consider the following models: Let LR = ∑ LR , then under H0
12 12,i Simple application of Vuong tests, overlooks the dependency
among the test statistics.
(n σ 12.12 )−1/2 LR 12 →d N(0, 1).
Let S1 represent the sample covariance matrix of LR 12,i , Yeast Data Analysis
If different models have different dimensions we consider
LR 13,i and LR 14,i.
We analyzed the yeast genetical genonics data set from Brem
LR *12 = LR 12 – D12 Under regularity conditions we have that S1 converges a.s. to and Kruglyak (2005).
Σ1.
where D12 represents a difference of AIC or BIC penalties, and We evaluated the precision of the causal predictions made by
adopt the test statistic the methods using validated causal relationships extracted
It follows from the MCT and Slutsky’s theorem that when
Z12 = (n σ 12.12 )−1/2 LR *12 . from a data-base of 247 knock-out experiments (Hughes
( E0[LR12] , E0[LR13] , E0[LR14] )T = ( 0 , 0 , 0 )T 2000, Zhu 2008).
Clarke’s Model Selection Test
we have that In total, 46 of the ko-genes showed significant eQTLs, and
Conclusions Represents a non-parametric version of Vuong’s test. we tested a total of 4,928 ko-gene/putative target gene
Z1 = n−1/2 diag(S1 )−1/2 LR 1 →d N3(0 , ρ1) relations.
Advantages of the Causal Model Selection Tests: Vuong’s null: the mean log-likelihood ratio is 0.
Clarke’s null: the median log-likelihood ratio is 0. where LR 1 = ( LR 12 , LR 13 , LR 14 )T and ρ1 = diag
1- Fully analytical hypothesis tests that avoid the use of (S1)−1/2 Σ1 diag(S1)−1/2
computationally expensive permutation or re-sampling Paired sign test on log-likelihood scores:
techniques. We consider the hypotheses
Scores: (LR 12,1 , LR 12,2 , LR 12,3 , LR 12,4 , LR 12,5 ,
2- Achieve better controlled type I error rates. … , LR 12,n ) H0: min{ E0[LR12] , E0[LR13] , E0[LR14] } ≤ 0
Signs: ( + , − , + , + , − , … , H1: min{ E0[LR12] , E0[LR13] , E0[LR14] } > 0
3- Achieve higher precision rates. + )
and adopt the test statistic W1 = min{Z1}. The p-value is
Let, T12 = {# of positive signs}. Then under Clarke’s null computed as
Main disadvantage: lower statistical power.
T12 ~ Binomial(n, 1/2). P(W1 ≥ w1) = P(Z12 ≥ w1 , Z13 ≥ w1 , Z14 ≥ w1).

ELIAS NETO
Causal Model Selection Hypothesis Tests in Systems Genetics

The Schadt et al. (2005) approach was based on
a penalized likelihood model selection approach,
were we simply select the model with the best
score.

The proposed hypothesis test allows us to attach
a p-value to the selected model and, in this way,
allows the quantification of the uncertainty
associated with the model selection call.

The proposed tests are fully analytical and avoid
computationally expensive permutation and re-
sampling techniques.

ZHI
WANG

A
mulQ-‐Qssue
immune-‐driven
theory
of
weight
loss

Hypothalamus

Lep4n

signaling

FaDy
acids

Macrophage/

inﬂamma4on

Liver
Adipose

M1
macrophage

Phagocytosis-‐
Phagocytosis-‐

induced
lipolysis
induced
lipolysis

PLATFORM
Sage Platform and Infrastructure Builders-
( Academic Biotech and Industry IT Partners...)

PILOTS= PROJECTS FOR COMMONS
Data Sharing Commons Pilots-
(Federation, CCSB, Inspire2Live....)
ORM
M APS

F
PLAT
NEW

RULES GOVERN

Why not share clinical /genomic data and model building in the
ways currently used by the software industry
(power of tracking workflows and versioning

Leveraging Existing Technologies

Addama

Taverna
tranSMART

sage bionetworks synapse project
Watch What I Do, Not What I Say

Reduce, Reuse, Recycle

Most of the People You Need to Work with Don’t Work with You

My Other Computer is Cloudera Amazon Google

Sage Metagenomics Project

Processed Data
(S3)

•  > 10k genomic and expression standardized datasets indexed in SCR
•  Error detection, normalization in mG
•  Access raw or processed data via download or API in downstream analysis
•  Building towards open, continuous community curation

Sage Metagenomics using Amazon Simple Workflow

Full case study at http://aws.amazon.com/swf/testimonials/swfsagebio/

Amazon SWF and Synapse

•  Maintains state of analysis •  Hosts raw and processed data for
•  Tracks step execution further reuse in public or private
projects
•  Logs workflow history
•  Provides visibility into
•  Dispatches work to Amazon or intermediate results and
remote worker nodes algorithmic details
•  Efficiently match job size to •  Allows programmatic access to
hardware data; integration with R
•  Provides error handling and •  Provides standard terminologies
recovery for annotations
•  Search across data sets

Synapse Roadmap
•  Data Repository
•  Projects and security Synapse Platform Functionality
•  R integration •  Workflow templates
•  Analysis provenance •  Social networking
•  Publishing figures •  User-customized
• Search •  Wiki & collaboration tools dashboards
• Controlled Vocabularies •  Integrated management •  R Studio integration
• Governance of restricted of cloud resources •  Curation tool integration
data

Internal Alpha Public Beta Testing Synapse 1.0 Synapse 1.5 Future

Q1-2012 Q2-2012 Q3-2012 Q4-2012 Q1-2013 Q2-2013 Q3-2013 Q4-2013

• TCGA •  Predictive modeling •  TBD: Integrations with other
•  METABRIC breast workflows visualization and analysis
cancer challenge •  Automated processing of packages
common genomics platforms
•  40+ manually curated clinical studies
•  8000 + GEO / Array Express datasets
•  Clinical, genomic, compound sensitivity
•  Bioconductor and custom R analysis

Data / Analysis Capabilities

INTEROPERABILITY
SYNAPSE

Genome Pattern
CYTOSCAPE
tranSMART
I2B2
INTEROPERABILITY

Now
accep4ng

submissions

Editor-‐in-‐Chief

Eric
Schadt
(USA)

Open
Network
Biology
is
an
open
access
journal
that
publishes
arQcles
relaQng
to

predicQve,
network-‐based
models
of
living
systems
linked
to
the
corresponding

coherent
data
sets
upon
which
the
models
are
based.
In
addiQon
to
arQcles

describing
these
large
data
sets,
the
journal
also
welcomes
submissions
of

original
research,
sobware
and
methods,
along
with
reviews
and
commentary,

relevant
to
the
emerging
field
of
network
biology.

Submit
your
manuscript
and
benefit
from:

• 
High
visibility
for
arQcles
through
unrestricted
online
access

• 
Free
arQcle
redistribuQon
under
a
CreaQve
Commons
aHribuQon
license

• 
No
limits
on
arQcle
length,
addiQonal
files,
colour
figures
or
movies

• 
Rapid,
immediate
open
access
publicaQon
on
acceptance

• 
An
integrated
repository
for
network
model
data
and
code

www.opennetworkbiology.com

Five
Pilots
involving
Sage
Bionetworks

CTCAP

Arch2POCM

The
FederaQon

ORM
S
Portable
Legal
Consent

MAP

F
Sage
Congress
Project

PLAT
NEW
RULES GOVERN

Clinical Trial Comparator Arm
Partnership (CTCAP)
  Description: Collate, Annotate, Curate and Host Clinical Trial Data
with Genomic Information from the Comparator Arms of Industry and
Foundation Sponsored Clinical Trials: Building a Site for Sharing
Data and Models to evolve better Disease Maps.
  Public-Private Partnership of leading pharmaceutical companies,
clinical trial groups and researchers.
  Neutral Conveners: Sage Bionetworks and Genetic Alliance
[nonprofits].
  Initiative to share existing trial data (molecular and clinical) from
non-proprietary comparator and placebo arms to create powerful
new tool for drug development.

Started Sept 2010

Shared clinical/genomic data sharing and analysis will
maximize clinical impact and enable discovery

•  Graphic
of
curated
to
qced
to
models

Arch2POCM

Restructuring
the
PrecompeQQve

Space
for
Drug
Discovery

How
to
potenQally
De-‐Risk

High-‐Risk
TherapeuQc
Areas

Arch2POCM: scale and scope
•  Proposed Goal: Initiate 2 programs. One for Oncology/Epigenetics/
Immunology. One for Neuroscience/Schizophrenia/Autism. Both
programs will have 8 drug discovery projects (targets) - ramped up
over a period of 2 years

–  It is envisioned that Arch2POCM’s funding partners will select targets
that are judged as slightly too risky to be pursued at the top of pharma’s
portfolio, but that have significant scientific potential that could benefit
from Arch2POCM’s crowdsourcing effort

•  These will be executed over a period of 5 years making a total of 16
drug discovery projects

–  Projected pipeline attrition by Year 5 (assuming 12 targets loaded in
early discovery)
•  30% will enter Phase 1
•  20% will deliver Ph 2 POCM data 52

How can we accelerate the pace of scientific discovery?
2008
2009
2010
2011

Ways to move beyond
“traditional” collaborations?

Intra-lab vs Inter-lab
Communication

Colrain/ Industrial PPPs Academic
Unions

sage federation:
model of biological age

Faster Aging
Predicted
Age
(liver
expression)

Slower Aging

Clinical Association
-  Gender
-  BMI
-  Disease
Age Differential Genotype Association
Gene Pathway Expression

Chronological
Age
(years)

Reproducible
science==shareable
science

Sweave: combines programmatic analysis with narrative

Dynamic generation of statistical reports
using literate data analysis

Sweave.Friedrich Leisch. Sweave: Dynamic generation of statistical reports
using literate data analysis. In Wolfgang Härdle and Bernd Rönz,editors, Compstat 2002 –
Proceedings in Computational Statistics,pages 575-580.
Physica Verlag, Heidelberg, 2002. ISBN 3-7908-1517-9

Federated
Aging
Project
:

Combining
analysis
+
narraQve

=Sweave Vignette
Sage Lab
R code + PDF(plots + text + code snippets)
narrative
HTML

Data objects

Califano Lab Ideker Lab Submitted
Paper

Shared
Data
JIRA:
Source
code
repository
&
wiki

Repository

For 11/12 compounds, the #1 predictive feature in an unbiased
analysis corresponds to the known stratifier of sensitivity
#2
CML
lineage

CML lineage
#1
EGFR
mut

EGFR mut

#1
EGFR
mut

EGFR mut

#1
CML
lineage

#1
EGFR
mut

CML linage
EGFR mut

#1
ERBB2
expr

ERBB2 expr

Can
the
approach
make
new
mut

#1
BRAF

discoveries?

BRAF mut

#1
HGF
expr

HGF expr
#2
NRAS
mut
NRAS mut

BRAF mut
#1
BRAF
mut

#3
KRAS
mut

KRAS mut

#2
NRAS
mut

NRAS mut
BRAF mut

#1
BRAF
mut

#3
KRAS
mut

KRAS mut

#2
NRAS
mut

NRAS mut
BRAF mut

#1
BRAF
mut

#2
TP53
mut

TP53 mut

#3
CDKN2A
copy

CDKN2A copy

#1
MDM2
expr

MDM2 expr

59

Presentation outline

1)
Predic4ng
drug
response
2)
Future
approaches:
3)
Standardized

from
cancer
cell
lines
network-‐based
predictors
workﬂows
for
data

and
mul4-‐task
learning
management,

Cancer
cell
line
versioning
and

encyclopedia
method
comparison

Molecular characterization
Network
/
pathway

(1,000 cell lines) prior
informa4on

Currently
  mRNA
  copy number
  somatic mutations (36
cancer-related genes)
In progress
  targeted exon sequencing Vaske,
et
al.

  epigenetics
  microRNA TCGA
/ICGC

  lncRNA Transfer
Molecular characterization
learning
(50 tumor types)
  phospho-tyrosine kinase
  metabolites

Viability screens (500 cell   genomics
lines, 24 compounds)
  transcriptomics
Small molecule screen   epigenetics

Predic4ve

Clinical data
model
Vaske,
et
al.

1)  Data
management
APIs
to
load
standaridzed
objects,
e.g.

R
ExpressionSets
(MaD
Furia):

ccleFeatureData
<-‐
getEnQty(ccleFeatureDataId)

ccleResponseData
<-‐
getEnQty(ccleResponseDataId)

2)

tAutomated,
standardized
workflows
for
cura4on
and
QC
of

large-‐scale
datasets
(-‐
getEnQty(tcgaFeatureDataId)

cgaFeatureData
< Brig
Mecham).

tcgaResponseData
<-‐
getEnQty(tcgaResponseDataId)

A.  TCGA:
Automated
cloud-‐based
processing.

B. GEO
/
Array
Expression:
NormalizaQon
workflows,
curaQon

of
phenotype
using
standard
ontologies.

C. AddiQonal
studies
with
geneQc
and
phenotypic
data
in

Sage
repository
(e.g.
CCLE
and
Sanger
cell
line
datasets)

Observed Data!=! Systematic Variation! +! Random Variation!

=! +! +!

3)  Pluggable
API
to
implement
predic4ve
modeling

algorithms.
Normalization: Remove the influence of
adjustment variables on data...!
A)  Support
for
all
commonly
used
machine
learning
methods

4)  Sta4s4cal
performance
assessment
ew
methods)

(for
automated
benchmarking
against
n across
models.

B)  Pluggable
custom
=! ethods
as
R
classes
implemenQng

m
customTrain()
and
customPredict()
methods.

+!
custom
model
1
be
arbitrarily
complex
(e.g.
pathway
and
other

A)  Can
custom
model
2
custom
model
N

priors)

5)  Output
of
candidate
biomarkers
and
feature

B)  Support
for
parallelizaQon
in
for
each
loops.

evalua4on
(e.g.
GSEA,
pathway
analysis)

custom
model
1
custom
model
2
custom
model
N

6)
Experimental
follow-‐up
on
top
predic4ons
(TBD)

E.g.
for
cell
lines:
medium
throughput
suppressor
/
enhancer

screens
of
drug
sensiQvity
for
knockdown
/
overexpression
of

predicted
biomarkers.

Portable
Legal
Consent

(AcQvaQng
PaQents)

John
Wilbanks

Sage
Congress
Project

April
20
2012

RealNames
Parkinson’s
Project

RevisiQng
Breast
Cancer
Prognosis

Fanconi’s
Anemia

(Responders
CompeQQons-‐
IBM-‐DREAM)

Networking
Disease
Model
Building

Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23

Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (9)

Semelhante a Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23

Semelhante a Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23 (20)

Mais de Sage Base

Mais de Sage Base (17)

Último

Último (20)

Stephen Friend Complex Traits: Genomics and Computational Approaches 2012-02-23