5. σ2G
σ2P
H2 =
Heritability (H2) is the range of phenotypic
variability attributed to genetic variability in a
population
This estimate captures the genetic architecture of
phenotype, important for way we model genetic
risk for disease
6. For example:
(1) Is my G-P association specified correctly?
Or, how much of the P variation is additive vs. interaction?
P = a + b1*g1
vs.
P = a + b1*g1 + z1*c1
vs.
P = a + b1*g1 + b2*g2 + b2*g1*g2 + z1*c1
Stratification (confounding)?
Epistasis (interaction)?
7. For example:
(2) How much P variation explained by what we can measure?
evolut
partic
eases;
tase 1)
well a
biolog
The
captur
implem
STRU
revert
subset
librium
clearly
−log10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
80
60
40
100
rvedteststatistic
a
b
NATURE|Vol 447|7 June 2007
WTCCC, 2007
AA Aa aa
case
control
ie, how much variation in P captured by GWAS?
Image credit: illumina
8. What is the exposomic architecture of complex
phenotype?
?
σ2P
E2 =
Important for way we model exposomic risk for
disease, including the role of mixtures!
10. 🏠
σ2shared
σ2P
C2 =
Shared E (C2) is the range of phenotypic variability
attributed to shared household or geography
(but not genetics)
11. Lakhani et al., Nature Genetics 2019
Decompose the mixture of genetics, shared E
(C2), and indicators of shared E (air pollution,
weather, and socioeconomic status)
chirag “the better” braden
Arjun Manrai
(BCH CHIP)
http://apps.chiragjpgroup.org/catch/
12. To decompose P, G, and E, we need to
measure them all simultaneously!
13. Decomposing the mixture of genetics, shared E (C2), and indicators of
shared E (air pollution, weather, and socioeconomic status)
in real-world data
+ Disease (ICD9/ICD10),
procedures, drugs, labs
N ~ 45M
chirag “the better” braden
Jian Yang
Peter M Visscher
Arjun Manrai
(BCH CHIP)
insurance claims
Weather
Air Pollution
Census SES
15. Amassing (the largest) twin and sibling cohort
in the US to estimate G and E in ~500 P
• Assume familial relationships in
subscriber groups
• Subscriber group less than 15 members
• Both members are child of primary
subscriber (e.g., employed individual)
• Same date of birth
• Year of birth occurs on or after 1985
• Member enrollment greater than 36
months
Same Sex -
Female
17,919
Same Sex -
Male
17,835
Opposite
Sex
20,642
total 56,396
Largest collection of twins in US (next largest has ~28k pairs)
Largest collection of twins in US (next largest has ~28k pairs)
724K siblings!
Lakhani et al., Nature Genetics 2019
16. Where do we get E indicators?
Exposome Data Warehouse (~1TB)
Geographical information system-enabled
database to map individuals to E
17. US distribution of twins and siblings that can be “linked”
to variation in air pollution, climate, and socioeconomic
status
18. We mapped 13360 ICD9 billing codes to 1809
PheWAS codes (in addition to 95 Mendelian
disorders)
Denny, Bastarache, et al. 2013
Rzhetsky, White et al. 2013
CARDIOVASCULAR
hypertension (401)
cardiac dysrhythmias (427)
DIGESTIVE
irritable bowel syndrome (564.1)
ENDOCRINE
type 2 diabetes (250.1)
type 1 diabetes (250.2)
(and 11 more phenotype groups)
19. h2 = 2(rmz - rdz)
c2 = 2rdz - rmz
In a twin study, h2 and c2 can be
estimated using Falconer’s formula
Tetrachoric correlation to estimate rmz & rdz
h2 : narrow-sense heritability
c2 : shared environment
rmz: correlation of phenotype between identical twins
rdz: correlation of phenotype between fraternal twins
20. … but we do not know the zygosity status of
claimants…
But we do know:
Opposite sex twins: all fraternal
Same Sex twins 👯 : mixture of identical and fraternal
21. We can estimate the proportion of fraternal and
identical twins using opposite sex twin prevalence
Weinberg, 1902
Benyamin, et al, 2005, 2006
P(mz) ~ 1 - 2(NOS / Nall) = 0.26
p(ss) = Nss / Nall = 0.63
p = P(mz|ss) = P(mz) / P(ss) = 0.41
h2 = 2/p (rss - ros)
c2 = (ros(p+1) - rss) / p
22. Lakhani et al., Nature Genetics 2019
http://apps.chiragjpgroup.org/catch/
Patient cohorts in the “real-world” :
overall heritability (0.32) and shared environment (0.09): a
global view among 560 phenotypes
gives a nuanced view of G and E
CaTCH: Claims analysis of Twin Correlation and Heritability
US-based, ages < 25
statistic
Phenotype category
Overall (0.32) Endocrine (0.4) Metabolic (0.4)
23. Decomposing G (h2), shared E (c2) and factors of shared E
(SES, air quality, and temperature)
in 2 candidate phenotypes:
Lyme and Obesity
Lakhani et al., Nature Genetics 2019
temperature/seasonality = 5% of shared environment
SES explains 10% of shared E
significant total genetics
significant total shared E
zero genetics
moderate shared environment
24. h2 and c2 estimates for 560 phenotypes versus statistical significance :
326/560 traits (>50%) have a heritable and 180/560 (32%) had a shared
environment component!
r=0.817
Lakhani et al., Nature Genetics 2019
25. … but air pollution, climate, and geocoded SES
play a modest role in total shared environment (c2)
r=0.817
Lakhani et al., Nature Genetics 2019
26. 56K twins and 700K siblings in a massive health insurance cohort
point to complex and elusive variation in 560 phenotypes
560phenotypes
http://apps.chiragjpgroup.org/catch/
h2=0.3
c2=0.1 (🏠)
?
https://rdcu.be/boZeV
27. • 58.2% of P had non-zero h2
• 32% of P had a non-zero c2
• 87.9% had non-zero age fixed effect
• 50% had non-zero sex fixed effect
• Shared c2 != pollution, income, climate
• 0.32 + 0.09 = 0.41
• 1 - 0.41 = 0.59!
The shared environment and genetic contribution in phenotype
is complex and modest:
Where is the rest of the variation on phenotype?
Where is the rest of the variation in P?
28. Explaining the missing variation in P with the Exposome:
A data-driven paradigm for robust discovery of E
Wild, 2005, 2012
Ioannidis , 2009
Rappaport and Smith, 2010, 2011
Buck-Louis and Sundaram 2012
Miller and Jones, 2014
Patel CJ and Ioannidis JPAI, 2014ab
Ioannidis, 2016
Manrai et al 2017
29. Hypothesis - on average, explaining only 10% of the stuff in red
560phenotypes
http://apps.chiragjpgroup.org/catch/
h2=0.3
c2=0.1 (🏠)
~10%
https://rdcu.be/boZeV
30. Challenges in EWAS for identification of specific E factors, E
mixtures, and variance explained (E2) of the exposome
Power (both low and large sample sizes)
Model misspecification and mis?-interpretation
Dense correlational web of the exposome
Time-dependence of the exposome-phenome associations
ARPH 2016
JAMA 2014
JECH 2014
Scalable measurement of the exposome
31. Red: positive ρ
Blue: negative ρ
thickness: |ρ|
for each pair of E:
Spearman ρ
(575 factors: 81,937 correlations)
Correlation globes paint a complex view of the exposome:
average correlation of < 0.3
permuted data to produce
“null ρ”
sought replication in > 1
cohort
Pac Symp Biocomput. 2015
JECH. 2015
Chung et al., ES&T 2018
Effective number of
variables:
500 (10% decrease)
32. Co-E patterns between
females and males are similar
females
males
… however:
Co-E within household
weaker!
Chung et al., ES&T 2018
Jake Chung
33. High-throughput data analytics to mitigate analytical challenges of
exposome-based research:
What model do we use, how do we assess them, and how to
interpret them?
Agier, Portengen et al, EHP 2016
34. Unclear what model to use in the context of adjustment
variables
Estimating the Vibration of Effects (or Risk)
Variable of Interest
e.g., 1 SD of log(serum Vitamin D)
Adjusting Variable Set
n=13
All-subsets Cox regression
213+ 1 = 8,193 models
SES [3rd tertile]
education [>HS]
race [white]
body mass index [normal]
total cholesterol
any heart disease
family heart disease
any hypertension
any diabetes
any cancer
current/past smoker [no smoking]
drink 5/day
physical activity
Data Source
NHANES 1999-2004
417 variables of interest
time to death
N≧1000 (≧100 deaths)
effect sizes
p-values
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
5.0
7.5
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RPvalue = 4.68
A
B
C D
E
median p-value/HR for k
percentile indicator
JCE, 2015
●
●
●
●
●
●
●
●
●
●
●
●
●●
0
1
2
3
4
5
6
7
8
9
10
11
1213
1
50
99
1 50 99
2.5
5.0
7.5
0.64 0.68 0.72 0.76
Hazard Ratio
−log10(pvalue)
Vitamin D (1SD(log))
RHR = 1.14
RP = 4.68
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
50
99
1 50 99
1
2
3
4
0.75 0.80 0.85 0.90
Hazard Ratio
−log10(pvalue)
Thyroxine (1SD(log))
RHR = 1.15
RP = 2.90
http://bit.ly/effectvibration
38. JCE, 2015
Risk and significance depends on modeling scenario!
The Vibration of Effects: beware of the Janus effect
(both risk and protection?!)
“risk”“protection”
“significant”
http://bit.ly/effectvibration
39. How do we interpret the complex models?:
What is a mixture? - an integration of correlated E? Interaction?
Curr Epidemiol Rep 2017
P = a + b1*e1 + b2*e2 + b3*e3
vs.
P = a + b1*e1 + b2*e2 + b3*e3 + b4*e1*e2
vs.
P = a + b1*e1 + b2*e2 + b3*e3 + b4*e1*e2
+ b5*e1*e2 + b6*e2*e3 + b7*e1*e3 + b8*e1*e2*e3
All terms significant?
Only interaction terms?
Additive?
Simple interaction?
All 2 way/n-way?
40. Our current exposomic research cohorts may be underpowered
for complex modeling!
Chung et al, Environment International, 2019
42. Emerging large sample sized cohorts will expose massive mis-
misspecification, or confounding, and correlation amongst E!
Manrai et al, American Journal of Epidemiology 2019
EWAS in telomere length
(Patel et al., IJE 2017)
Adjusted for age, poverty, ethnicity, and sex
GWAS in telomere length
(Pearson and Manolio, JAMA 2008)
Before and after confounding adjustment
43. Aside from residual confounding and mis-specification:
Modest association sizes and R2— if effects — may be diluted
by the complex phenomenon of individual exposure and time
Athersuch Bioanalysis 2012
Cumulative (cadmium, PCB)
Constant, but excreted (phenols, vitamins)
Intervention (drugs)
Seasonal (allergen)
In-utero
Not shown: Diurnal
44. In conclusion:
In complex traits and phenotypes, we are missing much
of phenotypic variation of the population.
H2 ~ 0.3
C2 ~ 0.1
45. Challenges in EWAS for identification of specific E factors, E
mixtures, and variance explained (E2) of the exposome
Power (both low and large sample sizes)
Model misspecification and mis(?)-interpretation
Dense correlational web of the exposome
Time-dependence of the exposome-phenome associations
ARPH 2016
JAMA 2014
JECH 2014
Scalable measurement of the exposome
46. Now hiring!
Data scientist for translational exposome informatics!
phenome exposome genome
individuals
= +
diabetes
telom
eres
airpollution
nutrients
influenza
FTOTCF7L2
thousands of environmental exposures millions of genetic variants
Develop data science tools for dissecting relationships between the
phenome, exposome, and genome to optimize medical decision
making in the era of massive data.
(you)
your phenome
P E G
@chiragjp
chirag@hms.harvard.edu
47. Harvard DBMI
Susanne Churchill
Nathan Palmer
Sophia Mamousette
Sunny Alvear
Chirag J Patel
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org
NIH Common Fund
Big Data to Knowledge
Acknowledgements
RagGroup
Jake Chung
Kajal Claypool
Chirag Lakhani
Danielle Rasooly
Alan LeGoallec
Braden Tierney
Yixuan He
Mentioned Collaborators
Arjun Manrai
John Ioannidis
Peter Visscher