College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
Β
Intro to Biomedical Informatics 701
1. Bioinformatics for discovery:
Introduction to GWAS and EWAS
BMI 701:Introduction to Biomedical Informatics
12/1/2015
chirag@hms.harvard.edu
@chiragjp
www.chiragjpgroup.org
Chirag J Patel
2. P = G + EType 2 Diabetes
Cancer
Alzheimerβs
Gene expression
Phenotype Genome
Variants
Environment
Infectious agents
Nutrients
Pollutants
Drugs
Complex traits are a function of genes and
environment...
3. We are great at G investigation!
over 2000
Genome-wide Association Studies (GWAS)
https://www.ebi.ac.uk/gwas/
G
7. A new paradigm of GWAS for discovery of G in P:
Human Genome Project to GWAS
Sequencing of the genome
2001
HapMap project:
http://hapmap.ncbi.nlm.nih.gov/
Characterize common variation
2001-current day
High-throughput variant
assay
< $99 for ~1M variants
Measurement tools
~2003 (ongoing)
ARTICLES
Genome-wide association study of 14,000
cases of seven common diseases and
3,000 shared controls
The Wellcome Trust Case Control Consortium*
There is increasing evidence that genome-wide association (GWA) studies represent a powerful approach to the
identification of genes involved in common human diseases. We describe a joint GWA study (using the Affymetrix GeneChip
500K Mapping Array Set) undertaken in the British population, which has examined ,2,000 individuals for each of 7 major
diseases and a shared set of ,3,000 controls. Case-control comparisons identified 24 independent association signals at
P , 5 3 1027
: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohnβs disease, 3 in rheumatoid arthritis, 7 in type 1
diabetes and 3 in type 2 diabetes. On the basis of prior findings and replication studies thus-far completed, almost all of these
signals reflect genuine susceptibility effects. We observed association at many previously identified loci, and found
compelling evidence that some loci confer risk for more than one of the diseases studied. Across all diseases, we identified a
25 27
Vol 447|7 June 2007|doi:10.1038/nature05911
Nature 2008
Comprehensive, high-throughput analyses
GWAS
8. Number of raw publications with subject of
βGWASβ
0
1000
2000
3000
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Year
NumberofPublications'GWAS'
pubmed MeSH terms:
human + GWAS
9. Number of raw publications with subject of
βGWASβ
0
1000
2000
3000
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
Year
NumberofPublications'GWAS'
pubmed MeSH terms:
human + GWAS
Risch + Merikangas
linkage vs. association
human genome sequenced
GWAS
age-related macular degeneration
mega-meta-GWAS
WTCCC
GWAS is relevant today (even with NGS) around the corner
11. Geneticists have made substantial progress in
identifying the genetic basis of many human
diseases, at least those with conspicuous deter-
minants.ThesesuccessesincludeHuntington's
disease, Alzheimer's disease, and some forms of
breast cancer. However, the detection of ge-
netic factors for complex diseases-such as
schizophrenia, bipolardisorder, anddiabetes-
has been far more complicated. There have
been numerous reports of genes or loci that
might underlie these disorders, butfew ofthese
findings have been replicated. The modest na-
ture ofthe gene effectsforthese disorders likely
explains the contradictory and inconclusive
claims about their identification. Despite the
small effects of such genes, the magnitude of
theirattributable risk (theproportion ofpeople
affectedduetothem) maybelargebecause they
are quite frequent in the population, making
them ofpublic health significance.
Has the genetic study ofcomplex disorders
reached its limits? The persistent lack of
replicability of these reports of linkage be-
tween various loci and complex diseases
might imply that it has. We argue below that
age analysis we have chosen for this argu-
ment is a popular current paradigm in which
pairs of siblings, both with the disease, are
examined for sharing of alleles at multiple
sites in the genome defined by genetic mark-
ers. The more often the affected siblings
share the same allele at a particular site, the
more likely the site is close to the disease
gene. Using the formulas in (1), we calculate
the expected proportion Yofalleles shared by
a pair ofaffected siblings for the best possible
case-that is, a closely linked marker locus
(recombination fraction 0 = 0) that is fully
informative (heterozygosity = 1) (2)-as
1 +W wherew= pq(y-1)2
2+w (py+q)2
If there is no linkage of a marker at a
particular site to the disease, the siblings
would be expected to share alleles 50% ofthe
time; that is, Y would equal 0.5. Values of Y
for various values ofp and y are given in the
third column of the table. For an allele of
moderate frequency (p is 0.1 to 0.5) that con-
linkage analysis for
about 2 or less will ne
because the numbe
(more than -2500)
able.
Although testsof
est effect are of low
above example, direc
a disease locus itself
To illustrate this poi
sion/disequilibrium t
In this test, transmis
at a locus from heter
affected offspring is e
lian inheritance, all a
chance ofbeing tran
eration. In contrast,
associated with dise
mitted more often th
For this approach,
with multiple affect
just on single affect
parents. For the same
can calculate the pr
parents as pq(y + 1
the probability for a
transmit the high ris
Association tests ca
pairs of affected sibl
associatedwithdiseas
over 50% is the same
the probability ofpar
creased at lowvalues
the probability ofpar
creased. The formula
The Future of Genetic Studies of
Complex Human Diseases
Neil Risch and Kathleen Merikangas
onimm, 0In"a0,"a,
Geneticists have made substantial progress in
identifying the genetic basis of many human
diseases, at least those with conspicuous deter-
minants.ThesesuccessesincludeHuntington's
disease, Alzheimer's disease, and some forms of
breast cancer. However, the detection of ge-
netic factors for complex diseases-such as
schizophrenia, bipolardisorder, anddiabetes-
has been far more complicated. There have
been numerous reports of genes or loci that
might underlie these disorders, butfew ofthese
findings have been replicated. The modest na-
ture ofthe gene effectsforthese disorders likely
explains the contradictory and inconclusive
claims about their identification. Despite the
small effects of such genes, the magnitude of
theirattributable risk (theproportion ofpeople
affectedduetothem) maybelargebecause they
are quite frequent in the population, making
them ofpublic health significance.
Has the genetic study ofcomplex disorders
reached its limits? The persistent lack of
replicability of these reports of linkage be-
tween various loci and complex diseases
might imply that it has. We argue below that
age analysis we have chosen for this ar
ment is a popular current paradigm in whi
pairs of siblings, both with the disease,
examined for sharing of alleles at multip
sites in the genome defined by genetic mar
ers. The more often the affected sibli
share the same allele at a particular site, t
more likely the site is close to the dise
gene. Using the formulas in (1), we calcul
the expected proportion Yofalleles shared
a pair ofaffected siblings for the best possi
case-that is, a closely linked marker lo
(recombination fraction 0 = 0) that is fu
informative (heterozygosity = 1) (2)-as
1 +W wherew= pq(y-1)2
2+w (py+q)2
If there is no linkage of a marker at
particular site to the disease, the sibli
would be expected to share alleles 50% oft
time; that is, Y would equal 0.5. Values o
for various values ofp and y are given in t
third column of the table. For an allele
moderate frequency (p is 0.1 to 0.5) that co
The Future of Genetic Studies of
Complex Human Diseases
Neil Risch and Kathleen Merikangas
Science, 1996
A new paradigm is needed for discovery!
13. Single nucleotide polymorphisms (SNPs):
How many SNPs are in the human genome?
>3,000,000,000 bases in human genome
SNPs appear ~1000 bases
~3,000,000 SNPs
40-60% have minor allele frequency <5%
GWAS focus on frequency >5%
HapMap Consortium, 2010
14. Canβt measure everything:
Tag SNPs and Linkage Disequilibrium (LD)
LD = co-occurance of SNPs in a contiguous region
Bush and Moore, 2012
15. The phenomenon of LD makes GWAS possible:
How and why?: Indirect association
additional studies to map the precise
location of the influential SNP.
Conceptually, the end result of GWAS
under the common disease/common var-
needed to capture the variation
African genome.
It is important to note that t
ogy for measuring genomic
Figure 3. Indirect Association. Genotyped SNPs often lie in a region of high linka
will be statistically associated with disease as a surrogate for the disease SNP throu
doi:10.1371/journal.pcbi.1002822.g003
Bush and Moore, 2012
LD blocks
16. Canβt measure everything:
Tag SNPs and Linkage Disequilibrium
Tag SNPs are common proxies for other SNPs
500K - 1M per chip
tified significant associations for seven SNPs representing four new
T2DM loci (Table 1). In all cases, the strongest association for the
MAX statistic (see Methods) was obtained with the additive model.
of this gene (Fig. 2a)
solely in the secretory
final stages of insulin
*
*
*
0
2
4
βlog10[P]
βlog10[P]
*
4954642sr
2373971sr
3373971sr
445409sr
8012261sr
3349941sr
883429sr
2019462sr
0349941sr
90350501sr
036169sr
0415007sr
2225991sr
6136642sr
8136642sr
1869646sr
8798751sr
04928201sr
3926642sr
5926642sr
43666231sr
9926642sr
2954642sr
01350501sr
5769646sr
4577187sr
4769646sr
41350501sr
5784931sr
2173387sr
39250501sr
5050007sr
7492602sr
1255051sr
156868sr
4373387sr
4784931sr
7501107sr
2697402sr
91518711sr
6461001sr
29250501sr
5889103sr
8669646sr
0889103sr
4688392sr
SLC30A8 IDE
0
2
4
7912381sr
3148707sr
0283856sr
52078111sr
5227373sr
0491242sr
2369412sr
2297881sr
662155sr
7790197sr
44068701sr
35075221sr
5826807sr
7851092sr
9409522sr
βlog10[P]
βlog10[P]
EXT2 ALX4
0
2
4
*** *
0
2
4
a b
c d
LD block
2 alleles are correlated because they are inherited
together
Sladek et al, 2007
18. Assessing Thousands of Factors Simultaneously:
Data-driven search for diο¬erences in SNP frequencies
~100,000 - ~1,000,000 association tests
disease cases
healthy controls
GCAGGTACATG...GGTA...
GCAGGTACACG...GGTA...
GCAGGTACATG...GGTA...
GCAGGTACACG...GGTA...
GCAGGTACATG...GGTA...
GCAGGTACACG...GGTA...
disease cases
GCAGGTACATG...GGTA...
GCAGGTACATG...GGTA...
GCAGGTACATG...GGTA...
GCAGGTACATG...GGTA...
healthy controls
19. Associating One SNP with Disease
Case-Control Study Design
DiseaseSNP (A/a)
?
A a
diseased
non-
diseased
cases
controls
20. Associating One SNP with Disease
What is an βOdds Ratioβ?
DiseaseSNP (A/a)
?
A a
diseased c d
non-
diseased
x y
cases
controls
Chi-squared test
Odds Ratio a vs A:
Odds of disease with allele a
vs.
Odds of disease with allele A
1: equal odds (no diο¬erence)
>1: increased odds (increased risk)
<1: decreased odds (decreased risk)
21. Associating One SNP with Disease
Calculating the Odds Ratio
DiseaseSNP (A/a)
?
A a
diseased c d
non-
diseased
x y
cases
controls
Chi-squared test
Odds Ratio
dx
cy
y/x
d/c
[d/(d+y)]/[y/(d+y)]
Odds Ratio a vs A:
[c/(x+y)]/[x/(c+x)]
Odds with allele a
Odds with allele A
How would you interpret an OR of 2?
22. Associating One SNP with Disease
Cohort Study Design
DiseaseSNP (A/a)
?
β’Direct measure of risk vs. odds ratio
β’Need to wait!
β’If incidence is low, N needs to be large!
Non-diseasedSNP (A/a)
vs.
Cox survival regression
Relative Risk
23. Models to associate genotypes with disease
Examples for a case-control study
Aa AA
AA
aa Aa
AaaaAa
Disease Non-diseased
ND=4 NC=4
24. Models to associate genotypes with disease
Examples for a case-control study
Aa AA
AA
aa Aa
AaaaAa
Disease Non-diseased
ND=4 NC=4
A a
diseased
non-
diseased
6 2
2 6
OR A (vs a)
OR a (vs A)
25. AA Aa aa
diseased
non-
diseased
Models to associate genotypes with disease
Genotypic Test (β2 or 1 df testβ)
Aa AA
AA
aa Aa
AaaaAa
Diseased Non-diseased
ND=4 NC=4
2 OR AA (vs. Aa)
aa (vs. Aa)
2 0
220
26. Associating One SNP with Quantitative Trait
(e.g., height, weight, cholesterol)
40
60
80
100
1 2 3
factor(SNP)
trait
GG GC CC
height
SNP rs1234 SNP rs123456
25
50
75
100
125
1 2 3
factor(SNP)
trait
height
CC CT TT
27. Associating One SNP with Quantitative Trait
Linear Regression and Additive Risk Model
y=Ι+Ξ²x+Ξ΅
25
50
75
100
125
1 2 3
factor(SNP)
trait
height
CC (0) CT (1) TT (2)
SNP rs123456
height = Ι+Ξ²x
xCC=0 if individual is CC
xCT=1 if individual is CT
xTT=2 if individual is TT
Ι
Ξ²: change in height for 1 risk allele
T= risk allele
Ξ²
28. Prototypical βManhattan plotβ to visualize
associations
Science, 2007
~100,000 - ~1,000,000 association tests
evol
part
ease
tase
well
biol
T
capt
imp
STR
reve
subs
libri
clea
βlog10(P)
0
5
10
15
Chromosome
22
X
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
80
60
40
100
rvedteststatistic
a
b
NATURE|Vol 447|7 June 2007
AA Aa aa
diseased
non-
diseased
29. ibility with schizophrenia, a psychotic disorder with many similar-
ities to BD. In particular association findings have been reported with
assium channel. Ion channelopathies are well-recognized as causes of
episodic central nervous system disease, including seizures, ataxias
βlog10
(P)
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
0
5
10
15
Chromosome
Type 2 diabetes
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
22
XX
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Coronary artery disease
Crohnβs disease
Hypertension
Rheumatoid arthritis
Type 1 diabetes
Bipolar disorder
Figure 4 | Genome-wide scan for seven diseases. For each of seven diseases
2log10 of the trend test P value for quality-control-positive SNPs, excluding
Chromosomes are shown in alternating colours for clarity, with
P values ,1 3 1025
highlighted in green. All panels are truncated at
30. Type I Error:
False Positives!
what is a p-value?
chance we attain the observed result if no diο¬erence (H0)
Many tests: some can be signiο¬cant (low p-value by chance)!
100 tests at a p-value of 0.05...
how many would be signiο¬cant per chance?
Bonferroni βcorrectionβ:
Correct the 0.05 signiο¬cance level by number of tests
e.g., 1M SNPs: 0.05/1x10-6 = 5x10-8
31. QQplot:
Distribution of of observed p-values vs. Ho p-
values
Histogram of runif(10000)
runif(10000)
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0100200300400500
p-values under Ho
Histogram of gwas$P.value
gwas$P.value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
050000100000150000
p-values of GWAS in Total Cholesterol
Global Lipids Consortium, 2012random uniform distribution
32. QQplot:
Distribution of of observed p-values vs. Ho p-
values
Histogram of gwas$P.value
gwas$P.value
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
050000100000150000
p-values of GWAS in Total Cholesterol
33. Which diseases show evidence of association?
Examining the QQplot of test statistics in WTCCC
sent study cannot provideconclusive exclusion of any given gene. This
is the consequence of several factors including: less-than-complete
coverage of common variation genome-wide on the Affymetrix chip;
poor coverage (by design) of rare variants, including many structural
variants (thereby reducing power to detect rare, penetrant, alleles)25
;
difficultieswithdefining thefullgenomicextentofthegene ofinterest;
and, despite the sample size, relatively low power to detect, at levels of
already allow us, for selected diseases, to highlight pathways and
mechanisms of particular interest. Naturally, extensive resequencing
and fine-mapping work, followed by functional studies will be
required before such inferences can be translated into robust state-
ments about the molecular and physiological mechanisms involved.
We turn now to a discussion of the main findings for each disease,
focusing here only on the most significant and interesting results
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
25
20
20
15
15
10
10
5
5
30
0
0
BD
Observedteststatistic
Expected chi-squared value
CAD CD
HT RA
T2D
T1D
Figure 3 | Quantile-quantile plots for seven genome-wide scans. For each
of the seven disease collections, a quantile-quantile plot of the results of the
trend test is shown in black for all SNPs that pass the standard project filters,
have a minor allele frequency .1% and missing data rate ,1%. SNPs that
360,000 SNPs. SNPs at which the test statistic exceeds 30 are represented by
triangles. Additional quantile-quantile plots, which also exclude all SNPs
located in the regions of association listed in Table 3, are superimposed in
blue (for BD, the exclusion of these SNPs has no visible effect on the plot, and
35. Ice Cream $ Drowning
Confounding bias
What is a confounder?
Summer!
?
Confounder is correlated to both the βriskβ factor and disease,
leading to invalid inference.
Common source of bias in observational studies (e.g., case-control,
cohort, etc)
36. SNP Disease
Population Stratiο¬cation:
A source of possible confounding in GWAS
race/ethnicity
?
Ancestry correlated with allele frequency and disease
GWAS are done on speciο¬c populations separately.
(most have been done in populations of European ancestry)
37. FTO Diabetes
Mediation
SNPs indicative of a mediator factor?
Example: FTO and Type 2 Diabetes
Body Mass
?
Association between FTO and Type 2 Diabetes via BMI?
... or does FTO have a independent role in Type 2 Diabetes...?
FTO Body Mass
41. Type 2 Diabetes Mellitus:
A complex, multifactorial disease
β’Insulin production vs. use
β’beta-cell function
β’insulin sensitivity (BMI)
β’Moves glucose from blood into
cells
β’Complications arise due to
glucose in blood, hyperglycemia
β’diagnosed by blood glucose
levels
CDC,
family history: 25%
body weight, diet, lifestyle, age
43. ARTICLES
A genome-wide association study
identifies novel risk loci for type 2 diabetes
Robert Sladek1,2,4
, Ghislain Rocheleau1
*, Johan Rung4
*, Christian Dina5
*, Lishuang Shen1
, David Serre1
,
Philippe Boutin5
, Daniel Vincent4
, Alexandre Belisle4
, Samy Hadjadj6
, Beverley Balkau7
, Barbara Heude7
,
Guillaume Charpentier8
, Thomas J. Hudson4,9
, Alexandre Montpetit4
, Alexey V. Pshezhetsky10
, Marc Prentki10,11
,
Barry I. Posner2,12
, David J. Balding13
, David Meyre5
, Constantin Polychronakos1,3
& Philippe Froguel5,14
Type 2 diabetes mellitus results from the interaction of environmental factors with a combination of genetic variants, most of
which were hitherto unknown. A systematic search for these variants was recently made possible by the development of
high-density arrays that permit the genotyping of hundreds of thousands of polymorphisms. We tested 392,935
single-nucleotide polymorphisms in a French caseβcontrol cohort. Markers with the most significant difference in genotype
frequencies between cases of type 2 diabetes and controls were fast-tracked for testing in a second cohort. This identified
four loci containing variants that confer type 2 diabetes risk, in addition to confirming the known association with the TCF7L2
gene. These loci include a non-synonymous polymorphism in the zinc transporter SLC30A8, which is expressed exclusively in
insulin-producing b-cells, and two linkage disequilibrium blocks that contain genes potentially involved in b-cell
development or function (IDEβKIF11βHHEX and EXT2βALX4). These associations explain a substantial portion of disease risk
and constitute proof of principle for the genome-wide approach to the elucidation of complex genetic traits.
The rapidly increasing prevalence of type 2 diabetes mellitus (T2DM) is
thought to be due to environmental factors, such as increased availabil-
ity of food and decreased opportunity and motivation for physical
activity, acting on genetically susceptible individuals. The heritability
of T2DM is one of the best established among common diseases and,
consequently, genetic risk factors for T2DM have been the subject of
intense research1
. Although the genetic causes of many monogenic
forms of diabetes (maturity onset diabetes in the young, neonatal mito-
chondrial and other syndromic types of diabetes mellitus) have been
elucidated, few variants leading to common T2DM have been clearly
identified and individually confer only a small risk (odds ratio < 1.1β
1.25) of developing T2DM1
. Linkage studies have reported many
T2DM-linked chromosomal regions and have identified putative, cau-
sative genetic variants in CAPN10 (ref. 2), ENPP1 (ref. 3), HNF4A (refs
genotypes for 392,935 single-nucleotide polymorphisms (SNPs) in
1,363 T2DM cases and controls (Supplementary Table 1). In order to
enrich for risk alleles21
, the diabetic subjects studied in stage 1 were
selected to have at least one affected first degree relative and age at
onset under 45 yr (excluding patients with maturity onset diabetes in
the young). Furthermore, in order to decrease phenotypic hetero-
geneity and to enrich for variants determining insulin resistance and
b-cell dysfunction through mechanisms other than severe obesity, we
initially studied diabetic patients with a body mass index (BMI)
,30 kg m22
. Control subjects were selected to have fasting blood
glucose ,5.7 mmol l21
in DESIR, a large prospective cohort for the
study of insulin resistance in French subjects22
.
Genotypes for each study subject were obtained using two plat-
Sladek, 2007How many SNPs (p-value?)
European-based; N ~ 1000
cases: high fasting blood glucose/non-obese
controls: non-obese
45. Identification of four novel T2DM loci
Our fast-track stage 2 genotyping confirmed the reported association
for rs7903146 (TCF7L2) on chromosome 10, and in addition iden-
tified significant associations for seven SNPs representing four new
T2DM loci (Table 1). In all cases, the strongest association for the
MAX statistic (see Methods) was obtained with the additive model.
The most significant of these corresponds to rs13266634, a non-
synonymous SNP (R325W) in SLC30A8, located in a 33-kb linkage
disequilibrium block on chromosome 8, containing only the 39 end
of this gene (Fig. 2a). SLC30A8 encodes a zinc transporter expressed
solely in the secretory vesicles of b-cells and is thus implicated in the
final stages of insulin biosynthesis, which involve co-crystallization
Table 1 | Confirmed association results
SNP Chromosome Position
(nucleotides)
Risk
allele
Major
allele
MAF
(case)
MAF
(ctrl)
Odds ratio
(het)
Odds ratio
(hom)
PAR ls Stage 2
pMAX
Stage 2 pMAX
(perm)
Stage 1
pMAX
Stage 1 pMAX
(perm)
Nearest
gene
rs7903146 10 114,748,339 T C 0.406 0.293 1.65 6 0.19 2.77 6 0.50 0.28 1.0546 1.5 3 10234
,1.0 3 1027
3.2 3 10217
,3.3 3 10210
TCF7L2
rs13266634 8 118,253,964 C C 0.254 0.301 1.18 6 0.25 1.53 6 0.31 0.24 1.0089 6.1 3 1028
5.0 3 1027
2.1 3 1025
1.8 3 1025
SLC30A8
rs1111875 10 94,452,862 G G 0.358 0.402 1.19 6 0.19 1.44 6 0.24 0.19 1.0069 3.0 3 1026
7.4 3 1026
9.1 3 1026
7.3 3 1026
HHEX
rs7923837 10 94,471,897 G G 0.335 0.377 1.22 6 0.21 1.45 6 0.25 0.20 1.0065 7.5 3 1026
2.2 3 1025
3.4 3 1026
2.5 3 1026
HHEX
rs7480010 11 42,203,294 G A 0.336 0.301 1.14 6 0.13 1.40 6 0.25 0.08 1.0041 1.1 3 1024
2.9 3 1024
1.5 3 1025
1.2 3 1025
LOC387761
rs3740878 11 44,214,378 A A 0.240 0.272 1.26 6 0.29 1.46 6 0.33 0.24 1.0046 1.2 3 1024
2.8 3 1024
1.8 3 1025
1.3 3 1025
EXT2
rs11037909 11 44,212,190 T T 0.240 0.271 1.27 6 0.30 1.47 6 0.33 0.25 1.0045 1.8 3 1024
4.5 3 1024
1.8 3 1025
1.3 3 1025
EXT2
rs1113132 11 44,209,979 C C 0.237 0.267 1.15 6 0.27 1.36 6 0.31 0.19 1.0044 3.3 3 1024
8.1 3 1024
3.7 3 1025
2.9 3 1025
EXT2
Significant T2DM associations were confirmed for eight SNPs in five loci. Allele frequencies, odds ratios (with 95% confidence intervals) and PAR were calculated using only the stage 2 data. Allele
frequencies in the controls were very close to those reported for the CEU set (European subjects genotyped in the HapMap project). Induced sibling recurrent risk ratios (ls) were estimated using
stage 2 genotype counts for the control subjects and assuming a T2DM prevalence of 7% in the French population. hom, homozygous; het, heterozygous; major allele, the allele with the higher
frequency in controls; pMAX, P-value of the MAX statistic from the x2
distribution; pMAX (perm), P-value of the MAX statistic from the permutation-derived empirical distribution (pMAX and
pMAX (perm) are adjusted for variance inflation); risk allele, the allele with higher frequency in cases compared with controls.
0
2
4
βlog10[P]
βlog10[P]
SLC30A8 IDE HHEXKIF11
0
2
4
a b
NATURE|Vol 445|22 February 2007 ARTICLES
Sladek, 2007
5
3
1
5
3
1
15
10
5
1 1 1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
5
3
1
3 4 5
8 9 10
13 14 15
19 20
X
18
DM 2log10[pMAX], the P-value obtained by the MAX statistic, for each SNP
How would you interpret the p-
values?
Odds ratios?
Conο¬rmed 8 SNPs with N ~ 1000
47. g the Diabetes Genetics
nvestigation of NIDDM
nd (iv) the Framingham
omponent studies (n ΒΌ
ry Table 1 online.
aring, the four consortia
n 10 and 20 SNPs promi-
their individual, interim,
mentary Table 2 online).
oci with consistent effects
dies. Two of these repre-
6PC2 and GCK. In addi-
nerated evidence for an
NPs around the MTNR1B
rs1387153, P ΒΌ 2.2 Γ
10Γ11; DFS: rs10830963,
5.8 Γ 10Γ4, for the most
ch analysis). The associa-
d on formal meta-analysis
r exclusion of individuals
ΒΌ 1.1 Γ 10Γ57; rs4607517
NR1B), P ΒΌ 3.2 Γ 10Γ50;
pplementary Table 3 and
ent efforts to harmonize
(including the additional
data from the WTCCC, DGI and FUSION scans)10 (Supplementary
Note). We found strong evidence that the minor G allele of
rs10830963 was associated with increased risk of T2D (odds ratio ΒΌ
1.09 (1.05β1.12), P ΒΌ 3.3 Γ 10Γ7; Fig. 2 and Supplementary Table 6
online). The possibility that the fasting glucose association might
DGI
Study ID OR (95% CI) Weight
(%)
1.12 (0.96, 1.30) 4.61
4.89
8.03
9.58
3.53
8.75
2.69
6.04
10.56
23.18
2.85
7.41
7.90
100.00
1.20 (1.03, 1.39)
1.07 (0.95, 1.20)
1.14 (1.03, 1.27)
1.00 (0.84, 1.19)
1.17 (1.04, 1.30)
1.07 (0.88, 1.31)
1.16 (1.02, 1.33)
1.00 (0.90, 1.10)
1.03 (0.96, 1.10)
0.91 (0.75, 1.10)
1.15 (1.02, 1.30)
1.16 (1.03, 1.30)
1.09 (1.05, 1.12)
Meta-analysis P value = 3.3 Γ 10
β7
FUSION
WTCCC
deCODE
KORA
Rotterdam
CCC
ADDITION/ELY
Norfolk
UKT2DGC
OxGN/58BC
FUSION Stage 2
METSIM
.722 1 1.39
Overall (I
2
= 26.6%, P = 0.176)
Figure 2 Association of rs10830963 with type 2 diabetes (T2D) in 13 case-
control studies.
VOLUME 41 [ NUMBER 1 [ JANUARY 2009 NATURE GENETICS
Meta-analysis of SNP rs10830963:
Combining ο¬ndings from multiple cohorts
Propenko, 2009
48. A RT I C L E S
By combining genome-wide association data from 8,130 individuals with type 2 diabetes (T2D) and 38,987 controls of
European descent and following up previously unidentified meta-analysis signals in a further 34,412 cases and 59,925 controls,
we identified 12 new T2D association signals with combined P < 5 Γ 10β8. These include a second independent signal at the
KCNQ1 locus; the first report, to our knowledge, of an X-chromosomal association (near DUSP9); and a further instance of
overlap between loci implicated in monogenic and multifactorial forms of diabetes (at HNF1A). The identified loci affect both
beta-cell function and insulin action, and, overall, T2D association signals show evidence of enrichment for genes involved in
cell cycle regulation. We also show that a high proportion of T2D susceptibility loci harbor independent association signals
influencing apparently unrelated complex traits.
Type 2 diabetes (T2D) is characterized by insulin resistance and
deficient beta-cell function1. The escalating prevalence of T2D and
the limitations of currently available preventative and therapeutic
options highlight the need for a more complete understanding of
T2D pathogenesis. To date, approximately 25 genome-wide significant
common variant associations with T2D have been described, mostly
through genome-wide association (GWA) analyses2β13. The identities
of the variants and genes mediating the susceptibility effects at most
of these signals have yet to be established, and the known variants
account for less than 10% of the overall estimated genetic contribution
to T2D predisposition. Although some of the unexplained heritability
will reflect variants poorly captured by existing GWA platforms, we
reasoned that an expanded meta-analysis of existing GWA data would
the inverse-variance method (Online Methods, Fig. 1, Supplementary
Tables 1 and 2 and Supplementary Note). We observed only modest
genomic control inflation ( gc = 1.07), suggesting that the observed
results were not due to population stratification. After removing SNPs
within established T2D loci (Supplementary Table 3), the result-
ing quantile-quantile plot was consistent with a modest excess of
disease associations of relatively small effect (Supplementary Note).
Weak evidence for association at HLA variants strongly associated
with autoimmune forms of diabetes (Supplementary Table 3 and
Supplementary Note) suggested some case admixture involving
subjects with type 1 diabetes or latent autoimmune diabetes of adult-
hood; however, failure to detect T2D associations at other non-HLA
type 1 diabetes susceptibility loci (for example, INS, PTPN22 and
Twelve type 2 diabetes susceptibility loci identified
through large-scale association analysis
Voight, 2010
Meta-analyses for T2D:
N>40K and 90K identiο¬es >30 loci among 2,400,000 SNPs
49. A RT I C L E S
13 autosomal loci exceeded the threshold for genome-wide significance
(P ranging from 2.8 Γ 10β8 to 1.4 Γ 10β22) with allele-specific odds
(r2 < 0.05), and conditional analyses (see below) establish these SNPs
as independent (Fig. 2 and Supplementary Table 4). Further analysis
50 Locus established previously
Locus identified by current study
Locus not confirmed by current study
BCL11A
THADA
NOTCH2
ADAMTS9
IRS1
IGF2BP2
WFS1
ZBED3
CDKAL1
HHEX/IDE
KCNQ1 (2 signals*: )
TCF7L2
KCNJ11
CENTD2
MTNR1B
HMGA2 ZFAND6
PRC1
FTO
HNF1B DUSP9
Conditional analysis
Unconditional analysis
TSPAN8/LGR5
HNF1A
CDC123/CAMK1D
CHCHD9
CDKN2A/2B
SLC30A8
TP53INP1
JAZF1
KLF14
PPAR
40
30
βlog10(P)βlog10(P)
20
10
10
1 2 3 4 5 6 7 8
Chromosome
9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
0
0
Suggestive statistical association (P < 1 10
β5
)
Association in identified or established region (P < 1 10
β4
)
Figure 1 Genome-wide Manhattan plots for the DIAGRAM+ stage 1 meta-analysis. Top panel summarizes the results of the unconditional meta-
analysis. Previously established loci are denoted in red and loci identified by the current study are denoted in green. The ten signals in blue are those
taken forward but not confirmed in stage 2 analyses. The genes used to name signals have been chosen on the basis of proximity to the index SNP and
should not be presumed to indicate causality. The lower panel summarizes the results of equivalent meta-analysis after conditioning on 30 previously
established and newly identified autosomal T2D-associated SNPs (denoted by the dotted lines below these loci in the upper panel). Newly discovered
conditional signals (outside established loci) are denoted with an orange dot if they show suggestive levels of significance (P < 10β5), whereas
secondary signals close to already confirmed T2D loci are shown in purple (P < 10β4).
Meta-analyses for T2D:
N>40K and 90K identiο¬es >30 loci among 2,400,000 SNPs
51. pporting!Figures!
!
!
~90% of GWAS hits are non-coding!
Stamatoyannopoulos, Science 2012
Systematic Localization of Common
Disease-Associated Variation in
Regulatory DNA
Matthew T. Maurano,1
* Richard Humbert,1
* Eric Rynes,1
* Robert E. Thurman,1
Eric Haugen,1
Hao Wang,1
Alex P. Reynolds,1
Richard Sandstrom,1
Hongzhu Qu,1,2
Jennifer Brody,3
Anthony Shafer,1
Fidencio Neri,1
Kristen Lee,1
Tanya Kutyavin,1
Sandra Stehling-Sun,1
Audra K. Johnson,1
Theresa K. Canfield,1
Erika Giste,1
Morgan Diegel,1
Daniel Bates,1
R. Scott Hansen,4
Shane Neph,1
Peter J. Sabo,1
Shelly Heimfeld,5
Antony Raubitschek,6
Steven Ziegler,6
Chris Cotsapas,7,8
Nona Sotoodehnia,3,9
Ian Glass,10
Shamil R. Sunyaev,11
Rajinder Kaul,4
John A. Stamatoyannopoulos1,12
β
Genome-wide association studies have identified many noncoding variants associated with common
diseases and traits. We show that these variants are concentrated in regulatory DNA marked by
deoxyribonuclease I (DNase I) hypersensitive sites (DHSs). Eighty-eight percent of such DHSs are active
during fetal development and are enriched in variants associated with gestational exposureβrelated
phenotypes. We identified distant gene targets for hundreds of variant-containing DHSs that may explain
phenotype associations. Disease-associated variants systematically perturb transcription factor recognition
sequences, frequently alter allelic chromatin states, and form regulatory networks. We also demonstrated
tissue-selective enrichment of more weakly disease-associated variants within DHSs and the de novo
identification of pathogenic cell types for Crohnβs disease, multiple sclerosis, and an electrocardiogram
trait, without prior knowledge of physiological mechanisms. Our results suggest pervasive involvement of
regulatory DNA variation in common human disease and provide pathogenic insights into diverse disorders.
D
isease- and trait-associated genetic variants
are rapidly being identified with genome-
wide association studies (GWAS) and re-
lated strategies (1). To date, hundreds of GWAS
have been conducted, spanning diverse diseases
and quantitative phenotypes (2) (fig. S1A). How-
ever, the majority (~93%) of disease- and trait-
associated variants emerging from these studies
lie within noncoding sequence (fig. S1B), com-
plicating their functional evaluation. Several lines
of evidence suggest the involvement of a propor-
tion of such variants in transcriptional regulatory
mechanisms, including modulation of promoter
and enhancer elements (3β6) and enrichment with-
in expression quantitative trait loci (eQTL) (3, 7, 8).
Human regulatory DNA encompasses a vari-
ety of cis-regulatory elements within which the co-
operative binding of transcription factors creates
focal alterations in chromatin structure. Deoxy-
ribonuclease I (DNase I) hypersensitive sites (DHSs)
are sensitive and precise markers of this actuated
regulatory DNA, and DNase I mapping has been
instrumental in the discovery and census of hu-
man cis-regulatory elements (9). We performed
DNase I mapping genome-wide (10) in 349 cell
and tissue samples, including 85 cell types studied
under the ENCODE Project (10) and 264 sam-
ples studied under the Roadmap Epigenomics
Program (11). These encompass several classes
nome. In total, we identified 3,899,693 distinct
DHS positions along the genome (collectively
spanning 42.2%), each of which was detected in
one or more cell or tissue types (median = 5).
Disease- and trait-associated variants are
concentrated in regulatory DNA. We examined
the distribution of 5654 noncoding genome-wide
significant associations [5134 unique single-
nucleotide polymorphisms (SNPs); fig. S1 and
table S2] for 207 diseases and 447 quantitative
traits (2) with the deep genome-scale maps of
regulatory DNA marked by DHSs. This revealed
a collective 40% enrichment of GWAS SNPs in
DHSs (fig. S1C, P < 10β55
, binomial, compared to
the distribution of HapMap SNPs). Fully 76.6%
of all noncoding GWAS SNPs either lie within a
DHS (57.1%, 2931 SNPs) or are in complete
linkage disequilibrium (LD) with SNPs in a near-
by DHS (19.5%, 999 SNPs) (Fig. 1A) (12). To con-
firm this enrichment, we sampled variants from
the 1000 Genomes Project (13) with the same ge-
nomic feature localization (intronic versus inter-
genic), distance from the nearest transcriptional
start site, and allele frequency in individuals of
European ancestry. We confirmed significant en-
richment both for SNPs within DHSs (P < 10β59
,
simulation) and also including variants in com-
plete LD (r 2
= 1) with SNPs in DHSs (P < 10β37
,
simulation) (fig. S2).
In total, 47.5% of GWAS SNPs fall within
gene bodies (fig. S1B); however, only 10.9% of
intronic GWAS SNPs within DHSs are in strong
LD (r2
β₯ 0.8) with a coding SNP, indicating that
the vast majority of noncoding genic variants
are not simply tagging coding sequence. Analo-
gously, only 16.3% of GWAS variants within
coding sequences are in strong LD with variants in
DHSs. SNPs on widely used genotyping arrays
(e.g., Affymetrix) were modestly enriched with-
in DHSs (fig. S2), possibly due to selection of
SNPs with robust experimental performance in
genotyping assays. However, we found no evi-
dence for sequence composition bias (table S3).
To further examine the enrichment of GWAS
SNPs in regulatory DNA, we systematically clas-
sified all noncoding GWAS SNPs by the quality
1
Department of Genome Sciences, University of Washington,
Seattle, WA 98195, USA. 2
Laboratory of Disease Genomics
RESEARCH ARTICLE
onSeptember12,2012www.sciencemag.orgDownloadedfrom
52. There have been few, if any, similar bursts of discovery in the
history of medical research.
David Hunter and Peter Kraft, NEJM, 2007
53. Common claims discussed in regards to GWAS:
Despite issues, yielded many discoveries vs. cost
to a doubling of the number of associated variants discov-
ered. The proportion of genetic variation explained by
signiο¬cantly associated SNPs is usually low (typically less
than 10%) for many complex traits, but for diseases such
as CD and multiple sclerosis (MS [MIM 126200]), and for
quantitative traits such as height and lipid traits, between
Figure 1. GWAS Discoveries over Time
Data obtained from the Published GWAS Catalog (see Web
Resources). Only the top SNPs representing loci with association
p values < 5 3 10Γ8
are included, and so that multiple counting
is avoided, SNPs identiο¬ed for the same traits with LD r2
> 0.8 esti-
mated from the entire HapMap samples are excluded.
~500,000 SNP chips x ~$500/chip
= $250M
Five years of GWAS Discovery (Visscher, 2012)
$250M / ~2000 loci
= $125K/locus
Candidate genes: >$250M!
100 NIH R01s
Fighter jet
Hadron Collider: $9B
54. P = G + EType 2 Diabetes
Cancer
Alzheimerβs
Gene expression
Phenotype Genome
Variants
Environment
Infectious agents
Nutrients
Pollutants
Drugs
Complex traits are a function of genes and
environment...
55. Nothing comparable to elucidate E inο¬uence!
We lack high-throughput methods
and data to discover new E in Pβ¦
E: ???
58. Ο2
G
Ο2
P
H2 =
Heritability (H2) is the range of phenotypic variability
attributed to genetic variability in a population
Indicator of the proportion of phenotypic
diο¬erences attributed to G.
59. Height is an example of a heritable trait:
Francis Galton shows how its done (1887)
βmid-height of 205 parents
described 60% of variability of 928
offspringβ