This document discusses using gene expression data from the Gene Expression Omnibus to identify relationships between genes and phenotypic concepts like diseases, environmental exposures, and experimental conditions. It describes extracting concepts from sample annotations using the Unified Medical Language System and relating differential gene expression to these concepts. This establishes a network of relationships between genes, phenotypes, and environmental contexts. Identifying such phenome-genome and envirome-genome relationships could help discover new disease-associated genes.
5. Invited
to
HLS
Meeting
“I
think
when
Ari
[Ne’eman]
talks
about
autism
and
I
talk
about
autism,
we’re
talking
about
people
with
different
clusters
of
autism.
I
know
he
doesn’t
like
the
word
‘cure.’
If
my
daughter
could
function
the
way
Ari
could,
I
would
consider
her
cured,”
says
Singer.
“I
have
to
believe
my
daughter
doesn’t
want
to
be
spending
time
peeling
skin
off
her
arm.”
6. Patterns
across
tens
of
thousands
of
patients…
6
Preprocessing: (1) We grouped the 6905 distinct (non-procedure) ICD9 codes in the dataset int
802 PheWAS categories (dimensionality reduction). (2) We only considered PheWAS codes with
at least 5% prevalence and patients with less than 50 of any particular code in 6-month period.
This preprocessing step left us with 4927 individuals with 45 common category codes.
Clustering: For each patient, count the number of occurrences of each code in each 6-month
window from age 0 to age 15. We then applied standard hierarchical clustering with Euclidean
distance, Ward's linkage, and a minimum cluster size of 2% of the population.
Analysis: Significant elements of clusters were assessed by creating 15,000 permutations of
random cluster assignments and creating an empirical chi-squared statistic distribution for the
observed vs. expected number of code occurrences in each time window in each cluster.
Basic Cluster Characteristics
patients
code counts
0-6 months
code counts
6-12 months
code counts
12-18 months
patient
clustering
13. GIANT
study
A further possible source of missing heritability is allelic heterogen-
eity: the presence of multiple, independent variants influencing a trait
at the same locus. We performed genome-wide conditional analyses in
a subset of stage 1 studies, including a total of 106,336individuals. Each
study repeated the primary GWA analysis butadditionallyadjusted for
SNPs representing the 180 loci associated at P , 5 3 1026
(Sup-
plementaryMethods).Wethenmeta-analysedthesestudiesinthesame
way as for the primary GWA study meta-analysis. Nineteen SNPs
within the 180 loci were associated with height at P , 3.33 1027
(a
Bonferroni-corrected significance threshold calculated from the ap-
proximately 15% of the genome covered by the conditioned 2 Mb loci;
Table 1, Fig. 2, Supplementary Methods and Supplementary Figs 1
and 3). The distances of the second signals to the lead SNPs suggested
that both are likely to be affecting the same gene, rather than being
coincidentally in close proximity. At 17 of 17 loci (excluding two
contiguous loci in the HMGA1 region), the second signal occurred
within 500 kilobases (kb), rather than between 500 kb and 1 Mb, of
this lead SNP (binomial test P 5 2 3 1025
). Further analyses of allelic
heterogeneity may identify additional variants that increase the pro-
portion of variance explained. For example, within the 180 2-Mb loci,
a total of 45 independent SNPs reached P , 1 3 1025
when we would
expect less than 2 by chance.
Although GWA studies have identified many variants robustly asso-
ciated with common human diseases and traits, the biological signifi-
cance ofthesevariants,andthegenes on which they act, isoften unclear.
We first tested the overlap between the 180 height-associated variants
and two types of putatively functional variants, non-synonymous (ns)
SNPs and cis-expression quantitative trait loci (cis-eQTLs, variants
strongly associated with expression of nearby genes). Height variants
were 2.4-fold more likely to overlap with cis-eQTLs in lymphocytes
than expected by chance (47 variants: P 5 4.73 10211
) (Supplemen-
taryTable 7) and 1.7-fold morelikelytobecloselycorrelated (r2
$ 0.8in
the HapMap CEU sample) with nsSNPs (24 variants, P 5 0.004) (Sup-
plementary Methods and Supplementary Table 8). Although the
presence of a correlated cis-eQTL or nsSNP at an individual locus
does not establish the causality of any particular variant, this enrich-
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
Proportionofvarianceexplained
Proportion of variance
explained by 180 SNPs
5.00
×
10–8
5.00
×
10–7
5.00
×
10–6
5.00
×
10–4
5.00
×
10–5
5.00
×
10–3
5.00
×
10–2
P-value threshold
FINGESTURE 0.08 ± 0.02
RS2 0.11 ± 0.01
RS3 0.11 ± 0.01
GOOD 0.09 ± 0.02
QIMR 0.11 ± 0.02
Average
Lower 95% confidence intervals
Upper 95% confidence intervals
a
b
15
10
5
0 0
79,000
134,000
235,000
487,000
0 100 200 300 400 500 600 700
Cumulativeexpectedvarianceexplained(%)
Samplesizerequired
Cumulative expected number of loci
Figure 1 | Phenotypic variance explained by common variants. a, Variance
explained is higher when SNPs not reaching genome-wide significance are
included in the prediction model. The y axis represents the proportion of
LETTER RESEARCH
100’s
of
genes
implicated..
14. Criteria
for
Treatment
• “Growth
hormone
deficiency
(GHD)”
• “Idiopathic
short
stature
(ISS),
defined
by
height
standard
deviation
score
≤-‐2.25”
associated
with
growth
rates
unlikely
to
result
in
normal
adult
height,
in
whom
other
causes
of
short
stature
have
been
excluded
and
a
little
story
from
25
years
ago
19. Survival 3 Years After a WBC Test
(White, Male, 50-69 Years;; Using Last WBC Between 7/28/05 and 7/27/06)
20. But
over
most
of
medicine…
• Even
the
most
basic
of
autonomy,
taking
your
data
with
you,
is
not
the
status
quo.
21.
22. What
does
data
tell
us
about
human
rights
and
autonomy
• There
is
no
“normal”
but
there
are
desired
outcomes.
• Utilities
are
not
shared
across
parents,
patients,
providers
and
payors.
• Autonomy
makes
the
data-‐sharing
broader.
• Broader
data
sharing
highlights
distinct
utility
functions.
• Activist-‐level
data
sharing
today
– Less
energy
required
with
#OpenData
• Much
to
be
done
in
getting
data
analyses
done
“right”
• In
healthcare:
Recognize
and
harness
patients
as
collaborators.