1. An integrated map of genetic
variation from 1,092 human
genomes
The 1000 Genomes Project Consortium
http://www.1000genomes.org
Nature 491, 56–65 (01 November 2012)
2. Primary goal
• to create a complete and detailed
catalogue of human genetic
variations, which in turn can be used for
association studies relating genetic
variation to disease.
3. Primary goal
• to discover >95 % of the variants (e.g.
SNPs, CNVs, indels) with minor allele
frequencies as low as 1% across the
genome and 0.1-0.5% in gene regions
• to estimate the population
frequencies, haplotype backgrounds and
linkage disequilibrium patterns of variant
alleles
4. Secondary goals
• support of better SNP and probe selection for
genotyping platforms in future studies
• improvement of the human reference sequence.
• the completed database will be a useful tool for
studying regions under selection, variation in
multiple populations and understanding the
underlying processes of mutation and
recombination.
5. Project design
• to sequence each sample to about 4X coverage;
at this depth sequencing cannot provide the
complete genotype of each sample, but should
allow the detection of most variants with
frequencies as low as 1%.
• Combining the data from 2500 samples should
allow highly accurate estimation (imputation) of
the variants and genotypes for each sample that
were not seen directly by the light sequencing.
6. Project design / Stages
• The 1000 genomes full project has been
divided into phases to represent the dispersed
nature of the sample collection.
7. Project design / Stages / Pilot
Three pilot studies provided data to inform the
design of the full-scale project:
• Pilot 1: low coverage pilot (2-4X, WGS of 180
samples)
• Pilot 2: high coverage pilot (20-60X, WGS of 2
mother-father-adult child trios)
• Pilot 3: the exon targeted pilot (50X, 1000 gene
regions in 900 samples)
The pilot was completed in 2009.
8. Project design / Stages / Phase 1
Phase 1 represents low coverage and exome
data analysis available for the first 1092
samples.
9. Project design / Stages / Phase 1
Phase 1 represents low coverage and exome
data analysis available for the first 1092
samples.
DONE!
Results published in Nature 491, 56–65 (01
November 2012)
11. Project design / Stages / Phase 2
• Phase 2 represents an expanded set of
samples, around 1700 in number (the sequence
data has been finalized).
• This data is being used for method development
to both improve on existing methods from phase
1 and also develop new methods to handle
features like multi allelic variant sites and true
integration of complex variation and structural
variants.
12. Project design / Stages / Phase 3
• Phase 3 represents 2500 samples including
new African samples and samples from South
Asia. The new methods developed in phase 2
will be applied to this data set an a final
catalogue of variation will be released.
13. Amounts of Data
• Full genomic sequence of 1,700 individuals is
now available (200TB of genomic data).
14. Amounts of Data
• > 2 human genomes every 24 hours
• 60-fold more sequence data than what has
been published in DNA databases over the
past 25 years.
17. Samples / Ancestry-based groups
• Europe (IBS (Iberian Populations in Spain), GBR (British
from England and Scotland ), CEU (Utah residents with
ancestry from northern and western Europe), FIN
(Finnish in Finland), TSI (Toscani in Italia));
• East Asia (JPT (Japanese in Tokyo, Japan), CHB (Han
Chinese in Beijing, China), CHS (Han Chinese South));
• Africa (ASW (African Ancestry in SW USA), YRI (Yoruba
in Ibadan, Nigeria), LWK (Luhya in Webuye, Kenya));
• Americas (MXL (Mexican Ancestry in Los
Angeles, CA, USA), PUR (Puerto Ricans in Puerto
Rico), CLM (Colombians in Medellin, Colombia)).
18. Data
• combination of low-coverage (2–6x) whole-
genome sequence data, targeted deep (50–100x)
exome sequence data and dense SNP genotype
data.
• the approach was augmented with statistical
methods for selecting higher quality variant calls
from candidates obtained using multiple
algorithms, and to integrate SNP, indel and larger
structural variants within a single framework
19.
20.
21. • A key goal of the 1000 Genomes Project was
to identify more than 95% of SNPs at 1%
frequency in a broad set of populations.
• Our current resource includes ~50%, 98% and
99.7% of the SNPs with frequencies of
~0.1%, 1.0% and 5.0%, respectively, in ~2,500
UK sampled genomes.
22.
23. Genetic variation
• 3.60 million single nucleotide polymorphisms
(SNPs), of which 24,000 were in GENCODE
(coding) regions
• 344,000 small indels (440 coding) which gives a
ratio of 1:10 with SNPs in human genomes, and
demonstrates the strong selection against indels
in coding regions.
• 717 large deletions (the most confident category
of SVs that we currently can detect), of which 39
overlapped GENCODE regions.
24. • Most common variants (94% of variants with
frequency>=5%) were known before the
current phase of the project and had their
haplotype structure mapped through earlier
projects.
• Only 62% of variants in the range 0.5–5% and
• 13% of variants with frequencies of <0.5% had
been described previously.
25.
26.
27. • Variants present at 10% and above across the
entire sample are almost all found in all of the
populations studied.
• By contrast, 17% of low-frequency variants in
the range 0.5–5% were observed in a single
ancestry group, and 53% of rare variants at
0.5% were observed in a single population.
28.
29. • The derived allele frequency distribution
shows substantial divergence between
populations below a frequency of 40%, such
that individuals from populations with
substantial African ancestry carry up to three
times as many low-frequency variants (0.5–
5%) as those of European or East Asian
origin, reflecting ancestral bottlenecks in non-
African populations.
30. • However, individuals from all populations
show an enrichment of rare variants (<0.5%
frequency), reflecting recent explosive
increases in population size and the effects of
geographic differentiation.
31.
32. • Variants present twice across the entire
sample (referred to as f2 variants), typically
the most recent of informative mutations, are
found within the same population in 53% of
cases
• However, between-population sharing
identifies recent historical connections.
33.
34. • At the most highly conserved coding
sites, 85% of non-synonymous variants and
more than 90% of stop-gain and splice-
disrupting variants are below 0.5% in
frequency, compared with 65% of
synonymous variants.
35. • Individuals typically carry more than 2500
nonsynonymous variants at conserved
positions, of which 20-40 are likely to be
damaging (2-5 of which are rare), 150 loss-of-
function variants (splice site variants, stop
gains, frameshift indels) of which 10-20 are rare
• 130–400 non-synonymous variants per
individual, 10–20 LOF variants, 2–5 damaging
mutations, and 1–2 variants identified previously
from cancer genome sequencing
38. • The non-synonymous to synonymous ratio
among rare (<0.5%) variants is typically in the
range 1–2, and among common variants in the
range 0.5–1.5, suggesting that 25–50% of rare
non-synonymous variants are deleterious.
• However, the segregating rare load among
gene groups in KEGG pathways varies
substantially.