Aug2013 NIST highly confident genotype calls for NA12878
1. Using Highly Confident Genotype
Calls for NA12878 to understand
sequencing accuracy
Genome in a Bottle Consortium
Justin Zook, Ph.D and Marc Salit, Ph.D.
National Institute of Standards and Technology
1
2. Why create a set of highly confident
genotypes for a genome?
• Current validation methods have limited purview or accuracy
• Sanger confirmation
– Limited by number of sites (and sometimes it’s wrong)
• High depth NGS confirmation
– May have same systematic errors
• Genotyping microarrays
– Limited to known (easier) variants
– Problems with neighboring variants, homopolymers, duplications
• Mendelian inheritance
– Can’t account for some systematic errors
• Simulated data
– Generally not very representative of errors in real data
• Ti/Tv
– Varies by region of genome, and only gives overall statistic
2
3. Goals for Data Integration
• Carefully define highly confident regions of the
genome
– distinguish between Hom Ref and Uncertain
• ~0 false positive AND false negative calls in
confident regions
• Include as much of the genome as possible in the
confident regions (i.e., don’t just take the
intersection)
• Avoid bias towards any particular platform
• Avoid bias towards any particular bioinformatics
algorithms
3
5. Integration of Data to
Form Highly Confident Genotype Calls
Find all possible variant sites
Find highly confident sites across multiple datasets
Identify sites with atypical characteristics signifying
sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical
characteristics until all datasets agree
Even if all datasets agree, identify them as uncertain if
few have typical characteristics, or if they fall in known
segmental duplications or long repeats
Candidate variants
Confident variants
Find characteristics
of bias
Arbitration
Confidence Level
5
6. Characteristics of Sequence
Data/Genotype associated with bias
• Systematic sequencing
errors
– Strand bias
– Base Quality Rank Sum
Test
• Local Alignment
problems
– Distance from end of
read
– Read Position Rank Sum
– HaplotypeScore
• Mapping problems
– Mapping Quality
– Higher (or lower) than
expected coverage –
CNV
– Length of aligned reads
• Abnormal allele balance
or Quality/Depth
– Allele Balance
– Quality/Depth
6
7. Regions excluded as uncertain
7
More recently, we also exclude homopolymers and long STRs, and 30 bp on each side of
uncertain heterozygous and homozygous variant positions
8. Example of Arbitration: SSE suspected
from strand biasPlatformBPlatformA
Homopolymer
Strand Bias
(SNP overrepresented
on reverse strands)
8
9. Verification of “Highly Confident”
Genotype accuracy
• Sanger sequencing
– 100% accuracy but only 100s of sites
• X Prize Fosmid sequencing
– Artifacts at end of fosmids
• Microarrays
– Differences appear to be FP or FN in arrays
• Broad 250bp HaplotypeCaller
– Very highly concordant, except a few systematic errors and
homopolymers
• Platinum genomes pedigree SNPs
– Some systematic errors are inherited; different representations of
complex variants
• Real Time Genomics Trio SNPs and indels
– Some interesting sites called by RTG complex caller but have no
evidence in mapped reads
9
10. GCAT – Interactive Performance
Metrics
• NIST is working with
GCAT to use our highly
confident variant calls
• Assess performance of
many combinations of
mappers and variant
callers
• www.bioplanet.com/gc
at
10
11. Why do calls differ from our highly
confident genotypes?
Calls not in Integration
• Platform-specific systematic
sequencing errors for SNPs
• Analysis-specific
• Difficult to map regions
• Indels in long
homopolymers
Calls specific to Integration
• Different complex variant
representation
• Some are incorrectly
filtered as suspected FPs
11
17. Challenges with assessing
performance
• All variant types are not
equal
• Nearby variants are often
difficult to align
– Multiple representations
• All regions of the genome
are not equal
– Homopolymers, STRs, dupli
cations
– Can be similar or different
in different genomes
• Labeling difficult variants
as uncertain leads to
higher apparent accuracy
when assessing
performance
• Genotypes fall in 3+
categories (not
positive/negative)
– standard diagnostic
accuracy measures not
well posed
17
18. How to incorporate inheritance in
multi-platform integration
• Adding confidence
– Site follows expected
inheritance pattern (and
not all homozygous)
• Identifying errors
– Mendelian inheritance
errors
– Sites where all family
members are
heterozygous
– Some CNVs
• Limitations of
inheritance
– All homozygous sites can
still be systematic errors
– Some errors can follow
inheritance pattern (e.g.,
incorrect alignment
around indel, some
CNVs)
18
19. Availability of data, genotype calls, and
methods
• Data for NA12878 is
available on NCBI GIAB
ftp site (see blogs on
genomeinabottle.org)
– mirrored to Amazon
today
• Highly confident
genotype calls and bed
files available on GIAB
ftp site
• Pre-print of manuscript
available on arxiv.org
• See
genomeinabottle.org
blog posts for more
information
19
20. Acknowledgements
• GCAT – David Mittelman and Jason Wang
• FDA HPC – Mike Mikailov, Brian Fitzgerald, et al.
• HSPH – Brad Chapman, Oliver Hofmann, Win
Hide
• Genome in a Bottle Consortium
– www.genomeinabottle.org
• newsletters, blogs, forums, announcements
– new partners welcome! Open to anyone
– targeting pilot reference material availability in early
2014
20
Notas do Editor
----- Meeting Notes (5/28/13 17:05) -----ask heng for decoy