2017 amp benchmarking_poster_justin

Introduction/Background
Genome in a Bottle: You’ve Sequenced a genome,
How Well Did You Do?
Justin Zook1, Peter Krusche2, Michael Eberle2, Len Trigg3, Kevin Jacobs4, Brendan O’Fallon5, Marc Salit1,
the Global Alliance for Genomics and Health Benchmarking Team, and the Genome in a Bottle Consortium
(1) Genome-Scale Measurements Group, National Institute of Standards and Technology
(2) Illumina, Inc.; (3) Real Time Genomics; (4) Helix; (5) ARUP Laboratories
• The Global Alliance for Genomics and Health Benchmarking Team has
developed a variety of resources for benchmarking germline small
variant calls:
• Standardized performance metrics definitions (e.g., false positives,
false negatives, precision, recall/sensitivity, genotype error rate)
• Links to high-confidence calls and data for benchmark genomes
• Benchmarking Tools
• Integrate variant comparison tools into a single benchmarking
framework
• Enable stratification of performance by variant type and genome
context
• Sophisticated variant comparison tools are important to handle different
representations of complex variants
• Benchmarking tools have been used in PrecisionFDA Challenges
Public Benchmark Callsets/Genomes
Resources
• Web-based implementation: GA4GH Benchmarking App on
https://precision.fda.gov
• GitHub site: https://github.com/ga4gh/benchmarking-tools
• Description of intermediate formats: doc/ref-impl
• Benchmark descriptions and downloads: resources/high-confidence-sets
• Stratification bed files and descriptions: resources/stratification-bed-files
• Please contribute / join the discussion! Email jzook@nist.gov
Genome PGP ID Coriell ID NIST ID NIST RM #
CEPH Mother N/A GM12878 HG001 8398
AJ Son huAA53E0 GM24385 HG002 8391 (son)/8392
AJ Father hu6E4515 GM24149 HG003 8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 8392 (trio)
Chinese Son hu91BD69 GM24631 HG005 8393
Practical Implications of Benchmark Callsets
Results: Stratification by Variant Type and Context
Table 1: Genomes currently being characterized by GIAB by integrating
data from multiple technologies. Vials from large homogeneous batch of
DNA available as NIST Reference Materials (RMs)
Conclusions: Benchmarking Best Practices
Different variant representation
changes variant counts.
MNP (=> one TP / FP / FN):
chr1 16837188 TGC CGT
SNPs (=> two TP / FP / FN):
chr1 16837188 T C
chr1 16837190 C T
Variant types can change when decomposing
or recomposing variants:
Complex variant:
chr1 201586350 CTCTCTCTCT CA
DEL + SNP:
chr1 201586350 CTCTCTCTCT C
chr1 201586359 T A
Variants cannot always be canonicalized
uniquely:
Complex variant:
chr20 21221450 GCCC GCG
chr20 21221450 GC G
chr20 21221452 C G
chr20 21221452 C G
chr20 21221453 C <DEL>
What is the SNP sensitivity in coding exons?
• 97.98% sensitivity vs. PG
• FNs predominately in low MQ and/or segmental duplication regions
• ~80% of FNs supported by long or linked reads
• 99.96% sensitivity vs. NISTv3.3.2
• 62x lower FN rate than vs PG
True accuracy is
hard to estimate,
especially in difficult
regions
1x0.3x 10x3x 30x
11to50bp51to200bp
2bp unit repeat
3bp unit repeat
4bp unit repeat
2bp unit repeat
3bp unit repeat
4bp unit repeat
FN rate vs. average
What is the FP rate for compound heterozygous indels?
• 93% precision vs. PG
• 4/10 manually inspected putative FPs were errors in test set
• 6/10 were correct in test set (partial calls or missing in PG vcf)
• 95% precision vs. NISTv3.3.2
• 9/10 manually inspected FPs were errors in test set (1 error in v3.3.2)
• Accuracy often varies by variant type and
genomic context
• Error rates for complex variants >
indels > SNPs
• Error rates in tandem repeats and
difficult to map regions are greater than
in non-repetitive regions
• The benchmarking team has made
available a set of bed files describing
difficult and interesting regions
• Different types of tandem repeats
• Low mappability regions
• Segmental duplications
• Coding regions
PrecisionFDA Challenge Results Example (precision.fda.gov)
Genome PGP ID Coriell ID NIST ID NIST RM #
CEPH Mother N/A GM12878 HG001 8398
CEPH Father N/A GM12877 N/A N/A
Table 2: Genomes with high-confidence calls from the Illumina Platinum
Genomes Project by phasing parents and 11 children and finding variants
inherited as expected
Methods: Accounting for different
representations of complex variants
• Complex variants (i.e.,
nearby SNPs and indels)
can usually be correctly
represented in multiple
ways
• GA4GH Benchmarking
tools account for these
differences in
representation
Sophisticated
comparison tools
(right) make a
significant
difference in
performance
metrics compared
to naïve tools (left)
Example Complex Variant where normalization
alone (e.g., with vt) does not work
Benchmark
sets
Use benchmark sets with both high-confidence variant calls as well as high-confidence regions, so
that both false negatives and false positives can be assessed.
Stringency of
variant
comparison
Determine whether it is important that the genotype match exactly, only the allele matches, or the
call just needs to be near the true variant.
Variant
comparison
tools
Use sophisticated variant comparison engines such as vcfeval, xcmp, or varmatch that are able to
determine if different representations of the same variant are consistent with the benchmark
call. Subsetting by high-confidence regions and, if desired, targeted regions, should only be done
after comparison to avoid problems comparing variants with differing representations.
Manual
curation
Manually curate alignments, ideally from multiple data types, around at least a subset of putative
false positive and false negative calls in order to ensure they are truly errors in the user’s callset and
to understand the cause(s) of errors. Report back to benchmark set developers any potential errors
found in the benchmark set (e.g., using https://goo.gl/forms/ECbjHY7nhz0hrCR52 for GIAB).
Interpretation
of metrics
All performance metrics should only be interpreted with respect to the limitations of the variants and
regions in the benchmark set. Performance metrics are likely to be lower for more difficult variant
types and regions that are not fully represented in the benchmark set, such as those in repetitive or
difficult-to-map regions. When comparing methods, method 1 may perform better in the high-
confidence regions, but method 2 may perform better for more difficult variants outside the high-
confidence regions.
Stratification Overall performance metrics can be useful, but for many applications it is important to assess
performance for particular variant types and genome contexts. Performance often varies significantly
across variant types and genome contexts, and stratification allows users to understand this. In
addition, stratification allows users to see if some variant types and genome contexts of interest are
not sufficiently represented.
Confidence
Intervals
Confidence intervals for performance metrics such as precision and recall should be calculated. This
is particularly critical for the smaller numbers of variants found when benchmarking in targeted
regions and/or less common stratified variant types and regions.

2017 amp benchmarking_poster_justin

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2017 amp benchmarking_poster_justin

Similar to 2017 amp benchmarking_poster_justin (20)

More from GenomeInABottle

More from GenomeInABottle (18)

Recently uploaded

Recently uploaded (20)

2017 amp benchmarking_poster_justin