Genome in a bottle for next gen dx v2 180821

Best Practices for Using Genome in a Bottle Reference
Materials to Benchmark Small and Large Germline
Variant Calls
Justin Zook, on behalf of the GIAB Consortium
NIST Genome-Scale Measurements Group
Joint Initiative for Metrology in Biology (JIMB)
August 21, 2018

Take-home Messages
• Genome in a Bottle is:
– “Open science”
– Authoritative characterization of human genomes
• Currently enable benchmarking of “easier” variants
– Clinical validation
– Technology development, optimization, and demonstration
• Now working on difficult variants and regions
– Working on finalizing a tiered benchmark set >=50bp + confident
regions
– Many challenges remain and collaborations welcome!

Why Genome in a Bottle?
• A map of every individual’s genome
will soon be possible, but how will
we know if it is correct?
• Diagnostics and precision medicine
require high levels of confidence
• Well-characterized, broadly
disseminated genomes are needed
to benchmark performance of
sequencing
• Open, transparent data/analyses
• Enable technology development,
optimization, and demonstration
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432

GIAB is evolving with technologies
2012
• No human
benchmark
calls available
• GIAB
Consortium
formed
2014
• Small variant
genotypes
for ~77% of
pilot genome
NA12878
2015
• NIST releases
first human
genome
Reference
Material
2016
• 4 new
genomes
• Small
variants for
90% of 5
genomes for
GRCh37/38
2017+
• Characteriz-
ing difficult
variants
• Develop
tumor
samples

GIAB has characterized 5 human genome RMs
• Pilot genome
– NA12878
• PGP Human Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
https://doi.org/10.1101/281006
Latest small variant characterization:

Open consent enables secondary reference samples
• >30 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
– Clinical variants
– Somatic variants
– Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing
• …

Example Derived Product addresses clinical need
• NIST RMs do not contain many
challenging variants in clinically
targeted regions
• Seracare developed product with
synthetic complex variants spiked
into GIAB cell line DNA
• An interlaboratory study
demonstrated the utility of these
to evaluate detection of difficult
variants

GIAB convenes unique, diverse stakeholders
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developersNGS technology developers
Reference samples

All data and analyses are open and public
51 authors
14 institutions
12 datasets
7 genomes
Data described in ISA-tab
New data on GIAB NCBI FTP

GIAB “Open Science” Virtuous Cycle
Users
analyze
GIAB
Samples
Benchmark
vs. GIAB
data
Critical
feedback
to GIAB
Integrate
new
methods
New
benchmark
data
Method
development,
optimization, and
demonstration
Part of assay
validation
GIAB/NIST
expands to more
difficult regions

Evolution of high-confidence small variants
Calls
HC
Regions HC Calls
HC
indels
Concordant
with PG
NIST-
only in
beds
PG-only
in beds PG-only
Variants
Phased
v2.19 2.22 Gb 3153247 352937 3030703 87 404 1018795 0.3%
v3.2.2 2.53 Gb 3512990 335594 3391783 57 52 657715 3.9%
v3.3 2.57 Gb 3566076 358753 3441361 40 60 608137 8.8%
v3.3.2 2.58 Gb 3691156 487841 3529641 47 61 469202 99.6%
5-7
errors
in NIST
1-7
errors
in NIST
~2 FPs and ~2 FNs per million NIST variants in PG and NIST bed files

Important characteristics of benchmark calls
What does “gold standard” mean?
1. Accurate
– high-confidence variants,
genotypes, haplotypes, and regions
– When results from any method is
compared to the benchmark, the
majority of differences (FPs/FNs)
are errors in the method
2. Representative examples
– Different types of variants in
different genome contexts
3. Comprehensive characterization
– Many examples of different variant
types/genome contexts
– Eventually, diploid assembly
benchmarking

Best Practices for Benchmarking Small Variants
https://github.com/ga4gh/benchmarking-tools
https://doi.org/10.1101/270157 https://precision.fda.gov/
Describe
public
“Truth”
VCFs with
confident
regions
Enable
stratification of
performance in
difficult regions
Tools to compare
different
representations of
complex variants
Standardized
VCF-I output of
comparison tools
Standardized definitions
of performance metrics
based on matching
stringency Web-based interface
for performance
metrics
Standardized output
formats for
performance metrics

Benchmarking Tools
Standardized comparison, counting, and stratification with
Hap.py + vcfeval
https://precision.fda.gov/https://github.com/ga4gh/benchmarking-tools

FN rates high in some tandem repeats
1x0.3x 10x3x 30x
11to50bp51to200bp
2bp unit repeat
3bp unit repeat
4bp unit repeat
2bp unit repeat
3bp unit repeat
4bp unit repeat
FN rate vs. average

What are we accessing and what is still
challenging?
Type of variant Genome
context
Fraction
of variants
called*
Number of
variants
missing*
How to improve?
Simple SNPs Not repetitive ~97% >100k Machine learning
Simple indels Not repetitive ~93% >10k Machine learning
All variants Low
mappability
<30% >170k Use linked reads and long
reads
All variants Regions not in
GRCh37/38
0 >>100k??? De novo assembly; long reads
Small indels Tandem repeats
and
homopolymers
<50% >200k STR/homopolymer callers; long
reads; better handle complex
and compound variants
Indels 15-50bp All <25% >30k Assembly-based callers;
integrate larger variants
differently; long reads
Indels >50bp All <1% >20k
* Approximate values based on fraction of variants in GATKHC or FermiKit that are
inside v3.3.2 High-confidence regions

How can we extend our approach to structural
variants?
Similarities to small variants
• Collect callsets from multiple
technologies
• Compare callsets to find calls
supported by multiple technologies
Differences from small variants
• Callsets have limited sensitivity
• Variants are often imprecisely
characterized
– breakpoints, size, type, etc.
• Representation of variants is poorly
standardized, especially when complex
• Comparison tools in infancy

Integration of diverse data types and analyses
• Data publicly available
– Deep short reads
– Linked reads
– Long reads
– Optical/nanopore mapping
• Analyses
– Small variant calling
– SV calling
– Local and global assembly
Discover &
Refine
sequence-
resolved calls
from multiple
datasets &
analyses Compare
variant and
genotype calls
from different
methods
Evaluate/
genotype calls
with other
data
Identify
features
associated
with reliability
of calls from
each method
Form
benchmark
calls using
heuristics &
machine
learning
Compare
benchmarks
to high-
quality
callsets and
examine
differences

Evolution of SV calls for AJ Trio
v0.2.0
• Only
deletions
• Overlap
and size-
based
clustering
• Output
sites with
multitech
support
v0.3.0
• New
calling
methods
• Deletions
and
insertions
• Sequence-
resolved
calls
• Sequence-
based
clustering
• Output
sites with
multitech
support
v0.4.0
• Include
some
single tech
calls
• Evaluate
read
support to
remove
some false
positives
• Add
genotypes
for trio
v0.5.0
• Better
calling
methods,
especially
for large
insertions
• Include
more
single tech
calls
• Add some
phasing
info
v0.6 Tier 1
• Exclude
clusters of
differing
calls
• High-
confidence
regions

Can you trust the SV benchmark results?
• Important to use sophisticated
benchmarking tools
• github.com/spiralgenetics/truvari
• github.com/nhansen/SVanalyzer
• Volunteers compared to v0.6 Tier 1
• Stratified by variant type and
overlap with tandem repeat
• Manually curated 10 random
putative FPs and FNs from each
category
• Short reads vs v0.6
• >90% of putative FPs and FNs
are errors from short reads
• Long reads vs. v0.6
• >90% of putative FNs are
errors from long read methods
• ~50% of putative FP insertions
appear to be real missed
variants in v0.6
Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/

Crowd-sourcing SV manual curation
www.svcurator.com
● 61 curators participated
● 20 evaluated 648+ sites!
Credit:
Lesley Chapman

Oxford Nanopore “Ultralong reads”
Noah Spies
David Catoe
Marc Salit
Matt Loose
Nick Loman
Josh Quick
Nate Olson
• So far…
• 16x total mapped
• 8x reads > 50kb
• 4x reads > 100kb
• Estimated 30-40x total
in 2018

The road ahead...
2018
• Further
automate
integration
• Large
variants
• Difficult
small
variants
• Phasing
2019
• Difficult
small & large
variants
• Somatic
sample
development
• Germline
samples from
new
ancestries
2020+
• Diploid
assembly
• Somatic
structural
variation
• Segmental
duplications
• Centromere/
telomere
• ...

Approaches to Benchmarking Variant Calling
• Well-characterized whole genome Reference Materials
• Many samples characterized in clinically relevant regions
• Synthetic DNA spike-ins
• Cell lines with engineered mutations
• Simulated reads
• Modified real reads
• Modified reference genomes
• Confirming results found in real samples over time

Challenges in Benchmarking Variant Calling
• It is difficult to do robust benchmarking of tests designed to
detect many analytes (e.g., many variants)
• Best to benchmark only within high-confidence bed file, but…
• Benchmark calls/regions tend to be biased towards easier
variants and regions
– Some clinical tests are enriched for difficult sites
• Always manually inspect a subset of FPs/FNs
• Stratification by variant type and region is important
• Always calculate confidence intervals on performance metrics

Take-home Messages
• Genome in a Bottle is:
– “Open science”
– Authoritative characterization of human genomes
• Currently enable benchmarking of “easier” variants
– Clinical validation
– Technology development, optimization, and demonstration
• Now working on difficult variants and regions
– Working on finalizing a tiered benchmark set >=50bp + confident
regions
– Many challenges remain and collaborations welcome!
Draft SV calls for feedback: ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/

Acknowledgements
• NIST/JIMB
– Marc Salit
– Jenny McDaniel
– Lindsay Harris
– David Catoe
– Lesley Chapman
– Nate Olson
– Justin Wagner
– Noah Spies
• FDA
• Genome in a Bottle Consortium
• GA4GH Benchmarking Team

For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google group
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
Latest small variant benchmark: https://doi.org/10.1101/281006
Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://doi.org/10.1101/270157
Public workshops
– Next workshop tentatively Spring 2019 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!

Genome in a bottle for next gen dx v2 180821

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Genome in a bottle for next gen dx v2 180821

Semelhante a Genome in a bottle for next gen dx v2 180821 (20)

Mais de GenomeInABottle

Mais de GenomeInABottle (20)

Último

Último (20)

Genome in a bottle for next gen dx v2 180821