Top Rated Hyderabad Call Girls Erragadda ⟟ 9332606886 ⟟ Call Me For Genuine ...
GIAB update for GRC GIAB workshop 191015
1. October 15, 2019
Genome in a Bottle: Developing
Benchmarks for Challenging
Variants With Linked & Long Reads
www.slideshare.net/genomeinabottle
2. NIST Human Genomics Team
• Purpose: Inspire trust in
human genome
measurements to enable
– Technology innovation
– Clinical translation
– Science-based regulatory
oversight
– Human health
• Values:
– Understand stakeholder
needs
– Collaborate with experts and
synthesize results
• Sequencing technologies
• Informatics developers
– Open science
• Open data
• Open analyses
• Open samples
3. Why start Genome in a Bottle?
• A map of every individual’s
genome will soon be possible, but
how will we know if it is correct?
• Diagnostics and precision
medicine require high levels of
confidence
• Well-characterized, broadly
disseminated genomes are needed
to benchmark performance of
sequencing
• NIST and FDA funding for the work
O’Rawe et al, Genome Medicine, 2013
https://doi.org/10.1186/gm432
4. Human Genome Sequencing needed a new class of
Reference Materials with billions of reference values
By Russ London at English Wikipedia, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=9923576
5. Many diverse contributors to GIAB
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*
6. GIAB has characterized 7 human
genomes
• Pilot genome
– NA12878
• PGP Human
Genomes
– Ashkenazi Jewish son
– Ashkenazi Jewish trio
– Chinese son
• Parents also
characterized
National I nstituteof S tandards & Technology
Report of I nvestigation
Reference Material 8391
Human DNA for Whole-Genome Variant Assessment
(Son of Eastern European Ashkenazim Jewish Ancestry)
This Reference Material (RM) is intended for validation, optimization, and process evaluation purposes. It consists
of a male whole human genome sample of Eastern European Ashkenazim Jewish ancestry, and it can be used to assess
performance of variant calling from genome sequencing. A unit of RM 8391 consists of a vial containing human
genomic DNA extracted from a single large growth of human lymphoblastoid cell line GM24385 from the Coriell
Institute for Medical Research (Camden, NJ). The vial contains approximately 10 µg of genomic DNA, with the peak
of the nominal length distribution longer than 48.5 kb, as referenced by Lambda DNA, and the DNA is in TE buffer
(10 mM TRIS, 1 mM EDTA, pH 8.0).
This material is intended for assessing performance of human genome sequencing variant calling by obtaining
estimates of true positives, false positives, true negatives, and false negatives. Sequencing applications could include
whole genome sequencing, whole exome sequencing, and more targeted sequencing such as gene panels. This
genomic DNA is intended to be analyzed in the same way as any other sample a lab would process and analyze
extracted DNA. Because the RM is extracted DNA, it is not useful for assessing pre-analytical steps such as DNA
extraction, but it does challenge sequencing library preparation, sequencing machines, and the bioinformatics steps of
mapping, alignment, and variant calling. This RM is not intended to assess subsequent bioinformatics steps such as
functional or clinical interpretation.
Information Values: Information values are provided for single nucleotide polymorphisms (SNPs), small insertions
and deletions (indels), and homozygous reference genotypes for approximately 88 % of the genome, using methods
similar to described in reference 1. An information value is considered to be a value that will be of interest and use to
the RM user, but insufficient information is available to assess the uncertainty associated with the value. We describe
and disseminate our best, most confident, estimate of the genotypes using the data and methods currently available.
These data and genomic characterizations will be maintained over time as new data accrue and measurement and
informatics methods become available. The information values are given as a variant call file (vcf) that contains the
high-confidence SNPs and small indels, as well as a tab-delimited “bed” file that describes the regions that are called
high-confidence. Information values cannot be used to establish metrological traceability. The files referenced in this
report are available at the Genome in a Bottle ftp site hosted by the National Center for Biotechnology Information
(NCBI). The Genome in a Bottle ftp site for the high-confidence vcf and high confidence regions is:
7. Open consent enables secondary reference samples to
meet specific clinical needs
• >50 products now available
based on broadly-consented,
well-characterized GIAB PGP cell
lines
• Genomic DNA + DNA spike-ins
• Clinical variants
• Somatic variants
• Difficult variants
• Clinical matrix (FFPE)
• Circulating tumor DNA
• Stem cells (iPSCs)
• Genome editing
• …
8. Reference Genomes vs. Benchmark Genomes
• Primary uses: mapping and
annotation
• De novo assembly without
reference
• Traditionally not diploid
• Combination of individuals that
often aren’t public samples
• Primary use: benchmarking and
optimization
• Variant calls and regions on
reference genome
• Diploid-aware is essential
• Widely available individual samples
9. Design of our human genome reference values
Benchmark
Variant
Calls
10. Benchmark
Regions –
regions in which
the benchmark
contains (almost)
all the variants
Benchmark
Variant
Calls
Design of our human genome reference values
12. Variants from
any method
being evaluated
Design of our human genome reference values
Benchmark
Regions
Benchmark
Variant
Calls
13. Benchmark
Regions
Variants
outside
benchmark
regions are
not assessed
Majority of
variants unique
to method should
be false positives
(FPs)
Majority of
variants
unique to
benchmark
should be
false
negatives
(FNs)
Matching
variants
assumed to be
true positives
Variants from
any method
being evaluated
Benchmark
Variant
Calls
Design of our human genome reference values
16. GIAB has extensive public,
unembargoed data
Short reads
• BGISEQ
• Complete
Genomics
• Illumina
• Ion Torrent
• SOLiD
Linked reads
• 10x Genomics
• BGISEQ stLFR
• Illumina 6kb
mate-pair
• HiC
• Strand-seq
Long reads
• PacBio
• PacBio CCS
• Promethion
• Ultralong Oxford
Nanopore
Optical/electronic
mapping
• BioNano
• Nabsys
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/
17. GIAB has extensive public,
unembargoed data
Short reads
• BGISEQ
• Complete
Genomics
• Illumina
• Ion Torrent
• SOLiD
Linked reads
• 10x Genomics
• BGISEQ stLFR
• Illumina 6kb
mate-pair
• HiC
• Strand-seq
Long reads
• PacBio
• PacBio CCS
• Promethion
• Ultralong Oxford
Nanopore
Optical/electronic
mapping
• BioNano
• Nabsys
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/
Now >100x
public 10-15kb
CCS data for
HG002
18. Extensive “ultralong” ONT Data
3 Promethion
flow cells
110 MinION
flow cells
3 Promethion
flow cells
110 MinION
flow cells
4x >250kb
7x
>100kb
David Catoe
Nate Olson
Noah Spies
Marc Salit
Matt Loose
Nick Loman
Josh Quick
19. Extensive “ultralong” ONT Data
3 Promethion
flow cells
110 MinION
flow cells
3 Promethion
flow cells
110 MinION
flow cells
4x >250kb
7x
>100kb
David Catoe
Nate Olson
Noah Spies
Marc Salit
Matt Loose
Nick Loman
Josh Quick
20. Now using linked and long reads for
difficult variants and regions
GIAB Public Data
• Linked Reads
– 10x Genomics
– Complete Genomics/BGI stLFR
– Hi-C
– Strand-seq (underway)
• Long Reads
– PacBio Continuous Long Reads
– PacBio Circular Consensus Seq
– Oxford Nanopore “ultralong”
– Promethion
GIAB Use Cases
• Develop structural variant
benchmark
• Diploid assembly of difficult
regions like MHC
• Expand small variant benchmark
21. 50 to 1000 bp
Alu
Alu
1kbp to 10kbp
LINE
LINE
Discovery: 498876 (296761 unique) calls >=50bp and 1157458 (521360 unique) calls >=20bp
discovered in 30+ sequence-resolved callsets from 4 technologies for AJ Trio
Compare SVs: 128715 sequence-resolved SV calls >=50bp after clustering
sequence changes within 20% edit distance in trio
Discovery Support: 30062 SVs with 2+ techs or 5+ callers predicting
sequences <20% different or BioNano/Nabsys support in trio
Evaluate/genotype: 19748 SVs with consensus variant
genotype from svviz in son
Filter complex: 12745 SVs not within
1kb of another SV
Regions: 9641 SVs inside
2.66 Gbp benchmark
regions supported by
diploid assembly
v0.6
tinyurl.com/GIABSV06
22. Reference genomes and benchmark genomes
are converging
Reference genomes
that are polished
diploid assemblies of
open cell lines
Benchmark genomes
and tools to stratify
by genome context
and variant type
New diploid
assembly-derived
benchmarks
New tools to assess
diploid assembly
quality
23. The road
ahead... 2019
Integration pipeline development
for small and structural variants
Manuscripts for small and
structural variants
2020
Difficult large variants
Somatic sample development
Germline samples from new
ancestries
Diploid assembly
2021+
Somatic integration pipeline
Somatic structural variation
Large segmental duplications
Centromere/telomere
Diploid assembly benchmarking
...
24. Acknowledgment of many GIAB contributors
Government
Clinical Laboratories Academic Laboratories
Bioinformatics developers
NGS technology developers
Reference samples
* Funders
*
*
25. For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis Team google groups
GIAB slides, including 2019 Workshop slides: www.slideshare.net/genomeinabottle
Public, Unembargoed Data:
– http://www.nature.com/articles/sdata201625
– ftp://ftp-trace.ncbi.nlm.nih.gov/giab/
– github.com/genome-in-a-bottle
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
– Web-based implementation at precision.fda.gov
– Best Practices at https://rdcu.be/bqpDT
Public workshops
– Next workshop planned for April 1-2, 2020 at Stanford University, CA, USA
Justin Zook: jzook@nist.gov
NIST postdoc
opportunities
available!
Diploid assembly,
cancer genomes,
other ‘omics, …
Notas do Editor
This is a good slide for 644:
give a clinical anecdote
Also numbers - attendance, publications, data, RM unit sales
Reference sample distributors
How much money from IAA?
- sustained funding
Quantify collaborators' input
GIAB steering committee
Examples of others contributing data, analyses
How to describe emails
This is a good slide for 644:
give a clinical anecdote
Also numbers - attendance, publications, data, RM unit sales
Reference sample distributors
How much money from IAA?
- sustained funding
Quantify collaborators' input
GIAB steering committee
Examples of others contributing data, analyses
How to describe emails