Call Girls Varanasi Just Call 9907093804 Top Class Call Girl Service Available
150224 giab 30 min generic slides
1. Genome in a Bottle: So you’ve
sequenced a genome – how well did
you do?
February 2015
Justin Zook, Marc Salit, and the Genome
in a Bottle Consortium
2. Whole genome sequencing technologies
disagree about 100,000’s of variants
3,198,316
(80.05%)
125,574
(3.14%)
Platform
#1
Platform
#2
Platform #3
230,311
(5.76%)
121,440
(3.04%)
208,038
(5.21%)
71,944
(1.80%)
39,604
(0.99%)
# SNPs
(% of SNPs detected
by any platform)
4. NIST-hosted
Genome in a Bottle Consortium
• Infrastructure for performance
assessment of NGS
– support science-based regulatory
oversight
• No widely accepted set of metrics
to characterize the fidelity of
variant calls from NGS…
• Genome in a Bottle Consortium is
developing standards to address
this…
– well-characterized human genomes
as Reference Materials (RMs)
• characterized and disseminated by NIST
– tools and methods to use these RMs
• Global Alliance for Genomics and
Health Benchmarking Team
http://genomeinabottle.org
5. Genome in a Bottle
Consortium Development
• NIST met with sequencing
technology developers to assess
standards needs
– Stanford, June 2011
• Open, exploratory workshop
– ASHG, Montreal, Canada
– October 2011
• Small, invitational workshop at
NIST to develop consortium for
human genome reference
materials
– FDA, NCBI, NHGRI, NCI, CDC, Wash
U, Broad, technology developers,
clinical labs, CAP, PGP, Partners,
ABRF, others
– developed draft work plan
– April 2012
• Open, public meetings of GIAB
– August 2012 at NIST
– March 2013 at Xgen
– August 2013 at NIST
– January 2014 at Stanford
– August 2014 at NIST
– January 2015 at Stanford
• Website
– www.genomeinabottle.org
6. Others working in this space…
Well-characterized genomes
• Illumina Platinum Genomes
• CDC GeT-RM
• Korean Genome Project
• Human Longevity, Inc.
• Hyditaform mole haploid
cell line
• Genome Reference
Consortium
Performance Metrics
• Global Alliance for
Genomics and Health
Benchmarking Team
• NCBI/CDC GeT-RM Browser
• GCAT website
7. NIST Plays a Role in the First FDA Authorization for
Next-Generation Sequencer
November 20, 2013
8. Measurement Process
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• gDNA reference
materials will be
developed to
characterize
performance of a part
of process
– materials will be
certified for their
variants against a
reference sequence,
with confidence
estimates
genericmeasurementprocess
Analytical
steps
Pre-Analytical
steps
Clinical
Interpretation
9. • NIST worked with GIAB
to select genomes
• Current genomes
– NA12878 HapMap
sample as Pilot sample
• part of 17-member
pedigree
– 2 trios from PGP
• Ashkenazim
• Asian
12889 12890 12891 12892
12877 12878
12879 12880 12881 12882 12883 12884 12885 1288712886 12888 12893
CEPH Utah Pedigree 1463
Putting “Genomes” in Bottles
11 children
10. NIST Human Genome RMs in the
pipeline
• All 10 ug samples of DNA
isolated from multistage large
growth cell cultures
– all are intended to act as stable,
homogeneous references
suitable for use in regulated
applications
– all genomes also available from
Coriell repository
• Pilot Genome
– ~8400 tubes
• Ashkenazim Jewish Trio
– ~10000 son; ~2500 each parent
• Asian Trio
– ~10000 son; parents not yet
planned as NIST RM
11. Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in
confident regions
• Include as much of the genome as possible in
the confident regions (i.e., don’t just take the
intersection)
• Avoid bias towards any particular platform
– take advantage of strengths of each platform
• Avoid bias towards any particular
bioinformatics algorithms
11
13. Dataset#1Dataset#2Dataset#3
Annotation #1
Histogram
(e.g., coverage)
Dataset#1Dataset#2Dataset#3
Annotation #2
Histogram
(e.g., strand bias)
Site A
Site B
Potential
Bias
Site C
Dataset Site A Site B Site C
Dataset #1 0/0 0/0 1/1
Dataset #2 0/1 0/1 1/1
Dataset #3 0/0 0/1 1/1
Integration 0/0 0/1 Uncer-
tain
Candidate
variants
Concordant
variants
Find
characteristics
of bias
Arbitrate using
evidence of
bias
Confidence
Level
Integration Methods to Establish
Benchmark Variant Calls
14. Integration Methods to Establish
Benchmark Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.
15. Assigning confidence to genotypes
High-confidence sites
• Sequencing/bioinformatics
methods agree or we
understand the biases
causing disagreement
• At least some methods have
no evidence of bias
• Inherited as expected
Less confident sites
• In a region known to be
difficult for current
technologies
• State reasons for lower
confidence
• If a site is near a low
confidence site, make it low
confidence
16. Challenges with assessing
performance
• All variant types are not
equal
• All regions of the genome
are not equal
• Labeling difficult variants
as uncertain leads to
higher apparent accuracy
when assessing
performance
• Genotypes fall in 3+
categories (not
positive/negative)
– standard diagnostic
accuracy measures not
well posed
16
17. Challenge in variant comparison: Complex
variants have multiple correct representations
BWA
ssaha2
CGTools
Novo-
align
Ref:
T
insertion
TCTCT
insertion
17
FP SNPs FP MNPs FP indels
Traditional
comparison
0.38%
(610)
100%
(915)
6.5%
(733)
Comparison
with
realignment
0.15%
(249)
4.2%
(38)
2.6%
(298)
18. Global Alliance for Genomics and Health
Benchmarking Task Team
• Formed June 2014 to develop
methods and tools for comparing
variant calls to a benchmark
• Developed standardized definitions
for performance metrics like TP, FP,
and FN.
• Initial focus on germline SNPs/indels
• Developing benchmarking tools
• Comparison engine
• Pluggable web interface with
modules for:
• Reporting/calculation of metrics
• Visualization/user interface
• Working with Genome in a Bottle
Consortium to host data and calls
from their well-characterized
genomes www.bioplanet.com/gcat
Example User Interface
19. Stratifying Performance
• Measure performance for
different types of variants in
different sequence contexts
– Types of variants
• SNPs
• indels of different sizes
• complex variants
• structural variants
– Sequence contexts
• Homopolymers,
• STRs
• Duplications
– Functional context
• Exome vs genome, etc
– Data characteristics
• Coverage
• Mapping quality
• Challenge of smaller gene
panels vs genome
sequencing
– one RM may not have a
sufficient number of
examples of different classes
of variants or sequence
contexts
– likely need more samples
with specific types of variants
21. Initial uses of high-confidence NIST-
GIAB genotypes for NA12878
• NIST have released
several versions of high-
confidence genotypes
for its pilot RM
• These data are
presently being used for
benchmarking
– prior to release of RMs
– SNPs & indels
• ~77% of the genome
22. Using Genome in a Bottle calls to
benchmark clinical exome sequencing
at Mount Sinai School of Medicine
“We evaluate a set of
NA12878 technical replicates
against GIAB for each new
pipeline version.”
24. Implications of Technical Accuracy in
Medical Genome Sequencing
• Collaboration with Euan
Ashley group at Stanford
• What is accuracy for
functional variants?
• How much of the exome
falls in high confidence
regions?
• “Black list” in databases
• Sensitivity
– WExS (95%) < WGS (98%)
• especially splicing
– genome < nonsyn < syn
– Most exome FNs caused by
low coverage
– Most WGS FNs cause by
filtering
• Only 81 % of ClinVar
pathogenic or likely
pathogenic SNPs fall in
high-confidence regions
– Lots of work to do!
26. Ashkenazim Jewish PGP RM Trio
Dataset Characteristics Coverage Availability Good for…
Illumina Paired-
end
150x150bp ~300x/individu
al
Fastq on ftp SNPs/indels/so
me SVs
Illumina Long
Mate pair
~6000 bp insert ~40x/individual Feb-Mar 2015 SVs
Illumina
“moleculo”
Custom library ~30x by long
fragments
Feb-Mar 2015 SVs/phasing/as
sembly
Complete
Genomics
100x/individual On ftp SNPs/indels/so
me SVs
Complete
Genomics
LFR ?? SNPs/indels/ph
asing
Ion Proton Exome 1000x/individu
al
On SRA SNPs/indels in
exome
BioNano
Genomics
Feb 2015 SVs/assembly
PacBio ~10kb reads ~120-150x on
AJ trio
Finished ~Mar
2015
SVs/phasing/as
sembly/STRs
27. Asian PGP trio
• Similar sequencing to
Ashkenazim trio except
for PacBio
• Only son will be NIST
RM
28. Future Directions
Germline mutations
• Difficult regions/variants
– Long-read technologies
– Forming an analysis group
• Tools for assessing
performance
– How to stratify performance
and understand biases?
Somatic mutations
• Pilot interlaboratory study
to assess comparability of
spike-ins
• Commercial members
developing FFPE cell lines
• Participants interested in
mixing different RMs
29. How to get involved
• Use our integrated
SNP/indel genotypes for
NA12878 and give us
feedback
– Cells and DNA currently
available from Coriell
– NIST RM available April
2015
• Join our new Analysis
group
– Use Long-read
technologies
– Structural Variant calls
– De novo assembly
– Help create the best-ever
characterized trio
• Attend our biannual
workshops (January in CA,
August in MD)
• Develop tools/metrics
with Global Alliance for
Genomics and Health
Benchmarking Team
30. Acknowledgments
• FDA – Elizabeth Mansfield,
HPC staff
• HSPH
• GCAT - David Mittelman,
Jason Wang
• Francisco De La Vega
• Illumina - Mike Eberle
• Personalis - Deanna Church
• NCBI – Chunlin Xiao
• Celera - Andrew Grupe
• Genome in a Bottle
– www.genomeinabottle.org
– New members welcome!
– Sign up for email newsletters
– jzook@nist.gov