Aug2014 giab status update and wg charge

Genome in a Bottle: Reference
Materials to Enable Translation
August 2014
Justin Zook, Marc Salit, and the Genome
in a Bottle Consortium

NIST Human Genome RMs in the
pipeline
• All 10 ug samples of DNA
isolated from multistage large
growth cell cultures
– all are intended to act as stable,
homogeneous references
suitable for use in regulated
applications
– all genomes also available from
Coriell repository
• Pilot Genome
– ~8400 tubes
• Ashkenazim Jewish Trio
– ~10000 son; ~2500 each parent
• Asian Trio
– ~10000 son; parents not yet
planned as NIST RM

Homogeneity
Analysis
First and last vial
3 libraries sequenced
to ~33x each
Use Varscan to detect
differences in allele
fraction of SNPs and
indels between vials
Significant differences
only found in regions
prone to alignment
errors
Use BIC-seq to detect
differences in copy
number between vials
and libraries
No consistent
differences between
comparisons of
different libraries
between vials
4 Random vials
2 libraries sequenced
to 12.5x each
Use BIC-seq to detect
differences in copy
number between vials
and libraries
Only one difference
with p<10^-8, which is
in a region prone to
mapping errors.
• Sequence multiple
libraries from multiple
vials
• Use somatic mutation
callers to detect
differences in SNPs and
CNVs

8week
8week
8week
2week
2week
2week
8week
8week
8week
8week
8week
8week
2week
2week
2week
Run multiple gels for each condition
Time = 0
Time = 8 weeks
Freeze Thaw 2x
Vortex (10sec)
Freeze Thaw 2x
Vortex (10sec)
Vigorous Pipetting
(full vol 10x)
Vigorous Pipetting
(full vol 10x)
Freeze Thaw 2x
Freeze Thaw 5x
Freeze Thaw 5x
Freeze Thaw 5x
8week
Vortex (10sec)
Vigorous Pipetting
(full vol 10x)
• Blinded qualitative analysis of gel by 5 NIST staff
• Consensus that only vials stored at 37° C for 8 weeks had significantly decreased
size
Shipping cross-
country

Goals for Data to Accompany RM
• ~0 false positive AND false negative calls in
confident regions
• Include as much of the genome as possible in
the confident regions (i.e., don’t just take the
intersection)
• Avoid bias towards any particular platform
– take advantage of strengths of each platform
• Avoid bias towards any particular
bioinformatics algorithms
6

Integrate 12 14 Datasets from 5
platforms
7

Integration of Data to
Form Highly Confident Genotype Calls
Find all possible variant sites
Find concordant sites across multiple datasets
Identify sites with atypical characteristics signifying
sequencing, mapping, or alignment bias
For each site, remove datasets with decreasingly atypical
characteristics until all datasets agree
Even if all datasets agree, identify them as uncertain if
few have typical characteristics, or if they fall in known
segmental duplications, SVs, or long repeats
Candidate variants
Concordant variants
Find characteristics
of bias
Arbitrate using
evidence of bias
Confidence Level
8

Integration Methods to Establish
Reference Variant Calls
Candidate variants
Concordant variants
Find characteristics of bias
Arbitrate using evidence of
bias
Confidence Level Zook et al., Nature Biotechnology, 2014.

Pedigree calls
• RTG and Illumina Platinum
Genomes developed these
• Sequence NA12878,
husband, and 11 children to
identify high confidence
variants
– Identify cross-over events
– Determine if genotypes are
consistent with inheritance
• Integrated these with NIST
high-confidence genotypes
• Should we find larger
families for future
genomes? Source: Mike Eberle, Illumina
10

Assigning confidence to genotypes
High-confidence sites
• Sequencing/bioinformatics
methods agree or we
understand the biases
causing disagreement
• At least some methods have
no evidence of bias
• Inherited as expected
Less confident sites
• In a region known to be
difficult for current
technologies
• State reasons for lower
confidence
• If a site is near a low
confidence site, make it low
confidence

Performance Metrics Specification
• Goal is to standardize
performance metrics
measured with respect to
NIST RMs
• Licensing
• Definitions
• Input formats
• User interface
• Accuracy outputs
– FP, FN, Sens, Spec, etc.
– Stratification
• by variant type
• by genome context
• by functional regions
• Characteristics of FP/FN
• Working with Global
Alliance for Genomic
Health
• See draft at
genomeinabottle.org

Working Group Charges
RM Selection and Design
• Derivative products based
on NIST RMs
• RMs for cancer and somatic
variant calling?
• Do we need another large
family and/or more
diversity?
• What is the priority of
transcriptome RMs?
Characterization/Bioinformatics
• What are the barriers to
submitting data via SRA?
• How should we use long read
technologies?
• How should we call structural
variants?
• Do we need targeted
confirmation/validation of
SNPs, indels, or SVs?
• Integration of data for PGP
trios

Working Group Charges
Performance Metrics
• How should we coordinate
with Global Alliance for
Genomic Health
Benchmarking group?
• Feedback about Performance
Metrics Specification
– Stratification of performance
by type of error, variant type,
genome context, and
functional region

Aug2014 giab status update and wg charge

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (19)

Semelhante a Aug2014 giab status update and wg charge

Semelhante a Aug2014 giab status update and wg charge (20)

Mais de GenomeInABottle

Mais de GenomeInABottle (20)

Último

Último (20)

Aug2014 giab status update and wg charge