150219 agbt giab_poster_marc

Bioinformatics, Data Integration, and Data
Representation Group
In 2012, NIST convened the Genome in a Bottle Consortium to develop
the metrology infrastructure needed to enable confidence in human whole
genome variant calls.
Consortium products will include:
• Well-characterized whole genome and synthetic DNA Reference
Materials (RMs)
• Reference data associated with the RMs
• Reference methods (Comparison tools, documentary standards)
These Genome in a Bottle products will help enable translation of whole
genome sequencing to clinical applications.
Expected use cases of these products include:
• Enable regulated applications
• Validation, QC, proficiency testing
• Identify and quantify sources of bias & variability
• Optimize measurement technologies
• Resolve structural variants
• Improve reference assembly
• Integrate data from multiple platforms
Overview Reference Material Selection and Design Group
• Personal Genome Project samples –
consent for commercialization
• Ashkenazi Jewish trio
• East Asian trio
• Additional diversity and a large family?
• Supporting inter-laboratory analysis of potential commercial
reference materials - recruiting labs now
• Are synthetic spike-ins a good surrogate for real
somatic mutations?
• Spike-ins vs. FFPE engineered cell lines vs. FFPE tissue
www.personalgenomes.org
Genome in a Bottle: So you’ve sequenced a
genome, how well did you do?
Marc Salit, Justin Zook, Genome in a Bottle Consortium
Genome-scale Measurements Group, National Institute of Standards and Technology, Gaithersburg, MD 20899
Measurements for Reference Material
Characterization Group
Performance Metrics Group
Developing Benchmark Genotypes
• Performance Metrics Specification
• Available on GIAB blog
• Global Alliance for Genomics & Health
• Formed Benchmarking Task Team to
develop methods and tools for comparing
variant calls to a benchmark
• Developed standardized definitions for
performance metrics like TP, FP, and FN.
• Developing benchmarking tool in 3 parts:
Comparison, Reporting, and Visualization
• NCBI/CDC GeT-RM Genome Browser
• Visualization of data
Mutation of
Interest
Alien Barcode
Point Mutation Control Plasmids from
M. Williams et al. Frederick National
Laboratory for Cancer Research
• Developed data integration methods and
benchmark genotype calls for NA12878
• Multi-platform method
• Published by Zook et al. (2014) in
Nature Biotechnology
• Newest calls integrate Pedigree methods
• Real Time Genomics (RTG)
• Illumina Platinum Genomes
• NCBI hosts FTP with raw data and calls
• ftp-trace.ncbi.nih.gov/giab
• Mirrored to AWS S3:
giab.s3.amazonaws.com
How you can get involved:
• Join Analysis Group for Personal Genome Project trios
• Help with Structural Variant calls and difficult regions of the genome
• Help with analyzing data from long-read technologies
• Attend our biannual workshops (January in CA, August in MD)
• Help develop definitions and methods to measure performance using our
well-characterized genomes with Global Alliance for Genomics & Health
Benchmarking Working Group (http://ga4gh.org/#/benchmarking-team)
• Use our integrated SNP/indel/homozygous reference genotypes for
NA12878 and give us feedback
Reference Materials
Sample
Preparation
Sequencing
Bioinformatics
Variant List,
Performance metrics
Genome in a Bottle Consortium
New members welcome!
Sign up for newsletters at www.genomeinabottle.org
Overlap of SNP calls between three variant call files and proposed methods to arbitrate
between multiple datasets and produce high-confidence integrated SNP, indel, and
homozygous reference genotypes. A similar integration process has been applied to our pilot
genome based on NA12878 (see Zook et al, Nat. Biotech, 2014), and we plan to use these
methods to produce high-confidence calls for the Ashkenazim and Asian trios from the
Personal Genome Project.
Structural Variants
• We are developing similar methods for SVs (see Zook et al. poster)
• Methodology development to annotate each SV using coverage, insert size,
discordant paired ends, mapping quality, soft-clipping …
• How to use long-read technologies?
Normalize and
take union of calls
Simple
SNPs/indels
Illumina/SOLiD –
GATK HC force
calls
Ion – TVC force
calls
If all biased or low
qual, uncertain
Elseif all
concordant, high-
conf
Elseif all unbiased
are concordant,
high-conf
Else uncertain
CG – use Ref file
Complex Variants
Use GA4GH
methods for
sequential pair-
wise comparison
www.bioplanet.com/gcat
www.ncbi.nlm.nih.gov/variation/tools/get-rm/
Dataset Characteristics Coverage Availability Good for…
Illumina
Paired-end
150x150bp ~300x/individual Fastq on FTP SNPs/indels/some SVs
Illumina Long
Mate pair
~6000 bp insert ~40x/individual Feb-Mar 2015 SVs
Illumina
“moleculo”
Custom library ~30x by long
fragments
Feb-Mar 2015 SVs/phasing/assembly
Complete
Genomics
Paired end ~100x/individual On FTP SNPs/indels/some SVs
Complete
Genomics
LFR Mar 2015 SNPs/indels/phasing
Ion Proton Exome 1000x/individual On FTP/SRA SNPs/indels in exome
BioNano
Genomics
Optical mapping Feb 2015 SVs/assembly
PacBio ~10kb reads ~120-150x on AJ
trio
50% on FTP;
Finished ~Mar 2015
SVs/phasing/assembly/S
TRs
Forming an analysis
group:
• Using long-reads
• SV analysis
• De novo
assembly
• Complex variants
• All data is public
• Now recruiting
members

150219 agbt giab_poster_marc

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 150219 agbt giab_poster_marc

Similar to 150219 agbt giab_poster_marc (20)

More from GenomeInABottle

More from GenomeInABottle (20)

Recently uploaded

Recently uploaded (20)

150219 agbt giab_poster_marc