The Genome in a Bottle Consortium is developing reference materials, reference methods, and reference data to assess confidence in human whole genome variant calls. The Consortium is characterizing several human genomes including the NA12878 genome, an Ashkenazi Jewish trio, and a Chinese trio from the Personal Genome Project. Data generated for these genomes includes various sequencing technologies from Illumina, Complete Genomics, PacBio, BioNano, and others. The Consortium is developing high-confidence variant calls for SNPs, indels, structural variants, and phasing. Individual datasets and integrated variant calls will be made publicly available on the GIAB FTP site.
💕SONAM KUMAR💕Premium Call Girls Jaipur ↘️9257276172 ↙️One Night Stand With Lo...
Giab jan2016 intro and update 160128
1. genomeinabottle.org
Genome in a Bottle Consortium
January 2016
Stanford University, Stanford, CA
Reference Materials for Human Genome
Sequencing
Marc Salit, Ph.D. and Justin Zook, Ph.D
National Institute of Standards and Technology
5. genomeinabottle.org
GIAB Scope
• The Genome in a Bottle Consortium is
developing the reference materials, reference
methods, and reference data needed to
assess confidence in human whole genome
variant calls.
• Priority is authoritative characterization of
human genomes.
GIAB steering committee, Aug 2015
6. genomeinabottle.org
Genome in a Bottle
Consortium Development
• NIST met with sequencing
technology developers to assess
standards needs
– Stanford, June 2011
• Open, exploratory workshop
– ASHG, Montreal, Canada
– October 2011
• Small workshop at NIST to develop
consortium for human genome
reference materials
– FDA, NCBI, NHGRI, NCI, CDC, Wash
U, Broad, technology developers,
clinical labs, CAP, PGP, Partners,
ABRF, others
– developed draft work plan
– April 2012
• Open, public meetings of GIAB
– August 2012 at NIST
– March 2013 at Xgen
– August 2013 at NIST
– January 2014 at Stanford
– August 2014 at NIST
– January 2015 at Stanford
– August 2015 at NIST
– January 28-29, 2016 at Stanford
–
• Website
– www.genomeinabottle.org
7. genomeinabottle.org
Well-characterized, stable RMs
• Obtain metrics for
validation, QC, QA, PT
• Determine sources and
types of bias/error
• Learn to resolve difficult
structural variants
• Improve reference
genome assembly
• Optimization
• Enable regulated
applications
Comparison of SNP Calls for
NA12878 on 2 platforms, 3
analysis methods
8. Analytical Performance
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
• Use gDNA reference
materials to benchmark
performance
• Characterized Pilot
Genome NA12878
• Ashkenazim Trio, Asian
Trio from PGP in process
• Tools to facilitate their
use
– With the Global Alliance
Data Working Group
Benchmarking Team
genericmeasurementprocess
9. genomeinabottle.org
High-confidence SNP/indel calls
Zook et al., Nature Biotechnology, 2014.
• methods to develop
SNP/indel call set
described in manuscript
• broad and quick
adoption of call set for
benchmarking
– struck nerve
11. genomeinabottle.org
NIST Human Genome
Reference Materials (RMs)
• NIST RM 8398 is available!
– tinyurl.com/giabpilot
– DNA isolated from large
growth cell cultures
– Stable, homogeneous
– Best for regulated uses
– DNA from same cell line at
Coriell (NA12878)
• New AJ and Asian Samples
– Available from Coriell now
– NIST RM available in 2016
12. genomeinabottle.org
Jan 2016 Workshop
Thursday
• Update and Roadmap
• Breakouts
– Analyses for PGP GIAB Trios
– Reference Material Selection
and Development
• Breakout reports
• Roadmap discussion
Friday
• Using GIAB Products for
technology development,
optimization, and
demonstration
– Experiences from the
consortium
• Steering committee
13. genomeinabottle.org
Steering Committee Meeting
Topics
• Future workshops
• Format
• Program committee?
• Crafting a mission statement
• Defining scope
• Liaison with other efforts
Current members
– Marc Salit
– Justin Zook
– David Mittelman
– Andrew Grupe
– Michael Eberle
– Steve Sherry
– Deanna Church
– Francisco De La Vega
– Christian Olsen
– Monica Basehore
– Lisa Kalman
– Christopher Mason
– Elizabeth Mansfield
– Liz Kerrigan
– Leming Shi
– Melvin Limson
– Alexander Wait Zaranek
– Nils Homer
– Fiona Hyland
– Steve Lincoln
– Don Baldwin
– Robyn Temple-Smolkin
– Chunlin Xiao
– Kara Norman
– Luke Hickey
14. genomeinabottle.org
Agenda
Monday
• Breakfast and registration
• Welcome and Context Setting
• NIST RM Update and Status Report
• Charge to Working Groups
• Coffee Break
• Working Group Breakout Discussions
• Lunch (provided)
• Informal Working Group Reports
• Coffee Break
• Breakout Topical Discussions
– Topic #1: Moving beyond the 'easy'
variants and regions of the genome
– Topic #2: Selecting future genomes for
Reference Materials
Tuesday
• Breakfast and registration
• Use cases: Experiences using the pilot
Reference Material
• Discussion of plans to release pilot
Reference Material
• Coffee Break
• Working Group Breakout discussions
• Lunch (provided)
• Working Group leaders present plans
and discussion
• Steering committee Overview
• First meeting of the Steering
Committee (others adjourn)
Please Note
Slides will be made available on SlideShare after
the workshop (see genomeinabottle.org).
Tweets are welcome unless the speaker requests
otherwise. Please use #giab as the hashtag.
15. We are liaising with…
• Illumina Platinum Genomes
• CDC GeT-RM
• Korean Genome Project
• Genome Reference
Consortium
• 1000 Genomes SV group
• CAP/CLIA
• ABRF
• FDA
• SEQC
• Global metrology system
• Global Alliance for
Genomics and Health
Benchmarking Team
• NCBI/CDC GeT-RM Browser
• GCAT website
17. Association of Biomolecular Resource Facilities (ABRF)
www.abrf.org
Next Generation Sequencing Study
Phase 2: DNA sequencing platforms
Study Design and Launch Plan
Slides courtesy of Chris Mason
January 24, 2016
18. Aims
Create reference data sets - Sequence data from reference samples will be generated
with intra- and inter-lab replication to model the likely range of performance that
should be expected under normal service laboratory conditions.
Test and create reference samples - Designated reference samples will be easily
accessible to the community for self-evaluation by comparison to the reference data.
Samples should be standardized, able to be stably reproduced over time, and suitable
for development of new laboratory and bioinformatics methods.
Data release and Immediate utility - Performance metrics and data will be developed
for instrument platforms and sample preparation protocols that are deployed now or
will be in the near future in core sequencing facilities. After QC, data will be released
to the entire Genome in a Bottle (GIAB) and ABRF Consortia for use and preparation
for submission as publications.
ABRF NGS Phase II Study
19. Samples and Platforms – All tested in triplicate across three distinct sites
Platform Human DNA Bacterial DNA
Illumina HiSeq X Ten A, B, C, C2, C2f
Illumina HiSeq 4000 A, B, C
Illumina HiSeq 2500 v4 1T A, B, C
Illumina HiSeq 2500 v3 Rapid Run C Ste, Eco, Mil, P
Illumina NextSeq 500 High Output C
Illumina MiSeq Ste, Eco, Mil, P
Life Tech Proton A, B, C exomes Ste, Eco, Mil, P
Life Tech S5 A, B, C exomes Ste, Eco, Mil, P
Life Tech PGM Ste, Eco, Mil, P
Pacific Biosciences Ste, Eco, Mil, P
Oxford Nanopore Ste, Eco, Mil, P
maternal
paternal
son
son
(Coriell)
A B C C2
Ste Eco Mil pool
Human Trio Bacterial Isolates and Mixture
ABRF NGS Phase II Study
20. Reference DNA,
TruSeq PCR-free 350
FFPE DNA, TruSeq Nano
FFPE DNA, TruSeq PCR-free
KAPA libraries from sites a-b-c
Ste Eco Mil pool
maternal
paternal
son
son
(Coriell)
%GC: 28 50 72
A B C C2
Personal Genome Project
NIST Reference Human Genomes
C2f
Reference bacterial genomes
TruSeq PCR-free 550
Ca
Illumina (ILMN) - Samples
ABRF NGS Phase II Study
22. Sequencing Quality Control
Phase II (SEQC2) – An
Introduction
Slides courtesy of Weida Tong, Ph.D.
Division of Bioinformatics and Biostatistics,
NCTR/FDA
22
23. Short reads vs
long reads
Detection power
for rare mutation
Detection accuracy
for difficult genes
Application scope
of MiSeq
Variants call (e.g.,
SNV, CNV, Indels)
Assess the WGS accuracy and reproducibility for variants call by
investigating the join effect of reads alignment pipelines,
variants call methods and coverage as well as comparing the
results from personal genome versus reference genome.
Assess detection power of ultra-deep sequencing
(TGS) for subclonal mutation and its dependency
on bioinformatics and coverage.
Assess the utility of MiSeq for (1) detection of subclonal mutation,
(2) the difficult genes (e.g., HLA), and (3) the difficult variations
(e.g., Indels)
Assess the accuracy for some difficult genes that
varies significantly due to complexity in their genomic
regions (e.g. GC region) with specifically focused on
HLA genes.
Datasets:
• Approaches: WGS
and TGS
• Platforms: Hiseq,
PacBio, MiSeq, etc
• Samples: TRIOs, NB,
cell lines, etc
Parameters:
• Personal vs reference
genome
• Bioinformatics
• Coverage
SEQC2 Overview
Assess short reads alone, long reads alone and their
combination for genome assembly and subsequent
variant calling in WGS.
24. Trio Study
Coverage/platform
Notes
Short reads Long reads
SEQC2:
HapMap Trio
(European)
80x TBD
Planned for both WGS and TGS;
genotyping data and information
from HapMap are available
GIAB:
Trio
(Ashkenazim)
Illumina 300x
69x (son),
30x (parents) This dataset is generated by Genome
In A Bottle (GIAB) consortium. We
closely work with GIAB to obtain the
update information of this Trio and
the GIAB leaders also participate in
SEQC2.
Complete
Genomics
BioNano
Ion Torrent Moleculo
SOLiD (WGS)
SEQC2:
Chinese Trio and
test of LCL-
germline
100x 50x
Panned; the datasets will be
provided by Dr. Leming Shi who is a
part of SEQC2 leadership team.
Three Trio Datasets
24
25. Candidate NIST Reference Materials
Genome PGP ID Coriell ID NIST ID NIST RM #
CEPH
Mother/Daugh
ter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391
(son)/RM8392
(trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)
Asian Son hu91BD69 GM24631 HG005 RM8393
Asian Father huCA017E GM24694 N/A N/A
Asian Mother hu38168C GM24695 N/A N/A
29. Dataset AJ Son AJ Parents Chinese son Chinese
parents
NA12878
Illumina Paired-
end
X X X X X
Illumina Long
Mate pair
X X X X X
Illumina
“moleculo”
X X X X X
Complete
Genomics
X X X X X
Complete
Genomics LFR
X X X
Ion exome
X X X X
BioNano
X X X X
10X
X X X
PacBio
X X X
SOLiD single end
X X X
Illumina exome
X X X X
Oxford
Nanopore
X
31. GIAB Analysis Group – New Data Sets
Leaders
• Francisco de la Vega
– Stanford, TOMA Biosciences
• Chris Mason
– Weil Cornell Medical Center
• Tina Graves
– Washington University
• Valerie Schneider
– NCBI
•and Justin and Marc
Strategic Documents
• Analysis Group Responsibilities:
– https://docs.google.com/document/d/10e
A0DwB4iYTSFM_LPO9_2LyyN2xEqH49OXH
htNH1uzw/edit?usp=sharing
• Analysis Milestones:
– https://docs.google.com/spreadsheets/d/1Pj4nSz
H742g40wJz2fA6f8kFtZYAToZpSZYVPiC5st4/edit?u
sp=sharing
• Analysis Methods
– https://docs.google.com/spreadsheet
s/d/1Je2g85H7oK6kMXbBOoqQ1FM
NrvGnFuUJTJn7deyYiS8/edit?usp=sha
ring
• Analysis Plan:
– https://drive.google.com/file/d/0B7Ao1qq
JJDHQdnVEaVdqbWdEdkE/view?usp=shari
ng
• Collecting Data and analyses on GIAB
FTP Site
• Recruiting people to help with the
work.
Goal: Establish and distribute a set of authoritative benchmark variant calls of all
types and sizes, as well as homozygous reference regions, on GIAB PGP trios
32. GIAB Analysis Group – New Data Sets
Types of analyses
• SNPs/indels
– NIST working on integration
– 10X/moleculo/PacBio for
difficult-to-map regions
• Assembly
– 2 de novo assemblies
– Being used for SV calling
Status
• Structural variants
– Candidate calls being generated
by 15+ groups with >20 different
algorithms and 6 datasets
– 3+ integration methods
– ~monthly calls
• Long-range Phasing
– 2 phased calls so far (CG LFR and
10X)
– Integration methods needed
• Methylation analyses
Goal: Establish and distribute a set of authoritative benchmark variant calls of all
types and sizes, as well as homozygous reference regions, on GIAB PGP trios
33. genomeinabottle.org
SNP/Indel Integration Method Update
• Implementing refined integration methods
– Developed so others can readily reproduce results
– Consistent results for all GIAB genomes
– Simpler process taking advantage of best practices
for each technology
• Validating with released NA12878 RM data
– Preliminary comparisons show minor changes
• Application to PGP trios
– Plan to analyze AJ trio by Q2 2016
– Release of NIST RMs in Q2 2016
– Develop calls for GRCh38
34. genomeinabottle.org
Data Release: Real-time, Open,
Public Release
Individual Datasets
• Uploaded to GIAB FTP site
as it is collected
• Includes raw reads, aligned
reads, and
variant/reference calls
Integrated High-confidence Calls
• First develop SNP, indel,
and homozygous reference
calls
• Then develop SV and non-
SV calls
• Released calls are versioned
• Preliminary callsets will be
made available to be
critiqued
35. GIAB AJ Trio Hybrid PacBio/BioNano
Assembly
Hybrid (PacBio with BioNano)
Input Assembly Notes
# of
Scaffolds N50 Max Total
HG002 Falcon 248 22.7Mb 92.8Mb 2.38Gb
Trio Falcon 210 29.3Mb 87.6Mb 2.32Gb
Two Step
Trio
celera (child) +
falcon (trio) 187 34.3Mb 98.0Mb 2.6Gb
Credits: Ali Bashir, Jason Chin, Alex Hastie
Pendleton et al, Nature Methods, 2015
37. Proposed approach to form high-
confidence SV (and non-SV) calls
Generate Candidate Calls
Compare/evaluate calls using
Parliament/MetaSV/svclassify/others?;
manual inspection
Integrate new and revised calls; manual
inspection
Combine integrated calls; manual inspection;
targeted experimental validation?
August 30, 2015
January 2016
Plan in January 2016
Feb 2016 and beyond
38. Deletion overlap summary for son
By # of callsets
# of callsets # of calls
1 3780
2 1391
3 859
4 574
5+ 344
By Technology
Technology # of calls
Illumina 3277
PacBio 5177
BioNano 812
CG 1758
Illumina/CG+PacBio 2318
Illumina/CG+BioNano 518
PacBio+BioNano 467
2+ technologies 2661
Converted all to bed; combined with bedtools multiinter; Calls within 50bps were merged
39. Preliminary Confirmation of SVs
Integration results from AJ son
Parliament: BMC Genomics, 2015, 16:286 (performed by Andrew Carroll, DNAnexus)
MetaSV: Bioinformatics, 2015, 31:2741 (performed by Marghoob Mohiyuddin, Bina/Roche)
• Parliament
– Candidates from Illumina
– Confirmed by PacBio and/or
Illumina
– ~50% in both technologies
– ~4.5k deletions, 1k insertions
– 85% of Genotypes consistent
within Trio
• MetaSV
– Multiple types of evidence
from Illumina
MetaSV
Total:
2809
Parliament
Total:
5467
569
(20 %)
977
(18 %)
MetaSV
2240
(80 %)
Parliament
4490
(82 %)
50 % reciprocal overlap
Some overlap within Parliament calls
40. genomeinabottle.org
GeT-RM Browser from NCBI and CDC
• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/
• Allows visualization of data underlying call each call
41. genomeinabottle.org
Uses of GIAB NA12878
Oncology – Molecular and Cellular Tumor Markers
“Next Generation” Sequencing (NGS) guidelines for
somatic genetic variant detection
www.bioplanet.com/gcat
42. Global Alliance for Genomics and Health
Benchmarking Task Team
• Initial version of standardized
definitions for performance
metrics like TP, FP, and FN.
• Continued development of
sophisticated benchmarking tools
• vcfeval – Len Trigg
• hap.py – Peter Krusche
• vgraph – Kevin Jacobs
• Standardized intermediate and
final file formats
• Standardized bed files with
difficult genome contexts for
stratification
• Simulating reads with non-SNP
ClinVar variants to demonstrate
importance of these tools
• github.com/ga4gh/benchmarking
-tools
Next steps
• Further analysis to
demonstrate importance of
sophisticated tools
• Write manuscript about the
team’s tools
• Integrate vcfeval and hap.py to
take advantage of strengths of
each
• Recommend “Best Practices”
for benchmarking
• Explore venues for making the
team’s benchmarking process
easier to use
Progress
43. Proposed Performance Metrics
Definitions
• Define TP/FP/FN/TN in 4 ways depending on
required stringency of match:
• Loose match: TP if within x-bp of a true variant
• Allelle match: TP if ALT allele matches
• Genotype match: TP if genotype and ALT allele
match
• Phasing match: TP if genotype, ALT allele, and
phasing with nearby variants all match
• True negatives are difficult to define because
an infinite number of potential alleles exist
44. genomeinabottle.org
Global Alliance for Genomics and Health
Benchmarking Task Team
Credit: Rebecca Truty, Complete Genomics
How should we interpret this complex variant on chr21?
45. GA4GH Benchmarking Tool Architecture
Truth VCF
Query
VCF
Comparison Engine
vcfeval / vgraph /
xcmp / bcftools / ...
VCF-I
Two-column VCF
with TP/FP/FN
annotations
Quantification
e.g. quantify / hap.py
Stratification BED
files
Confident Call
Regions
VCF-R
Two-column
VCF with
TP/FP/FN/UNK
annotations
Counts
Credit: Peter Krusche
https://github.com/ga4gh/benchmarking-tools
46. Approaches to Benchmarking Variant
Calling
• Well-characterized whole genome Reference
Materials
• Many samples characterized in clinically relevant
regions
• Synthetic DNA spike-ins
• Cell lines with engineered mutations
• Simulated reads
• Modified real reads
• Modified reference genomes
• Confirming results found in real samples over
time
47. Challenges in Benchmarking Small
Variant Calling
• It is difficult to do robust benchmarking of tests designed to
detect many analytes (e.g., many variants)
• Easiest to benchmark only within high-confidence bed file,
but…
• Benchmark calls/regions tend to be biased towards easier
variants and regions
– Some clinical tests are enriched for difficult sites
• Challenges with benchmarking complex variants near
boundaries of high-confidence regions
• Always manually inspect a subset of FPs/FNs
• Stratification by variant type and region is important
• Always calculate confidence intervals on performance
metrics
48. Particular Challenges in Benchmarking
SV Calling
• How to establish benchmark calls for difficult
regions?
• How to establish non-SV regions to assess FP
rates?
• Multiple dimensions of accuracy:
– Predicted SV existence
– Predicted SV type
– Predicted size
– Predicted breakpoints
– Predicted exact sequence
49. Acknowledgments
• FDA – Elizabeth
Mansfield
• Many members of
Genome in a Bottle
– New members
welcome!
– Sign up on website
for email
newsletters
GIAB Steering Committee
– Marc Salit
– Justin Zook
– David Mittelman
– Andrew Grupe
– Michael Eberle
– Steve Sherry
– Deanna Church
– Francisco De La Vega
– Christian Olsen
– Monica Basehore
– Lisa Kalman
– Christopher Mason
– Elizabeth Mansfield
– Liz Kerrigan
– Leming Shi
– Melvin Limson
– Alexander Wait Zaranek
– Nils Homer
– Fiona Hyland
– Steve Lincoln
– Don Baldwin
– Robyn Temple-Smolkin
– Chunlin Xiao
– Kara Norman
– Luke Hickey
50. For More Information
www.genomeinabottle.org - sign up for general GIAB and Analysis
Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://biorxiv.org/content/early/2015/09/15/026468
Global Alliance Benchmarking Team
– https://github.com/ga4gh/benchmarking-tools
Twice yearly public workshops
– Winter at Stanford University, California, USA
– Summer at NIST, Maryland, USA
Justin Zook: jzook@nist.gov
Marc Salit: salit@nist.gov
51. GIAB Roadmap: Where are we,
Where are we going?
• Reference Materials
– Germline
– Somatic
• Informatics
– Analysis of GIAB data
– Benchmarking
• Documentary Standards/Publications
– Documentation of methods
– Supporting Use