3. Who are we?
• Justin Johnson
– Managing Director of Services
– Director of Bioinformatics
– 10 Years at JCVI before EdgeBio
– Project Manager - Archon Genomics XPrize
• EdgeBio
– CLIA Lab
– Illumina Hiseq & Miseq, Ion Proton & PGM
4. Overview – GIAB as I See It.
• Which genomes?
• How do we sequence them?
• How do we analyze them?
• How do we enable their usage?
5. Overview
Bioinformatics Experimental Data
Data Integration • Sequence Data & Variation
• Metadata
/ Representation
Database
Refine and Feedback • RM vs. Reference
• Every Base
Compare and Report Visualize and Filter
• Single Genome Browser • Browser over DB
• ValidationProtocol.org • Query by Experiment Data
Experimental Data = Combination of Prep / Sequencing / Analysis
6. Experimental Data
• GetRM Model for Collection
– http://www.ncbi.nlm.nih.gov/projects/variation/get-rm/
• Preparation
– Link to published prep protocol
– ROI in Bed/GFF/GBK Format
• Sequencing
– Platform Information (Minimally - Name)
– Chemistry (Minimally - Version)
• Analysis
– Link to published analysis protocol or best practices
– Read Data (fastq, sra, hdf5, others)
– Alignment/Assembly Data (bam)
• Minimal Tag Set TBD
– Variation (VCF or gVCF)
• Minimal Tag Set TBD in INFO field of VCF or define external XSD
• https://sites.google.com/site/gvcftools/home/about-gvcf
8. Meta Data
• All required fields in VCF 4.1
• Others (Examples)
– AA : ancestral allele
– AC : allele count in genotypes, for each ALT allele, in the same order as listed
– AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not
called genotypes
– AN : total number of alleles in called genotypes
– BQ : RMS base quality at this position
– CIGAR : cigar string describing how to align an alternate allele to the reference allele
– DB : dbSNP membership
– DP : combined depth across samples, e.g. DP=154
– END : end position of the variant described in this record (for use with symbolic alleles)
– H2 : membership in hapmap2
– VALIDATED : validated by follow-up experiment
• Reference Block Implementations
• Handle Indel Conflicts and Resolution
• Genotype Quality for non-variant sites (GQX)
9. Database
• Store Each Base + Meta of RM versus Reference for each
Experiment from gVCF
– Distinguish missing versus homozygous reference
– Include copy number and phasing when available, not
required
• Engine that drives front end visualization (Genome Browser)
• Build on GetRM/NCBI Database Work
10. Visualize and Filter
• Build on GetRM/NCBI Browser Work
• Single RM -> Many Experiments
• Not all metadata will be visual, but most/all will be filterable
• Filter data to generate ROI or VOI
– Canned: i.e. Intersect of All Platforms + Analysis, All OMIM SNPs,
Clinical Cert SNV List, etc
– Dynamic: allowing people to explore prep, sequence, or analysis bias
• Slice, Dice, Export VOI to compare and reporting SW
• Allow user defined tracks
• By product is community educational resource
– I have a ROI for a test and want to know what platform, prep, exome
kit version, etc covers it best. What do I do?
11. Parallel Database, Filter Effort (Gemini)
Quinlan Lab at UVA - https://github.com/arq5x/gemini
• Gemini – simple, flexible, and powerful
framework for exploring genetic variation
• Basic browser capabilities being developed
• Flexible custom annotation and metadata
addition to DB
• Leverage the expressive power of SQL while
overcoming fundamental challenges associated
with using databases for very large datasets
15. Compare and Reporting
• Take in ROI or VOI from the visualize and filter stage
• Take in user defined VOI or VOI + ROI
• Leverage SW under ValidationProtocol.org to generate reports
and files including BNLT:
– Summary of completeness, accuracy, phasing
– Discordant variants in VCF
– Concordant variants in VCF
– Phasing errors in VCF
• Provide intuitive way to feed these resultants in downstream
analysis SW (VarinatViz, IO8) or back into browser (User
Defined Track)
16. • $10 million prize competition to showcase whole
genome sequencing technology
• Award to the team(s) who can most completely,
accurately and affordably sequence 100 human
genomes in 30 days or less
• Competing Teams will sequence the genomes of the
100 centenarians who have evaded the usual
diseases of aging such as heart disease, diabetes,
cancer and Alzheimer’s
18. AGXP Validation Study Analysis
• 2 Major Phases using NA19239 and NA12878
– Develop Reference Standards
• Fosmid Reconstruction, Variation Discovery
• Technology Comparison and Bias Removal
– Develop Performance Metrics
• Software Development
• Help labs use the data
19. Compare and Report
• The validationprotocol.org website provides a
simple way for anyone to compare their
variant calls against the public reference
genomes.
• Encourages submission and analysis in public
tools like Galaxy through transparent
interoperability with GenomeSpace.
23. Follow On
• Export different categories
(Concordant/Discordant/Phasing Error)
variants to VariantViz IO8
• Visualize Quality, Allele Frequencies, Depth,
etc Info to detect patterns in and between
variant categories
27. Xprize Team
• Justin H. Johnson and Team - EdgeBio
• Brad Chapman Harvard: automated high-throughput analysis pipelines
with custom visualization and processing tools
• Gabor Marth Boston College: Read mapping, single-nucleotide and
insertion-deletion polymorphism detection, and discovery of structural
variants.
• Aaron Quinlin University of Virginia: structural variation (SV)
• Granger Sutton JCVI: Oversight Committee
• Victor Jongeneel University of Illinois and NCSA: Oversight Committee
• Larry Kedes UCLA: Oversight Committee
28. EdgeBio Team
• LAB • IFX
– Joy Adigun – David Jenkins
– Ryan Mease – Anju Varadarajan
– Jennifer Sheffield – Vani Rajan
– Aaron Johnson – Karthik Kota
– Jackie Jackson – Phil Dagasto
• Adam Bennett
• Isabel Llorente
29. Thank You!
More info available at
http://bit.ly/agxpval
http://www.genomeinabottle.org