2. Users of Pipeline
• Integration pipeline running
• GIAB Team at NIST
• Benchmark set users
• Sequencing technology
developers, method developers,
consortium members, clinical
laboratories
• Users are interested in why a
variant was included in
benchmark set
3. Existing Pipeline
• Code available on github:
https://github.com/jzook/genome-
data-integration
• Bash and Perl-based DNANexus
Applet
• Dependencies:
• vcflib
• vcfcombine, vcfannotate, vcfintersect,
vcfallelicprimitives
• rtg
• vcfstats, vcffilter
• bedtools
• multiIntersectBed, subtractBed
PASS variants #2
Benchmark regions
0/1 1/11/1
Benchmark calls
0/11/1
Callable regions #2
Callable regions #1
1/10/11/1PASS variants #1
InputMethods
1/1
(1)
Concordant
(2)
Discordant
unresolved
(3)
Discordant
arbitrated
(4)
Concordant
not callable
4. Current development
• Rewritten in Python
• Ease of continued development, building visualization utilities, testing
framework
• Building standalone application along with making a DNANexus
Applet
• Union, annotated VCF fields in an appropriate data structure for
outlier detection
5. Development Plan
• Visualization of parameter selections for data driven analysis
• Ease user experience for finding what the support for a given variant is across
callsets and criteria for making call during integration process
• Information is in *annotated.vcf from integration pipeline
• Modular design for using integration pipeline with different human
genome assemblies and potentially other organisms
• Working on utility to speed manual curation for variants of interest
6. Development Plan – Benchmarking Output
• Closely tie in testing with comparison framework to track variants lost
and gained between GIAB callset versions
• Ease access to hap.py output for user’s regions/variants of interest
• Identification of stratifications where user evaluation set performs
best or has problems
7. Future Development
• Explore machine learning to automate or accelerate integration
process
Raw
Callsets
Filters/
Annotations Refined Callsets
Benchmark
Callset
Outlier detection
View 1
View 2
View 3
View n
Multi-view classification
Active learning+Crowd-sourced curation