The document describes a scalable WES/WGS processing pipeline and variant interpretation tool. The pipeline uses a workflow programming model to process data in parallel across cloud resources for improved scalability and cost efficiency. The variant interpretation tool provides a simple interface for clinicians while maintaining a history of investigations and provenance for accountability and analytics across patient cases.
Invited cloud-e-Genome project talk at 2015 NGS Data Congress
1. NGSDataCongress
London,June2015
P.Misiser
Scalable WES Processing And Variant Interpretation
With Provenance Recording
Using Workflow On The Cloud
Paolo Missier, Jacek Cała, Yaobo Xu,
Eldarina Wijaya, Ryan Kirby
School of Computing Science and Institute of Genetic Medicine
Newcastle University, Newcastle upon Tyne, UK
NGS Data Congress
London, June 15th, 2015
2. NGSDataCongress
London,June2015
P.Misiser
The Cloud-e-Genome project at Newcastle
1. NGS data processing:
• Implement a flexible WES/WGS pipeline
• Scalable deployment over a public cloud
• Cost control
• Scalability
• Flexibility
• Of design
• Of maintenance
• Ensure
accountability
through traceability
• Enable analytics
over past patient
cases
2. Traceable variant interpretation:
• Design a simple-to-use tool to facilitate
clinical diagnosis by clinicians
• Maintain history of past investigations for
analytical purposes
Objectives: With an aim to:
• 2 year pilot project: 2013-2015
• Funded by UK’s National Institute for Health Research (NIHR)
• Cloud resources from Azure for Research Award
3. NGSDataCongress
London,June2015
P.Misiser
Part I: data processing
Objectives:
• Design and Implement a flexible WES/WGS pipeline
• Using workflow technology high level programming
• Providing scalable deployment over a public cloud
4. NGSDataCongress
London,June2015
P.Misiser
Scripted NGS data processing pipeline
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
5. NGSDataCongress
London,June2015
P.Misiser
Scripts to workflow - Design
Design
Cloud
Deployment
Execution Analysis
• Better abstraction
• Easier to understand, share,
maintain
• Better exploit data parallelism
• Extensible by wrapping new tools
Theoretical advantages of using a workflow programming model
6. NGSDataCongress
London,June2015
P.Misiser
Workflow Design
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT =
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”
blocksUtility
blocks
20. NGSDataCongress
London,June2015
P.Misiser
Lessons learnt
Design
Cloud
Deployment
Execution Analysis
Better abstraction
• Easier to understand, share,
maintain
Better exploit data parallelism
Extensible by wrapping new tools
• Scalability
Fewer installation/deployment
requirements, staff hours required
Automated dependency management,
packaging
Configurable to make most efficient
use of a cluster
Runtime monitoring
Provenance collection
Reproducibility
Accountability
21. NGSDataCongress
London,June2015
P.Misiser
Part II: SVI- Simple, traceable variant interpretation
Objectives:
• Design a simple-to-use
tool to facilitate clinical
diagnosis by clinicians
• Maintain history of past
investigations for
analytical purposes
• Ensure accountability
through traceability
• Enable analytics over
past patient cases
MAF threshold
- Non-synonymous
- stop/gain
- frameshift
known polymorphisms
Homo / Heterozygous
Pathogenicity
predictors
Variant filtering
HPO match
HPO to OMIM
OMIM match
OMIM to Gene
Gene
Union
Gene
Intersect
Genes in scope
User-supplied
genes list
User-supplied
disease keywords
User-defined
preferred genes
Variant Scoping
Candidate
variants
Select
variants
in scope
variants
in scope
ClinVar
lookupClinVar
Annotated
patient
variants
Variant Classification
RED:
found,
pathogenic
AMBER:
not found
GREEN:
found,
benign
OMIM
AMBER/
not found
AMBER/
uncertain
NGS
pipeline
24. NGSDataCongress
London,June2015
P.Misiser
Provenance of variant identification
• A provenance graph is
generated for each
investigation
It accounts for the filtering process
for each variant listed in the result
Enables analytics over
provenance graphs across many
investigations
- “which variants where
identified independently on
different cases, and how do
they correlate with
phenotypes?”
25. NGSDataCongress
London,June2015
P.Misiser
Summary
1. WES/WGS data processing to annotated variants
• Scalable, Cloud-based
• High level
• Low cost / sample
2.Variant interpretation:
• Simple
• Targeted at clinicians
• Built-in accountability of genetic diagnosis
• Analytics over a database of past
investigations
What we are delivering to NIHR:
Editor's Notes
Objective 1: Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals.
Obj 2: front end tool to facilitate clinical diagnosis
2 year pilot project
Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC)
Nov. 2013: Cloud resources from Azure for Research Award
1 year’s worth of data/network/computing resources
Objective 1: Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals.
Obj 2: front end tool to facilitate clinical diagnosis
2 year pilot project
Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC)
Nov. 2013: Cloud resources from Azure for Research Award
1 year’s worth of data/network/computing resources
Current local implementation:
- Scripted pipeline requires expertise to maintain, evolve
Deployed on local department cluster
Difficult to scale
Cost / patient unknown
Unable to take advantage of decreasing cost of commodity cloud resources
Coverage information translates into confidence on variant call
Recalibration:
quality score recalibration --
machine produces colour coding for the 4 aminocids, along with a p-value indicating the highest prob call; these are the Q scores
different platforms give differnst system bias on Q scores -- and also depending on the lane. Each lane gives a different systematic bias. The point of recalibration is to correct for this type of bias
Wrapper blocks, such as Picard-CleanSAM and Picard-MarkDuplicates, communicate via files in the local filesystem of the workflow engine, which is explicitly de- noted as a connection between blocks. The workflow includes also utility blocks to import and export files, i.e. to transfer data from/to the shared data space (in this case, the Azure blob store).
These were com- plemented by e-SC shared libraries, which provide better efficiency in running the tools, as they are installed only once and cached by the workflow engine for any future use. Libraries also promote reproducibility because they eliminate dependencies on external data and services. For instance, to access the human reference genome we built and stored in the system a shared library that included the genome data in a specific version and flavour (precisely HG19 from UCSC).
Loops were used in stage (1) and (3), to iterate over samples that the pipeline was configured to process.
control blocks can start a number of sub-workflow invocations, one for each element on their input list. Using these two features, we were able to implement a pattern similar to “map” (in the functional sense), where the initial block generates a list of data samples to process, and then for each element in the list the following block starts a sub-workflow (the loop body);
Sync design: The subworkflows of each step are executed in parallel but synchronously over a number of samples. It means that the top-level workflow submits N subworkflow invocations for a particular step, wait
The primary advantage of the discussed, synchronous de- sign is that the structure of the pipeline is modular and clearly represented by the top-level orchestrating workflow whilst the parallelisation is managed by e-SC automatically. The top-level workflow mainly includes blocks to run subworkflows that are independent parts implementing only the actual work done by a particular step. The control blocks take care of the interaction with the system to submit the subworkflows and also suspend the parent invocation until all of them complete.
Model currently is sync execution
Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
A quick overview of the entered phenotype. Shows how many genes found in OMIM, match with genes found in the patients variants. The graph shows a quick summary of any results produces from ClinVar. The phenotypes section in the bottom right shows results from HPO. In the report sections, on the left hows a collection of all the investigations created for the current case (including the one just created).