A paper presented at the annual Italian Database conference (SEBD): http://sisinflab.poliba.it/sebd/2018/
here is the paper: http://sisinflab.poliba.it/sebd/2018/papers/June-27-Wednesday/1-Big-Data/SEBD_2018_paper_23.pdf
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools
1. Design and evaluation of a genomics variant analysis pipeline
using GATK Spark tools
Nicholas Tucci1, Jacek Cala2, Jannetta Steyn2, Paolo Missier2
(1) Dipartimento di Ingegneria Elettronica,Universita’ Roma Tre, Italy
(2) School of Computing, Newcastle University, UK
SEBD 2018, Italy
In collaboration with the Institute of Genetic Medicine,
Newcastle University
2. 2
Motivation: genomics at scale
<eventname>
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
Current cost of whole-genome sequencing: < £1,000
https://www.genomicsengland.co.uk/the-100000-genomes-project/
(our) processing time: about 40’ / GB
@ 11GB / sample (exome): 8 hours
@ 300-500GB / sample (genome): …
3. 3
Genomics Analysis Toolkit: Best practices
<eventname>
Source: Broad Institute, https://software.broadinstitute.org/gatk/best-practices/
Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset
in VCF format.
4. 4
Key points
<eventname>
1. Time and cost:
• Spark implementation at the cutting edge: still in beta but progressing rapidly
• Cluster deployment provides speedup but with limitations
• Azure Genomics Services is cheaper and faster but a black-box service
2. Quality of the analysis:
What is the relative impact of new versions on the variant output?
(how quickly do results become obsolete?)
http://recomp.org.uk/
5. 5
Multi-sample WES pipeline
<eventname>
Bwa
MarkDuplicates
BQSR
Haplotype
CallerSpark
Sample 1 FastqToSam
BQSR
Haplotype
CallerSpark
Sample 2
BQSR
Haplotype
CallerSpark
Sample N
Recalibration
Genotype
Refinement
Select
Variants
Genotype
VCFs
ANNOVAR
ANNOVAR
ANNOVAR
IGM
Anno
IGM
Anno
IGM
Anno
Exonic
Filter
Exonic
Filter
Exonic
Filter
PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT
.
.
.
.
.
.
FastqToSam
FastqToSam
Bwa
MarkDuplicates
Bwa
MarkDuplicates
Two levels data parallelism:
- Across samples (pre-processing)
- Within single sample processing
Raw
reads
“map to
reference”
(Alignment)
Against h19,
h37, h38…
Flag up multiple
pair reads
“Base Quality
Scores
Recalibration”
assigns
confidence
values to
aligned reads
“Call Variants
- SNPs
- Indels
Per Sample
Filter for
accuracy
6. 6
Exploiting parallelism – state of the art
<eventname>
• Split-and-merge / Wrapper approach: eg Gesall [1]
1. Partition each exome by chromosome / auto-load balancing
2. “drive” standard BWA on each partition
3. Merge the partial results
• Heavy MapReduce stack required between HDFS and BWA
• See also [2,3]
• GATK releasing Spark implementations of BWA, BQSR, HC
• Natively exploits Spark infrastructure – HDFS data partitioning
[1] A. Roy et al., “Massively parallel processing of whole genome sequence data: an in-depth performance study,” in
Procs. SIGMOD 2017 pp. 187–202
[2] H. Mushtaq and Z. Al-Ars, “Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline,” in
Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, 2015, pp. 1471–1477.
[3] X. Li, G. Tan, B. Wang, and N. Sun, “High-performance Genomic Analysis Framework with In-memory Computing,”
SIGPLAN Not., vol. 53, no. 1, pp. 317–328, Feb. 2018.
7. 7
Spark hybrid implementation
<eventname>
BQSR
Haplotype
CallerSpark
Sample 2
BQSR
Haplotype
CallerSpark
Sample N
Recalibration
Genotype
Refinement
Select
Variants
Genotype
VCFs
ANNOVAR
ANNOVAR
IGM
Anno
IGM
Anno
Exonic
Filter
Exonic
Filter
PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT
.
.
.
.
.
.
FastqToSam
FastqToSam
Bwa
MarkDuplicates
Bwa
MarkDuplicates
Natively ported to Spark Wrapped using Spark.pipe()
Single-node deployment:
- Pre-processing: one iteration / sample
- Discovery: single batch execution
8. 8
The Spark Pipe operator
<eventname>
Bash / Perl
Shell
script
stdin stdout
Pipe
RDD
Partitioned RDD
- Wraps local code through Shell
- Effective but inefficient breaks the RDD in-memory model
9. 9
Spark cluster virtualisation using Swarm
<eventname>
• Automated distribution of Docker containers over a cluster of VMs.
• Swarm: nodes running Docker and joined in a cluster
• Swarm Manager executes Docker commands on the cluster
Transparency issues:
Reference data mostly shared over HDFS
But:
1. non-Spark tools require local data
mount HDFS Data nodes as virtual
Docker volumes
2. Reference genome replicated to every
node (Swarm global replication)
Spark master + HDFS
Namemode Swarm Manager
Dedicated overlay network
10. 10
Pipeline execution flow in cluster mode
<eventname>
- Non-Spark tools remain centralised
- Data sharing still through HDFS (shallow integration across Spark tools). no in-memory optimisation
11. 11
Evaluation: focus
<eventname>
12%38 + 11 + 39 = 88%
Evaluation focused on pre-processing:
BWA/MD BQSRP HC
- Heaviest phase
- Spark tools focus of the study!
BWA/MD
38%
BQSRP
11%
HC
39%
discovery and
refinement
12%
BWA/MD BQSRP HC discovery and refinement
BQSR
Haplotype
CallerSpark
Sample 2
BQSR
Haplotype
CallerSpark
Sample N
Recalibration
Genotype
Refinement
Select
Variants
Genotype
VCFs
ANNOVAR
ANNOVAR
IGM
Anno
IGM
Anno
Exonic
Filter
Exonic
Filter
PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT
.
.
.
.
.
.
FastqToSam
FastqToSam
Bwa
MarkDuplicates
Bwa
MarkDuplicates
12. 12
Evaluation: setup
<eventname>
6 exomes from the Institute of Genetic Medicine at Newcastle
Sample size [10.8GB – 15.9GB] avg 13.5GB (compressed)
Deployment modes:
- Single node “pseudo-cluster” deployment
- Cluster mode with up to 4 nodes
All deployment on Azure cloud, 8 cores, 55GB RAM / node
14. 14
Normalised pre-processing processing time/GB
<eventname>
Average time/GB for two configurations Pre-processing time/GB (all three steps) across four
configurations for a single sample (14.2GB)
15. 15
Speedup
<eventname>
0
50
100
150
200
250
300
350
1 2 3 4
minutes
number of nodes
BWA/MD + BQSRP BWA/MD BQSRP
0
50
100
150
200
250
300
350
8 16 32
minutes
number of cores
BWA/MD + BQSRP
Note: HC not included due to tech issues running HC on 16 cores. -- Average HC time: 270 minutes (single sample)
Scale up
55GB RAM/core single node
Scale out / cluster mode
55GB RAM, 8 cores / node
8 cores x 2 229’
16 cores x 1 165’
But:
8 cores x 4 137’
32 cores x 1 175’
Cluster overhead:
16. 16
Comparison: Microsoft Genomics Services
<eventname>
Fast, but opaque:
• Processing time for PFC 0028 sample: 77 minutes
• Cost: £0.217/GB £19 for six samples
• Our best time: 446 minutes (7.5 hrs) on a single node(*)
• Our costs (8 cores, 55GB, six samples): £28
• Running on a single, high-end VM
• But: specs undisclosed
• Not open -- no flexibility at all
(*) This is 176’ (single node, 16 cores) + 270’ (average HC processing time)
17. 17
What we are doing now
<eventname>
All pipeline components change (rapidly)
How sensitive are prior results to version changes (in data / software tools / libraries)?
- Re-processing is time-consuming continuous refresh not scalable
- Can we quantify the effect of changes on a cohort of cases and prioritise re-computing?
Approach:
• Generate multiple variations of the baseline pipeline by injecting version changes
• Assess quality (specificity / sensitivity) of each results (sets of variants) across the
cohort [1]
[1] D. T. Houniet et al., “Using population data for assessing next-generation sequencing performance,”
Bioinformatics, vol. 31, no. 1, pp. 56–61, Jan. 2015.
18. 19
ReComp
<eventname>
J. Cala and P. Missier, “Selective and recurring re-computation of Big Data analytics tasks:
insights from a Genomics case study,” Journal of Big Data Research, 2018 (in press)
http://recomp.org.uk/
ReComp is about preserving value from large scale data
analytics over time through selective re-computation
More on this topic:
marks any duplicates, i.e., by flagging up multiple paired reads that are mapped to the same start and end positions. These reads often originate erroneously from DNA preparation methods. They will cause biases that skew variant calling and hence should be removed, in order to avoid them in downstream analysis.
As both Spark and HDFS adopt Master-Slave architecture, the masters (Spark Master and HDFS Namenode) are deployed on the Swarm Manager.
However, we also note that scaling out, that is, by adding nodes, may incur an overhead that makes it less efficient than scaling up (i.e. adding cores and memory to a single node configuration). For instance, 2 nodes with 8 cores each process at 229 minutes, while a single node with 16 cores takes 165 minutes. This overhead is less noticeable when using 32 cores, which as we noted earlier does not improve processing time on a single host (175 minutes, Fig.~\ref{fig:scale-up}), while a 4x8 nodes cluster takes 137 minutes, a further improvement over the other configurations.
However, at the time of writing these services were only offered as a \textit{black box} that runs on a single, high-end virtual machine of undisclosed specifications. In terms of pricing, the current charges for using Genomics Services are \pounds0.217 / GB, which translates to about \pounds18.61 for processing our six samples. For comparison, the cost of processing the same samples using our pipeline with a 8 cores, 55GB configuration is estimated at \pounds28.