Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

Design and evaluation of a genomics variant analysis pipeline
using GATK Spark tools
Nicholas Tucci1, Jacek Cala2, Jannetta Steyn2, Paolo Missier2
(1) Dipartimento di Ingegneria Elettronica,Universita’ Roma Tre, Italy
(2) School of Computing, Newcastle University, UK
SEBD 2018, Italy
In collaboration with the Institute of Genetic Medicine,
Newcastle University

2
Motivation: genomics at scale
<eventname>
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
Current cost of whole-genome sequencing: < £1,000
https://www.genomicsengland.co.uk/the-100000-genomes-project/
(our) processing time: about 40’ / GB
 @ 11GB / sample (exome): 8 hours
 @ 300-500GB / sample (genome): …

3
Genomics Analysis Toolkit: Best practices
<eventname>
Source: Broad Institute, https://software.broadinstitute.org/gatk/best-practices/
Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset
in VCF format.

4
Key points
<eventname>
1. Time and cost:
• Spark implementation at the cutting edge: still in beta but progressing rapidly
• Cluster deployment provides speedup but with limitations
• Azure Genomics Services is cheaper and faster but a black-box service
2. Quality of the analysis:
What is the relative impact of new versions on the variant output?
(how quickly do results become obsolete?)
http://recomp.org.uk/

5
Multi-sample WES pipeline
<eventname>
Bwa
MarkDuplicates
BQSR
Haplotype
CallerSpark
Sample 1 FastqToSam
BQSR
Haplotype
CallerSpark
Sample 2
BQSR
Haplotype
CallerSpark
Sample N
Recalibration
Genotype
Refinement
Select
Variants
Genotype
VCFs
ANNOVAR
ANNOVAR
ANNOVAR
IGM
Anno
IGM
Anno
IGM
Anno
Exonic
Filter
Exonic
Filter
Exonic
Filter
PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT
.
.
.
.
.
.
FastqToSam
FastqToSam
Bwa
MarkDuplicates
Bwa
MarkDuplicates
Two levels data parallelism:
- Across samples (pre-processing)
- Within single sample processing
Raw
reads
“map to
reference”
(Alignment)
Against h19,
h37, h38…
Flag up multiple
pair reads
“Base Quality
Scores
Recalibration”
assigns
confidence
values to
aligned reads
“Call Variants
- SNPs
- Indels
Per Sample
Filter for
accuracy

6
Exploiting parallelism – state of the art
<eventname>
• Split-and-merge / Wrapper approach: eg Gesall [1]
1. Partition each exome  by chromosome / auto-load balancing
2. “drive” standard BWA on each partition
3. Merge the partial results
• Heavy MapReduce stack required between HDFS and BWA
• See also [2,3]
• GATK releasing Spark implementations of BWA, BQSR, HC
• Natively exploits Spark infrastructure – HDFS data partitioning
[1] A. Roy et al., “Massively parallel processing of whole genome sequence data: an in-depth performance study,” in
Procs. SIGMOD 2017 pp. 187–202
[2] H. Mushtaq and Z. Al-Ars, “Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline,” in
Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, 2015, pp. 1471–1477.
[3] X. Li, G. Tan, B. Wang, and N. Sun, “High-performance Genomic Analysis Framework with In-memory Computing,”
SIGPLAN Not., vol. 53, no. 1, pp. 317–328, Feb. 2018.

7
Spark hybrid implementation
<eventname>
BQSR
Haplotype
CallerSpark
Sample 2
BQSR
Haplotype
CallerSpark
Sample N
Recalibration
Genotype
Refinement
Select
Variants
Genotype
VCFs
ANNOVAR
ANNOVAR
IGM
Anno
IGM
Anno
Exonic
Filter
Exonic
Filter
.
.
.
.
.
.
FastqToSam
FastqToSam
Bwa
MarkDuplicates
Bwa
MarkDuplicates
Natively ported to Spark Wrapped using Spark.pipe()
Single-node deployment:
- Pre-processing: one iteration / sample
- Discovery: single batch execution

8
The Spark Pipe operator
<eventname>
Bash / Perl
Shell
script
stdin stdout
Pipe
RDD
Partitioned RDD
- Wraps local code through Shell
- Effective but inefficient  breaks the RDD in-memory model

9
Spark cluster virtualisation using Swarm
<eventname>
• Automated distribution of Docker containers over a cluster of VMs.
• Swarm: nodes running Docker and joined in a cluster
• Swarm Manager executes Docker commands on the cluster
Transparency issues:
Reference data mostly shared over HDFS
But:
1. non-Spark tools require local data
 mount HDFS Data nodes as virtual
Docker volumes
2. Reference genome replicated to every
node (Swarm global replication)
Spark master + HDFS
Namemode  Swarm Manager
Dedicated overlay network

10
Pipeline execution flow in cluster mode
<eventname>
- Non-Spark tools remain centralised
- Data sharing still through HDFS (shallow integration across Spark tools).  no in-memory optimisation

11
Evaluation: focus
<eventname>
12%38 + 11 + 39 = 88%
Evaluation focused on pre-processing:
BWA/MD  BQSRP  HC
- Heaviest phase
- Spark tools  focus of the study!
BWA/MD
38%
BQSRP
11%
HC
39%
discovery and
refinement
12%
BWA/MD BQSRP HC discovery and refinement
BQSR
Haplotype
CallerSpark
Sample 2
BQSR
Haplotype
CallerSpark
Sample N
Recalibration
Genotype
Refinement
Select
Variants
Genotype
VCFs
ANNOVAR
ANNOVAR
IGM
Anno
IGM
Anno
Exonic
Filter
Exonic
Filter
.
.
.
.
.
.
FastqToSam
FastqToSam
Bwa
MarkDuplicates
Bwa
MarkDuplicates

12
Evaluation: setup
<eventname>
6 exomes from the Institute of Genetic Medicine at Newcastle
Sample size [10.8GB – 15.9GB] avg 13.5GB (compressed)
Deployment modes:
- Single node “pseudo-cluster” deployment
- Cluster mode with up to 4 nodes
All deployment on Azure cloud, 8 cores, 55GB RAM / node

13
Pre-processing steps for single node deployment
<eventname>
0
100
200
300
400
500
600
700
800
900
10.8 13 13.2 14.2 14.4 15.9
time(minutes)
sample size (GB)
BWA/MD BQSRP HC
0
100
200
300
400
500
600
700
800
900
10.8 13 13.2 14.2 14.4 15.9
time(minutes)
sample size (GB)
BWA/MD BQSRP HC
Configuration 20/2/4/16 Configuration 20/4/2/8
1. Driver process memory (GB)
2. Executors
3. Cores/executor
4. Memory/executor (GB)
Configuration settings not significant

14
Normalised pre-processing processing time/GB
<eventname>
Average time/GB for two configurations Pre-processing time/GB (all three steps) across four
configurations for a single sample (14.2GB)

15
Speedup
<eventname>
0
50
100
150
200
250
300
350
1 2 3 4
minutes
number of nodes
BWA/MD + BQSRP BWA/MD BQSRP
0
50
100
150
200
250
300
350
8 16 32
minutes
number of cores
BWA/MD + BQSRP
Note: HC not included due to tech issues running HC on 16 cores. -- Average HC time: 270 minutes (single sample)
Scale up
55GB RAM/core single node
Scale out / cluster mode
55GB RAM, 8 cores / node
8 cores x 2  229’
16 cores x 1 165’
But:
8 cores x 4  137’
32 cores x 1  175’
Cluster overhead:

16
Comparison: Microsoft Genomics Services
<eventname>
Fast, but opaque:
• Processing time for PFC 0028 sample: 77 minutes
• Cost: £0.217/GB  £19 for six samples
• Our best time: 446 minutes (7.5 hrs) on a single node(*)
• Our costs (8 cores, 55GB, six samples): £28
• Running on a single, high-end VM
• But: specs undisclosed
• Not open -- no flexibility at all
(*) This is 176’ (single node, 16 cores) + 270’ (average HC processing time)

17
What we are doing now
<eventname>
All pipeline components change (rapidly)
How sensitive are prior results to version changes (in data / software tools / libraries)?
- Re-processing is time-consuming  continuous refresh not scalable
- Can we quantify the effect of changes on a cohort of cases and prioritise re-computing?
Approach:
• Generate multiple variations of the baseline pipeline by injecting version changes
• Assess quality (specificity / sensitivity) of each results (sets of variants) across the
cohort [1]
[1] D. T. Houniet et al., “Using population data for assessing next-generation sequencing performance,”
Bioinformatics, vol. 31, no. 1, pp. 56–61, Jan. 2015.

19
ReComp
<eventname>
J. Cala and P. Missier, “Selective and recurring re-computation of Big Data analytics tasks:
insights from a Genomics case study,” Journal of Big Data Research, 2018 (in press)
http://recomp.org.uk/
ReComp is about preserving value from large scale data
analytics over time through selective re-computation
More on this topic:

20
Questions?
Call for participation:
July 12-13th, London (King’s College)

Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

Semelhante a Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools (20)

Mais de Paolo Missier

Mais de Paolo Missier (20)

Último

Último (20)

Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

Notas do Editor