O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

131 visualizações

Publicada em

A paper presented at the annual Italian Database conference (SEBD): http://sisinflab.poliba.it/sebd/2018/

here is the paper: http://sisinflab.poliba.it/sebd/2018/papers/June-27-Wednesday/1-Big-Data/SEBD_2018_paper_23.pdf

Publicada em: Tecnologia
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

  1. 1. Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools Nicholas Tucci1, Jacek Cala2, Jannetta Steyn2, Paolo Missier2 (1) Dipartimento di Ingegneria Elettronica,Universita’ Roma Tre, Italy (2) School of Computing, Newcastle University, UK SEBD 2018, Italy In collaboration with the Institute of Genetic Medicine, Newcastle University
  2. 2. 2 Motivation: genomics at scale <eventname> Image credits: Broad Institute https://software.broadinstitute.org/gatk/ Current cost of whole-genome sequencing: < £1,000 https://www.genomicsengland.co.uk/the-100000-genomes-project/ (our) processing time: about 40’ / GB  @ 11GB / sample (exome): 8 hours  @ 300-500GB / sample (genome): …
  3. 3. 3 Genomics Analysis Toolkit: Best practices <eventname> Source: Broad Institute, https://software.broadinstitute.org/gatk/best-practices/ Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.
  4. 4. 4 Key points <eventname> 1. Time and cost: • Spark implementation at the cutting edge: still in beta but progressing rapidly • Cluster deployment provides speedup but with limitations • Azure Genomics Services is cheaper and faster but a black-box service 2. Quality of the analysis: What is the relative impact of new versions on the variant output? (how quickly do results become obsolete?) http://recomp.org.uk/
  5. 5. 5 Multi-sample WES pipeline <eventname> Bwa MarkDuplicates BQSR Haplotype CallerSpark Sample 1 FastqToSam BQSR Haplotype CallerSpark Sample 2 BQSR Haplotype CallerSpark Sample N Recalibration Genotype Refinement Select Variants Genotype VCFs ANNOVAR ANNOVAR ANNOVAR IGM Anno IGM Anno IGM Anno Exonic Filter Exonic Filter Exonic Filter PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT . . . . . . FastqToSam FastqToSam Bwa MarkDuplicates Bwa MarkDuplicates Two levels data parallelism: - Across samples (pre-processing) - Within single sample processing Raw reads “map to reference” (Alignment) Against h19, h37, h38… Flag up multiple pair reads “Base Quality Scores Recalibration” assigns confidence values to aligned reads “Call Variants - SNPs - Indels Per Sample Filter for accuracy
  6. 6. 6 Exploiting parallelism – state of the art <eventname> • Split-and-merge / Wrapper approach: eg Gesall [1] 1. Partition each exome  by chromosome / auto-load balancing 2. “drive” standard BWA on each partition 3. Merge the partial results • Heavy MapReduce stack required between HDFS and BWA • See also [2,3] • GATK releasing Spark implementations of BWA, BQSR, HC • Natively exploits Spark infrastructure – HDFS data partitioning [1] A. Roy et al., “Massively parallel processing of whole genome sequence data: an in-depth performance study,” in Procs. SIGMOD 2017 pp. 187–202 [2] H. Mushtaq and Z. Al-Ars, “Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline,” in Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, 2015, pp. 1471–1477. [3] X. Li, G. Tan, B. Wang, and N. Sun, “High-performance Genomic Analysis Framework with In-memory Computing,” SIGPLAN Not., vol. 53, no. 1, pp. 317–328, Feb. 2018.
  7. 7. 7 Spark hybrid implementation <eventname> BQSR Haplotype CallerSpark Sample 2 BQSR Haplotype CallerSpark Sample N Recalibration Genotype Refinement Select Variants Genotype VCFs ANNOVAR ANNOVAR IGM Anno IGM Anno Exonic Filter Exonic Filter PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT . . . . . . FastqToSam FastqToSam Bwa MarkDuplicates Bwa MarkDuplicates Natively ported to Spark Wrapped using Spark.pipe() Single-node deployment: - Pre-processing: one iteration / sample - Discovery: single batch execution
  8. 8. 8 The Spark Pipe operator <eventname> Bash / Perl Shell script stdin stdout Pipe RDD Partitioned RDD - Wraps local code through Shell - Effective but inefficient  breaks the RDD in-memory model
  9. 9. 9 Spark cluster virtualisation using Swarm <eventname> • Automated distribution of Docker containers over a cluster of VMs. • Swarm: nodes running Docker and joined in a cluster • Swarm Manager executes Docker commands on the cluster Transparency issues: Reference data mostly shared over HDFS But: 1. non-Spark tools require local data  mount HDFS Data nodes as virtual Docker volumes 2. Reference genome replicated to every node (Swarm global replication) Spark master + HDFS Namemode  Swarm Manager Dedicated overlay network
  10. 10. 10 Pipeline execution flow in cluster mode <eventname> - Non-Spark tools remain centralised - Data sharing still through HDFS (shallow integration across Spark tools).  no in-memory optimisation
  11. 11. 11 Evaluation: focus <eventname> 12%38 + 11 + 39 = 88% Evaluation focused on pre-processing: BWA/MD  BQSRP  HC - Heaviest phase - Spark tools  focus of the study! BWA/MD 38% BQSRP 11% HC 39% discovery and refinement 12% BWA/MD BQSRP HC discovery and refinement BQSR Haplotype CallerSpark Sample 2 BQSR Haplotype CallerSpark Sample N Recalibration Genotype Refinement Select Variants Genotype VCFs ANNOVAR ANNOVAR IGM Anno IGM Anno Exonic Filter Exonic Filter PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT . . . . . . FastqToSam FastqToSam Bwa MarkDuplicates Bwa MarkDuplicates
  12. 12. 12 Evaluation: setup <eventname> 6 exomes from the Institute of Genetic Medicine at Newcastle Sample size [10.8GB – 15.9GB] avg 13.5GB (compressed) Deployment modes: - Single node “pseudo-cluster” deployment - Cluster mode with up to 4 nodes All deployment on Azure cloud, 8 cores, 55GB RAM / node
  13. 13. 13 Pre-processing steps for single node deployment <eventname> 0 100 200 300 400 500 600 700 800 900 10.8 13 13.2 14.2 14.4 15.9 time(minutes) sample size (GB) BWA/MD BQSRP HC 0 100 200 300 400 500 600 700 800 900 10.8 13 13.2 14.2 14.4 15.9 time(minutes) sample size (GB) BWA/MD BQSRP HC Configuration 20/2/4/16 Configuration 20/4/2/8 1. Driver process memory (GB) 2. Executors 3. Cores/executor 4. Memory/executor (GB) Configuration settings not significant
  14. 14. 14 Normalised pre-processing processing time/GB <eventname> Average time/GB for two configurations Pre-processing time/GB (all three steps) across four configurations for a single sample (14.2GB)
  15. 15. 15 Speedup <eventname> 0 50 100 150 200 250 300 350 1 2 3 4 minutes number of nodes BWA/MD + BQSRP BWA/MD BQSRP 0 50 100 150 200 250 300 350 8 16 32 minutes number of cores BWA/MD + BQSRP Note: HC not included due to tech issues running HC on 16 cores. -- Average HC time: 270 minutes (single sample) Scale up 55GB RAM/core single node Scale out / cluster mode 55GB RAM, 8 cores / node 8 cores x 2  229’ 16 cores x 1 165’ But: 8 cores x 4  137’ 32 cores x 1  175’ Cluster overhead:
  16. 16. 16 Comparison: Microsoft Genomics Services <eventname> Fast, but opaque: • Processing time for PFC 0028 sample: 77 minutes • Cost: £0.217/GB  £19 for six samples • Our best time: 446 minutes (7.5 hrs) on a single node(*) • Our costs (8 cores, 55GB, six samples): £28 • Running on a single, high-end VM • But: specs undisclosed • Not open -- no flexibility at all (*) This is 176’ (single node, 16 cores) + 270’ (average HC processing time)
  17. 17. 17 What we are doing now <eventname> All pipeline components change (rapidly) How sensitive are prior results to version changes (in data / software tools / libraries)? - Re-processing is time-consuming  continuous refresh not scalable - Can we quantify the effect of changes on a cohort of cases and prioritise re-computing? Approach: • Generate multiple variations of the baseline pipeline by injecting version changes • Assess quality (specificity / sensitivity) of each results (sets of variants) across the cohort [1] [1] D. T. Houniet et al., “Using population data for assessing next-generation sequencing performance,” Bioinformatics, vol. 31, no. 1, pp. 56–61, Jan. 2015.
  18. 18. 19 ReComp <eventname> J. Cala and P. Missier, “Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study,” Journal of Big Data Research, 2018 (in press) http://recomp.org.uk/ ReComp is about preserving value from large scale data analytics over time through selective re-computation More on this topic:
  19. 19. 20 Questions? Call for participation: July 12-13th, London (King’s College)

×