SlideShare uma empresa Scribd logo
1 de 19
Design and evaluation of a genomics variant analysis pipeline
using GATK Spark tools
Nicholas Tucci1, Jacek Cala2, Jannetta Steyn2, Paolo Missier2
(1) Dipartimento di Ingegneria Elettronica,Universita’ Roma Tre, Italy
(2) School of Computing, Newcastle University, UK
SEBD 2018, Italy
In collaboration with the Institute of Genetic Medicine,
Newcastle University
2
Motivation: genomics at scale
<eventname>
Image credits: Broad Institute https://software.broadinstitute.org/gatk/
Current cost of whole-genome sequencing: < £1,000
https://www.genomicsengland.co.uk/the-100000-genomes-project/
(our) processing time: about 40’ / GB
 @ 11GB / sample (exome): 8 hours
 @ 300-500GB / sample (genome): …
3
Genomics Analysis Toolkit: Best practices
<eventname>
Source: Broad Institute, https://software.broadinstitute.org/gatk/best-practices/
Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset
in VCF format.
4
Key points
<eventname>
1. Time and cost:
• Spark implementation at the cutting edge: still in beta but progressing rapidly
• Cluster deployment provides speedup but with limitations
• Azure Genomics Services is cheaper and faster but a black-box service
2. Quality of the analysis:
What is the relative impact of new versions on the variant output?
(how quickly do results become obsolete?)
http://recomp.org.uk/
5
Multi-sample WES pipeline
<eventname>
Bwa
MarkDuplicates
BQSR
Haplotype
CallerSpark
Sample 1 FastqToSam
BQSR
Haplotype
CallerSpark
Sample 2
BQSR
Haplotype
CallerSpark
Sample N
Recalibration
Genotype
Refinement
Select
Variants
Genotype
VCFs
ANNOVAR
ANNOVAR
ANNOVAR
IGM
Anno
IGM
Anno
IGM
Anno
Exonic
Filter
Exonic
Filter
Exonic
Filter
PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT
.
.
.
.
.
.
FastqToSam
FastqToSam
Bwa
MarkDuplicates
Bwa
MarkDuplicates
Two levels data parallelism:
- Across samples (pre-processing)
- Within single sample processing
Raw
reads
“map to
reference”
(Alignment)
Against h19,
h37, h38…
Flag up multiple
pair reads
“Base Quality
Scores
Recalibration”
assigns
confidence
values to
aligned reads
“Call Variants
- SNPs
- Indels
Per Sample
Filter for
accuracy
6
Exploiting parallelism – state of the art
<eventname>
• Split-and-merge / Wrapper approach: eg Gesall [1]
1. Partition each exome  by chromosome / auto-load balancing
2. “drive” standard BWA on each partition
3. Merge the partial results
• Heavy MapReduce stack required between HDFS and BWA
• See also [2,3]
• GATK releasing Spark implementations of BWA, BQSR, HC
• Natively exploits Spark infrastructure – HDFS data partitioning
[1] A. Roy et al., “Massively parallel processing of whole genome sequence data: an in-depth performance study,” in
Procs. SIGMOD 2017 pp. 187–202
[2] H. Mushtaq and Z. Al-Ars, “Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline,” in
Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, 2015, pp. 1471–1477.
[3] X. Li, G. Tan, B. Wang, and N. Sun, “High-performance Genomic Analysis Framework with In-memory Computing,”
SIGPLAN Not., vol. 53, no. 1, pp. 317–328, Feb. 2018.
7
Spark hybrid implementation
<eventname>
BQSR
Haplotype
CallerSpark
Sample 2
BQSR
Haplotype
CallerSpark
Sample N
Recalibration
Genotype
Refinement
Select
Variants
Genotype
VCFs
ANNOVAR
ANNOVAR
IGM
Anno
IGM
Anno
Exonic
Filter
Exonic
Filter
PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT
.
.
.
.
.
.
FastqToSam
FastqToSam
Bwa
MarkDuplicates
Bwa
MarkDuplicates
Natively ported to Spark Wrapped using Spark.pipe()
Single-node deployment:
- Pre-processing: one iteration / sample
- Discovery: single batch execution
8
The Spark Pipe operator
<eventname>
Bash / Perl
Shell
script
stdin stdout
Pipe
RDD
Partitioned RDD
- Wraps local code through Shell
- Effective but inefficient  breaks the RDD in-memory model
9
Spark cluster virtualisation using Swarm
<eventname>
• Automated distribution of Docker containers over a cluster of VMs.
• Swarm: nodes running Docker and joined in a cluster
• Swarm Manager executes Docker commands on the cluster
Transparency issues:
Reference data mostly shared over HDFS
But:
1. non-Spark tools require local data
 mount HDFS Data nodes as virtual
Docker volumes
2. Reference genome replicated to every
node (Swarm global replication)
Spark master + HDFS
Namemode  Swarm Manager
Dedicated overlay network
10
Pipeline execution flow in cluster mode
<eventname>
- Non-Spark tools remain centralised
- Data sharing still through HDFS (shallow integration across Spark tools).  no in-memory optimisation
11
Evaluation: focus
<eventname>
12%38 + 11 + 39 = 88%
Evaluation focused on pre-processing:
BWA/MD  BQSRP  HC
- Heaviest phase
- Spark tools  focus of the study!
BWA/MD
38%
BQSRP
11%
HC
39%
discovery and
refinement
12%
BWA/MD BQSRP HC discovery and refinement
BQSR
Haplotype
CallerSpark
Sample 2
BQSR
Haplotype
CallerSpark
Sample N
Recalibration
Genotype
Refinement
Select
Variants
Genotype
VCFs
ANNOVAR
ANNOVAR
IGM
Anno
IGM
Anno
Exonic
Filter
Exonic
Filter
PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT
.
.
.
.
.
.
FastqToSam
FastqToSam
Bwa
MarkDuplicates
Bwa
MarkDuplicates
12
Evaluation: setup
<eventname>
6 exomes from the Institute of Genetic Medicine at Newcastle
Sample size [10.8GB – 15.9GB] avg 13.5GB (compressed)
Deployment modes:
- Single node “pseudo-cluster” deployment
- Cluster mode with up to 4 nodes
All deployment on Azure cloud, 8 cores, 55GB RAM / node
13
Pre-processing steps for single node deployment
<eventname>
0
100
200
300
400
500
600
700
800
900
10.8 13 13.2 14.2 14.4 15.9
time(minutes)
sample size (GB)
BWA/MD BQSRP HC
0
100
200
300
400
500
600
700
800
900
10.8 13 13.2 14.2 14.4 15.9
time(minutes)
sample size (GB)
BWA/MD BQSRP HC
Configuration 20/2/4/16 Configuration 20/4/2/8
1. Driver process memory (GB)
2. Executors
3. Cores/executor
4. Memory/executor (GB)
Configuration settings not significant
14
Normalised pre-processing processing time/GB
<eventname>
Average time/GB for two configurations Pre-processing time/GB (all three steps) across four
configurations for a single sample (14.2GB)
15
Speedup
<eventname>
0
50
100
150
200
250
300
350
1 2 3 4
minutes
number of nodes
BWA/MD + BQSRP BWA/MD BQSRP
0
50
100
150
200
250
300
350
8 16 32
minutes
number of cores
BWA/MD + BQSRP
Note: HC not included due to tech issues running HC on 16 cores. -- Average HC time: 270 minutes (single sample)
Scale up
55GB RAM/core single node
Scale out / cluster mode
55GB RAM, 8 cores / node
8 cores x 2  229’
16 cores x 1 165’
But:
8 cores x 4  137’
32 cores x 1  175’
Cluster overhead:
16
Comparison: Microsoft Genomics Services
<eventname>
Fast, but opaque:
• Processing time for PFC 0028 sample: 77 minutes
• Cost: £0.217/GB  £19 for six samples
• Our best time: 446 minutes (7.5 hrs) on a single node(*)
• Our costs (8 cores, 55GB, six samples): £28
• Running on a single, high-end VM
• But: specs undisclosed
• Not open -- no flexibility at all
(*) This is 176’ (single node, 16 cores) + 270’ (average HC processing time)
17
What we are doing now
<eventname>
All pipeline components change (rapidly)
How sensitive are prior results to version changes (in data / software tools / libraries)?
- Re-processing is time-consuming  continuous refresh not scalable
- Can we quantify the effect of changes on a cohort of cases and prioritise re-computing?
Approach:
• Generate multiple variations of the baseline pipeline by injecting version changes
• Assess quality (specificity / sensitivity) of each results (sets of variants) across the
cohort [1]
[1] D. T. Houniet et al., “Using population data for assessing next-generation sequencing performance,”
Bioinformatics, vol. 31, no. 1, pp. 56–61, Jan. 2015.
19
ReComp
<eventname>
J. Cala and P. Missier, “Selective and recurring re-computation of Big Data analytics tasks:
insights from a Genomics case study,” Journal of Big Data Research, 2018 (in press)
http://recomp.org.uk/
ReComp is about preserving value from large scale data
analytics over time through selective re-computation
More on this topic:
20
Questions?
Call for participation:
July 12-13th, London (King’s College)

Mais conteúdo relacionado

Mais procurados

Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
Anubhav Jain
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
vsachde
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
aimsnist
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 

Mais procurados (20)

Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation AlgorithmA Scalable Dataflow Implementation of Curran's Approximation Algorithm
A Scalable Dataflow Implementation of Curran's Approximation Algorithm
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 
Tree building 2
Tree building 2Tree building 2
Tree building 2
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
 
Value-Based Allocation of Docker Containers
Value-Based Allocation of Docker ContainersValue-Based Allocation of Docker Containers
Value-Based Allocation of Docker Containers
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
 
Computational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methodsComputational materials design with high-throughput and machine learning methods
Computational materials design with high-throughput and machine learning methods
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
Contributions to the Efficient Use of General Purpose Coprocessors: KDE as Ca...
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
 

Semelhante a Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
Ian Foster
 
epscor_talk_2.pptx
epscor_talk_2.pptxepscor_talk_2.pptx
epscor_talk_2.pptx
ShadowCon
 

Semelhante a Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools (20)

20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
 
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
Butler - a framework for a large-scale scientific analysis on the cloud - EOS...
 
Introduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-SeqIntroduction to Galaxy and RNA-Seq
Introduction to Galaxy and RNA-Seq
 
3rd presentation
3rd presentation3rd presentation
3rd presentation
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...
 
Paper - Muhammad Gulraj
Paper - Muhammad GulrajPaper - Muhammad Gulraj
Paper - Muhammad Gulraj
 
Interactive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science CloudInteractive Data Analysis for End Users on HN Science Cloud
Interactive Data Analysis for End Users on HN Science Cloud
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
epscor_talk_2.pptx
epscor_talk_2.pptxepscor_talk_2.pptx
epscor_talk_2.pptx
 
ECP Application Development
ECP Application DevelopmentECP Application Development
ECP Application Development
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
Larry Smarr - NRP Application Drivers
Larry Smarr - NRP Application DriversLarry Smarr - NRP Application Drivers
Larry Smarr - NRP Application Drivers
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 

Mais de Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

Mais de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 
Provenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-ComputationProvenance Annotation and Analysis to Support Process Re-Computation
Provenance Annotation and Analysis to Support Process Re-Computation
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools

  • 1. Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools Nicholas Tucci1, Jacek Cala2, Jannetta Steyn2, Paolo Missier2 (1) Dipartimento di Ingegneria Elettronica,Universita’ Roma Tre, Italy (2) School of Computing, Newcastle University, UK SEBD 2018, Italy In collaboration with the Institute of Genetic Medicine, Newcastle University
  • 2. 2 Motivation: genomics at scale <eventname> Image credits: Broad Institute https://software.broadinstitute.org/gatk/ Current cost of whole-genome sequencing: < £1,000 https://www.genomicsengland.co.uk/the-100000-genomes-project/ (our) processing time: about 40’ / GB  @ 11GB / sample (exome): 8 hours  @ 300-500GB / sample (genome): …
  • 3. 3 Genomics Analysis Toolkit: Best practices <eventname> Source: Broad Institute, https://software.broadinstitute.org/gatk/best-practices/ Identify germline short variants (SNPs and Indels) in one or more individuals to produce a joint callset in VCF format.
  • 4. 4 Key points <eventname> 1. Time and cost: • Spark implementation at the cutting edge: still in beta but progressing rapidly • Cluster deployment provides speedup but with limitations • Azure Genomics Services is cheaper and faster but a black-box service 2. Quality of the analysis: What is the relative impact of new versions on the variant output? (how quickly do results become obsolete?) http://recomp.org.uk/
  • 5. 5 Multi-sample WES pipeline <eventname> Bwa MarkDuplicates BQSR Haplotype CallerSpark Sample 1 FastqToSam BQSR Haplotype CallerSpark Sample 2 BQSR Haplotype CallerSpark Sample N Recalibration Genotype Refinement Select Variants Genotype VCFs ANNOVAR ANNOVAR ANNOVAR IGM Anno IGM Anno IGM Anno Exonic Filter Exonic Filter Exonic Filter PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT . . . . . . FastqToSam FastqToSam Bwa MarkDuplicates Bwa MarkDuplicates Two levels data parallelism: - Across samples (pre-processing) - Within single sample processing Raw reads “map to reference” (Alignment) Against h19, h37, h38… Flag up multiple pair reads “Base Quality Scores Recalibration” assigns confidence values to aligned reads “Call Variants - SNPs - Indels Per Sample Filter for accuracy
  • 6. 6 Exploiting parallelism – state of the art <eventname> • Split-and-merge / Wrapper approach: eg Gesall [1] 1. Partition each exome  by chromosome / auto-load balancing 2. “drive” standard BWA on each partition 3. Merge the partial results • Heavy MapReduce stack required between HDFS and BWA • See also [2,3] • GATK releasing Spark implementations of BWA, BQSR, HC • Natively exploits Spark infrastructure – HDFS data partitioning [1] A. Roy et al., “Massively parallel processing of whole genome sequence data: an in-depth performance study,” in Procs. SIGMOD 2017 pp. 187–202 [2] H. Mushtaq and Z. Al-Ars, “Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline,” in Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, 2015, pp. 1471–1477. [3] X. Li, G. Tan, B. Wang, and N. Sun, “High-performance Genomic Analysis Framework with In-memory Computing,” SIGPLAN Not., vol. 53, no. 1, pp. 317–328, Feb. 2018.
  • 7. 7 Spark hybrid implementation <eventname> BQSR Haplotype CallerSpark Sample 2 BQSR Haplotype CallerSpark Sample N Recalibration Genotype Refinement Select Variants Genotype VCFs ANNOVAR ANNOVAR IGM Anno IGM Anno Exonic Filter Exonic Filter PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT . . . . . . FastqToSam FastqToSam Bwa MarkDuplicates Bwa MarkDuplicates Natively ported to Spark Wrapped using Spark.pipe() Single-node deployment: - Pre-processing: one iteration / sample - Discovery: single batch execution
  • 8. 8 The Spark Pipe operator <eventname> Bash / Perl Shell script stdin stdout Pipe RDD Partitioned RDD - Wraps local code through Shell - Effective but inefficient  breaks the RDD in-memory model
  • 9. 9 Spark cluster virtualisation using Swarm <eventname> • Automated distribution of Docker containers over a cluster of VMs. • Swarm: nodes running Docker and joined in a cluster • Swarm Manager executes Docker commands on the cluster Transparency issues: Reference data mostly shared over HDFS But: 1. non-Spark tools require local data  mount HDFS Data nodes as virtual Docker volumes 2. Reference genome replicated to every node (Swarm global replication) Spark master + HDFS Namemode  Swarm Manager Dedicated overlay network
  • 10. 10 Pipeline execution flow in cluster mode <eventname> - Non-Spark tools remain centralised - Data sharing still through HDFS (shallow integration across Spark tools).  no in-memory optimisation
  • 11. 11 Evaluation: focus <eventname> 12%38 + 11 + 39 = 88% Evaluation focused on pre-processing: BWA/MD  BQSRP  HC - Heaviest phase - Spark tools  focus of the study! BWA/MD 38% BQSRP 11% HC 39% discovery and refinement 12% BWA/MD BQSRP HC discovery and refinement BQSR Haplotype CallerSpark Sample 2 BQSR Haplotype CallerSpark Sample N Recalibration Genotype Refinement Select Variants Genotype VCFs ANNOVAR ANNOVAR IGM Anno IGM Anno Exonic Filter Exonic Filter PREPROCESSING VARIANT DISCOVERY CALLSET REFINEMENT . . . . . . FastqToSam FastqToSam Bwa MarkDuplicates Bwa MarkDuplicates
  • 12. 12 Evaluation: setup <eventname> 6 exomes from the Institute of Genetic Medicine at Newcastle Sample size [10.8GB – 15.9GB] avg 13.5GB (compressed) Deployment modes: - Single node “pseudo-cluster” deployment - Cluster mode with up to 4 nodes All deployment on Azure cloud, 8 cores, 55GB RAM / node
  • 13. 13 Pre-processing steps for single node deployment <eventname> 0 100 200 300 400 500 600 700 800 900 10.8 13 13.2 14.2 14.4 15.9 time(minutes) sample size (GB) BWA/MD BQSRP HC 0 100 200 300 400 500 600 700 800 900 10.8 13 13.2 14.2 14.4 15.9 time(minutes) sample size (GB) BWA/MD BQSRP HC Configuration 20/2/4/16 Configuration 20/4/2/8 1. Driver process memory (GB) 2. Executors 3. Cores/executor 4. Memory/executor (GB) Configuration settings not significant
  • 14. 14 Normalised pre-processing processing time/GB <eventname> Average time/GB for two configurations Pre-processing time/GB (all three steps) across four configurations for a single sample (14.2GB)
  • 15. 15 Speedup <eventname> 0 50 100 150 200 250 300 350 1 2 3 4 minutes number of nodes BWA/MD + BQSRP BWA/MD BQSRP 0 50 100 150 200 250 300 350 8 16 32 minutes number of cores BWA/MD + BQSRP Note: HC not included due to tech issues running HC on 16 cores. -- Average HC time: 270 minutes (single sample) Scale up 55GB RAM/core single node Scale out / cluster mode 55GB RAM, 8 cores / node 8 cores x 2  229’ 16 cores x 1 165’ But: 8 cores x 4  137’ 32 cores x 1  175’ Cluster overhead:
  • 16. 16 Comparison: Microsoft Genomics Services <eventname> Fast, but opaque: • Processing time for PFC 0028 sample: 77 minutes • Cost: £0.217/GB  £19 for six samples • Our best time: 446 minutes (7.5 hrs) on a single node(*) • Our costs (8 cores, 55GB, six samples): £28 • Running on a single, high-end VM • But: specs undisclosed • Not open -- no flexibility at all (*) This is 176’ (single node, 16 cores) + 270’ (average HC processing time)
  • 17. 17 What we are doing now <eventname> All pipeline components change (rapidly) How sensitive are prior results to version changes (in data / software tools / libraries)? - Re-processing is time-consuming  continuous refresh not scalable - Can we quantify the effect of changes on a cohort of cases and prioritise re-computing? Approach: • Generate multiple variations of the baseline pipeline by injecting version changes • Assess quality (specificity / sensitivity) of each results (sets of variants) across the cohort [1] [1] D. T. Houniet et al., “Using population data for assessing next-generation sequencing performance,” Bioinformatics, vol. 31, no. 1, pp. 56–61, Jan. 2015.
  • 18. 19 ReComp <eventname> J. Cala and P. Missier, “Selective and recurring re-computation of Big Data analytics tasks: insights from a Genomics case study,” Journal of Big Data Research, 2018 (in press) http://recomp.org.uk/ ReComp is about preserving value from large scale data analytics over time through selective re-computation More on this topic:
  • 19. 20 Questions? Call for participation: July 12-13th, London (King’s College)

Notas do Editor

  1. marks any duplicates, i.e., by flagging up multiple paired reads that are mapped to the same start and end positions. These reads often originate erroneously from DNA preparation methods. They will cause biases that skew variant calling and hence should be removed, in order to avoid them in downstream analysis.
  2. As both Spark and HDFS adopt Master-Slave architecture, the masters (Spark Master and HDFS Namenode) are deployed on the Swarm Manager.
  3. However, we also note that scaling out, that is, by adding nodes, may incur an overhead that makes it less efficient than scaling up (i.e. adding cores and memory to a single node configuration). For instance, 2 nodes with 8 cores each process at 229 minutes, while a single node with 16 cores takes 165 minutes. This overhead is less noticeable when using 32 cores, which as we noted earlier does not improve processing time on a single host (175 minutes, Fig.~\ref{fig:scale-up}), while a 4x8 nodes cluster takes 137 minutes, a further improvement over the other configurations.
  4. However, at the time of writing these services were only offered as a \textit{black box} that runs on a single, high-end virtual machine of undisclosed specifications. In terms of pricing, the current charges for using Genomics Services are \pounds0.217 / GB, which translates to about \pounds18.61 for processing our six samples. For comparison, the cost of processing the same samples using our pipeline with a 8 cores, 55GB configuration is estimated at \pounds28.