SlideShare a Scribd company logo
1 of 25
NGSDataCongress
London,June2015
P.Misiser
Scalable WES Processing And Variant Interpretation
With Provenance Recording
Using Workflow On The Cloud
Paolo Missier, Jacek Cała, Yaobo Xu,
Eldarina Wijaya, Ryan Kirby
School of Computing Science and Institute of Genetic Medicine
Newcastle University, Newcastle upon Tyne, UK
NGS Data Congress
London, June 15th, 2015
NGSDataCongress
London,June2015
P.Misiser
The Cloud-e-Genome project at Newcastle
1. NGS data processing:
• Implement a flexible WES/WGS pipeline
• Scalable deployment over a public cloud
• Cost control
• Scalability
• Flexibility
• Of design
• Of maintenance
• Ensure
accountability
through traceability
• Enable analytics
over past patient
cases
2. Traceable variant interpretation:
• Design a simple-to-use tool to facilitate
clinical diagnosis by clinicians
• Maintain history of past investigations for
analytical purposes
Objectives: With an aim to:
• 2 year pilot project: 2013-2015
• Funded by UK’s National Institute for Health Research (NIHR)
• Cloud resources from Azure for Research Award
NGSDataCongress
London,June2015
P.Misiser
Part I: data processing
Objectives:
• Design and Implement a flexible WES/WGS pipeline
• Using workflow technology  high level programming
• Providing scalable deployment over a public cloud
NGSDataCongress
London,June2015
P.Misiser
Scripted NGS data processing pipeline
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
GATK
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
NGSDataCongress
London,June2015
P.Misiser
Scripts to workflow - Design
Design
Cloud
Deployment
Execution Analysis
• Better abstraction
• Easier to understand, share,
maintain
• Better exploit data parallelism
• Extensible by wrapping new tools
Theoretical advantages of using a workflow programming model
NGSDataCongress
London,June2015
P.Misiser
Workflow Design
echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP
mkdir -p $PICARD_OUTDIR
mkdir -p $PICARD_TEMP
echo Starting PICARD to clean BAM files...
$Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED
echo Starting PICARD to remove duplicates...
$Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = 
$SORTED_BAM_FILE_NODUPS_NO_RG
METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true
echo Adding read group information to bam file...
$Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG
OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID 
RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}”
echo Indexing bam files...
samtools index $SORTED_BAM_FILE_NODUPS
“Wrapper”
blocksUtility
blocks
NGSDataCongress
London,June2015
P.Misiser
Workflow design
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Conceptual:
Actual:
NGSDataCongress
London,June2015
P.Misiser
Anatomy of a complex parallel dataflow
eScience Central: simple dataflow model…
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Sample-split:
Parallel processing of
samples in a batch
NGSDataCongress
London,June2015
P.Misiser
Anatomy of a complex parallel dataflow
… with hierarchical structure
NGSDataCongress
London,June2015
P.Misiser
Phase II, top level
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Chromosome-split:
Parallel processing of
each chromosome
across all samples
NGSDataCongress
London,June2015
P.Misiser
Phase III
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
Sample-split:
Parallel processing
of samples
NGSDataCongress
London,June2015
P.Misiser
Implicit parallelism in the pipeline
align-clean-
recalibrate-coverage
…
align-clean-
recalibrate-coverage
Sample
1
Sample
n
Variant calling
recalibration
Variant calling
recalibration
Variant filtering
annotation
Variant filtering
annotation
……
Chromosome
split
Per-sample
Parallel
processing
Per-chromosome
Parallel
processing
Stage I Stage II Stage III
How does the workflow design exploit this parallelism?
raw
sequences align clean
recalibrate
alignments
calculate
coverage
call
variants
recalibrate
variants
filter
variants
annotate
coverage
information
annotated
variants
raw
sequences align clean
recalibrate
alignments
calculate
coverage
coverage
informationraw
sequences align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
filter
variants
NGSDataCongress
London,June2015
P.Misiser
Parallel processing over a batch of exomes
align lane
recalibrate
sample
call
variants
recalibrate
variants
align
sample
haplotype
caller
recalibrate
sample
raw
sequences
align clean
calculate
coverage
coverage
information
recalibrate
alignments
annotate
annotated
variants
Stage 1
Stage 2
Stage 3
filter
variants
align lanealign lane
align
sample
align lane
clean
sampleclean
sampleclean
sample
align lane
align
sample
align lane
recalibrate
sample
VC with
chr-split
haplotype
callerhaplotype
callerhaplotype
caller
annotates
ampleannotates
ampleannotates
ample
filter
samplefilter
samplefilter
sample
annotated
variantsannotated
variants
raw
sequencesraw
sequences
coverage
informationcoverage
information
coverage
per samplecoverage
per samplecoverage
per sample
recalibrate
NGSDataCongress
London,June2015
P.Misiser
Cloud Deployment
Design
Cloud
Deployment
Execution Analysis
• Scalability
• Fewer installation/deployment requirements, staff hours required
• Automated dependency management, packaging
• Configurable to make most efficient use of a cluster
NGSDataCongress
London,June2015
P.Misiser
Workflow on Azure Cloud – modular configuration
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines Module
configuration:
3 nodes, 24 cores
Modular architecture  indefinitely scalable!
NGSDataCongress
London,June2015
P.Misiser
Scripts to workflow
Design
Cloud
Deployment
Execution Analysis
3. Execution
• Runtime monitoring
• provenance collection
NGSDataCongress
London,June2015
P.Misiser
Performance
3 workflow engines perform better than our HPC benchmark on larger sample sizes
Technical configurations for 3VMs experiments:
HPC cluster (dedicated nodes): used 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48
GiB RAM, 160 GB scratch space
Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu
14.04.
NGSDataCongress
London,June2015
P.Misiser
Scalability
There is little incentive to grow the VM pool beyond 6 engines
NGSDataCongress
London,June2015
P.Misiser
Cost
0
2
4
6
8
10
12
14
16
18
0 6 12 18 24
CostinGBP
Number of samples
3 eng (24 cores)
6 eng (48 cores)
12 eng (96 cores)
Again, a 6 engine configuration achieves near-optimal cost/sample
NGSDataCongress
London,June2015
P.Misiser
Lessons learnt
Design
Cloud
Deployment
Execution Analysis
 Better abstraction
• Easier to understand, share,
maintain
 Better exploit data parallelism
 Extensible by wrapping new tools
• Scalability
 Fewer installation/deployment
requirements, staff hours required
 Automated dependency management,
packaging
 Configurable to make most efficient
use of a cluster
 Runtime monitoring
 Provenance collection
 Reproducibility
 Accountability
NGSDataCongress
London,June2015
P.Misiser
Part II: SVI- Simple, traceable variant interpretation
Objectives:
• Design a simple-to-use
tool to facilitate clinical
diagnosis by clinicians
• Maintain history of past
investigations for
analytical purposes
• Ensure accountability
through traceability
• Enable analytics over
past patient cases
MAF threshold
- Non-synonymous
- stop/gain
- frameshift
known polymorphisms
Homo / Heterozygous
Pathogenicity
predictors
Variant filtering
HPO match
HPO to OMIM
OMIM match
OMIM to Gene
Gene
Union
Gene
Intersect
Genes in scope
User-supplied
genes list
User-supplied
disease keywords
User-defined
preferred genes
Variant Scoping
Candidate
variants
Select
variants
in scope
variants
in scope
ClinVar
lookupClinVar
Annotated
patient
variants
Variant Classification
RED:
found,
pathogenic
AMBER:
not found
GREEN:
found,
benign
OMIM
AMBER/
not found
AMBER/
uncertain
NGS
pipeline
NGSDataCongress
London,June2015
P.Misiser
A database of patient cases and investigations
Cases:
NGSDataCongress
London,June2015
P.Misiser
Investigations
NGSDataCongress
London,June2015
P.Misiser
Provenance of variant identification
• A provenance graph is
generated for each
investigation
It accounts for the filtering process
for each variant listed in the result
Enables analytics over
provenance graphs across many
investigations
- “which variants where
identified independently on
different cases, and how do
they correlate with
phenotypes?”
NGSDataCongress
London,June2015
P.Misiser
Summary
1. WES/WGS data processing to annotated variants
• Scalable, Cloud-based
• High level
• Low cost / sample
2.Variant interpretation:
• Simple
• Targeted at clinicians
• Built-in accountability of genetic diagnosis
• Analytics over a database of past
investigations
What we are delivering to NIHR:

More Related Content

What's hot

HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
Dan Han
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
DataWorks Summit
 
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Rim Moussa
 

What's hot (20)

GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DC
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)
 
HGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBaseHGrid A Data Model for Large Geospatial Data Sets in HBase
HGrid A Data Model for Large Geospatial Data Sets in HBase
 
Big Data with Modern R & Spark
Big Data with Modern R & SparkBig Data with Modern R & Spark
Big Data with Modern R & Spark
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data Pipelines
 
OGCE TeraGrid 2010 ASTA Support
OGCE TeraGrid 2010 ASTA SupportOGCE TeraGrid 2010 ASTA Support
OGCE TeraGrid 2010 ASTA Support
 
Giving MongoDB a Way to Play with the GIS Community
Giving MongoDB a Way to Play with the GIS CommunityGiving MongoDB a Way to Play with the GIS Community
Giving MongoDB a Way to Play with the GIS Community
 
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
AWS Public Sector Symposium 2014 Canberra | Big Data in the Cloud: Accelerati...
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
MongoDB + GeoServer
MongoDB + GeoServerMongoDB + GeoServer
MongoDB + GeoServer
 
SkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage SystemSkyhookDM - Towards an Arrow-Native Storage System
SkyhookDM - Towards an Arrow-Native Storage System
 
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons LearnedHadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metricsBenchmarking data warehouse systems in the cloud: new requirements & new metrics
Benchmarking data warehouse systems in the cloud: new requirements & new metrics
 
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
Running a GPU burst for Multi-Messenger Astrophysics with IceCube across all ...
 
Academy PRO: Elasticsearch Misc
Academy PRO: Elasticsearch MiscAcademy PRO: Elasticsearch Misc
Academy PRO: Elasticsearch Misc
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
 

Similar to Invited cloud-e-Genome project talk at 2015 NGS Data Congress

Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 
Aastha Grover Resume (2)
Aastha Grover Resume (2)Aastha Grover Resume (2)
Aastha Grover Resume (2)
Aastha Grover
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Databricks
 

Similar to Invited cloud-e-Genome project talk at 2015 NGS Data Congress (20)

Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...
 
How novel compute technology transforms life science research
How novel compute technology transforms life science researchHow novel compute technology transforms life science research
How novel compute technology transforms life science research
 
FULLTEXT02
FULLTEXT02FULLTEXT02
FULLTEXT02
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Satwik mishra resume
Satwik mishra resumeSatwik mishra resume
Satwik mishra resume
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
Satwik resume
Satwik resumeSatwik resume
Satwik resume
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Satwik mishra resume
Satwik mishra resumeSatwik mishra resume
Satwik mishra resume
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Aastha Grover Resume (2)
Aastha Grover Resume (2)Aastha Grover Resume (2)
Aastha Grover Resume (2)
 
Scientific
Scientific Scientific
Scientific
 
Satwik Mishra resume
Satwik Mishra resumeSatwik Mishra resume
Satwik Mishra resume
 
Geospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning DataGeospatial Sensor Networks and Partitioning Data
Geospatial Sensor Networks and Partitioning Data
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
 
VINEYARD Overview - ARC 2016
VINEYARD Overview - ARC 2016VINEYARD Overview - ARC 2016
VINEYARD Overview - ARC 2016
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
Processing genetic data at scale
Processing genetic data at scaleProcessing genetic data at scale
Processing genetic data at scale
 

More from Paolo Missier

Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Invited cloud-e-Genome project talk at 2015 NGS Data Congress

  • 1. NGSDataCongress London,June2015 P.Misiser Scalable WES Processing And Variant Interpretation With Provenance Recording Using Workflow On The Cloud Paolo Missier, Jacek Cała, Yaobo Xu, Eldarina Wijaya, Ryan Kirby School of Computing Science and Institute of Genetic Medicine Newcastle University, Newcastle upon Tyne, UK NGS Data Congress London, June 15th, 2015
  • 2. NGSDataCongress London,June2015 P.Misiser The Cloud-e-Genome project at Newcastle 1. NGS data processing: • Implement a flexible WES/WGS pipeline • Scalable deployment over a public cloud • Cost control • Scalability • Flexibility • Of design • Of maintenance • Ensure accountability through traceability • Enable analytics over past patient cases 2. Traceable variant interpretation: • Design a simple-to-use tool to facilitate clinical diagnosis by clinicians • Maintain history of past investigations for analytical purposes Objectives: With an aim to: • 2 year pilot project: 2013-2015 • Funded by UK’s National Institute for Health Research (NIHR) • Cloud resources from Azure for Research Award
  • 3. NGSDataCongress London,June2015 P.Misiser Part I: data processing Objectives: • Design and Implement a flexible WES/WGS pipeline • Using workflow technology  high level programming • Providing scalable deployment over a public cloud
  • 4. NGSDataCongress London,June2015 P.Misiser Scripted NGS data processing pipeline Recalibration Corrects for system bias on quality scores assigned by sequencer GATK Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller
  • 5. NGSDataCongress London,June2015 P.Misiser Scripts to workflow - Design Design Cloud Deployment Execution Analysis • Better abstraction • Easier to understand, share, maintain • Better exploit data parallelism • Extensible by wrapping new tools Theoretical advantages of using a workflow programming model
  • 6. NGSDataCongress London,June2015 P.Misiser Workflow Design echo Preparing directories $PICARD_OUTDIR and $PICARD_TEMP mkdir -p $PICARD_OUTDIR mkdir -p $PICARD_TEMP echo Starting PICARD to clean BAM files... $Picard_CleanSam INPUT=$SORTED_BAM_FILE OUTPUT=$SORTED_BAM_FILE_CLEANED echo Starting PICARD to remove duplicates... $Picard_NoDups INPUT=$SORTED_BAM_FILE_CLEANED OUTPUT = $SORTED_BAM_FILE_NODUPS_NO_RG METRICS_FILE=$PICARD_LOG REMOVE_DUPLICATES=true ASSUME_SORTED=true echo Adding read group information to bam file... $Picard_AddRG INPUT=$SORTED_BAM_FILE_NODUPS_NO_RG OUTPUT=$SORTED_BAM_FILE_NODUPS RGID=$READ_GROUP_ID RGPL=illumina RGSM=$SAMPLE_ID RGLB="${SAMPLE_ID}_${READ_GROUP_ID}” RGPU="platform_Unit_${SAMPLE_ID}_${READ_GROUP_ID}” echo Indexing bam files... samtools index $SORTED_BAM_FILE_NODUPS “Wrapper” blocksUtility blocks
  • 7. NGSDataCongress London,June2015 P.Misiser Workflow design raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Conceptual: Actual:
  • 8. NGSDataCongress London,June2015 P.Misiser Anatomy of a complex parallel dataflow eScience Central: simple dataflow model… raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Sample-split: Parallel processing of samples in a batch
  • 9. NGSDataCongress London,June2015 P.Misiser Anatomy of a complex parallel dataflow … with hierarchical structure
  • 10. NGSDataCongress London,June2015 P.Misiser Phase II, top level raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Chromosome-split: Parallel processing of each chromosome across all samples
  • 11. NGSDataCongress London,June2015 P.Misiser Phase III raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants Sample-split: Parallel processing of samples
  • 12. NGSDataCongress London,June2015 P.Misiser Implicit parallelism in the pipeline align-clean- recalibrate-coverage … align-clean- recalibrate-coverage Sample 1 Sample n Variant calling recalibration Variant calling recalibration Variant filtering annotation Variant filtering annotation …… Chromosome split Per-sample Parallel processing Per-chromosome Parallel processing Stage I Stage II Stage III How does the workflow design exploit this parallelism? raw sequences align clean recalibrate alignments calculate coverage call variants recalibrate variants filter variants annotate coverage information annotated variants raw sequences align clean recalibrate alignments calculate coverage coverage informationraw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants filter variants
  • 13. NGSDataCongress London,June2015 P.Misiser Parallel processing over a batch of exomes align lane recalibrate sample call variants recalibrate variants align sample haplotype caller recalibrate sample raw sequences align clean calculate coverage coverage information recalibrate alignments annotate annotated variants Stage 1 Stage 2 Stage 3 filter variants align lanealign lane align sample align lane clean sampleclean sampleclean sample align lane align sample align lane recalibrate sample VC with chr-split haplotype callerhaplotype callerhaplotype caller annotates ampleannotates ampleannotates ample filter samplefilter samplefilter sample annotated variantsannotated variants raw sequencesraw sequences coverage informationcoverage information coverage per samplecoverage per samplecoverage per sample recalibrate
  • 14. NGSDataCongress London,June2015 P.Misiser Cloud Deployment Design Cloud Deployment Execution Analysis • Scalability • Fewer installation/deployment requirements, staff hours required • Automated dependency management, packaging • Configurable to make most efficient use of a cluster
  • 15. NGSDataCongress London,June2015 P.Misiser Workflow on Azure Cloud – modular configuration <<Azure VM>> Azure Blob store e-SC db backend <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow engines Module configuration: 3 nodes, 24 cores Modular architecture  indefinitely scalable!
  • 16. NGSDataCongress London,June2015 P.Misiser Scripts to workflow Design Cloud Deployment Execution Analysis 3. Execution • Runtime monitoring • provenance collection
  • 17. NGSDataCongress London,June2015 P.Misiser Performance 3 workflow engines perform better than our HPC benchmark on larger sample sizes Technical configurations for 3VMs experiments: HPC cluster (dedicated nodes): used 3x8-core compute nodes Intel Xeon E5640, 2.67GHz CPU, 48 GiB RAM, 160 GB scratch space Azure workflow engines: D13 VMs with 8-core CPU, 56 GiB of memory and 400 GB SSD, Ubuntu 14.04.
  • 18. NGSDataCongress London,June2015 P.Misiser Scalability There is little incentive to grow the VM pool beyond 6 engines
  • 19. NGSDataCongress London,June2015 P.Misiser Cost 0 2 4 6 8 10 12 14 16 18 0 6 12 18 24 CostinGBP Number of samples 3 eng (24 cores) 6 eng (48 cores) 12 eng (96 cores) Again, a 6 engine configuration achieves near-optimal cost/sample
  • 20. NGSDataCongress London,June2015 P.Misiser Lessons learnt Design Cloud Deployment Execution Analysis  Better abstraction • Easier to understand, share, maintain  Better exploit data parallelism  Extensible by wrapping new tools • Scalability  Fewer installation/deployment requirements, staff hours required  Automated dependency management, packaging  Configurable to make most efficient use of a cluster  Runtime monitoring  Provenance collection  Reproducibility  Accountability
  • 21. NGSDataCongress London,June2015 P.Misiser Part II: SVI- Simple, traceable variant interpretation Objectives: • Design a simple-to-use tool to facilitate clinical diagnosis by clinicians • Maintain history of past investigations for analytical purposes • Ensure accountability through traceability • Enable analytics over past patient cases MAF threshold - Non-synonymous - stop/gain - frameshift known polymorphisms Homo / Heterozygous Pathogenicity predictors Variant filtering HPO match HPO to OMIM OMIM match OMIM to Gene Gene Union Gene Intersect Genes in scope User-supplied genes list User-supplied disease keywords User-defined preferred genes Variant Scoping Candidate variants Select variants in scope variants in scope ClinVar lookupClinVar Annotated patient variants Variant Classification RED: found, pathogenic AMBER: not found GREEN: found, benign OMIM AMBER/ not found AMBER/ uncertain NGS pipeline
  • 22. NGSDataCongress London,June2015 P.Misiser A database of patient cases and investigations Cases:
  • 24. NGSDataCongress London,June2015 P.Misiser Provenance of variant identification • A provenance graph is generated for each investigation It accounts for the filtering process for each variant listed in the result Enables analytics over provenance graphs across many investigations - “which variants where identified independently on different cases, and how do they correlate with phenotypes?”
  • 25. NGSDataCongress London,June2015 P.Misiser Summary 1. WES/WGS data processing to annotated variants • Scalable, Cloud-based • High level • Low cost / sample 2.Variant interpretation: • Simple • Targeted at clinicians • Built-in accountability of genetic diagnosis • Analytics over a database of past investigations What we are delivering to NIHR:

Editor's Notes

  1. Objective 1: Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals. Obj 2: front end tool to facilitate clinical diagnosis 2 year pilot project Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC) Nov. 2013: Cloud resources from Azure for Research Award 1 year’s worth of data/network/computing resources
  2. Objective 1: Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals. Obj 2: front end tool to facilitate clinical diagnosis 2 year pilot project Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC) Nov. 2013: Cloud resources from Azure for Research Award 1 year’s worth of data/network/computing resources
  3. Current local implementation: - Scripted pipeline  requires expertise to maintain, evolve Deployed on local department cluster Difficult to scale Cost / patient unknown Unable to take advantage of decreasing cost of commodity cloud resources Coverage information translates into confidence on variant call Recalibration: quality score recalibration -- machine produces colour coding for the 4 aminocids, along with a p-value indicating the highest prob call; these are the Q scores different platforms give differnst system bias on Q scores -- and also depending on the lane. Each lane gives a different systematic bias. The point of recalibration is to correct for this type of bias
  4. Wrapper blocks, such as Picard-CleanSAM and Picard-MarkDuplicates, communicate via files in the local filesystem of the workflow engine, which is explicitly de- noted as a connection between blocks. The workflow includes also utility blocks to import and export files, i.e. to transfer data from/to the shared data space (in this case, the Azure blob store). These were com- plemented by e-SC shared libraries, which provide better efficiency in running the tools, as they are installed only once and cached by the workflow engine for any future use. Libraries also promote reproducibility because they eliminate dependencies on external data and services. For instance, to access the human reference genome we built and stored in the system a shared library that included the genome data in a specific version and flavour (precisely HG19 from UCSC).
  5. Loops were used in stage (1) and (3), to iterate over samples that the pipeline was configured to process. control blocks can start a number of sub-workflow invocations, one for each element on their input list. Using these two features, we were able to implement a pattern similar to “map” (in the functional sense), where the initial block generates a list of data samples to process, and then for each element in the list the following block starts a sub-workflow (the loop body);
  6. Sync design: The subworkflows of each step are executed in parallel but synchronously over a number of samples. It means that the top-level workflow submits N subworkflow invocations for a particular step, wait The primary advantage of the discussed, synchronous de- sign is that the structure of the pipeline is modular and clearly represented by the top-level orchestrating workflow whilst the parallelisation is managed by e-SC automatically. The top-level workflow mainly includes blocks to run subworkflows that are independent parts implementing only the actual work done by a particular step. The control blocks take care of the interaction with the system to submit the subworkflows and also suspend the parent invocation until all of them complete.
  7. Model currently is sync execution
  8. Each sample included 2-lane, pair-end raw sequence reads (4 files per sample).The average size of compressed files was nearly 15 GiB per sample; file decompression was included in the pipeline as one of the initial tasks.
  9. A quick overview of the entered phenotype. Shows how many genes found in OMIM, match with genes found in the patients variants. The graph shows a quick summary of any results produces from ClinVar. The phenotypes section in the bottom right shows results from HPO. In the report sections, on the left hows a collection of all the investigations created for the current case (including the one just created).