SlideShare uma empresa Scribd logo
1 de 22
GenomeInformatics2013–P.Missier
From scripted HPC-based NGS pipelines to
workflows on the cloud
Jacek Cała, Yaobo Xu, Eldarina Azfar Wijaya, Paolo Missier
School of Computing Science and Institute of Genetic Medicine
Newcastle University, Newcastle upon Tyne, UK
C4Bio workshop @CCGrid 2014
Chicago, May 26th, 2014
C4Bio2014@CCGrid,-P.Missier
The Cloud-e-Genome project
NGS data processing:
provide mechanisms to rapidly and flexibly
create new exome sequence data processing
pipelines, and to deploy them in a scalable way;
Cost
Scalability
Flexibility
Data to insightHuman variant interpretation for clinical
diagnosis:
provide clinicians with a tool for analysis
and interpretation of human variants
• 2 year pilot project
• Funded by UK’s National Institute for Health Research (NIHR)
through the Biomedical Research Council (BRC)
• Nov. 2013: Cloud resources from Azure for Research Award
• 1 year’s worth of data/network/computing resources
Challenge:
to deliver the benefits of WES/WGS technology to clinical practice
C4Bio2014@CCGrid,-P.Missier
Key technical goals
• Scalability
• In the rate and number of patient sequence submissions
• In the density of sequence data (from whole exome to whole genome)
• Flexibility, Traceability, Comparability across versions
• Simplify experimenting with alternative pipelines (choice of tools, configuration
parameters)
• Trace each version and its executions
• Ability to compare results obtained using different pipelines and reason about
the differences
• Openness. Simplify the process of adding:
• New variant analysis tools
• New statistical methods for variant filtering, selection, and ranking
• Integration with third party databases
C4Bio2014@CCGrid,-P.Missier
Approach and testbed
Technical Approach:
• double- porting
• Infrastructure: HPC cluster to cloud (IaaS)
• Implementation: NGS pipelines from scripts to workflow
• Implement user tools for clinical diagnosis as cloud apps (SaaS)
Testbed and scale:
• Neurological patients from the North-East of England, focus on rare diseases
• Initial testing on about 300 sequences
• 2500-3000 sequences expected within 12 months
C4Bio2014@CCGrid,-P.Missier
Why port to workflow?
• Programming:
• Workflows provide better abstraction in the specification of pipelines
• Workflows directly executable by enactment engine
• Easier to understand, share, and maintain over time
• Flexible – relatively easy to introduce variations
• System: minimal installation/deployment requirements
• Fewer dedicated technical staff hours required
• Automated dependency management, packaging, deployment
• Extensible by wrapping new tools
• Exploits available data parallelism (but not automagically)
• Reproducibility
• Execution monitoring, provenance collection
• Persistence trace serves as evidence for data
• Amenable to automated analysis
C4Bio2014@CCGrid,-P.Missier
Scripted pipeline
Recalibration
Corrects for system
bias on quality
scores assigned by
sequencer
Computes coverage
of each read.
VCF Subsetting by filtering,
eg non-exomic variants
Annovar functional annotations (eg
MAF, synonimity, SNPs…)
followed by in house annotations
Aligns sample
sequence to HG19
reference genome
using BWA aligner
Cleaning, duplicate
elimination
Picard tools
Variant calling operates on
multiple samples
simultaneously
Splits samples into chunks.
Haplotype caller detects
both SNV as well as longer
indels
Variant recalibration
attempts to reduce
false positive rate
from caller
C4Bio2014@CCGrid,-P.Missier
From scripts to workflows
C4Bio2014@CCGrid,-P.Missier Workflow nesting
C4Bio2014@CCGrid,-P.Missier Pipeline evolution
Pipeline:
set C = { c1 … cn } of components -- tool wrappers
Each ci has a configuration conf(ci) and a version v(ci)
…and why
• Technology / algorithm evolution
• Traditional GATK variant caller 
GATK haplotype caller
• Does the interface change?
• Do the operational assumptions
change?
Eg. GATK Variant Recalibrator
requires large input data. Not suitable for
targeted sequencing
What can change
1 – Tool version:
v(ci)  v’(ci)
2 - Tool replacement / add / remove:
ci  c’I
3 – Configuration parameters
conf(ci)  conf’(ci)
(*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J.
Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.”
Briefings in bioinformatics, pp. bbs086–, Jan. 2013
Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while
for variant annotation they refer over 70 tools
C4Bio2014@CCGrid,-P.Missier
Role of provenance
Provenance refers to the sources of information, including entities
and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how
they came to be in the state they are in today (*)
• Provenance is evidence in support of clinical diagnosis
1. Why do these variants appear in the output list?
2. Why have you concluded they are disease-causing?
• Requires ability to trace variants through workflow execution
• Simple scripting lacks this functionality
“Where do these variants come from?”
“Why do these results differ?”
C4Bio2014@CCGrid,-P.Missier Comparing results across pipeline configurations
Run pipeline version V1
V1  V2:
Replace BWA version
Modify Annovar configuration parameters
Variant list
VL1
Variant list
VL2Run pipeline version V2
??
Variant list
VL1
Variant list
VL2
DDIFF
(data differencing)
PDIFF
(provenance differencing)
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing
for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013):
doi:10.1002/cpe.3035.
C4Bio2014@CCGrid,-P.Missier PDIFF - overview
WA
WB
C4Bio2014@CCGrid,-P.Missier The corresponding provenance traces
d1
S0
S0'
w h
S3 S2
y z
S4
x
k
S1
d2
d1'
S0
k'h'
S3'
S2v2
w'
S3
S4
y' z'
x'
S5
d2
(i) Trace A (ii) Trace B
P0 P1
P0 P1
P0 P0 P1P1
S Sv2
d0 d0
C4Bio2014@CCGrid,-P.Missier Delta graph computed by PDIFF
x, x
y, y z, z
w, w
k, k
S0 , S3
S0'
S3'
S1, S5
(service repl.)
S2, S2v2
(version change)
h, h
S0'
P0 branch of S4 P1 branch of S4
P0 branch of S2 P1 branch of S2
S,Sv2
(version change)
S0, S0
d1, d1
PDIFF helps determine the impact of
variations in the pipeline
C4Bio2014@CCGrid,-P.Missier
HPC Cluster configuration
16compute nodes
48/96GB RAM / 250GB disk
19TB usable storage space
Gigabit Ethernet
Shared resource for Institute-wide research
Submission script specifies node
/ core requirements
Computation waits until
resources are available
Current config:
• BWA alignment: 2 cores
• GATK: 8 cores
C4Bio2014@CCGrid,-P.Missier The case for cloud in genome informatics (*)
(*) Stein, Lincoln D. “The Case for Cloud Computing in Genome Informatics.” Genome
Biology 11, no. 5 (January 2010): 207.
• Storage + computing resources co-located
in a cloud
• Privacy issues
• Public, private, or hybrid
• Fluctuating demand  benefits from
elasticity
• Web-based access to clinicians simplifies
adoption
C4Bio2014@CCGrid,-P.Missier Workflow on Azure Cloud - configuration
<<Azure VM>>
Azure Blob
store
e-SC db
backend
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow engines
Top level workflow
Sub-workflows
Test configuration:
3 nodes, 24 cores
C4Bio2014@CCGrid,-P.Missier Workflow and sub-workflows execution
To e-SC queue To e-SC queue
Executable
Block
To e-SC queue
e-SC db
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
Workflow invocation executing on one engine (fragment)
C4Bio2014@CCGrid,-P.Missier Multi-sample processing
Sample list
[S1…Sk]
Top level
workflow
Variant files
[VCF1…VCFk]
Map semantics:
push K new
workflow
invocations to
the e-SC queue
<<Azure VM>>
e-Science
Central
main server JMS queue
REST APIWeb UI
web
browser
rich client
app
workflow invocations
e-SC control data
workflow data
<<worker role>>
Workflow
engine
<<worker role>>
Workflow
engine
e-SC blob
store
<<worker role>>
Workflow
engine
BWA (S1)
BWA (S2)
…
C4Bio2014@CCGrid,-P.Missier Sub-workflows enqueued recursively
Exec block
Specifies threading
 OS maps
threads to
available cores
Sub-workflow
Gets added to
queue
Sub-workflow
One instance gets
added to queue for
each input sample
C4Bio2014@CCGrid,-P.Missier
Preliminary cost estimates
2 samples
1 x 8 core
30 hr @ £0.821 / h = £12.3 / sample
6 samples
3 x 8 core
47 hr @ £2.5 / h = £19 / sample
Cloud deployment makes cost easy to calculate
Trade-off:
• Better flexibility, scalability
• But loss of performance
Some tuning required
Cost model based on node uptime
Better node utilization
• Larger sample batches
Remove unnecessary wait time
• Make sub-workflows async
C4Bio2014@CCGrid,-P.Missier
Summary
• Whole-exome sequence processing on a cloud infrastructure
• Windows Azure – project sponsor
• Tracking provenance as evidence and for change analysis
• Porting HPC scripted pipeline to workflow model and technology
• Scalability, Flexibility, Evolvability

Mais conteúdo relacionado

Destaque

Ipaw12 datalog paper talk
Ipaw12 datalog paper talkIpaw12 datalog paper talk
Ipaw12 datalog paper talkPaolo Missier
 
Invited talk @ DCC09 workshop
Invited talk @ DCC09 workshopInvited talk @ DCC09 workshop
Invited talk @ DCC09 workshopPaolo Missier
 
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paolo Missier
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paolo Missier
 
Session talk @ AGU09
Session talk @ AGU09Session talk @ AGU09
Session talk @ AGU09Paolo Missier
 
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Paolo Missier
 
PDT: Personal Data from Things, and its provenance
PDT: Personal Data from Things,and its provenancePDT: Personal Data from Things,and its provenance
PDT: Personal Data from Things, and its provenancePaolo Missier
 
SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management Paolo Missier
 
Structured Occurrence Network for provenance: talk for ipaw12 paper
Structured Occurrence Network for provenance: talk for ipaw12 paperStructured Occurrence Network for provenance: talk for ipaw12 paper
Structured Occurrence Network for provenance: talk for ipaw12 paperPaolo Missier
 
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...Paolo Missier
 
ProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphsProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphsPaolo Missier
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBTPaolo Missier
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...Paolo Missier
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...Paolo Missier
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07Paolo Missier
 

Destaque (17)

Ipaw12 datalog paper talk
Ipaw12 datalog paper talkIpaw12 datalog paper talk
Ipaw12 datalog paper talk
 
Invited talk @ DCC09 workshop
Invited talk @ DCC09 workshopInvited talk @ DCC09 workshop
Invited talk @ DCC09 workshop
 
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
 
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
 
Session talk @ AGU09
Session talk @ AGU09Session talk @ AGU09
Session talk @ AGU09
 
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
 
PDT: Personal Data from Things, and its provenance
PDT: Personal Data from Things,and its provenancePDT: Personal Data from Things,and its provenance
PDT: Personal Data from Things, and its provenance
 
SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management SWPM12 report on the dagstuhl seminar on Semantic Data Management
SWPM12 report on the dagstuhl seminar on Semantic Data Management
 
Structured Occurrence Network for provenance: talk for ipaw12 paper
Structured Occurrence Network for provenance: talk for ipaw12 paperStructured Occurrence Network for provenance: talk for ipaw12 paper
Structured Occurrence Network for provenance: talk for ipaw12 paper
 
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
Invited talk @ Cardiff University, 2008: Approximate entity reconciliation fo...
 
ProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphsProvAbs: model, policy, and tooling for abstracting PROV graphs
ProvAbs: model, policy, and tooling for abstracting PROV graphs
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
Your data won’t stay smart forever:exploring the temporal dimension of (big ...Your data won’t stay smart forever:exploring the temporal dimension of (big ...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
Paper presentation @DILS'07
Paper presentation @DILS'07Paper presentation @DILS'07
Paper presentation @DILS'07
 

Semelhante a C4Bio paper talk

QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity
 
Recording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid ServicesRecording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid ServicesMartin Szomszor
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeCarole Goble
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
 
Principles of Reproducible Workflows (U-DAWS) nfcamp2019
Principles of Reproducible Workflows (U-DAWS) nfcamp2019Principles of Reproducible Workflows (U-DAWS) nfcamp2019
Principles of Reproducible Workflows (U-DAWS) nfcamp2019Venkat Malladi
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobus
 
Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...David Wallom
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionChris Dwan
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...confluent
 
Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...njcar
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsMonal Daxini
 
Evaluating Cloud vs On-Premises for NGS Clinical Workflows
Evaluating Cloud vs On-Premises for NGS Clinical WorkflowsEvaluating Cloud vs On-Premises for NGS Clinical Workflows
Evaluating Cloud vs On-Premises for NGS Clinical WorkflowsGolden Helix
 
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
(ATS6-APP01) Unleashing the Power of Your Data with DiscoverantBIOVIA
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationIan Foster
 

Semelhante a C4Bio paper talk (20)

QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Recording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid ServicesRecording and Reasoning Over Data Provenance in Web and Grid Services
Recording and Reasoning Over Data Provenance in Web and Grid Services
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Principles of Reproducible Workflows (U-DAWS) nfcamp2019
Principles of Reproducible Workflows (U-DAWS) nfcamp2019Principles of Reproducible Workflows (U-DAWS) nfcamp2019
Principles of Reproducible Workflows (U-DAWS) nfcamp2019
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...Federating Infrastructure as a Service cloud computing systems to create a un...
Federating Infrastructure as a Service cloud computing systems to create a un...
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
 
Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...Standard Provenance Reporting and Scientific Software Management in Virtual L...
Standard Provenance Reporting and Scientific Software Management in Virtual L...
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
 
Evaluating Cloud vs On-Premises for NGS Clinical Workflows
Evaluating Cloud vs On-Premises for NGS Clinical WorkflowsEvaluating Cloud vs On-Premises for NGS Clinical Workflows
Evaluating Cloud vs On-Premises for NGS Clinical Workflows
 
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
Launch Elixir BE 2017
Launch Elixir BE 2017Launch Elixir BE 2017
Launch Elixir BE 2017
 

Mais de Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...Paolo Missier
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthPaolo Missier
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyPaolo Missier
 

Mais de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
algorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparencyalgorithmic-decisions, fairness, machine learning, provenance, transparency
algorithmic-decisions, fairness, machine learning, provenance, transparency
 

Último

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

C4Bio paper talk

  • 1. GenomeInformatics2013–P.Missier From scripted HPC-based NGS pipelines to workflows on the cloud Jacek Cała, Yaobo Xu, Eldarina Azfar Wijaya, Paolo Missier School of Computing Science and Institute of Genetic Medicine Newcastle University, Newcastle upon Tyne, UK C4Bio workshop @CCGrid 2014 Chicago, May 26th, 2014
  • 2. C4Bio2014@CCGrid,-P.Missier The Cloud-e-Genome project NGS data processing: provide mechanisms to rapidly and flexibly create new exome sequence data processing pipelines, and to deploy them in a scalable way; Cost Scalability Flexibility Data to insightHuman variant interpretation for clinical diagnosis: provide clinicians with a tool for analysis and interpretation of human variants • 2 year pilot project • Funded by UK’s National Institute for Health Research (NIHR) through the Biomedical Research Council (BRC) • Nov. 2013: Cloud resources from Azure for Research Award • 1 year’s worth of data/network/computing resources Challenge: to deliver the benefits of WES/WGS technology to clinical practice
  • 3. C4Bio2014@CCGrid,-P.Missier Key technical goals • Scalability • In the rate and number of patient sequence submissions • In the density of sequence data (from whole exome to whole genome) • Flexibility, Traceability, Comparability across versions • Simplify experimenting with alternative pipelines (choice of tools, configuration parameters) • Trace each version and its executions • Ability to compare results obtained using different pipelines and reason about the differences • Openness. Simplify the process of adding: • New variant analysis tools • New statistical methods for variant filtering, selection, and ranking • Integration with third party databases
  • 4. C4Bio2014@CCGrid,-P.Missier Approach and testbed Technical Approach: • double- porting • Infrastructure: HPC cluster to cloud (IaaS) • Implementation: NGS pipelines from scripts to workflow • Implement user tools for clinical diagnosis as cloud apps (SaaS) Testbed and scale: • Neurological patients from the North-East of England, focus on rare diseases • Initial testing on about 300 sequences • 2500-3000 sequences expected within 12 months
  • 5. C4Bio2014@CCGrid,-P.Missier Why port to workflow? • Programming: • Workflows provide better abstraction in the specification of pipelines • Workflows directly executable by enactment engine • Easier to understand, share, and maintain over time • Flexible – relatively easy to introduce variations • System: minimal installation/deployment requirements • Fewer dedicated technical staff hours required • Automated dependency management, packaging, deployment • Extensible by wrapping new tools • Exploits available data parallelism (but not automagically) • Reproducibility • Execution monitoring, provenance collection • Persistence trace serves as evidence for data • Amenable to automated analysis
  • 6. C4Bio2014@CCGrid,-P.Missier Scripted pipeline Recalibration Corrects for system bias on quality scores assigned by sequencer Computes coverage of each read. VCF Subsetting by filtering, eg non-exomic variants Annovar functional annotations (eg MAF, synonimity, SNPs…) followed by in house annotations Aligns sample sequence to HG19 reference genome using BWA aligner Cleaning, duplicate elimination Picard tools Variant calling operates on multiple samples simultaneously Splits samples into chunks. Haplotype caller detects both SNV as well as longer indels Variant recalibration attempts to reduce false positive rate from caller
  • 9. C4Bio2014@CCGrid,-P.Missier Pipeline evolution Pipeline: set C = { c1 … cn } of components -- tool wrappers Each ci has a configuration conf(ci) and a version v(ci) …and why • Technology / algorithm evolution • Traditional GATK variant caller  GATK haplotype caller • Does the interface change? • Do the operational assumptions change? Eg. GATK Variant Recalibrator requires large input data. Not suitable for targeted sequencing What can change 1 – Tool version: v(ci)  v’(ci) 2 - Tool replacement / add / remove: ci  c’I 3 – Configuration parameters conf(ci)  conf’(ci) (*) S. Pabinger, A. Dander, M. Fischer, R. Snajder, M. Sperk, M. Efremova, B. Krabichler, M. R. Speicher, J. Zschocke, and Z. Trajanoski, “A survey of tools for variant analysis of next-generation genome sequencing data.” Briefings in bioinformatics, pp. bbs086–, Jan. 2013 Just for sequence alignment Pabinger et al. in their survey (*) list 17 aligners while for variant annotation they refer over 70 tools
  • 10. C4Bio2014@CCGrid,-P.Missier Role of provenance Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*) Provenance is a description of how things came to be, and how they came to be in the state they are in today (*) • Provenance is evidence in support of clinical diagnosis 1. Why do these variants appear in the output list? 2. Why have you concluded they are disease-causing? • Requires ability to trace variants through workflow execution • Simple scripting lacks this functionality “Where do these variants come from?” “Why do these results differ?”
  • 11. C4Bio2014@CCGrid,-P.Missier Comparing results across pipeline configurations Run pipeline version V1 V1  V2: Replace BWA version Modify Annovar configuration parameters Variant list VL1 Variant list VL2Run pipeline version V2 ?? Variant list VL1 Variant list VL2 DDIFF (data differencing) PDIFF (provenance differencing) Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013): doi:10.1002/cpe.3035.
  • 13. C4Bio2014@CCGrid,-P.Missier The corresponding provenance traces d1 S0 S0' w h S3 S2 y z S4 x k S1 d2 d1' S0 k'h' S3' S2v2 w' S3 S4 y' z' x' S5 d2 (i) Trace A (ii) Trace B P0 P1 P0 P1 P0 P0 P1P1 S Sv2 d0 d0
  • 14. C4Bio2014@CCGrid,-P.Missier Delta graph computed by PDIFF x, x y, y z, z w, w k, k S0 , S3 S0' S3' S1, S5 (service repl.) S2, S2v2 (version change) h, h S0' P0 branch of S4 P1 branch of S4 P0 branch of S2 P1 branch of S2 S,Sv2 (version change) S0, S0 d1, d1 PDIFF helps determine the impact of variations in the pipeline
  • 15. C4Bio2014@CCGrid,-P.Missier HPC Cluster configuration 16compute nodes 48/96GB RAM / 250GB disk 19TB usable storage space Gigabit Ethernet Shared resource for Institute-wide research Submission script specifies node / core requirements Computation waits until resources are available Current config: • BWA alignment: 2 cores • GATK: 8 cores
  • 16. C4Bio2014@CCGrid,-P.Missier The case for cloud in genome informatics (*) (*) Stein, Lincoln D. “The Case for Cloud Computing in Genome Informatics.” Genome Biology 11, no. 5 (January 2010): 207. • Storage + computing resources co-located in a cloud • Privacy issues • Public, private, or hybrid • Fluctuating demand  benefits from elasticity • Web-based access to clinicians simplifies adoption
  • 17. C4Bio2014@CCGrid,-P.Missier Workflow on Azure Cloud - configuration <<Azure VM>> Azure Blob store e-SC db backend <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow engines Top level workflow Sub-workflows Test configuration: 3 nodes, 24 cores
  • 18. C4Bio2014@CCGrid,-P.Missier Workflow and sub-workflows execution To e-SC queue To e-SC queue Executable Block To e-SC queue e-SC db <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine Workflow invocation executing on one engine (fragment)
  • 19. C4Bio2014@CCGrid,-P.Missier Multi-sample processing Sample list [S1…Sk] Top level workflow Variant files [VCF1…VCFk] Map semantics: push K new workflow invocations to the e-SC queue <<Azure VM>> e-Science Central main server JMS queue REST APIWeb UI web browser rich client app workflow invocations e-SC control data workflow data <<worker role>> Workflow engine <<worker role>> Workflow engine e-SC blob store <<worker role>> Workflow engine BWA (S1) BWA (S2) …
  • 20. C4Bio2014@CCGrid,-P.Missier Sub-workflows enqueued recursively Exec block Specifies threading  OS maps threads to available cores Sub-workflow Gets added to queue Sub-workflow One instance gets added to queue for each input sample
  • 21. C4Bio2014@CCGrid,-P.Missier Preliminary cost estimates 2 samples 1 x 8 core 30 hr @ £0.821 / h = £12.3 / sample 6 samples 3 x 8 core 47 hr @ £2.5 / h = £19 / sample Cloud deployment makes cost easy to calculate Trade-off: • Better flexibility, scalability • But loss of performance Some tuning required Cost model based on node uptime Better node utilization • Larger sample batches Remove unnecessary wait time • Make sub-workflows async
  • 22. C4Bio2014@CCGrid,-P.Missier Summary • Whole-exome sequence processing on a cloud infrastructure • Windows Azure – project sponsor • Tracking provenance as evidence and for change analysis • Porting HPC scripted pipeline to workflow model and technology • Scalability, Flexibility, Evolvability

Notas do Editor

  1. Implement a cloud-based, secure scalable, computing infrastructure that is capable of translating the potential benefits of high throughput sequencing into actual genetic diagnosis to health care professionals. Azure: 10 L instances/ 24h a day. / 30 TB/year. / 10 GB of SQL Azure space / 30-­‐100 TB
  2. Coverage information translates into confidence on variant call Recalibration: quality score recalibration -- machine produces colour coding for the 4 aminocids, along with a p-value indicating the highest prob call; these are the Q scores different platforms give differnst system bias on Q scores -- and also depending on the lane. Each lane gives a different systematic bias. The point of recalibration is to correct for this type of bias
  3. E-Science Central Integrate multiple runtime environments - R, Octave, Java, Javascript, (Perl)
  4. Traditional Variant Callers Go through the whole genome to identify locations where a number of non-reference bases appears to call SNPs Gapped mapping to identify INDELs Different algorithms to calculate SNP and INDELs likelihoods GATK HaplotypeCaller Haplotype-based calling Call SNPs and indels simultaneously by performing a local de-novo assembly Same algorithm for SNPs and Indels likelyhoods Artifacts caused by large INDELs recovered by assembly
  5. We have seen some examples of the look and feel of e-SC. Now we briefly go over the architecture. SaaS – Science as a Service
  6. Config is a trade off – shared resource has limited room for scalability. Peak times
  7. Model currently is sync execution
  8. Overall Azure/lampredi ratio is about 1.4 (Azure solution is 40% slower than lampredi) -- this is based on the pipeline run up GATK-phase3 (excluding), so BWA-picard-GATK1-VariantCalling. BWA-picard-GATK1-VariantCalling takes over 90% of the total time of the pipeline, so approximation should be ok. our input in compressed form is (13.85 GiB on average).