SlideShare uma empresa Scribd logo
1 de 31
Baixar para ler offline
From Scientific Workflow Patterns
to 5-star Linked Open Data
Alban Gaignard Hala Skaf-Molli Audrey Bihouée
Nantes Academic
Hospital (CHU de
Nantes), CNRS, France
LINA,
Nantes University,
CNRS,
France
Institut du Thorax,
Nantes University,
INSERM, CNRS,
France
8th USENIX workshop on Theory and Practice of Provenance (TaPP’16)
Washington DC
Needs for linked experiment reports
2
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
3
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
4
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
5
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing (massive) RNA-seq data
TopHat: algorithm to align multiple sequence reads to a reference
genome (known genes).
6
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
Challenges
Algorithmic performance, storage, preservation,
reuse (limit recompute) & share.
1 sample 300 samples
Input data 2 x 17 Gb 10.2 Tb
1-core CPU 170 hours 5.9 years
32-cores CPU 32 hours 14 months
Output data 12 Gb 3.6 Tb
1 sample
Input data 2 x 17 Gb
1-core CPU 170 hours
32-cores CPU 32 hours
Output data 12 Gb
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing experiment results
Scientific experiment: RNA sequencing to quantify gene expression
levels under multiple biological conditions.
7
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Motivations: reusing experiment results
Scientific experiment: RNA sequencing to quantify gene expression
levels under multiple biological conditions.
8
Need for scientific context : metadata
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Expected result: human+machine tractable reports
9
Annotated “Material & Methods”
Links to some workflow artifacts (algorithms, data)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
5-star Linked Open Data
10
W3C standards for machine and human
readable data on the web.
⭑⭑⭑⭑⭑ : time and expertise !
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
5-star Linked Open Data
11
How to ease this process ?
● Workflow engines → automation
● PROV → workflow runs as linked data
W3C standards for machine and human
readable data on the web.
⭑⭑⭑⭑⭑ : time and expertise !
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PROV only
12
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PROV only
13
too fine-grained
no domain concepts
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Provenance as a Linked Experiment Report
14
few + meaningful statements
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Problem statement & objectives
15
Problem statement
Scientific workflows produce massive raw results. Their
publication into curated query-able linked data repositories
requires lot of time and expertise.
Can we exploit provenance traces to ease the publication of
scientific results as Linked Data ?
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Problem statement & objectives
16
Problem statement
Scientific workflows produce massive raw results. Their
publication into curated query-able linked data repositories
requires lot of time and expertise.
Can we exploit provenance traces to ease the publication of
scientific results as Linked Data ?
Objectives
(1) Leverage annotated workflow patterns to generate
provenance mining rules.
(2) Refine provenance traces into linked experiment reports.
Rules generation
17
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Approach
18
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Input domain-specific annotations (❶,❷)
Workflow patterns ❶
19
Sequence patterns, with possibly intermediate steps
● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar
● EDAM ontology: hasFunction, RNA sequence, Genome map
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Input domain-specific annotations (❶,❷)
Workflow patterns ❶
20
Sequence patterns, with possibly intermediate steps
● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar
● EDAM ontology: hasFunction, RNA sequence, Genome map
Experiment report template ❷
Link scientific claims, statements, material and methods
● MicroPublication ontology: Material, Method, Claims
● Experimental factor ontology: Transcriptome, Gene expression
● NCBI taxonomy: Homo Sapiens
● Open Annotation model: hasBody, hasTarget
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PoeM: generating PrOvEnance Mining rules ❸
21
(SPARQL Property path)
(SPARQL Basic graph pattern)
(SPARQL Construct query)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
PoeM: sample generated rule ❸
22
<If> part
<Then> part
First experiments &
results
23
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment context
24
SyMeTRIC: systems medicine project (2015-2017, call “Connect
Talent”), funded by the french region Pays de la Loire.
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment
25
Material & methods
● Real-life RNA-seq workflow to study 3 mice populations
● WF implemented in Galaxy, run on 2 biological samples
● PROV traces exported from Galaxy Histories (API)
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Experiment
26
Material & methods
● Real-life RNA-seq workflow to study 3 mice populations
● WF implemented in Galaxy, run on 2 biological samples
● PROV traces exported from Galaxy Histories (API)
Results (for 1 biological sample)
● 60h CPU (12 cores for genome alignment), 21Gb storage
● 3s to export 81 PROV triples from the Galaxy history
● 2s to apply the rule and produce 35 Micropublication triples
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Conclusion & perspectives
27
Semi-automated approach
(1) PoeM generates semantic web rules
(2) PoeM rules applied on PROV traces to assemble
linked experiment reports (MicroPublication)
Limitations:
- Sequence workflow patterns only
- SPARQL property paths with complex WF patterns ?
- Syntactic matching between WF patterns and PROV labels
Usage scenarios:
→ Query workflow datasets with domain concepts
→ Populate RDF repositories with WF results
TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée
Conclusion & perspectives
28
Future works
(1) WF patterns: split-merge, “common motifs”
(2) Genericity: other domains / other reports (RO, Nanopub.)
(3) PROV heterogeneity: multi-systems PROV
(4) Evaluation: involving biologists, at larger scale
Questions ?
Demo: http://poem.univ-nantes.fr
Contact: alban.gaignard@univ-nantes.fr
Acknowledgments
BiRD bioinformatics facility Connect Talent Call
TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30
Larger scale experiments (PROV traces)
1232 edges
TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31
Larger scale experiments (PROV traces)
49 edges

Mais conteúdo relacionado

Mais procurados

Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesYasset Perez-Riverol
 
Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Yasset Perez-Riverol
 
Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录insight-labs
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog libraryNGDATA
 
New methods hackathon mhc project
New methods   hackathon mhc projectNew methods   hackathon mhc project
New methods hackathon mhc projectGenomeInABottle
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysisYun Lung Li
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Genome Reference Consortium
 
Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Elsa von Licy
 
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...GENALICE
 
Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...GENALICE
 

Mais procurados (20)

Getting the most from the reference assembly
Getting the most from the reference assemblyGetting the most from the reference assembly
Getting the most from the reference assembly
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome Coordinates
 
Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...Systematic integration of millions of peptidoform evidences into Ensembl and ...
Systematic integration of millions of peptidoform evidences into Ensembl and ...
 
Ashg grc workshop2015_tg
Ashg grc workshop2015_tgAshg grc workshop2015_tg
Ashg grc workshop2015_tg
 
agbt 2016 workshop lindsay
agbt 2016 workshop lindsayagbt 2016 workshop lindsay
agbt 2016 workshop lindsay
 
Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录Defcon 2011 network forensics 解题记录
Defcon 2011 network forensics 解题记录
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
The Lily RowLog library
The Lily RowLog libraryThe Lily RowLog library
The Lily RowLog library
 
New methods hackathon mhc project
New methods   hackathon mhc projectNew methods   hackathon mhc project
New methods hackathon mhc project
 
Big data solution for ngs data analysis
Big data solution for ngs data analysisBig data solution for ngs data analysis
Big data solution for ngs data analysis
 
Sept2016 sv nabsys
Sept2016 sv nabsysSept2016 sv nabsys
Sept2016 sv nabsys
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...Exploiting long read sequencing technology to build a substantially improved ...
Exploiting long read sequencing technology to build a substantially improved ...
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT 2016 Workshop Magrini
AGBT 2016 Workshop MagriniAGBT 2016 Workshop Magrini
AGBT 2016 Workshop Magrini
 
Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013Epi tect chi pqpcr_2013
Epi tect chi pqpcr_2013
 
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
The Keygene Variome Analysis Platform and CropPedia: enabling and acceleratin...
 
Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...Population Calling: a powerful tool for novel mutation detection in larger sa...
Population Calling: a powerful tool for novel mutation detection in larger sa...
 
Agbt2015 workshop schneider
Agbt2015 workshop schneiderAgbt2015 workshop schneider
Agbt2015 workshop schneider
 

Semelhante a PoemTapp16

The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceRobert Grossman
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...Baptiste Mayjonade
 
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Alejandra Gonzalez-Beltran
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartAraport
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128GenomeInABottle
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizeAnn Loraine
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposalGenomeInABottle
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’ Paolo Missier
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...GenomeInABottle
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngsDin Apellidos
 
Bic bellec 2016_small
Bic bellec 2016_smallBic bellec 2016_small
Bic bellec 2016_smallPierre Bellec
 
SHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceSHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceGaignard Alban
 
SHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSyed Muhammad Ali Hasnain
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge Proto204
 

Semelhante a PoemTapp16 (20)

The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
BEST PRACTICE TO MAXIMIZE THROUGHPUT WITH NANOPORE TECHNOLOGY & DE NOVO SEQUE...
 
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...
 
Thesis biobix
Thesis biobixThesis biobix
Thesis biobix
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’
 
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...Bioinformatics, Data Integration, and Data Representation Working Group Summa...
Bioinformatics, Data Integration, and Data Representation Working Group Summa...
 
ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs2011 jeroen vanhoudt_ngs
2011 jeroen vanhoudt_ngs
 
Bic bellec 2016_small
Bic bellec 2016_smallBic bellec 2016_small
Bic bellec 2016_small
 
SHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenanceSHARP: harmonizing Galaxy and Taverna worflow provenance
SHARP: harmonizing Galaxy and Taverna worflow provenance
 
SHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenanceSHARP: Harmonizing Galaxy and Taverna workflow provenance
SHARP: Harmonizing Galaxy and Taverna workflow provenance
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
RAMP Data Challenge
RAMP Data Challenge RAMP Data Challenge
RAMP Data Challenge
 

Último

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Último (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

PoemTapp16

  • 1. From Scientific Workflow Patterns to 5-star Linked Open Data Alban Gaignard Hala Skaf-Molli Audrey Bihouée Nantes Academic Hospital (CHU de Nantes), CNRS, France LINA, Nantes University, CNRS, France Institut du Thorax, Nantes University, INSERM, CNRS, France 8th USENIX workshop on Theory and Practice of Provenance (TaPP’16) Washington DC
  • 2. Needs for linked experiment reports 2
  • 3. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 3
  • 4. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 4 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 5. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 5 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 6. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing (massive) RNA-seq data TopHat: algorithm to align multiple sequence reads to a reference genome (known genes). 6 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb Challenges Algorithmic performance, storage, preservation, reuse (limit recompute) & share. 1 sample 300 samples Input data 2 x 17 Gb 10.2 Tb 1-core CPU 170 hours 5.9 years 32-cores CPU 32 hours 14 months Output data 12 Gb 3.6 Tb 1 sample Input data 2 x 17 Gb 1-core CPU 170 hours 32-cores CPU 32 hours Output data 12 Gb
  • 7. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing experiment results Scientific experiment: RNA sequencing to quantify gene expression levels under multiple biological conditions. 7
  • 8. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Motivations: reusing experiment results Scientific experiment: RNA sequencing to quantify gene expression levels under multiple biological conditions. 8 Need for scientific context : metadata
  • 9. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Expected result: human+machine tractable reports 9 Annotated “Material & Methods” Links to some workflow artifacts (algorithms, data)
  • 10. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée 5-star Linked Open Data 10 W3C standards for machine and human readable data on the web. ⭑⭑⭑⭑⭑ : time and expertise !
  • 11. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée 5-star Linked Open Data 11 How to ease this process ? ● Workflow engines → automation ● PROV → workflow runs as linked data W3C standards for machine and human readable data on the web. ⭑⭑⭑⭑⭑ : time and expertise !
  • 12. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PROV only 12
  • 13. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PROV only 13 too fine-grained no domain concepts
  • 14. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Provenance as a Linked Experiment Report 14 few + meaningful statements
  • 15. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Problem statement & objectives 15 Problem statement Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ?
  • 16. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Problem statement & objectives 16 Problem statement Scientific workflows produce massive raw results. Their publication into curated query-able linked data repositories requires lot of time and expertise. Can we exploit provenance traces to ease the publication of scientific results as Linked Data ? Objectives (1) Leverage annotated workflow patterns to generate provenance mining rules. (2) Refine provenance traces into linked experiment reports.
  • 18. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Approach 18
  • 19. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Input domain-specific annotations (❶,❷) Workflow patterns ❶ 19 Sequence patterns, with possibly intermediate steps ● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar ● EDAM ontology: hasFunction, RNA sequence, Genome map
  • 20. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Input domain-specific annotations (❶,❷) Workflow patterns ❶ 20 Sequence patterns, with possibly intermediate steps ● P-PLAN ontology: Step, Variable, hasInputVar, hasOutputVar ● EDAM ontology: hasFunction, RNA sequence, Genome map Experiment report template ❷ Link scientific claims, statements, material and methods ● MicroPublication ontology: Material, Method, Claims ● Experimental factor ontology: Transcriptome, Gene expression ● NCBI taxonomy: Homo Sapiens ● Open Annotation model: hasBody, hasTarget
  • 21. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PoeM: generating PrOvEnance Mining rules ❸ 21 (SPARQL Property path) (SPARQL Basic graph pattern) (SPARQL Construct query)
  • 22. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée PoeM: sample generated rule ❸ 22 <If> part <Then> part
  • 24. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment context 24 SyMeTRIC: systems medicine project (2015-2017, call “Connect Talent”), funded by the french region Pays de la Loire.
  • 25. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment 25 Material & methods ● Real-life RNA-seq workflow to study 3 mice populations ● WF implemented in Galaxy, run on 2 biological samples ● PROV traces exported from Galaxy Histories (API)
  • 26. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Experiment 26 Material & methods ● Real-life RNA-seq workflow to study 3 mice populations ● WF implemented in Galaxy, run on 2 biological samples ● PROV traces exported from Galaxy Histories (API) Results (for 1 biological sample) ● 60h CPU (12 cores for genome alignment), 21Gb storage ● 3s to export 81 PROV triples from the Galaxy history ● 2s to apply the rule and produce 35 Micropublication triples
  • 27. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Conclusion & perspectives 27 Semi-automated approach (1) PoeM generates semantic web rules (2) PoeM rules applied on PROV traces to assemble linked experiment reports (MicroPublication) Limitations: - Sequence workflow patterns only - SPARQL property paths with complex WF patterns ? - Syntactic matching between WF patterns and PROV labels Usage scenarios: → Query workflow datasets with domain concepts → Populate RDF repositories with WF results
  • 28. TaPP ‘16A. Gaignard - H. Skaf-Molli - A. Bihouée Conclusion & perspectives 28 Future works (1) WF patterns: split-merge, “common motifs” (2) Genericity: other domains / other reports (RO, Nanopub.) (3) PROV heterogeneity: multi-systems PROV (4) Evaluation: involving biologists, at larger scale
  • 29. Questions ? Demo: http://poem.univ-nantes.fr Contact: alban.gaignard@univ-nantes.fr Acknowledgments BiRD bioinformatics facility Connect Talent Call
  • 30. TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 30 Larger scale experiments (PROV traces) 1232 edges
  • 31. TaPP’16A. Gaignard, H. Skaf-Molli, A. Bihouée 31 Larger scale experiments (PROV traces) 49 edges