SlideShare uma empresa Scribd logo
1 de 41
Baixar para ler offline
SHARP: Harmonizing cross-workflow provenance
SeWeBMeDA’17
Alban Gaignard1
, Khalid Belhajjame2
, Hala Skaf-Molli3
May 28, 2017
1
Nantes Academic Hospital, France
2
LAMSADE Paris-Dauphine University, France
3
LS2N - Nantes University, France
Multi-scale biological observations ? data silos !
!
!
!
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
Multi-scale biological observations ? data silos !
!
!
!
Context
• poor interoperability
• data diversity + volume
• multi-site collaborations,
decentralized data
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
Multi-scale biological observations ? data silos !
!
!
!
Context
• poor interoperability
• data diversity + volume
• multi-site collaborations,
decentralized data
Workflows
Modeling complex analysis,
running at large scale
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
Multi-scale biological observations ? data silos !
!
!
!
Context
• poor interoperability
• data diversity + volume
• multi-site collaborations,
decentralized data
Workflows
Modeling complex analysis,
running at large scale
Provenance
Informing on data sources,
production and analysis chains
→ transparency, reproducibility
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
go to owl:sameAs
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 2
Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
“Which alignment algorithm was used when predicting these
effects ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
“Which alignment algorithm was used when predicting these
effects ?”
“A new version of a reference genome is available, which
genome was used when predicting these phenotypes ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
“Which alignment algorithm was used when predicting these
effects ?”
“A new version of a reference genome is available, which
genome was used when predicting these phenotypes ?”
Need for an overall tracking of provenance over both Galaxy
and Taverna workflows !
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
PROV-O ontology
https://www.w3.org/TR/prov-o
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 4
Heterogeneity in provenance capture !
Galaxy PROV predicates counts
prov:wasDerivedFrom 118
rdf:type 76
rdfs:label 62
prov:used 61
prov:wasAttributedTo 34
prov:wasGeneratedBy 33
prov:endedAtTime 26
prov:startedAtTime 26
prov:wasAssociatedWith 26
prov:generatedAtTime 1
https://www.w3.org/TR/prov-o
Taverna PROV predicates counts
rdf:type 54
rdfs:label 13
prov:atTime 8
wfprov:describedByParameter 6
rdfs:comment 6
prov:hadRole 6
prov:activity 5
dcterms:hasPart 4
prov:agent 4
prov:endedAtTime 4
prov:hadPlan 4
prov:qualifiedAssociation 4
prov:qualifiedEnd 4
prov:qualifiedStart 4
prov:startedAtTime 4
prov:wasAssociatedWith 4
tavernaprov:content 3
wfprov:usedInput 3
wfprov:wasEnactedBy 3
wfprov:wasOutputFrom 3
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 5
Problem statement
• Heterogeneity in PROV predicates: how to reconcile
different PROV traces ?
• Needs for human-oriented interpretation: tractable size,
domain concepts.
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 6
Approach
1. Applying graph saturation and PROV inferences to
overcome vocabulary heterogeneity.
2. Summarizing harmonized provenance graphs as
life-science experiment reports.
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 7
SHARP: Harmonizing multiple PROV graphs
owl:sameAs
inferred
PROV
PROV
trace
PROV
trace
nanopub
PROV interlinking PROV harmonization PROV summarization
11 12 13
…
14
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 8
Tackling multiple provenance
heterogeneity
– Multi-provenance linking: owl:sameAs
Idea
Stating that two PROV entities associated to the same file
content are the same. go to example
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
– Multi-provenance linking: owl:sameAs
Idea
Stating that two PROV entities associated to the same file
content are the same. go to example
Producing owl:sameAs
1. SHA-512 fingerprint of files
2. annotating PROV entities with the SHA-512 digest
3. producing owl:sameAs → SPARQL CONSTRUCT-WHERE query
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
— Reasoning: PROV inference regime
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 10
— Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
— Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
Example
wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a).
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
— Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
Example
wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a).
Translation in SPARQL
CONSTRUCT {
?e_2 prov:wasGeneratedBy _:blank_node .
_:blank_node prov:used ?e_1
} WHERE { ?e_2 prov:wasDerivedFrom ?e_1 }
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
— Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
Example
wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a).
Translation in SPARQL
CONSTRUCT {
?e_2 prov:wasGeneratedBy _:blank_node .
_:blank_node prov:used ?e_1
} WHERE { ?e_2 prov:wasDerivedFrom ?e_1 }
Issues
• may produce many non-informative blank nodes for
existential variables.
• may increase the size of the PROV graph.
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
˜ Equality-Generating Dependencies (EGDs)
Idea
• Two blank nodes have the same neighborhood → keep
only one BN
• One blank node and a resource have the same
neighborhood → keep only the resource
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
˜ Equality-Generating Dependencies (EGDs)
Idea
• Two blank nodes have the same neighborhood → keep
only one BN
• One blank node and a resource have the same
neighborhood → keep only the resource
Example
prov:qualifiedGeneration
e a
prov:activity
:_gen1
:_gen2
prov:qualifiedGeneration
prov:activity
prov:used a2
prov:qualifiedGeneration
e a
prov:activity
:_gen1
prov:used a2
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
˜ Equality-Generating Dependencies (EGDs)
Input : G : the provenance graph resulting from the application of TGD on G
Output: G : the provenance graph with substituted blank nodes, when possible.
1 begin
2 G ← G
3 substitutions ← new List < Pair < Node, Node >> ()
4 repeat
5 S ← findSubstitutions(G )
6 foreach (s ∈ S) do
7 source ← s[0]
8 target ← s[1]
9 foreach (in ∈ G .listStatements(∗, ∗, source)) do
10 G ← G .add(in.getSubject(), in.getPredicate(), target)
11 G ← G .del(in)
12 foreach (out ∈ G .listStatements(source, ∗, ∗)) do
13 G ← G .add(target, out.getPredicate(), out.getObject())
14 G ← G .del(out)
15 until (S.size() = 0)
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 13
Answering domain-specific
questions
™ From harmonized PROV to Nanopublications
Objectives
Linking data to scientific context and publishing them as
cite-able and exchangeable experiment reports.
→ Semantic science Integrated Ontology + NanoPublications
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
™ From harmonized PROV to Nanopublications
Objectives
Linking data to scientific context and publishing them as
cite-able and exchangeable experiment reports.
→ Semantic science Integrated Ontology + NanoPublications
reference
genome
sample_001
predicted
phenotypes
from exons
sio:has-phenotype
sio:is-variant-of
sio:is-supported-by
sio:is-supported-by
sio:is-supported-by
sio:is-supported-by
scientific
question :head {
ex:pub1 a np:Nanopublication .
ex:pub1 np:hasAssertion :assertion1 ;
np:hasAssertion :assertion2 .
ex:pub1 np:hasProvenance :provenance .
ex:pub1 np:hasPublicationInfo :pubInfo . }
:assertion1 {
ex:question a sio:Question ;
sio:has-value "What are the effects of SNPs
located in exons for study-Y samples" ;
sio:is-supported-by ex:referenceGenome ;
sio:is-supported-by ex:sample_001 ;
sio:is-supported-by ex:annotatedVariants . }
:assertion2 {
ex:referenceGenome a sio:Genome .
ex:sample_001 a sio:Sample ;
sio:is-variant-of ex:referenceGenome ;
sio:has-phenotype ex:annotatedVariants . }
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
™ From harmonized PROV to Nanopublications
How ?
By identifying a data lineage path in multiple PROV graphs,
beforehand harmonized (inferred prov:wasInfluencedBy).
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
™ From harmonized PROV to Nanopublications
How ?
By identifying a data lineage path in multiple PROV graphs,
beforehand harmonized (inferred prov:wasInfluencedBy).
CONSTRUCT {
GRAPH :assertion {
?ref_genome a sio:Genome .
?sample a sio:Sample ;
sio:is-variant-of ?ref_genome ;
sio:has-phenotype ?out .
?out rdfs:label ?out_label .
?out sio:is-supported-by ?ref_genome . }
} WHERE {
?sample rdfs:label ?sample_label.
FILTER (contains(lcase(str(?sample_label)), lcase("fastq"))) .
?ref_genome rdfs:label ?ref_genome_label.
FILTER (contains(lcase(str(?ref_genome_label)), lcase("GRCh"))) .
?out ( prov:wasInfluencedBy )+ ?sample
?out tavernaprov:content ?out_label .
FILTER (contains(lcase(str(?out_label)), lcase("exons"))) .
}
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
Experimental results
Time and space for multiple PROV harmonization ?
Material & methods
• 3 ProvStore graphs
• 10, 109 and 666 edges
• Java + Jena rule engine
• Laptop, 4 cores, 16GB, 5
runs
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
Time and space for multiple PROV harmonization ?
Material & methods
• 3 ProvStore graphs
• 10, 109 and 666 edges
• Java + Jena rule engine
• Laptop, 4 cores, 16GB, 5
runs
Results
• almost 5s for the whole harmonization process
• no new wasDerivedFrom relation (must be captured)
• inferred wasInfluencedBy → “common denominator” for
data lineage
∆ size ∆ BNodes ∆ wDerivedFrom ∆ wInfluencedBy mean time
PA + 2776 + 2 + 1 + 7 4835 ± 343 ms
PB + 3102 + 4 + 1 + 58 4759 ± 71 ms
PC + 5023 + 63 + 1 + 231 5304 ± 176 ms
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
Can SHARP produce human-oriented experiment reports ?
Target question
“Which reference genome was used when annotating the
genetic variants ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
Can SHARP produce human-oriented experiment reports ?
Target question
“Which reference genome was used when annotating the
genetic variants ?”
Material & methods
• Galaxy PROV trace
• Taverna PROV trace
• Manually provided
owl:sameAs
• Harmonization process
• Summarization query
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
Can SHARP produce human-oriented experiment reports ?
Harmonization results
Galaxy PROV Taverna PROV Harmonized PROV
predicates counts predicates counts predicates counts
prov:wasDer.From 118 rdf:type 54 owl:differentFrom 3617
rdf:type 76 rdfs:label 13 rdf:type 958
rdfs:label 62 prov:atTime 8 prov:wasInflu.By 515
prov:used 61 wfprov:descByParam. 6 prov:influenced 291
prov:wasAttr.To 34 rdfs:comment 6 rdfs:seeAlso 268
prov:wasGen.By 33 prov:hadRole 6 rdfs:subClassOf 223
prov:endedAtTime 26 prov:activity 5 owl:disjointWith 218
prov:startedAtTime 26 purl:hasPart 4 rdfs:range 208
prov:wasAsso.With 26 prov:agent 4 rdfs:domain 199
prov:gen.AtTime 1 prov:endedAtTime 4 prov:wasGen.By 172
all 463 all 177 all 8654
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 18
Can SHARP produce human-oriented experiment reports ?
Nanopublication as summarization result
“Which reference genome was used when predicting the phenotypes ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 19
Conclusion & future works
Conclusion & future works
Achievements
• Reasoning to address PROV capture heterogeneity
• Reconciliation of multi-site workflow provenance
• Human-oriented nanopublications: domain-specific
vocabulary, tractable size
• Open source implementation for all PROV inference rules
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20
Conclusion & future works
Achievements
• Reasoning to address PROV capture heterogeneity
• Reconciliation of multi-site workflow provenance
• Human-oriented nanopublications: domain-specific
vocabulary, tractable size
• Open source implementation for all PROV inference rules
Perspectives
• size of inferred graphs: OWL/PROV entailment subsets ?
• not yet automated summarization query → work in progress
• towards decentralized provenance harmonization
(federated querying)
• “user study” to assess better interpretation, sharing, trust
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20
Questions ?
alban.gaignard@univ-nantes.fr
Acknowledgments

Mais conteúdo relacionado

Semelhante a SHARP: harmonizing cross-workflow provenance

THoSP: an Algorithm for Nesting Property Graphs
THoSP: an Algorithm for Nesting Property GraphsTHoSP: an Algorithm for Nesting Property Graphs
THoSP: an Algorithm for Nesting Property GraphsGiacomo Bergami
 
SSB 2018 Comparative Phylogeography Workshop
SSB 2018 Comparative Phylogeography WorkshopSSB 2018 Comparative Phylogeography Workshop
SSB 2018 Comparative Phylogeography WorkshopJamie Oaks
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadofnothaft
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Peter krusche population based targeted validation of structural variant brea...
Peter krusche population based targeted validation of structural variant brea...Peter krusche population based targeted validation of structural variant brea...
Peter krusche population based targeted validation of structural variant brea...GenomeInABottle
 
Soft Shake Event / A soft introduction to Neo4J
Soft Shake Event / A soft introduction to Neo4JSoft Shake Event / A soft introduction to Neo4J
Soft Shake Event / A soft introduction to Neo4JFlorent Biville
 

Semelhante a SHARP: harmonizing cross-workflow provenance (6)

THoSP: an Algorithm for Nesting Property Graphs
THoSP: an Algorithm for Nesting Property GraphsTHoSP: an Algorithm for Nesting Property Graphs
THoSP: an Algorithm for Nesting Property Graphs
 
SSB 2018 Comparative Phylogeography Workshop
SSB 2018 Comparative Phylogeography WorkshopSSB 2018 Comparative Phylogeography Workshop
SSB 2018 Comparative Phylogeography Workshop
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Peter krusche population based targeted validation of structural variant brea...
Peter krusche population based targeted validation of structural variant brea...Peter krusche population based targeted validation of structural variant brea...
Peter krusche population based targeted validation of structural variant brea...
 
Soft Shake Event / A soft introduction to Neo4J
Soft Shake Event / A soft introduction to Neo4JSoft Shake Event / A soft introduction to Neo4J
Soft Shake Event / A soft introduction to Neo4J
 

Último

DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 

Último (20)

DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 

SHARP: harmonizing cross-workflow provenance

  • 1. SHARP: Harmonizing cross-workflow provenance SeWeBMeDA’17 Alban Gaignard1 , Khalid Belhajjame2 , Hala Skaf-Molli3 May 28, 2017 1 Nantes Academic Hospital, France 2 LAMSADE Paris-Dauphine University, France 3 LS2N - Nantes University, France
  • 2. Multi-scale biological observations ? data silos ! ! ! ! A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  • 3. Multi-scale biological observations ? data silos ! ! ! ! Context • poor interoperability • data diversity + volume • multi-site collaborations, decentralized data A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  • 4. Multi-scale biological observations ? data silos ! ! ! ! Context • poor interoperability • data diversity + volume • multi-site collaborations, decentralized data Workflows Modeling complex analysis, running at large scale A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  • 5. Multi-scale biological observations ? data silos ! ! ! ! Context • poor interoperability • data diversity + volume • multi-site collaborations, decentralized data Workflows Modeling complex analysis, running at large scale Provenance Informing on data sources, production and analysis chains → transparency, reproducibility A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
  • 6. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 go to owl:sameAs A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 2
  • 7. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 “Which alignment algorithm was used when predicting these effects ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
  • 8. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 “Which alignment algorithm was used when predicting these effects ?” “A new version of a reference genome is available, which genome was used when predicting these phenotypes ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
  • 9. Multiple sites/scientists → multiple workflow engines ! Taverna workflow @research-lab Galaxy workflow @sequencing-facility Variant effect prediction VCF file Exon filtering output Merge Alignment sample 1.a.R1 sample 1.a.R2 Alignment sample 1.b.R1 sample 1.b.R2 Alignment sample 2.R1 sample 2.R2 Sort Sort Variant calling GRCh37 “Which alignment algorithm was used when predicting these effects ?” “A new version of a reference genome is available, which genome was used when predicting these phenotypes ?” Need for an overall tracking of provenance over both Galaxy and Taverna workflows ! A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
  • 10. PROV-O ontology https://www.w3.org/TR/prov-o A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 4
  • 11. Heterogeneity in provenance capture ! Galaxy PROV predicates counts prov:wasDerivedFrom 118 rdf:type 76 rdfs:label 62 prov:used 61 prov:wasAttributedTo 34 prov:wasGeneratedBy 33 prov:endedAtTime 26 prov:startedAtTime 26 prov:wasAssociatedWith 26 prov:generatedAtTime 1 https://www.w3.org/TR/prov-o Taverna PROV predicates counts rdf:type 54 rdfs:label 13 prov:atTime 8 wfprov:describedByParameter 6 rdfs:comment 6 prov:hadRole 6 prov:activity 5 dcterms:hasPart 4 prov:agent 4 prov:endedAtTime 4 prov:hadPlan 4 prov:qualifiedAssociation 4 prov:qualifiedEnd 4 prov:qualifiedStart 4 prov:startedAtTime 4 prov:wasAssociatedWith 4 tavernaprov:content 3 wfprov:usedInput 3 wfprov:wasEnactedBy 3 wfprov:wasOutputFrom 3 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 5
  • 12. Problem statement • Heterogeneity in PROV predicates: how to reconcile different PROV traces ? • Needs for human-oriented interpretation: tractable size, domain concepts. A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 6
  • 13. Approach 1. Applying graph saturation and PROV inferences to overcome vocabulary heterogeneity. 2. Summarizing harmonized provenance graphs as life-science experiment reports. A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 7
  • 14. SHARP: Harmonizing multiple PROV graphs owl:sameAs inferred PROV PROV trace PROV trace nanopub PROV interlinking PROV harmonization PROV summarization 11 12 13 … 14 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 8
  • 16. – Multi-provenance linking: owl:sameAs Idea Stating that two PROV entities associated to the same file content are the same. go to example A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
  • 17. – Multi-provenance linking: owl:sameAs Idea Stating that two PROV entities associated to the same file content are the same. go to example Producing owl:sameAs 1. SHA-512 fingerprint of files 2. annotating PROV entities with the SHA-512 digest 3. producing owl:sameAs → SPARQL CONSTRUCT-WHERE query A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
  • 18. — Reasoning: PROV inference regime A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 10
  • 19. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  • 20. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). Example wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a). A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  • 21. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). Example wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a). Translation in SPARQL CONSTRUCT { ?e_2 prov:wasGeneratedBy _:blank_node . _:blank_node prov:used ?e_1 } WHERE { ?e_2 prov:wasDerivedFrom ?e_1 } A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  • 22. — Tuple-Generating Dependencies (TGDs) Idea Repeatedly inferring new facts based on existing ones until saturation (production rules / forward chaining). Example wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a). Translation in SPARQL CONSTRUCT { ?e_2 prov:wasGeneratedBy _:blank_node . _:blank_node prov:used ?e_1 } WHERE { ?e_2 prov:wasDerivedFrom ?e_1 } Issues • may produce many non-informative blank nodes for existential variables. • may increase the size of the PROV graph. A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
  • 23. ˜ Equality-Generating Dependencies (EGDs) Idea • Two blank nodes have the same neighborhood → keep only one BN • One blank node and a resource have the same neighborhood → keep only the resource A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
  • 24. ˜ Equality-Generating Dependencies (EGDs) Idea • Two blank nodes have the same neighborhood → keep only one BN • One blank node and a resource have the same neighborhood → keep only the resource Example prov:qualifiedGeneration e a prov:activity :_gen1 :_gen2 prov:qualifiedGeneration prov:activity prov:used a2 prov:qualifiedGeneration e a prov:activity :_gen1 prov:used a2 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
  • 25. ˜ Equality-Generating Dependencies (EGDs) Input : G : the provenance graph resulting from the application of TGD on G Output: G : the provenance graph with substituted blank nodes, when possible. 1 begin 2 G ← G 3 substitutions ← new List < Pair < Node, Node >> () 4 repeat 5 S ← findSubstitutions(G ) 6 foreach (s ∈ S) do 7 source ← s[0] 8 target ← s[1] 9 foreach (in ∈ G .listStatements(∗, ∗, source)) do 10 G ← G .add(in.getSubject(), in.getPredicate(), target) 11 G ← G .del(in) 12 foreach (out ∈ G .listStatements(source, ∗, ∗)) do 13 G ← G .add(target, out.getPredicate(), out.getObject()) 14 G ← G .del(out) 15 until (S.size() = 0) A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 13
  • 27. ™ From harmonized PROV to Nanopublications Objectives Linking data to scientific context and publishing them as cite-able and exchangeable experiment reports. → Semantic science Integrated Ontology + NanoPublications A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
  • 28. ™ From harmonized PROV to Nanopublications Objectives Linking data to scientific context and publishing them as cite-able and exchangeable experiment reports. → Semantic science Integrated Ontology + NanoPublications reference genome sample_001 predicted phenotypes from exons sio:has-phenotype sio:is-variant-of sio:is-supported-by sio:is-supported-by sio:is-supported-by sio:is-supported-by scientific question :head { ex:pub1 a np:Nanopublication . ex:pub1 np:hasAssertion :assertion1 ; np:hasAssertion :assertion2 . ex:pub1 np:hasProvenance :provenance . ex:pub1 np:hasPublicationInfo :pubInfo . } :assertion1 { ex:question a sio:Question ; sio:has-value "What are the effects of SNPs located in exons for study-Y samples" ; sio:is-supported-by ex:referenceGenome ; sio:is-supported-by ex:sample_001 ; sio:is-supported-by ex:annotatedVariants . } :assertion2 { ex:referenceGenome a sio:Genome . ex:sample_001 a sio:Sample ; sio:is-variant-of ex:referenceGenome ; sio:has-phenotype ex:annotatedVariants . } A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
  • 29. ™ From harmonized PROV to Nanopublications How ? By identifying a data lineage path in multiple PROV graphs, beforehand harmonized (inferred prov:wasInfluencedBy). A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
  • 30. ™ From harmonized PROV to Nanopublications How ? By identifying a data lineage path in multiple PROV graphs, beforehand harmonized (inferred prov:wasInfluencedBy). CONSTRUCT { GRAPH :assertion { ?ref_genome a sio:Genome . ?sample a sio:Sample ; sio:is-variant-of ?ref_genome ; sio:has-phenotype ?out . ?out rdfs:label ?out_label . ?out sio:is-supported-by ?ref_genome . } } WHERE { ?sample rdfs:label ?sample_label. FILTER (contains(lcase(str(?sample_label)), lcase("fastq"))) . ?ref_genome rdfs:label ?ref_genome_label. FILTER (contains(lcase(str(?ref_genome_label)), lcase("GRCh"))) . ?out ( prov:wasInfluencedBy )+ ?sample ?out tavernaprov:content ?out_label . FILTER (contains(lcase(str(?out_label)), lcase("exons"))) . } A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
  • 32. Time and space for multiple PROV harmonization ? Material & methods • 3 ProvStore graphs • 10, 109 and 666 edges • Java + Jena rule engine • Laptop, 4 cores, 16GB, 5 runs A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
  • 33. Time and space for multiple PROV harmonization ? Material & methods • 3 ProvStore graphs • 10, 109 and 666 edges • Java + Jena rule engine • Laptop, 4 cores, 16GB, 5 runs Results • almost 5s for the whole harmonization process • no new wasDerivedFrom relation (must be captured) • inferred wasInfluencedBy → “common denominator” for data lineage ∆ size ∆ BNodes ∆ wDerivedFrom ∆ wInfluencedBy mean time PA + 2776 + 2 + 1 + 7 4835 ± 343 ms PB + 3102 + 4 + 1 + 58 4759 ± 71 ms PC + 5023 + 63 + 1 + 231 5304 ± 176 ms A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
  • 34. Can SHARP produce human-oriented experiment reports ? Target question “Which reference genome was used when annotating the genetic variants ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
  • 35. Can SHARP produce human-oriented experiment reports ? Target question “Which reference genome was used when annotating the genetic variants ?” Material & methods • Galaxy PROV trace • Taverna PROV trace • Manually provided owl:sameAs • Harmonization process • Summarization query A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
  • 36. Can SHARP produce human-oriented experiment reports ? Harmonization results Galaxy PROV Taverna PROV Harmonized PROV predicates counts predicates counts predicates counts prov:wasDer.From 118 rdf:type 54 owl:differentFrom 3617 rdf:type 76 rdfs:label 13 rdf:type 958 rdfs:label 62 prov:atTime 8 prov:wasInflu.By 515 prov:used 61 wfprov:descByParam. 6 prov:influenced 291 prov:wasAttr.To 34 rdfs:comment 6 rdfs:seeAlso 268 prov:wasGen.By 33 prov:hadRole 6 rdfs:subClassOf 223 prov:endedAtTime 26 prov:activity 5 owl:disjointWith 218 prov:startedAtTime 26 purl:hasPart 4 rdfs:range 208 prov:wasAsso.With 26 prov:agent 4 rdfs:domain 199 prov:gen.AtTime 1 prov:endedAtTime 4 prov:wasGen.By 172 all 463 all 177 all 8654 A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 18
  • 37. Can SHARP produce human-oriented experiment reports ? Nanopublication as summarization result “Which reference genome was used when predicting the phenotypes ?” A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 19
  • 39. Conclusion & future works Achievements • Reasoning to address PROV capture heterogeneity • Reconciliation of multi-site workflow provenance • Human-oriented nanopublications: domain-specific vocabulary, tractable size • Open source implementation for all PROV inference rules A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20
  • 40. Conclusion & future works Achievements • Reasoning to address PROV capture heterogeneity • Reconciliation of multi-site workflow provenance • Human-oriented nanopublications: domain-specific vocabulary, tractable size • Open source implementation for all PROV inference rules Perspectives • size of inferred graphs: OWL/PROV entailment subsets ? • not yet automated summarization query → work in progress • towards decentralized provenance harmonization (federated querying) • “user study” to assess better interpretation, sharing, trust A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20