SHARP: harmonizing cross-workflow provenance

SHARP: Harmonizing cross-workﬂow provenance
SeWeBMeDA’17
Alban Gaignard1
, Khalid Belhajjame2
, Hala Skaf-Molli3
May 28, 2017
1
Nantes Academic Hospital, France
2
LAMSADE Paris-Dauphine University, France
3
LS2N - Nantes University, France

Multi-scale biological observations ? data silos !
!
!
!
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1

!
!
!
Context
• poor interoperability
• data diversity + volume
• multi-site collaborations,
decentralized data

!
!
!
Context
decentralized data
Workﬂows
Modeling complex analysis,
running at large scale

!
!
!
Context
decentralized data
Workﬂows
Modeling complex analysis,
running at large scale
Provenance
Informing on data sources,
production and analysis chains
→ transparency, reproducibility

Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
go to owl:sameAs

Taverna workflow
@research-lab
Galaxy workflow
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
“Which alignment algorithm was used when predicting these
effects ?”

Taverna workflow
@research-lab
Galaxy workflow
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
effects ?”
“A new version of a reference genome is available, which
genome was used when predicting these phenotypes ?”

Taverna workflow
@research-lab
Galaxy workflow
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
effects ?”
“A new version of a reference genome is available, which
genome was used when predicting these phenotypes ?”
Need for an overall tracking of provenance over both Galaxy
and Taverna workflows !

PROV-O ontology
https://www.w3.org/TR/prov-o

Heterogeneity in provenance capture !
Galaxy PROV predicates counts
prov:wasDerivedFrom 118
rdf:type 76
rdfs:label 62
prov:used 61
prov:wasAttributedTo 34
prov:wasGeneratedBy 33
prov:endedAtTime 26
prov:startedAtTime 26
prov:wasAssociatedWith 26
prov:generatedAtTime 1
https://www.w3.org/TR/prov-o
Taverna PROV predicates counts
rdf:type 54
rdfs:label 13
prov:atTime 8
wfprov:describedByParameter 6
rdfs:comment 6
prov:hadRole 6
prov:activity 5
dcterms:hasPart 4
prov:agent 4
prov:endedAtTime 4
prov:hadPlan 4
prov:qualifiedAssociation 4
prov:qualifiedEnd 4
prov:qualifiedStart 4
prov:startedAtTime 4
prov:wasAssociatedWith 4
tavernaprov:content 3
wfprov:usedInput 3
wfprov:wasEnactedBy 3
wfprov:wasOutputFrom 3

Problem statement
• Heterogeneity in PROV predicates: how to reconcile
different PROV traces ?
• Needs for human-oriented interpretation: tractable size,
domain concepts.

Approach
1. Applying graph saturation and PROV inferences to
overcome vocabulary heterogeneity.
2. Summarizing harmonized provenance graphs as
life-science experiment reports.

SHARP: Harmonizing multiple PROV graphs
owl:sameAs
inferred
PROV
PROV
trace
PROV
trace
nanopub
PROV interlinking PROV harmonization PROV summarization
11 12 13
…
14

Tackling multiple provenance
heterogeneity

– Multi-provenance linking: owl:sameAs
Idea
Stating that two PROV entities associated to the same ﬁle
content are the same. go to example

– Multi-provenance linking: owl:sameAs
Idea
Stating that two PROV entities associated to the same file
content are the same. go to example
Producing owl:sameAs
1. SHA-512 fingerprint of files
2. annotating PROV entities with the SHA-512 digest
3. producing owl:sameAs → SPARQL CONSTRUCT-WHERE query

— Reasoning: PROV inference regime

— Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).

Idea
Example
wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a).

Idea
Example
Translation in SPARQL
CONSTRUCT {
?e_2 prov:wasGeneratedBy _:blank_node .
_:blank_node prov:used ?e_1
} WHERE { ?e_2 prov:wasDerivedFrom ?e_1 }

Idea
Example
Translation in SPARQL
CONSTRUCT {
?e_2 prov:wasGeneratedBy _:blank_node .
_:blank_node prov:used ?e_1
} WHERE { ?e_2 prov:wasDerivedFrom ?e_1 }
Issues
• may produce many non-informative blank nodes for
existential variables.
• may increase the size of the PROV graph.

˜ Equality-Generating Dependencies (EGDs)
Idea
• Two blank nodes have the same neighborhood → keep
only one BN
• One blank node and a resource have the same
neighborhood → keep only the resource

Idea
• Two blank nodes have the same neighborhood → keep
only one BN
• One blank node and a resource have the same
neighborhood → keep only the resource
Example
prov:qualiﬁedGeneration
e a
prov:activity
:_gen1
:_gen2
prov:activity
prov:used a2
e a
prov:activity
:_gen1
prov:used a2

Input : G : the provenance graph resulting from the application of TGD on G
Output: G : the provenance graph with substituted blank nodes, when possible.
1 begin
2 G ← G
3 substitutions ← new List < Pair < Node, Node >> ()
4 repeat
5 S ← ﬁndSubstitutions(G )
6 foreach (s ∈ S) do
7 source ← s[0]
8 target ← s[1]
9 foreach (in ∈ G .listStatements(∗, ∗, source)) do
10 G ← G .add(in.getSubject(), in.getPredicate(), target)
11 G ← G .del(in)
12 foreach (out ∈ G .listStatements(source, ∗, ∗)) do
13 G ← G .add(target, out.getPredicate(), out.getObject())
14 G ← G .del(out)
15 until (S.size() = 0)

Answering domain-speciﬁc
questions

™ From harmonized PROV to Nanopublications
Objectives
Linking data to scientiﬁc context and publishing them as
cite-able and exchangeable experiment reports.
→ Semantic science Integrated Ontology + NanoPublications

Objectives
Linking data to scientiﬁc context and publishing them as
cite-able and exchangeable experiment reports.
→ Semantic science Integrated Ontology + NanoPublications
reference
genome
sample_001
predicted
phenotypes
from exons
sio:has-phenotype
sio:is-variant-of
sio:is-supported-by
sio:is-supported-by
sio:is-supported-by
sio:is-supported-by
scientiﬁc
question :head {
ex:pub1 a np:Nanopublication .
ex:pub1 np:hasAssertion :assertion1 ;
np:hasAssertion :assertion2 .
ex:pub1 np:hasProvenance :provenance .
ex:pub1 np:hasPublicationInfo :pubInfo . }
:assertion1 {
ex:question a sio:Question ;
sio:has-value "What are the effects of SNPs
located in exons for study-Y samples" ;
sio:is-supported-by ex:referenceGenome ;
sio:is-supported-by ex:sample_001 ;
sio:is-supported-by ex:annotatedVariants . }
:assertion2 {
ex:referenceGenome a sio:Genome .
ex:sample_001 a sio:Sample ;
sio:is-variant-of ex:referenceGenome ;
sio:has-phenotype ex:annotatedVariants . }

How ?
By identifying a data lineage path in multiple PROV graphs,
beforehand harmonized (inferred prov:wasInﬂuencedBy).

How ?
By identifying a data lineage path in multiple PROV graphs,
beforehand harmonized (inferred prov:wasInﬂuencedBy).
CONSTRUCT {
GRAPH :assertion {
?ref_genome a sio:Genome .
?sample a sio:Sample ;
sio:is-variant-of ?ref_genome ;
sio:has-phenotype ?out .
?out rdfs:label ?out_label .
?out sio:is-supported-by ?ref_genome . }
} WHERE {
?sample rdfs:label ?sample_label.
FILTER (contains(lcase(str(?sample_label)), lcase("fastq"))) .
?ref_genome rdfs:label ?ref_genome_label.
FILTER (contains(lcase(str(?ref_genome_label)), lcase("GRCh"))) .
?out ( prov:wasInfluencedBy )+ ?sample
?out tavernaprov:content ?out_label .
FILTER (contains(lcase(str(?out_label)), lcase("exons"))) .
}

Time and space for multiple PROV harmonization ?
Material & methods
• 3 ProvStore graphs
• 10, 109 and 666 edges
• Java + Jena rule engine
• Laptop, 4 cores, 16GB, 5
runs

Time and space for multiple PROV harmonization ?
Material & methods
• 3 ProvStore graphs
• 10, 109 and 666 edges
• Java + Jena rule engine
• Laptop, 4 cores, 16GB, 5
runs
Results
• almost 5s for the whole harmonization process
• no new wasDerivedFrom relation (must be captured)
• inferred wasInﬂuencedBy → “common denominator” for
data lineage
∆ size ∆ BNodes ∆ wDerivedFrom ∆ wInﬂuencedBy mean time
PA + 2776 + 2 + 1 + 7 4835 ± 343 ms
PB + 3102 + 4 + 1 + 58 4759 ± 71 ms
PC + 5023 + 63 + 1 + 231 5304 ± 176 ms

Can SHARP produce human-oriented experiment reports ?
Target question
“Which reference genome was used when annotating the
genetic variants ?”

Target question
“Which reference genome was used when annotating the
genetic variants ?”
Material & methods
• Galaxy PROV trace
• Taverna PROV trace
• Manually provided
owl:sameAs
• Harmonization process
• Summarization query

Harmonization results
Galaxy PROV Taverna PROV Harmonized PROV
predicates counts predicates counts predicates counts
prov:wasDer.From 118 rdf:type 54 owl:differentFrom 3617
rdf:type 76 rdfs:label 13 rdf:type 958
rdfs:label 62 prov:atTime 8 prov:wasInﬂu.By 515
prov:used 61 wfprov:descByParam. 6 prov:inﬂuenced 291
prov:wasAttr.To 34 rdfs:comment 6 rdfs:seeAlso 268
prov:wasGen.By 33 prov:hadRole 6 rdfs:subClassOf 223
prov:endedAtTime 26 prov:activity 5 owl:disjointWith 218
prov:startedAtTime 26 purl:hasPart 4 rdfs:range 208
prov:wasAsso.With 26 prov:agent 4 rdfs:domain 199
prov:gen.AtTime 1 prov:endedAtTime 4 prov:wasGen.By 172
all 463 all 177 all 8654

Nanopublication as summarization result
“Which reference genome was used when predicting the phenotypes ?”

Conclusion & future works
Achievements
• Reasoning to address PROV capture heterogeneity
• Reconciliation of multi-site workﬂow provenance
• Human-oriented nanopublications: domain-speciﬁc
vocabulary, tractable size
• Open source implementation for all PROV inference rules

Conclusion & future works
Achievements
• Reasoning to address PROV capture heterogeneity
• Reconciliation of multi-site workﬂow provenance
• Human-oriented nanopublications: domain-speciﬁc
vocabulary, tractable size
• Open source implementation for all PROV inference rules
Perspectives
• size of inferred graphs: OWL/PROV entailment subsets ?
• not yet automated summarization query → work in progress
• towards decentralized provenance harmonization
(federated querying)
• “user study” to assess better interpretation, sharing, trust

Questions ?
alban.gaignard@univ-nantes.fr
Acknowledgments

SHARP: harmonizing cross-workflow provenance

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a SHARP: harmonizing cross-workflow provenance

Semelhante a SHARP: harmonizing cross-workflow provenance (6)

Último

Último (20)

SHARP: harmonizing cross-workflow provenance