3. Multi-scale biological observations ? data silos !
!
!
!
Context
• poor interoperability
• data diversity + volume
• multi-site collaborations,
decentralized data
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
4. Multi-scale biological observations ? data silos !
!
!
!
Context
• poor interoperability
• data diversity + volume
• multi-site collaborations,
decentralized data
Workflows
Modeling complex analysis,
running at large scale
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
5. Multi-scale biological observations ? data silos !
!
!
!
Context
• poor interoperability
• data diversity + volume
• multi-site collaborations,
decentralized data
Workflows
Modeling complex analysis,
running at large scale
Provenance
Informing on data sources,
production and analysis chains
→ transparency, reproducibility
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 1
7. Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
“Which alignment algorithm was used when predicting these
effects ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
8. Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
“Which alignment algorithm was used when predicting these
effects ?”
“A new version of a reference genome is available, which
genome was used when predicting these phenotypes ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
9. Multiple sites/scientists → multiple workflow engines !
Taverna workflow
@research-lab
Galaxy workflow
@sequencing-facility
Variant effect
prediction
VCF file
Exon filtering
output
Merge
Alignment
sample
1.a.R1
sample
1.a.R2
Alignment
sample
1.b.R1
sample
1.b.R2
Alignment
sample
2.R1
sample
2.R2
Sort Sort
Variant calling
GRCh37
“Which alignment algorithm was used when predicting these
effects ?”
“A new version of a reference genome is available, which
genome was used when predicting these phenotypes ?”
Need for an overall tracking of provenance over both Galaxy
and Taverna workflows !
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 3
12. Problem statement
• Heterogeneity in PROV predicates: how to reconcile
different PROV traces ?
• Needs for human-oriented interpretation: tractable size,
domain concepts.
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 6
13. Approach
1. Applying graph saturation and PROV inferences to
overcome vocabulary heterogeneity.
2. Summarizing harmonized provenance graphs as
life-science experiment reports.
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 7
16. – Multi-provenance linking: owl:sameAs
Idea
Stating that two PROV entities associated to the same file
content are the same. go to example
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
17. – Multi-provenance linking: owl:sameAs
Idea
Stating that two PROV entities associated to the same file
content are the same. go to example
Producing owl:sameAs
1. SHA-512 fingerprint of files
2. annotating PROV entities with the SHA-512 digest
3. producing owl:sameAs → SPARQL CONSTRUCT-WHERE query
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 9
18. — Reasoning: PROV inference regime
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 10
19. — Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
20. — Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
Example
wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a).
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
21. — Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
Example
wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a).
Translation in SPARQL
CONSTRUCT {
?e_2 prov:wasGeneratedBy _:blank_node .
_:blank_node prov:used ?e_1
} WHERE { ?e_2 prov:wasDerivedFrom ?e_1 }
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
22. — Tuple-Generating Dependencies (TGDs)
Idea
Repeatedly inferring new facts based on existing ones until
saturation (production rules / forward chaining).
Example
wasDerivedFrom(e2, e1) → ∃ a used(a, e1), wasGeneratedFrom(e2, a).
Translation in SPARQL
CONSTRUCT {
?e_2 prov:wasGeneratedBy _:blank_node .
_:blank_node prov:used ?e_1
} WHERE { ?e_2 prov:wasDerivedFrom ?e_1 }
Issues
• may produce many non-informative blank nodes for
existential variables.
• may increase the size of the PROV graph.
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 11
23. ˜ Equality-Generating Dependencies (EGDs)
Idea
• Two blank nodes have the same neighborhood → keep
only one BN
• One blank node and a resource have the same
neighborhood → keep only the resource
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
24. ˜ Equality-Generating Dependencies (EGDs)
Idea
• Two blank nodes have the same neighborhood → keep
only one BN
• One blank node and a resource have the same
neighborhood → keep only the resource
Example
prov:qualifiedGeneration
e a
prov:activity
:_gen1
:_gen2
prov:qualifiedGeneration
prov:activity
prov:used a2
prov:qualifiedGeneration
e a
prov:activity
:_gen1
prov:used a2
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 12
25. ˜ Equality-Generating Dependencies (EGDs)
Input : G : the provenance graph resulting from the application of TGD on G
Output: G : the provenance graph with substituted blank nodes, when possible.
1 begin
2 G ← G
3 substitutions ← new List < Pair < Node, Node >> ()
4 repeat
5 S ← findSubstitutions(G )
6 foreach (s ∈ S) do
7 source ← s[0]
8 target ← s[1]
9 foreach (in ∈ G .listStatements(∗, ∗, source)) do
10 G ← G .add(in.getSubject(), in.getPredicate(), target)
11 G ← G .del(in)
12 foreach (out ∈ G .listStatements(source, ∗, ∗)) do
13 G ← G .add(target, out.getPredicate(), out.getObject())
14 G ← G .del(out)
15 until (S.size() = 0)
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 13
27. ™ From harmonized PROV to Nanopublications
Objectives
Linking data to scientific context and publishing them as
cite-able and exchangeable experiment reports.
→ Semantic science Integrated Ontology + NanoPublications
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
28. ™ From harmonized PROV to Nanopublications
Objectives
Linking data to scientific context and publishing them as
cite-able and exchangeable experiment reports.
→ Semantic science Integrated Ontology + NanoPublications
reference
genome
sample_001
predicted
phenotypes
from exons
sio:has-phenotype
sio:is-variant-of
sio:is-supported-by
sio:is-supported-by
sio:is-supported-by
sio:is-supported-by
scientific
question :head {
ex:pub1 a np:Nanopublication .
ex:pub1 np:hasAssertion :assertion1 ;
np:hasAssertion :assertion2 .
ex:pub1 np:hasProvenance :provenance .
ex:pub1 np:hasPublicationInfo :pubInfo . }
:assertion1 {
ex:question a sio:Question ;
sio:has-value "What are the effects of SNPs
located in exons for study-Y samples" ;
sio:is-supported-by ex:referenceGenome ;
sio:is-supported-by ex:sample_001 ;
sio:is-supported-by ex:annotatedVariants . }
:assertion2 {
ex:referenceGenome a sio:Genome .
ex:sample_001 a sio:Sample ;
sio:is-variant-of ex:referenceGenome ;
sio:has-phenotype ex:annotatedVariants . }
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 14
29. ™ From harmonized PROV to Nanopublications
How ?
By identifying a data lineage path in multiple PROV graphs,
beforehand harmonized (inferred prov:wasInfluencedBy).
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
30. ™ From harmonized PROV to Nanopublications
How ?
By identifying a data lineage path in multiple PROV graphs,
beforehand harmonized (inferred prov:wasInfluencedBy).
CONSTRUCT {
GRAPH :assertion {
?ref_genome a sio:Genome .
?sample a sio:Sample ;
sio:is-variant-of ?ref_genome ;
sio:has-phenotype ?out .
?out rdfs:label ?out_label .
?out sio:is-supported-by ?ref_genome . }
} WHERE {
?sample rdfs:label ?sample_label.
FILTER (contains(lcase(str(?sample_label)), lcase("fastq"))) .
?ref_genome rdfs:label ?ref_genome_label.
FILTER (contains(lcase(str(?ref_genome_label)), lcase("GRCh"))) .
?out ( prov:wasInfluencedBy )+ ?sample
?out tavernaprov:content ?out_label .
FILTER (contains(lcase(str(?out_label)), lcase("exons"))) .
}
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 15
32. Time and space for multiple PROV harmonization ?
Material & methods
• 3 ProvStore graphs
• 10, 109 and 666 edges
• Java + Jena rule engine
• Laptop, 4 cores, 16GB, 5
runs
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
33. Time and space for multiple PROV harmonization ?
Material & methods
• 3 ProvStore graphs
• 10, 109 and 666 edges
• Java + Jena rule engine
• Laptop, 4 cores, 16GB, 5
runs
Results
• almost 5s for the whole harmonization process
• no new wasDerivedFrom relation (must be captured)
• inferred wasInfluencedBy → “common denominator” for
data lineage
∆ size ∆ BNodes ∆ wDerivedFrom ∆ wInfluencedBy mean time
PA + 2776 + 2 + 1 + 7 4835 ± 343 ms
PB + 3102 + 4 + 1 + 58 4759 ± 71 ms
PC + 5023 + 63 + 1 + 231 5304 ± 176 ms
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 16
34. Can SHARP produce human-oriented experiment reports ?
Target question
“Which reference genome was used when annotating the
genetic variants ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
35. Can SHARP produce human-oriented experiment reports ?
Target question
“Which reference genome was used when annotating the
genetic variants ?”
Material & methods
• Galaxy PROV trace
• Taverna PROV trace
• Manually provided
owl:sameAs
• Harmonization process
• Summarization query
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 17
37. Can SHARP produce human-oriented experiment reports ?
Nanopublication as summarization result
“Which reference genome was used when predicting the phenotypes ?”
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 19
39. Conclusion & future works
Achievements
• Reasoning to address PROV capture heterogeneity
• Reconciliation of multi-site workflow provenance
• Human-oriented nanopublications: domain-specific
vocabulary, tractable size
• Open source implementation for all PROV inference rules
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20
40. Conclusion & future works
Achievements
• Reasoning to address PROV capture heterogeneity
• Reconciliation of multi-site workflow provenance
• Human-oriented nanopublications: domain-specific
vocabulary, tractable size
• Open source implementation for all PROV inference rules
Perspectives
• size of inferred graphs: OWL/PROV entailment subsets ?
• not yet automated summarization query → work in progress
• towards decentralized provenance harmonization
(federated querying)
• “user study” to assess better interpretation, sharing, trust
A. Gaignard, K. Belhajjame, H. Skaff Molli – SeWeBMeDA’17 20