This document discusses Janus, a semantic provenance model for workflow provenance. Janus aims to move from domain-agnostic provenance graphs to domain-aware graphs through explicit annotations. It also aims to move from local provenance graphs to graphs published as linked open data to enable queries across the web of data. The document outlines Janus's structure and how annotations can be propagated through the graph. It also provides examples of extended queries that combine patterns on local provenance graphs with conditions on remote linked open data sources.
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Nesc invited presentation: Semantic Provenance and Linked Open Data
1. “looking back
to gaze forwards”
Janus
Provenance
LOP: Linked Open Provenance
(was: from Workfl
ows to Semantic Provenance
and Linked Open Data)
Paolo Missier Jun Zhao Satya S. Sahoo
Newcastle University Amit Sheth
(University of Manchester) University of Oxford, UK Wright State University, USA
Nesc workshop:
Understanding Provenance and Linked Open Data
Edinburgh, March 30th, 2011
2. Baseline provenance of a workflow run
QTL → mmu:12575
Ensembl Genes
v1 ... vn w
Ensembl Gene → Ensembl Gene →
Uniprot Gene Entrez Gene
path:mmu04012
exec
Uniprot Gene → Entrez Gene → a1 ... an b1 ... bm
Kegg Gene Kegg Gene
mmu:26416
merge gene IDs
path:mmu04010 y11 ymn
Gene → Pathway
...
path:mmu04010→derives_from→mmu:26416
path:mmu04012→derives_from→mmu:12575
• The graph encodes all direct data dependency relations
• Baseline query model: compute paths amongst sets of nodes
• Transitive closure over data dependency relations
2
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
3. Key ideas
• Janus:
– a semantic provenance model with domain-specific
extensions
– designed around the Taverna workflow model
• From domain-agnostic provenance graphs
• To domain-aware graphs through explicit
annotations
• From local provenance graphs and queries
scoped to the graph
• To
– Graphs published as Linked Data
– Queries extended into the Web of Data
3
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
4. Reference user questions
QTL →
Ensembl Genes Q0: Find all intermediate and initial input
values that contribute to the computation of a
certain output value.
Ensembl Gene → Ensembl Gene →
Uniprot Gene Entrez Gene
Q1. Find all those genes within the input QTL
Uniprot Gene → Entrez Gene →
region that are involved in a given KEGG
Kegg Gene Kegg Gene pathway.
merge gene IDs
Q2: Find all Uniprot-sourced genes
Gene → Pathway
Q3: Find all Entrez genes that encode proteins involved in ATP binding (go:
0005524).
Q4: List relevant PubMed publications for the pathways listed in the result set.
4
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
5. Query types and model capabilities
Query formulation effort Annotation requirements Query Scope
Q0 - Requires knowledge of
process structure and
data values
No annotations required Single run graph or
Multi-run graphs
- Graphical query
constructor may be
available
Q1 Q2
Requires domain annotations
Use of domain terms Single run graph or
on workflow tasks and on data
facilitates query formulation Multi-run graphs
values
Q3 Q4
- Use of domain terms - Requires domain annotations
facilitates query on workflow tasks and on
formulation. data values
The Web of Data
- Can be integrated with - Relies on completeness of
browsers for LoD sources Linked Data Sources
5
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
6. Query types and model capabilities
Query formulation effort Annotation requirements Query Scope
Q0 - Requires knowledge of
process structure and
data values
No annotations required Single run graph or
Multi-run graphs
- Graphical query
constructor may be
available
Q1 Q2
Requires domain annotations
Use of domain terms Single run graph or
on workflow tasks and on data
facilitates query formulation Multi-run graphs
values
Q3 Q4
- Use of domain terms - Requires domain annotations
facilitates query on workflow tasks and on
formulation. data values
The Web of Data
- Can be integrated with - Relies on completeness of
browsers for LoD sources Linked Data Sources
5
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
7. Janus structure
6
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
8. Janus structure
port
v1 v2 v3 value
X1 X2 X3 X1 X2 X3
port
P exec P_inst
Y1 Y2 Y1 Y2
processor
processor exec
spec w1 w2
6
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
10. Desiderata: from structure to values annotations
8
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
11. Desiderata: from structure to values annotations
8
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
12. Desiderata: from structure to values annotations
protein
sequence
X1 X2 X3
interpro
P match
report
Y1 Y2
processor
interpro spec
scan
8
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
13. Desiderata: from structure to values annotations
protein
sequence
X1 X2 X3
interpro
P match
report
Y1 Y2
processor
interpro spec
scan
protein
sequence
port
v1 v2 v3 value
X1
exec
X2 X3
port
P_inst
interpro Y1 Y2
processor interpro
scan exec
w1 w2 match
8 report
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
14. Annotations propagation rules
X rdf:type Port C = {c} X has value type c
X has value v v rdf:type PortValue
v rdf:type C denotes
data type
in the PL
protein sense
has_value_type sequence
X1 X2 X3
interpro
P match
report
Y1 Y2
processor
interpro spec
scan
9
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
15. Annotations propagation rules
X rdf:type Port C = {c} X has value type c
X has value v v rdf:type PortValue
v rdf:type C denotes
data type
in the PL
protein sense
has_value_type sequence
X1 X2 X3
interpro
P match
report
Y1 Y2
processor
?
interpro spec port
scan v1 v2 v3 value
X1 X2 X3
port
P_inst
interpro Y1 Y2
processor interpro
scan exec
w1 w2 match
9 report
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
16. Annotations propagation rules
X rdf:type Port C = {c} X has value type c
X has value v v rdf:type PortValue
v rdf:type C denotes
data type
in the PL
protein sense
has_value_type sequence
X1 X2 X3
interpro
P match
report
Y1 Y2
protein
processor sequence
interpro spec port
scan v1 v2 v3 value
X1 X2 X3
port
P_inst
interpro Y1 Y2
processor interpro
scan exec
w1 w2 match
9 report
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
17. Annotations as semantic overlay
v1 w1
X3
Y2
Provenance graph has_port_value has_port_value
X2
P
fragment
Y1
X1
vn wm
has-input-type Pathway has-output-type
search
service
instance-of
instance-of instance-of
Gene v1 w1 Pathway
has-source
X3
has-source
has_port_value Y2 has_port_value
X2
P
instance-of
instance-of
Y1
Kegg wm
X1
vn Kegg
has-source has-source
10
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
19. Extensions to Linked Data
Annotated workflow Annotated provenance graph
QTL →
Ensembl Genes
Ensembl Gene → Ensembl Gene →
Uniprot Gene Entrez Gene
exec ...
Uniprot Gene → Entrez Gene →
Kegg Gene Kegg Gene
merge gene IDs
Gene → Pathway
- Publish
- I - Map IDs
- II - query
11
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
20. I - Mapping data values to LoD URIs
In our prototype we map data values to Bio2RDF as follows:
Entrez Genes
Uniprot Genes
KEGG Genes
KEGG Pathways
<rdf:Description rdf:about="http://purl.org/net/taverna/janus/create_report/entrezGeneId">
<janus:has_value_binding rdf:resource="http://purl.org/net/taverna/janus/test18"/>
<rdf:Description rdf:about="http://purl.org/net/taverna/janus/test18">
<rdf:type rdf:resource="http://purl.org/net/taverna/janus#port_value"/>
<rdfs:comment>11835</rdfs:comment>
<rdf:type rdf:resource="http://purl.org/obo/owl/sequence#gene"/>
<janus:has_source rdf:resource="http://purl.org/net/taverna/janus#entrez_gene"/>
<rdf:Description rdf:about="http://purl.org/net/taverna/janus/test18">
<rdfs:seeAlso rdf:resource="http://bio2rdf.org/geneid:11835"/>
12
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
21. II - extended queries in the LoD setting
Strategy:
- use the SQUIN LoD query engine to query multiple “Web of Data” sources
- only Bio2RDF in our case
- combine graph patterns on local provenance with conditions on remote LoD
graphs
Q5: Find all Entrez genes that encode proteins involved in ATP binding
(GO:0005524).
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX : <http://www.taverna.org.uk/janus#>
SELECT distinct ?entrezgene
WHERE {
?protein uniprot:classifiedWith <http://bio2rdf.org/go:0005524> . Bio2RDF
?entrezgene <http://bio2rdf.org/bio2rdf_resource:xPath> ?protein .
?gene rdfs:seeAlso ?entrezgene
?gene rdf:type :port_gene
?gene :has_source :entrez_gene . }
local provenance graph
13
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
22. Current status
Current Taverna provenance architecture:
Lab prototype Production
– “Export as...” Janus RDF – “native” (relational) graphs
– currently only queried using – simple, efficient query language on
SPARQL native provenance
e
– manually published
– manually annotated
Taverna
Taverna <<Events stream>> Provenance
runtime events capture
query <scope ... /> Lineage
<select .../> relational
query
<focus .../>
processor DB
query response
OPM graph
complete Janus RDF
graph exporter
Provenance
API (Java)
14
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
23. Summary and References
• Janus: a semantic model for workflow provenance
– OWL ontology, extension of Provenir
– should include attribution + system level provenance
– alignment with OPM?
• Domain-aware graphs through annotations:
– automatically propagated from workflow annotations when possible
– but in practice no real workflows are annotated
• LoD integration:
– powerful provenance publishing and query broadening
– mapping rules currently limited
– no completeness guarantee -- all joins are outer joins!
P. Missier, S.S. Sahoo, J. Zhao, A. Sheth, and C. Goble, "Janus: from Workflows to Semantic
Provenance and Linked Open Data," Procs. IPAW 2010, Troy, NY: 2010
S.S. Sahoo, J. Zhao, and P. Missier, "Extending Semantic provenance into the Web of Data,"
Internet Computing, special issue on Provenance in Web Applications, vol. to appear, 2011.
15
P.Missier - Provenance and Linked Data workshop
Edinburgh March 2011
Notas do Editor
\n
\n
\n
On a typical QTL region with ~150k base pairs, one execution of this workflow finds about 50 Ensembl genes. \nThese could correspond to about 30 genes in the UniProt database, and 60 in the NCBI Etrez genes database. \nEach gene may be involved in a number of pathways, for example the mouse genes Mapk13 (mmu:26415) and \nCdkn1a (mmu:12575) participate to 13 and 9 pathways, respectively. \n\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
comments\nhard for users to formulate a suitable provenance query on scenario I\ncan be done visually but still good understanding of the process and its relation to data is needed\n
comments\nhard for users to formulate a suitable provenance query on scenario I\ncan be done visually but still good understanding of the process and its relation to data is needed\n
comments\nhard for users to formulate a suitable provenance query on scenario I\ncan be done visually but still good understanding of the process and its relation to data is needed\n
comments\nhard for users to formulate a suitable provenance query on scenario I\ncan be done visually but still good understanding of the process and its relation to data is needed\n
\n
annotations done manually for this implementation but could have been inherited from service annotations -- we show this in the next slide\n