3. P.Missier
IDCC‘16–Feb.2016
Data publication and reuse: a potential virtuous cycle
Publication
Reuse
Tracking
Partial
credit
Article “reuse” == Article citation
• Easy, but limited semantics
Data reuse is more interesting /
complicated:
• Data derivation can take many forms
• Multiple programs, information systems
• Multiple generations
1. What happens to published datasets after their publication?
2. Can we follow their trajectory through transformations?
3. Can we use this knowledge to quantify credit to data contributors?
Measuring data impact (see eg [1])
[1] Alex Ball, Monica Duke (2015). ‘How to Track the Impact of Research Data with Metrics’. DCC How-
to Guides. Edinburgh: Digital Curation Centre.
Available online: http://www.dcc.ac.uk/resources/how-guides
4. P.Missier
IDCC‘16–Feb.2016
Data publication & reuse: a hypothetical scenario
Who gets credit for what?
How much credit should Alice, Bob, Charlie receive?
RO = “Research Object”
RO3
RO5
RO2
4RO3
RO4
Charlie
RO1
P2
3️⃣
DR1
Alice
RO1
1⃣
DR3
DR2
RO3
RO2
RO1
Bob
2⃣
P1
6. P.Missier
IDCC‘16–Feb.2016
Assignment and transitive propagation of credit
Inductive defintion of credit:
1. External credit:
• Can be assigned to any ROx in the graph at any time
• How? Don’t care: any (community-based) mechanism is ok
2. Transitively propagated partial credit:
• If ROy is reachable from ROx in the graph, then ROy should
receive a portion of the credit given to ROx
Assuming this graph
can be constructed:
10. P.Missier
IDCC‘16–Feb.2016
Credit propagation patterns - 2
we want RO to receive a fraction of RO’s credit.
credit transfer parameter through a:
𝝰 models the value of the transformation a
relative to its inputs data RO
High value transformation: low 𝝰 value low credit to RO
Simple transformation: high 𝝰 value high credit to RO
1. Single-input, single-output activity
12. P.Missier
IDCC‘16–Feb.2016
Credit propagation patterns -4
RO receives credit from each output RO’
These are all part of DT(RO)
3. multi-input, multi-output activity: A generates M>1 outputs
Relative importance of derived data
products RO’1 … RO’m:
13. P.Missier
IDCC‘16–Feb.2016
Credit propagation patterns - unknown activity
When activity a is unknown, none of the parameters α,β,γ can be used
Exists some activity a such that:
(*) https://www.w3.org/TR/prov-constraints/#derivations
Modelled using a derivation transfer parameter:
For n known derivations of RO:
PROV-CONSTRAINTS (*)
14. P.Missier
IDCC‘16–Feb.2016
Credit from data to Agents
Agents are the actual people to whom the ROs are attributed
Each agent may be responsible for a set R or ROs.
The credit to this agent is simply:
15. P.Missier
IDCC‘16–Feb.2016
Summary of credit model
RO reuse events
provenance statements about RO
complete provenance graph
DT(RO)
cr(RO)
Three elements to cr(RO):
1. External credit that is independent of reuse
- May follow any community-based scoring scheme of data
relevance
2. Credit propagation rules computed inductively from DT(RO)
- These formalise the notion of transitive credit
3. A collection of credit transfer parameters
- These account for the nature of the activities involved DT(RO)
16. P.Missier
IDCC‘16–Feb.2016
How it might work
How it might work: a data reuse simulator
Events:
- Data re-use through an activity
- Adjustments to external credit
17. P.Missier
IDCC‘16–Feb.2016
Next steps
Define a suitable credit transfer function f
• Credit transfer parameters
2. Build the provenance graph in practice
• Provlets and their composition
Issues in building a graph of reuse events:
1. Modelling reuse events using PROV [easy]
2. Detecting and reporting reuse events in practice [hard!!]
18. P.Missier
IDCC‘16–Feb.2016
Modelling reuse using PROV
DR1
DR3
DR2
Alice
RO1
RO1
RO3
S2
RO3
RO4
RO3
RO5
RO2
Bob
Charlie
RO1
P2
P1
1⃣
2⃣
3⃣
Alice generates RO1
Bob reuses RO1, generating RO2, RO3
Charlie reuses RO1 and RO3, generating RO4 through P2
Unknown Agent reuses RO2 and RO3, generating RO5 through an unkonwn
activity
Observable events:
Provlets are PROV document fragments generated by multiple,
independent, autonomous Information Systems
23. P.Missier
IDCC‘16–Feb.2016
Provlets generation and composition
P1
Px
P2
RO1
RO2
RO3
RO4
RO5
used
used
used
used
genBy
genBy
genBy
genByAlice
wasAttributedTo
Bob
wasAttributedTo
wasAttributedTo
Charlie
24. P.Missier
IDCC‘16–Feb.2016
Is this really practical?
Provlets are generated by multiple, independent, autonomous Systems
• Not necessarily cooperative
• Especially in the long tail of science
No guarantee of
• Completeness
• Consistency eg of RO PID usage
Alice misses out on credit due
to dependencies
RO2 RO1, RO3 RO1
P1
Px
P2
RO1
RO2
RO3
RO4
RO5
used
used
used
used
genBy
genBy
genBy
genByAlice
wasAttributedTo
Bob
wasAttributedTo
wasAttributedTo
Charlie
Provenance and trajectories can be incomplete, partially disconnected
Px
P2
RO1
RO2
RO3
RO4
RO5
used
used
used
genBy
genBy
Alice
wasAttributedTo
wasAttributedTo
Charlie
25. P.Missier
IDCC‘16–Feb.2016
Challenges: A research agenda
Vision: tracking data re-use in the wild
1. Community efforts
• Incrementally instrument key systems to be provenance-friendly and cooperative
• Python NoWorkflow
• R
• Workflows (Kepler, Taverna, Pegasus, VisTrails, …)
• Facilitate consistent use of PIDs
• Incentivise proactive reporting of re-use instances
2. Research into probabilistic provenance
• Can we estimate the likelihood of some of the missing derivations?
• Uncertain graph management a rich foundation
• Can we design robust credit models that incorporate uncertainty of derivation?
27. P.Missier
IDCC‘16–Feb.2016
Selected references
• Bechhofer, S., De Roure, D., Gamble, M., Goble, C. & Buchan, I. (2010).
Research Objects: Towards Exchange and Reuse of Digital Knowledge. Nature
Precedings.
• Callaghan, S., Donegan, S., Pepler, S., Thorley, M., Cunningham, N., Kirsch, P., . .
. Wright, D. (2012, may). Making Data a First Class Scientific Output: Data Citation
and Publication by NERC’s Environmental Data Centres (Vol. 7) (No. 1).
• Katz, D. S. (2014). Transitive credit as a means to address social and
technological concerns stemming from citation and attribution of digital products.
Journal of Open Research Software, 2(1), e20.
• Moreau, L. & Groth, P. (2013, sep). Provenance: An Introduction to PROV.
Synthesis Lectures on the Semantic Web: Theory and Technology, 3(4), 1–129.
• Wallis, J. C., Rolando, E. & Borgman, C. L. (2013, jul). If We Share Data, Will
Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and
Technology. PLoS ONE, 8(7), e67332.
Editor's Notes
Re3Data.org: more than 1,130 data repositories that are accessed by over 5,000 unique visitors each month. On average, 10 new repositories are added every week.
How do I account for the complexity of its transformations?
Measuring the influence of a dataset on others’ research?
How do I give credit to original contributors?
The scenario involves an initial RO, $\mathit{RO}_1$, which is created and then published by Alice to data repository $DR_1$. This RO is later discovered, downloaded, and reused by Bob through a process $P_1$, and independently by Charlie through process $P_2$, resulting in derivative objects $\mathit{RO}_2$, $\mathit{RO}_3$, and $\mathit{RO}_4$, respectively. These new ROs may be published into different and separate data repositories, eg $DR_2$, $DR_3$ as in the figure.Here Alice, Bob, and Charlie are modelled as PROV Agents, and $P_1$, $P_2$ as Activities.Not all details about a derivation are always available. For instance, in this example $\mathit{RO}_2$ and $\mathit{RO}_3$ are later themselves reused by some unknown Agent through some unknown Activity, generating $\mathit{RO}_5$ as a result.
The scenario involves an initial RO, $\mathit{RO}_1$, which is created and then published by Alice to data repository $DR_1$. This RO is later discovered, downloaded, and reused by Bob through a process $P_1$, and independently by Charlie through process $P_2$, resulting in derivative objects $\mathit{RO}_2$, $\mathit{RO}_3$, and $\mathit{RO}_4$, respectively. These new ROs may be published into different and separate data repositories, eg $DR_2$, $DR_3$ as in the figure.Here Alice, Bob, and Charlie are modelled as PROV Agents, and $P_1$, $P_2$ as Activities.Not all details about a derivation are always available. For instance, in this example $\mathit{RO}_2$ and $\mathit{RO}_3$ are later themselves reused by some unknown Agent through some unknown Activity, generating $\mathit{RO}_5$ as a result.
Traverse the provenance gra[h …to obtain graphs DT(RO) of RO’s direct and indirect derivations:
$\RO$ accrues a proportion of the total credit of $\RO$, which accounts for its perceived importance in computing $\RO'$ using $a$.
$\RO$ accrues a proportion of the total credit of $\RO$, which accounts for its perceived importance in computing $\RO'$ using $a$.
Re3Data.org: more than 1,130 data repositories that are accessed by over 5,000 unique visitors each month. On average, 10 new repositories are added every week.