SlideShare a Scribd company logo
1 of 27
P.Missier
IDCC‘16–Feb.2016
Data Trajectories:
tracking reuse of published data
for transitive credit attribution
Paolo Missier
Paolo.Missier@ncl.ac.uk
School of Computing Science
Newcastle University, UK
IDCC’16
Amsterdam, Feb 24, 2016
P.Missier
IDCC‘16–Feb.2016
A crowded space in Open Research Data (Repositories)
P.Missier
IDCC‘16–Feb.2016
Data publication and reuse: a potential virtuous cycle
Publication
Reuse
Tracking
Partial
credit
Article “reuse” == Article citation
• Easy, but limited semantics
Data reuse is more interesting /
complicated:
• Data derivation can take many forms
• Multiple programs, information systems
• Multiple generations
1. What happens to published datasets after their publication?
2. Can we follow their trajectory through transformations?
3. Can we use this knowledge to quantify credit to data contributors?
Measuring data impact (see eg [1])
[1] Alex Ball, Monica Duke (2015). ‘How to Track the Impact of Research Data with Metrics’. DCC How-
to Guides. Edinburgh: Digital Curation Centre.
Available online: http://www.dcc.ac.uk/resources/how-guides
P.Missier
IDCC‘16–Feb.2016
Data publication & reuse: a hypothetical scenario
Who gets credit for what?
How much credit should Alice, Bob, Charlie receive?
RO = “Research Object”
RO3
RO5
RO2
4RO3
RO4
Charlie
RO1
P2
3️⃣
DR1
Alice
RO1
1⃣
DR3
DR2
RO3
RO2
RO1
Bob
2⃣
P1
P.Missier
IDCC‘16–Feb.2016
Recording reuse chains
Sequence of derivations viewed as a provenance graph
• W3C PROV compliant
DR1
DR3
DR2
Alice
RO1
RO1
RO3
S2
RO3
RO4
RO3
RO5
RO2
Bob
Charlie
RO1
P2
P1
1⃣
2⃣
3⃣
P.Missier
IDCC‘16–Feb.2016
Assignment and transitive propagation of credit
Inductive defintion of credit:
1. External credit:
• Can be assigned to any ROx in the graph at any time
• How? Don’t care: any (community-based) mechanism is ok
2. Transitively propagated partial credit:
• If ROy is reachable from ROx in the graph, then ROy should
receive a portion of the credit given to ROx
Assuming this graph
can be constructed:
P.Missier
IDCC‘16–Feb.2016
Data trajectories
The trajectory DT(RO) of contains all RO’ on which RO has had an impact
For each RO, its credit is defined by induction on its trajectory graph:
Externa
l credit
Transitive
credit
P.Missier
IDCC‘16–Feb.2016
Next steps
1. Define a suitable credit transfer function f
2. Build the provenance graph in practice
• Provlets and their composition
P.Missier
IDCC‘16–Feb.2016
Credit propagation patterns - 1
Most general case:
RO has been reused r times, by activities, a1 … ar:
Then, we consider patterns that involve a single activity a
P.Missier
IDCC‘16–Feb.2016
Credit propagation patterns - 2
we want RO to receive a fraction of RO’s credit.
credit transfer parameter through a:
𝝰 models the value of the transformation a
relative to its inputs data RO
High value transformation: low 𝝰 value  low credit to RO
Simple transformation: high 𝝰 value  high credit to RO
1. Single-input, single-output activity
P.Missier
IDCC‘16–Feb.2016
Credit propagation patterns -3
We account for relative importance of
each of A’s inputs RO1 … ROn
Modelled using n new factors:
2. multi-input activity: RO is only one of n>1 inputs to A
P.Missier
IDCC‘16–Feb.2016
Credit propagation patterns -4
RO receives credit from each output RO’
These are all part of DT(RO)
3. multi-input, multi-output activity: A generates M>1 outputs
Relative importance of derived data
products RO’1 … RO’m:
P.Missier
IDCC‘16–Feb.2016
Credit propagation patterns - unknown activity
When activity a is unknown, none of the parameters α,β,γ can be used
 Exists some activity a such that:
(*) https://www.w3.org/TR/prov-constraints/#derivations
Modelled using a derivation transfer parameter:
For n known derivations of RO:
PROV-CONSTRAINTS (*)
P.Missier
IDCC‘16–Feb.2016
Credit from data to Agents
Agents are the actual people to whom the ROs are attributed
Each agent may be responsible for a set R or ROs.
The credit to this agent is simply:
P.Missier
IDCC‘16–Feb.2016
Summary of credit model
RO reuse events
 provenance statements about RO
 complete provenance graph
 DT(RO)
 cr(RO)
Three elements to cr(RO):
1. External credit that is independent of reuse
- May follow any community-based scoring scheme of data
relevance
2. Credit propagation rules computed inductively from DT(RO)
- These formalise the notion of transitive credit
3. A collection of credit transfer parameters
- These account for the nature of the activities involved DT(RO)
P.Missier
IDCC‘16–Feb.2016
How it might work
How it might work: a data reuse simulator
Events:
- Data re-use through an activity
- Adjustments to external credit
P.Missier
IDCC‘16–Feb.2016
Next steps
 Define a suitable credit transfer function f
• Credit transfer parameters
2. Build the provenance graph in practice
• Provlets and their composition
Issues in building a graph of reuse events:
1. Modelling reuse events using PROV [easy]
2. Detecting and reporting reuse events in practice [hard!!]
P.Missier
IDCC‘16–Feb.2016
Modelling reuse using PROV
DR1
DR3
DR2
Alice
RO1
RO1
RO3
S2
RO3
RO4
RO3
RO5
RO2
Bob
Charlie
RO1
P2
P1
1⃣
2⃣
3⃣
Alice generates RO1
Bob reuses RO1, generating RO2, RO3
Charlie reuses RO1 and RO3, generating RO4 through P2
Unknown Agent reuses RO2 and RO3, generating RO5 through an unkonwn
activity
Observable events:
Provlets are PROV document fragments generated by multiple,
independent, autonomous Information Systems
P.Missier
IDCC‘16–Feb.2016
Provlets - I
Alice
RO1 DR1
Alice
RO1
wasAttributedTo
P.Missier
IDCC‘16–Feb.2016
Provlets - II
DR1
RO1 P1
DR3
DR2RO2
RO3
Bob
wasAttributedTo
P1RO1
RO2
RO3
used
genBy
genBy
Bob
P.Missier
IDCC‘16–Feb.2016
Provlets - III
DR1
RO1 P2 DR3
RO4
RO3
Charlie
wasAttributedTo
P2
RO1
RO3
RO4used
genBy
Charlie
P.Missier
IDCC‘16–Feb.2016
Provlets - IV
P.Missier
IDCC‘16–Feb.2016
Provlets generation and composition
P1
Px
P2
RO1
RO2
RO3
RO4
RO5
used
used
used
used
genBy
genBy
genBy
genByAlice
wasAttributedTo
Bob
wasAttributedTo
wasAttributedTo
Charlie
P.Missier
IDCC‘16–Feb.2016
Is this really practical?
Provlets are generated by multiple, independent, autonomous Systems
• Not necessarily cooperative
• Especially in the long tail of science
No guarantee of
• Completeness
• Consistency eg of RO PID usage
Alice misses out on credit due
to dependencies
RO2  RO1, RO3  RO1
P1
Px
P2
RO1
RO2
RO3
RO4
RO5
used
used
used
used
genBy
genBy
genBy
genByAlice
wasAttributedTo
Bob
wasAttributedTo
wasAttributedTo
Charlie
Provenance and trajectories can be incomplete, partially disconnected
Px
P2
RO1
RO2
RO3
RO4
RO5
used
used
used
genBy
genBy
Alice
wasAttributedTo
wasAttributedTo
Charlie
P.Missier
IDCC‘16–Feb.2016
Challenges: A research agenda
Vision: tracking data re-use in the wild
1. Community efforts
• Incrementally instrument key systems to be provenance-friendly and cooperative
• Python  NoWorkflow
• R
• Workflows (Kepler, Taverna, Pegasus, VisTrails, …)
• Facilitate consistent use of PIDs
• Incentivise proactive reporting of re-use instances
2. Research into probabilistic provenance
• Can we estimate the likelihood of some of the missing derivations?
• Uncertain graph management  a rich foundation
• Can we design robust credit models that incorporate uncertainty of derivation?
P.Missier
IDCC‘16–Feb.2016
A crowded space in Open Research Data (Repositories)
P.Missier
IDCC‘16–Feb.2016
Selected references
• Bechhofer, S., De Roure, D., Gamble, M., Goble, C. & Buchan, I. (2010).
Research Objects: Towards Exchange and Reuse of Digital Knowledge. Nature
Precedings.
• Callaghan, S., Donegan, S., Pepler, S., Thorley, M., Cunningham, N., Kirsch, P., . .
. Wright, D. (2012, may). Making Data a First Class Scientific Output: Data Citation
and Publication by NERC’s Environmental Data Centres (Vol. 7) (No. 1).
• Katz, D. S. (2014). Transitive credit as a means to address social and
technological concerns stemming from citation and attribution of digital products.
Journal of Open Research Software, 2(1), e20.
• Moreau, L. & Groth, P. (2013, sep). Provenance: An Introduction to PROV.
Synthesis Lectures on the Semantic Web: Theory and Technology, 3(4), 1–129.
• Wallis, J. C., Rolando, E. & Borgman, C. L. (2013, jul). If We Share Data, Will
Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and
Technology. PLoS ONE, 8(7), e67332.

More Related Content

What's hot

Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Raja Chiky
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Paolo Missier
 
LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013Luis Daniel Ibáñez
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisVincenzo Gulisano
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiersRim Moussa
 
Online Relation Alignment for Linked Datasets
Online Relation Alignment for Linked DatasetsOnline Relation Alignment for Linked Datasets
Online Relation Alignment for Linked DatasetsMaria Koutraki
 
Provenance for Reproducible Data Science
Provenance for Reproducible Data ScienceProvenance for Reproducible Data Science
Provenance for Reproducible Data ScienceAndreas Schreiber
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphsStefan Dietze
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream ProcessingZbigniew Jerzak
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representationsMarco Quartulli
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
Using Substitutive Itemset Mining Framework for Finding Synonymous Properties...
Using Substitutive Itemset Mining Framework for Finding Synonymous Properties...Using Substitutive Itemset Mining Framework for Finding Synonymous Properties...
Using Substitutive Itemset Mining Framework for Finding Synonymous Properties...Agnieszka Ławrynowicz
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

What's hot (19)

Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014Introduction to Data streaming - 05/12/2014
Introduction to Data streaming - 05/12/2014
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...Preserving the currency of genomics outcomes over time through selective re-c...
Preserving the currency of genomics outcomes over time through selective re-c...
 
LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013LiveLinkedData - TransWebData - Nantes 2013
LiveLinkedData - TransWebData - Nantes 2013
 
parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Hadoop ensma poitiers
Hadoop ensma poitiersHadoop ensma poitiers
Hadoop ensma poitiers
 
Online Relation Alignment for Linked Datasets
Online Relation Alignment for Linked DatasetsOnline Relation Alignment for Linked Datasets
Online Relation Alignment for Linked Datasets
 
Provenance for Reproducible Data Science
Provenance for Reproducible Data ScienceProvenance for Reproducible Data Science
Provenance for Reproducible Data Science
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
A Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF GraphsA Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF Graphs
 
Using Substitutive Itemset Mining Framework for Finding Synonymous Properties...
Using Substitutive Itemset Mining Framework for Finding Synonymous Properties...Using Substitutive Itemset Mining Framework for Finding Synonymous Properties...
Using Substitutive Itemset Mining Framework for Finding Synonymous Properties...
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Viewers also liked

PDT: Personal Data from Things, and its provenance
PDT: Personal Data from Things,and its provenancePDT: Personal Data from Things,and its provenance
PDT: Personal Data from Things, and its provenancePaolo Missier
 
ReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyPaolo Missier
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...Paolo Missier
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’ Paolo Missier
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBTPaolo Missier
 

Viewers also liked (6)

PDT: Personal Data from Things, and its provenance
PDT: Personal Data from Things,and its provenancePDT: Personal Data from Things,and its provenance
PDT: Personal Data from Things, and its provenance
 
ReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case StudyReComp and the Variant Interpretations Case Study
ReComp and the Variant Interpretations Case Study
 
ReComp for genomics
ReComp for genomicsReComp for genomics
ReComp for genomics
 
ReComp: Preserving the value of large scale data analytics over time through...
ReComp:Preserving the value of large scale data analytics over time through...ReComp:Preserving the value of large scale data analytics over time through...
ReComp: Preserving the value of large scale data analytics over time through...
 
The data, they are a-changin’
The data, they are a-changin’The data, they are a-changin’
The data, they are a-changin’
 
Big Data Quality Panel : Diachron Workshop @EDBT
Big Data Quality Panel: Diachron Workshop @EDBTBig Data Quality Panel: Diachron Workshop @EDBT
Big Data Quality Panel : Diachron Workshop @EDBT
 

Similar to Data Trajectories: tracking the reuse of published data for transitive credit attribution

Rubrics for DMPs
Rubrics for DMPsRubrics for DMPs
Rubrics for DMPsJisc RDM
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSotiris Beis
 
To trust, or not to trust: Highlighting the need for data provenance in mobil...
To trust, or not to trust: Highlighting the need for data provenance in mobil...To trust, or not to trust: Highlighting the need for data provenance in mobil...
To trust, or not to trust: Highlighting the need for data provenance in mobil...Jon Lazaro Aduna
 
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...John Makridis
 
Visualizing the Maturing Global API Ecosystem
Visualizing the Maturing Global API EcosystemVisualizing the Maturing Global API Ecosystem
Visualizing the Maturing Global API EcosystemSaeidHeshmatisafa1
 
poem_presentation_v5_linkedIn_version
poem_presentation_v5_linkedIn_versionpoem_presentation_v5_linkedIn_version
poem_presentation_v5_linkedIn_versionIliada Eleftheriou
 
Studying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsStudying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsLynn Connaway
 
Studying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsStudying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsOCLC
 
Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...Jisc RDM
 
Network Relationships and Job Changes of Software Developers at Sunbelt 2016
Network Relationships and Job Changes of Software Developers at Sunbelt 2016Network Relationships and Job Changes of Software Developers at Sunbelt 2016
Network Relationships and Job Changes of Software Developers at Sunbelt 2016Dawn Foster
 
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC
 
Data Mining @ Information Age
Data Mining @ Information AgeData Mining @ Information Age
Data Mining @ Information AgeIIRindia
 
Tools for reflexivity and innovation platforms
Tools for reflexivity and innovation platformsTools for reflexivity and innovation platforms
Tools for reflexivity and innovation platformsILRI
 

Similar to Data Trajectories: tracking the reuse of published data for transitive credit attribution (20)

Rubrics for DMPs
Rubrics for DMPsRubrics for DMPs
Rubrics for DMPs
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Observlets
Observlets Observlets
Observlets
 
To trust, or not to trust: Highlighting the need for data provenance in mobil...
To trust, or not to trust: Highlighting the need for data provenance in mobil...To trust, or not to trust: Highlighting the need for data provenance in mobil...
To trust, or not to trust: Highlighting the need for data provenance in mobil...
 
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
 
Transitive credit
Transitive creditTransitive credit
Transitive credit
 
Visualizing the Maturing Global API Ecosystem
Visualizing the Maturing Global API EcosystemVisualizing the Maturing Global API Ecosystem
Visualizing the Maturing Global API Ecosystem
 
poem_presentation_v5_linkedIn_version
poem_presentation_v5_linkedIn_versionpoem_presentation_v5_linkedIn_version
poem_presentation_v5_linkedIn_version
 
Carpenter/Lagace: NISO Recommended Practices to Support Adoption of Altmetric...
Carpenter/Lagace: NISO Recommended Practices to Support Adoption of Altmetric...Carpenter/Lagace: NISO Recommended Practices to Support Adoption of Altmetric...
Carpenter/Lagace: NISO Recommended Practices to Support Adoption of Altmetric...
 
Studying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsStudying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and Residents
 
Studying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and ResidentsStudying information behavior: The Many Faces of Digital Visitors and Residents
Studying information behavior: The Many Faces of Digital Visitors and Residents
 
Social Media, Linked'Data & The Context Question
Social Media, Linked'Data & The Context QuestionSocial Media, Linked'Data & The Context Question
Social Media, Linked'Data & The Context Question
 
Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...
 
Network Relationships and Job Changes of Software Developers at Sunbelt 2016
Network Relationships and Job Changes of Software Developers at Sunbelt 2016Network Relationships and Job Changes of Software Developers at Sunbelt 2016
Network Relationships and Job Changes of Software Developers at Sunbelt 2016
 
OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.OCLC Research Update at ALA Chicago. June 26, 2017.
OCLC Research Update at ALA Chicago. June 26, 2017.
 
Data Scientists
 Data Scientists Data Scientists
Data Scientists
 
Data Mining @ Information Age
Data Mining @ Information AgeData Mining @ Information Age
Data Mining @ Information Age
 
Sense, Report, Act
Sense, Report, ActSense, Report, Act
Sense, Report, Act
 
Tools for reflexivity and innovation platforms
Tools for reflexivity and innovation platformsTools for reflexivity and innovation platforms
Tools for reflexivity and innovation platforms
 

More from Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

More from Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Data Trajectories: tracking the reuse of published data for transitive credit attribution

  • 1. P.Missier IDCC‘16–Feb.2016 Data Trajectories: tracking reuse of published data for transitive credit attribution Paolo Missier Paolo.Missier@ncl.ac.uk School of Computing Science Newcastle University, UK IDCC’16 Amsterdam, Feb 24, 2016
  • 2. P.Missier IDCC‘16–Feb.2016 A crowded space in Open Research Data (Repositories)
  • 3. P.Missier IDCC‘16–Feb.2016 Data publication and reuse: a potential virtuous cycle Publication Reuse Tracking Partial credit Article “reuse” == Article citation • Easy, but limited semantics Data reuse is more interesting / complicated: • Data derivation can take many forms • Multiple programs, information systems • Multiple generations 1. What happens to published datasets after their publication? 2. Can we follow their trajectory through transformations? 3. Can we use this knowledge to quantify credit to data contributors? Measuring data impact (see eg [1]) [1] Alex Ball, Monica Duke (2015). ‘How to Track the Impact of Research Data with Metrics’. DCC How- to Guides. Edinburgh: Digital Curation Centre. Available online: http://www.dcc.ac.uk/resources/how-guides
  • 4. P.Missier IDCC‘16–Feb.2016 Data publication & reuse: a hypothetical scenario Who gets credit for what? How much credit should Alice, Bob, Charlie receive? RO = “Research Object” RO3 RO5 RO2 4RO3 RO4 Charlie RO1 P2 3️⃣ DR1 Alice RO1 1⃣ DR3 DR2 RO3 RO2 RO1 Bob 2⃣ P1
  • 5. P.Missier IDCC‘16–Feb.2016 Recording reuse chains Sequence of derivations viewed as a provenance graph • W3C PROV compliant DR1 DR3 DR2 Alice RO1 RO1 RO3 S2 RO3 RO4 RO3 RO5 RO2 Bob Charlie RO1 P2 P1 1⃣ 2⃣ 3⃣
  • 6. P.Missier IDCC‘16–Feb.2016 Assignment and transitive propagation of credit Inductive defintion of credit: 1. External credit: • Can be assigned to any ROx in the graph at any time • How? Don’t care: any (community-based) mechanism is ok 2. Transitively propagated partial credit: • If ROy is reachable from ROx in the graph, then ROy should receive a portion of the credit given to ROx Assuming this graph can be constructed:
  • 7. P.Missier IDCC‘16–Feb.2016 Data trajectories The trajectory DT(RO) of contains all RO’ on which RO has had an impact For each RO, its credit is defined by induction on its trajectory graph: Externa l credit Transitive credit
  • 8. P.Missier IDCC‘16–Feb.2016 Next steps 1. Define a suitable credit transfer function f 2. Build the provenance graph in practice • Provlets and their composition
  • 9. P.Missier IDCC‘16–Feb.2016 Credit propagation patterns - 1 Most general case: RO has been reused r times, by activities, a1 … ar: Then, we consider patterns that involve a single activity a
  • 10. P.Missier IDCC‘16–Feb.2016 Credit propagation patterns - 2 we want RO to receive a fraction of RO’s credit. credit transfer parameter through a: 𝝰 models the value of the transformation a relative to its inputs data RO High value transformation: low 𝝰 value  low credit to RO Simple transformation: high 𝝰 value  high credit to RO 1. Single-input, single-output activity
  • 11. P.Missier IDCC‘16–Feb.2016 Credit propagation patterns -3 We account for relative importance of each of A’s inputs RO1 … ROn Modelled using n new factors: 2. multi-input activity: RO is only one of n>1 inputs to A
  • 12. P.Missier IDCC‘16–Feb.2016 Credit propagation patterns -4 RO receives credit from each output RO’ These are all part of DT(RO) 3. multi-input, multi-output activity: A generates M>1 outputs Relative importance of derived data products RO’1 … RO’m:
  • 13. P.Missier IDCC‘16–Feb.2016 Credit propagation patterns - unknown activity When activity a is unknown, none of the parameters α,β,γ can be used  Exists some activity a such that: (*) https://www.w3.org/TR/prov-constraints/#derivations Modelled using a derivation transfer parameter: For n known derivations of RO: PROV-CONSTRAINTS (*)
  • 14. P.Missier IDCC‘16–Feb.2016 Credit from data to Agents Agents are the actual people to whom the ROs are attributed Each agent may be responsible for a set R or ROs. The credit to this agent is simply:
  • 15. P.Missier IDCC‘16–Feb.2016 Summary of credit model RO reuse events  provenance statements about RO  complete provenance graph  DT(RO)  cr(RO) Three elements to cr(RO): 1. External credit that is independent of reuse - May follow any community-based scoring scheme of data relevance 2. Credit propagation rules computed inductively from DT(RO) - These formalise the notion of transitive credit 3. A collection of credit transfer parameters - These account for the nature of the activities involved DT(RO)
  • 16. P.Missier IDCC‘16–Feb.2016 How it might work How it might work: a data reuse simulator Events: - Data re-use through an activity - Adjustments to external credit
  • 17. P.Missier IDCC‘16–Feb.2016 Next steps  Define a suitable credit transfer function f • Credit transfer parameters 2. Build the provenance graph in practice • Provlets and their composition Issues in building a graph of reuse events: 1. Modelling reuse events using PROV [easy] 2. Detecting and reporting reuse events in practice [hard!!]
  • 18. P.Missier IDCC‘16–Feb.2016 Modelling reuse using PROV DR1 DR3 DR2 Alice RO1 RO1 RO3 S2 RO3 RO4 RO3 RO5 RO2 Bob Charlie RO1 P2 P1 1⃣ 2⃣ 3⃣ Alice generates RO1 Bob reuses RO1, generating RO2, RO3 Charlie reuses RO1 and RO3, generating RO4 through P2 Unknown Agent reuses RO2 and RO3, generating RO5 through an unkonwn activity Observable events: Provlets are PROV document fragments generated by multiple, independent, autonomous Information Systems
  • 20. P.Missier IDCC‘16–Feb.2016 Provlets - II DR1 RO1 P1 DR3 DR2RO2 RO3 Bob wasAttributedTo P1RO1 RO2 RO3 used genBy genBy Bob
  • 21. P.Missier IDCC‘16–Feb.2016 Provlets - III DR1 RO1 P2 DR3 RO4 RO3 Charlie wasAttributedTo P2 RO1 RO3 RO4used genBy Charlie
  • 23. P.Missier IDCC‘16–Feb.2016 Provlets generation and composition P1 Px P2 RO1 RO2 RO3 RO4 RO5 used used used used genBy genBy genBy genByAlice wasAttributedTo Bob wasAttributedTo wasAttributedTo Charlie
  • 24. P.Missier IDCC‘16–Feb.2016 Is this really practical? Provlets are generated by multiple, independent, autonomous Systems • Not necessarily cooperative • Especially in the long tail of science No guarantee of • Completeness • Consistency eg of RO PID usage Alice misses out on credit due to dependencies RO2  RO1, RO3  RO1 P1 Px P2 RO1 RO2 RO3 RO4 RO5 used used used used genBy genBy genBy genByAlice wasAttributedTo Bob wasAttributedTo wasAttributedTo Charlie Provenance and trajectories can be incomplete, partially disconnected Px P2 RO1 RO2 RO3 RO4 RO5 used used used genBy genBy Alice wasAttributedTo wasAttributedTo Charlie
  • 25. P.Missier IDCC‘16–Feb.2016 Challenges: A research agenda Vision: tracking data re-use in the wild 1. Community efforts • Incrementally instrument key systems to be provenance-friendly and cooperative • Python  NoWorkflow • R • Workflows (Kepler, Taverna, Pegasus, VisTrails, …) • Facilitate consistent use of PIDs • Incentivise proactive reporting of re-use instances 2. Research into probabilistic provenance • Can we estimate the likelihood of some of the missing derivations? • Uncertain graph management  a rich foundation • Can we design robust credit models that incorporate uncertainty of derivation?
  • 26. P.Missier IDCC‘16–Feb.2016 A crowded space in Open Research Data (Repositories)
  • 27. P.Missier IDCC‘16–Feb.2016 Selected references • Bechhofer, S., De Roure, D., Gamble, M., Goble, C. & Buchan, I. (2010). Research Objects: Towards Exchange and Reuse of Digital Knowledge. Nature Precedings. • Callaghan, S., Donegan, S., Pepler, S., Thorley, M., Cunningham, N., Kirsch, P., . . . Wright, D. (2012, may). Making Data a First Class Scientific Output: Data Citation and Publication by NERC’s Environmental Data Centres (Vol. 7) (No. 1). • Katz, D. S. (2014). Transitive credit as a means to address social and technological concerns stemming from citation and attribution of digital products. Journal of Open Research Software, 2(1), e20. • Moreau, L. & Groth, P. (2013, sep). Provenance: An Introduction to PROV. Synthesis Lectures on the Semantic Web: Theory and Technology, 3(4), 1–129. • Wallis, J. C., Rolando, E. & Borgman, C. L. (2013, jul). If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology. PLoS ONE, 8(7), e67332.

Editor's Notes

  1. Re3Data.org: more than 1,130 data repositories that are accessed by over 5,000 unique visitors each month. On average, 10 new repositories are added every week.
  2. How do I account for the complexity of its transformations? Measuring the influence of a dataset on others’ research? How do I give credit to original contributors?
  3. The scenario involves an initial RO, $\mathit{RO}_1$, which is created and then published by Alice to data repository $DR_1$. This RO is later discovered, downloaded, and reused by Bob through a process $P_1$, and independently by Charlie through process $P_2$, resulting in derivative objects $\mathit{RO}_2$, $\mathit{RO}_3$, and $\mathit{RO}_4$, respectively. These new ROs may be published into different and separate data repositories, eg $DR_2$, $DR_3$ as in the figure.Here Alice, Bob, and Charlie are modelled as PROV Agents, and $P_1$, $P_2$ as Activities.Not all details about a derivation are always available. For instance, in this example $\mathit{RO}_2$ and $\mathit{RO}_3$ are later themselves reused by some unknown Agent through some unknown Activity, generating $\mathit{RO}_5$ as a result.
  4. The scenario involves an initial RO, $\mathit{RO}_1$, which is created and then published by Alice to data repository $DR_1$. This RO is later discovered, downloaded, and reused by Bob through a process $P_1$, and independently by Charlie through process $P_2$, resulting in derivative objects $\mathit{RO}_2$, $\mathit{RO}_3$, and $\mathit{RO}_4$, respectively. These new ROs may be published into different and separate data repositories, eg $DR_2$, $DR_3$ as in the figure.Here Alice, Bob, and Charlie are modelled as PROV Agents, and $P_1$, $P_2$ as Activities.Not all details about a derivation are always available. For instance, in this example $\mathit{RO}_2$ and $\mathit{RO}_3$ are later themselves reused by some unknown Agent through some unknown Activity, generating $\mathit{RO}_5$ as a result.
  5. Traverse the provenance gra[h …to obtain graphs DT(RO) of RO’s direct and indirect derivations:
  6. $\RO$ accrues a proportion of the total credit of $\RO$, which accounts for its perceived importance in computing $\RO'$ using $a$.
  7. $\RO$ accrues a proportion of the total credit of $\RO$, which accounts for its perceived importance in computing $\RO'$ using $a$.
  8. Re3Data.org: more than 1,130 data repositories that are accessed by over 5,000 unique visitors each month. On average, 10 new repositories are added every week.