The digital universe is booming, especially metadata and user-generated data. This raises strong challenges in order to identify the relevant portions of data which are relevant for a particular problem and to deal with the lifecycle of data. Finer grain problems include data evolution and the potential impact of change in the applications relying on the data, causing decay. The management of scientific data is especially sensitive to this. We present the Research Objects concept as the means to indentify and structure relevant data in scientific domains, addressing data as first-class citizens. We also identify and formally represent the main reasons for decay in this domain and propose methods and tools for their diagnosis and repair, based on provenance information. Finally, we discuss on the application of these concepts to the broader domain of the Web of Data: Data with a Purpose.
2. The data deluge
Some facts
» In 2010 the size of the digital
universe exceeded 1 Zettabyte
(=1 trillion Gb)
» 1.8 Zb in 2011
» 35 Zb expected in 2020
» 90% unstructured data
» 70% user-generated
» 75% resulting from data copying,
merging, and transforming
» Metadata is the fastest growing
data category
» Much of such data is dynamic,
real-time, volatile
Source: IDC ‘s The 2011 Digital Universe Study
– Extracting Value from Chaos
2
3. Dealing with dynamicity
Two main challenges
» Challenge 1: Identifying and
structuring the relevant portions of
the data for the task at hand
› First-class data citizens
» Challenge 2: Managing the lifecycle
of data entities
› Preservation
› Evolution and versioning
› Decay Both technical and
social aspects involved
3
4. The Research Lifecycle
Workflows in the Scientific Method
Background
Hypothesis Results Scientific
Experiment Results
Assumptions (data) Interpretation Publication
(Data)
Input data
Method
Example: Genome-Wide Association Studies
4
5. Workflow-based Science
What is a Scientific Workflow?
» A mechanism for coordinating the
execution of services and linking together
resources.
» The combination of data and processes
into a configurable, structured set of steps
that implement semi-automated
computational solutions in scientific
problem-solving
Scientific workflows are at the core of
scientific data management
› Enable automation
› Encourage best practices
5
6. Challenge 1
Identifying and structuring
the relevant portions of the
data for the task at hand
First-class data citizens
7. Questions for Scientific Data and Workflows Issues
Who are you ? Identity and Description
Where and when were you born ? Authenticity
Who were your parents (creators) ? Uniqueness
For which purpose were you conceived and have been used ? Reuse, Repurpose
What do you have inside ? Inspection
Visualization
Annotations
How is your content linked ? Graphical Representation
May I access all your parts ? Access Rights
Which parts can I replace ? Adaptability
What have they done to you ? Provenance
Who and When ? Versioning
Why did they do that ?
Why have you been recommended to me ? Information Quality
Can I believe what you are saying or trust your results ?
Do you still produce the same results ? Reproducibility
Are you still working ? Completeness
How could I repair you ? Stability
How could I thank you ? Credit
How could I talk about you ? 7
8. Challenge 1: Identifying and structuring the relevant data
Research Objects as Technical Objects
Carriers of Research Context Third Party Alien
» Referentiable Distributed Tenancy Store
» Aggregation, Dispersed
› Heterogeneous
› Local and External
» Annotated metadata
› Provenance
› Structured: Manifests,
Recipes, Permissions,
Discourse
» Lifecycle
› Publishing, Evolution
› Versioning
» Mixed Stewardship
› Graceful Degradation
» Sharing
» Security & Privacy
Technical Objects Social Objects
» Stereotypical User Profiles
» Services
OAI-ORE 8
9. Research Objects as Social Objects
Package,
Explore, Inspect,
Review,
Exchange,
Share, Reuse,
Publish, Credit
9 9
9
10. http://purl.org/wf4ever/ro#
Research Object model core (simplified)
RO specification: http://wf4ever.github.com/ro
ore:aggregates
ro:ResearchObject
ro:Resource
ore:isDescribedBy
ro:Manifest
wfdesc:Workflow
ro:annotatesAggregatedResource ro:AggregatedAnnotation
› ro (aggregation and annotation) Note: This figure shows a simplified view of the RO core.
› wfdesc (workflow description)
› Minim* (minimum info model)
› wfprov (workflow provenance)
› roprov (RO provenance)
› roevo (evolution model) 10
*Minim based on M. Gamble’s MIM
13. Challenge 2: Managing the lifecycle of data entities
RO Decay
Workflow Decay
• Component level
• flux/decay/unavailability
• Data level
• Infrastructure level
Experiment Decay
• Methodological changes
• New technologies
• New resources/components
• New data
13
14. Preservation, Conservation, Recreating
Preserving
Archived Record
Fixed Snapshots
Review
Rerun & Replay
Conserving
Active Instrument
Live
Rerun & Reuse
Repair & Restore
Recreating
Archived Record
Active Instrument
Live
Rebuild Recycle Repurpose
14
15. Challenge 2: Managing the lifecycle of data entities
Possible types of decay (an example)
15
16. Decay Analysis
A Taxonomy of RO decay
1. Service tool is missing
2. Service file descriptor disappeared
3. Service up but not contactable
4. Service up but functionality changed
5. Local software dependencies
6. Data unavailability
7. Changes in data formats
8. Chained dependency
9. Credentials deprecated
10. Input data superseded by other data
11. RO metadata outdated (upon versioning)
12. Old fashioned RO
13. External references lose credit
14. Execution framework no longer available
16
18. Decay Analysis
1.0 Certificate – Evaluation of Stability and Completeness
1.0 Certificate of quality
Stability Completeness
Is the RO free from any form of decay Is the minimal aggregation of
preventing workflow execution? resources encapsulated by the RO
consistent?
» Focus on reproducibility » RO checklists
» Assisted detection of RO decay » Produced by scientists
» Active monitoring on decay forms » Automatically checked against
» RO and workflow provenance minimal model (minim)
» RO evolution
» Notification
» Explanation
18
1.0 Certificate notion originally proposed by Yde de Jong
19. Recap
Lessons learnt
Scalability » Data with a Purpose
» Encapsulate & Conquer
› Goal-driven (purpose)
› Aggregation
› Community-managed
» Nothing is immutable,
Provenance especially data.
› Foster evolution
› Monitor decay
19
20. Thanks for your Attention!
Questions
Any Questions?
http://www.wf4ever-project.org/
20
Notas do Editor
In this scenario student Dennis has made a conceptual workflow that takes the result of a gene expression experiment (activity values of all genes under two conditions: with/without a chemical compound). The wet laboratory experiment was done by others then Dennis. He makes a note of the origin (including a paper reference). The initial hypothesis is that the chemical compound disturbs gene expression. It is yet unknown which genes and what biological processes are affected. The conceptual workflow first performs one of the standard data preprocessing steps for the type of data Dennis has (Affymetrix gene expression array), then it uses a statistical test to filter those genes that are significantly differentially expressed between the two conditions, and finally it performs an enrichment test to find those pathways that are most prominent among the filtered genes. The latter requires an annotation process, where each gene is coupled to the pathways it was once implied in in other experiments (there is a database for that: KEGG).Dennis is new to workflows, so he wishes to start with an existing workflow. For each component he will search myExperiment for keywords. He then wishes to understand the workflows: look into them, perform test runs with test data and his own data, and see other peoples logs. When he finds workflows he does not understand, Dennis is inclined to create his own workflow with his own scripts. He will receive scripts from colleagues and perform tests that his colleagues are familiar with. As such, he can learn what his workflow is doing. This will help him interpret his results.Ultimately, the workflow may suggest for instance that the set of differentially expressed genes has the Wnt pathway as most common denominator. This pathway is well known for embryogenesis and cancer, information he finds on the internet. He makes a note of that. It will lead to the hypothesis that the chemical compound, may have effects on embryogenesis and/or cancer. This is now his interpretation of his experiment that he wishes to link to his experiment and the processed data. Dennis notes that in a next cycle he will want to perform another workflow that specifically tests this hypothesis, rather that perform an enrichment test. He will then look for a workflow that performs a 'global test', and replace this part in his workflow with the global test workflow. In his log he indicates this fact. In this case he will link the result of this test (most likely a new hypothesis) to the previous experiment and in particular to the initial hypothesis. At some point, he wishes to be able to retrieve this past information and the interrelationships among his hypotheses.Assuming his finding and new hypothesis are valuable and new, he will publish his results. The publication has cleaned information, sufficient for evaluating his hypothesis and rerunning the one workflow and the one dataset that lead to this result.Dennis Working Research Object will containA reference to the source of the data and the people to acknowledge for it.The initial hypothesisThe conceptual workflow or a summary of the experiment planReferences to workflows that were tested, with comments on their application for Dennis caseA reference to the workflow(s) that Dennis eventually uses, including acknowledgement information (including a note on how these people want to be acknowledged)Dennis his workflow, possibly with a backlog of previous versions that Dennis wishes to keep for reference (with notes and comments)Dennis his workflow run, results and the recorded steps that lead to the results, in some cases with comments for later reference (e.g. 'here I used parameter A, next time I may try B')The final hypothesis, with comments.A reference to the results of the workflowA Design log that records Dennis considerations while making the workflowA Run log that records Dennis considerations while running and interpreting the workflowHis Publication Research Object will containThe workflowA caption for his workflow (filtered from his design and run log, all information necessary to run the experiment by a reviewer)A workflow run (results, and a caption filtered from run log)His initial hypothesisHis final hypothesisThe data sourceAcknowledgementsIn time, Dennis' workflow can be found on the basis of his Published and Working RO's metadata. This will create a rich and wide range of search capabilities for Dennis' successors.The Working RO is kept at Dennis local group, and is the most valuable resource for reusing the work. The Published RO is available for download and reuse. It is anticipated that interested parties will contact Dennis or his group for 'reuse in collaboration' (i.e. for the group's expertise).
Emphasise the use of Linked Data. Note: the figures here are not intended to be readable. They’re simply emphasising the existence of the models. Example user requirements being addressed by RO:UR1.3 aggregate existing resources to conveniently access related resources from a single placeUR1.6 describe the relationships between aggregated resources so that other researchers can see how the resources fit togetherUR1.16 annotate experimental results using semantic models so that I can find/show links to other, relevant research objects