SlideShare uma empresa Scribd logo
1 de 20
www.wf4ever-project.org




Scientific Data Management -
  From the Lab to the Web
      José Manuel Gómez Pérez, iSOCO

         Semantic Data Management
             Dagstuhl Seminar
             22-27 April 2012
The data deluge
                                                          Some facts

                             »    In 2010 the size of the digital
                                  universe exceeded 1 Zettabyte
                                  (=1 trillion Gb)
                             »    1.8 Zb in 2011
                             »    35 Zb expected in 2020

                             »    90% unstructured data
                             »    70% user-generated
                             »    75% resulting from data copying,
                                  merging, and transforming

                             »    Metadata is the fastest growing
                                  data category
                             »    Much of such data is dynamic,
                                  real-time, volatile

Source: IDC ‘s The 2011 Digital Universe Study
       – Extracting Value from Chaos

                                                                     2
Dealing with dynamicity
                                         Two main challenges


» Challenge 1: Identifying and
  structuring the relevant portions of
  the data for the task at hand
   › First-class data citizens
» Challenge 2: Managing the lifecycle
  of data entities
   › Preservation
   › Evolution and versioning
   › Decay                         Both technical and
                                 social aspects involved

                                                               3
The Research Lifecycle
                                                Workflows in the Scientific Method


Background
 Hypothesis                           Results           Scientific
                   Experiment         Results
Assumptions                            (data)         Interpretation       Publication
                                       (Data)
 Input data
   Method


   Example: Genome-Wide Association Studies




                                                                                         4
Workflow-based Science
              What is a Scientific Workflow?


»    A mechanism for coordinating the
     execution of services and linking together
     resources.

»    The combination of data and processes
     into a configurable, structured set of steps
     that implement semi-automated
     computational solutions in scientific
     problem-solving


    Scientific workflows are at the core of
    scientific data management
        › Enable automation
        › Encourage best practices




                                                    5
Challenge 1

 Identifying and structuring
the relevant portions of the
  data for the task at hand

    First-class data citizens
Questions for Scientific Data and Workflows                        Issues
Who are you ?                                               Identity and Description
Where and when were you born ?                                     Authenticity
Who were your parents (creators) ?                                 Uniqueness
For which purpose were you conceived and have been used ?      Reuse, Repurpose

What do you have inside ?                                         Inspection
                                                                  Visualization
                                                                  Annotations
How is your content linked ?                                Graphical Representation
May I access all your parts ?                                    Access Rights
Which parts can I replace ?                                       Adaptability
What have they done to you ?                                      Provenance
Who and When ?                                                     Versioning
Why did they do that ?


Why have you been recommended to me ?                         Information Quality
Can I believe what you are saying or trust your results ?

Do you still produce the same results ?                         Reproducibility
Are you still working ?                                          Completeness
How could I repair you ?                                           Stability

How could I thank you ?                                              Credit
How could I talk about you ?                                                           7
Challenge 1: Identifying and structuring the relevant data
                                         Research Objects as Technical Objects

Carriers of Research Context                       Third Party     Alien
» Referentiable                        Distributed  Tenancy        Store
» Aggregation, Dispersed
    › Heterogeneous
    › Local and External
» Annotated metadata
    › Provenance
    › Structured: Manifests,
      Recipes, Permissions,
      Discourse
» Lifecycle
    › Publishing, Evolution
    › Versioning
» Mixed Stewardship
    › Graceful Degradation
» Sharing
    » Security & Privacy
                                       Technical Objects              Social Objects
» Stereotypical User Profiles
» Services
                               OAI-ORE                                                   8
Research Objects as Social Objects




                    Package,
                    Explore, Inspect,
                    Review,
                    Exchange,
                    Share, Reuse,
                    Publish, Credit




9      9
                                    9
http://purl.org/wf4ever/ro#
                                                   Research Object model core (simplified)

    RO specification: http://wf4ever.github.com/ro


                                       ore:aggregates
                                                          ro:ResearchObject
               ro:Resource
                                                                                       ore:isDescribedBy



                                                                                           ro:Manifest
    wfdesc:Workflow

                     ro:annotatesAggregatedResource        ro:AggregatedAnnotation



›    ro (aggregation and annotation)           Note: This figure shows a simplified view of the RO core.
›    wfdesc (workflow description)
›    Minim* (minimum info model)
›    wfprov (workflow provenance)
›    roprov (RO provenance)
›    roevo (evolution model)                                                                                   10
                                                                           *Minim   based on M. Gamble’s MIM
Challenge 2

Managing the lifecycle of
     data entities

   Evolution and Decay
Challenge 2: Managing the lifecycle of data entities
                 RO Evolution & Versioning




                                                 12
Challenge 2: Managing the lifecycle of data entities
                                                                       RO Decay



Workflow Decay
•   Component level
•   flux/decay/unavailability
•   Data level
•   Infrastructure level

Experiment Decay
•   Methodological changes
•   New technologies
•   New resources/components
•   New data




                                                                                 13
Preservation, Conservation, Recreating


Preserving
Archived Record
Fixed Snapshots
Review
Rerun & Replay

Conserving
Active Instrument
Live
Rerun & Reuse
Repair & Restore

Recreating
Archived Record
Active Instrument
Live
Rebuild Recycle Repurpose

                                                                     14
Challenge 2: Managing the lifecycle of data entities
    Possible types of decay (an example)




                                                 15
Decay Analysis
                    A Taxonomy of RO decay



1. Service tool is missing
2. Service file descriptor disappeared
3. Service up but not contactable
4. Service up but functionality changed
5. Local software dependencies
6. Data unavailability
7. Changes in data formats
8. Chained dependency
9. Credentials deprecated
10. Input data superseded by other data
11. RO metadata outdated (upon versioning)
12. Old fashioned RO
13. External references lose credit
14. Execution framework no longer available

                                              16
A taxonomy of workflow decay
      Sample decay type




                         17
Decay Analysis
                                    1.0 Certificate – Evaluation of Stability and Completeness

                                               1.0 Certificate of quality

                           Stability                                        Completeness



      Is the RO free from any form of decay                   Is the minimal aggregation of
      preventing workflow execution?                          resources encapsulated by the RO
                                                              consistent?


      »    Focus on reproducibility                           »   RO checklists
      »    Assisted detection of RO decay                     »   Produced by scientists
      »    Active monitoring on decay forms                   »   Automatically checked against
      »    RO and workflow provenance                             minimal model (minim)
                                                              »   RO evolution

      »    Notification
      »    Explanation


                                                                                                      18
1.0 Certificate notion originally proposed by Yde de Jong
Recap
                                      Lessons learnt


Scalability   » Data with a Purpose

              » Encapsulate & Conquer
                 › Goal-driven (purpose)
                 › Aggregation
                 › Community-managed

              » Nothing is immutable,
Provenance      especially data.
                 › Foster evolution
                 › Monitor decay

                                                  19
Thanks for your Attention!
                                               Questions




 Any Questions?

http://www.wf4ever-project.org/




                                                         20

Mais conteúdo relacionado

Mais procurados

Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...Wolfgang Reinhardt
 
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionAlbert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionInstitute for Knowledge Mobilization
 
The changing scholarly content and communication landscape
The changing scholarly content and communication landscapeThe changing scholarly content and communication landscape
The changing scholarly content and communication landscapeLaura Czerniewicz
 
Programming Education based on Jigsaw
Programming Education based on JigsawProgramming Education based on Jigsaw
Programming Education based on Jigsawyunjae jang
 
Digital Scholar
Digital Scholar Digital Scholar
Digital Scholar tanbob
 

Mais procurados (8)

Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
Awareness Support for Knowledge Workers in Research Networks - Very brief PhD...
 
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and ActionAlbert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
Albert Simard - Mobilizing Knowledge: Acquisition, Analysis, and Action
 
Knowledge mobilization
Knowledge mobilization Knowledge mobilization
Knowledge mobilization
 
Qiagram
QiagramQiagram
Qiagram
 
The changing scholarly content and communication landscape
The changing scholarly content and communication landscapeThe changing scholarly content and communication landscape
The changing scholarly content and communication landscape
 
2012 Taiwan UX Summit 工作坊A 簡報
2012 Taiwan UX Summit 工作坊A 簡報2012 Taiwan UX Summit 工作坊A 簡報
2012 Taiwan UX Summit 工作坊A 簡報
 
Programming Education based on Jigsaw
Programming Education based on JigsawProgramming Education based on Jigsaw
Programming Education based on Jigsaw
 
Digital Scholar
Digital Scholar Digital Scholar
Digital Scholar
 

Semelhante a Scientific data management from the lab to the web

OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objectsseanb
 
Data Management for Librarians: An Introduction
Data Management for Librarians: An IntroductionData Management for Librarians: An Introduction
Data Management for Librarians: An IntroductionGarethKnight
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip finalDeborah McGuinness
 
2012 03-28 Wf4ever, preserving workflows as digital research objects
2012 03-28 Wf4ever, preserving workflows as digital research objects2012 03-28 Wf4ever, preserving workflows as digital research objects
2012 03-28 Wf4ever, preserving workflows as digital research objectsStian Soiland-Reyes
 
Research Shared: researchobject.org
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.orgNorman Morrison
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the partsCarole Goble
 
myExperiment and the Rise of Social Machines
myExperiment and the Rise of Social MachinesmyExperiment and the Rise of Social Machines
myExperiment and the Rise of Social MachinesDavid De Roure
 
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?GigaScience, BGI Hong Kong
 
If we build it will they come? BOSC2012 Keynote Goble
If we build it will they come? BOSC2012 Keynote GobleIf we build it will they come? BOSC2012 Keynote Goble
If we build it will they come? BOSC2012 Keynote GobleCarole Goble
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Rudy Potenzone
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science Carole Goble
 
Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Vince Smith
 
Scratchpads training course introduction
Scratchpads training course introductionScratchpads training course introduction
Scratchpads training course introductionDimitrios Koureas
 
Argumentative discussions-on-the-web-2013-02-bretagne
Argumentative discussions-on-the-web-2013-02-bretagneArgumentative discussions-on-the-web-2013-02-bretagne
Argumentative discussions-on-the-web-2013-02-bretagnejodischneider
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research ObjectsDavid De Roure
 
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012Lee Dirks
 

Semelhante a Scientific data management from the lab to the web (20)

Research Objects in Wf4Ever
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4Ever
 
OAI7 Research Objects
OAI7 Research ObjectsOAI7 Research Objects
OAI7 Research Objects
 
Data Management for Librarians: An Introduction
Data Management for Librarians: An IntroductionData Management for Librarians: An Introduction
Data Management for Librarians: An Introduction
 
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
 
2012 03-28 Wf4ever, preserving workflows as digital research objects
2012 03-28 Wf4ever, preserving workflows as digital research objects2012 03-28 Wf4ever, preserving workflows as digital research objects
2012 03-28 Wf4ever, preserving workflows as digital research objects
 
Research Shared: researchobject.org
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.org
 
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
myExperiment and the Rise of Social Machines
myExperiment and the Rise of Social MachinesmyExperiment and the Rise of Social Machines
myExperiment and the Rise of Social Machines
 
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
Marco Roos: Newton's ideas and methods are preserved forever: how about yours?
 
Role of Semantic Web in Health Informatics
Role of Semantic Web in Health InformaticsRole of Semantic Web in Health Informatics
Role of Semantic Web in Health Informatics
 
If we build it will they come? BOSC2012 Keynote Goble
If we build it will they come? BOSC2012 Keynote GobleIf we build it will they come? BOSC2012 Keynote Goble
If we build it will they come? BOSC2012 Keynote Goble
 
2013-01-17 Research Object
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research Object
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...Making your data work for you: Scratchpads, publishing & the biodiversity dat...
Making your data work for you: Scratchpads, publishing & the biodiversity dat...
 
Scratchpads training course introduction
Scratchpads training course introductionScratchpads training course introduction
Scratchpads training course introduction
 
Argumentative discussions-on-the-web-2013-02-bretagne
Argumentative discussions-on-the-web-2013-02-bretagneArgumentative discussions-on-the-web-2013-02-bretagne
Argumentative discussions-on-the-web-2013-02-bretagne
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012
 

Mais de Jose Manuel Gómez-Pérez

Mais de Jose Manuel Gómez-Pérez (9)

Science religion-dsmeetupv1.0
Science religion-dsmeetupv1.0Science religion-dsmeetupv1.0
Science religion-dsmeetupv1.0
 
Trust and linked data jmgomez-v1.1
Trust and linked data jmgomez-v1.1Trust and linked data jmgomez-v1.1
Trust and linked data jmgomez-v1.1
 
Halo Pcs Kcap2007 V2
Halo Pcs Kcap2007 V2Halo Pcs Kcap2007 V2
Halo Pcs Kcap2007 V2
 
Acquisition And Understanding Of Process Knowledgev1 1
Acquisition And Understanding Of Process Knowledgev1 1Acquisition And Understanding Of Process Knowledgev1 1
Acquisition And Understanding Of Process Knowledgev1 1
 
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
NeOn: Lifecycle Support for Networked Ontologies - Case Studies in the Pharma...
 
Next Challenges in Corporate Knowledge Management
Next Challenges in Corporate Knowledge ManagementNext Challenges in Corporate Knowledge Management
Next Challenges in Corporate Knowledge Management
 
Provenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of DataProvenance: From e-Science to the Web Of Data
Provenance: From e-Science to the Web Of Data
 
Tecnologías Semánticas en Salud
Tecnologías Semánticas en SaludTecnologías Semánticas en Salud
Tecnologías Semánticas en Salud
 
Provenance and Trust
Provenance and TrustProvenance and Trust
Provenance and Trust
 

Último

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Último (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Scientific data management from the lab to the web

  • 1. www.wf4ever-project.org Scientific Data Management - From the Lab to the Web José Manuel Gómez Pérez, iSOCO Semantic Data Management Dagstuhl Seminar 22-27 April 2012
  • 2. The data deluge Some facts » In 2010 the size of the digital universe exceeded 1 Zettabyte (=1 trillion Gb) » 1.8 Zb in 2011 » 35 Zb expected in 2020 » 90% unstructured data » 70% user-generated » 75% resulting from data copying, merging, and transforming » Metadata is the fastest growing data category » Much of such data is dynamic, real-time, volatile Source: IDC ‘s The 2011 Digital Universe Study – Extracting Value from Chaos 2
  • 3. Dealing with dynamicity Two main challenges » Challenge 1: Identifying and structuring the relevant portions of the data for the task at hand › First-class data citizens » Challenge 2: Managing the lifecycle of data entities › Preservation › Evolution and versioning › Decay Both technical and social aspects involved 3
  • 4. The Research Lifecycle Workflows in the Scientific Method Background Hypothesis Results Scientific Experiment Results Assumptions (data) Interpretation Publication (Data) Input data Method Example: Genome-Wide Association Studies 4
  • 5. Workflow-based Science What is a Scientific Workflow? » A mechanism for coordinating the execution of services and linking together resources. » The combination of data and processes into a configurable, structured set of steps that implement semi-automated computational solutions in scientific problem-solving Scientific workflows are at the core of scientific data management › Enable automation › Encourage best practices 5
  • 6. Challenge 1 Identifying and structuring the relevant portions of the data for the task at hand First-class data citizens
  • 7. Questions for Scientific Data and Workflows Issues Who are you ? Identity and Description Where and when were you born ? Authenticity Who were your parents (creators) ? Uniqueness For which purpose were you conceived and have been used ? Reuse, Repurpose What do you have inside ? Inspection Visualization Annotations How is your content linked ? Graphical Representation May I access all your parts ? Access Rights Which parts can I replace ? Adaptability What have they done to you ? Provenance Who and When ? Versioning Why did they do that ? Why have you been recommended to me ? Information Quality Can I believe what you are saying or trust your results ? Do you still produce the same results ? Reproducibility Are you still working ? Completeness How could I repair you ? Stability How could I thank you ? Credit How could I talk about you ? 7
  • 8. Challenge 1: Identifying and structuring the relevant data Research Objects as Technical Objects Carriers of Research Context Third Party Alien » Referentiable Distributed Tenancy Store » Aggregation, Dispersed › Heterogeneous › Local and External » Annotated metadata › Provenance › Structured: Manifests, Recipes, Permissions, Discourse » Lifecycle › Publishing, Evolution › Versioning » Mixed Stewardship › Graceful Degradation » Sharing » Security & Privacy Technical Objects Social Objects » Stereotypical User Profiles » Services OAI-ORE 8
  • 9. Research Objects as Social Objects Package, Explore, Inspect, Review, Exchange, Share, Reuse, Publish, Credit 9 9 9
  • 10. http://purl.org/wf4ever/ro# Research Object model core (simplified) RO specification: http://wf4ever.github.com/ro ore:aggregates ro:ResearchObject ro:Resource ore:isDescribedBy ro:Manifest wfdesc:Workflow ro:annotatesAggregatedResource ro:AggregatedAnnotation › ro (aggregation and annotation) Note: This figure shows a simplified view of the RO core. › wfdesc (workflow description) › Minim* (minimum info model) › wfprov (workflow provenance) › roprov (RO provenance) › roevo (evolution model) 10 *Minim based on M. Gamble’s MIM
  • 11. Challenge 2 Managing the lifecycle of data entities Evolution and Decay
  • 12. Challenge 2: Managing the lifecycle of data entities RO Evolution & Versioning 12
  • 13. Challenge 2: Managing the lifecycle of data entities RO Decay Workflow Decay • Component level • flux/decay/unavailability • Data level • Infrastructure level Experiment Decay • Methodological changes • New technologies • New resources/components • New data 13
  • 14. Preservation, Conservation, Recreating Preserving Archived Record Fixed Snapshots Review Rerun & Replay Conserving Active Instrument Live Rerun & Reuse Repair & Restore Recreating Archived Record Active Instrument Live Rebuild Recycle Repurpose 14
  • 15. Challenge 2: Managing the lifecycle of data entities Possible types of decay (an example) 15
  • 16. Decay Analysis A Taxonomy of RO decay 1. Service tool is missing 2. Service file descriptor disappeared 3. Service up but not contactable 4. Service up but functionality changed 5. Local software dependencies 6. Data unavailability 7. Changes in data formats 8. Chained dependency 9. Credentials deprecated 10. Input data superseded by other data 11. RO metadata outdated (upon versioning) 12. Old fashioned RO 13. External references lose credit 14. Execution framework no longer available 16
  • 17. A taxonomy of workflow decay Sample decay type 17
  • 18. Decay Analysis 1.0 Certificate – Evaluation of Stability and Completeness 1.0 Certificate of quality Stability Completeness Is the RO free from any form of decay Is the minimal aggregation of preventing workflow execution? resources encapsulated by the RO consistent? » Focus on reproducibility » RO checklists » Assisted detection of RO decay » Produced by scientists » Active monitoring on decay forms » Automatically checked against » RO and workflow provenance minimal model (minim) » RO evolution » Notification » Explanation 18 1.0 Certificate notion originally proposed by Yde de Jong
  • 19. Recap Lessons learnt Scalability » Data with a Purpose » Encapsulate & Conquer › Goal-driven (purpose) › Aggregation › Community-managed » Nothing is immutable, Provenance especially data. › Foster evolution › Monitor decay 19
  • 20. Thanks for your Attention! Questions Any Questions? http://www.wf4ever-project.org/ 20

Notas do Editor

  1. In this scenario student Dennis has made a conceptual workflow that takes the result of a gene expression experiment (activity values of all genes under two conditions: with/without a chemical compound). The wet laboratory experiment was done by others then Dennis. He makes a note of the origin (including a paper reference). The initial hypothesis is that the chemical compound disturbs gene expression. It is yet unknown which genes and what biological processes are affected. The conceptual workflow first performs one of the standard data preprocessing steps for the type of data Dennis has (Affymetrix gene expression array), then it uses a statistical test to filter those genes that are significantly differentially expressed between the two conditions, and finally it performs an enrichment test to find those pathways that are most prominent among the filtered genes. The latter requires an annotation process, where each gene is coupled to the pathways it was once implied in in other experiments (there is a database for that: KEGG).Dennis is new to workflows, so he wishes to start with an existing workflow. For each component he will search myExperiment for keywords. He then wishes to understand the workflows: look into them, perform test runs with test data and his own data, and see other peoples logs. When he finds workflows he does not understand, Dennis is inclined to create his own workflow with his own scripts. He will receive scripts from colleagues and perform tests that his colleagues are familiar with. As such, he can learn what his workflow is doing. This will help him interpret his results.Ultimately, the workflow may suggest for instance that the set of differentially expressed genes has the Wnt pathway as most common denominator. This pathway is well known for embryogenesis and cancer, information he finds on the internet. He makes a note of that. It will lead to the hypothesis that the chemical compound, may have effects on embryogenesis and/or cancer. This is now his interpretation of his experiment that he wishes to link to his experiment and the processed data. Dennis notes that in a next cycle he will want to perform another workflow that specifically tests this hypothesis, rather that perform an enrichment test. He will then look for a workflow that performs a 'global test', and replace this part in his workflow with the global test workflow. In his log he indicates this fact. In this case he will link the result of this test (most likely a new hypothesis) to the previous experiment and in particular to the initial hypothesis. At some point, he wishes to be able to retrieve this past information and the interrelationships among his hypotheses.Assuming his finding and new hypothesis are valuable and new, he will publish his results. The publication has cleaned information, sufficient for evaluating his hypothesis and rerunning the one workflow and the one dataset that lead to this result.Dennis Working Research Object will containA reference to the source of the data and the people to acknowledge for it.The initial hypothesisThe conceptual workflow or a summary of the experiment planReferences to workflows that were tested, with comments on their application for Dennis caseA reference to the workflow(s) that Dennis eventually uses, including acknowledgement information (including a note on how these people want to be acknowledged)Dennis his workflow, possibly with a backlog of previous versions that Dennis wishes to keep for reference (with notes and comments)Dennis his workflow run, results and the recorded steps that lead to the results, in some cases with comments for later reference (e.g. 'here I used parameter A, next time I may try B')The final hypothesis, with comments.A reference to the results of the workflowA Design log that records Dennis considerations while making the workflowA Run log that records Dennis considerations while running and interpreting the workflowHis Publication Research Object will containThe workflowA caption for his workflow (filtered from his design and run log, all information necessary to run the experiment by a reviewer)A workflow run (results, and a caption filtered from run log)His initial hypothesisHis final hypothesisThe data sourceAcknowledgementsIn time, Dennis' workflow can be found on the basis of his Published and Working RO's metadata. This will create a rich and wide range of search capabilities for Dennis' successors.The Working RO is kept at Dennis local group, and is the most valuable resource for reusing the work. The Published RO is available for download and reuse. It is anticipated that interested parties will contact Dennis or his group for 'reuse in collaboration' (i.e. for the group's expertise).
  2. Emphasise the use of Linked Data. Note: the figures here are not intended to be readable. They’re simply emphasising the existence of the models. Example user requirements being addressed by RO:UR1.3 aggregate existing resources to conveniently access related resources from a single placeUR1.6 describe the relationships between aggregated resources so that other researchers can see how the resources fit togetherUR1.16 annotate experimental results using semantic models so that I can find/show links to other, relevant research objects