Introductory class on techniques and tools to manage scientific data, focusing on sources of information and data analysis. Lecturer: Prof. Kelly Rosa Braghetto, a NeuroMat associate investigator and a professor at the University of São Paulo's Department of Computer Science.
Data Provenance and Scientific Workflow Management
1. Data Provenance and Scientific Workflow Management
Data Provenance
Neuroscience Data
Scientific Workflow Management
(and Questionnaires)
Kelly Rosa Braghetto
kellyrb@ime.usp.br
Departamento de Ciência da Computação
Instituto de Matemática e Estatística
Universidade de São Paulo
05 de Junho de 2013
1 / 21
2. Data Provenance and Scientific Workflow Management
Agenda
1 Data Provenance
2 Neuroscience Data
CARMEN Project
NEMO Project
3 Scientific Workflow Management Systems (SWMS)
Taverna
4 Questionnaires
2 / 21
3. Data Provenance and Scientific Workflow Management
Data Provenance
Data Provenance
Frequently asked questions for Scientists
Where was a document found?
How was this data set produced?
Were all facts included in this decision?
Were all the latest figures included in this diagram?
Can this scientific experiment be reproduced?
Source: http://openprovenance.org/
3 / 21
4. Data Provenance and Scientific Workflow Management
Data Provenance
Data Provenance
What is Provenance?
Provenance refers to the sources of information, such as entities and
processes, involved in producing or delivering an artifact.
Why does Provenance matter?
The provenance of information is crucial in deciding whether
information is to be trusted, how it should be integrated with other
diverse information sources, and how to give credit to its originators
when reusing it.
In an open and inclusive environment such as the Web, users find
information that is often contradictory or questionable.
People make trust judgments based on provenance that may or may
not be explicitly offered to them. Problem: lack of a standard
model.
Source: http://www.w3.org/2011/prov/wiki/Main_Page 4 / 21
5. Data Provenance and Scientific Workflow Management
Data Provenance
Works devoted to Data Provenance
Provenance Working Group, maintained by W3C
“Mission: to support the widespread publication and use of
provenance information of Web documents, data, and
resources.”
http://www.w3.org/2011/prov/wiki/Main_Page
Wf4Ever project
“Wf4Ever addresses some of the challenges associated to the
preservation of scientific experiments in data-intensive science.”
http://www.wf4ever-project.org/
Open Provenance Model (OPM)
http://openprovenance.org/
5 / 21
6. Data Provenance and Scientific Workflow Management
Data Provenance
Open Provenance Model (OPM)
The Open Provenance Model is a model of provenance that is
designed to meet the following requirements:
1 To allow provenance information to be exchanged between
systems, by means of a compatibility layer based on a shared
provenance model.
2 To allow developers to build and share tools that operate on
such a provenance model.
3 To define provenance in a precise, technology-agnostic manner.
4 To support a digital representation of provenance for any
’thing’, whether produced by computer systems or not.
5 To allow multiple levels of description to coexist.
6 To define a core set of rules that identify the valid inferences
that can be made on provenance representation.
6 / 21
7. Data Provenance and Scientific Workflow Management
Neuroscience Data
Projects recording provenance of neuroscience
data
Code Analysis, Repository & Modelling for e-Neuroscience
(CARMEN)
http://www.carmen.org.uk/
“CARMEN is an e-Science Pilot Project funded by the Engineering
and Physical Sciences Research Council (UK). It will deliver a
virtual laboratory for neurophysiology, enabling sharing and
collaborative exploitation of data, analysis code and expertise.
Neural activity recordings (signals and image series) are the primary
data types.”
Neural ElectroMagnetic Ontologies (NEMO)
http://nemo.nic.uoregon.edu/wiki/NEMO
[More details in the next slides...]
7 / 21
8. Data Provenance and Scientific Workflow Management
Neuroscience Data
CARMEN Project
The CARMEN consortium
“A core part of our work is the development of minimum reporting
guidelines for annotation of data and other computational resources
for the purpose of sharing”
Result: a MINI module for Electrophysiology
MINI (Minimum Information about a Neuroscience
investigation) – is a family of reporting guideline documents
A module represents the minimum information that should be
reported about a dataset to:
facilitate computational access and analysis
to allow a reader to interpret and critically evaluate the process
performed and the conclusions reached
to support their experimental corroboration
8 / 21
9. Data Provenance and Scientific Workflow Management
Neuroscience Data
CARMEN Project
MINI module for Electrophysiology
The reporting recommendadions cover both extracellular and
intracellular electrophysiology
Covered data:
date stamps and responsible persons
the subject under study
the subject task or stimulus if appropriate
the recording protocol
and the resulting description of time series data
The entire module is described in:
http://www.carmen.org.uk/standards/mini.pdf
The module is registered in the MIBBI portal
(http://www.biosharing.org/standards/mibbi and
http://mibbi.sourceforge.net/legacy.shtml).
MIBBI – Minimum Information for Biological and Biomedical
Investigations – is a pioneering project that aims to coordinate
guidelines for reporting of metadata across domains 9 / 21
10. Data Provenance and Scientific Workflow Management
Neuroscience Data
NEMO Project
Neural ElectroMagnetic Ontologies (NEMO)
An NIH funded project
Aims to create EEG and MEG ontologies and ontology based
tools. These resources will be used to support representation,
classification, and meta-analysis of brain electromagnetic data.
Based on three pillars: DATA, ONTOLOGY, and DATABASE
Data – raw EEG, averaged EEG (ERPs), and ERP data
analysis results
Ontologies – include concepts related to ERP data (including
spatial and temporal features of ERP patterns), data
provenance, and the cognitive and linguistic paradigms that
were used to collect the data
Database – the NEMO database portal is a large repository
that stores NEMO consortium data, data analysis results, and
data provenance
Site: http://nemo.nic.uoregon.edu
10 / 21
11. Data Provenance and Scientific Workflow Management
Neuroscience Data
NEMO Project
Ontology (informal definition)
In both computer science and information science, an ontology
represents a set of concepts within a domain and the
relationships between those concepts. It is used to reason
about the objects within that domain.
Ontologies are used as a form of knowledge representation
about the world or some part of it.
Ontologies generally describe:
Individuals: the basic or “ground level” objects
Classes: sets, collections, or types of objects
Attributes: properties, features, characteristics, or parameters
that objects can have and share
Relations: ways that objects can be related to one another
Events: the changing of attributes or relations
Source: http://neurolex.org
11 / 21
12. Data Provenance and Scientific Workflow Management
Neuroscience Data
NEMO Project
MINEMO – an extension of the MINI module for
Electrophysiology
MINEMO = Minimal Information for Neural Electromagnetic
Ontologies
“A standards-compliant method for analysis and integration of
event-related potentials (ERP) data”; in other words: a
checklist for the description of ERP studies
The checklist comprises no more than 60 fields; 20 of these
fields are considered “mandatory”
MINEMO promotes the use of controlled vocabularies (or
lexicons) for data annotation. Aim: to conduct cross-lab
meta-analysis
Each MINEMO checklist item is linked to a term defined in
the NEMO ontology
12 / 21
13. Data Provenance and Scientific Workflow Management
Neuroscience Data
NEMO Project
Subset of “mandatory” MINEMO terms
1 Research lab (General features)
2 Experiment (General features)
3 Publication
4 Study subjects (Group characteristics)
5 Experiment condition
6 Stimulus representation
7 Behavioral data collection
8 EEG data collection
9 EEG/ERP data preprocessing
10 EEG/ERP data file
The entire set of terms is defined in the article:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3235514/
They are also in the MIBBI portal:
13 / 21
14. Data Provenance and Scientific Workflow Management
Neuroscience Data
NEMO Project
More about NEMO...
Data in the NEMO Portal are aligned with the MINEMO
checklist and ontology
https://portal.nemo.nic.uoregon.edu
NIF (the Neuroscience Information Framework project –
http://www.neuinfo.org/) uses the NEMO ontology. NIF
aggregates online sources of neuroscience data, including
database, web sites, and publications, and provides a search
interface across these disparate sources
The NEMO ontology can be seen in:
http://bioportal.bioontology.org/ontologies/40522
14 / 21
15. Data Provenance and Scientific Workflow Management
Neuroscience Data
NEMO Project
A “detail” to worry about...
The MINI module for Electrophysiology and MINEMO do not cover
the description of image data
To see later:
MIfMRI – Minimum Information about an fMRI Study
http://www.fmrimethods.org/
15 / 21
16. Data Provenance and Scientific Workflow Management
Scientific Workflow Management Systems (SWMS)
Scientific Workflows
A data analysis (or processing) generally can be described as a
workflow, e.g., a set of computational tasks that “transform”
data
In Bioinformatics, a workflow is frequently called pipeline
In a workflow, the output data of a task is generally used as
input data for other(s) tasks(s). So, the flow of data defines
an execution order for the workflows tasks
Frequently, a same task can be appear in more than one
workflow
16 / 21
17. Data Provenance and Scientific Workflow Management
Scientific Workflow Management Systems (SWMS)
Scientific Workflow Management System
(SWMS)
A computational tool that controls the execution of workflows
It provides mechanisms for a scientist to describe his/her
workflow using “intuitive” modeling languages
It can optimize the execution considering the characteristics of
the available computational resources
It helps to generate provenance data of an analysis process. In
addition, it improves the reproducibility of analyses
17 / 21
18. Data Provenance and Scientific Workflow Management
Scientific Workflow Management Systems (SWMS)
Most successful SWMSs
Taverna – http://www.taverna.org.uk
VisTrails – http://www.vistrails.org
Kepler – https://kepler-project.org
Galaxy – http://galaxyproject.org
18 / 21
19. Data Provenance and Scientific Workflow Management
Scientific Workflow Management Systems (SWMS)
Online workflow repositories – collaborative
science
MyExperiments project (http://www.myexperiment.org/):
Users upload their workflow models
Models are categorized according their research domain
Users can search and download models uploaded by other users
Site stores models from different SWMSs (Taverna, Kepler,
etc.)
19 / 21
20. Data Provenance and Scientific Workflow Management
Scientific Workflow Management Systems (SWMS)
Taverna
Taverna
Features:
Graphical user interface for the description of the workflows
Easy installation and use
Recording of the “execution history” and intermediate results
(= provenance data of the entire analysis)
Provenance export capability to OPM
20 / 21
21. Data Provenance and Scientific Workflow Management
Questionnaires
Automatic Generation of Online Questionnaires
There are computational tools that automatically generate
electronic questionnaires.
One of the most used is the LimeSurvey
(https://www.limesurvey.org/).
Functionalities of the LimeSurvey:
Generates online questionnaires
Has a big set of question types
Keeps questionnaire data in a real database
Manages users
Creates a print version of questionnaires
Makes basic statistical analysis
...
21 / 21