Capturing Context in Scientific Experiments: Towards Computer-Driven Science

Capturing Context in Scientific Experiments:
Towards Computer-Driven Science
Daniel Garijo
Information Sciences Institute and
Department of Computer Science
https://w3id.org/people/dgarijo
@dgarijov
dgarijo@isi.edu

A prediction of the future… from the past
Useful for:
• Every day tasks
• Organize agenda
• Calls
• Look for information
• Research features
• Summarize related work
• Reuse and comparison of
work
• Highlights
• Do new data analyses
Capturing Context in Scientific Experiments: Towards Computer-Driven Science 2
Source: https://www.businessinsider.com.au/apple-future-computer-knowledge-navigator-john-sculley-george-lucas-2017-10,
https://www.youtube.com/watch?v=QRH8eimU_20
The knowledge navigator (Apple, 1987)

Meeting expectations…
• In terms of Data
• Open datasets
• Open metadata portals
• In terms of Software
• Open Source repositories
• Containers and virtual machines
• In terms of Publications
• Open journals
• Open methods/protocols
3Capturing Context in Scientific Experiments: Towards Computer-Driven Science

What are we missing?
• Methods in publications are not designed for intelligent systems
• Objectives, hypotheses, methodology and conclusions are tailored for humans
• Link between data, software and publications is not clear (if exists)
• Functionality and instructions for executing software requires specific
domain expertise
• Publications are difficult to reuse and reproduce
4
Retracted Scientiﬁc Studies: A Growing List - NYTimes.com
Sections Home Search Skip to content
Advertisement
Email
Share
Tweet
More
Search
Subscribe
Log In 0 Settings
Close search
search sponsored by
Search NYTimes.com
SUBSCRIBE NOW
5/ 29/ 15, 1:49 AMRetracted Scientiﬁc Studies: A Growing List - NYTimes.com
The retraction by Science of a study of changing attitudes about gay marriage is
the latest prominent withdrawal of research results from scientific literature.
And it very likely won't be the last. A 2011 study in Nature found a 10-fold
increase in retraction notices during the preceding decade.
Many retractions barely register outside of the scientific field. But in some
instances, the studies that were clawed back made major waves in societal
discussions of the issues they dealt with. This list recounts some prominent
retractions that have occurred since 1980.
Photo
In 1998, The Lancet, a British medical journal,
published a study by Dr. Andrew Wakefield
that suggested that autism in children was
caused by the combined vaccine for measles,
mumps and rubella. In 2010, The Lancet
retracted the study following a review of Dr.
Wakefield's scientific methods and financial
conflicts.
Despite challenges to the study, Dr.
Wakefield's research had a strong effect on
many parents. Vaccination rates tumbled in
Britain, and measles cases grew. American
antivaccine groups also seized on the research. The United States had more
cases of measles in the first month of 2015
than the number that is typically diagnosed in a full year.
Vaccinesand
Autism
Capturing Context in Scientific Experiments: Towards Computer-Driven Science

The Cost of Reproducibility
5
• Necessary to fill in the gaps
• 2 months of effort in reproducing published method [Kinnings et al, PLOS 2010]
• Authors expertise was required
Comparison of
ligand binding
sites
Comparison of dissimilar
protein structures
Graph network
generation
Molecular Docking
[Garijo et al PLOS]
Collaboration with UCSD

Scientist-Driven Science
6
Scientist
Scientist +
Automated
Tools
Scientist +
Intelligent
System
Intelligent Systems help:
• Comparing
• Reusing/Repurposing
• Testing new hypotheses
• Explaining results
Requirements:
• Functionality
• Relations between data,
software and method
• Provenance
Scientists:
• Keep their own records
• Write their own software
• Data cleaning
• Reformatting
• Analysis
• Run the experiments
• Manually analyze results
and compare to state of
the art
Automated Tools help:
• Searching
• Setting up execution
• Visualizing
• Sharing
Requirements
• Data/Dataset metadata
• Software/Software
metadata
• Method description
• User/domain expertise
Context of a computational experiment

Outline
• Capturing and publishing context of computational experiments
• From scientific workflows to Linked Data
• Capturing software functionality
• Representing software metadata
• Using context to facilitate reusability and exploration of experiments
• Detecting commonalities among experiments
• Explaining computational results
• Using context in Intelligent Systems
• Hypothesis testing
• Environmental sciences modeling
• A vision for context capture in computer-driven science

Introduction
Lab book
Digital Log
Laboratory Protocol
(recipe)
Scientific Workflow
Experiment
In silico experiment
8
Background: Computational Experiments

Outline

Workflow representation: Structures interchanged in the workflow lifecycle
Dataset
Stemmer
algorithm
Result
Term weighting
algorithm
FinalResult
File:
Dataset123
LovinsStemmer
algorithm
Id:resultaa1
IDF
algorithm
Id:fresultaa2
Workflow
Template Workflow Instance Workflow Execution Trace
Design Instantiation Execution
File:
Dataset124
PorterStemmer
algorithm
Id:resultaa1
IDF
algorithm
Id:fresultaa2
File:
Dataset123
LovinsStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
File:
Dataset123
LovinsStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
File:
Dataset124
PorterStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
File:
Dataset124
PorterStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
File:
Dataset124
PorterStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
File:
Dataset123
LovinsStemmer
execution
Id:resultaa1
IDF
execution
Id:fresultaa2
…
…
Id:resultaa1
Workflow Lifecycle

Requirements
Workflow template description
Workflow execution trace description
Workflow attribution
Workflow metadata
Link between templates and executions
Requirements for workflow Representation
[Garijo et al., 2017 FGCS]
Plan: P-Plan [Garijo et al 2012]
http://purl.org/net/p-plan
Provenance: PROV (W3C)
[Lebo et al 2013]
http://www.w3.org/ns/prov#
Dublin Core, PROV (W3C)

OPMW: Extending provenance standards and plan models
template1
opmw:isVariableOfTemplate
opmw:isVariable
OfTemplate
Input Dataset
Term Weighting
Topics
p-plan:isOutputVarOf
p-plan:hasInputVar
opmw:isStepOf
Template
opmw:correspondsTo
Template
opmw:corresponds
toTemplateArtifact
opmw:corresponds
toTemplateProcess
opmw:corresponds
toTemplateArtifact
opmw:Workflow
ExecutionProcess
opmw:Workflow
ExecutionAccount
prov:Entity
prov:Activity
prov:Bundle
PROV, OPM Extension
opmv:Artifact
opmo:Account
opmv:Process
opmw:Workflow
ExecutionArtifact
opmw:Workflow
TemplateArtifact
opmw:Workflow
TemplateProcess
opmw:Workflow
Template
p-plan:Plan
p-plan:Step
p-plan:Variable
P-Plan extension
Class Object property
Legend
Instance ofInstance Subclass of
execution1
File: Dataset123
IDF
(java)
File: FResultaa2
prov:wasGeneratedBy
prov:used
opmo:account
opmo:account
opmo:account
http://www.opmw.org/ontology/
A Vocabulary for Workflow Representation: OPMW

Publishing workflows as Linked Data
Specification
Why Linked Data?
•Facilitates exploitation of workflow resources in an homogeneous manner
Adapted methodology from [Villazón-Terrazas et al 2011]
Tested it for the WINGS workflow system
1
Base URI = http://www.opmw.org/
Ontology URI = http://www.opmw.org/ontology/
Assertion URI = http://www.opmw.org/export/resource/ClassName/instanceName
Examples:
http://www.opmw.org/export/resource/WorkflowTemplate/ABSTRACTSUBWFDOCKING
http://www.opmw.org/export/resource/WorkflowExecutionAccount/ACCOUNT1348629
350796
Publishing scientific workflows as Linked Data

Why Linked Data?
Specification Modeling
1 2
OPMW
P-Plan
OPM DC
PROV

Why Linked Data?
Specification Modeling Generation
1 2 3
Workflow system
Workflow
Template
Workflow
execution
OPMW
export
OPMW
RDF

Why Linked Data?
Specification Modeling Generation Publication
1 2 3 4
RDF
Triple
store
Permanent
web-
accessible
file
store
RDF Upload Interface
SPARQL
Endpoint
OPMW
RDF

Why Linked Data?
Specification Modeling Generation Publication
1 2 3 4
Exploitation
5
Curl Linked Data Browser SPARQL
endpoint
Workflow explorer

Outline
• Machine learning analysis

Capturing software functionality
[Garijo et al 2014a] (Collaboration with U. of Manchester)
Is it possible to generalize workflow steps based on their functionality in an
experiment?
• What kind of data manipulations are performed in a workflow?
•E.g.:
•Data retrieval
•Data preparation
•Data curation
•Data visualization
• etc.

Capturing software functionality
[Garijo et al 2014a] (Collaboration with U. of Manchester)
Analyzed software steps of 260 workflows from 4 different workflow systems
Created a catalog of workflow step functionalities (motifs)
Guidelines for annotating workflows
Catalog available at: http://purl.org/net/wf-motifs#
= 260 workflows
89 12526 20

Outline

Capturing Software Metadata
[Gil et al 2015]
• Scientific workflows capture some software metadata
• High amount of software not used in scientific workflows
• Software in open repositories often have missing metadata
• How to use it?
• What can I use it with?
• What are the dependencies?
• Is it still maintained?
• How can I contribute?
• …
• Ontology for scientific software metadata
• Described with scientist in mind:
• How can scientist contribute to populate it?
• What do scientists need in terms of software?

Software Metadata: Categories
Used in the OntoSoft
metadata Registry:
http://ontosoft.org/portals
http://ontosoft.org/software

Using the ontology in the Ontosoft software registry
Software entries
from distributed
repositories are
readily accessible
Semantic
search
Comparison matrix
of software entries
PIHM PIHMgis DrEICH TauDEM WBMsed
nto$
o%$
Metadata
completion
highlighted
Software is
contrasted
by property

Outline

Detecting commonalities in computational experiments
[Garijo et al 2014b]
PROBLEMS to address:
• Workflows have many detailed steps and may be difficult to understand
• The general method may not apparent
• How are different workflow related?
• What steps do they have in common?
A
B
C
A
F
D
A
B
C
G
B
H
A
B
F
B
E
Common workflow fragments
Workflow 1 Workflow 2 Workflow 3

1
2
3
4
A method for detecting reusable workflow fragments
Dataset
Stemmer
algorithm
Result
Term weighting
algorithm
FinalResult
Stemmer
algorithm
Term weighting
algorithm
Duplicated workflows are removed
Single-step workflows are removed

1
2
3
4
Popular graph mining techniques
Inexact FSM: usage of heuristics to calculate
similarity between two graphs. The solution
might not be complete
Exact FSM: deliver all the possible fragments to be
found the dataset.

1
2
3
4
Remove redundant fragments

1
2
3
4
Link fragments back to the workflows
where they were found
http://purl.org/net/wf-fd

?
Research question: Are our proposed workflow fragments useful?
•A fragment is useful if it has been designed and (re)used by a user.
•Comparison between proposed fragments and user designed fragments
(groupings) and workflows
Workflow fragment assessment

?
Metrics: Precision and recall
Fragments
(F)
Workflows
(W)
Groupings
(G)

?
Workflow corpora
User Corpus 1 (WC1)
• Designed mostly by a single a single user
• 790 workflows (475 after data preparation)
User Corpus 2 (WC2)
• Created by a user, with collaborations of others
Multi User Corpus 3 (WC3)
• Workflows submitted by 62 users during the month of Jan 2014
User Corpus 4 (WC4)
• Designed mostly by a single a single user

?
Result assessment
•30%-60% of proposed fragments are equal to user defined groupings or
workflows
•40%-80% of proposed of proposed fragments are equal or similar to user
defined groupings or workflows
Commonly occurring patterns are potentially useful for users designing workflows
What about the rest of the fragments? Are those useful?

?
User feedback: user survey
Q1: Would you consider the proposed fragment a valuable grouping?
•I would not select it as a grouping (0)
•I would use it as a grouping with major changes (i.e., adding/removing more than 30% of the steps) (1)
•I would use it as a grouping with minor changes (i.e., adding/removing less than 30% of the steps) (2).
•I would use it as a grouping as it is (3)
Q2: What do you think about the complexity of the fragment?
•The fragment is too simple (0)
•The fragment is fine as it is (1)
•The fragment has too many steps (2)
Not enough evidence to state that all proposed workflow fragments are useful

Outline

Using captured context to explain results
[Gil and Garijo 2016]
Current methods in paper are ambiguous, incomplete and described at
inconsistent levels of detail
Comparison of
ligand binding
sites
Comparison of dissimilar
protein structures
Graph network
generation
Molecular Docking
The SMAP software was used to
compare the binding sites of the 749
M.tb protein structures plus 1,446
homology models (a total of 2,195
protein structures) with the 962 binding
sites of 274 approved drugs, in an all-
against-all manner. While the
binding sites of the approved drugs
were already defined by the bound
ligand, the entire protein surface of each
of the 2,195 M.tb protein structures
was scanned in order to identify
alternative binding sites. For each
pairwise comparison, a P -value
representing the significance of the
binding site similarity was calculated.

Using captured context to explain results
[Gil and Garijo 2016]
Current methods in paper are ambiguous, incomplete and described at
inconsistent levels of detail
Goal: Automatically generate reports from computer-generated data
analysis records
• Reports must:
• Be truthful to actual events
• Enable inspection
• Be human-understandable
• Abstract details
• Ideally:
• Become part of papers
• Have persistent evidence
• Be adapted to different audiences/expertise/purpose

Data Narratives
1. A record of events that describe a new result
• A workflow and/or provenance of all the computations executed
2. Persistent entries for key entities involved
• URIs/DOIs for data, software versions, workflow,…
3. Narrative account(s)
• Human-consumable rendering(s) that includes pointers to the detailed
records and entries
• Each account is generated for a different audience/purpose
• A casual reader, a close colleague, someone inspecting how the work
was done, someone reproducing the work

Data Narrative Accounts: An example
40
“Topic modeling was run on the Reuters R8 dataset (10.6084/
m9.figshare.776887), and English Words dataset
(10.6084/m9.figshare.776888), with iterations set to 100, stop word
size set to 3, number of topics set to 10 and batch size set to 10.
The results are at 10.6084/m9.figshare.776856”
“The topics at 10.6084/m9.figshare.776856 were found
in the Reuters R8 dataset
(10.6084/m9.figshare.776887) and English Words
dataset (10.6084/m9.figshare.776888)”
• Execution view
• Inputs, parameters and main outputs
• Data view
• Just the data that influenced the results
• Method view
• Main steps based on their functionality
“Topic training was run on the input dataset. The results are
product of PlotTopics, a visualization step”

• Dependency view
• How the steps depend on each other
• Implementation view
• How the steps were implemented in the execution
• Software view
• Details on the software used to implement the steps
Data Narrative Accounts: An example
41
“First, the input data is filtered by Stop Words, followed by Small
Words, Format Dataset, and Train Topics. The final results are
produced by Plot Topics”
“Train topics was implemented using Latent Dirichlet allocation”
“The train topics step was generated with Online LDA open source
software, written in Java. Plot topics was generated with the Termite
software.”

DANA: DAta NArratives
42
Experiment
Records
Provenance
RepositoryExperiment-
specific
Knowledge Base
DANA Generator
Narrative
accounts Software
registry
Query
patterns
Data Narrative aggregator
Input
Resource
request
Response
Resource
request
Response
Output
Get query Pattern
result
Get
pattern
1. Identify which experiment records to describe
2. Generation of an Experiment-specific knowledge base
3. Creation of the Data Narrative from templates
4. Produce narrative accounts
https://knowledgecaptureanddiscovery.github.io/DataNarratives/

Formative evaluation
• Survey with 6 target scenarios
• Each scenario:
• Description of a situation where a user has to do a task
• A workflow sketch of the analysis done
• Six candidate narratives of that workflow sketch.
• 12 responses from users
• Results
• Each narrative is considered appropriate for describing some scenario
• Different users chose different narratives for each scenario

Outline

Using Context for Hypothesis Testing
[Gil et al 2016]
data
Protein PRKCDBP is expressed
in samples of patient P36
hypothesis
revision
PRKCDBP mutation
is expressed in P36
workflows meta-
workflows
Wf#0# Wf#1# Wf#2#
simMetrics#
com parison*
hypothesis#
revisedHyp#
hypothesisRevision*

Hypothesis Testing: My Contribution
[Garijo et al 2017]
HG2 HE2
HG1
HE1
HS2
Protein
EGFR
Colon
Cancer
SubtypeA
Associated
With
revisionOf
HS1
Protein
EGFR
Colon
Cancer
Associated
With
wasGeneratedBy
Execution 1
wasGeneratedBy
HQ2
Execution 2
C1
hasConfidence
Report
L2
hasConfidenceLevel
wasGeneratedBy
HQ1
C1
hasConfidence
Report
L1
hasConfidenceLevel
Statement
Qualifier
Evidence
History
The DISK Ontology: http://disk-project.org/ontology/disk/

Using Context for Environmental Sciences Modeling
Work in progress
• Modeler wants to predict a situation
• E.g., Impact of draught in the Amazon
• Intelligent system assists:
• Finding data of interest
• Connecting environmental models:
hydrology, economy, agronomy, etc.
• Facilitating the execution of models
• Visualizing results
My contribution:
• Extending our software ontology to
capture requirements of environmental
models
• Relating variables to inputs, units, time, etc.
Albedo
Soil
moisture
Soil
quality
Precipi
tation
Comm
odity
prices
Property
rights
Market
access
Crop/forest
yields
Land
use
House
hold
type
Climate Model Hydrology Model
Economy
model
…
Intelligent System
predictionsvariables
Scenario
Data Catalog
Model Catalog

Outline

Where are we headed?
49
Scientist Driven Science Computer Driven Science
Scientist
Scientist +
Automated
Tools
Scientist +
Intelligent
System
Intelligent
System +
Scientist
• Can an Intelligent System co-author a paper? Can it be an author?
• Can it win a Nobel prize? [Kitano, ISWC 2016]
• What do we need to capture (in Software, Data, Methods, Provenance)?
1. Functionality and abstraction
2. Granularity
3. Importance

Next steps for context capture in
computational experiments
• Capturing different levels of abstraction in experiments
• Using user expertise to curate captured context
• What do users consider important?
• Improve explanation of details
• How can we identify the core function of a
software step?
• Represent the goal and objectives of a
computational experiment
RDF
Triple
store

Summing up
• Context is needed to understand and reuse computational experiments
• Sharing context from computational experiments
• Scientific workflows and their executions
• Software functionality and metadata
• Getting value out of context
• Reusability, exploration, explanation
• Used to power intelligent systems!
• Next steps
• Representing functionality and levels of abstraction
• Interact with users to curate context

Special thanks
• Yolanda Gil
• Varun Ratnakar
• Oscar Corcho
• Pinar Alper
• Khalid Belhajjame
• Asuncion Gomez Perez
• Idafen Santana Perez
• Felisa Verdejo
• Francisco Garijo

References
• [Kinnings et al, PLOS 2010]: Kinnings SL, Xie L, Fung KH, Jackson RM, Xie L, Bourne PE (2010) The
Mycobacterium tuberculosis Drugome and Its Polypharmacological Implications. PLoS Comput Biol
6(11): e1000976. https://doi.org/10.1371/journal.pcbi.1000976
• [Garijo et al PLOS]: Garijo D, Kinnings S, Xie L, Xie L, Zhang Y, Bourne PE, et al. (2013) Quantifying
Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome. PLoS ONE 8(11):
e80278. https://doi.org/10.1371/journal.pone.0080278
• [Garijo et al 2014a]: Garijo, D.; Alper, P.; Belhajjame, K.; Corcho, O.; Gil, Y.; and Goble, C .Common motifs
in scientific workflows: An empirical analysis. Future Generation Computer Systems, 36: 338--351. 2014.
• [Garijo et al 2014b]: Garijo, D.; Corcho, O.; Gil, Y.; Gutman, B. A; Dinov, I. D; Thompson, P.; and Toga, A
Fragflow automated fragment detection in scientific workflows. W In e-Science (e-Science), 2014 IEEE
10th International Conference on, volume 1, pages 281--289, 2014. IEEE
• [Garijo and Gil 2016]: Gil, Y.; and Garijo, D. Towards Automating Data Narratives. In Proceedings of the
22nd International Conference on Intelligent User Interfaces, pages 565--576, 2017. ACM
• [Garijo et al 2017]: Garijo, D.; Gil, Y.; and Ratnakar, V. The DISK Hypothesis Ontology: Capturing
Hypothesis Evolution for Automated Discovery. In Proceedings of the Workshop on Capturing Scientific
Knowledge (SciKnow), held in conjunction with the ACM International Conference on Knowledge Capture
(K-CAP), Austin, Texas, 2017.
• [Garijo et al 2017 FGCS]: Garijo, D.; Gil, Y.; and Corcho, O. Abstract, link, publish, exploit: An end to end
framework for workflow sharing. Future Generation Computer Systems, . 2017.
• [Gil et al 2015]: Gil, Y.; Ratnakar, V.; and Garijo, D. OntoSoft: Capturing scientific software metadata. In
Proceedings of the 8th International Conference on Knowledge Capture, pages 32, 2015. ACM
• [Kitano ISWC 2016]: Kitano, H. Artificial Intelligence to Win the Nobel Prize and Beyond: Creating the
Engine for Scientific Discovery. Keynote http://iswc2016.semanticweb.org/pages/program/keynote-
kitano.html

Capturing Context in Scientific Experiments:
Towards Computer-Driven Science:
Daniel Garijo
Information Sciences Institute and
Department of Computer Science
https://w3id.org/people/dgarijo
@dgarijov
dgarijo@isi.edu

Capturing Context in Scientific Experiments: Towards Computer-Driven Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Capturing Context in Scientific Experiments: Towards Computer-Driven Science

Similar to Capturing Context in Scientific Experiments: Towards Computer-Driven Science (20)

More from dgarijo

More from dgarijo (20)

Recently uploaded

Recently uploaded (20)

Capturing Context in Scientific Experiments: Towards Computer-Driven Science

Editor's Notes