1. The
Inves)ga)on/Study/Assay
(ISA)
metadata
framework
for
reproducible
and
reusable
bioscience
research
Alejandra
González-‐Beltrán,
PhD
on
behalf
of
the
ISATeam
Oxford
e-‐Research
Centre,
University
of
Oxford
Faculty
of
Technology,
Environment
and
Engineering
Birmingham
City
University
12th
March
2013
2. Ioannidis
et
al.,
Repeatability
of
published
microarray
gene
expression
analyses.
Nature
Gene*cs
41(2),
149-‐55
(2009)
doi:10.1038/ng.295
3. Ioannidis
et
al.,
Repeatability
of
published
microarray
gene
expression
analyses.
Nature
Gene*cs
41(2),
149-‐55
(2009)
doi:10.1038/ng.295
8. Need
for
a
generic
representa)on,
applied
to:
•microarray
based
experiments
(MAGE)
•sequencing
based
experiments
(SRA)
•flow
cytometry
based
experiments
(FuGE-‐Flow
Cyt)
•mass
spectrometry
and
NMR
spectroscopy
experiments
(Metabolights
and
PRIDE)
9. Roadmap
Reproducible
&
Reusable
Bioscience
Research
10. Roadmap
reasoning
visualiza)on
analysis
browsing
integra)on
exchange
retrieval
Well-‐annotated
&
Structured
Data
Reproducible
&
Reusable
Bioscience
Research
11. Roadmap
reasoning
visualiza)on
analysis
browsing
integra)on
exchange
retrieval
Well-‐annotated
&
Structured
Data
Reproducible
&
Reusable
Bioscience
Research
User
community
12. Roadmap
reasoning
visualiza)on
analysis
browsing
integra)on
exchange
retrieval
Community
Standards
Sodware
Tools
Well-‐annotated
&
Structured
Data
Reproducible
&
Reusable
Bioscience
Research
User
community
15. Bioscience
is
mul)-‐domain…
health
env
agro
tox/pharma
§
Interdisciplinary
and
integra:ve
in
character
• need
to
deal
with
new
and
exis:ng
datasets
• deal
with
a
variety
of
data
types
Source
of
the
figure:
EBI
website
16. Mul)ple
communi)es,
mul)ple
norms
and
standards,
e.g.:
use
the
same
term
to
allow
data
to
flow
from
report
the
same
core,
refer
to
the
same
‘thing’
one
system
to
another
essen)al
informa)on
Challenges: lack of interaction and coordination, duplication of effort,
fragmentation and uneven coverage…hinders interoperability
18. But…
what
do
we
know
about
them
and
how
they
are
related
MAGE-Tab! AAO! miame!
GCDML! MIAPA!
CHEBI! GIATE!
SRAxml! OBI! MIRIAM!
VO!
SOFT! MIQAS!
FASTA! PATO! MIX!
CML! ENVO! REMARK!
DICOM! MIGEN!
GELML! MOD!
SBRML! MIAPE! MIQE!
TEDDY!
MITAB! MzML! XAO! CIMR! CONSORT!
BTO!
ISA-Tab! SEDML…! DO
PRO! IDO…! MIASE! MISFISHIE….!
19. But…
what
do
we
know
about
them
and
how
they
are
related
I
use
high
throughput
Which
tools
and
sequencing
technologies,
databases
which
ones
are
relevant
to
implement
which
me?
standards?
How
can
I
get
What
are
the
involved
to
propose
criteria
to
evaluate
extensions
or
their
status
and
modifica)ons?
value?
Which
ones
are
Which
formats
I
work
on
plants,
are
mature
enough
for
support
specific
these
standards
just
me
to
use
or
minimum
for
biomedical
recommend?
informa)on
applica)ons?
guidelines?
20. A
coherent,
curated
and
searchable
catalogue
of
data
sharing
resources
• Bioscience
standards
and
associated
data-‐sharing
policies,
publica:ons,
tools
and
databases
• Assessment
criteria
for
usability
and
popularity
of
standards
• Rela:onships
among
standards
• Encouragement
for
communica:on
&
interac:on
among
groups
• Promo)ng
interoperability
&
informed
decisions
about
standards
21.
infrastructure
22. ISA
sodware
suite:
suppor)ng
standards-‐compliant
experimental
annota)on
and
enabling
cura)on
at
infrastructure
the
community
level
Rocca-‐Serra
et
al,
2010
Bioinforma)cs
• Assist
in
the
annota)on
and
management
of
experimental
metadata
at
source,
suppor)ng
data
provenance
tracking
• Deal
with
high-‐throughput
studies
using
one
or
a
combina)on
of
omics
and
other
technologies
• Empower
users
to
uptake
community-‐defined
checklists
and
ontologies
• Facilitate
data
sharing,
re-‐use,
comparison
and
reproducibility
of
experiments,
submission
to
interna)onal
public
repositories
23.
24.
25.
26.
27. faahKO
dataset
• Available
in
Bioconductor
• Subset
of
the
original
data
on
global
metabolite
profiling
Saghatlian
et
al.
Biochemistry.
2004
• LC/MS
peaks
from
the
spinal
cords
of
6
wild-‐type
and
6
FAAH
(fa[y
acid
amyde
hydrolase)
knockout
mice
28. -‐
Define
key
en))es
(e.g.
factors,
protocols,
parameters)
-‐
Grouping
of
studies
-‐
Relate
studies
and
assays
faahKO
inves)ga)on
29. -‐ Subjects
studied:
source(s),
sampling
methodology,
characteris)cs
faahKO
study
-‐ treatments/manipula)ons
performed
to
prepare
the
specimens
NEWT
UniProt
Taxonomy
Database
Mouse
Genome
Informa)cs
30. -‐ Subjects
studied:
source(s),
sampling
methodology,
characteris)cs
faahKO
study
-‐ treatments/manipula)ons
performed
to
prepare
the
specimens
Mouse
Adult
Gross
Anatomy
31. -‐ measurement
type,
e.g.
metabolite
profiling
-‐ technology,
e.g.
mass
spectrometry
faahKO
assay
32.
33.
34. Create template(s) to fit the type of
experiments to be described
Create
templates
detailing
the
steps
to
be
reported
for
different
inves)ga)ons,
complying
to
community
standards,
e.g.
configuring
the
value(s)
allowed
for
each
field
to
be
• text
(with/without
regular
expression
tes)ng),
• ontology
terms,
• numbers
etc.
35. Describe, curate your experiment using a
desktop-based tool
Report and edit the description using this tool,
(also customized using the templates) with a
spreadsheet like look and feel, packed with
functionalities such as
• ontology search (access via )
• term-tagging features
• import from spreadsheets etc…
36. • Ontology
search
and
automated
tagging
(relying
on
NCBO
Bioportal
services)
on
Google
Spreadsheets
• Collabora)ve
annota)on;
support
for
distributed
users
• Version
control
&
history
OntoMaton:
a
Bioportal
powered
Ontology
widget
for
Google
Spreadsheets
Maguire
et
al,
2013
Bioinforma)cs
37.
38.
39.
40.
41. • R
package
available
in
BioConductor
2.11
h[p://bioconductor.org/packages/release/bioc/html/Risa.html
• ISAtab
class
• Read
ISAtab
files
into
ISAtab
objects
and
write
ISAtab
files
back
to
disk
• Increment
metadata
with
defini)on
factors/
treatments/groups
• Build
xcmsSet
(xcms
package)
objects
from
mass
spectrometry
assays
• Augment
the
ISAtab
dataset
ader
analysis
•
source
&
issues
tracking
h[ps://github.com/ISA-‐tools/Risa
42. • faahKO
package
v.
2.12
contains
ISAtab
files
describing
the
experiment
faahkoISA
=
readISAta(find.package("faahKO"))
assay.filename
<-‐
faahkoISA["assay.filenames"][[1]]
xset
=
processAssayXcmsSet(faahkoISA,
assay.filename)
…
updateAssayMetadata(faahkoISA,
assay.filename,"Derived
Spectral
Data
File","faahkoDSDF.txt"
)
• MTBLS2
processing
and
analysis
using
Risa,
xcms
and
CAMERA
BioConductor
packages
Metabolights – an open access general-purpose repository for
metabolomics studies and associated meta-data
Haug et al, 2012
Nucleic Acids Research
44. Hybridiza)on
Derived
Array
Data
File
Sample
Name
Material
Type
Assay
Design
REF
Array
Data
File
Protocol
REF
Assay
Name
sample1
genomic
DNA
assay1
A-AFFY-107" assay1.cel
data
normaliza)on
assay1.txt
sample2
genomic
DNA
assay2
A-AFFY-107" assay2.cel
data
normaliza)on
assay2.txt
sample3
genomic
DNA
assay3
A-AFFY-107" assay3.cel
data
normaliza)on
assay3.txt
Material
transforma)ons...
Material
Node
Data
File
Node
"
" DATA!
Characteristics[…]
Material! Derived Data File
Factor Value[…]
(independent Protocol
variables)
Process
Material Type
Comment[…]
Parameter
Value
"
[…]
"
Material! DATA! Raw Data
Performer
(operator effect)
File
Date
(day effect)
45. 45
Tagging:
from
free
text
to
ontology-‐based
• single
interven)on
representa)on,
free
text
annota)on
Factor
Characteris)cs[organism]
Factor
Factor
Source
Name
Value[perturba)on
Value[dose]
Value[dura)on]
agent]
individual1
human
aspirin
high
dose
12
weeks
• single
interven)on,
ontology-‐based
annota)on
Factor
Characteris)cs[organism
Term
Source
Term
Accession
Value[chemical
Term
Source
Term
Accession
Source
Name
obi:0100026)])
REF
Number
compound
REF
Number
CHEBI_37577)]
individual1
Homo
sapiens
NCBITax
9606
aspirin
CHEBI
1231354
Factor
Term
Source
Term
Accession
Factor
Value[)me
Term
Source
Term
Accession
Unit
Value[dose(OBI_0000984)
REF
Number
(PATO_0000165)]
REF
Number
low
dose
LNC
LP30872-‐3
12
week
UO
0000034
46. ToxBank
effort
developed
by
Nina
Jeliazkova
Health
Care
&
Life
Sciences
Kohonen
et
al.
The
ToxBank
Data
Warehouse:
a
Interest
Group
research
cluster
of
7
EU
FP7
Health
systems
toxicology
and
toxicogenomics
projects.
47. • Make
the
seman)cs
of
ISAtab
explicit,
including
materials
&
data
en))es
&
processes
&
their
rela)onships
• Provide
incen)ves
for
provision
of
ontology-‐
based
annota)ons
in
ISA-‐TAB
datasets;
exploit
those
annota)ons
• Augment
ISA
syntax
with
new
elements
(e.g.
groups),
facilita)ng
the
understanding
&
querying
of
experimental
design
• Facilitate
data
integra)on
&
knowledge
discovery/reasoning
49. vocabularies
Chemical
Biomolecular
Informa)on
domain
domain
domain
Experimental
domain
Factor
Characteris)cs[organi
Term
Term
Accession
Value[chemical
Term
Source
Term
Accession
Source
Name
smobi:0100026)])
Source
REF
Number
compound
REF
Number
CHEBI_37577)]
individual1
Homo
sapiens
NCBITax
9606
aspirin
CHEBI
1231354
50. Open
Biological
and
Biomedical
Ontologies
(OBO)
Foundry
BFO
ChEBI
GO
IAO
Factor
Characteris)cs[organi
OBI
Term
Term
Accession
Value[chemical
Term
Source
Term
Accession
Source
Name
smobi:0100026)])
Source
REF
Number
compound
REF
Number
CHEBI_37577)]
individual1
Homo
sapiens
NCBITax
9606
aspirin
CHEBI
1231354
53. faahKO
dataset
Available
in
Bioconductor
(with
ISA-‐TAB
metadata)
Global
metabolite
profiling
Data
subset:
LC/
MS
peaks
from
the
spinal
cords
of
6
wild-‐type
and
6
FAAH
(fa[y
acid
amyde
hydrolase)
knockout
mice
54.
55. • support
different
conversion
modes
(different
levels
of
granularity)
• querying
for
ISA-‐TAB
datasets,
across
mul)ple
experiment
types
• reasoning
exploi)ng
ontology
annota)ons
–
seman)c
valida)on
of
ISA-‐TAB
datasets
• augmented
annota)on
over
na)ve
ISA
syntax
– iden)fica)on
gaps
in
ontological
representa)ons
– feedback
of
findings
to
community
ontologies
56. Increasing
level
of
structure
for
experimental
metadata
Notes
in
Lab
books
Spreadsheets
&
Tables
Facts
as
RDF
statements
(ISAtab
metadata)
57.
58. Towards
interoperable
bioscience
data
Sansone
et
al,
2012
Nature
Gene)cs
A
growing
ecosystem
of
over
30
public
and
internal
resources
using
the
ISA
metadata
tracking
framework
to
facilitate
standards-‐compliant
collec)on,
cura)on,
management
and
reuse
of
inves)ga)ons
in
an
increasingly
diverse
set
of
life
science
domains.