Generative AI - Gitex v1Generative AI - Gitex v1.pptx
ISA tools presentation
1. The ISA software suite
Eamonn Maguire
Lead Software Engineer
eamonn.maguire@oerc.ox.ac.uk
Novartis, 21st October 2011
Tuesday, 8 November 2011
2. Who am I?
it’s rhetorical...
Irish
Formal background is Computer Science (Bachelors) and Bioinformatics (Masters)
Lead software engineer on the ISA project
DPhil Student at Oxford in Visualization in the Dept. of Computer Science
Have my own graphic design company (Antarctic Design)
Part of a small but productive and vibrant team at Oxford
headed by Susanna-Assunta Sansone.
Our work includes the ISA tools/infrastructure, MIBBI &
BioSharing.
Novartis, 21st October 2011
Tuesday, 8 November 2011
3. What is ISA all about?
We want to enable better reporting of
experiments...
We want to make to easier for
submitters...
We want to provide tooling which
biologists will want to use...
Novartis, 21st October 2011
Tuesday, 8 November 2011
4. What’s the problem?
Could be beans. Could be peas. Could be soup.
Analogy time.
Each can is an experiment.
Tin can analogy borrowed from We have no labels, so no indication about what is in the can.
Norman Morrison & converted
from ontologies to metadata
transfer standards.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same
language. What do I mean by this? Well...
1. there is fragmentation: the formats used to describe experiments are different, e.g. MAGE-
Tab, PRIDE-ML, SRA-XML, but in essence they capture much of the same information; and
2. the terminologies used to describe experiments is different, even though many concepts
are shared such as sample description. Field names as well as values...
Novartis, 21st October 2011
Tuesday, 8 November 2011
5. What’s the problem?
Could be beans. Could be peas. Could be soup.
- a different representation...non latin language
Analogy time.
Each can is an experiment.
Tin can analogy borrowed from We have no labels, so no indication about what is in the can.
Norman Morrison & converted
from ontologies to metadata
transfer standards.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same
language. What do I mean by this? Well...
1. there is fragmentation: the formats used to describe experiments are different, e.g. MAGE-
Tab, PRIDE-ML, SRA-XML, but in essence they capture much of the same information; and
2. the terminologies used to describe experiments is different, even though many concepts
are shared such as sample description. Field names as well as values...
Novartis, 21st October 2011
Tuesday, 8 November 2011
6. What’s the problem?
Could be beans. Could be peas. Could be soup.
- a different representation...non latin language
Might be petit pois - a different terminology
Analogy time.
Each can is an experiment.
Tin can analogy borrowed from We have no labels, so no indication about what is in the can.
Norman Morrison & converted
from ontologies to metadata
transfer standards.
In biology, things aren’t quite as bad as this, we have some labels, but they aren’t all in the same
language. What do I mean by this? Well...
1. there is fragmentation: the formats used to describe experiments are different, e.g. MAGE-
Tab, PRIDE-ML, SRA-XML, but in essence they capture much of the same information; and
2. the terminologies used to describe experiments is different, even though many concepts
are shared such as sample description. Field names as well as values...
Novartis, 21st October 2011
Tuesday, 8 November 2011
7. What’s the problem?
Can you imagine having to translate everything you write into a different language in
order to submit your data?
Novartis, 21st October 2011
Tuesday, 8 November 2011
8. What’s the problem?
Can you imagine having to translate everything you write into a different language in
order to submit your data?
译 语 编 吗 转换
译 错
Novartis, 21st October 2011
Tuesday, 8 November 2011
9. What’s the problem?
Can you imagine having to translate everything you write into a different language in
order to submit your data?
译 语 编 吗 转换
译 错
An féidir leat a shamhlú go bhfuil gach rud a scríobh tú a aistriú isteach i
dteanga eile d'fhonn a chur isteach do chuid sonraí? Fiú uirlisí chomhshó,
cosúil le google translate a fháilsé mícheart.
Novartis, 21st October 2011
Tuesday, 8 November 2011
10. Take home point...
Repositories are making it difficult for biologists to submit data, and for others to use it.
Particularly for those performing multi-omic experiments where to submit say proteomic and
transcriptomic data, one must provide the same general data in two very different formats...why?
Well people like to have their own formats...plus, ad hoc is easier in general
Our solution is one general purpose, flexible format, herein referred to as ISA-Tab.
A domain agnostic format to capture experimental metadata in omic experiments
(transcriptomic, genomic, proteomic, metabolomic) as well as traditional experiments such as
clinical chemistry and histology.
...it works on lots (I won’t dare say all) types of data...nutrigenomics, toxicogenomics, public
health... etc.
Novartis, 21st October 2011
Tuesday, 8 November 2011
11. Tell me more...
investigation investigation
high level concept to link
related studies
study
the central unit, containing
information on the subject
under study, its characteristics
and any treatments applied.
a study has associated assays
assay
test performed either on
material taken from the sub-
ject or on the whole initial
subject, which produce quali-
tative or quantitative meas-
urements (data)
assay(s) assay(s)
pointers to data file
Biologists like tab.
names/location
They don’t like XML.
Through basic inference...
external files in ISA-Tab is good :)
native or other for-
mats
data data
Novartis, 21st October 2011
Tuesday, 8 November 2011
12. But we don’t want to do this...
http://xkcd.com/927/
Novartis, 21st October 2011
Tuesday, 8 November 2011
13. A format on it’s own isn’t very much though...
Too true...the secret to adoption is to provide the tooling to enable biologists to get data
into the format, share it, convert and analyse it!
The ISAtools provide this tool support.
Novartis, 21st October 2011
Tuesday, 8 November 2011
14. The ISA tools
Developed on top of the ISA-Tab format...modular, configurable, open source, Java based*
converter
isacreator
&
others being developed by the ISA community...
PERL Parser for ISA by Bob MacCallum and Python Parser for ISA by Brad Chapman
*apart from the R, PERL and Python
packages of course...
Novartis, 21st October 2011
Tuesday, 8 November 2011
15. The ISA tools... modular
Convert to ISA Convert from ISA
converter converter
Convert to MAGE-TAB,
Convert from MAGE-Tab PRIDE-ML, SRA-XML for
to ISATab. More formats submission to international
coming soon... public repositories
Configure Create Validate Load Browse
isacreator Users browse investigations,
Check adherance to Curator stores metadata
Curator creates template Experimentalist uses editor to in database using BII data query and view
template experimental metadata, and
report investigation. management tool
access associated data files
Analyze
Perform analysis of data in
context with the metadata
Requires Configuration XML using the Galaxy or R analysis
engines.
Novartis, 21st October 2011
Tuesday, 8 November 2011
16. The ISA tools... configurable
Are you just using buzz words? Well we like buzz words as much as everyone else, but no.
We need to be configurable to support evolving checklists and requirements. Just check out
mibbi.org, lots of checklists! 32 in fact at the last count.
MIBBI is trying to harmonise these checklists to reduce redundancy and make them
interoperable.
Novartis, 21st October 2011
Tuesday, 8 November 2011
17. Checklists...what are they?
When we report things, there are some things which are really important.
In a school report, we have the child’s name, their class, teacher, subjects taken and so on.
Well, in a biological experiment, the very same principles apply. We need information about the
sample (species, strain, age) and information about the protocols applied during the experiment
and subsequent parameters.
We have 32 checklists at present because there are differences in what is deemed important
depending on the experiment being performed.
Good reporting means that statistics can be applied better, experiments can be reproduced more
easily, and data mashups can occur in the future.
Experiments are expensive, we should make sure that their full value is realised.
Novartis, 21st October 2011
Tuesday, 8 November 2011
18. On this point...
Helping to demystify the
unwieldy world of
standards...
Find out what standards are out
there...MI Checklists, ontologies
and formats plus what domains
they are suited to...
Find out about data sharing
policies from NIH for example.
Novartis, 21st October 2011
Tuesday, 8 November 2011
19. Configurable...back to that
We need to support lots of different checklists,
and it should be easy for people to change their
requirements should they need to....
So, our infrastructure is built upon XML files.
These are created by the ISAConfigurator.
A configuration XML file describes the fields (or
checklist) required to describe a particular
experiment!
Novartis, 21st October 2011
Tuesday, 8 November 2011
20. ISAconfigurator
Configuration XML
The brick maker...a kiln The bricks...
Novartis, 21st October 2011
Tuesday, 8 November 2011
24. The configuration xml...
This is an example of a field definition created by the
configurator. In this instance we are describing a label
field, in particular, one used to describe the label used in
a microarray experiment.
We have defined it to come from an ontology, and we
recommend the ChEBI ontology. It is also required.
Novartis, 21st October 2011
Tuesday, 8 November 2011
25. The configuration xml...
Aside from strong ontology support, the configuration xml also allows for specification of regular
expressions which field contents should match, to specify if a field is an integer, double, list value,
boolean, string or a field which should accept a file location...
The configuration xml is an important part of the infrastructure and is utilised in various
components in differing capacities.
isacreator
Used in content validation but it’s main Used in content validation. The validation
purpose here is to build the user component is also called in the ISAconverter
interface...more on this later. and BII data manager before conversion and
loading respectively
Novartis, 21st October 2011
Tuesday, 8 November 2011
26. isacreator
Create & Edit ISA-Tab
Novartis, 21st October 2011
Tuesday, 8 November 2011
27. The ISAcreator... file chooser
publication searcher visualization
ontology search
QR code generator
isacreator
Developed to be a user friendly
way to enter standards-compliant automated ontology tagging
metadata: it has lots of features... spreadsheet-like interface tagterms visualise suggest clear all help
powered by ncbo annotator
But these are just some of
them...we also have a data entry
wizard and an import utility...
Novartis, 21st October 2011
Tuesday, 8 November 2011
28. Use of the configuration xml
Configuration xml schema (XSD) is consumed by an XML beans goal in maven and Java stubs are
created which are then used to load the XML files into memory
XML definition(s) Import into Java Object Model Construct spreadsheet model. Columns, Assign cell editors. Ontology terms are
using classes created by XML rows, etc. given the ontology selection tool as a cell
beans editor, file fields are given a file chooser
etc.
<xml>
<field>sample</field>
<field>protocol ref</field> Java Object
<field>extract name</field> TableReferenceObject
<field>label</field>
...
</xml>
The configuration is also used to define the form view using a similar mechanism....
Novartis, 21st October 2011
Tuesday, 8 November 2011
29. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
30. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
31. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
32. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
33. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
34. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
35. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
36. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
37. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
38. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
39. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
40. Sounds good...what does it look like?...
Novartis, 21st October 2011
Tuesday, 8 November 2011
41. Ontologies
We use the NCBO Bioportal and the EBI’s OLS to do searching and browsing on ontologies.
Ontology field restriction Ontology browsing & searching
Ontology tagging
Ontology Resource Manager
The resource manager provides seamless searching of ontology resources, regardless of their origins, their underlying
data schema or the mechanism (REST, SOAP or local file store) through which they are accessed.
NCBO Ontology Plugin
BioPortal Lookup
Search, Hierarchy and Annotator services Service (OLS)
ISAcreator manages ontology metadata such as version information as well as individual term accessions, source, uri and so forth.
Ontology search code is usable outside of ISAcreator. In fact, the ISAconfigurator imports ISAcreator as a maven dependency and
reuses it’s components to do ontology restriction...plugins can also make use of our ontology search and browse functionalities
Novartis, 21st October 2011
Tuesday, 8 November 2011
42. Ontologies...some more technical details
How do we browse so quickly without downloading and reasoning over ontologies?
(disclaimer: speed also depends on if you access OLS/BioPortal from Europe/America)
Ontologies are all accessed by web services...this part is clear.
But browsing over ontologies, especially those coming from 2 separate resources, in different parts
of the world with very different implementations isn’t easy.
ontology loaded root expanded node a expanded
root root, level 0 root, level 0
level(root) + 1 branch a branch a
level(a) +1 branch b
level(b) +1
To make the browsing experience not so slow and painful, we preload parts of the ontology tree
in advance of them being requested by the user.
Novartis, 21st October 2011
Tuesday, 8 November 2011
43. Plugins
In ISAcreator, we use the Apache Felix implementation of the OSGi framework...it’s really good.
Plugins can be developed for 3 different purposes:
Search (adds extra search Custom cell editors Extra general functionality
space for ontology tool) (for spreadsheet) (which appears in a plugin
menu)
Novartis, 21st October 2011
Tuesday, 8 November 2011
44. Plugins...example Novartis Metastore Search
Search function on the Novartis
Metastore... integrates search
results on the metastore in the
Ontology search tool.
So, with the Novartis plugin in
your Plugin directory, you’ll be
able to search the Novartis
metastore directly within
ISAcreator, and it will handle all
the tasks involved with recording
term source, etc.
Novartis, 21st October 2011
Tuesday, 8 November 2011
45. Make sure the ISA-Tab is correct
Novartis, 21st October 2011
Tuesday, 8 November 2011
46. Checks:
the structure of the ISA-Tab to ensure it’s well formed;
the contents to ensure that it matches what is defined in the configuration xml
Then:
maps the tab structure into an graph-based object model
H1 H. Sapiens 35 Years H1.sample1 Labeling H1.sample1.labeled h1-s1.cel
H1 H. Sapiens 35 Years H1.sample2 h1-s2.cel
H2 H. Sapiens 33 Years H2.sample1 Labeling H2.sample1.labeled h2-s1.cel
H1.sample1 Labeling H1.sample1.labeled h1-s1.cel
H1
H. Sapiens H1.sample2 h1-s2.cel
35 Years
H2 H2.sample1 Labeling H2.sample1.labeled h2-s1.cel
H. Sapiens
33 Years
Actions such as conversion to other formats and persisting to the DB are performed on this object
model (called the BIIObjectStore).
Novartis, 21st October 2011
Tuesday, 8 November 2011
48. or...
validate from the command line...
or...
within ISAcreator directly...
Novartis, 21st October 2011
Tuesday, 8 November 2011
49. Convert to or from differing formats
Novartis, 21st October 2011
Tuesday, 8 November 2011
50. The converters
Fully Endorsed by ArrayExpress, PRIDE and the European Nucleotide Archive (ENA)...
Converts MAGE-Tab to ISA-Tab.
This is still in beta, however we are getting close to a fully working version. We’ve successfully
creating validated ISA-Tab for ~90% of the 21k experiments in ArrayExpress
Available as a web service, web interface and source is available for running conversions locally
http://isatab.sourceforge.net/magetoisa/
Novartis, 21st October 2011
Tuesday, 8 November 2011
52. or...
convert from the command line...
or...
within ISAcreator directly...
Novartis, 21st October 2011
Tuesday, 8 November 2011
53. Automagically filters
out the formats you
can’t export to...e.g., if I
have no sequencing
experiments, I won’t
need to export in SRA
Novartis, 21st October 2011
Tuesday, 8 November 2011
54. Get ISA-Tab into a database
Share it (or don’t) with the world
Novartis, 21st October 2011
Tuesday, 8 November 2011
55. GUI & command line interface to get ISA-Tab into an instance of the BII (BioInvestigation Index)
Calls the validator first, then persists the BIIObjectStore object to the database via Hibernate
Novartis, 21st October 2011
Tuesday, 8 November 2011
56. Lots of admin
functionalities available
from the GUI, these are
also available using the
command line or API
Disclaimer
Over X11, using such an
interface is slow...I’d suggest
making use of the API or
command line tools available...
Novartis, 21st October 2011
Tuesday, 8 November 2011
57. Database
Novartis, 21st October 2011
Tuesday, 8 November 2011
58. Database
The BioInvestigation Index term is an overloaded one.
It refers to the database & the web application
The database itself is quite complicated to describe in detail in a single presentation, but the key take
home message is that it is graph based...remember this?
H1.sample1 Labeling H1.sample1.labeled h1-s1.cel
H1
H. Sapiens H1.sample2 h1-s2.cel
35 Years
H2 H2.sample1 Labeling H2.sample1.labeled h2-s1.cel
H. Sapiens
33 Years
In the BII, we have Materials, Processes, Cross References and Annotations.
This makes things pretty generic...and the BII model is even more generic that ISA-Tab
Novartis, 21st October 2011
Tuesday, 8 November 2011
59. Database
One more word about the database, (and a few sentences)
then I’ll show the web application.
Scalable.
As far as we know... :)
ArrayExpress v2 makes use of all of the BII object model.
They just add a table for bio entities (or genes) and that’s it!
AE have >21,000 experiments and >500,000 hybridizations loaded
into it’s database.
Novartis, 21st October 2011
Tuesday, 8 November 2011
60. Web Application
Novartis, 21st October 2011
Tuesday, 8 November 2011
61. Web application
Novartis, 21st October 2011
Tuesday, 8 November 2011
62. Web application
Novartis, 21st October 2011
Tuesday, 8 November 2011
63. Web application
Novartis, 21st October 2011
Tuesday, 8 November 2011
64. Web application
Novartis, 21st October 2011
Tuesday, 8 November 2011
65. Web application
Novartis, 21st October 2011
Tuesday, 8 November 2011
66. Web application
We created the web application as a light weight solution enabling users to share their data.
(But it’s a J2EE solution so I think we’ve got an oxymoron on our hands)
But even though it’s enterprise level, it is at least light on maintenance. You’ll not have to do much with BII once it is
running. The EBI version, running across 2 servers (one as backup) has been live for 6 months so far without one
restart...and I only restarted to deploy a new instance.
Novartis, 21st October 2011
Tuesday, 8 November 2011
67. Web application
We use JBoss Seam, mainly because we don’t have to worry about HTTP sessions, scope,
etc. It manages everything for us which is useful...this is particularly important in highly
accessed systems and releases time to be spent working on more interesting things...
But it’s also a really good “integration framework”, pulling in JSP, JSF, EJB, JPA, Hibernate, etc.
Novartis, 21st October 2011
Tuesday, 8 November 2011
68. Web application
We use HQL instead of platform specific SQL. So the database can be
Oracle, MySQL, PostGreSQL...a database independent application
We can deal directly with objects, directly from the database queries
We construct the schema using POJO’s, some XML
Novartis, 21st October 2011
Tuesday, 8 November 2011
69. Web application
Lucene creates a document-based index of the database contents
We use annotations to specify which fields should be indexed
This index can be accessed and queried very quickly,
so we use this to build the user interface
Novartis, 21st October 2011
Tuesday, 8 November 2011
70. Being deployed on Cloud-enabled instance of the BioLinux VM
Will make it easier to create deployments of the BII database and
web application...
Novartis, 21st October 2011
Tuesday, 8 November 2011
71. Last but not least...
Analysis
Novartis, 21st October 2011
Tuesday, 8 November 2011
72. Package to read ISA-Tab into R, especially BioConductor to run analysis
scripts on your data...
It can automatically call microarray, mass spec and flow cytometry
analysis packages on appropriate datasets...
We still need to upload this to BioConductor...created by Audrey Kauffman
There is also a script to create Galaxy libraries from ISA-Tab
Brad Chapman is working on this at HSPH
Novartis, 21st October 2011
Tuesday, 8 November 2011
73. Who’s using ISA?
Fortunately, lots of people are now taking ISA on board... people are realising that MAGE-TAB,
SOFT, PRIDE-ML and SRA-XML are an overhead which can be avoided, especially in multi-omic
experiments.
The National Center for
Toxicological Research (NCTR)
& others...see the case study section on the ISA tools web site
Novartis, 21st October 2011
Tuesday, 8 November 2011
74. Who’s using ISA?
Case study: Metabolomics repository - Metabolights
Built on top of the ISA infrastructure with a custom front-end web interface...
converter
isacreator
Data entry tooling - ISAcreator, ISAvalidator and ISAconverter
Data management tools - BII data manager, BII database
Also developing their own plugins for ISAcreator (of type: custom cell editor)
to help users in reporting metabolite assignments.
Novartis, 21st October 2011
Tuesday, 8 November 2011
75. Who’s using ISA?
Case study: Metabolomics repository - Metabolights
Novartis, 21st October 2011
Tuesday, 8 November 2011
76. Who’s using ISA?
Case study: SCDE
Curated stem cell informatics resource linked with the Galaxy analysis engine
converter
isacreator
Built on top of the ISA infrastructure in its entirety
Contributing automatic deployment scripts for the BII
(linked with the cloud BioLinux initiative)
Created the Python Parser for ISA-Tab
Novartis, 21st October 2011
Tuesday, 8 November 2011
77. Who’s using ISA?
Case study: SCDE
Novartis, 21st October 2011
Tuesday, 8 November 2011
78. Who’s using ISA? Biggest public study of its kind
Case study: GeneData - InnoMed Only available in ISA-Tab
720 animals
16 compounds
3 doses
~20,000 assays
Novartis, 21st October 2011
Tuesday, 8 November 2011
79. Who’s using ISA? Biggest public study of its kind
Case study: GeneData - InnoMed Only available in ISA-Tab
protein expression profiling
by mass spectrometry
transcription profiling
by dna microarray
720 animals
metabolite profiling
16 compounds by mass spectrometry
3 doses metabolite profiling
by nmr spectroscopy
~20,000 assays
histology
clinical chemistry
hematology
Novartis, 21st October 2011
Tuesday, 8 November 2011
80. Who’s using ISA?
Case study: GeneData - InnoMed
Novartis, 21st October 2011
Tuesday, 8 November 2011
81. Our next steps...as a community
Visualization Further adoption Analysis
low dose
aspirin
liver kidney blood serum blood plasma
x5 x5 x5 x5
SAMP SAMP SAMP SAMP
EX EX EX EX
kidney blood serum
LABEL LABEL LABEL
HYB HYB HYB
x5 x5
SAMP SAMP
SCAN SCAN SCAN SCAN
EX
TRANS TRANS TRANS TRANS
LABEL
HYB
SCAN SCAN
liver kidney blood serum blood plasma
TRANS TRANS
x5 x5 x5 x5
SAMP SAMP SAMP SAMP
well described process missing protocols and no
from sample to data file. information about what
was being measured.
EX EX
Making visual comparisons is straightfor-
ward using this approach. The longest path
is constructed based on all other known
LABEL LABEL datasets in the pool of workflows being
compared.
HYB HYB HYB
SCAN SCAN SCAN SCAN
TRANS TRANS TRANS TRANS
Novartis, 21st October 2011
Tuesday, 8 November 2011
82. We can’t do everything by ourselves...
ISA team Funders
Susanna-Assunta Sansone
Philippe Rocca-Serra
Eamonn Maguire
Contributors
Collaborators at
Marco Brandizi
Natalija Sklyar
Brad Chapman
Bob MacCallum
Kenneth Haug
Pablo Conesa The National Center for
Toxicological Research (NCTR)
Audrey Kauffman
Novartis, 21st October 2011
Tuesday, 8 November 2011
83. ISA software suite: supporting standards-compliant
experimental annotation and enabling curation at the
community level
Philippe Rocca-Serra; Marco Brandizi; Eamonn Maguire; Nataliya Sklyar; Chris Taylor; Kimberly Begley; Dawn
Field; Stephen Harris; Winston Hide; Oliver Hofmann; Steffen Neumann; Peter Sterk; Weida Tong; Susanna-
Assunta Sansone
Bioinformatics 2010 26: 2354-2356
Novartis, 21st October 2011
Tuesday, 8 November 2011
84. Thanks for listening...
Questions??
You can email us...
isatools@googlegroups.com
View our website
http://www.isa-tools.org
View our Git repo & contribute
http://github.com/ISA-tools
View our blog
http://isatools.wordpress.com
Follow us on Twitter
@antarcticdesign
Novartis, 21st October 2011
Tuesday, 8 November 2011