InterMine is an open-source data warehouse software that allows for the integration of complex biological data. It provides parsers for common data formats and an extensible framework to customize data. The system uses a PostgreSQL database to store integrated data according to an object-oriented data model. It offers a customizable web interface for querying as well as programmatic access via a web service API. Building an InterMine instance involves configuring data sources, performing data integration and post-processing, and deploying the web application. InterMine facilitates data sharing across multiple biological "mines".
2. InterMine in a nutshell
• Open-source data warehouse software
• Integration of complex biological data
• Parsers for common biological data formats
• Extensible framework for custom data
• Cookie-cutter interface, highly customizable
• Interact using sophisticated web query tools
• Programmatic access using web-service API
3. Open-source Project
• Source code available online
• Distributed with the GNU
LGPL license
• GitHub Repo:
https://github.com/intermine/int
ermine
• GitHub Organization:
https://github.com/intermine
intermine / intermine
> bio
> biotestmine
> config
> flymine
> humanmine
> imbuild
> intermine
> testmodel
.gitignore
.travis.yml
LICENSE
LICENSE.LIBS
README.md
RELEASE_NOTES
4. Richard N. Smith et al. Bioinformatics 2012;28:3163-3165
InterMine system architecture
5. InterMine system architecture
Web Application
• Java Server Pages (JSP), HTML, JS, CSS
• Interfaces with Java Servlets and IM web-services
Web Server
• Tomcat 7.0.x, serves Web application ARchive file
• ant based build system using Java SDK
Database Server
• PostgreSQL 9.2 or above
• range query, btree, gist enabled (refer docs here)
http://intermine.readthedocs.org/en/latest/system-requirements/
6. Data Model Overview
• Object-oriented data model
• Divided into classes, their attributes and
their relationships; defined in XML
• Represented as Java classes (pure Java
beans); auto-generated from XML,
automatically map to tables in schema
• Core data model; based on Sequence
Ontology (SO); refer: bio/core/core.xml
and bio/core/genomic_additions.xml
http://intermine.readthedocs.org/en/latest/data-model/overview/
7. Data Model Overview
<?xml version="1.0"?>
<model name="example" package="org.intermine.model.bio">
<class name="Protein" is-interface="true" extends="SequenceFeature">
<attribute name="name" type="java.lang.String"/>
<attribute name="accession" type="java.lang.String"/>
<collection name="features" referenced-type="NewFeature" reverse-reference="protein"/>
</class>
<class name="NewFeature" is-interface="true">
<attribute name="identifier" type="java.lang.String"/>
<attribute name="confidence" type="java.lang.Double"/>
<reference name="protein" referenced-type="Protein" reverse-reference="features"/>
</class>
</model>
Model expects standard Java names for classes and attributes
• classes: start with an upper case letter and be CamelCase, no underscores or spaces.
• fields (attributes, references, collections): should start with a lower case letter and be
lowerCamelCase, no underscores or spaces.
http://intermine.readthedocs.org/en/latest/data-model/model/
8. Creating & configuring a mine
• Build out scaffold for mine
$ cd git/intermine
$ bio/scripts/make_mine legumine
• Configure data to load and
post-processing steps to
run by customizing
project.xml
• Data <source /> elements
correspond to directory
under bio/sources/*;
defines parsers to retrieve
data and encodes rules for
integration
intermine / intermine
> bio
> biotestmine
> config
> flymine
> legumine
> dbmodel
> integrate
> postprocess
> webapp
> default.intermine.integrate.properties
> default.intermine.webapp.properties
> project.xml
> humanmine
> imbuild
> intermine
> testmodel
.gitignore
.travis.yml
LICENSE
LICENSE.LIBS
README.md
RELEASE_NOTES
http://intermine.readthedocs.org/en/latest/get-started/tutorial/#creating-a-new-mine
10. Data Sources and Sets
• InterMine provides a vast library of data source parsers and
loaders, covering data types not restricted to:
genome sequence (fasta)
annotation (gff)
ontology (go, so)
proteins (uniprot)
interactions (psi-mi)
pathway (kegg, reactome)
homologs (panther, compara, homologene)
publications (pubmed)
chado (sequence, stock)
• Custom sources can be written by following the tutorial:
http://intermine.readthedocs.org/en/latest/database/data-
sources/custom/ or by referring to code from other mines
http://intermine.readthedocs.org/en/latest/database/data-sources/library/
11. Building a mine
• Each InterMine instance requires 3
PostgreSQL databases:
legumine: core db mapping to data model
items-legumine: db for storing intermediate Items during load
userprofile-legumine: db for storing user specific data
• Running build requires special config file in
the users’ home area, containing db
connection params and other mine
specific configs to override
${HOME}/.intermine/legumine.properties
http://intermine.readthedocs.org/en/latest/get-started/tutorial/#properties-file
12. Model Merging & Data Integration
Model Merging
• Each source contributes
towards the data model
• bio/core/core.xml is
always used as the base
for model merging
• The ant build-db
command consumes the
SOURCE_additions.xml
• Model is used to generate
tables, Java classes and
the webapp
http://intermine.readthedocs.org/en/latest/database/database-building/model-
merging/
Data Integration
• Key(s) for class of object
defines equivalence for
objects of that class
• Primary key defines
field(s) used to search for
equivalence
• For objects which share
same primary key, fields
are merged and stored as
single object
http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys/
13. Post processing
• Operations are
performed on
integrated data
• Calculate/set fields
difficult to work with
while data loading,
because they require 2
or more sources to be
loaded already
• Order of steps is
somewhat important
<post-processing>
<post-process name="create-references" />
<post-process name="create-chromosome-
locations-and-lengths"/>
<post-process name="create-gene-flanking-
features" />
<post-process name="do-sources" />
<post-process name="create-intron-
features">
<property name="organisms" value="3880"/>
</post-process>
<post-process name="transfer-sequences"/>
<post-process name="populate-child-
features"/>
<post-process name="create-location-range-
index" />
<post-process name="create-overlap-view" />
<post-process name="create-attribute-
indexes"/>
<post-process name="summarise-
objectstore"/>
<post-process name="create-search-index"/>
</post-processing>
http://intermine.readthedocs.org/en/latest/database/database-building/post-processing/
14. Building & deploying a mine
Two types of build mechanisms:
• Manual:
$ cd dbmodel && ant clean build-db ## initialize db
$ ant -Dsource=legumine-gff ## load data sources
$ ant -Dsource=legumine-chr-fasta ## load more sources
$ cd ../postprocess && ant ## run post-process steps
$ cd ../webapp ## build mine webapp
$ ant clean remove-webapp default release-webapp
• Automated:
$ ../bio/scripts/project_build -b -v localhost ~/legumine-dump
http://intermine.readthedocs.org/en/latest/database/database-building/build-script/
15. Lucene based search index
• Post-process "create-search-index" runs the
database indexing, zips and stores in db
• On webapp (first) load, index is unpacked
• By default, all id and text fields are ignored by the
indexer
• Uses the Apache Lucene whitespace analyzer to
identify word boundaries
• Control temp directory and classes/fields to be
ignored by altering
MINE_NAME/dbmodel/resources/keyword_sear
ch.properties file
http://intermine.readthedocs.org/en/latest/webapp/keyword-search/
16. Alex Kalderimis et al. Nucl. Acids Res. 2014;42:W468-W472
InterMine web services
http://iodocs.labs.intermine.org
17. Federated Authentication
• Apart from the standard login scheme
(username/password), InterMine supports industry
standard OAuth2 based login flows, implemented
by Google, GitHub, Agave, etc.
• ThaleMine relies on this infrastructure to
authenticate users against the araport.org tenant
registered within the Agave infrastructure
• Documentation available here:
http://intermine.readthedocs.org/en/latest/webapp/
properties/web-properties/#openauth2-settings-
aka-openid-connect
19. Summary
• Advantages
InterMine is a powerful biological data warehouse
Performs complex data integration
Allows fast and flexible querying
Well documented programmatic interface
Cookie-cutter, user-friendly web interface
Facilitates cross-talk between “mines”
• Caveats
Adding more data requires a full database rebuild (incremental loading
is not possible) because of the integration step
• About InterMine:
Developed by the Micklem Lab at the University of Cambridge, UK
Written in Java, backed by PostgreSQLdb, deployed under Tomcat.
Documentation and downloads available at http://www.intermine.org
20. Acknowledgments
• InterMine Team
Gos Micklem
Julie Sullivan
Alex Kalderimis
Richard Smith
Sergio Contrino
Josh Heimbach
et al.
• Araport Team
Chris Town
Jason Miller
Matt Vaughn
Maria Kim
Svetlana
Karamycheva
Erik Ferlanti
Chia-Yi Cheng
Benjamin Rosen
Irina Belyaeva
bio: code to deal with biological data, including data sources
flymine: config used to create FlyMine
testmodel: non-biological test data model used for testing core InterMine
imbuild: ant-based build system, do not edit anything
intermine: the core (generic) InterMine code to work with any data model
ObjectStore: custom Java object/relational mapping system, optimized for read-only database performance
Query optimizer: pre-computed tables joining connected data from different tables, improves PostgreSQL performance
Attributes can be primitives (int, float, string), references to other objects in database and collections of other objects in database
All elements of the model extend core InterMineObject, which has unique field ‘id’
Only the classes defined in the model are searchable
dbmodel: information about the data model and ant targets relating to the model and database creation
integrate: ant targets for loading data
postprocess: ant targets to run post-processing operations on data
webapp: basic configuration and commands for building and deploying the web application
Summary of web services available through InterMine.