Introduction to InterMine Infrastructure

Introduction to InterMine
Infrastructure
Vivek Krishnakumar
LF Meeting 04/28/2015

InterMine in a nutshell
• Open-source data warehouse software
• Integration of complex biological data
• Parsers for common biological data formats
• Extensible framework for custom data
• Cookie-cutter interface, highly customizable
• Interact using sophisticated web query tools
• Programmatic access using web-service API

Open-source Project
• Source code available online
• Distributed with the GNU
LGPL license
• GitHub Repo:
https://github.com/intermine/int
ermine
• GitHub Organization:
https://github.com/intermine
intermine / intermine
> bio
> biotestmine
> config
> flymine
> humanmine
> imbuild
> intermine
> testmodel
.gitignore
.travis.yml
LICENSE
LICENSE.LIBS
README.md
RELEASE_NOTES

Richard N. Smith et al. Bioinformatics 2012;28:3163-3165
InterMine system architecture

InterMine system architecture
Web Application
• Java Server Pages (JSP), HTML, JS, CSS
• Interfaces with Java Servlets and IM web-services
Web Server
• Tomcat 7.0.x, serves Web application ARchive file
• ant based build system using Java SDK
Database Server
• PostgreSQL 9.2 or above
• range query, btree, gist enabled (refer docs here)
http://intermine.readthedocs.org/en/latest/system-requirements/

Data Model Overview
• Object-oriented data model
• Divided into classes, their attributes and
their relationships; defined in XML
• Represented as Java classes (pure Java
beans); auto-generated from XML,
automatically map to tables in schema
• Core data model; based on Sequence
Ontology (SO); refer: bio/core/core.xml
and bio/core/genomic_additions.xml
http://intermine.readthedocs.org/en/latest/data-model/overview/

Data Model Overview
<?xml version="1.0"?>
<model name="example" package="org.intermine.model.bio">
<class name="Protein" is-interface="true" extends="SequenceFeature">
<attribute name="name" type="java.lang.String"/>
<attribute name="accession" type="java.lang.String"/>
<collection name="features" referenced-type="NewFeature" reverse-reference="protein"/>
</class>
<class name="NewFeature" is-interface="true">
<attribute name="identifier" type="java.lang.String"/>
<attribute name="confidence" type="java.lang.Double"/>
<reference name="protein" referenced-type="Protein" reverse-reference="features"/>
</class>
</model>
Model expects standard Java names for classes and attributes
• classes: start with an upper case letter and be CamelCase, no underscores or spaces.
• fields (attributes, references, collections): should start with a lower case letter and be
lowerCamelCase, no underscores or spaces.
http://intermine.readthedocs.org/en/latest/data-model/model/

Creating & configuring a mine
• Build out scaffold for mine
$ cd git/intermine
$ bio/scripts/make_mine legumine
• Configure data to load and
post-processing steps to
run by customizing
project.xml
• Data <source /> elements
correspond to directory
under bio/sources/*;
defines parsers to retrieve
data and encodes rules for
integration
intermine / intermine
> bio
> biotestmine
> config
> flymine
> legumine
> dbmodel
> integrate
> postprocess
> webapp
> default.intermine.integrate.properties
> default.intermine.webapp.properties
> project.xml
> humanmine
> imbuild
> intermine
> testmodel
.gitignore
.travis.yml
LICENSE
LICENSE.LIBS
README.md
RELEASE_NOTES
http://intermine.readthedocs.org/en/latest/get-started/tutorial/#creating-a-new-mine

Creating & configuring a mine
<project type="bio">
<property name="target.model" value="genomic"/>
<property name="source.location" location="../bio/sources/"/>
<property name="common.os.prefix" value="common"/>
<property name="intermine.properties.file" value="legumine.properties"/>
<property name="default.intermine.properties.file" location="../default.intermine.integrate.properties"/>
<sources>
<source name=”legumine-gff" type="legumine-gff">
<property name="gff3.taxonId" value="3880"/>
<property name="gff3.seqDataSourceName" value="LF"/>
<property name="gff3.dataSourceName" value="LF"/>
<property name="gff3.seqClsName" value="Chromosome"/>
<property name="gff3.dataSetTitle" value="Genome Annotation"/>
<property name="src.data.dir" location="/path/to/legumine/genome/gff/" />
</source>
:
:
</sources>
<post-processing>
<post-process name="create-references" />
<post-process name="create-chromosome-locations-and-lengths"/>
<post-process name="create-gene-flanking-features" />
:
:
</post-processing>
</project>
project.xml
http://intermine.readthedocs.org/en/latest/get-started/tutorial/#project-xml

Data Sources and Sets
• InterMine provides a vast library of data source parsers and
loaders, covering data types not restricted to:
genome sequence (fasta)
annotation (gff)
ontology (go, so)
proteins (uniprot)
interactions (psi-mi)
pathway (kegg, reactome)
homologs (panther, compara, homologene)
publications (pubmed)
chado (sequence, stock)
• Custom sources can be written by following the tutorial:
http://intermine.readthedocs.org/en/latest/database/data-
sources/custom/ or by referring to code from other mines
http://intermine.readthedocs.org/en/latest/database/data-sources/library/

Building a mine
• Each InterMine instance requires 3
PostgreSQL databases:
 legumine: core db mapping to data model
 items-legumine: db for storing intermediate Items during load
 userprofile-legumine: db for storing user specific data
• Running build requires special config file in
the users’ home area, containing db
connection params and other mine
specific configs to override
${HOME}/.intermine/legumine.properties
http://intermine.readthedocs.org/en/latest/get-started/tutorial/#properties-file

Model Merging & Data Integration
Model Merging
• Each source contributes
towards the data model
• bio/core/core.xml is
always used as the base
for model merging
• The ant build-db
command consumes the
SOURCE_additions.xml
• Model is used to generate
tables, Java classes and
the webapp
http://intermine.readthedocs.org/en/latest/database/database-building/model-
merging/
Data Integration
• Key(s) for class of object
defines equivalence for
objects of that class
• Primary key defines
field(s) used to search for
equivalence
• For objects which share
same primary key, fields
are merged and stored as
single object
http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys/

Post processing
• Operations are
performed on
integrated data
• Calculate/set fields
difficult to work with
while data loading,
because they require 2
or more sources to be
loaded already
• Order of steps is
somewhat important
<post-processing>
<post-process name="create-references" />
<post-process name="create-chromosome-
locations-and-lengths"/>
<post-process name="create-gene-flanking-
features" />
<post-process name="do-sources" />
<post-process name="create-intron-
features">
<property name="organisms" value="3880"/>
</post-process>
<post-process name="transfer-sequences"/>
<post-process name="populate-child-
features"/>
<post-process name="create-location-range-
index" />
<post-process name="create-overlap-view" />
<post-process name="create-attribute-
indexes"/>
<post-process name="summarise-
objectstore"/>
<post-process name="create-search-index"/>
</post-processing>
http://intermine.readthedocs.org/en/latest/database/database-building/post-processing/

Building & deploying a mine
Two types of build mechanisms:
• Manual:
$ cd dbmodel && ant clean build-db ## initialize db
$ ant -Dsource=legumine-gff ## load data sources
$ ant -Dsource=legumine-chr-fasta ## load more sources
$ cd ../postprocess && ant ## run post-process steps
$ cd ../webapp ## build mine webapp
$ ant clean remove-webapp default release-webapp
• Automated:
$ ../bio/scripts/project_build -b -v localhost ~/legumine-dump
http://intermine.readthedocs.org/en/latest/database/database-building/build-script/

Lucene based search index
• Post-process "create-search-index" runs the
database indexing, zips and stores in db
• On webapp (first) load, index is unpacked
• By default, all id and text fields are ignored by the
indexer
• Uses the Apache Lucene whitespace analyzer to
identify word boundaries
• Control temp directory and classes/fields to be
ignored by altering
MINE_NAME/dbmodel/resources/keyword_sear
ch.properties file
http://intermine.readthedocs.org/en/latest/webapp/keyword-search/

Alex Kalderimis et al. Nucl. Acids Res. 2014;42:W468-W472
InterMine web services
http://iodocs.labs.intermine.org

Federated Authentication
• Apart from the standard login scheme
(username/password), InterMine supports industry
standard OAuth2 based login flows, implemented
by Google, GitHub, Agave, etc.
• ThaleMine relies on this infrastructure to
authenticate users against the araport.org tenant
registered within the Agave infrastructure
• Documentation available here:
http://intermine.readthedocs.org/en/latest/webapp/
properties/web-properties/#openauth2-settings-
aka-openid-connect

Friendly reference mines
• FlyMine: https://github.com/intermine/intermine/
• ThaleMine: https://github.com/Arabidopsis-
Information-Portal/intermine/
• MedicMine: https://github.com/jcvi-plant-
genomics/intermine/
• PhytoMine:
https://github.com/JoeCarlson/intermine/

Summary
• Advantages
 InterMine is a powerful biological data warehouse
 Performs complex data integration
 Allows fast and flexible querying
 Well documented programmatic interface
 Cookie-cutter, user-friendly web interface
 Facilitates cross-talk between “mines”
• Caveats
 Adding more data requires a full database rebuild (incremental loading
is not possible) because of the integration step
• About InterMine:
 Developed by the Micklem Lab at the University of Cambridge, UK
 Written in Java, backed by PostgreSQLdb, deployed under Tomcat.
Documentation and downloads available at http://www.intermine.org

Acknowledgments
• InterMine Team
 Gos Micklem
 Julie Sullivan
 Alex Kalderimis
 Richard Smith
 Sergio Contrino
 Josh Heimbach
 et al.
• Araport Team
 Chris Town
 Jason Miller
 Matt Vaughn
 Maria Kim
 Svetlana
Karamycheva
 Erik Ferlanti
 Chia-Yi Cheng
 Benjamin Rosen
 Irina Belyaeva

Introduction to InterMine Infrastructure

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Introduction to InterMine Infrastructure

Semelhante a Introduction to InterMine Infrastructure (20)

Último

Último (20)

Introduction to InterMine Infrastructure

Notas do Editor