Bioinformatics databases: Current Trends and Future Perspectives

Bioinformatics
Databases
Current Trends and
Future Perspectives
Assoc. Prof Dr Sarinder Kaur Kashmir Singh
(Sarinder K. Dhillon)
Data Science & Bioinformatics Lab
Faculty of Science
University of Malaya
sarinder@um.edu.my

In a nutshell….
Today almost every scientist realizes the need of storing their data sets using relevant
technology and models.
The architecture of a database system is greatly influenced by the underlying computer
system of the database system runs.
The access to relevant data, combining myriad data sources and coping with distinct
and heterogeneous systems is a tremendously difficult task.
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Advanced computing power, advanced informatics and the Internet has changed the way biological data are stored,
located and disseminated. In the past 20 years, scientists have engaged themselves in the
informatics discipline :
querying multiple remote or
local heterogeneous data
sources
integrating
manually
received data
manipulating it with
advanced data analysing
and visualizing tools.

Today’s talk is
about..
DATABASE
FUNDAMENTALS -
MODELS,
ARCHITECTURE AND
USES
TYPES OF
BIOLOGICAL
DATABASES
EXAMPLES OF
SIGNIFICANT
BIOINFORMATICS
DATABASES
ISSUES
PERTAINING TO
BIOLOGICAL
DATABASES
FUTURE
DIRECTION OF
BIOLOGICAL
DATABASES.

Database
Fundamentals
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of
Science, University of Malaya
In the simplest form, a database can be defined as
an organised and formal collection of information
stored in a computer readable format.
• Flat Files – Records without structured relationships
• Relational Databases- first introduction in 1970;
data in tabulated form
• Multidimensional (OLAP)- created using inputs from
a relational database where data is stored in a
cube or a multidimensional array.
• Graph and ontology based databases (NoSQL
databases)- Ontology based databases are in
principle made of the object oriented approach,
where the datasets are treated as objects, with
values and properties.
Common
approaches

Example of Relational Database
An example of Entity Relationship Diagram using a relation schema. Source: Dhillon, S.K., Shuhaimi, N., Hong, S.L.L., and Sidhu, A.S. (2013)
Malaysian Parasite Database Infrastructure. In Sidhu, A.S. Dhillon, S. K. (eds) Advances in Biomedical Infrastructure 2013. Springer-Verlag,
Heidelberg Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya

Example of an OLAP Model
data is represented in a cube dimensional model
Source: Dhillon, S.K., Shuhaimi, N., Hong, S.L.L., and Sidhu, A.S. (2013) Malaysian Parasite Database Infrastructure. In Sidhu, A.S. Dhillon,
S. K. (eds) Advances in Biomedical Infrastructure 2013. Springer-Verlag, Heidelberg

Graph and Ontology Based Database
Graph data model representing the nodes (in oval), relationships (arrows). The descriptive boxes showing the mainly properties in
nodes and edges. Source: Costa, R.L., Gadelha, L., Ribeiro-Alves, M., Porto, F., 2017. GeNNet: An integrated platform for unifying
scientific workflows and graph databases for transcriptome data analysis. PeerJ, 5, p. e3509.

Database Architectures
DISTRIBUTED
WAREHOUSE
FEDERATED

Uses of
Databases
Knowledge
extraction
Databases are built to support front end applications which
ranges from simple query based systems to sophisticated
Artificial Intelligence enabled systems.
With the concept of big data that has taken Information
Technology - databases are extensively used for knowledge
extraction using algorithms that are run through large
volumes of data.
Thus, databases now have bigger role to play in many data
centric domains to function as platforms for machine
readable and machine interpretable format of information.
Protein Information and Knowledge Extractor (PIKE)
Lynx, a database and knowledge extraction engine for integrative
medicine.
Encyclopedia of Life (Parr et al., 2014).

• Databases are very useful for storing massive and ever growing
amount of data, which can be utilised for discovering patterns and
rules, automatically or even semi-automatically.
• This method is widely known as data mining and is conceived based
on logics in database systems (Paramasivam et al., 2014).
• A clean set of integrated data is usually stored in a data warehouse
before the data mining techniques applied. Key techniques
commonly used in data mining : classification, clustering,
prediction, association, text mining, link analysis and regression.
• Data mining is seen as an important concept in databases as the
process involves collecting data, creating database and
management, analyzing data and finally interpreting data (Han et
al., 2000).
Uses of Databases
Data mining

Uses of
Databases
Visualisation
An effective visual design aids decision making
and interpretation to generate knowledge.
A data visualisation software or a script can be
used to find pattern, correlations or trends to
generate knowledge, which could not be seen in
textual or numerical content in a database.
One good example of data visualisation is using
graph based databases whereby the data is
retrieved in a visual context.

Focus of Presentation - Biological databases
Database - a core
element of any
discipline that
produces large
amounts of data.
In Biology (particularly in
genomics), massive amounts of
data are being accumulated on
a daily basis via experimental
work, field work, software
analysis, illustrations,
microscopic photographs and
image acquisition.
This scenario has lead to
the construction of not only
single databases but also
the consortiums managing
large amounts of biological
data.

Source: Stephens, Z.D., Lee, S.Y., Faghri, F., et al., 2015. Big data: Astronomical or Genomical? PLOS Biology 13(7), e1002195. Available
at: https://doi.org/10.1371/journal.pbio.1002195.
Big data domains - 2025
Stephens et al. (2015) compared other Big Data Domains such as, astronomy, youtube and twitter in their recent study on Genomics
perceiving to be Big Data science . They estimated that genomics is leading or at par with the other big domains in terms of terms of
data acquisition, storage, distribution, and analysis. Hence, the need to accommodate the data surge in biology is a challenge.

Categories & Types of Biological Databases
FOUR BROAD CATEGORIES:
(a)Primary databases,
(b)Secondary databases
(c)Specialised databases
(d)Integrated databases
TYPES OF BIOLOGICAL DATA
INCLUDES (but not limited to):
(a)nucleotide and protein sequences
(b) protein structure
(c) microarray and gene expression
(d) metabolic pathways
(e) species profile (taxonomy)
(f) animal model
(g) human disease
(h) clinical database and others.
Data Science & Bioinformatics Laboratory, Institute of Biological
Sciences, Faculty of Science, University of Malaya

Secondary Databases
In the early 80s, storing sequence data was not easy due
to cost and infrastructure limitations. Nevertheless, many
institutions have started to store their data in primary
databases which are available in isolated platforms and
are focused on specific subject of study.
Nucleotide Sequence Databases
- European Nucleotide Archive (ENA) , GenBank , EMBL
Nucleotide Sequence Database
Protein Sequence Database
- UniProt (uniprotkb, uniref, uniparc), Swiss-Prot, TrEMBL and
PIR-PSD, Protein Databank
Microarray Databases ( Microarray gene expression data)
- ArrayExpress, GEO (Gene Expression Omnibus), Stanford
Microarray Database
Metabolic Pathway Databases
- BRaunschweig ENzyme Database (BRENDA) , KEGG PATHWAY
Database, Metacyc
Phylogenetics Databases
- Phylomedb, treebase
Contain derived data (by analyzing primary data) to
address specific requirements.
Eukaryotic Promoter Database - comprehensive organisms-
specific transcription start site (TSS) collections -derived from
NGS data
NCBI Reference Sequence Database (RefSeq)- set of reference
sequences including genomic, transcript, and protein
Pfam is a database of protein families
Prosite is a protein database- describing the protein families,
domains -manually curated via Swiss Institute of Bioinformatics
The PRIDE PRoteomics IDEntifications (PRIDE) is a data
warehouse on proteomics data
SCOP2- focusing on proteins that are structurally characterized
and deposited in the PDB. Proteins are organised according to
their structural and evolutionary relationships in a complex
graph network.
Primary Databases

Specialised Databases
• Phylogenomic Database - Plant Comparative Genomics (GreenPhylDB)
• Plant Secretome and Subcullular Proteome Knowledgebase (PlantSecKB)
• TreeBASE an open-access database of phylogenetic trees and associated data
• Plant Alternative Splicing database (PASD)
• Comparative Toxicogenomics Database (CTD)
• Drosophila Genes & Genomes (FlyBase)
• WormBase
• AceDB
• Arabidopsis Information Resource (TAIR)
• Online Mendelian Inheritance in Man (OMIM)
• Saccharomyces Genome Database (SGD)

Integrated Databases
National Center for Biotechnology Information (NCBI)
Kyoto Encyclopedia of Genes and Genomes
GeneCards
The InterPro Consortium
The European Molecular Biology Laboratory (EMBL)
Nucleotide Sequence Database
Global Biodiversity Information Facility (GBIF)
Integrated databases –databases unified under a platform, with the aim to offer a suite of applications for
secondary uses of the data. These days integrated databases are preferred by researchers due to their
usability as one stop centres for knowledge extraction. These databases are more like consortiums managing
and integrating sources of information to provide a unified access to users.

Other
Biological
Databases
Biodiversity
databases
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences,
Faculty of Science, University of Malaya
Global Biodiversity Information Facility (GBIF)
Taxonomic Data Working Group (TDWG)
Encyclopedia of Life (EoL)
Catalogue of Life (CoL)
Integrated Taxonomic Information System (ITIS)
FishBase
many more
Taxonomy databases- also
referred as biodiversity databases
have been developed much
before sequence databases are
introduced. Taxonomy, being a
very old and matured branch of
Biology, has produced abundant
of data, particularly on species
profiling. Almost every country
listed as a megabiodiversity
country has produced a database
of their very own indigenous
organisms.

Other
biological
databases -
Animal model
Animal models are mostly used in biomedical research which is an important research area to
demonstrate biological significance in experiments. Due to this, animal model, particularly
mouse model databases are advent these days.
The MUGEN mouse database (MMdb)- is a repository of murine models of immune processes
and immunological diseases.
Mouse Genome Informatics (MGI) - international database resource for the laboratory mouse,
providing integrated genetic, genomic, and biological data to facilitate the study of human
health and disease.
The Mouse Phenome Database (MPD) is one of the most widely used resource for primary
experimental trait data and genotypic variation.
International Mouse Strains Resource (IMSR) is a database on mouse strains, stocks, and
mutant ES cell lines available worldwide, including inbred, mutant, and genetically engineered
strains.
The European Mouse Mutant Archive
Rat Resource and Research Center

Other biological
databases -
Clinical/Health
databases
 The National Center of Biotechnology Information (NCBI)- produced
two important clinical and health databases
 (i) ClinVar , a database on the reports of the relationships among
human variations and phenotypes;
 (ii) MedGen a database containing information related to human
medical genetics, such as attributes of conditions with a genetic
contribution.
 Database of Genotypes and Phenotypes (dbGaP) is an archive of data
and results from studies that have investigated the interaction of
genotype and phenotype in humans.
 PubMed Health is the world’s largest digital medical library.
 MEDLINE contains journal citations and abstracts for biomedical
literature from around the world
 Clinical and Health databases focusing on diseases are also useful
resource for biologists and medical scientists.
 MalaCards – human disease database
 Autoimmune Disease Database,
 Inflammatory Bowel Diseases
 KEGG Disease Database
 LiverWiki -a wiki-based database for human liver.
 The Diseases Database ver 2.0 contain a whole range of database in
their portal

C
u
r
r
e
n
t
T
r
e
n
d
s
DATABASE MODELS
• Predominantly, traditional relational model has been
used in the design and development of biological
databases.
• GenBank, EMBL, SWISS-PROT and Protein
Information Resource(PIR) and CBS Genome Atlas
Database.
• Recent development in the semantic web technologies,
many biological ontologies have been created that uses
object based database models using ontologies and
nosql database tools.
• BioPortal - repository for biomedical ontologies. It
currently contains 704 ontologies.
• BioPAX
• Cell Cycle Ontology
• Gene Expression Knowledgebase
• Disease Ontology
• Gene Ontology Consortium
• Sequence Ontology
• SNOMED CT
• Fish Ontology
DATABASE ARCHITECTURE
• All three approaches: Distributed, Federated and Data
Warehouse have been utilised in biological
databases.
• Distributed Databases
• Ensembl, WormBase, and the Berkeley
Drosophila Genome Project
• Federated databases
• TwinNET (Muilu et al., 2007), ENCODE
(Blankenberg et al., 2007), EBI search (Park et al.,
2017), SPINE2 (Goh et al., 2003), Cancer
Biomedical Informatics Grid (Saltz et al., 2006),
NIF (Gardner et al., 2008), Biomedical Informatics
Research Network (Ashish et al., 2010),
Biomedical Investigations (Taylor et al., 2008),
EdgeExpressDB (Severin et al, 2009) and Minimal
Information About Neural Electromagnetic
Ontologies (Frishkoff et al., 2011).
• Data warehouse
• Pathway Commons (Cerami et al., 2010), String
(Szklarczyk et al., 2010), CBS Genome Atlas
Database (Hallin and Ussery, 2004), BioMart
(Haider et al., 2009), BrainMap & PubBrainure

Case studies
of Biological
Databases
Global Biological
Database Efforts
GeneCards TCGA
Database Efforts by
Data Science &
Bioinformatics Lab,
UM
Fish Ontology
Patient Records
Graph Database
B Rotunda Database
My Breast Cancer
Cohort (MyBCC)
database
Breast Cancer
Module in EMR

GeneCards
In order to gather scattered data primary
databases, The Weizmann Institute of Science
Crown Human Genome Center
(http://www.weizmann.ac.il) developed a
database called GeneCards in 1997.
In the early stage, this database was dealing
mostly with human
genome information, human genes, the
encoded protein’s function and related diseases.
Currently it serves as a complete, authoritative
compendium of annotative information about
human genes that has been broadly used for
almost 15 years (Safran et al., 2010).

The Cancer Genome Atlas (TCGA) -
generate complete, multi-dimensional
maps of the important thing genomic
changes in important types and
subtypes of cancer (National
Institutes of Health, 2017).
Began in 2005, to catalogue genetic
mutations answerable for cancer, the
usage of genome sequencing and
bioinformatics.
The TCGA has furnished a large
amount of publicly available data on
most cancers, highlighting candidate
cancer biomarkers and drug
objectives.

Current Trends- Graph based database
The Fish Ontology (FO) model. A portion of the FO is shown here on how the classes are related to each other and to other ontology
classes. The dark blue circles represent terms from other ontologies while light blue circles represent terms from the FO. Source: Ali, N.M.,
Khan
HA, Then AY, Ving Ching C, Gaur M, Dhillon SK. (2017) Fish Ontology framework for taxonomy-based fish recognition. PeerJ 5:e3811Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya

Graph Model of Electronic Medical records
Hong Yung Yip, Nur A. Taib, Haris A. Khan and Sarinder K. Dhillon (2019) Electronic Health Record Integration. In: Ranganathan, S., Gribskov, M., Nakai, K.
and Schönbach, C. (eds.), Encyclopedia of Bioinformatics and Computational Biology, vol. 2, pp. 1063–1076. Oxford: Elsevier.

B Rotunda Genomic Database
contig (FK)
start_pos
end_pos
strand
id (PK)
target
threshold
IPRSCAN
subject_anno
subject_end
e_value
score
ANNOTATION hashasSwissprot
TrEMBL
KEGG
COG
query_id (FK)
subject_id (PK)
identity
mismatch
align_length
gap
query_start
query_end
subject_start
query_id (FK)
subject_id (PK)
subject_db
query_start
query_end
e_value subject_anno
GENE CODING
Contig (FK)
type
start_pos
end_pos
strand
query_id (FK)
id (PK)
contig
miRNA
annotation
rRNA annotation
contig (FK)
start_pos
end_pos
strand
id (PK)
target
annotation
threshold
tRNA
contig (FK)
start_pos end_pos
strand
id (PK)
rRNA
contig (FK)
start_pos
end_pos
strand
id (PK)
target
kegg
unigene (PK)
nr
swissprot
cog
trembl
interpro
go
CONTIG
contig (PK)

Malaysian Breast Cancer Survivorship Cohort
(MyBCC)
• MyBCC- longitudinal cohort study
is to determine the impact of
lifestyle, mental and socio-
cultural condition on the overall
survival and quality of life among
multi-ethnic Malaysian women
following a new diagnosis of
breast cancer.
• The MyBCC database application
includes data science techniques
to help clinicians or researchers
in conducting outcome analysis
on life styles factors, affecting the
survival time of patients.
Ganggayah, M.D et al ( 2018)

Radiology
Pharmacy
Oncology
Pathology
Surgery (Breast Unit)
i-Pesakit© 𝐁𝐫𝐞𝐚𝐬𝐭 𝐂𝐚𝐧𝐜𝐞𝐫 𝐌𝐨𝐝𝐮𝐥𝐞
- First Visit Clinic
- Diagnostic MDT* with
Radiology
- Results Clinic
- Treatment MDT* with Oncology
- Follow-up Visit Clinic
- Relapse MDT*
- Relapse
*MDT : Multidisciplinary meeting
Breast Cancer
Clinical Audit
and Reporting
Integration of data sources
from i-Pesakit©
i-Pesakit© Breast Cancer Module (BCM)
System used across multiple clinical
departments
* Implementation enhancement
Data input
Data output
Clinical Workflow Clinical Research
Breast Cancer
Module
(i-Research)
Data Analysis
i-Pesakit© BCM database
mirroring
MyBCC
(Malaysian Breast Cancer
Survivorship Cohort)
Biobank
Breast Q
(Patient Reported
Outcomes Measure)
Reports for
Ministry
De-identified
i-Research
Mirrored i-Pesakit© BCM
with identifiers
Related breast cancer
research databases
De-identified i-Research
database for research
analysis and reporting
*
*
*
National Registration
Department
Architecture System of the UMMC i-Pesakit© Breast Cancer Module
Source: Nurul Aqilah Mohd Nor, Nur Aishah Taib, Marniza Saad, Hana Salwani Zaini, Zahir Ahmad, Yamin Ahmad, Sarinder Kaur Dhillon
(2018). Development of Electronic Medical Records for Clinical and Research Purposes: The Breast Cancer Module Using an
Implementation Framework in a Middle Income Country- Malaysia, BMC Bioinformatics ( in press) (ISI-Indexed)

Journals
The most current online biological databases can be found in the yearly issue of the journal
Nucleic Acids Research (https://academic.oup.com/nar) and The Journal of Biological Database
and Curation (https://academic.oup.com/database). Both these journals are freely published by
the Oxford Academic Journals.
Another important source of biological databases is the BMC Bioinformatics
(https://bmcbioinformatics.biomedcentral.com/).

Issues
Pertaining to
Biological
Databases
HETEROGENITY
Data is still in various forms, locations and in diverse formats
. One of the most intriguing phase in promoting the growth
of biological information centres is to correlate, synthesize,
disseminate, share and retrieve information in the form of
databases. Linking data is challenging.
DATA INTEGRATION
The fundamental problem affecting data integration is the
adoption of data standards. In Biology, data standards in
terms of vocabularies and ontologies have facilitated data
integration tremendously. However an undesirable situation
could arise if developers are from the computing
background and do not really understand the available data
standards in Biology, while, biologists are not very keen to
undertake technical tasks such as developing databases.
DATA SHARING
Publishing databases online is still not favored by scientists
especially if they are the data owners, as it may give rise to
copyright issues as well as misuse of the data by some
fraudulent parties. Scientists who may not want to share
unpublished data have their valuable data stored in personal
computers which are most of the time untapped. These kind
of speckled data, both textual and image date, will not allow
discovery of new knowledge and its evident that correlated
data can give rise to new discoveries in science.
DIGITIZATION
The development of databases require data to be in the
digital form and digitizing biological data is a daunting task.
It requires extensive manpower, enough equipments and all
of these requires funds. However, this can be overcome if
world class biological data centres makes databasing lenient
enough for underprivileged scientists.

Future Direction
1
Biological databases -
an integral part of
research in many
scientific areas (data
extraction, knowledge
discovery and
biological simulations
to advanced analysis).
2
Important to relook at
models and
infrastructure
deployed in
construction of these
databases.
3
Older database models
such as relational, which
have been the core
technology behind
biological databases, are
losing its relevance with
regards to handling
exponential increase of
data.
4
In line with the
exponential increase
in size and
heterogeneity of
biological data, new
database models have
been proposed,

no SQL Database• Recent studies have explored the practical implementations of
NoSQL databases to help in developing a viable and usable
practice in information management.
• The growing data in genomics, metabolomics, proteomics and
metagenomics can be regarded as Big Data if these data sources
are harmonised.
• It is critical to explore back end data integration that are
streamlined into front end resilient computer systems for
performing seamless transactions, whether for simple search
algorithms or high level data science.
• In order to achieve this vision, these data sources need to be
mediated via ontologies and noSQL databases by adopting
parallel computing technology and Artificial Intelligence. In order
to facilitate the new paradigm, automated methods need to map
relational models (SQL based) into noSQL models which are the
essence of big data applications.
• NoSQL graph model has recently become very popular as it
presents a solution to many of today’s challenges such as
visualisation of data with complex connections and rich
relationships.Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya

ARE NoSQL Databases
taking over traditional
relational databases?

Advantages of noSQL over Relational
Implementation and infrastructure costs are high. Most NoSQL database systems are open-source and can run on
inexpensive commodity hardware architectures.
Contrary to relational databases, NoSQL distributed databases offer higher performance that is capable of
spanning data across server nodes, racks, or even multiple data centers with no single point of failure.
NoSQL databases provide high availability due to its distributed nature and data replication.
Increasing heterogeneity of bioinformatics data such as free-text notes, images and other complex data that are unstructured or semi-
structured require new storage alternatives. Flexible data models or schemas powered by NoSQL databases allow complex data to be
stored easily.
Size of bioinformatics are escalating over time and eventually became a bottleneck for traditional relational systems.
NoSQL databases are based on horizontal scalability which permits effortless and automatic scaling.

Summary
THIS PRESENTATION COVERS A HOLISTIC
OVERVIEW OF BIOLOGICAL DATABASES, WITH
REGARDS TO MODELLING AND THE
ARCHITECTURE OF THE DATABASES. A
COMPREHENSIVE LITERATURE ON
BIOLOGICAL DATABASES IS PRESENTED WITH
A FOCUS ON A FEW CASE STUDIES.
THIS PRESENTATION,
HOWEVER, BY NO
MEANS COVERS THE
WHOLE RANGE OF
BIOLOGICAL
DATABASES THAT ARE
CURRENTLY
AVAILABLE.
FINALLY, THE FUTURE
DIRECTION OF BIOLOGICAL
DATABASES IS DISCUSSED
FOCUSING ON THE NEW
CONCEPTS IN THE DIGITAL
WORLD SUCH AS BIG DATA,
DATA SCIENCE AND NOSQL
DATABASES WHICH HOLDS THE
FUTURE OF BIOLOGICAL
DATABASES.

THANK YOU
Main References:
Sarinder K. Dhillon (2019) Biological Databases. In: Ranganathan, S.,
Gribskov, M., Nakai, K. and Schönbach, C. (eds.), Encyclopedia of
Bioinformatics and Computational Biology, vol. 2, pp. 96–117. Oxford:
Elsevier.
Hong Yung Yip, Nur A. Taib, Haris A. Khan and Sarinder K. Dhillon (2019)
Electronic Health Record Integration. In: Ranganathan, S., Gribskov, M.,
Nakai, K. and Schönbach, C. (eds.), Encyclopedia of Bioinformatics and
Computational Biology, vol. 2, pp. 1063–1076. Oxford: Elsevier.

Bioinformatics databases: Current Trends and Future Perspectives

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Bioinformatics databases: Current Trends and Future Perspectives

Semelhante a Bioinformatics databases: Current Trends and Future Perspectives (20)

Último

Último (20)

Bioinformatics databases: Current Trends and Future Perspectives