SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Bioinformatics
Databases
Current Trends and
Future Perspectives
Assoc. Prof Dr Sarinder Kaur Kashmir Singh
(Sarinder K. Dhillon)
Data Science & Bioinformatics Lab
Faculty of Science
University of Malaya
sarinder@um.edu.my
In a nutshell….
Today almost every scientist realizes the need of storing their data sets using relevant
technology and models.
The architecture of a database system is greatly influenced by the underlying computer
system of the database system runs.
The access to relevant data, combining myriad data sources and coping with distinct
and heterogeneous systems is a tremendously difficult task.
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Advanced computing power, advanced informatics and the Internet has changed the way biological data are stored,
located and disseminated. In the past 20 years, scientists have engaged themselves in the
informatics discipline :
querying multiple remote or
local heterogeneous data
sources
integrating
manually
received data
manipulating it with
advanced data analysing
and visualizing tools.
Today’s talk is
about..
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
DATABASE
FUNDAMENTALS -
MODELS,
ARCHITECTURE AND
USES
TYPES OF
BIOLOGICAL
DATABASES
EXAMPLES OF
SIGNIFICANT
BIOINFORMATICS
DATABASES
ISSUES
PERTAINING TO
BIOLOGICAL
DATABASES
FUTURE
DIRECTION OF
BIOLOGICAL
DATABASES.
Database
Fundamentals
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of
Science, University of Malaya
In the simplest form, a database can be defined as
an organised and formal collection of information
stored in a computer readable format.
• Flat Files – Records without structured relationships
• Relational Databases- first introduction in 1970;
data in tabulated form
• Multidimensional (OLAP)- created using inputs from
a relational database where data is stored in a
cube or a multidimensional array.
• Graph and ontology based databases (NoSQL
databases)- Ontology based databases are in
principle made of the object oriented approach,
where the datasets are treated as objects, with
values and properties.
Common
approaches
Example of Relational Database
An example of Entity Relationship Diagram using a relation schema. Source: Dhillon, S.K., Shuhaimi, N., Hong, S.L.L., and Sidhu, A.S. (2013)
Malaysian Parasite Database Infrastructure. In Sidhu, A.S. Dhillon, S. K. (eds) Advances in Biomedical Infrastructure 2013. Springer-Verlag,
Heidelberg Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Example of an OLAP Model
data is represented in a cube dimensional model
Source: Dhillon, S.K., Shuhaimi, N., Hong, S.L.L., and Sidhu, A.S. (2013) Malaysian Parasite Database Infrastructure. In Sidhu, A.S. Dhillon,
S. K. (eds) Advances in Biomedical Infrastructure 2013. Springer-Verlag, Heidelberg
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Graph and Ontology Based Database
Graph data model representing the nodes (in oval), relationships (arrows). The descriptive boxes showing the mainly properties in
nodes and edges. Source: Costa, R.L., Gadelha, L., Ribeiro-Alves, M., Porto, F., 2017. GeNNet: An integrated platform for unifying
scientific workflows and graph databases for transcriptome data analysis. PeerJ, 5, p. e3509.
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Database Architectures
DISTRIBUTED
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
WAREHOUSE
FEDERATED
Uses of
Databases
Knowledge
extraction
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of
Science, University of Malaya
Databases are built to support front end applications which
ranges from simple query based systems to sophisticated
Artificial Intelligence enabled systems.
With the concept of big data that has taken Information
Technology - databases are extensively used for knowledge
extraction using algorithms that are run through large
volumes of data.
Thus, databases now have bigger role to play in many data
centric domains to function as platforms for machine
readable and machine interpretable format of information.
Protein Information and Knowledge Extractor (PIKE)
Lynx, a database and knowledge extraction engine for integrative
medicine.
Encyclopedia of Life (Parr et al., 2014).
• Databases are very useful for storing massive and ever growing
amount of data, which can be utilised for discovering patterns and
rules, automatically or even semi-automatically.
• This method is widely known as data mining and is conceived based
on logics in database systems (Paramasivam et al., 2014).
• A clean set of integrated data is usually stored in a data warehouse
before the data mining techniques applied. Key techniques
commonly used in data mining : classification, clustering,
prediction, association, text mining, link analysis and regression.
• Data mining is seen as an important concept in databases as the
process involves collecting data, creating database and
management, analyzing data and finally interpreting data (Han et
al., 2000).
Uses of Databases
Data mining
Uses of
Databases
Visualisation
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of
Science, University of Malaya
An effective visual design aids decision making
and interpretation to generate knowledge.
A data visualisation software or a script can be
used to find pattern, correlations or trends to
generate knowledge, which could not be seen in
textual or numerical content in a database.
One good example of data visualisation is using
graph based databases whereby the data is
retrieved in a visual context.
Focus of Presentation - Biological databases
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Database - a core
element of any
discipline that
produces large
amounts of data.
In Biology (particularly in
genomics), massive amounts of
data are being accumulated on
a daily basis via experimental
work, field work, software
analysis, illustrations,
microscopic photographs and
image acquisition.
This scenario has lead to
the construction of not only
single databases but also
the consortiums managing
large amounts of biological
data.
Source: Stephens, Z.D., Lee, S.Y., Faghri, F., et al., 2015. Big data: Astronomical or Genomical? PLOS Biology 13(7), e1002195. Available
at: https://doi.org/10.1371/journal.pbio.1002195.
Big data domains - 2025
Stephens et al. (2015) compared other Big Data Domains such as, astronomy, youtube and twitter in their recent study on Genomics
perceiving to be Big Data science . They estimated that genomics is leading or at par with the other big domains in terms of terms of
data acquisition, storage, distribution, and analysis. Hence, the need to accommodate the data surge in biology is a challenge.
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Categories & Types of Biological Databases
FOUR BROAD CATEGORIES:
(a)Primary databases,
(b)Secondary databases
(c)Specialised databases
(d)Integrated databases
TYPES OF BIOLOGICAL DATA
INCLUDES (but not limited to):
(a)nucleotide and protein sequences
(b) protein structure
(c) microarray and gene expression
(d) metabolic pathways
(e) species profile (taxonomy)
(f) animal model
(g) human disease
(h) clinical database and others.
Data Science & Bioinformatics Laboratory, Institute of Biological
Sciences, Faculty of Science, University of Malaya
Secondary Databases
In the early 80s, storing sequence data was not easy due
to cost and infrastructure limitations. Nevertheless, many
institutions have started to store their data in primary
databases which are available in isolated platforms and
are focused on specific subject of study.
Nucleotide Sequence Databases
- European Nucleotide Archive (ENA) , GenBank , EMBL
Nucleotide Sequence Database
Protein Sequence Database
- UniProt (uniprotkb, uniref, uniparc), Swiss-Prot, TrEMBL and
PIR-PSD, Protein Databank
Microarray Databases ( Microarray gene expression data)
- ArrayExpress, GEO (Gene Expression Omnibus), Stanford
Microarray Database
Metabolic Pathway Databases
- BRaunschweig ENzyme Database (BRENDA) , KEGG PATHWAY
Database, Metacyc
Phylogenetics Databases
- Phylomedb, treebase
Contain derived data (by analyzing primary data) to
address specific requirements.
Eukaryotic Promoter Database - comprehensive organisms-
specific transcription start site (TSS) collections -derived from
NGS data
NCBI Reference Sequence Database (RefSeq)- set of reference
sequences including genomic, transcript, and protein
Pfam is a database of protein families
Prosite is a protein database- describing the protein families,
domains -manually curated via Swiss Institute of Bioinformatics
The PRIDE PRoteomics IDEntifications (PRIDE) is a data
warehouse on proteomics data
SCOP2- focusing on proteins that are structurally characterized
and deposited in the PDB. Proteins are organised according to
their structural and evolutionary relationships in a complex
graph network.
Primary Databases
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Specialised Databases
• Phylogenomic Database - Plant Comparative Genomics (GreenPhylDB)
• Plant Secretome and Subcullular Proteome Knowledgebase (PlantSecKB)
• TreeBASE an open-access database of phylogenetic trees and associated data
• Plant Alternative Splicing database (PASD)
• Comparative Toxicogenomics Database (CTD)
• Drosophila Genes & Genomes (FlyBase)
• WormBase
• AceDB
• Arabidopsis Information Resource (TAIR)
• Online Mendelian Inheritance in Man (OMIM)
• Saccharomyces Genome Database (SGD)
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of
Science, University of Malaya
Integrated Databases
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
National Center for Biotechnology Information (NCBI)
Kyoto Encyclopedia of Genes and Genomes
GeneCards
The InterPro Consortium
The European Molecular Biology Laboratory (EMBL)
Nucleotide Sequence Database
Global Biodiversity Information Facility (GBIF)
Integrated databases –databases unified under a platform, with the aim to offer a suite of applications for
secondary uses of the data. These days integrated databases are preferred by researchers due to their
usability as one stop centres for knowledge extraction. These databases are more like consortiums managing
and integrating sources of information to provide a unified access to users.
Other
Biological
Databases
Biodiversity
databases
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences,
Faculty of Science, University of Malaya
Global Biodiversity Information Facility (GBIF)
Taxonomic Data Working Group (TDWG)
Encyclopedia of Life (EoL)
Catalogue of Life (CoL)
Integrated Taxonomic Information System (ITIS)
FishBase
many more
Taxonomy databases- also
referred as biodiversity databases
have been developed much
before sequence databases are
introduced. Taxonomy, being a
very old and matured branch of
Biology, has produced abundant
of data, particularly on species
profiling. Almost every country
listed as a megabiodiversity
country has produced a database
of their very own indigenous
organisms.
Other
biological
databases -
Animal model
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of
Science, University of Malaya
Animal models are mostly used in biomedical research which is an important research area to
demonstrate biological significance in experiments. Due to this, animal model, particularly
mouse model databases are advent these days.
The MUGEN mouse database (MMdb)- is a repository of murine models of immune processes
and immunological diseases.
Mouse Genome Informatics (MGI) - international database resource for the laboratory mouse,
providing integrated genetic, genomic, and biological data to facilitate the study of human
health and disease.
The Mouse Phenome Database (MPD) is one of the most widely used resource for primary
experimental trait data and genotypic variation.
International Mouse Strains Resource (IMSR) is a database on mouse strains, stocks, and
mutant ES cell lines available worldwide, including inbred, mutant, and genetically engineered
strains.
The European Mouse Mutant Archive
Rat Resource and Research Center
Other biological
databases -
Clinical/Health
databases
 The National Center of Biotechnology Information (NCBI)- produced
two important clinical and health databases
 (i) ClinVar , a database on the reports of the relationships among
human variations and phenotypes;
 (ii) MedGen a database containing information related to human
medical genetics, such as attributes of conditions with a genetic
contribution.
 Database of Genotypes and Phenotypes (dbGaP) is an archive of data
and results from studies that have investigated the interaction of
genotype and phenotype in humans.
 PubMed Health is the world’s largest digital medical library.
 MEDLINE contains journal citations and abstracts for biomedical
literature from around the world
 Clinical and Health databases focusing on diseases are also useful
resource for biologists and medical scientists.
 MalaCards – human disease database
 Autoimmune Disease Database,
 Inflammatory Bowel Diseases
 KEGG Disease Database
 LiverWiki -a wiki-based database for human liver.
 The Diseases Database ver 2.0 contain a whole range of database in
their portal
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
C
u
r
r
e
n
t
T
r
e
n
d
s
DATABASE MODELS
• Predominantly, traditional relational model has been
used in the design and development of biological
databases.
• GenBank, EMBL, SWISS-PROT and Protein
Information Resource(PIR) and CBS Genome Atlas
Database.
• Recent development in the semantic web technologies,
many biological ontologies have been created that uses
object based database models using ontologies and
nosql database tools.
• BioPortal - repository for biomedical ontologies. It
currently contains 704 ontologies.
• BioPAX
• Cell Cycle Ontology
• Gene Expression Knowledgebase
• Disease Ontology
• Gene Ontology Consortium
• Sequence Ontology
• SNOMED CT
• Fish Ontology
DATABASE ARCHITECTURE
• All three approaches: Distributed, Federated and Data
Warehouse have been utilised in biological
databases.
• Distributed Databases
• Ensembl, WormBase, and the Berkeley
Drosophila Genome Project
• Federated databases
• TwinNET (Muilu et al., 2007), ENCODE
(Blankenberg et al., 2007), EBI search (Park et al.,
2017), SPINE2 (Goh et al., 2003), Cancer
Biomedical Informatics Grid (Saltz et al., 2006),
NIF (Gardner et al., 2008), Biomedical Informatics
Research Network (Ashish et al., 2010),
Biomedical Investigations (Taylor et al., 2008),
EdgeExpressDB (Severin et al, 2009) and Minimal
Information About Neural Electromagnetic
Ontologies (Frishkoff et al., 2011).
• Data warehouse
• Pathway Commons (Cerami et al., 2010), String
(Szklarczyk et al., 2010), CBS Genome Atlas
Database (Hallin and Ussery, 2004), BioMart
(Haider et al., 2009), BrainMap & PubBrainure
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Case studies
of Biological
Databases
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of
Science, University of Malaya
Global Biological
Database Efforts
GeneCards TCGA
Database Efforts by
Data Science &
Bioinformatics Lab,
UM
Fish Ontology
Patient Records
Graph Database
B Rotunda Database
My Breast Cancer
Cohort (MyBCC)
database
Breast Cancer
Module in EMR
GeneCards
In order to gather scattered data primary
databases, The Weizmann Institute of Science
Crown Human Genome Center
(http://www.weizmann.ac.il) developed a
database called GeneCards in 1997.
In the early stage, this database was dealing
mostly with human
genome information, human genes, the
encoded protein’s function and related diseases.
Currently it serves as a complete, authoritative
compendium of annotative information about
human genes that has been broadly used for
almost 15 years (Safran et al., 2010).
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
The Cancer Genome Atlas (TCGA) -
generate complete, multi-dimensional
maps of the important thing genomic
changes in important types and
subtypes of cancer (National
Institutes of Health, 2017).
Began in 2005, to catalogue genetic
mutations answerable for cancer, the
usage of genome sequencing and
bioinformatics.
The TCGA has furnished a large
amount of publicly available data on
most cancers, highlighting candidate
cancer biomarkers and drug
objectives.
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Current Trends- Graph based database
The Fish Ontology (FO) model. A portion of the FO is shown here on how the classes are related to each other and to other ontology
classes. The dark blue circles represent terms from other ontologies while light blue circles represent terms from the FO. Source: Ali, N.M.,
Khan
HA, Then AY, Ving Ching C, Gaur M, Dhillon SK. (2017) Fish Ontology framework for taxonomy-based fish recognition. PeerJ 5:e3811Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Graph Model of Electronic Medical records
Hong Yung Yip, Nur A. Taib, Haris A. Khan and Sarinder K. Dhillon (2019) Electronic Health Record Integration. In: Ranganathan, S., Gribskov, M., Nakai, K.
and Schönbach, C. (eds.), Encyclopedia of Bioinformatics and Computational Biology, vol. 2, pp. 1063–1076. Oxford: Elsevier.
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
B Rotunda Genomic Database
contig (FK)
start_pos
end_pos
strand
id (PK)
target
threshold
IPRSCAN
subject_anno
subject_end
e_value
score
ANNOTATION hashasSwissprot
TrEMBL
KEGG
COG
query_id (FK)
subject_id (PK)
identity
mismatch
align_length
gap
query_start
query_end
subject_start
query_id (FK)
subject_id (PK)
subject_db
query_start
query_end
e_value subject_anno
GENE CODING
Contig (FK)
type
start_pos
end_pos
strand
query_id (FK)
id (PK)
contig
miRNA
annotation
rRNA annotation
contig (FK)
start_pos
end_pos
strand
id (PK)
target
annotation
threshold
tRNA
contig (FK)
start_pos end_pos
strand
id (PK)
rRNA
contig (FK)
start_pos
end_pos
strand
id (PK)
target
kegg
unigene (PK)
nr
swissprot
cog
trembl
interpro
go
CONTIG
contig (PK)
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Malaysian Breast Cancer Survivorship Cohort
(MyBCC)
• MyBCC- longitudinal cohort study
is to determine the impact of
lifestyle, mental and socio-
cultural condition on the overall
survival and quality of life among
multi-ethnic Malaysian women
following a new diagnosis of
breast cancer.
• The MyBCC database application
includes data science techniques
to help clinicians or researchers
in conducting outcome analysis
on life styles factors, affecting the
survival time of patients.
Ganggayah, M.D et al ( 2018)
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Radiology
Pharmacy
Oncology
Pathology
Surgery (Breast Unit)
i-Pesakit© 𝐁𝐫𝐞𝐚𝐬𝐭 𝐂𝐚𝐧𝐜𝐞𝐫 𝐌𝐨𝐝𝐮𝐥𝐞
- First Visit Clinic
- Diagnostic MDT* with
Radiology
- Results Clinic
- Treatment MDT* with Oncology
- Follow-up Visit Clinic
- Relapse MDT*
- Relapse
*MDT : Multidisciplinary meeting
Breast Cancer
Clinical Audit
and Reporting
Integration of data sources
from i-Pesakit©
i-Pesakit© Breast Cancer Module (BCM)
System used across multiple clinical
departments
* Implementation enhancement
Data input
Data output
Clinical Workflow Clinical Research
Breast Cancer
Module
(i-Research)
Data Analysis
i-Pesakit© BCM database
mirroring
MyBCC
(Malaysian Breast Cancer
Survivorship Cohort)
Biobank
Breast Q
(Patient Reported
Outcomes Measure)
Reports for
Ministry
De-identified
i-Research
Mirrored i-Pesakit© BCM
with identifiers
Related breast cancer
research databases
De-identified i-Research
database for research
analysis and reporting
*
*
*
National Registration
Department
Architecture System of the UMMC i-Pesakit© Breast Cancer Module
Source: Nurul Aqilah Mohd Nor, Nur Aishah Taib, Marniza Saad, Hana Salwani Zaini, Zahir Ahmad, Yamin Ahmad, Sarinder Kaur Dhillon
(2018). Development of Electronic Medical Records for Clinical and Research Purposes: The Breast Cancer Module Using an
Implementation Framework in a Middle Income Country- Malaysia, BMC Bioinformatics ( in press) (ISI-Indexed)
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Journals
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of
Science, University of Malaya
The most current online biological databases can be found in the yearly issue of the journal
Nucleic Acids Research (https://academic.oup.com/nar) and The Journal of Biological Database
and Curation (https://academic.oup.com/database). Both these journals are freely published by
the Oxford Academic Journals.
Another important source of biological databases is the BMC Bioinformatics
(https://bmcbioinformatics.biomedcentral.com/).
Issues
Pertaining to
Biological
Databases
HETEROGENITY
Data is still in various forms, locations and in diverse formats
. One of the most intriguing phase in promoting the growth
of biological information centres is to correlate, synthesize,
disseminate, share and retrieve information in the form of
databases. Linking data is challenging.
DATA INTEGRATION
The fundamental problem affecting data integration is the
adoption of data standards. In Biology, data standards in
terms of vocabularies and ontologies have facilitated data
integration tremendously. However an undesirable situation
could arise if developers are from the computing
background and do not really understand the available data
standards in Biology, while, biologists are not very keen to
undertake technical tasks such as developing databases.
DATA SHARING
Publishing databases online is still not favored by scientists
especially if they are the data owners, as it may give rise to
copyright issues as well as misuse of the data by some
fraudulent parties. Scientists who may not want to share
unpublished data have their valuable data stored in personal
computers which are most of the time untapped. These kind
of speckled data, both textual and image date, will not allow
discovery of new knowledge and its evident that correlated
data can give rise to new discoveries in science.
DIGITIZATION
The development of databases require data to be in the
digital form and digitizing biological data is a daunting task.
It requires extensive manpower, enough equipments and all
of these requires funds. However, this can be overcome if
world class biological data centres makes databasing lenient
enough for underprivileged scientists.
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Future Direction
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
1
Biological databases -
an integral part of
research in many
scientific areas (data
extraction, knowledge
discovery and
biological simulations
to advanced analysis).
2
Important to relook at
models and
infrastructure
deployed in
construction of these
databases.
3
Older database models
such as relational, which
have been the core
technology behind
biological databases, are
losing its relevance with
regards to handling
exponential increase of
data.
4
In line with the
exponential increase
in size and
heterogeneity of
biological data, new
database models have
been proposed,
no SQL Database• Recent studies have explored the practical implementations of
NoSQL databases to help in developing a viable and usable
practice in information management.
• The growing data in genomics, metabolomics, proteomics and
metagenomics can be regarded as Big Data if these data sources
are harmonised.
• It is critical to explore back end data integration that are
streamlined into front end resilient computer systems for
performing seamless transactions, whether for simple search
algorithms or high level data science.
• In order to achieve this vision, these data sources need to be
mediated via ontologies and noSQL databases by adopting
parallel computing technology and Artificial Intelligence. In order
to facilitate the new paradigm, automated methods need to map
relational models (SQL based) into noSQL models which are the
essence of big data applications.
• NoSQL graph model has recently become very popular as it
presents a solution to many of today’s challenges such as
visualisation of data with complex connections and rich
relationships.Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
ARE NoSQL Databases
taking over traditional
relational databases?
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Advantages of noSQL over Relational
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
Implementation and infrastructure costs are high. Most NoSQL database systems are open-source and can run on
inexpensive commodity hardware architectures.
Contrary to relational databases, NoSQL distributed databases offer higher performance that is capable of
spanning data across server nodes, racks, or even multiple data centers with no single point of failure.
NoSQL databases provide high availability due to its distributed nature and data replication.
Increasing heterogeneity of bioinformatics data such as free-text notes, images and other complex data that are unstructured or semi-
structured require new storage alternatives. Flexible data models or schemas powered by NoSQL databases allow complex data to be
stored easily.
Size of bioinformatics are escalating over time and eventually became a bottleneck for traditional relational systems.
NoSQL databases are based on horizontal scalability which permits effortless and automatic scaling.
Summary
THIS PRESENTATION COVERS A HOLISTIC
OVERVIEW OF BIOLOGICAL DATABASES, WITH
REGARDS TO MODELLING AND THE
ARCHITECTURE OF THE DATABASES. A
COMPREHENSIVE LITERATURE ON
BIOLOGICAL DATABASES IS PRESENTED WITH
A FOCUS ON A FEW CASE STUDIES.
THIS PRESENTATION,
HOWEVER, BY NO
MEANS COVERS THE
WHOLE RANGE OF
BIOLOGICAL
DATABASES THAT ARE
CURRENTLY
AVAILABLE.
FINALLY, THE FUTURE
DIRECTION OF BIOLOGICAL
DATABASES IS DISCUSSED
FOCUSING ON THE NEW
CONCEPTS IN THE DIGITAL
WORLD SUCH AS BIG DATA,
DATA SCIENCE AND NOSQL
DATABASES WHICH HOLDS THE
FUTURE OF BIOLOGICAL
DATABASES.
THANK YOU
Main References:
Sarinder K. Dhillon (2019) Biological Databases. In: Ranganathan, S.,
Gribskov, M., Nakai, K. and Schönbach, C. (eds.), Encyclopedia of
Bioinformatics and Computational Biology, vol. 2, pp. 96–117. Oxford:
Elsevier.
Hong Yung Yip, Nur A. Taib, Haris A. Khan and Sarinder K. Dhillon (2019)
Electronic Health Record Integration. In: Ranganathan, S., Gribskov, M.,
Nakai, K. and Schönbach, C. (eds.), Encyclopedia of Bioinformatics and
Computational Biology, vol. 2, pp. 1063–1076. Oxford: Elsevier.
Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya

Mais conteúdo relacionado

Mais procurados (20)

Biological networks
Biological networksBiological networks
Biological networks
 
Est database
Est databaseEst database
Est database
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
 
Comparative genomics 2
Comparative genomics 2Comparative genomics 2
Comparative genomics 2
 
COMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGYCOMPUTATIONAL BIOLOGY
COMPUTATIONAL BIOLOGY
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
Bio153 microbial genomics 2012
Bio153 microbial genomics 2012Bio153 microbial genomics 2012
Bio153 microbial genomics 2012
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Sequence database
Sequence databaseSequence database
Sequence database
 
FASTA
FASTAFASTA
FASTA
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Pathway and network analysis
Pathway and network analysisPathway and network analysis
Pathway and network analysis
 
Web based servers and softwares for genome analysis
Web based servers and softwares for genome analysisWeb based servers and softwares for genome analysis
Web based servers and softwares for genome analysis
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Protein Data Bank (PDB)
Protein Data Bank (PDB)Protein Data Bank (PDB)
Protein Data Bank (PDB)
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Protein database
Protein databaseProtein database
Protein database
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
 
Prosite
PrositeProsite
Prosite
 

Semelhante a Bioinformatics databases: Current Trends and Future Perspectives

Sla2009 D Curation Heidorn
Sla2009 D Curation HeidornSla2009 D Curation Heidorn
Sla2009 D Curation HeidornBryan Heidorn
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Enayat Rajabi
 
Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureWhy Data Science Matters - 2014 WDS Data Stewardship Award Lecture
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureXiaogang (Marshall) Ma
 
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...European School of Oncology
 
FAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFelipe Gutierrez
 
Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Lihua Zhao
 
SEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability ScienceSEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability ScienceSEAD
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Anita de Waard
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Bryan Heidorn
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Datavbrant
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...ijseajournal
 
Data Provenance and Scientific Workflow Management
Data Provenance and Scientific Workflow ManagementData Provenance and Scientific Workflow Management
Data Provenance and Scientific Workflow ManagementNeuroMat
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeLizLyon
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals FederationManjulaPatel
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 

Semelhante a Bioinformatics databases: Current Trends and Future Perspectives (20)

Sla2009 D Curation Heidorn
Sla2009 D Curation HeidornSla2009 D Curation Heidorn
Sla2009 D Curation Heidorn
 
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
 
Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture
Why Data Science Matters - 2014 WDS Data Stewardship Award LectureWhy Data Science Matters - 2014 WDS Data Stewardship Award Lecture
Why Data Science Matters - 2014 WDS Data Stewardship Award Lecture
 
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
NY Prostate Cancer Conference - P.A. Fearn - Session 1: Data management for p...
 
Martone grethe
Martone gretheMartone grethe
Martone grethe
 
FAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODSFAIR sequencing data repository based on iRODS
FAIR sequencing data repository based on iRODS
 
Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011Mid-Ontology Learning from Linked Data @JIST2011
Mid-Ontology Learning from Linked Data @JIST2011
 
SEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability ScienceSEAD Prototype: Data Curation and Preservation for Sustainability Science
SEAD Prototype: Data Curation and Preservation for Sustainability Science
 
Open Science
Open Science Open Science
Open Science
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
Heidorn The Path to Enlightened Solutions for Biodiversity's Dark DataViBRANT...
 
The Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark DataThe Path to Enlightened Solutions for Biodiversity's Dark Data
The Path to Enlightened Solutions for Biodiversity's Dark Data
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
 
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
PERFORMANCE EVALUATION OF STRUCTURED AND SEMI-STRUCTURED BIOINFORMATICS TOOLS...
 
Data Provenance and Scientific Workflow Management
Data Provenance and Scientific Workflow ManagementData Provenance and Scientific Workflow Management
Data Provenance and Scientific Workflow Management
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decade
 
B.3.5
B.3.5B.3.5
B.3.5
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals Federation
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 

Último

pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑Damini Dixit
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONrouseeyyy
 

Último (20)

pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 

Bioinformatics databases: Current Trends and Future Perspectives

  • 1. Bioinformatics Databases Current Trends and Future Perspectives Assoc. Prof Dr Sarinder Kaur Kashmir Singh (Sarinder K. Dhillon) Data Science & Bioinformatics Lab Faculty of Science University of Malaya sarinder@um.edu.my
  • 2. In a nutshell…. Today almost every scientist realizes the need of storing their data sets using relevant technology and models. The architecture of a database system is greatly influenced by the underlying computer system of the database system runs. The access to relevant data, combining myriad data sources and coping with distinct and heterogeneous systems is a tremendously difficult task. Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya Advanced computing power, advanced informatics and the Internet has changed the way biological data are stored, located and disseminated. In the past 20 years, scientists have engaged themselves in the informatics discipline : querying multiple remote or local heterogeneous data sources integrating manually received data manipulating it with advanced data analysing and visualizing tools.
  • 3. Today’s talk is about.. Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya DATABASE FUNDAMENTALS - MODELS, ARCHITECTURE AND USES TYPES OF BIOLOGICAL DATABASES EXAMPLES OF SIGNIFICANT BIOINFORMATICS DATABASES ISSUES PERTAINING TO BIOLOGICAL DATABASES FUTURE DIRECTION OF BIOLOGICAL DATABASES.
  • 4. Database Fundamentals Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya In the simplest form, a database can be defined as an organised and formal collection of information stored in a computer readable format. • Flat Files – Records without structured relationships • Relational Databases- first introduction in 1970; data in tabulated form • Multidimensional (OLAP)- created using inputs from a relational database where data is stored in a cube or a multidimensional array. • Graph and ontology based databases (NoSQL databases)- Ontology based databases are in principle made of the object oriented approach, where the datasets are treated as objects, with values and properties. Common approaches
  • 5. Example of Relational Database An example of Entity Relationship Diagram using a relation schema. Source: Dhillon, S.K., Shuhaimi, N., Hong, S.L.L., and Sidhu, A.S. (2013) Malaysian Parasite Database Infrastructure. In Sidhu, A.S. Dhillon, S. K. (eds) Advances in Biomedical Infrastructure 2013. Springer-Verlag, Heidelberg Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 6. Example of an OLAP Model data is represented in a cube dimensional model Source: Dhillon, S.K., Shuhaimi, N., Hong, S.L.L., and Sidhu, A.S. (2013) Malaysian Parasite Database Infrastructure. In Sidhu, A.S. Dhillon, S. K. (eds) Advances in Biomedical Infrastructure 2013. Springer-Verlag, Heidelberg Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 7. Graph and Ontology Based Database Graph data model representing the nodes (in oval), relationships (arrows). The descriptive boxes showing the mainly properties in nodes and edges. Source: Costa, R.L., Gadelha, L., Ribeiro-Alves, M., Porto, F., 2017. GeNNet: An integrated platform for unifying scientific workflows and graph databases for transcriptome data analysis. PeerJ, 5, p. e3509. Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 8. Database Architectures DISTRIBUTED Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya WAREHOUSE FEDERATED
  • 9. Uses of Databases Knowledge extraction Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya Databases are built to support front end applications which ranges from simple query based systems to sophisticated Artificial Intelligence enabled systems. With the concept of big data that has taken Information Technology - databases are extensively used for knowledge extraction using algorithms that are run through large volumes of data. Thus, databases now have bigger role to play in many data centric domains to function as platforms for machine readable and machine interpretable format of information. Protein Information and Knowledge Extractor (PIKE) Lynx, a database and knowledge extraction engine for integrative medicine. Encyclopedia of Life (Parr et al., 2014).
  • 10. • Databases are very useful for storing massive and ever growing amount of data, which can be utilised for discovering patterns and rules, automatically or even semi-automatically. • This method is widely known as data mining and is conceived based on logics in database systems (Paramasivam et al., 2014). • A clean set of integrated data is usually stored in a data warehouse before the data mining techniques applied. Key techniques commonly used in data mining : classification, clustering, prediction, association, text mining, link analysis and regression. • Data mining is seen as an important concept in databases as the process involves collecting data, creating database and management, analyzing data and finally interpreting data (Han et al., 2000). Uses of Databases Data mining
  • 11. Uses of Databases Visualisation Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya An effective visual design aids decision making and interpretation to generate knowledge. A data visualisation software or a script can be used to find pattern, correlations or trends to generate knowledge, which could not be seen in textual or numerical content in a database. One good example of data visualisation is using graph based databases whereby the data is retrieved in a visual context.
  • 12. Focus of Presentation - Biological databases Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya Database - a core element of any discipline that produces large amounts of data. In Biology (particularly in genomics), massive amounts of data are being accumulated on a daily basis via experimental work, field work, software analysis, illustrations, microscopic photographs and image acquisition. This scenario has lead to the construction of not only single databases but also the consortiums managing large amounts of biological data.
  • 13. Source: Stephens, Z.D., Lee, S.Y., Faghri, F., et al., 2015. Big data: Astronomical or Genomical? PLOS Biology 13(7), e1002195. Available at: https://doi.org/10.1371/journal.pbio.1002195. Big data domains - 2025 Stephens et al. (2015) compared other Big Data Domains such as, astronomy, youtube and twitter in their recent study on Genomics perceiving to be Big Data science . They estimated that genomics is leading or at par with the other big domains in terms of terms of data acquisition, storage, distribution, and analysis. Hence, the need to accommodate the data surge in biology is a challenge. Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 14. Categories & Types of Biological Databases FOUR BROAD CATEGORIES: (a)Primary databases, (b)Secondary databases (c)Specialised databases (d)Integrated databases TYPES OF BIOLOGICAL DATA INCLUDES (but not limited to): (a)nucleotide and protein sequences (b) protein structure (c) microarray and gene expression (d) metabolic pathways (e) species profile (taxonomy) (f) animal model (g) human disease (h) clinical database and others. Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 15. Secondary Databases In the early 80s, storing sequence data was not easy due to cost and infrastructure limitations. Nevertheless, many institutions have started to store their data in primary databases which are available in isolated platforms and are focused on specific subject of study. Nucleotide Sequence Databases - European Nucleotide Archive (ENA) , GenBank , EMBL Nucleotide Sequence Database Protein Sequence Database - UniProt (uniprotkb, uniref, uniparc), Swiss-Prot, TrEMBL and PIR-PSD, Protein Databank Microarray Databases ( Microarray gene expression data) - ArrayExpress, GEO (Gene Expression Omnibus), Stanford Microarray Database Metabolic Pathway Databases - BRaunschweig ENzyme Database (BRENDA) , KEGG PATHWAY Database, Metacyc Phylogenetics Databases - Phylomedb, treebase Contain derived data (by analyzing primary data) to address specific requirements. Eukaryotic Promoter Database - comprehensive organisms- specific transcription start site (TSS) collections -derived from NGS data NCBI Reference Sequence Database (RefSeq)- set of reference sequences including genomic, transcript, and protein Pfam is a database of protein families Prosite is a protein database- describing the protein families, domains -manually curated via Swiss Institute of Bioinformatics The PRIDE PRoteomics IDEntifications (PRIDE) is a data warehouse on proteomics data SCOP2- focusing on proteins that are structurally characterized and deposited in the PDB. Proteins are organised according to their structural and evolutionary relationships in a complex graph network. Primary Databases Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 16. Specialised Databases • Phylogenomic Database - Plant Comparative Genomics (GreenPhylDB) • Plant Secretome and Subcullular Proteome Knowledgebase (PlantSecKB) • TreeBASE an open-access database of phylogenetic trees and associated data • Plant Alternative Splicing database (PASD) • Comparative Toxicogenomics Database (CTD) • Drosophila Genes & Genomes (FlyBase) • WormBase • AceDB • Arabidopsis Information Resource (TAIR) • Online Mendelian Inheritance in Man (OMIM) • Saccharomyces Genome Database (SGD) Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 17. Integrated Databases Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya National Center for Biotechnology Information (NCBI) Kyoto Encyclopedia of Genes and Genomes GeneCards The InterPro Consortium The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database Global Biodiversity Information Facility (GBIF) Integrated databases –databases unified under a platform, with the aim to offer a suite of applications for secondary uses of the data. These days integrated databases are preferred by researchers due to their usability as one stop centres for knowledge extraction. These databases are more like consortiums managing and integrating sources of information to provide a unified access to users.
  • 18. Other Biological Databases Biodiversity databases Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya Global Biodiversity Information Facility (GBIF) Taxonomic Data Working Group (TDWG) Encyclopedia of Life (EoL) Catalogue of Life (CoL) Integrated Taxonomic Information System (ITIS) FishBase many more Taxonomy databases- also referred as biodiversity databases have been developed much before sequence databases are introduced. Taxonomy, being a very old and matured branch of Biology, has produced abundant of data, particularly on species profiling. Almost every country listed as a megabiodiversity country has produced a database of their very own indigenous organisms.
  • 19. Other biological databases - Animal model Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya Animal models are mostly used in biomedical research which is an important research area to demonstrate biological significance in experiments. Due to this, animal model, particularly mouse model databases are advent these days. The MUGEN mouse database (MMdb)- is a repository of murine models of immune processes and immunological diseases. Mouse Genome Informatics (MGI) - international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. The Mouse Phenome Database (MPD) is one of the most widely used resource for primary experimental trait data and genotypic variation. International Mouse Strains Resource (IMSR) is a database on mouse strains, stocks, and mutant ES cell lines available worldwide, including inbred, mutant, and genetically engineered strains. The European Mouse Mutant Archive Rat Resource and Research Center
  • 20. Other biological databases - Clinical/Health databases  The National Center of Biotechnology Information (NCBI)- produced two important clinical and health databases  (i) ClinVar , a database on the reports of the relationships among human variations and phenotypes;  (ii) MedGen a database containing information related to human medical genetics, such as attributes of conditions with a genetic contribution.  Database of Genotypes and Phenotypes (dbGaP) is an archive of data and results from studies that have investigated the interaction of genotype and phenotype in humans.  PubMed Health is the world’s largest digital medical library.  MEDLINE contains journal citations and abstracts for biomedical literature from around the world  Clinical and Health databases focusing on diseases are also useful resource for biologists and medical scientists.  MalaCards – human disease database  Autoimmune Disease Database,  Inflammatory Bowel Diseases  KEGG Disease Database  LiverWiki -a wiki-based database for human liver.  The Diseases Database ver 2.0 contain a whole range of database in their portal Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 21. C u r r e n t T r e n d s DATABASE MODELS • Predominantly, traditional relational model has been used in the design and development of biological databases. • GenBank, EMBL, SWISS-PROT and Protein Information Resource(PIR) and CBS Genome Atlas Database. • Recent development in the semantic web technologies, many biological ontologies have been created that uses object based database models using ontologies and nosql database tools. • BioPortal - repository for biomedical ontologies. It currently contains 704 ontologies. • BioPAX • Cell Cycle Ontology • Gene Expression Knowledgebase • Disease Ontology • Gene Ontology Consortium • Sequence Ontology • SNOMED CT • Fish Ontology DATABASE ARCHITECTURE • All three approaches: Distributed, Federated and Data Warehouse have been utilised in biological databases. • Distributed Databases • Ensembl, WormBase, and the Berkeley Drosophila Genome Project • Federated databases • TwinNET (Muilu et al., 2007), ENCODE (Blankenberg et al., 2007), EBI search (Park et al., 2017), SPINE2 (Goh et al., 2003), Cancer Biomedical Informatics Grid (Saltz et al., 2006), NIF (Gardner et al., 2008), Biomedical Informatics Research Network (Ashish et al., 2010), Biomedical Investigations (Taylor et al., 2008), EdgeExpressDB (Severin et al, 2009) and Minimal Information About Neural Electromagnetic Ontologies (Frishkoff et al., 2011). • Data warehouse • Pathway Commons (Cerami et al., 2010), String (Szklarczyk et al., 2010), CBS Genome Atlas Database (Hallin and Ussery, 2004), BioMart (Haider et al., 2009), BrainMap & PubBrainure Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 22. Case studies of Biological Databases Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya Global Biological Database Efforts GeneCards TCGA Database Efforts by Data Science & Bioinformatics Lab, UM Fish Ontology Patient Records Graph Database B Rotunda Database My Breast Cancer Cohort (MyBCC) database Breast Cancer Module in EMR
  • 23. GeneCards In order to gather scattered data primary databases, The Weizmann Institute of Science Crown Human Genome Center (http://www.weizmann.ac.il) developed a database called GeneCards in 1997. In the early stage, this database was dealing mostly with human genome information, human genes, the encoded protein’s function and related diseases. Currently it serves as a complete, authoritative compendium of annotative information about human genes that has been broadly used for almost 15 years (Safran et al., 2010). Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 24. The Cancer Genome Atlas (TCGA) - generate complete, multi-dimensional maps of the important thing genomic changes in important types and subtypes of cancer (National Institutes of Health, 2017). Began in 2005, to catalogue genetic mutations answerable for cancer, the usage of genome sequencing and bioinformatics. The TCGA has furnished a large amount of publicly available data on most cancers, highlighting candidate cancer biomarkers and drug objectives. Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 25. Current Trends- Graph based database The Fish Ontology (FO) model. A portion of the FO is shown here on how the classes are related to each other and to other ontology classes. The dark blue circles represent terms from other ontologies while light blue circles represent terms from the FO. Source: Ali, N.M., Khan HA, Then AY, Ving Ching C, Gaur M, Dhillon SK. (2017) Fish Ontology framework for taxonomy-based fish recognition. PeerJ 5:e3811Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 26. Graph Model of Electronic Medical records Hong Yung Yip, Nur A. Taib, Haris A. Khan and Sarinder K. Dhillon (2019) Electronic Health Record Integration. In: Ranganathan, S., Gribskov, M., Nakai, K. and Schönbach, C. (eds.), Encyclopedia of Bioinformatics and Computational Biology, vol. 2, pp. 1063–1076. Oxford: Elsevier. Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 27. B Rotunda Genomic Database contig (FK) start_pos end_pos strand id (PK) target threshold IPRSCAN subject_anno subject_end e_value score ANNOTATION hashasSwissprot TrEMBL KEGG COG query_id (FK) subject_id (PK) identity mismatch align_length gap query_start query_end subject_start query_id (FK) subject_id (PK) subject_db query_start query_end e_value subject_anno GENE CODING Contig (FK) type start_pos end_pos strand query_id (FK) id (PK) contig miRNA annotation rRNA annotation contig (FK) start_pos end_pos strand id (PK) target annotation threshold tRNA contig (FK) start_pos end_pos strand id (PK) rRNA contig (FK) start_pos end_pos strand id (PK) target kegg unigene (PK) nr swissprot cog trembl interpro go CONTIG contig (PK) Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 28. Malaysian Breast Cancer Survivorship Cohort (MyBCC) • MyBCC- longitudinal cohort study is to determine the impact of lifestyle, mental and socio- cultural condition on the overall survival and quality of life among multi-ethnic Malaysian women following a new diagnosis of breast cancer. • The MyBCC database application includes data science techniques to help clinicians or researchers in conducting outcome analysis on life styles factors, affecting the survival time of patients. Ganggayah, M.D et al ( 2018) Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 29. Radiology Pharmacy Oncology Pathology Surgery (Breast Unit) i-Pesakit© 𝐁𝐫𝐞𝐚𝐬𝐭 𝐂𝐚𝐧𝐜𝐞𝐫 𝐌𝐨𝐝𝐮𝐥𝐞 - First Visit Clinic - Diagnostic MDT* with Radiology - Results Clinic - Treatment MDT* with Oncology - Follow-up Visit Clinic - Relapse MDT* - Relapse *MDT : Multidisciplinary meeting Breast Cancer Clinical Audit and Reporting Integration of data sources from i-Pesakit© i-Pesakit© Breast Cancer Module (BCM) System used across multiple clinical departments * Implementation enhancement Data input Data output Clinical Workflow Clinical Research Breast Cancer Module (i-Research) Data Analysis i-Pesakit© BCM database mirroring MyBCC (Malaysian Breast Cancer Survivorship Cohort) Biobank Breast Q (Patient Reported Outcomes Measure) Reports for Ministry De-identified i-Research Mirrored i-Pesakit© BCM with identifiers Related breast cancer research databases De-identified i-Research database for research analysis and reporting * * * National Registration Department Architecture System of the UMMC i-Pesakit© Breast Cancer Module Source: Nurul Aqilah Mohd Nor, Nur Aishah Taib, Marniza Saad, Hana Salwani Zaini, Zahir Ahmad, Yamin Ahmad, Sarinder Kaur Dhillon (2018). Development of Electronic Medical Records for Clinical and Research Purposes: The Breast Cancer Module Using an Implementation Framework in a Middle Income Country- Malaysia, BMC Bioinformatics ( in press) (ISI-Indexed) Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 30. Journals Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya The most current online biological databases can be found in the yearly issue of the journal Nucleic Acids Research (https://academic.oup.com/nar) and The Journal of Biological Database and Curation (https://academic.oup.com/database). Both these journals are freely published by the Oxford Academic Journals. Another important source of biological databases is the BMC Bioinformatics (https://bmcbioinformatics.biomedcentral.com/).
  • 31. Issues Pertaining to Biological Databases HETEROGENITY Data is still in various forms, locations and in diverse formats . One of the most intriguing phase in promoting the growth of biological information centres is to correlate, synthesize, disseminate, share and retrieve information in the form of databases. Linking data is challenging. DATA INTEGRATION The fundamental problem affecting data integration is the adoption of data standards. In Biology, data standards in terms of vocabularies and ontologies have facilitated data integration tremendously. However an undesirable situation could arise if developers are from the computing background and do not really understand the available data standards in Biology, while, biologists are not very keen to undertake technical tasks such as developing databases. DATA SHARING Publishing databases online is still not favored by scientists especially if they are the data owners, as it may give rise to copyright issues as well as misuse of the data by some fraudulent parties. Scientists who may not want to share unpublished data have their valuable data stored in personal computers which are most of the time untapped. These kind of speckled data, both textual and image date, will not allow discovery of new knowledge and its evident that correlated data can give rise to new discoveries in science. DIGITIZATION The development of databases require data to be in the digital form and digitizing biological data is a daunting task. It requires extensive manpower, enough equipments and all of these requires funds. However, this can be overcome if world class biological data centres makes databasing lenient enough for underprivileged scientists. Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 32. Future Direction Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya 1 Biological databases - an integral part of research in many scientific areas (data extraction, knowledge discovery and biological simulations to advanced analysis). 2 Important to relook at models and infrastructure deployed in construction of these databases. 3 Older database models such as relational, which have been the core technology behind biological databases, are losing its relevance with regards to handling exponential increase of data. 4 In line with the exponential increase in size and heterogeneity of biological data, new database models have been proposed,
  • 33. no SQL Database• Recent studies have explored the practical implementations of NoSQL databases to help in developing a viable and usable practice in information management. • The growing data in genomics, metabolomics, proteomics and metagenomics can be regarded as Big Data if these data sources are harmonised. • It is critical to explore back end data integration that are streamlined into front end resilient computer systems for performing seamless transactions, whether for simple search algorithms or high level data science. • In order to achieve this vision, these data sources need to be mediated via ontologies and noSQL databases by adopting parallel computing technology and Artificial Intelligence. In order to facilitate the new paradigm, automated methods need to map relational models (SQL based) into noSQL models which are the essence of big data applications. • NoSQL graph model has recently become very popular as it presents a solution to many of today’s challenges such as visualisation of data with complex connections and rich relationships.Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 34. ARE NoSQL Databases taking over traditional relational databases? Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya
  • 35. Advantages of noSQL over Relational Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya Implementation and infrastructure costs are high. Most NoSQL database systems are open-source and can run on inexpensive commodity hardware architectures. Contrary to relational databases, NoSQL distributed databases offer higher performance that is capable of spanning data across server nodes, racks, or even multiple data centers with no single point of failure. NoSQL databases provide high availability due to its distributed nature and data replication. Increasing heterogeneity of bioinformatics data such as free-text notes, images and other complex data that are unstructured or semi- structured require new storage alternatives. Flexible data models or schemas powered by NoSQL databases allow complex data to be stored easily. Size of bioinformatics are escalating over time and eventually became a bottleneck for traditional relational systems. NoSQL databases are based on horizontal scalability which permits effortless and automatic scaling.
  • 36. Summary THIS PRESENTATION COVERS A HOLISTIC OVERVIEW OF BIOLOGICAL DATABASES, WITH REGARDS TO MODELLING AND THE ARCHITECTURE OF THE DATABASES. A COMPREHENSIVE LITERATURE ON BIOLOGICAL DATABASES IS PRESENTED WITH A FOCUS ON A FEW CASE STUDIES. THIS PRESENTATION, HOWEVER, BY NO MEANS COVERS THE WHOLE RANGE OF BIOLOGICAL DATABASES THAT ARE CURRENTLY AVAILABLE. FINALLY, THE FUTURE DIRECTION OF BIOLOGICAL DATABASES IS DISCUSSED FOCUSING ON THE NEW CONCEPTS IN THE DIGITAL WORLD SUCH AS BIG DATA, DATA SCIENCE AND NOSQL DATABASES WHICH HOLDS THE FUTURE OF BIOLOGICAL DATABASES.
  • 37. THANK YOU Main References: Sarinder K. Dhillon (2019) Biological Databases. In: Ranganathan, S., Gribskov, M., Nakai, K. and Schönbach, C. (eds.), Encyclopedia of Bioinformatics and Computational Biology, vol. 2, pp. 96–117. Oxford: Elsevier. Hong Yung Yip, Nur A. Taib, Haris A. Khan and Sarinder K. Dhillon (2019) Electronic Health Record Integration. In: Ranganathan, S., Gribskov, M., Nakai, K. and Schönbach, C. (eds.), Encyclopedia of Bioinformatics and Computational Biology, vol. 2, pp. 1063–1076. Oxford: Elsevier. Data Science & Bioinformatics Laboratory, Institute of Biological Sciences, Faculty of Science, University of Malaya