Basic of bioinformatics

BASICS OF BIOINFORMATICS.
PART-1
-JAYATI SHRIVASTAVA

“Bioinformatics”
“ The mathematical statistical and computing method that
aim to solve biological problem using DNA, Amino acid
sequence and reterive information.” ~Fredj Tekaia.
• General definition: computational techniques for solving
biological problems.,
– data problems: representation (graphics), storage and
retrieval (databases), analysis (statistics, artificial
intelligence, optimization, etc.)
– biology problems: sequence analysis, structure or function
prediction, data mining, etc.
• It is basically giving concept to molecular biology in sence
of physical chemistry then applying ” informatics” derived
from computer science, maths and statics to understand
the information associated with the molecule on large
scale.

Need for Bioinformatics
• When in the early 1980s methods for DNA sequencing became widely
available, molecular sequence data expeditiously started to grow
exponentially. After the sequencing of the first microbial genome in 1995, the
genomes of more than 100 organisms have been sequenced and large-scale
genome sequencing projects have evolved to routine, though still non-trivial,
procedures (Janssen et al., 2003; Kanehisa and Bork, 2003). The imperative of
efficient and powerful tools and databases became obvious during the
realization of the human genome project, whose completion has been
established several years ahead of schedule. The accumulated data was stored
in the first genomic databases such as GenBank, European Molecular Biology
Laboratory Nucleotide Sequence Database (EMBL), and DNA Data Bank of
Japan (DDBJ)
• As an example, the number of entries in a database of gene sequences in
GenBank has increased from 1,765,847 to 22,318,883 in the last five years.
These entries tend to double every 15 months (Benson et al., 2002).
• There are two major challenging areas in bioinformatics:
(1) data management and
(2) knowledge discovery.

Fig1. The growth of data in GenBank
(source: http://www.ncbi.nih.gov/Genbank/genbankstats.html)

• Our body is made up of trillions of cells. According to human genome
project, the number of genes in each cell is approximately 20,000.
• This microscopic cell has an ultramicroscopic commanding centre called
nucleus within which 2 m of DNA is elegantly packaged. The number of
nucleotides is -3x109. That much enormous data in a cell! How could we
store this, access this data. analyze this data Here comes the use of
computers.
• We developed and used computers for the same purpose, efficient data
storage retrieval and analysis. With the advancement in sequencing
technology, each day thousands of nucleotides of different organisms are
sequenced and submitted to the databases worldwide.
• In Bioinformatics, the use of computer is same as previously but the
data is biological data, the letters of life.
• Actually we are now facing an information overload. Loads of sequence
data but the real challenge is to make sense of this data.

History and Landmark event in the field
of Bioinformatics

.
• 1965 Margret Dayhoff’s Atlas of protein sequences.
• 1970 Needleman –Wunsch Algorithm
• 1977 DNA sequencing and software to analyze it
• 1981 Smith- Waterman algorithm developed
• 1981 The concept of sequence motif
• 1982 GeneBank relase 3 made public
• 1982 Phage lamda genome sequenced
• 1983 Sequence database searching algorithum
• 1985 FASTA/ FASTN Fast sequence similarity searching
• 1988 National centre for biotechnology information (NCBI) created at NIH/NLM.
• 1988 EMBnet network for database distribution
• 1990 BLAST: Fast sequencing searching
• 1991 EST: expressed sequence tag sequenceing
• 1993 Sanger center, Hinxton, UK
• 1994 EMBL European bioinformatics instute Hinxton, UK
• 1995 First bacterial genome completely sequenced
• 1996 yeast genome completely sequenced
• 1997 PSI- BLAST
• 1998 Worm genome completely sequenced
• 1999 Fly genome completely sequenced.

Founder of Bioinformatics: Margaret O. Dayhoff

Paulien Hogeweg and Ben Hesper

Computer-Aided Drug Design (CADD) emerged as an efficient means of identifying
potential lead compounds and for aiding the developments of possible drugs for a wide
range of diseases [8, 9]. Today, a number of computational approaches are being used to
identify potential lead molecules from huge compound libraries.
Pharmacology is the science of how drugs act on biological systems and how
the body responds to the drug. The study of pharmacology encompasses the
sources, chemical properties, biological effects and therapeutic uses of drugs.
1. Pharmacology and CADD

• Bioinformatics leads to accelerate Drug target, identification,
validation, discovery of drug, characterization of side effects, also
help us to predict drug resistance.
• Also use in the development of Biomaker:
Toxigemomics (how protein act in response to toxic substance)
Pharmacogenomics (role of genome against Drug response
Both these tools use to maximize therapeutic benefit of drug.
In Next 10 Years
We will see Quantum Computing, which will highly beneficial for
CADD
“Quantum computing is a rapidly-emerging technology that
harnesses the laws of quantum mechanics to solve problems too
complex for classical computers.”

2.Proteomics
• Proteomics is “Extensive Study Of Proteins”
Proteomics is used to investigate:
•when and where proteins are expressed
•rates of protein production, degradation, and steady-state abundance
•how proteins are modified (for example, post-translational modifications (PTMs) such
as phosphorylation)
•the movement of proteins between subcellular compartments
•the involvement of proteins in metabolic pathways
•how proteins interact with one another
Proteomics can provide significant biological information for many biological problems,
such as:
•which proteins interact with a particular protein of interest (for example, the tumour
suppressor protein p53)?
•which proteins are localised to a subcellular compartment (for example, cell
membrane)?
•which proteins are involved in a biological process (for example, circadian rhythm)?
•THESE PROCESS OF PROTEOMICS HIGHLY DEPEND UPON BIOINFORMATICS.
•Which means if proteomics Application will expand, field of
bioinformatics will also expand.

3. Centralize Data Analysis
• Bioinformatics provide globally accessible database that enable several
scientists to search, submit and analyse information.
• This global Collaboration will grow beyond leaps and bounds.
• Thus learning bioinformatics can put us in global map of collaboration.
4. Cancer bioinformatics
Cancer Bioinformatics provides a
unique and outstanding platform
and opportunity for scientists to
integrate omics science,
bioinformatics tools and data,
clinical research, disease-specific
biomarkers, dynamic networks,
with precision medicine, together
fighting cancer and improving the
life quality of patients with cancer.

Role of Bioinformatics in Cancer Research and drug Development
Source: https://doi.org/10.1016/B978-0-323-89824-9.00011-2

Figure Example schematic of use of
personalised medicine. From ABPI 2016
5. Personalized Medicine
The concept of Personalised medicine –
The right medicine at the right dose for
the right patient.
What is meant by Personalised medicine?
A form of medicine that uses information
about a person's own genes or proteins to
prevent, diagnose, or treat disease
Personalized medicine is an emerging practice
of medicine and has high chances of growth

Summary diagrams for patient treatment RA and Psoriasis (A), Alzheimer’s disease (B) and
the scheme of future personalized therapy (C)
Future prospective of Personalized Medicine for Each Disease

6. Agriculture
Within the agricultural industry,
bioinformatics has been used to
expand the current understanding of
various plant functions, protect plants
against harmful stressors, and
improve plant quality for human
consumption.
Bioinformatics is playing an increasingly
important role in the collection, storage, and
analysis of genomic data.
Some of the different ways in which
bioinformatics tools and methods are used in
agriculture, which is collectively referred to as
agri-informatics, primarily include the
improvement of plant resistance against both
biotic and abiotic stressors and enhancement
of the nutritional quality in depleted soils.
In addition to these purposes, gene discovery
through the use of computer software has
also allowed researchers to develop targeted
methods for the improvement of seed quality,
incorporate added micronutrients into plants
for enhanced human health, and engineer
plants with phytoremediation capabilities.

7. System Biology And Bioinformatics
Systems biology examines the interactions
between several components rather than the
individual features of the molecules, in order to
understand the phenotype resulting from the
components of the system. To this end,
computational approaches are employed in
systems biology to create possible in silico models
that can also be verified experimentally in
vivo or in vitro, thus allowing the analysis of a
large number of data. In the study of biological
systems, various computational tools are used
including techniques for sequence alignment and
for recording molecular dynamics, molecular
interactions and discovering or predicting the
molecular structure. Figure 1.
Hypothesis-driven research in systems biology
systems biology includes the computational analysis on extensive experimental data in the field of
pharmacology, namely systems pharmacology. Systems pharmacology is focused on the study of
drugs, identifying new drug targets, repurposing of existing drugs and analyzing the properties and
effects of known drugs in a systems-level. Addressing the complexity of the cellular networks and
the mode of action of a drug can lead to a better understanding of side effects and adverse events
of a drug and the identification of off-targets, improving the safety and effectiveness of disease
treatment. In the past decade, systems-based applications have proved to gain better insights into
drug-drug interactions, drug-target networks, drug-target interactions, and drug side-effects,
leading to novel drug discovery

8. Genetics and Genomics in Bioinformatics
Genetics is the scientific study of genes
and heredity—of how certain qualities or
traits are passed from parents to offspring
as a result of changes in DNA sequence. A
gene is a segment of DNA that contains
instructions for building one or more
molecules that help the body work.
Genomics is the study of all of
a person's genes (the
genome), including
interactions of those genes
with each other and with the
person's environment
Both Genomics and Genetics apply in bioinformatics and computational
technique to generate data from DNA and RNA sequence.B

Figure Schematic illustration of the cases stemming the need for
immunoinformatics vaccine development approach
9. Immunoinformatics and vaccine discovery
Immuno-informatics is the
intersection between
experimental immunology and
computational biology.
Here we can study host pathogen
interaction also use to identify functions
of Unknown gene.

FIGURE 1. Flow diagram of design strategy, representing the steps of the construct of the multi-
epitope subunit vaccine
During Covid this field has grown rapidaly.

10. Neuroinformatics
Neuroinformatics refers to a research field that focuses on organizing neuroscience data
through analytical tools and computational models. It combines data across all scales and
levels of neuroscience in order to understand the complex functions of the brain and work
toward treatments for brain-related illness. Neuroinformatics involves the techniques and
tools for acquiring, sharing, storing, publishing, analyzing, modeling, visualizing and
simulating data.
Neuroinformatics helps
researchers to work
together and share data
across different facilities
and different countries
through the exchange of
approaches and tools for
integrating and
analysing data. This field
makes it possible to
integrate any type of
data across various
biological organization
levels.
The benefits of neuroinformatics include:
•Advancement in neuroscience and
improvement in the treatment of several
neurological disorders
•The enhancement of researchers' knowledge.
Neuroinformatics enables them to understand
the working pattern of some particular
neurological functions by permitting the
researchers to trace some specific functions
inside the computerized models.
•The accomplishment of huge volumes of new
data for creating more sophisticated models
for testing

The nomenclature system we adopt in Bioinformatics work is based on the International Union
of Pure and Applied Chemistry (IUPAC) recommendations. It is useful to follow this
nomenclature system so that data sets from different laboratories situated around the world can
be compared easily and uniformly.
DNA and Protein sequences
Figure. Summary of single-letter code IUPAC
recommendations
The first 4 bases G,A,T,C, their symbols
and the basis for nomenclature is clear.
While determining sequence data through
experiments, sometimes, the sequence
identity at a particular position may not be
clearly identifiable due to compression
artifacts or other secondary structure related
problems. In most cases the problem can be
solved by repeating the experiment and also
by sequencing the complementary strand. In
a few cases, ambiguities may persist. In
such cases, the most probable results are
inferred from the chromatograms.
For instance, at a position where the
ambiguity is not resolvable between a 'G' or
a 'C' but one can be sure that there is no
possibility of "A' or 'T' in the same position,
then the symbol to be used is 'S'.

In most organisms, DNA is present as double stranded. The two strands are anti-parallel and
complementary to each other (following Watson-Crick base-pairing). However, the problem
arises when we start encountering the symbols that mean more than one base at a given
position. Again, the IUPAC system comes to aid. The symbols to be used in the
complementary strand corresponding to the symbol at the same position in a given strand
are specified in. In certain cases, the complementary symbols are same as in the given
strand because in both cases they mean the same set of bases.
Figure. Definition of complementary symbols

The symbols and their meaning for the protein sequences are presented in. It is
evident that the number of symbols that mean more than one amino acid is very
few.
Figure. Symbol definitions for the amino acids.

cDNA : A large number of sequences deposited in the Databases were determined from cDNA molecules.
While filling up the sequence entry form you must tick at the right position to indicate whether the sequence
being deposited is a cDNA sequence. This data will also be provided when a sequence is retrieved. Thus in
the case of cDNA sequences one is looking at the expressed part of the genome.
Genomic DNA: Sequencing of genomic DNA has become very routine nowadays. The genomic DNA is the
store-house of information of which expressed part is represented in the cDNA sequences also.
ESTs : It is an abbreviation for Expressed Sequence Tags. Dr. Craig Venter initiated sequencing of a large
number of cDNA molecules by sequencing one end of each of the randomly picked cDNA clones. Millions of
ESTs have been deposited in a special database called dbEST. EST data is used to infer expression patterns
by counting the number of ESTs corresponding to each gene divided by the total number of ESTs.
GSTs : In Plasmodium falciparum the enzyme Mung Bean Nuclease (MNase) cleaves in between the genes.
A genomic DNA library generated by digestion with MNase was used for gene identification in P.
falciparum. The approach used was similar to ESTs. One read of sequence was obtained from either ends.
This data is referred to as genome sequence tags (GSTs). Usually, genomic DNA sequence refers to the
nuclear DNA.
Organelle DNA: Eukaryotic cells have organelles such as mitochondria and chloroplast. These organelles
have their own store house of information in the form of organelle DNA. Organelle DNA codes for a few
genes. The coding information for the rest of the genes reside in the nuclear DNA of the same cell. If an
organelle DNA has been sequenced the appropriate position in the sequence submission form must be
mentioned.
Other molecules: In addition to these molecules, the databases contain the sequences of other molecules
such as tRNA, and other small RNAs.

BIOTECHNOLOGY INFORMATION SYSTEM
NETWORK (BTISnet)

•It is a National Bioinformatics Network.
•India is the first country in the world to establish in 1987 a Biotechnology Information
System (BTIS) network to create an infrastructure of biotechnology through the
application of Bioinformatics.
• The Department of Biotechnology (DBT), Ministry of Science and
Technology,Government of India has taken up this infrastructure development project and
created a distributed network at a very low cost.
•BIOTECHNOLOGY INFORMATION SYSTEM NETWORK Runs by Department of
Biotechnology, Government of India
•BTIS is today recognized as one of the major scientific network in the world dedicated to
provide the-state-of-the-art infrastructure, education, manpower and tools in
bioinformatics.

Need for BTIS
•Research and Development activities in Modern Biology and Biotechnology are
very much information-dependent fields.
• Growth of biotechnology has accelerated particularly during the last decade
due to path breaking advancements in biology and new technologies that
produce large high quality data.
•The rate of growth of these data has been estimated to be more than 200
million bases per year.
•The content of the database itself is doubling insize approximately every year.
The large amounts of data generated through various forms are serving as a
source of knowledge to thescientists engaged in the field of Biotechnology.
• The analysis of such large data and extraction of knowledge from this data is
possible only with thehelp of new algorithms and compute intensive

The broad objectives of Biotechnology Information System Network programme are:
•To provide a National bioinformation network designed to bridge the inter-disciplinary gaps on
biotechnology information and establish link among scientists in organizations involved in R&D
and manufacturing activities in the country.
•To build information resources, prepare databases on biotechnology and to develop relevant
information handling tools and techniques.
•To continuously assess information requirements, create and improve necessary infrastructure
and to provide informatics based support and services to the national community of users
working in biotechnology and allied areas.
•To coordinate efforts to access Biotechnology information worldwide including establishing
linkages with some of the international resources of Biotechnology information (e.g. Databanks
on genetic materials, published literature, patents, and other information of scientific and
commercial value).
•To undertake research into advanced methods of computer-based information processing for
analyzing the structure and function of biologically important molecules.
•To evolve and implement programmes on education of users and training of information
scientists responsible for handling of biotechnology information and its applications to
biotechnology research and development.
•To establish regional and international cooperation for exchange of scientific information and
expertise in biotechnology through the development of appropriate network arrangements.

Resources
•Databases of BTISNetTRABAS by University of
Calcutta, Kolkata
•AgAbDb by University of Pune, Pune PDB
Goodies by Indian Institute of Science,
•Bangaloreetc.Open Source Databases-The
Gene Ontology (GENOME)DIANA LAB (RNA)-
Protein Data Bank/PDB (Protein DBS)etc.
•Softwares Gene Ontology Based Prediction
Analysis of MicroArray Suite/GOPAMSpectral
Repeat Finder/SRFetc.

BTIS Centres in India
•Centres of Excellence –
•Bioinformatics Centre. DBT,New Delhi:
• University of Pune Pune
•Jawaharlal Nehr University (JNU) New Delhi
•Madurai Kamaraj University (MKU) Madurai,
•Indian Institute of Science (BSc), Bangalore:
•Bose Institute, Kolkata: Super Computing Facility
•(IIT) New Delhi Distributed Information Centres (DICs) - 11Anna University Centre
for Cellular & Molecular Biology, Indian Agricultural Research Institute, Institute of
Microbial Technology, Kerala Agriculture Unversity.
• M. S. Univeristy ofBaroda, National Brain Research Centre, National Iristaute of
immunology, North EasternHill University(Shillong), Pondicherry University,
University of Calcutta Distributed Information Sub Centres (SubDICs)-51Institute of
Life Sciences Bhubaneswar, Indian Institute of Chemical Biology.KolkatIndian
Institute of Technology. Kharagpur etc.Bioinformatics Infrastructure Facility (BIF) for
Biology Teaching Through Bioinformatics (BTBI)-70Vidyasagar University. Midnapur,
West Bengal West Bengal University ofKolkata, West Bengal etc.Technology,North
Eastern State- Bioinformatics Infrastructure Facility (BIF) - 28Institute of Advanced
Study in Science and Technology. Ryan Path Guwahat Manipur University Canchipur
Manipur etc

Achievements:
According to last 5 five-year data (2015-20), the Network has published more than 1200
research articles and created 200 databases and carried out the training of more than 8000
personals including students and scientists. Some of the most cited web-servers developed
by the network are VirulentPred, PredictBias, Bhageerath, Sanjeevini, ChemGenome 2.0
and CylinPred etc. Some of the key highlights (for the period 2015-20) are listed below:
Sr. No. Software type Numbers
1 Databases 206
2 Web servers 72
3 Standalones (Databases: 5;
Others: 35)
40
4 Applications Developed 7
Total Software developed 325

Basic of bioinformatics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Basic of bioinformatics

Similar to Basic of bioinformatics (20)

More from Jayati Shrivastava

More from Jayati Shrivastava (8)

Recently uploaded

Recently uploaded (20)

Basic of bioinformatics