1. Data mining in Bioinformatics:
Data Mining is the process of automatic discovery of novel and understandable models and
patterns from large amounts of data involving methods at the intersection of machine learning,
statistics and database systems. Bioinformatics is the science of storing, analyzing, and utilizing
information from biological data such as sequences, molecules, gene expressions, and
pathways. Development of novel data mining methods will play a fundamental role in
understanding these rapidly expanding sources of biological data.
Data mining is an interdisciplinary subfield of computer science and statistics with an overall
goal to extract information from a large set of data and transform the information into a
comprehensible structure for further use. Data mining is the analysis step of the "knowledge
discovery in databases" process or KDD (Fig.1). Aside from the raw analysis step, it also
involves database and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-processing of
discovered structures, visualization, and online updating.
Fig. 1: The process of KDD and the steps involved.
Data mining approaches seem ideally suited in the field of bioinformatics with enormous
volumes of data deposited at every second. The extensive databases of biological information
create both challenges and opportunities for developing novel data mining methods. Every
year, workshop on Data Mining in Bioinformatics (BIOKDD) is held since 2001 with a goal
to encourage the KDD researchers worldwide to take on the numerous challenges that
Bioinformatics offers.
The difference between data analysis and data mining is that data analysis is used to test models
and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign,
regardless of the amount of data; in contrast, data mining uses machine learning and statistical
models to uncover clandestine or hidden patterns in a large volume of data.
BOTMT:604
Bioinformatics and Biophysics
Prepared By-
Dr. Sangeeta Das.
Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.
2. Data mining Tools in Bioinformatics:
Various tools for data mining are used in bioinformatics. The following are the tools for
nucleotide sequence analysis:
1. BLAST:
The Basic Local Alignment Search Tool (BLAST) for comparing gene and protein sequences
against others in public databases, now comes in several types including PSI-BLAST, PHI-
BLAST, and BLAST 2 sequences. Specialized BLASTs are also available for human,
microbial, malaria, and other genomes, as well as for vector contamination, immunoglobulins,
and tentative human consensus sequences.
2. Electronic PCR:
This tool allows to search the target DNA sequence for sequence tagged sites (STSs) that have
been used as landmarks in various types of genomic maps. It compares the query sequence
against data in NCBI’s UniSTS, a unified, non-redundant view of STSs from a wide range of
sources.
3. Entrez:
The Entrez is Global Query Cross-Database Search System is a federated search engine, or
web portal that allows users to search many discrete health sciences databases at the National
Center for Biotechnology Information (NCBI) website. The name "Entrez" (meaning "Come
in" in French) was chosen to reflect the spirit of welcoming the public to search the content
available from the National Library of Medicine (NLM).
Entrez Global Query is an integrated search and retrieval system that provides access to all
databases simultaneously with a single query string and user interface. Entrez can efficiently
retrieve related sequences, structures, and references. The Entrez system can provide views of
gene and protein sequences and chromosome maps. Some textbooks are also available online
through the Entrez system. Entrez searches the databases such as PubMed, PubMed Central,
Site Search, online Books, Online Mendelian Inheritance in Man (OMIM), Nucleotide
sequence database (GenBank), Protein sequence database, Genome Project, UniGene, NLM
Catalog, etc.
Each Entrez Gene record encapsulates a wide range of information for a given gene and
organism. When possible, the information includes results of analyses that have been done on
the sequence data. The amount and type of information presented depend on what is available
for a particular gene and organism and includes:
(1) graphic summary of the genomic context, intron/exon structure, and flanking genes
(2) link to a graphic view of the mRNA sequence, which in turn shows biological features such
as CDS, SNPs, etc.
(3) links to gene ontology and phenotypic information
(4) links to corresponding protein sequence data and conserved domains
(5) links to related resources, such as mutation databases. Entrez Gene is a successor to
LocusLink.
4. Model Maker:
It allows to view the evidence (mRNAs, ESTs, and gene predictions) that was aligned to
assembled genomic sequence to build a gene model and to edit the model by selecting or
BOTMT:604
Bioinformatics and Biophysics
Prepared By-
Dr. Sangeeta Das.
Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.
3. removing putative exons. Model Maker is accessible from sequence maps that were analyzed
at NCBI and displayed in Map Viewer.
5. ORF (Open Reading Frame) Finder:
ORF Finder identifies all possible ORFs in a DNA sequence by locating the standard and
alternative stop and start codons. The deduced amino acid sequences can then be used to
BLAST against GenBank. ORF finder is also packaged in the sequence submission software
Sequin.
6. SAGEMAP:
It is a tool for performing statistical tests designed specifically for differential-type analyses of
SAGE (Serial Analysis of Gene Expression) data. The data include SAGE libraries generated
by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP),
which have been submitted to Gene Expression Omnibus (GEO). Gene expression profiles that
compare the expression in different SAGE libraries are also available on the Entrez GEO
Profiles pages. It is possible to enter a query sequence in the SAGEmap resource to determine
what SAGE tags are in the sequence, then map to associated SAGEtag records and view the
expression of those tags in different CGAP SAGE libraries.
7. Spidey:
It aligns one or more mRNA sequences to a single genomic sequence. Spidey will try to
determine the exon/intron structure, returning one or more models of the genomic structure,
including the genomic/mRNA alignments for each exon.
8. VecScreen:
It is a tool for identifying segments of a nucleic acid sequence that may be of vector, linker, or
adapter origin prior to sequence analysis or submission. VecScreen was developed to combat
the problem of vector contamination in public sequence databases.
BOTMT:604
Bioinformatics and Biophysics
Prepared By-
Dr. Sangeeta Das.
Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.