Bioinformatics Guide

1 of 8
Bioinformatics
Bioinformatics are using of computers to store, organize and analyze biological information
(particularly that based on DNA and protein sequence).
A large number of different kinds of data have now been incorporated into biological
databases and can be said to be part of the domain of bioinformatics:
 DNA sequence, RNA sequence, Protein sequence
 Data from the analysis of transcription across whole genomes (transcriptomics), and
analysis of the expressed protein co mplement of cells or tissues under particular
circumstances (proteomics).
 Data from other ‘omics experiments, including metabolomics, proteomics, and
methods for studying interactions between molecules on a large scale, such as two-
hybrid analysis.
 Structural data from X-ray crystallography and nuclear magnetic resonance (NMR)
 Common Sequence Analyses
 compiling and editing sequences
 manipulating sequences
 complement, reverse, or both
 translating
 structural analysis
 searching for coding regions
 physical properties
 comparisons and homology searches
 aligning sequences
 sites and motifs
 gene identification
Basic information
o DNA contains the genetic information that encodes traits.
o Complementary: Complementary bases are two bases that are able to pair with each
other and create a base pair. In DNA, adenine (A) and thymine (T) are
complementary, and the A on one DNA strand interacts with the T on the opposite (or
complementary) DNA strand to form a bond to complete double stranded DNA.
Similarly, guanine (G) and cytosine (C) are complementary because they always
interact with one another. This complementary interaction happens between each
base on the DNA strands.
o DNA is double stranded and antiparallel. DNA strands are “anti-parallel” to one
another, meaning they are parallel (or side-by-side) but run in opposite directions, with
the beginning (or 5’ region) of one DNA strand found at the same location as the end
(or 3’ region) of the opposite DNA strand.
o 5’ and 3’: The 5’ region is the beginning of the DNA or RNA strand, while the 3’ region
is the end of the DNA or RNA strand. Scientists often say that DNA is “read 5’ to 3’”
which means that the sequence is read or examined from the beginning (5’) to the
end (3’), left to right. Genes are also transcribed and translated in a 5’ to 3’ direction.

2 of 8
o Read the sequence of DNA from 5’to 3’
o Read the sequence of mRNA from 5’to 3’
o Read the sequence of polypeptide from N-terminal to C-terminal
o Gene: The unit of heredity. A segment of DNA that codes for a specific protein.
o Coding strand or nontemplate strand or sense strand: Of the two DNA strands, the
coding strand is the strand that has the same sense as the messenger RNA. We can
use the DNA sequence from the coding strand to predict a protein sequence by
looking at the genetic code. This strand is the same sequence as the mRNA. Only
one strand is used for one gene, but different strands can encode different genes.
o Non-coding or template strand or Anti-sense strand: Of the two DNA strands, the anti-
sense strand is the one that is complementary to the strand that encodes a gene. This
strand is used to make the mRNA.
o Messenger RNA (mRNA): The template for protein synthesis. mRNA is created from a
gene through the process of transcription, and is used in the process of translation to
make proteins.
o Transcription: The process by which genetic information is copied from DNA into
mRNA.
o Transfer RNA (tRNA): small molecules of RNA that transfer specific amino acids to a
growing protein chain during protein synthesis (translation), using the information
encoded by the mRNA.
o Nucleotide triplets: Series of three nucleotides in a row, usually called a “codon,” that
specifies the genetic code information for a particular amino acid when translating a
gene into protein. For example, the nucleotide triplet or codon CCG codes for the
amino acid proline (P).
o Start codon: A three-nucleotide triplet or codon that tells the cell when to start protein
translation. The start codon is ATG (or AUG in mRNA).
o Stop codons: A three-nucleotide triplet or codon that tells the cell when to stop protein
translation. The stop codons are TAA, TAG, and TGA (or UAA, UAG, and UGA in
mRNA).
o Protein: A type of molecule found in cells formed from a chain of amino acids encoded
by genes. Proteins perform many of the functions needed in the cell.
o Translation: The process by which mRNA is decoded to produce a protein.
o Non-conservative: When amino acid changes observed between two or more
sequences do not share the same side chain or R-group chemistry.
o Conservative: When amino acid changes observed between two or more sequences
do share the same side chain or R-group chemistry.
o Because of the redundancy of the genetic code, in which multiple codons may encode
the same amino acid, the DNA and corresponding protein sequence may appear to
vary at different rates.]
o mRNA transcripts contain “start” and “stop” codons that initiate and terminate protein
translation, respectively.
o DNA encodes proteins through the processes of transcription and translation. Amino
acids are encoded by three nucleotide triplets called codons. The concepts of “start”

3 of 8
and “stop” codons. Proficiency in the use of the genetic code table, including the use
of single letter amino acid abbreviations, is helpful.
o Reading frame: A reading frame is a contiguous and non-overlapping sequence of
three-nucleotide codons in DNA or RNA. There are three possible reading frames in
an mRNA strand and six in a double-stranded DNA molecule (three reading frames
from each of the two DNA strands).
o Open reading frame: A reading frame that contains a start codon and a stop codon,
with multiple three-nucleotide codons in between. The open reading frame in a
particular region of DNA is the correct reading frame from which to translate the DNA
into protein. The longest open reading frame is the one that’s most likely to be correct.
o DNA sequences can be read in any of six possible reading frames, but only one of
these is usually translated by the cell into a protein. This is called the open reading
frame.
o Genetic analyses and multiple sequence alignments can be generated with either
DNA or protein sequences.
o Common Misconceptions: Translation always starts with the 1st letter of a DNA
sequence, or with the first start codon (AUG/ATG). All DNA codes for protein. Genes
are found only on one of the strands of DNA.
o In silico: An expression used to mean “performed on computer or via computer
simulation.”
Sequence alignment
Alignments provide a powerful way to compare sequences for either evolutionary
relatedness or structural/functional relatedness. Sequences can be compared by either
global or local alignments. Global alignment forces complete alignment of the input
sequences, whereas local alignments will only align their most similar segments. The choice
of method will depend on whether the sequences are presumed to be related over their entire
lengths or to only share isolated regions of homology. The algorithms to carry out global and
local alignments are similar, but the statistics associated with the output are different.
Homologs are structures or objects that share a common evolutionary origin. Objects with
similar structure or function, but no common ancestor are analogs. Homologs can be
classified as orthologs or paralogs. Orthologs are homologous genes from different species
that arose from a common ancestor gene. Paralogs are homologous genes that are the result
of gene duplication within an evolutionary lineage (eg., species) and may have different or
similar functions.
Searching databases
Comparing a sequence against a database to discover similarities is one of the most
frequently used and powerful tools in bioinformatics. In essence, searching a database is the
same as aligning two sequences. However, to perform a pair-wise alignment of the query
sequence with every sequence in the database would be too time consuming. Therefore
programs use heuristic strategies to speed up the analysis are usually used.
FASTA and BLAST are the two most commonly used programs for searching
databases. Both programs use rapid exact match procedures to first identify sequences
which are most likely to be related. These sequences are then subjected to further analyses

4 of 8
by alignment programs. BLAST (basic local alignment search tool) uses an approach based
on matching short sequence fragments and then finds the best local alignments between the
query sequence and the database sequences. Several BLAST programs are available for
different types of sequence matching
BLAST searches of the most current version of GenBank can be performed at the
National Center for Biotechnology Information (NCBI) by cutting and pasting the query
sequence onto their web page (http://www.ncbi.nlm.nih.gov/ BLAST/). The entire database or
subdivisions of the database can be searched. It is also desirable to 'filter' low complexity
sequences such as long runs of repetitive sequence (eg., tracts of poly-alanine, etc.) and
most BLAST programs filter by default. Such regions of sequence can give spuriously high
scores against unrelated proteins. Similarly, iterative searches are prone to contamination by
regions corresponding to coiled-coils or transmembrane domains. These characteristics
might match the initial search and the program then emphasizes these inappropriate
characteristics. The user can also choose cut-off values (i.e., E-values), the number of
matches to report, and the number of alignments to show.
The entire database(s) is searched and sequences with similarity to the query
sequence are identified and ranked accordingly to the alignment score derived from the pair-
wise alignment. In addition to this 'raw or bit score', the output includes a 'statistical score'
and the alignments. The statistical score, or expectation value (E-value), provides a measure
of the expected number of sequences in the database that would achieve a given alignment
score by chance. The E-values are more useful in terms of judging the significance of a
match than the alignment score. E-values for proteins and DNA should be less than 0.1 and
0.0001, respectively, to be considered significant. Examining the alignments will also help in
inferring whether the hit is significant. For example, protein global alignments which show
25% or greater sequence identity over a stretch of at least 80 amino acids should exhibit the
same basic structure. The identified sequences can be also retrieved and subjected to further
analyses. Database searches may assist in identifying an unknown gene or provide clues
about its function.
The normal approach is to compare the unknown sequence with a database of
previously determined sequences, to identify the most closely related sequence or
sequences from the database. This method relies on the fact that sequences of DNA and
proteins both within and between species are related by homology; that is through having
common evolutionary ancestors, and so proteins (and RNA and DNA motifs) tend to occur in
families with related sequences, structures and functions. Two DNA or two protein

5 of 8
sequences will always show some similarity, if only by chance. The principle of searching for
meaningful sequence similarity is to align two sequences, that is, array them one above the
other with instances of the same base or amino acid at the same position.
Multiple Sequence Alignments
multiple sequence alignments involve the alignment of more than two sequences.
The advantage of aligning multiple DNA or protein sequences is that evolutionary
relationships between individual genes or sequences become clearer, and it becomes
possible to identify important residues or motifs from their conservation between genes or
proteins from different organisms (orthologs), or proteins of similar, but not identical function
(homologs, paralogs, . The most commonly used multiple sequence alignment tool is called
Clustal, and, as with the other bioinformatics tools, is available from many centers as a web-
based application. The method works by initially carrying out pairwise alignments between all
the sequences, as in the BLAST method described above, but then uses this information to
gradually combine the sequences, beginning with most similar, and finishing with the most
distant, optimizing the overall match of the sequences as it goes.
Phylogenetic Trees
The usefulness of sequence similarity searching is based on the fact that similar sequences
are related through evolutionary descent. Hence, we can use the similarity between
sequences to infer something about these evolutionary relationships. Since individual species
have separated from each other at different times during evolution, their DNA and protein
sequences should have diverged from each other and show differences that are broadly
proportional to the time since divergence, in other words the sequences should form a
molecular clock. The amount of sequence similarity can therefore be used to generate
phylogenetic relationships, and phylogenetic trees .
Structural Bioinformatics
The mismatch between the number of known protein sequences and the number of
determined three-dimensional structures has led to the development of methods to derive
structure direct from sequence. These include secondary structure prediction and
comparative or homology modeling, which uses sequence similarity between an unknown
and a known structure to derive a plausible structure for the new protein.
Proteomics
The gene sequences provide a limited amount of information about the proteins that
are encoded by the genes. In addition, phenotype will be determined by the proteins and not
the genes (i.e., genotype). Efforts are now being directed towards the characterization of the
proteome (the complete set of proteins found in a cell, tissue or organism). Proteomics
encompasses many types of activities and analyses with a goal of having a complete
understanding of protein function and to apply this knowledge. One objective of proteomics is
to identify the complete complement of proteins. Genomics provides a first step in this
process since all of the genes within a genome can potentially be identified.

6 of 8
Coding or non-template DNA strand 5’ ATG CAT AGC ATG AGG ATC AT 3’
Complement 3’ TAC GTA TCG TAC TCC TAG TA5’
mRNA 5’ AUG CAU AGC AUG AGG AUC AU 3’
Protein N- M H S M R I -C
Reverse 5’ TAC TAG GAG TAC GAT ACG TA 3’
Reverse complement 5’ ATG ATC CTC CTG CTA TGC AT 3’
Composition A=7 : T=5: C=3: G=5
AT =60%: GC=40%:length 20

Bioinformatics Guide

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Bioinformatics Guide

Semelhante a Bioinformatics Guide (20)

Mais de Nawfal Aldujaily

Mais de Nawfal Aldujaily (20)

Último

Último (20)

Bioinformatics Guide