BLAST works by finding short matches ("words") between query and database sequences. It uses these matches to seed ungapped local alignments, which are then extended and evaluated. Significant alignments above a score threshold are reported, along with a statistical analysis to assess their reliability. The process removes low-complexity regions before alignment and uses substitution matrices to score matches between residues.
2. BLAST
a sequence comparison algorithm used to search databases for optimal local alignments
to a query.
search is done for a word of length "W" that scores at least "T" when compared to the
query using a substitution matrix.
generate an alignment with a score exceeding the threshold of "S". The "T" parameter
dictates the speed and sensitivity of the search.
query
The input sequence (or
other type of search
term) to which all of the
entries in a database are
to be compared.
Subject
The output sequences
that BLAST found
after searching the
datbase
3. Algorithm
A fixed procedure embodied in a computer program.
Alignment
The process or result of matching up the nucleotide or amino acid residues of two or more
biological sequences to achieve maximal levels of identity and, in the case of amino acid
sequences the degree of similarity and the possibility of homology.
Bit score
The bit score, S', is derived from the raw alignment score, S, taking the statistical properties
of the scoring system into account. Because bit scores are normalized with respect to the
scoring system, they can be used to compare alignment scores from different searches.
E value
The Expectation value or Expect value represents the number of different alignments with
scores equivalent to or better than S that is expected to occur in a database search by chance.
The lower the E value, the more significant the score and the alignment.
4. gap
A space introduced into an alignment to compensate for insertions and deletions in one
sequence relative to another.
H
H is the relative entropy of the target and background residue frequencies. (Karlin and
Altschul, 1990).
A measure of the average information (in bits) available per position that distinguishes an
alignment from chance.
At high values of H short alignments can be distinguished by chance, whereas at lower H
values a longer alignment may be necessary (Altschul, 1991).
similarity
The extent to which nucleotide or protein sequences are related.
Similarity between two sequences can be expressed as percent sequence identity and/or
percent positive substitutions.
5.
6. identity
The extent to which two (nucleotide or amino acid) sequences have the same residues at
the same positions in an alignment, often expressed as a percentage.
K
a natural scale for search space size
used in converting a raw score (S) to a bit score (S').
lambda
a natural scale for scoring system.
used in converting a raw score (S) to a bit score (S').
p value
The probability of a chance alignment occurring with a particular score or a better score
in a database search.
The most highly significant P values will be those close to 0.
P values and E values are different ways of representing the significance of the
7. What is the Expect (E) value?
The Expect value (E) is a parameter that describes the number of hits one can "expect" to
see by chance when searching a database of a particular size.
It decreases exponentially as the Score (S) of the match increases.
Essentially, the E value describes the random background noise.
For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a
database of the current size one might expect to see 1 match with a similar score simply
by chance.
The lower the E-value, or the closer it is to zero, the more "significant" the match is.
You can change the Expect value threshold on most BLAST search pages.
When the Expect value is increased from the default value of 10, a larger list with more
low-scoring hits can be reported.
8.
9. HSP
A High-scoring Segment Pair (HSP) is a local alignment with no gaps that
achieves one of the highest alignment scores in a given search.
low complexity region
A region of biased composition in nucleic acid and protein sequences.
These include homopolymeric runs, short-period repeats, and subtler over
representation of one or a few residues.
The SEG program is used to mask or filter low complexity regions in amino acid
queries.
The DUST program is used to mask or filter such regions in nucleic acid queries.
masking
Also known as filtering.
The removal of repeated or low complexity regions from a sequence in order to
improve the sensitivity of sequence similarity searches performed with that
sequence
10. What is "low-complexity" sequence?
Regions with low-complexity sequence have an unusual composition that can create
problems in sequence similarity searching.
For amino acid queries this compositional bias is determined by the SEG program (Wootton
and Federhen, 1996).
For nucleotide queries it is determined by the DustMasker program (Morgulis, et al.,2006).
Low-complexity sequence can often be recognized by visual inspection. For example, the
protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the
nucleotide sequence AAATAAAAAAAATAAAAAAT.
Filters are used to remove low-complexity sequence because it can cause artifactual hits.
In BLAST searches performed without a filter, high scoring hits may be reported only
because of the presence of a low-complexity region.
Most often, it is inappropriate to consider this type of match as the result of shared
homology.
Rather, it is as if the low-complexity region is "sticky" and is pulling out many sequences
that are not truly related.
11. local alignment
The alignment of a high-scoring region of two nucleic acid or protein sequences.
optimal alignment
An alignment of two sequences with the highest possible score.
raw score
The score of an alignment,
S, calculated as the sum of
substitution and gap
scores.
Substitution scores are given by a look-up table ( PAM, BLOSUM).
Gap scores are typically calculated as the sum of G, the gap opening penalty and L, the gap
extension penalty.
For a gap of length n, the gap cost would be G+Ln.
12. AACGTTTCCAGTCCAAATAGCTAGGC
***--*** *-***-**-******
AACCGTTC TACAATTACCTAGGC* = Positive match
- = Mismatch
Gap= gap between the sequences
Score for each positive hit = 1, total no of positive hits=18
Score for each mismatch = 2, total number of mismatches= 5
Penalty score for each gap= 2 (you can set the penalty score of gap), total
numbers of gaps =3
score = Number of positive hits × score of each positive hit-
{(Number of mismatches × Score of each mismatches) + (Number of
gaps × score for each gap penalty)}
13. substitution
The presence of a non-identical amino acid at a given position in an alignment.
If the aligned residues have similar physico-chemical properties or have a positive score in
the governing scoring matrix the substitution is said to be conservative.
substitution scoring matrix
A scoring matrix containing values proportional to the probability that amino acid i
mutates into amino acid j for all pairs of amino acids.
Such matrices are constructed by assembling a large and diverse sample of verified
pairwise alignments of protein sequences.
If the sample is large enough, the resulting matrices should reflect the true probabilities of
mutations occurring through a period of evolution.
The BLOSUM matrices are examples of substitution scoring matrices.
14. PAM
Percent Accepted Mutation (PAM) is unit introduced by Margaret Dayhoff and colleagues to
quantify the amount of evolutionary change in a protein sequence.
1.0 PAM unit is the amount of evolution that will change, on average, 1% of amino acids in a
protein sequence.
A PAM(x) substitution matrix is a look-up table in which scores for each amino acid
substitution have been calculated based on the frequency of that substitution in closely related
proteins that have experienced a certain amount (x) of evolutionary divergence.
unitary matrix
Also known as identity matrix. This is a scoring system in which only identical characters
receive a positive score
15. BLOSUM
A Blocks Substitution Matrix is a substitution scoring matrix in which scores for each position
are derived from observations of the frequencies of substitutions in blocks of local alignments
in related proteins.
Each matrix is tailored to a particular
evolutionary distance.
In the BLOSUM62 matrix, for example, the
alignment from which scores were derived was
created using sequences sharing no more than
62% identity.
Sequences more identical than 62% are
represented by a single sequence in the alignment
so as to avoid over-weighting closely related
family members. (Henikoff and Henikoff, 1992)
16. profile
A table that lists the frequencies of each amino acid in each position of protein sequence
alignment. Frequencies are calculated from multiple alignments of sequences containing a
domain of interest.
PSSM
A Position-Specific Scoring Matrix (PSSM) is a profile that gives the log-odds score for
finding a particular matching amino acid in a target sequence.
19. Using a heuristic method, BLAST finds similar sequences, not by comparing either
sequence in its entirety, but rather by locating short matches between the two sequences
This process of finding initial words is called seeding.
While attempting to find similarity in sequences, sets of common letters, known as
words, are very important.
For example, suppose that the sequence contains the following stretch of letters, GLKFA.
If a BLASTp was being conducted under default conditions, the word size would be 3
letters.
In this case, using the given stretch of letters, the searched words would be GLK, LKF,
KFA.
20. An overview of the BLASTP algorithm
(a protein to protein search)
21. Remove low-complexity region or sequence repeats in the query sequence.
"Low-complexity region" means a region of a sequence composed of few kinds of elements.
These regions might give high scores that confuse the program to find the actual significant
sequences in the database, so they should be filtered out.
The regions will be marked with an X (protein sequences) or N (nucleic acid sequences) and
then be ignored by the BLAST program.
To filter out the low-complexity regions, the SEG program is used for protein sequences and
the program DUST is used for DNA sequences.
On the other hand, the program XNU is used to mask off the tandem repeats in protein
sequences.
22. Make a k-letter word list of the query sequence.
Take k=3 for example, we list the words of length 3 in the query protein sequence (k is
usually 11 for a DNA sequence) "sequentially", until the last letter of the query sequence is
included. The method is illustrated in figure 1.
23. List the possible matching words.
BLAST only cares about the high-scoring words.
The scores are created by comparing the word illustrated previously with all the 3-letter
words.
By using the scoring matrix (substitution matrix) to score the comparison of each residue
pair, there are 20^3 possible match scores for a 3-letter word.
For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12,
respectively.
For DNA words, a match is scored as +5 and a mismatch as -4, or as +2 and -3.
After that, a neighborhood word score threshold T is used to reduce the number of possible
matching words.
The words whose scores are greater than the threshold T will remain in the possible
matching words list, while those with lower scores will be discarded.
For example, PEG is kept, but PQA is abandoned when T is 13.
24. Organize the remaining high-scoring words into an efficient search tree.
This allows the program to rapidly compare the high-scoring words to the database
sequences.
Repeat steps for each k-letter word in the query sequence.
Scan the database sequences for exact matches with the remaining high-scoring words.
The BLAST program scans the database sequences for the remaining high-scoring word,
such as PEG, of each position.
If an exact match is found, this match is used to seed a possible un-gapped alignment
between the query and database sequences.
25. Extend the exact matches to high-scoring segment pair (HSP).
The original version of BLAST stretches a longer alignment between the query and the
database sequence in the left and right directions, from the position where the exact match
occurred.
The extension does not stop until the accumulated total score of the HSP begins to decrease.
A simplified example is presented in figure 2.
26. List all of the HSPs in the database whose score is high enough to be considered.
We list the HSPs whose scores are greater than the empirically determined cutoff score S.
By examining the distribution of the alignment scores modeled by comparing random
sequences, a cutoff score S can be determined such that its value is large enough to
guarantee the significance of the remaining HSPs.
Evaluate the significance of the HSP score.
BLAST next assesses the statistical significance of each HSP score by exploiting the
Gumbel extreme value distribution (EVD).
In accordance with the Gumbel EVD, the probability p of observing a score S equal to or
greater than x is given by the equation
where
27. The statistical parameters λ and κ are estimated by fitting the distribution of
the un-gapped local alignment scores, of the query sequence and a lot of
shuffled versions (Global or local shuffling) of a database sequence, to the
Gumbel extreme value distribution.
Note that λ and κ depend upon the substitution matrix, gap penalties, and
sequence composition (the letter frequencies). and are the effective lengths of
the query and database sequences, respectively.
28. Make two or more HSP regions into a longer alignment.
Sometimes, we find two or more HSP regions in one database sequence that can be made
into a longer alignment.
This provides additional evidence of the relation between the query and database sequence.
There are two methods, the Poisson method and the sum-of-scores method, to compare the
significance of the newly combined HSP regions.
Suppose that there are two combined HSP regions with the pairs of scores (65, 40) and (52,
45), respectively.
The Poisson method gives more significance to the set with the maximal lower score
(45>40).
However, the sum-of-scores method prefers the first set, because 65+40 (105) is greater than
52+45(97). The original BLAST uses the Poisson method; gapped BLAST and the WU-
BLAST use the sum-of scores method.
29. Show the gapped Smith-Waterman local alignments of the query and each of the
matched database sequences.
The original BLAST only generates un-gapped alignments including the initially found
HSPs individually, even when there is more than one HSP found in one database sequence.
Report every match whose expect score is lower than a threshold parameter E.