2. Introduction:
Acronym for Basic Local Alignment Search Tool
The BLAST program was developed by Stephen
Altschul et al of NCBI in 1990
Also a heuristic method like FASTA
It is one of the most popular programs for sequence
analysis
3. enables a researcher to compare a query
sequence with a library or database of
sequences and
identify library sequences that resemble the
query sequence above a certain threshold
The objective is to find high-scoring ungapped
segments among related sequences
4. Using BLAST
http://www.ncbi.nlm.nih.gov/BLAST
1. Select BLAST program to use (blastn, blastp,
2.
3.
4.
5.
blastx, tblastn, tblastx)
Select database to search
different BLAST programs have different
databases
Enter Query Sequence
Submit Search
5. Steps in BLAST
The seq is optionally filtered to remove lowcomplexity regions (AGAGAG…)
The next step is to create a list of words from the
query sequence.
Each word is typically 3 residues for protein
sequences and 11 residues for DNA sequences.
The list includes every possible word extracted from
the query sequence.
This step is also called seeding.
6. PROTEIN WORDS
Query: GTQITVEDLFYNIATRRKALKN
Word Size = 3
Word Size can be 2 or 3 (default = 3)
GTQ
TQI
Make a lookup
Neighborhood Words
table of words QIT
LTV, MTV, ISV, LSV, etc.
ITV
TVE
VED
EDL
DLF
...
7. NUCLEOTIDE WORDS
Query: GTACTGGACATGGACCCTACAGGAA
Word Size = 11
minimum word size = 7
blastn default = 11
megablast default = 28
GTACTGGACAT
TACTGGACATG
ACTGGACATGG
CTGGACATGGA
Make a
TGGACATGGAC
lookup
GGACATGGACC
table of words
GACATGGACCC
ACATGGACCCT
...........
8.
The third step is to search a sequence
database for the occurrence of these words.
This step is to identify database sequences
containing the matching words
9. Using substitution scores matrixes the query
seq. words are evaluated for matches with
any DB seq. and these scores (log) are added
A cut-off score (T) is selected to reduce
number of matches to the most significant
ones
The above procedure is repeated for each
word in the query seq.
The remaining high-scoring words are
organised into efficient search tree and rapidly
compared to the DB seq.
10.
If a good match is found then an alignment is
extended from the match area in both
directions as far as the score continue to grow.
The extension continues until the score of the
alignment drops below a threshold due to
mismatches
(the drop threshold is twenty-two for proteins
and twenty for DNA).
11. The resulting contiguous aligned segment pair
without gaps is called high-scoring segment pair
(HSP )
In the original version of BLAST, the highest
scored HSPs are presented as the final report
12.
13. A recent improvement in the implementation
of BLAST is the ability to provide gapped
alignment.
In gapped BLAST, the highest scored segment
is chosen to be extended in both directions
using dynamic programming where gaps may
be introduced.
The extension continues if the alignment
score is above a certain threshold otherwise it
is terminated
14. BLAST Output
1.
2.
3.
4.
an introduction that tells where the search occurred
and what database and query were compared
a list of the sequences in the database containing
segment pairs whose scores were least likely to occur
by chance
alignments of the high-scoring segment pairs showing
identical
and
similar
residues
a complete list of the parameter settings used for the
search.
15.
16.
17. BLAST Variants
Program
BLASTP
Query sequence
Database sequence
protein
BLASTN
nucleic acid
BLASTX
translated nucleic acid
TBLASTN protein
TBLASTX translated nucleic acid
protein
nucleic acid
protein
translated nucleic acid
translated nucleic acid
18. Databases available on BLAST Web server
Database Description
A. Peptide sequence databases
1.
nr-translations of GenBank DNA sequences with redundancies removed,
PDB,
SwissProt, PIR, and PRF
2.
month -new or revised entries or updates to nr in the previous 30 days
3.
Swissprot- latest release of the SwissProt protein sequence databasea
4.
Drosophila genome -provided by Celera and Berkeley Drosophila genome
project
5.
yeast -yeast (Saccharomyces cerevisiae) genomic sequences
6.
E. Coli- E. coli genomic sequences
7.
pdb -sequences of proteins of known three-dimensional structure from the
Brookhaven Protein Data Bank
8.
yeast -yeast (S. cerevisiae) protein sequences
9.
E. coli- E. coli genomic coding sequence translations
10. kabat [kabatpro] -Kabat’s database of sequences of immunological interest
11. Alu- translations of select Alu repeats from REPBASE, a database of sequence
repeats
19. B. Nucleotide sequence databases
1. nr- GenBank, EMBL, DDBJ, and PDB sequences with redundancies
removed (EST, STS, GSS, and HTGS sequences excluded)
2. month -new or revised entries or updates to nr in the previous 30
days
3. dbestb- EST sequences from GenBank, EMBL, and DDBJ with
redundancies removed
4. dbstsb- STS sequences from GenBank, EMBL, and DDBJ with
redundancies removed
5. htgsb- high-throughput genomic sequences
6. kabat [kabatnuc] -Kabat’s database of sequences of immunological
interest
7. vector- vector subset of GenBank
8. mito -database of mitochondrial sequences
9. alu -select Alu repeats from REPBASE, a database of sequence repeats;
suitable for masking Alu repeats from query sequences
10. epd- eukaryotic promoter database
11. gssb -genome survey sequences, includes single-pass genomic
data,exon-trapped sequences, and Alu PCR sequences
20. Difference between BLAST and FASTA
BLAST
FASTA
uses a substitution matrix to find matching
words
Uses the hashing procedure
Word size:
Protein=3 ;DNA=11
K-tuple:
Protein=2;DNA=4-6
Faster than FASTA
Slower than BLAST
have higher specificity than FASTA due to
Low complexity masking
Lower specificity
21. E-value (expectation value)
Important statistical indicator in Sequence alignment
it indicates the probability that the resulting
alignments from a database search are caused by
random chance
The E-value provides information about the
likelihood that a given sequence match is purely by
chance.
The lower the E-value, the less likely the database
match is a result of random chance and therefore
the more significant the match is
22. Formula
E-value is determined by the equation
E = m × n × P
Where
m is the total number of residues in a database
n is the number of residues in the query sequence
and
P is the probability that an HSP alignment is a result
of random chance.
23. Bit Score
A bit score is another prominent statistical indicator
used in addition to the E value in a BLAST output.
The bit score measures sequence similarity
independent of query sequence length and
database size and is normalized based on the raw
pairwise alignment score.
24. Formula
The bit score (S) is determined by the following formula:
S = (λ × s − lnK)/ ln2
Where
λ is the Gumble distribution constant,
s is the raw alignment score, and
K is a constant associated with the scoring matrix used.
Thus, the bit score (S) is linearly related to the raw
alignment score (s).
Hence, the higher the bit score, the more highly
significant the match is.