This document discusses using information compression to align biological sequences. It notes that traditional sequence alignment methods cannot handle low complexity regions or account for sequence biases. Information and probability are described as two sides of the same coin, with more similar sequences sharing more information. The document proposes aligning sequences by compressing them using a probabilistic model, asserting that homologous sequences will share information and compress better together. This alignment by compression approach is said to outperform traditional methods in aligning distantly related or biased sequences.
5. Traditional alignments can’t handle low
complexity regions
NNNNNNNNNNNNNNNNNNNNNNNNNNNN AAGCAGAATTTAACATGTGGTTTGCTCA
NNNNNNNNNNNNNNNNNNNNNNNNNNNN TTTGTTCTTTATCGCATCTTTTGAAAAC
NNNNNNNNNNNNNNNNNNNNNNNNNNNN GCTATCGAAATAGCAGTACCTTCAGACT
NNNNNNNNNNNNNNNNNNNNNNNNNNNN TTTTCCGAATACAGTTTAGCCAAAAATA
NNNNNNNNNNNNNNNNNNNNNNNNNNNN TCAAGAAAAGCTTGAGCGCAAGTTCCTC
NNNNNNNNNNNNNNNNNNNNNNNNNNNN GAACTTTCTGGACACCCCATTAAACTTT
NNNNNNNNNNNNNNNNNNNNNNNNNNNN! TGTTTGCCGTTAAAAAAGGTACTTATCT!
50% of the
human genome
is masked
9. Information and probability are two sides
of the same coin
1
I(event) = log 2 = ! log 2 p(event)
p(event)
Information
4.3 bits
2 bits
1 bit
Probability event
.05 .25 .5 1 occurs
10. Information and probability are two sides
of the same coin
1
I(event) = log 2 = ! log 2 p(event)
p(event)
Information
AAAAAAAAAAAAAAAA…
AAAAAAATTTTTTTTT…
ATGCACTACTAACGGA…
Maximum
in DNA
2 bits A
A
1 bit
A
0 bits Probability event
.25 .5 1 occurs
11. Compression encodes symbols using a
probability distribution
AAAAAACGGG
A
A C G T G
C
T
00000000000001101010 11011010
13. Homologous sequences share information
T Markov Expert C
A
C
G
G
T
A
A
A
A
T
C
C
A
A
G
T
T
G
T
T
T I(Query) C!
C
C
C
G
G
A
A
A
A
T
T
C
C
A
A
T
A
T
G
14. Homologous sequences share information
T Markov Expert C
A
C
G
G
T
Align Expert A
A
A
A
T
C
C
A
A
G
T
T
G
T
I(Query| Reference) Mutual Information T
T I(Query) C!
C
C
C
G
G
A
A
A
A
T
T
C
C
A
A
T
A
T
G
15. Homologous sequences share information
T Markov Expert C
A
C
G
G
T
Align Expert A
A
A
A
T
C
C
A
A
G
T
T
G
T
I(Query| Reference) Mutual Information T
T I(Query) C!
C
C
C
G
G
A
A
A
A
T
T
C
C
A
A
T
A
T
G
16. XMAligner wins on distantly related biased
sequences
Specificity
Sensitivity