SlideShare a Scribd company logo
1 of 19
Download to read offline
Sequence Alignment
by Information
Compression     Nacho Caballero
Traditional Alignments



          Probability and
          Information



Alignment by
Compression
Traditional Alignments
Traditional alignments can’t handle low
complexity regions


NNNNNNNNNNNNNNNNNNNNNNNNNNNN    AAGCAGAATTTAACATGTGGTTTGCTCA
NNNNNNNNNNNNNNNNNNNNNNNNNNNN    TTTGTTCTTTATCGCATCTTTTGAAAAC
NNNNNNNNNNNNNNNNNNNNNNNNNNNN    GCTATCGAAATAGCAGTACCTTCAGACT
NNNNNNNNNNNNNNNNNNNNNNNNNNNN    TTTTCCGAATACAGTTTAGCCAAAAATA
NNNNNNNNNNNNNNNNNNNNNNNNNNNN    TCAAGAAAAGCTTGAGCGCAAGTTCCTC
NNNNNNNNNNNNNNNNNNNNNNNNNNNN    GAACTTTCTGGACACCCCATTAAACTTT
NNNNNNNNNNNNNNNNNNNNNNNNNNNN!   TGTTTGCCGTTAAAAAAGGTACTTATCT!


 50% of the
 human genome
 is masked
Traditional scoring schemes don’t reflect
sequence bias

                  GC content




                  GC skew


                 Match    +8
                 Mismatch -4
                 Gap      -3
Traditional alignments lack an objective
function to measure quality
Probability and
Information
Information and probability are two sides
of the same coin
                                        1
                    I(event) = log 2          = ! log 2 p(event)
                                     p(event)
Information




  4.3 bits


    2 bits
     1 bit
                                                  Probability event
              .05 .25     .5                  1       occurs
Information and probability are two sides
of the same coin
                                    1
                I(event) = log 2          = ! log 2 p(event)
                                 p(event)
Information

                AAAAAAAAAAAAAAAA…
               AAAAAAATTTTTTTTT…
              ATGCACTACTAACGGA…
Maximum
 in DNA
    2 bits       A
                         A
     1 bit
                                          A
    0 bits                                    Probability event
              .25     .5                  1       occurs
Compression encodes symbols using a
probability distribution

                  AAAAAACGGG




                               A
      A   C   G   T                     G
                                    C
                                            T




  00000000000001101010             11011010
Alignment by
Compression
Homologous sequences share information
T     Markov Expert                      C
A
                                         C
G
                                         G
T
                                         A
A
                                         A
A
                                         T
C
                                         C
A
                                         A
G
                                         T
T
                                         G
T
                                         T
T   I(Query)                             C!
C
                                         C
C
                                         G
G
                                         A
A
                                         A
A
                                         T
T
                                         C
C
                                         A
A
                                         T
A
                                         T
G
Homologous sequences share information
T     Markov Expert                                               C
A
                                                                  C
G
                                                                  G
T
                                                   Align Expert   A
A
                                                                  A
A
                                                                  T
C
                                                                  C
A
                                                                  A
G
                                                                  T
T
                                                                  G
T
                      I(Query| Reference)   Mutual Information    T
T   I(Query)                                                      C!
C
                                                                  C
C
                                                                  G
G
                                                                  A
A
                                                                  A
A
                                                                  T
T
                                                                  C
C
                                                                  A
A
                                                                  T
A
                                                                  T
G
Homologous sequences share information
T     Markov Expert                                               C
A
                                                                  C
G
                                                                  G
T
                                                   Align Expert   A
A
                                                                  A
A
                                                                  T
C
                                                                  C
A
                                                                  A
G
                                                                  T
T
                                                                  G
T
                      I(Query| Reference)   Mutual Information    T
T   I(Query)                                                      C!
C
                                                                  C
C
                                                                  G
G
                                                                  A
A
                                                                  A
A
                                                                  T
T
                                                                  C
C
                                                                  A
A
                                                                  T
A
                                                                  T
G
XMAligner wins on distantly related biased
sequences

Specificity




              Sensitivity
XMAligner is the most sensitive detecting
exons
XMAligner detecting a gene cluster




                   PLASMODIUM GENE CLUSTER
Alignment by compression overcomes the
limitations of traditional alignment




producing better results in distantly related
or biased sequences

More Related Content

Similar to Sequence Alignment by Information Compression

NGS technologies - platforms and applications
NGS technologies - platforms and applicationsNGS technologies - platforms and applications
NGS technologies - platforms and applicationsAGRF_Ltd
 
Dr. jekyll and mr. hyde dna replication model
Dr. jekyll and mr. hyde dna replication modelDr. jekyll and mr. hyde dna replication model
Dr. jekyll and mr. hyde dna replication modelpunxsyscience
 
Marker Gene Analysis: Best Practices
Marker Gene Analysis: Best PracticesMarker Gene Analysis: Best Practices
Marker Gene Analysis: Best PracticesDavidCoil
 
Tyler campbellpd.7 flipbook
Tyler campbellpd.7 flipbookTyler campbellpd.7 flipbook
Tyler campbellpd.7 flipbookpunxsyscience
 
Translation and Transcription
Translation and Transcription Translation and Transcription
Translation and Transcription punxsyscience
 

Similar to Sequence Alignment by Information Compression (7)

Transcription
TranscriptionTranscription
Transcription
 
NGS technologies - platforms and applications
NGS technologies - platforms and applicationsNGS technologies - platforms and applications
NGS technologies - platforms and applications
 
Dr. jekyll and mr. hyde dna replication model
Dr. jekyll and mr. hyde dna replication modelDr. jekyll and mr. hyde dna replication model
Dr. jekyll and mr. hyde dna replication model
 
Marker Gene Analysis: Best Practices
Marker Gene Analysis: Best PracticesMarker Gene Analysis: Best Practices
Marker Gene Analysis: Best Practices
 
080812
080812080812
080812
 
Tyler campbellpd.7 flipbook
Tyler campbellpd.7 flipbookTyler campbellpd.7 flipbook
Tyler campbellpd.7 flipbook
 
Translation and Transcription
Translation and Transcription Translation and Transcription
Translation and Transcription
 

More from Nacho Caballero

A Spanish Daily Routine for People Who Struggle with Daily Routines
A Spanish Daily Routine for People Who Struggle with Daily RoutinesA Spanish Daily Routine for People Who Struggle with Daily Routines
A Spanish Daily Routine for People Who Struggle with Daily RoutinesNacho Caballero
 
Single-Cell Transcriptome Analysis of Pluripotent Stem Cells
Single-Cell Transcriptome Analysis of Pluripotent Stem CellsSingle-Cell Transcriptome Analysis of Pluripotent Stem Cells
Single-Cell Transcriptome Analysis of Pluripotent Stem CellsNacho Caballero
 
Using the Host Immune Response to Hemorrhagic Fever Viruses to Understand Pat...
Using the Host Immune Response to Hemorrhagic Fever Viruses to Understand Pat...Using the Host Immune Response to Hemorrhagic Fever Viruses to Understand Pat...
Using the Host Immune Response to Hemorrhagic Fever Viruses to Understand Pat...Nacho Caballero
 
3 Ways to Obliterate Bullet Points From Your Slides - Slide Makeover 2
3 Ways to Obliterate Bullet Points From Your Slides - Slide Makeover 23 Ways to Obliterate Bullet Points From Your Slides - Slide Makeover 2
3 Ways to Obliterate Bullet Points From Your Slides - Slide Makeover 2Nacho Caballero
 
Creating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designerCreating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designerNacho Caballero
 
How to Build Compelling Research Stories That People Will Remember
How to Build Compelling Research Stories That People Will RememberHow to Build Compelling Research Stories That People Will Remember
How to Build Compelling Research Stories That People Will RememberNacho Caballero
 
Virus Hunting in French Guiana
Virus Hunting in French GuianaVirus Hunting in French Guiana
Virus Hunting in French GuianaNacho Caballero
 
Finding the viral diversity in a biological sample
Finding the viral diversity in a biological sampleFinding the viral diversity in a biological sample
Finding the viral diversity in a biological sampleNacho Caballero
 
Viral biodiversity in rodents
Viral biodiversity in rodentsViral biodiversity in rodents
Viral biodiversity in rodentsNacho Caballero
 
Bridging data analysis and interactive visualization
Bridging data analysis and interactive visualizationBridging data analysis and interactive visualization
Bridging data analysis and interactive visualizationNacho Caballero
 
High-resolution transcriptome of human macrophages
High-resolution transcriptome of human macrophagesHigh-resolution transcriptome of human macrophages
High-resolution transcriptome of human macrophagesNacho Caballero
 
Early detection of highly pathogenic viral infections
Early detection of highly pathogenic viral infectionsEarly detection of highly pathogenic viral infections
Early detection of highly pathogenic viral infectionsNacho Caballero
 
Squamous Cell Carcinoma: Looking for tale-tell signs
Squamous Cell Carcinoma: Looking for tale-tell signsSquamous Cell Carcinoma: Looking for tale-tell signs
Squamous Cell Carcinoma: Looking for tale-tell signsNacho Caballero
 
Gene Extrapolation Models for Toxicogenomic Data
Gene Extrapolation Models for Toxicogenomic DataGene Extrapolation Models for Toxicogenomic Data
Gene Extrapolation Models for Toxicogenomic DataNacho Caballero
 

More from Nacho Caballero (20)

A Spanish Daily Routine for People Who Struggle with Daily Routines
A Spanish Daily Routine for People Who Struggle with Daily RoutinesA Spanish Daily Routine for People Who Struggle with Daily Routines
A Spanish Daily Routine for People Who Struggle with Daily Routines
 
Single-Cell Transcriptome Analysis of Pluripotent Stem Cells
Single-Cell Transcriptome Analysis of Pluripotent Stem CellsSingle-Cell Transcriptome Analysis of Pluripotent Stem Cells
Single-Cell Transcriptome Analysis of Pluripotent Stem Cells
 
Using the Host Immune Response to Hemorrhagic Fever Viruses to Understand Pat...
Using the Host Immune Response to Hemorrhagic Fever Viruses to Understand Pat...Using the Host Immune Response to Hemorrhagic Fever Viruses to Understand Pat...
Using the Host Immune Response to Hemorrhagic Fever Viruses to Understand Pat...
 
A good looking pipeline
A good looking pipelineA good looking pipeline
A good looking pipeline
 
3 Ways to Obliterate Bullet Points From Your Slides - Slide Makeover 2
3 Ways to Obliterate Bullet Points From Your Slides - Slide Makeover 23 Ways to Obliterate Bullet Points From Your Slides - Slide Makeover 2
3 Ways to Obliterate Bullet Points From Your Slides - Slide Makeover 2
 
Creating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designerCreating effective slides without having to become a graphic designer
Creating effective slides without having to become a graphic designer
 
How to Build Compelling Research Stories That People Will Remember
How to Build Compelling Research Stories That People Will RememberHow to Build Compelling Research Stories That People Will Remember
How to Build Compelling Research Stories That People Will Remember
 
Virus Hunting in French Guiana
Virus Hunting in French GuianaVirus Hunting in French Guiana
Virus Hunting in French Guiana
 
Finding the viral diversity in a biological sample
Finding the viral diversity in a biological sampleFinding the viral diversity in a biological sample
Finding the viral diversity in a biological sample
 
Viral biodiversity in rodents
Viral biodiversity in rodentsViral biodiversity in rodents
Viral biodiversity in rodents
 
Bridging data analysis and interactive visualization
Bridging data analysis and interactive visualizationBridging data analysis and interactive visualization
Bridging data analysis and interactive visualization
 
High-resolution transcriptome of human macrophages
High-resolution transcriptome of human macrophagesHigh-resolution transcriptome of human macrophages
High-resolution transcriptome of human macrophages
 
Early detection of highly pathogenic viral infections
Early detection of highly pathogenic viral infectionsEarly detection of highly pathogenic viral infections
Early detection of highly pathogenic viral infections
 
Lab meeting 25/9
Lab meeting 25/9Lab meeting 25/9
Lab meeting 25/9
 
An RNA reset button
An RNA reset buttonAn RNA reset button
An RNA reset button
 
Buck v Bell
Buck v BellBuck v Bell
Buck v Bell
 
HIV-1 Antibodies
HIV-1 AntibodiesHIV-1 Antibodies
HIV-1 Antibodies
 
Squamous Cell Carcinoma: Looking for tale-tell signs
Squamous Cell Carcinoma: Looking for tale-tell signsSquamous Cell Carcinoma: Looking for tale-tell signs
Squamous Cell Carcinoma: Looking for tale-tell signs
 
29 Mammalian Genomes
29 Mammalian Genomes29 Mammalian Genomes
29 Mammalian Genomes
 
Gene Extrapolation Models for Toxicogenomic Data
Gene Extrapolation Models for Toxicogenomic DataGene Extrapolation Models for Toxicogenomic Data
Gene Extrapolation Models for Toxicogenomic Data
 

Sequence Alignment by Information Compression