SlideShare uma empresa Scribd logo
1 de 1
Baixar para ler offline
Maximum Likelihood Scaffold Assembly
                                              Anton Akhi, Alexey Sergushichev, Fedor Tsarev
                                St. Petersburg National Research University of Information Technologies,
                                                         Mechanics and Optics
                                               Genome Assembly Algorithms Laboratory

                           Scaffolding Problem                                                                           Preliminaries
· The process of DNA assembly is usually divided into two parts:                               · Normal insert size distribution N(m, s) with probability density
  assembling contigs and assembling scaffolds from contigs                                       function fm,s(x)
· Scaffolding is an important step of genome finishing                                         · Uniform position distribution
· Contig is a relatively long continuous fragment of DNA sequence                              · L – genome length
· Scaffold is an ordered set of oriented contigs with distance                                 · R – number of mate-pairs in library
  estimations between them                                                                     · Reads are aligned to contigs with Bowtie
· Scaffolds are usually assembled from contigs using mate-paired                               · Proposed algorithm (GAMLET) consists of two parts: distance
  libraries with large fragment size (usually greater than 1 Kbp)                                estimation and contig ordering and orientation



                           Distance Estimation                                                                   Ordering & Orientation
· Maximum likelihood principle                                                                 ·   Maximum parsimony principle
                                                 f                                             ·   Building graph with contigs as vertices and connections as edges
                       r              d1i                      d2i        r
                                                                                               ·   Removing high-degree vertices and short contigs
  A A C A T C C A C A G G A G T G                            A G G A C A T C A T G A G T
                 l1                                  d                    l2                   ·   Building draft scaffolds along shortest edges

· Taking into account connecting mate-pairs:

          P d1i , d 2 i | d   f m ,s d1i  d 2 i  d  2 r 
                                                                 1
                                                                 L
· The resulting likelihood is a product of the connecting reads
  probabilities and probability that all other reads do not connect
  contigs
                                            Mate-pairs                                         ·   Building graph with draft scaffolds as vertices
                                                                                               ·   Removing highly-covered vertices
                                                                                               ·   Searching for shortest paths between draft scaffolds
                                                                                               ·   Merging scaffolds along the shortest paths
               Connecting                                      Non-connecting


                                                                            wd s  
                                                                                        R n
                                                         
                                               
         n

         Pd     1i   , d 2i | d                       1   f m ,s s 
                                                                                   
                                                                                    
        i 1                                                 s               L 
· wd(s) – number of possible ways for a pair of reads with insert
  size s to connect a pair of contigs                                                          · Contig orientation is found using mate-pair alignment orientation.
· n – number of mate-pairs connecting contigs                                                    Only mate-pairs connecting contigs in the same scaffold are
· Finding the most likely distance using ternary search                                          considered.


                                  Experiments                                                                            Conclusions
· E.Coli genome                                                                                · Proposed distance estimation method is more precise on average
· Set of 502 contigs: N50 = 18047, minimum length 235, maximum                                   than SOPRA and GAPEST
  length 73908, average length 9126                                                            · Proposed algorithm assembles longer scaffolds with fewer number
· Three sets of 600000 mate-pairs generated by MetaSim: length                                   of misassembles
  36, mean insert size 3000, standard deviation 300                                            · Proposed algorithm is an order of magnitude faster than OPERA:
· Distance estimation algorithm was compared with SOPRA and                                      1.5 min. vs. 15 min. on E. Coli dataset
  GAPEST. For each algorithm average absolute difference between
  estimated and correct distances was computed:
                           Read set    SOPRA     GAPEST      GAMLET
                             Set1      209±15    175±14      149±13                                                Acknowledgements
                             Set2      139±13    217±17      135±13
                             Set3      215±15    172±14      153±13                                Research is supported by the federal program “Research and
                                                                                                   scientific-pedagogical personnel of innovative Russia in 2009-
· Scaffolding algorithm was compared with OPERA
                                                                                                   2013” (contract 16.740.11.0495, agreement 14.B37.21.0562)
                    OPERA                 GAMLET
    Read set
              N50 N50 (split) errors N50 N50 (split) errors
      Set1   366.5k 215.7k      11 311.7k  311.7k      1
      Set2   428.6k 215.7k      11 605.3k  392.0k      7                                           mailto:genome@mail.ifmo.ru
      Set3   465.0k 294.0k      11 578.2k  322.6k      9                                           http://genome.ifmo.ru/en/
                                                                                                                                                                    RECOMB-2013

Mais conteúdo relacionado

Semelhante a Maximum Likelihood Scaffold Assembly

NS-CUK Seminar:H.B.Kim, Review on "Asymmetric transitivity preserving graph ...
NS-CUK Seminar:H.B.Kim,  Review on "Asymmetric transitivity preserving graph ...NS-CUK Seminar:H.B.Kim,  Review on "Asymmetric transitivity preserving graph ...
NS-CUK Seminar:H.B.Kim, Review on "Asymmetric transitivity preserving graph ...ssuser4b1f48
 
5113jgraph02
5113jgraph025113jgraph02
5113jgraph02graphhoc
 
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Anton Alexandrov
 
Financial Networks VI - Correlation Networks
Financial Networks VI - Correlation NetworksFinancial Networks VI - Correlation Networks
Financial Networks VI - Correlation NetworksKimmo Soramaki
 
Crossing patterns in Nonplanar Road networks
Crossing patterns in Nonplanar Road networksCrossing patterns in Nonplanar Road networks
Crossing patterns in Nonplanar Road networksAjinkya Ghadge
 
Sparse PDF Volumes for Consistent Multi-resolution Volume Rendering
Sparse PDF Volumes for Consistent Multi-resolution Volume RenderingSparse PDF Volumes for Consistent Multi-resolution Volume Rendering
Sparse PDF Volumes for Consistent Multi-resolution Volume RenderingSubhashis Hazarika
 
Adaptive Geographical Search in Networks
Adaptive Geographical Search in NetworksAdaptive Geographical Search in Networks
Adaptive Geographical Search in NetworksAndrea Wiggins
 
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...Vladimir Kulyukin
 
Wondimu mobility increases_capacity
Wondimu mobility increases_capacityWondimu mobility increases_capacity
Wondimu mobility increases_capacityWondimu K. Zegeye
 
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptxthanhdowork
 
M ESH S IMPLIFICATION V IA A V OLUME C OST M EASURE
M ESH S IMPLIFICATION V IA A V OLUME C OST M EASUREM ESH S IMPLIFICATION V IA A V OLUME C OST M EASURE
M ESH S IMPLIFICATION V IA A V OLUME C OST M EASUREijcga
 
CPOSC Presentation 20091017
CPOSC Presentation 20091017CPOSC Presentation 20091017
CPOSC Presentation 20091017ericbeyeler
 
Presentation on Adaptive.pptx
Presentation on Adaptive.pptxPresentation on Adaptive.pptx
Presentation on Adaptive.pptxKaziNewajFaisal1
 

Semelhante a Maximum Likelihood Scaffold Assembly (17)

Aa04404164169
Aa04404164169Aa04404164169
Aa04404164169
 
Contel.final
Contel.finalContel.final
Contel.final
 
NS-CUK Seminar:H.B.Kim, Review on "Asymmetric transitivity preserving graph ...
NS-CUK Seminar:H.B.Kim,  Review on "Asymmetric transitivity preserving graph ...NS-CUK Seminar:H.B.Kim,  Review on "Asymmetric transitivity preserving graph ...
NS-CUK Seminar:H.B.Kim, Review on "Asymmetric transitivity preserving graph ...
 
5113jgraph02
5113jgraph025113jgraph02
5113jgraph02
 
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
Combining de Bruijn graph, overlap graph and microassembly for de novo genome...
 
Financial Networks VI - Correlation Networks
Financial Networks VI - Correlation NetworksFinancial Networks VI - Correlation Networks
Financial Networks VI - Correlation Networks
 
Crossing patterns in Nonplanar Road networks
Crossing patterns in Nonplanar Road networksCrossing patterns in Nonplanar Road networks
Crossing patterns in Nonplanar Road networks
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
Sparse PDF Volumes for Consistent Multi-resolution Volume Rendering
Sparse PDF Volumes for Consistent Multi-resolution Volume RenderingSparse PDF Volumes for Consistent Multi-resolution Volume Rendering
Sparse PDF Volumes for Consistent Multi-resolution Volume Rendering
 
Adaptive Geographical Search in Networks
Adaptive Geographical Search in NetworksAdaptive Geographical Search in Networks
Adaptive Geographical Search in Networks
 
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...
Connect-the-Dots in a Graph and Buffon's Needle on a Chessboard: Two Problems...
 
Wondimu mobility increases_capacity
Wondimu mobility increases_capacityWondimu mobility increases_capacity
Wondimu mobility increases_capacity
 
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx
240325_JW_labseminar[node2vec: Scalable Feature Learning for Networks].pptx
 
Study on atome probe
Study on atome probe Study on atome probe
Study on atome probe
 
M ESH S IMPLIFICATION V IA A V OLUME C OST M EASURE
M ESH S IMPLIFICATION V IA A V OLUME C OST M EASUREM ESH S IMPLIFICATION V IA A V OLUME C OST M EASURE
M ESH S IMPLIFICATION V IA A V OLUME C OST M EASURE
 
CPOSC Presentation 20091017
CPOSC Presentation 20091017CPOSC Presentation 20091017
CPOSC Presentation 20091017
 
Presentation on Adaptive.pptx
Presentation on Adaptive.pptxPresentation on Adaptive.pptx
Presentation on Adaptive.pptx
 

Maximum Likelihood Scaffold Assembly

  • 1. Maximum Likelihood Scaffold Assembly Anton Akhi, Alexey Sergushichev, Fedor Tsarev St. Petersburg National Research University of Information Technologies, Mechanics and Optics Genome Assembly Algorithms Laboratory Scaffolding Problem Preliminaries · The process of DNA assembly is usually divided into two parts: · Normal insert size distribution N(m, s) with probability density assembling contigs and assembling scaffolds from contigs function fm,s(x) · Scaffolding is an important step of genome finishing · Uniform position distribution · Contig is a relatively long continuous fragment of DNA sequence · L – genome length · Scaffold is an ordered set of oriented contigs with distance · R – number of mate-pairs in library estimations between them · Reads are aligned to contigs with Bowtie · Scaffolds are usually assembled from contigs using mate-paired · Proposed algorithm (GAMLET) consists of two parts: distance libraries with large fragment size (usually greater than 1 Kbp) estimation and contig ordering and orientation Distance Estimation Ordering & Orientation · Maximum likelihood principle · Maximum parsimony principle f · Building graph with contigs as vertices and connections as edges r d1i d2i r · Removing high-degree vertices and short contigs A A C A T C C A C A G G A G T G A G G A C A T C A T G A G T l1 d l2 · Building draft scaffolds along shortest edges · Taking into account connecting mate-pairs: P d1i , d 2 i | d   f m ,s d1i  d 2 i  d  2 r  1 L · The resulting likelihood is a product of the connecting reads probabilities and probability that all other reads do not connect contigs Mate-pairs · Building graph with draft scaffolds as vertices · Removing highly-covered vertices · Searching for shortest paths between draft scaffolds · Merging scaffolds along the shortest paths Connecting Non-connecting wd s   R n   n  Pd 1i , d 2i | d  1   f m ,s s     i 1  s L  · wd(s) – number of possible ways for a pair of reads with insert size s to connect a pair of contigs · Contig orientation is found using mate-pair alignment orientation. · n – number of mate-pairs connecting contigs Only mate-pairs connecting contigs in the same scaffold are · Finding the most likely distance using ternary search considered. Experiments Conclusions · E.Coli genome · Proposed distance estimation method is more precise on average · Set of 502 contigs: N50 = 18047, minimum length 235, maximum than SOPRA and GAPEST length 73908, average length 9126 · Proposed algorithm assembles longer scaffolds with fewer number · Three sets of 600000 mate-pairs generated by MetaSim: length of misassembles 36, mean insert size 3000, standard deviation 300 · Proposed algorithm is an order of magnitude faster than OPERA: · Distance estimation algorithm was compared with SOPRA and 1.5 min. vs. 15 min. on E. Coli dataset GAPEST. For each algorithm average absolute difference between estimated and correct distances was computed: Read set SOPRA GAPEST GAMLET Set1 209±15 175±14 149±13 Acknowledgements Set2 139±13 217±17 135±13 Set3 215±15 172±14 153±13 Research is supported by the federal program “Research and scientific-pedagogical personnel of innovative Russia in 2009- · Scaffolding algorithm was compared with OPERA 2013” (contract 16.740.11.0495, agreement 14.B37.21.0562) OPERA GAMLET Read set N50 N50 (split) errors N50 N50 (split) errors Set1 366.5k 215.7k 11 311.7k 311.7k 1 Set2 428.6k 215.7k 11 605.3k 392.0k 7 mailto:genome@mail.ifmo.ru Set3 465.0k 294.0k 11 578.2k 322.6k 9 http://genome.ifmo.ru/en/ RECOMB-2013