SlideShare uma empresa Scribd logo
1 de 74
Next Generation DNA Sequencing:
  Does the Read Length Matter?


             Pavel A. Pevzner
Department of Computer Science and Engineering,
      University of California at San Diego
Fragment Assembly

                                                                                                                     reads
atgcaatgcatgcatggatgcaatgcatgcaatgcatgcatggatgcaatgcatgcaatgcatgcatggatgcaatgcatgcaatgcatgcatggatgcaatgcatgcatgggg




              Cover region with (overlapping) reads
     Overlap reads and extend to reconstruct the
                original genomic region
Some puzzles are more difficult than other...

The puzzle has only
16 pieces and looks
      simple

  BUT there are
   repeats!!!

The repeats make it
  very difficult.
Does the Read Length Matter?




       Mark Chaisson                Dima Brinza
(now at Pacific Biosciences)   (now at Life Technologies)
EULER Short Reads assembler
(Chaisson et al, Bioinformatics 2004, Genome Res., 2008, 2009)
...history repeats itself:
   sequencing insulin




                Fred Sanger
              1958 (!) Nobel prize for
              sequencing insulin by Edman
              degradation


              Average read
              length = 5 aa!
Shotgun Protein Sequencing:
Mass Spectrometry vs. Edman degradation
Novel proteins are still determined by
laborious Edman degradation.

 – Integrilin, a blood clot prevention drug
   derived from rattlesnake venom.
 – Ziconotide, 20x more potent than morphine
   and has no addiction side effects, derived from
   cone snail venom

 Many important proteins are not inscribed in
   genomes

 – Fusion proteins in tumors
 – Antibodies (collaboration with Genentech)
 – Non-ribosomal peptides and other natural
   products represent 9 out of top 20
   bestselling drugs (collaborations with Pieter
   Dorrestein at UCSD School of Pharmacy)

 Challenge: Substitute slow
  Edman degradation by a fast                        Bandeira et al, MCP 2007
  protein sequencing technique                       Bandeira et al, PNAS 2007
Ribosomal Peptides May Be Equally Elusive
Short Read Sequencing and SBH
 Short read sequencing was first proposed in 1988 under
     the name Sequencing by Hybridization (SBH)

• 1988: SBH suggested as an               First microarray
                                          prototype (1989)
  alternative to Sanger sequencing.
  Nobody believed it will ever work

                                         First commercial
• 1991: Light directed polymer           DNA microarray
  synthesis developed                    prototype w/16,000
                                         features (1994)




• 1994: Affymetrix develops first 64-kb
  DNA microarray                      500,000 features
                                         per chip (2002)
Fragment Assembly with Short Reads (k-mers)
P.P. (1989) k-mer DNA sequencing.

Result: An optimal Eulerian    fragment assembly
algorithm for SBH.
Fragment Assembly with (very) Short Reads (k-mers)
P.P. (1989) k-mer DNA sequencing.

Result: An optimal Eulerian fragment assembly
algorithm for SBH.

Idury and Waterman (1995) Mimicking Sanger
sequencing as SBH reconstruction (first Eulerian
algorithm for fragment assembly)
Fragment Assembly with (very) Short Reads (k-mers)
P.P. (1989) k-mer DNA sequencing.

Result: An optimal and fast Eulerian fragment assembly
algorithm for SBH.


Idury and Waterman (1995) Mimicking Sanger
sequencing as SBH reconstruction (first Eulerian
algorithm for fragment assembly)

De novo assembly with short reads is not unlike assembly
            with virtual universal DNA array
Hamiltonian Cycle Problem

• Find a walk (cycle) in a
  network (graph) that
  visits every NODE
  exactly once

• Intractable problem
  (NP – complete)
The Bridges of Konigsberg Problem
 Find a path crossing every bridge just once
 Leonhard Euler, 1735




               Bridges of Königsberg
Eulerian Cycle Problem

• Find a walk (cycle) that
  visits every EDGE
  exactly once

• Linear time
  algorithm!




                      More complicated version of Königsberg
OVERLAP GRAPH
        Repeat                Repeat                   Repeat




Finding a path visiting every NODE exactly once: Hamiltonian path problem
REPEAT GRAPH versus OVERLAP GRAPH
    Repeat   Repeat                    Repeat




              Find a path visiting every EDGE exactly once:
              Eulerian path problem (taking into account
              multiplicity of edges – red edge is visited 3 times)
Fragment assembly: two approaches
Finding a path visiting every NODE exactly once in the OVERLAP graph:
                  Hamiltonian path problem (intractable)




  Find a path visiting every EDGE exactly once in the REPEAT graph:
                          Eulerian path problem




                         Easy to Solve!
N. meningitidis: repeat graph
Repeat Graph vs. Unordered Contigs
Generated by Traditional Assemblers
P.P. et al., PNAS 2001, Genome Res., 2004
P.P. et al., PNAS 2001, Genome Res., 2004
P.P. et al., Proc. National Academy of Sciences 2001, Genome Res., 2004
NEWBLER (454 Life Sci.,06)
ALLPATHS, Genome Res.08
(Broad Inst.)
VELVET, Genome Res.08
(EBI)
ABySS, Genome Res.08
(UBC)




                             P.P. et al., PNAS 2001, Genome Res., 2004
The Eulerian approach works well for very
  accurate (nearly error free) reads but
    deteriorates for inaccurate reads
Error correction in reads: catch-22
     The Eulerian approach works well for error-free reads but
    quickly deteriorates even for reads with low error rates (1%).
     To assemble a genome we need to correct errors in reads first.
    But to correct errors in reads one has to assemble the genome first!
  Can we correct sequencing errors if the genome is unknown,
  before the assembly started?


 Result: 50 fold reduction in sequencing errors PRIOR TO ASSEMBLY makes
   reads almost as accurate as the finished sequence (P.P. et al., PNAS 2001).


 Similar Spectrum Alignment approach (in a different context) was proposed in
Peer&Shamir, RECOMB 01,PNAS 02. It is now used in nearly all assembly tools.
EULER vs VELVET (E.Coli)

                                Benchmarking
total length of                   SSAKE,
  k longest                      SHARCGS,
    contigs                       VCAKE,
                                  EDENA,
                                  VELVET


                            k
Mosaic structure of human segmental duplications:
           from de Bruijn to A-Bruijn Graphs


                          A    B C        D    E F G H        I        J



                      A       B C     D       E F C   G H          I       J



                 A    B C D          E F C       G H      B C D            I   J


        A     B C     D       E F C       G H     B C D        I       F   C   G   J


• The mosaic structure of segmental duplications in human genome is reconstructed using the
                                    A-Bruijn graph approach:

Jiang et al . Evolutionary reconstruction of human segmental duplications (Nature Genetics, 2007)
Algorithmic Challenge

• Problem: given a string, find all repeat elements
  and reveal the sub-repeat mosaic structure.
   – Perfect repeats: de Bruijn graph, suffix tree.
   – Imperfect repeats: OPEN PROBLEM
   – The A-Bruijn graphs generalize the de Bruijn
     graphs for imperfect repeats (P.P. et al., Genome
     Res, 2004)
De Novo Repeat Classification


     All pairwise similarities


                                                      De novo repeat compilation


Pairwise similarity
                                 ?

                                     Repeat Element 1 AGCCTACG
              Library of
                                                 … …
           repeat elements           Repeat Element 2 TGCATTTT
                                                 … …
                                     Repeat Element 3 GAACTCAC
                                                  ……
Mosaic Structure of Repeats:
           (small region from human Y chromosome)


          8328 140 628 1185 2905 628 1185 381 140 628 1185 381 140 628 161442
           1    2   3     4     5   6        7   8   9   10   11   12 13 14   15



    RECON (Bao and Eddy, 2002) does not reveal the mosaic: structure




                                         ?
                              2 copies               2 copies
A-Bruijn representation
                              3 copies               4 copies
Repeat Gluing
(de Bruijn graph = Quotient space of all K-mers in the sequence)

    x
y        y                          y
                            y
    x
              x             y           x       y


                                y           y
                  x


         x            y                     x       y
Repeat Gluing
 (de Bruijn graph = Quotient space of all K-mers in the sequence)
gluing instruction
      x
y             y                            y
                                   y
      x
                     x             y           x       y


                                       y           y
                         x


              x              y                     x       y
Similarity
 matrix




 A   B C   D   E F   C   G H   B C D   I   F C   G   J
A    B C       D   E F    C   G H       B C D          I   F C   G    J




                                  H
                         A                   J
                                B C G
                              F
  repeat graph                     E
                                     D


                                    I

                   B                    F
                       2 copies             2 copies
Sub-repeats:                                               C
                                                                 4 copies
edges in the           2 copies
                   D                        2 copies
   repeat                               G
   graph
In reality, repeats are usually imperfect


8328 140 628 1185 2905 628 1185 381 140 628 1185 381 140 628 161442
 1    2   3   4    5    6    7    8   9   10   11   12 13 14   15



                                          … … AG-CCATCGACGTCACC … …
                                          … … AGTGCCTCG-CGTCTCC … …
Similarity
 matrix




 A   B C   D   E F C   G H   B C D   I   F   C   G   J
Repeat Gluing
(A-Bruijn graph = Quotient space of all ALIGNED POSITIONS)
             x
                           Consistent
    y                 y     Gluing




             x


             x
                           Inconsistent
                             Gluing
    y                 y

             x
Challenge: Generalize the Notion of De
   Bruijn Graph for Imperfect Repeats

• Input
  – a genomic sequence
  – all local pairwise alignments (pairs of aligned
    positions)


• Output
  – repeat graph representing all repeats as a
    mosaic of sub-repeats
Repeat Graph

    8328 140 628 1185 2905 628 1185 381 140 628 1185 381 140 628 161442
     1    2   3   4    5    6    7    8   9   10   11   12 13 14   15




    A-Bruijn graph



                        repeat graph
              x

y                          y

              x
Simplifying A-Bruijn Graph


A-Bruijn graph




 repeat graph
From A-Bruijn Graph to Repeat Graph:
             MSLG Problem

Maximum Subgraph with Large Girth (MSLG) Problem:

Input: a weighted graph and a parameter girth
Output: a maximum weight subgraph that does not contain short
cycles, i. e. cycles of length less than girth.




Solution known only when the girth is infinite --
Maximum Spanning Tree Problem (maximum weight
acyclic subgraph).
Maximum Spanning Tree Approximation
        to MSLG Problem
A-Bruijn Graphs and Fragment Assembly

Genome
   A       B C   D       E F       C   G H   B C D    I       F C       G   J



Reads


       A   B C       D    I    F C     G H   B C D        E   F     C   G       J


                              H
                 A                     J           Every possible genome
                      B C G
                    F                          reconstruction corresponds to an
                           D                   Eulerian path in the repeat graph.
       repeat graph      E

                               I
Fragment Assembly = Building Repeat
     Graph from Concatenated Reads



Theorem (PP et al., Genome. Res 04): The repeat graph built
from concatenated (in an arbitrary order!) reads is identical to the
repeat graph built from the genomic sequence if the reads
“cover” the genomic sequence.
EULER Algorithm (outline)


• Concatenate reads (in an arbitrary order) into a single sequence

• Compute the similarity matrix for this concatenated sequence

• Use this similarity matrix as a “glue” and apply MSLG
  algorithm to build the repeat graph with the A-Bruijn algorithm
  (in NGS applications, only k-mer based glues are practical).
EULER algorithm for NGS applications
       (Chaisson and PP, Genome Res., 2008)

    • de Bruijn step: Construct the de Bruijn graph of reads
    • A-Bruijn step: Remove bulges and whirls
    • Threading step: Thread each read through the resulting
      graph and form the consensus sequence from reads;
    • Mate-pair step: Utilize mate-pairs




Velvet, ALLPATHS, AbySS and other NGS de novo tools now use similar framework
DNA Sequencing with mate-pairs
     genome

                          cut many times at
                         random into equally
                           sized fragments




                       Get mate-pairs:
                        two reads from
                        each fragment
 ~50 bp       ~50 bp   (separated by a
                        fixed distance)
E. coli assembly with 35 bp Illumina reads
    (N50 statistics with and without mate-pairs)




EULER-USR      19 KB
VELVET         16 KB
EULER-USR (Mate-Paired) 68 KB
VELVET (Mate-Paired)    48 KB
Eulerian Assembly with Mate-Pairs
EULER transforms MATE-PAIRS:

“read1 - GAP of length d - read2”
into LONG MATE-READS:
“read1 - DNA SEQUENCE of length d – read2”



             P.P. and Tang, ISMB 2001
Transforming Mate-Pairs into Mate-Reads
             Repeat   Repeat   Repeat



Mate-pairs
Repeat Graph (in Difference from the Overlap Graph)
       Enables Easy Processing of Mate-pairs
Repeat graph before and after Transforming Mate-Pairs
 into Mate-Reads (Sanger Reads from N. Meningitidis)




            P.P. and Tang, ISMB 2001
Complications in Transforming Mate-Pairs into Mate-
Reads: Multiple Paths Matching the Distance Between
                    Mate-Pairs
   P.P. and Tang, ISMB 2001 described how to deal with such
  complications.
  VELVET (Breadcrumb) and ALLPATHS described similar
  approaches aimed at short reads assemblies (using multiple mate-
  pairs to transform a single mate-pair into a mate-read)
                      A          A‟
                            R1

                        B        B‟
                            R2
                        C        C‟
EULER’s Utilization of Mate-Pairs


R1              R2       R1         R2




                              R2
       R1
EULER with Mate-Pairs:
  Does the Read Length Matter?
• EULER provides an algorithmic solution for the
  problem of increasing the read lengths.
• Assuming that the read length is 50 bp and insert length
  in 300 bp, EULER generates mate-reads of length
  300+50+50=400 bp.
• If all mate-pairs are transformed into mate-reads
  then the read length does not matter! The thing that
  matters is

              SPAN=InsertLength+2*ReadLength
EULER-USR with Mate-Pairs:
   Does the Read Length Matter?
• EULER provides an algorithmic solution for the experimental
  problem of increasing read lengths.
• Assuming that the read length is 50 bp and insert length in 300
  bp, EULER generates mate-reads of length 300+50+50=400 bp.
• If all mate-pairs are transformed into mate-reads then the
  read length almost does not matter! The thing that matters is

             SPAN=InsertLength+2*ReadLength

• But is it possible to transform mate-pairs into mate-reads
  with nearly 100% efficiency?
Read Length Does NOT Matter!
    (good news for short read technologies)
• EULER-USR was run with simulated (and real) reads
  varying from 25nt to 100nt and fixed-length span
  SPAN=InsertLength+2*ReadLength=300 (E.Coli genome)

• For read length 35, the efficiency is 98.8% and N50= 61K
• For read length 100, the efficiency is 98.9% and N50=61K
BUT the Read Length Does Matter!
• EULER-USR was run with simulated (and real) read length
  varying from 25nt to 100nt and fixed-length span
  InsertLength+2*ReadLength=300 (E.Coli genome)

• For read length 35, the efficiency is 98.8% and N50= 61K
• For read length 100, the efficiency is 98.9% and N50= 61K

• BUT
  for read length 25,   the efficiency is 86.1% and N50= 41K
BUT Read Length Does Matter!
• EULER-USR was run with simulated (and real) read length varying
  from     25nt    to   100nt     and      fixed-length      span
  InsertLength+2*ReadLength=300 (E.Coli genome)

• For read length 35, the efficiency is 98.8% and N50= 61K
• For read length 100, the efficiency is 98.9% and N50= 61K

• For read length 25, the efficiency is 86.1% and N50= 41.3K

• A small drop in read length results in a dramatic drop in
  efficiency and N50
BUT Read Length Does Matter!
• EULER-USR was run with simulated (and real) read length varying
  from     30nt    to   100nt     and      fixed-length      span
  InsertLength+2*ReadLength=300 (E.Coli genome)

• For read length 35, the efficiency is 98.8% and N50= 61K
• For read length 100, the efficiency is 98.9% and N50= 61K

• For read length 26, the efficiency is 86.1% and N50= 41.3K

• A small drop in read length results in dramatic drop in
  efficiency and N50

• 30nt is a BREAKPOINT separating the assemblies when the
  read length DOES NOT MATTER from the assemblies when
  the read length MATTERS. For BACTERIAL (E.Coli) genome
Where is the Breakpoint for Assembling Yeast Genome?
      (bad news for Illumina, good news for 454)

• EULER-USR was run with simulated (and real) read length varying
  from     30nt    to   100nt     and      fixed-length      span
  InsertLength+2*ReadLength=300 (E.Coli genome)

• For read length 35, the efficiency is 98.8% and N50= 61K
• For read length 100, the efficiency is 98.9% and N50= 61K

• For read length 26, the efficiency is 86.1% and N50= 41.3K

• A small drop in read length results in dramatic drop in
  efficiency and N50

• 45nt is a BREAKPOINT separating the assemblies when the
  read length DOES NOT MATTER from the assemblies when
  the read length MATTERS. For YEAST genome
OPEN PROBLEM:
WHERE IS THE BREAKPOINT FOR
  MAMMALIAN GENOMES?
Mass-Spectral Assembly
Shotgun DNA sequencing for whole-genome assembly:
   1. Randomly read small portions of the genome – reads
   2. Find pairwise overlaps between reads
   3. Assemble overlaps into long sequences - contigs
Can we also assemble spectra into whole-protein sequences?
   – Shotgun proteomics generate spectra of unknown peptides
      (short reads?)
   – Find spectral pairs formed by spectra from overlapping
      peptides (pairwise overlaps?)
   – Assemble overlapping spectra into long stretches of amino
      acid (contigs?)
Spectral Assembly via Overlap
           Graph
1                    T
                     H
                     E
                     A
                       VM ETA
                       A TEVM
                        AV A V
                     A
                     V
                     M
                     M
                     V
                     A
                                                                     1: KQGGTLDDLEEQAR
                     A
                     E
                     H
                     T



                                                                     2: KQGGTLDDLEEQARELYR
      2                          3           T
                                               VM ETA
                                               A TEVM
                                                AV A V

                                                                     3: GGTLDDLEEQARELYR
                                             H
                                             E
                                             A
                                             A
                                             V
                                             M
                                             M
                                             V
                                             A
   VM ETA
   A TEVM
    AV A V                                   A
                                             E
                                             H
                                             T

                                                                     4: GGTLDDLEEQARELYRR
 T
 H
 E
 A
 A
 V
 M                                 VM ETA
                                   A TEVM
                                    AV A V
 M
 V                               T
                                 H
 A
 A
 E                               E
                                 A                         VM ETA
                                                           A TEVM
                                                            AV A V
 H
 T                               A
                                 V                       T
                                 M                       H
                                                                             LDDLEEQARELYRRLR
                                 M
                                 V
                                 A
                                 A
                                 E
                                 H
                                 T             5
                                                         E
                                                         A
                                                         A
                                                         V
                                                         M
                                                         M
                                                         V
                                                         A
                                                         A
                                                         E
                                                         H
                                                         T
                                                                     5:
             4                       VM ETA
                                     A TEVM
                                      AV A V
                                     T
                                     H
                                     E
                                     A
                                     A
                                     V
                                     M
                                     M
                                     V
                                     A
                                                                     6:        DLEEQARELYRRLREK
                                     A
                                     E
                                                                                 EEQARELYRRLREK
                   VM ETA
                   A TEVM
                    AV A V           H
                                     T
                 T
                 H
                 E
                 A
                 A
                 V
                 M
                 M
                 V
                                                             7       7:
                 A
                 A
                 E
                 H
                 T               6
Spectral Assembly via Overlap Graph
                           1                   T
                                               H
                                               E
                                               A
                                                 VM ETA
                                                 A TEVM
                                                  AV A V
                                               A
                                               V
                                               M
                                               M
                                               V
                                               A
                                                                                               1: KQGGTLDDLEEQAR
                                               A
                                               E
                                               H
                                               T



                                                                                               2: KQGGTLDDLEEQARELYR
                                   2                       3           T
                                                                         VM ETA
                                                                         A TEVM
                                                                          AV A V

                                                                                               3: GGTLDDLEEQARELYR
                                                                       H
                                                                       E
                                                                       A
                                                                       A
                                                                       V
                                                                       M
                                                                       M
                                                                       V
                                                                       A
                             VM ETA
                             A TEVM
                              AV A V                                   A
                                                                       E
                                                                       H
                                                                       T

                                                                                               4: GGTLDDLEEQARELYRR
                           T
                           H
                           E
                           A
                           A
                           V
                           M                                 VM ETA
                                                             A TEVM
                                                              AV A V
                           M
                           V                               T
                                                           H
                           A
                           A
                           E                               E
                                                           A                         VM ETA
                                                                                     A TEVM
                                                                                      AV A V
                           H
                           T                               A
                                                           V                       T
                                                           M                       H
                                                                                                       LDDLEEQARELYRRLR
                                                           M
                                                           V
                                                           A
                                                           A
                                                           E
                                                           H
                                                           T             5
                                                                                   E
                                                                                   A
                                                                                   A
                                                                                   V
                                                                                   M
                                                                                   M
                                                                                   V
                                                                                   A
                                                                                   A
                                                                                   E
                                                                                   H
                                                                                   T
                                                                                               5:
                                       4                       VM ETA
                                                               A TEVM
                                                                AV A V
                                                               T
                                                               H
                                                               E
                                                               A
                                                               A
                                                               V
                                                               M
                                                               M
                                                               V
                                                               A
                                                                                               6:        DLEEQARELYRRLREK
                                                               A
                                                               E
                                                                                                           EEQARELYRRLREK
                                             VM ETA
                                             A TEVM
                                              AV A V           H
                                                               T
                                           T
                                           H
                                           E
                                           A
                                           A
                                           V
                                           M
                                           M
                                           V
                                                                                       7       7:
                                           A
               A
               T
                   M
                   E
                       T
                       T
                               E
                               M
                                   T
                                   A
                                           A
                                           E
                                           H
                                           T               6

 A     T


 M
       E
                                                Real samples contain modified peptides. Using an
T+80 T+80
                                                analogy with DNA sequencing, a modified peptide is not
                                                unlike a polymorphism. Integrating them into the
 E
      M
                                                assembly pipeline is not unlike DNA assembly of
 T    A
                                                highly polymorphic genomes like sea squirt.

            Spectral alignment of                          DIFFICULT ALGORITHMIC PROBLEM
            modified peptides
Protein Sequencing with Eulerian Approach
                                                    A     M    T   E   T                               A    M    T     E       T                  A         M       T       E       T    A   V
                                                    T     E    T   M   A                               T    E    T     M       A                 V      A       T       E       T       M    A



Stage 1: Generate                           H
                                                T                                         A    T
                                                                                                                                         H
                                                                                                                                             T



spectral pairs using                        A E
                                                                                          M
                                                                                               E                                         A E

                                            A                                                                                            A

approach in Bandeira et                     M
                                                T

                                                                                      +80
                                                                                      T       T+80
                                                                                                                                         M
                                                                                                                                             T



                                                M                                                                                            M


al., PNAS 2007                              T
                                                A
                                                                                                                                         T
                                                                                                                                             A

                                            E A                                           E                                              E A
                                                                                               M

                                                H                                                                                            H
                                            T                                             T    A                                         T




Stage 2: „Glue‟ peaks in spectral pairs using approach in P.P. et al., Genome Res., 2004
             99.2 Da   71.0 Da   101.0 Da               129.1 Da           101.1 Da                  131.1 Da




                       71.1 Da   101.0 Da               129.3 Da           101.1 Da                  131.0 Da          71.0 Da     71.1 Da            137.1 Da




                                 101.1 Da               129.2 Da           101.0 Da                  131.1 Da          71.1 Da




                                 101.2 Da               129.0 Da                181.2 Da                        131.0 Da
                                                                                                                                   71.0 Da




 Stage 3: Sequencing on the A-Bruijn graph using approach in Bandeira et al., MCP 2007
             V          A         T                      E                  T                        M                     A         A                      H


                                                                           T+80
28 aa protein contig, 24 spectra
   [271.1]       F     (SK)   S   G    T   E   C    R   A   S   M   S   E     C     D   P   A   E      H     C    T   G   Q   S




GRHSLFHPEDTGKVFKVSHSFPHPLYDMSLLKNRFLRPGDDSSHDLMLLR

50 amino acids long protein contig of 92 assembled spectra




             b-ions in each spectrum               Mass difference between b-ions                   Oxidized Methionine
Sequencing Snake Venoms

• Venom dataset from western diamondback
  rattlesnake generated by Karl Clauser at Broad
  Institute
   – Mixture of ~30 proteins
   – Digestion with: trypsin, chymotrypsin, Asp-N, Glu-C
Sequencing Catrocollastatin
EHQKYNPFRFVELFLVVDKAMVTKNNGDLDKIKTRMYEIVNTVNEIYRYMYIHVALVGLEIWSNEDKITVKPEAGYTLNAFGEWRKTDLL

TRKKHDNAQLLTAIDLDRVIGLAYVGSMCHPKRSTGIIQDYSEINLVVAVIMAHEMGHNLGINHDSGYCSCGDYACIMRPEISPEPSTFF

SNCSYFECWDFIMNHNPECILNEPLGTDIISPPVCGNELLEVGEECDCGTPENCQNECCDAATCKLKSGSQCGHGDCCEQCKFSKSGTEC

RASMSECDPAEHCTGQSSECPADVFHKNGQPCLDNYGYCYNGNCPIMYHQCYDLFGADVYEAEDSCFERNQKGNYYGYCRKENGNKIPCA

PEDVKCGRLYCKDNSPGQNNPCKMFYSNEDEHKGMVLPGTKCADGKVCSNGHCVDVATAY




  •    321 correct/ 11 incorrect amino acid calls
  •    Longest contiguous stretch – 108 amino acids
        Over 2100 amino acid reconstructed
        Identified 15 SNP variants
Sequencing Antibodies
(collaboration with Genentech antibody sequencing group)
          a)                         20    -14 21                                   b)                                          Contig order induced by
                               10                     9                                                                  Comparative Shotgun Protein Sequencing
                                                           22
                         17                                     32
                    19
                                                                     16




                                                                                           Reconstructed SPS contigs
               5
                                                                          12
         15
                                                                               28
     13
                                                                                    26
    2
                                                                                -36
    27
                                                                                    1
                                                                                                                              100         200        300          400
     7                                                                                                                 Amino acid position on Anti-BTLA Heavy chain
                                                                                30
         6
                                                                               23        c) Anti-BTLA Heavy Chain
             31                                                                                        QVQLKESGPGLVAPSQSLSITCTVSGFSLTSYGVSWVR
                                                                          33                           QPPGKGLEWLGVIWGDGSTNYHSALISRLSISKDNSKS
                   25                                                                                  QVFLKLNSLQTDDTATYYCAKGGYRFYYAMDYWGQGTS
                                                                     29
                                                                                                       VTVSSAKTTPPSVYPLAPGSAAQTNSMVTLGCLVKGYF
                         8                                      4                                      PEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVP
                              -3                          18                                           SSTWPSETVTCNVAHPASSTKVDKKIVPRDCGCKPCIC
                                   -11              35
                                          34   24                                                      TVPEVSSVFIFPPKPKDVLTITLTPKVTCVVVDISKDD
                                                                                                       PEVQFSWFVDDVEVHTAQTQPREEQFNSTFRSVSELPI
                   - Contig order induced by homology to gi|148686583                                  MHQDWLNGKEFKCRVNSAAFPAPIEKTISKTKGRPKAP
                   - Contiguous contig order induced by homology to gi|148540420                       QVYTIPPPKEQMAKDKVSLTCMITDFFPEDITVEWQWN
                                                                                                       GQPAENYKNTQPIMDTDGSYFVYSKLNVQKSNWEAGNT
                   - Contig order induced by homology to gi|148540420 but
                                                                                                       FTCSVLHEGLHNHHTEKSLSHSPGK
                     interrupted by non-contiguous coverage (sequence gaps)


                                                                                         Bandeira et al., Nature Biotech, 2008
Acknowledgements
      (short reads DNA sequencing)



     Mark Chaisson                          Dima Brinza
(now at Pacific Biosciences)        (now at Life Technologies)
Collaboration with Xiaohua Huang at UCSD Bioengineering
                  (supported by NHGRI)
 Collaborations with Joe Ecker lab at Salk (BAC sequencing
      data) and Illumina team (E.Coli sequencing data)
Acknowledgements
• Rob Lipshutz, Affymetrix
  – SBH

• Haixu Tang (Indiana),
  Mike Waterman (USC) –
  EULER assembler

• Haixu Tang, Glenn Tesler
  (UCSD) - EULER+
  assembler

• Serafim Batzoglou
  (Stanford) – large
  assemblies with short reads

Mais conteúdo relacionado

Semelhante a 20101209 dnaseq pevzner

Genome Exploration in A-T G-C space (mk1)
Genome Exploration in A-T G-C space (mk1)Genome Exploration in A-T G-C space (mk1)
Genome Exploration in A-T G-C space (mk1)Jonathan Blakes
 
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walkingJonathan Blakes
 
AlgoAlignementGenomicSequences.ppt
AlgoAlignementGenomicSequences.pptAlgoAlignementGenomicSequences.ppt
AlgoAlignementGenomicSequences.pptSkanderBena
 
Guests 2011-11-09-alekseyev-rearrangements
Guests 2011-11-09-alekseyev-rearrangementsGuests 2011-11-09-alekseyev-rearrangements
Guests 2011-11-09-alekseyev-rearrangementsNikolay Vyahhi
 
Hadoop for Bioinformatics
Hadoop for BioinformaticsHadoop for Bioinformatics
Hadoop for BioinformaticsDeepak Singh
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsNtino Krampis
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymers
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymersDissertation Defense: The Physics of DNA, RNA, and RNA-like polymers
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymersLi Tai Fang
 
Hw09 Hadoop For Bioinfomatics
Hw09   Hadoop For BioinfomaticsHw09   Hadoop For Bioinfomatics
Hw09 Hadoop For BioinfomaticsCloudera, Inc.
 
London Calling 2019: Karen Miga
London Calling 2019: Karen MigaLondon Calling 2019: Karen Miga
London Calling 2019: Karen MigaKaren Hayden Miga
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)SungminYou
 
Tutorial Equivariance in Imaging ICMS 23.pptx
Tutorial Equivariance in Imaging ICMS 23.pptxTutorial Equivariance in Imaging ICMS 23.pptx
Tutorial Equivariance in Imaging ICMS 23.pptxJulián Tachella
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentationaustinps
 

Semelhante a 20101209 dnaseq pevzner (20)

Poster(3)-1
Poster(3)-1Poster(3)-1
Poster(3)-1
 
Genome Exploration in A-T G-C space (mk1)
Genome Exploration in A-T G-C space (mk1)Genome Exploration in A-T G-C space (mk1)
Genome Exploration in A-T G-C space (mk1)
 
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
 
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
Statistical Physics Studies of Machine Learning Problems by Lenka Zdeborova, ...
 
AlgoAlignementGenomicSequences.ppt
AlgoAlignementGenomicSequences.pptAlgoAlignementGenomicSequences.ppt
AlgoAlignementGenomicSequences.ppt
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Slides4
Slides4Slides4
Slides4
 
Guests 2011-11-09-alekseyev-rearrangements
Guests 2011-11-09-alekseyev-rearrangementsGuests 2011-11-09-alekseyev-rearrangements
Guests 2011-11-09-alekseyev-rearrangements
 
Technical
TechnicalTechnical
Technical
 
UCB 2012-02-28
UCB 2012-02-28UCB 2012-02-28
UCB 2012-02-28
 
Hadoop for Bioinformatics
Hadoop for BioinformaticsHadoop for Bioinformatics
Hadoop for Bioinformatics
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymers
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymersDissertation Defense: The Physics of DNA, RNA, and RNA-like polymers
Dissertation Defense: The Physics of DNA, RNA, and RNA-like polymers
 
Hw09 Hadoop For Bioinfomatics
Hw09   Hadoop For BioinfomaticsHw09   Hadoop For Bioinfomatics
Hw09 Hadoop For Bioinfomatics
 
London Calling 2019: Karen Miga
London Calling 2019: Karen MigaLondon Calling 2019: Karen Miga
London Calling 2019: Karen Miga
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
 
Alignment Approaches II: Long Reads
Alignment Approaches II: Long ReadsAlignment Approaches II: Long Reads
Alignment Approaches II: Long Reads
 
Tutorial Equivariance in Imaging ICMS 23.pptx
Tutorial Equivariance in Imaging ICMS 23.pptxTutorial Equivariance in Imaging ICMS 23.pptx
Tutorial Equivariance in Imaging ICMS 23.pptx
 
SyMAP Master's Thesis Presentation
SyMAP Master's Thesis PresentationSyMAP Master's Thesis Presentation
SyMAP Master's Thesis Presentation
 

Mais de Computer Science Club

20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugsComputer Science Club
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
20140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture1220140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture12Computer Science Club
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11Computer Science Club
 
20140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture1020140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture10Computer Science Club
 
20140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture0920140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture09Computer Science Club
 
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture0220140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture02Computer Science Club
 
20140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture0120140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture01Computer Science Club
 
20140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-0420140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-04Computer Science Club
 
20140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture0120140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture01Computer Science Club
 

Mais de Computer Science Club (20)

20141223 kuznetsov distributed
20141223 kuznetsov distributed20141223 kuznetsov distributed
20141223 kuznetsov distributed
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
20140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture1220140511 parallel programming_kalishenko_lecture12
20140511 parallel programming_kalishenko_lecture12
 
20140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture1120140427 parallel programming_zlobin_lecture11
20140427 parallel programming_zlobin_lecture11
 
20140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture1020140420 parallel programming_kalishenko_lecture10
20140420 parallel programming_kalishenko_lecture10
 
20140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture0920140413 parallel programming_kalishenko_lecture09
20140413 parallel programming_kalishenko_lecture09
 
20140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture0220140329 graph drawing_dainiak_lecture02
20140329 graph drawing_dainiak_lecture02
 
20140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture0120140329 graph drawing_dainiak_lecture01
20140329 graph drawing_dainiak_lecture01
 
20140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-0420140310 parallel programming_kalishenko_lecture03-04
20140310 parallel programming_kalishenko_lecture03-04
 
20140223-SuffixTrees-lecture01-03
20140223-SuffixTrees-lecture01-0320140223-SuffixTrees-lecture01-03
20140223-SuffixTrees-lecture01-03
 
20140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture0120140216 parallel programming_kalishenko_lecture01
20140216 parallel programming_kalishenko_lecture01
 
20131106 h10 lecture6_matiyasevich
20131106 h10 lecture6_matiyasevich20131106 h10 lecture6_matiyasevich
20131106 h10 lecture6_matiyasevich
 
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
 
20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich20131027 h10 lecture5_matiyasevich
20131027 h10 lecture5_matiyasevich
 
20131013 h10 lecture4_matiyasevich
20131013 h10 lecture4_matiyasevich20131013 h10 lecture4_matiyasevich
20131013 h10 lecture4_matiyasevich
 
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
 
20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich20131006 h10 lecture3_matiyasevich
20131006 h10 lecture3_matiyasevich
 

20101209 dnaseq pevzner

  • 1. Next Generation DNA Sequencing: Does the Read Length Matter? Pavel A. Pevzner Department of Computer Science and Engineering, University of California at San Diego
  • 2. Fragment Assembly reads atgcaatgcatgcatggatgcaatgcatgcaatgcatgcatggatgcaatgcatgcaatgcatgcatggatgcaatgcatgcaatgcatgcatggatgcaatgcatgcatgggg Cover region with (overlapping) reads Overlap reads and extend to reconstruct the original genomic region
  • 3. Some puzzles are more difficult than other... The puzzle has only 16 pieces and looks simple BUT there are repeats!!! The repeats make it very difficult.
  • 4. Does the Read Length Matter? Mark Chaisson Dima Brinza (now at Pacific Biosciences) (now at Life Technologies)
  • 5. EULER Short Reads assembler (Chaisson et al, Bioinformatics 2004, Genome Res., 2008, 2009)
  • 6.
  • 7. ...history repeats itself: sequencing insulin Fred Sanger 1958 (!) Nobel prize for sequencing insulin by Edman degradation Average read length = 5 aa!
  • 8. Shotgun Protein Sequencing: Mass Spectrometry vs. Edman degradation Novel proteins are still determined by laborious Edman degradation. – Integrilin, a blood clot prevention drug derived from rattlesnake venom. – Ziconotide, 20x more potent than morphine and has no addiction side effects, derived from cone snail venom Many important proteins are not inscribed in genomes – Fusion proteins in tumors – Antibodies (collaboration with Genentech) – Non-ribosomal peptides and other natural products represent 9 out of top 20 bestselling drugs (collaborations with Pieter Dorrestein at UCSD School of Pharmacy) Challenge: Substitute slow Edman degradation by a fast Bandeira et al, MCP 2007 protein sequencing technique Bandeira et al, PNAS 2007
  • 9. Ribosomal Peptides May Be Equally Elusive
  • 10. Short Read Sequencing and SBH Short read sequencing was first proposed in 1988 under the name Sequencing by Hybridization (SBH) • 1988: SBH suggested as an First microarray prototype (1989) alternative to Sanger sequencing. Nobody believed it will ever work First commercial • 1991: Light directed polymer DNA microarray synthesis developed prototype w/16,000 features (1994) • 1994: Affymetrix develops first 64-kb DNA microarray 500,000 features per chip (2002)
  • 11. Fragment Assembly with Short Reads (k-mers) P.P. (1989) k-mer DNA sequencing. Result: An optimal Eulerian fragment assembly algorithm for SBH.
  • 12. Fragment Assembly with (very) Short Reads (k-mers) P.P. (1989) k-mer DNA sequencing. Result: An optimal Eulerian fragment assembly algorithm for SBH. Idury and Waterman (1995) Mimicking Sanger sequencing as SBH reconstruction (first Eulerian algorithm for fragment assembly)
  • 13. Fragment Assembly with (very) Short Reads (k-mers) P.P. (1989) k-mer DNA sequencing. Result: An optimal and fast Eulerian fragment assembly algorithm for SBH. Idury and Waterman (1995) Mimicking Sanger sequencing as SBH reconstruction (first Eulerian algorithm for fragment assembly) De novo assembly with short reads is not unlike assembly with virtual universal DNA array
  • 14. Hamiltonian Cycle Problem • Find a walk (cycle) in a network (graph) that visits every NODE exactly once • Intractable problem (NP – complete)
  • 15. The Bridges of Konigsberg Problem Find a path crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg
  • 16. Eulerian Cycle Problem • Find a walk (cycle) that visits every EDGE exactly once • Linear time algorithm! More complicated version of Königsberg
  • 17. OVERLAP GRAPH Repeat Repeat Repeat Finding a path visiting every NODE exactly once: Hamiltonian path problem
  • 18. REPEAT GRAPH versus OVERLAP GRAPH Repeat Repeat Repeat Find a path visiting every EDGE exactly once: Eulerian path problem (taking into account multiplicity of edges – red edge is visited 3 times)
  • 19. Fragment assembly: two approaches Finding a path visiting every NODE exactly once in the OVERLAP graph: Hamiltonian path problem (intractable) Find a path visiting every EDGE exactly once in the REPEAT graph: Eulerian path problem Easy to Solve!
  • 21. Repeat Graph vs. Unordered Contigs Generated by Traditional Assemblers
  • 22. P.P. et al., PNAS 2001, Genome Res., 2004
  • 23. P.P. et al., PNAS 2001, Genome Res., 2004
  • 24. P.P. et al., Proc. National Academy of Sciences 2001, Genome Res., 2004
  • 25. NEWBLER (454 Life Sci.,06) ALLPATHS, Genome Res.08 (Broad Inst.) VELVET, Genome Res.08 (EBI) ABySS, Genome Res.08 (UBC) P.P. et al., PNAS 2001, Genome Res., 2004
  • 26. The Eulerian approach works well for very accurate (nearly error free) reads but deteriorates for inaccurate reads
  • 27. Error correction in reads: catch-22 The Eulerian approach works well for error-free reads but quickly deteriorates even for reads with low error rates (1%). To assemble a genome we need to correct errors in reads first. But to correct errors in reads one has to assemble the genome first! Can we correct sequencing errors if the genome is unknown, before the assembly started? Result: 50 fold reduction in sequencing errors PRIOR TO ASSEMBLY makes reads almost as accurate as the finished sequence (P.P. et al., PNAS 2001). Similar Spectrum Alignment approach (in a different context) was proposed in Peer&Shamir, RECOMB 01,PNAS 02. It is now used in nearly all assembly tools.
  • 28. EULER vs VELVET (E.Coli) Benchmarking total length of SSAKE, k longest SHARCGS, contigs VCAKE, EDENA, VELVET k
  • 29. Mosaic structure of human segmental duplications: from de Bruijn to A-Bruijn Graphs A B C D E F G H I J A B C D E F C G H I J A B C D E F C G H B C D I J A B C D E F C G H B C D I F C G J • The mosaic structure of segmental duplications in human genome is reconstructed using the A-Bruijn graph approach: Jiang et al . Evolutionary reconstruction of human segmental duplications (Nature Genetics, 2007)
  • 30. Algorithmic Challenge • Problem: given a string, find all repeat elements and reveal the sub-repeat mosaic structure. – Perfect repeats: de Bruijn graph, suffix tree. – Imperfect repeats: OPEN PROBLEM – The A-Bruijn graphs generalize the de Bruijn graphs for imperfect repeats (P.P. et al., Genome Res, 2004)
  • 31. De Novo Repeat Classification All pairwise similarities De novo repeat compilation Pairwise similarity ? Repeat Element 1 AGCCTACG Library of … … repeat elements Repeat Element 2 TGCATTTT … … Repeat Element 3 GAACTCAC ……
  • 32. Mosaic Structure of Repeats: (small region from human Y chromosome) 8328 140 628 1185 2905 628 1185 381 140 628 1185 381 140 628 161442 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 RECON (Bao and Eddy, 2002) does not reveal the mosaic: structure ? 2 copies 2 copies A-Bruijn representation 3 copies 4 copies
  • 33. Repeat Gluing (de Bruijn graph = Quotient space of all K-mers in the sequence) x y y y y x x y x y y y x x y x y
  • 34. Repeat Gluing (de Bruijn graph = Quotient space of all K-mers in the sequence) gluing instruction x y y y y x x y x y y y x x y x y
  • 35. Similarity matrix A B C D E F C G H B C D I F C G J
  • 36. A B C D E F C G H B C D I F C G J H A J B C G F repeat graph E D I B F 2 copies 2 copies Sub-repeats: C 4 copies edges in the 2 copies D 2 copies repeat G graph
  • 37. In reality, repeats are usually imperfect 8328 140 628 1185 2905 628 1185 381 140 628 1185 381 140 628 161442 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 … … AG-CCATCGACGTCACC … … … … AGTGCCTCG-CGTCTCC … …
  • 38. Similarity matrix A B C D E F C G H B C D I F C G J
  • 39. Repeat Gluing (A-Bruijn graph = Quotient space of all ALIGNED POSITIONS) x Consistent y y Gluing x x Inconsistent Gluing y y x
  • 40. Challenge: Generalize the Notion of De Bruijn Graph for Imperfect Repeats • Input – a genomic sequence – all local pairwise alignments (pairs of aligned positions) • Output – repeat graph representing all repeats as a mosaic of sub-repeats
  • 41. Repeat Graph 8328 140 628 1185 2905 628 1185 381 140 628 1185 381 140 628 161442 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A-Bruijn graph repeat graph x y y x
  • 43. From A-Bruijn Graph to Repeat Graph: MSLG Problem Maximum Subgraph with Large Girth (MSLG) Problem: Input: a weighted graph and a parameter girth Output: a maximum weight subgraph that does not contain short cycles, i. e. cycles of length less than girth. Solution known only when the girth is infinite -- Maximum Spanning Tree Problem (maximum weight acyclic subgraph).
  • 44. Maximum Spanning Tree Approximation to MSLG Problem
  • 45. A-Bruijn Graphs and Fragment Assembly Genome A B C D E F C G H B C D I F C G J Reads A B C D I F C G H B C D E F C G J H A J Every possible genome B C G F reconstruction corresponds to an D Eulerian path in the repeat graph. repeat graph E I
  • 46. Fragment Assembly = Building Repeat Graph from Concatenated Reads Theorem (PP et al., Genome. Res 04): The repeat graph built from concatenated (in an arbitrary order!) reads is identical to the repeat graph built from the genomic sequence if the reads “cover” the genomic sequence.
  • 47. EULER Algorithm (outline) • Concatenate reads (in an arbitrary order) into a single sequence • Compute the similarity matrix for this concatenated sequence • Use this similarity matrix as a “glue” and apply MSLG algorithm to build the repeat graph with the A-Bruijn algorithm (in NGS applications, only k-mer based glues are practical).
  • 48. EULER algorithm for NGS applications (Chaisson and PP, Genome Res., 2008) • de Bruijn step: Construct the de Bruijn graph of reads • A-Bruijn step: Remove bulges and whirls • Threading step: Thread each read through the resulting graph and form the consensus sequence from reads; • Mate-pair step: Utilize mate-pairs Velvet, ALLPATHS, AbySS and other NGS de novo tools now use similar framework
  • 49. DNA Sequencing with mate-pairs genome cut many times at random into equally sized fragments Get mate-pairs: two reads from each fragment ~50 bp ~50 bp (separated by a fixed distance)
  • 50. E. coli assembly with 35 bp Illumina reads (N50 statistics with and without mate-pairs) EULER-USR 19 KB VELVET 16 KB EULER-USR (Mate-Paired) 68 KB VELVET (Mate-Paired) 48 KB
  • 51. Eulerian Assembly with Mate-Pairs EULER transforms MATE-PAIRS: “read1 - GAP of length d - read2” into LONG MATE-READS: “read1 - DNA SEQUENCE of length d – read2” P.P. and Tang, ISMB 2001
  • 52. Transforming Mate-Pairs into Mate-Reads Repeat Repeat Repeat Mate-pairs
  • 53. Repeat Graph (in Difference from the Overlap Graph) Enables Easy Processing of Mate-pairs
  • 54. Repeat graph before and after Transforming Mate-Pairs into Mate-Reads (Sanger Reads from N. Meningitidis) P.P. and Tang, ISMB 2001
  • 55. Complications in Transforming Mate-Pairs into Mate- Reads: Multiple Paths Matching the Distance Between Mate-Pairs  P.P. and Tang, ISMB 2001 described how to deal with such complications. VELVET (Breadcrumb) and ALLPATHS described similar approaches aimed at short reads assemblies (using multiple mate- pairs to transform a single mate-pair into a mate-read) A A‟ R1 B B‟ R2 C C‟
  • 56. EULER’s Utilization of Mate-Pairs R1 R2 R1 R2 R2 R1
  • 57. EULER with Mate-Pairs: Does the Read Length Matter? • EULER provides an algorithmic solution for the problem of increasing the read lengths. • Assuming that the read length is 50 bp and insert length in 300 bp, EULER generates mate-reads of length 300+50+50=400 bp. • If all mate-pairs are transformed into mate-reads then the read length does not matter! The thing that matters is SPAN=InsertLength+2*ReadLength
  • 58. EULER-USR with Mate-Pairs: Does the Read Length Matter? • EULER provides an algorithmic solution for the experimental problem of increasing read lengths. • Assuming that the read length is 50 bp and insert length in 300 bp, EULER generates mate-reads of length 300+50+50=400 bp. • If all mate-pairs are transformed into mate-reads then the read length almost does not matter! The thing that matters is SPAN=InsertLength+2*ReadLength • But is it possible to transform mate-pairs into mate-reads with nearly 100% efficiency?
  • 59. Read Length Does NOT Matter! (good news for short read technologies) • EULER-USR was run with simulated (and real) reads varying from 25nt to 100nt and fixed-length span SPAN=InsertLength+2*ReadLength=300 (E.Coli genome) • For read length 35, the efficiency is 98.8% and N50= 61K • For read length 100, the efficiency is 98.9% and N50=61K
  • 60. BUT the Read Length Does Matter! • EULER-USR was run with simulated (and real) read length varying from 25nt to 100nt and fixed-length span InsertLength+2*ReadLength=300 (E.Coli genome) • For read length 35, the efficiency is 98.8% and N50= 61K • For read length 100, the efficiency is 98.9% and N50= 61K • BUT for read length 25, the efficiency is 86.1% and N50= 41K
  • 61. BUT Read Length Does Matter! • EULER-USR was run with simulated (and real) read length varying from 25nt to 100nt and fixed-length span InsertLength+2*ReadLength=300 (E.Coli genome) • For read length 35, the efficiency is 98.8% and N50= 61K • For read length 100, the efficiency is 98.9% and N50= 61K • For read length 25, the efficiency is 86.1% and N50= 41.3K • A small drop in read length results in a dramatic drop in efficiency and N50
  • 62. BUT Read Length Does Matter! • EULER-USR was run with simulated (and real) read length varying from 30nt to 100nt and fixed-length span InsertLength+2*ReadLength=300 (E.Coli genome) • For read length 35, the efficiency is 98.8% and N50= 61K • For read length 100, the efficiency is 98.9% and N50= 61K • For read length 26, the efficiency is 86.1% and N50= 41.3K • A small drop in read length results in dramatic drop in efficiency and N50 • 30nt is a BREAKPOINT separating the assemblies when the read length DOES NOT MATTER from the assemblies when the read length MATTERS. For BACTERIAL (E.Coli) genome
  • 63. Where is the Breakpoint for Assembling Yeast Genome? (bad news for Illumina, good news for 454) • EULER-USR was run with simulated (and real) read length varying from 30nt to 100nt and fixed-length span InsertLength+2*ReadLength=300 (E.Coli genome) • For read length 35, the efficiency is 98.8% and N50= 61K • For read length 100, the efficiency is 98.9% and N50= 61K • For read length 26, the efficiency is 86.1% and N50= 41.3K • A small drop in read length results in dramatic drop in efficiency and N50 • 45nt is a BREAKPOINT separating the assemblies when the read length DOES NOT MATTER from the assemblies when the read length MATTERS. For YEAST genome
  • 64. OPEN PROBLEM: WHERE IS THE BREAKPOINT FOR MAMMALIAN GENOMES?
  • 65. Mass-Spectral Assembly Shotgun DNA sequencing for whole-genome assembly: 1. Randomly read small portions of the genome – reads 2. Find pairwise overlaps between reads 3. Assemble overlaps into long sequences - contigs Can we also assemble spectra into whole-protein sequences? – Shotgun proteomics generate spectra of unknown peptides (short reads?) – Find spectral pairs formed by spectra from overlapping peptides (pairwise overlaps?) – Assemble overlapping spectra into long stretches of amino acid (contigs?)
  • 66. Spectral Assembly via Overlap Graph 1 T H E A VM ETA A TEVM AV A V A V M M V A 1: KQGGTLDDLEEQAR A E H T 2: KQGGTLDDLEEQARELYR 2 3 T VM ETA A TEVM AV A V 3: GGTLDDLEEQARELYR H E A A V M M V A VM ETA A TEVM AV A V A E H T 4: GGTLDDLEEQARELYRR T H E A A V M VM ETA A TEVM AV A V M V T H A A E E A VM ETA A TEVM AV A V H T A V T M H LDDLEEQARELYRRLR M V A A E H T 5 E A A V M M V A A E H T 5: 4 VM ETA A TEVM AV A V T H E A A V M M V A 6: DLEEQARELYRRLREK A E EEQARELYRRLREK VM ETA A TEVM AV A V H T T H E A A V M M V 7 7: A A E H T 6
  • 67. Spectral Assembly via Overlap Graph 1 T H E A VM ETA A TEVM AV A V A V M M V A 1: KQGGTLDDLEEQAR A E H T 2: KQGGTLDDLEEQARELYR 2 3 T VM ETA A TEVM AV A V 3: GGTLDDLEEQARELYR H E A A V M M V A VM ETA A TEVM AV A V A E H T 4: GGTLDDLEEQARELYRR T H E A A V M VM ETA A TEVM AV A V M V T H A A E E A VM ETA A TEVM AV A V H T A V T M H LDDLEEQARELYRRLR M V A A E H T 5 E A A V M M V A A E H T 5: 4 VM ETA A TEVM AV A V T H E A A V M M V A 6: DLEEQARELYRRLREK A E EEQARELYRRLREK VM ETA A TEVM AV A V H T T H E A A V M M V 7 7: A A T M E T T E M T A A E H T 6 A T M E Real samples contain modified peptides. Using an T+80 T+80 analogy with DNA sequencing, a modified peptide is not unlike a polymorphism. Integrating them into the E M assembly pipeline is not unlike DNA assembly of T A highly polymorphic genomes like sea squirt. Spectral alignment of DIFFICULT ALGORITHMIC PROBLEM modified peptides
  • 68. Protein Sequencing with Eulerian Approach A M T E T A M T E T A M T E T A V T E T M A T E T M A V A T E T M A Stage 1: Generate H T A T H T spectral pairs using A E M E A E A A approach in Bandeira et M T +80 T T+80 M T M M al., PNAS 2007 T A T A E A E E A M H H T T A T Stage 2: „Glue‟ peaks in spectral pairs using approach in P.P. et al., Genome Res., 2004 99.2 Da 71.0 Da 101.0 Da 129.1 Da 101.1 Da 131.1 Da 71.1 Da 101.0 Da 129.3 Da 101.1 Da 131.0 Da 71.0 Da 71.1 Da 137.1 Da 101.1 Da 129.2 Da 101.0 Da 131.1 Da 71.1 Da 101.2 Da 129.0 Da 181.2 Da 131.0 Da 71.0 Da Stage 3: Sequencing on the A-Bruijn graph using approach in Bandeira et al., MCP 2007 V A T E T M A A H T+80
  • 69. 28 aa protein contig, 24 spectra [271.1] F (SK) S G T E C R A S M S E C D P A E H C T G Q S GRHSLFHPEDTGKVFKVSHSFPHPLYDMSLLKNRFLRPGDDSSHDLMLLR 50 amino acids long protein contig of 92 assembled spectra b-ions in each spectrum Mass difference between b-ions Oxidized Methionine
  • 70. Sequencing Snake Venoms • Venom dataset from western diamondback rattlesnake generated by Karl Clauser at Broad Institute – Mixture of ~30 proteins – Digestion with: trypsin, chymotrypsin, Asp-N, Glu-C
  • 72. Sequencing Antibodies (collaboration with Genentech antibody sequencing group) a) 20 -14 21 b) Contig order induced by 10 9 Comparative Shotgun Protein Sequencing 22 17 32 19 16 Reconstructed SPS contigs 5 12 15 28 13 26 2 -36 27 1 100 200 300 400 7 Amino acid position on Anti-BTLA Heavy chain 30 6 23 c) Anti-BTLA Heavy Chain 31 QVQLKESGPGLVAPSQSLSITCTVSGFSLTSYGVSWVR 33 QPPGKGLEWLGVIWGDGSTNYHSALISRLSISKDNSKS 25 QVFLKLNSLQTDDTATYYCAKGGYRFYYAMDYWGQGTS 29 VTVSSAKTTPPSVYPLAPGSAAQTNSMVTLGCLVKGYF 8 4 PEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVP -3 18 SSTWPSETVTCNVAHPASSTKVDKKIVPRDCGCKPCIC -11 35 34 24 TVPEVSSVFIFPPKPKDVLTITLTPKVTCVVVDISKDD PEVQFSWFVDDVEVHTAQTQPREEQFNSTFRSVSELPI - Contig order induced by homology to gi|148686583 MHQDWLNGKEFKCRVNSAAFPAPIEKTISKTKGRPKAP - Contiguous contig order induced by homology to gi|148540420 QVYTIPPPKEQMAKDKVSLTCMITDFFPEDITVEWQWN GQPAENYKNTQPIMDTDGSYFVYSKLNVQKSNWEAGNT - Contig order induced by homology to gi|148540420 but FTCSVLHEGLHNHHTEKSLSHSPGK interrupted by non-contiguous coverage (sequence gaps) Bandeira et al., Nature Biotech, 2008
  • 73. Acknowledgements (short reads DNA sequencing) Mark Chaisson Dima Brinza (now at Pacific Biosciences) (now at Life Technologies) Collaboration with Xiaohua Huang at UCSD Bioengineering (supported by NHGRI) Collaborations with Joe Ecker lab at Salk (BAC sequencing data) and Illumina team (E.Coli sequencing data)
  • 74. Acknowledgements • Rob Lipshutz, Affymetrix – SBH • Haixu Tang (Indiana), Mike Waterman (USC) – EULER assembler • Haixu Tang, Glenn Tesler (UCSD) - EULER+ assembler • Serafim Batzoglou (Stanford) – large assemblies with short reads