SlideShare uma empresa Scribd logo
1 de 62
Baixar para ler offline
Velvet / Curtain
Matthias Haimel




                   EBI is an Outstation of the European Molecular Biology Laboratory.



2   25.04.11   Velvet / Curtain
Overview
    • De Bruijn Graph
    • Velvet
               • Theory
               • Practice
    • Data formats and quality
    • Velvet
               • Simulation data
               • Multiple insert lengths
    • Curtain
               • Theory
               • Practice


3   25.04.11                Velvet / Curtain
De Bruijn graph
    • A concept in combinatorial mathematics
               • In combinatorics, de bruijn graph is usually fully connected
               • http://en.wikipedia.org/wiki/De_Bruijn_graph
    • de bruijn sequence
               • Related concept
               • Path through graph




    • Velvet
               • de Bruijn inspired graph structure




4   25.04.11              Velvet / Curtain
De Bruijn graph (Velvet)
    • Representation of
               • a sequence based on short words (k-mers)
               • overlaps between words
    • K-mer: word of length k
    • K=5
                                               GCCTTCCA
               • k-1 overlap


    GCCTT                                   GCCTT           GCCTT
     CCTTC                                   CCTTC           CCTTC
                                               CTTCC           CTTCC
                                                                TTCCA
                                                                  ...
    GCCTTCCA                                GCCTTCCA        GCCTTCCA

5   25.04.11             Velvet / Curtain
De Bruijn graph (Velvet)
                            GCCTTCCAATTT
                            GCCTTCAAATTT


                      C                A
                  CTTC             TTCC    .....
                                                   CAATT
        T
     CCT TC
    G CT
     C                                                     AATTT
                     A                 A
                 CTTC              TTCA    .....   AAATT




6    25.04.11   Velvet / Curtain
De Bruijn graph representations (Velvet)
                                                TTCA
                                         ATTC          TCAG
    Error free, no repeat,
    no polymorphism



    Repeat > kmer length



    SNP, variant, < kmer length



    Structural variant, inversion
    Structural variant, deletion…
    …


7   25.04.11          Velvet / Curtain
Example
                      TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
                       AGTCGAG CTTTAGA CGATGAG CTTTAGA
                        GTCGAGG TTAGATC ATGAGGC      GAGACAG
                           GAGGCTC   ATCCGAT AGGCTTT GAGACAG
                       AGTCGAG    TAGATCC ATGAGGC TAGAGAA
                      TAGTCGA CTTTAGA CCGATGA     TTAGAGA
                          CGAGGCT AGATCCG TGAGGCT AGAGACA
                      TAGTCGA GCTTTAG TCCGATG GCTCTAG
                         TCGACGC    GATCCGA GAGGCTT AGAGACA
                      TAGTCGA    TTAGATC GATGAGG TTTAGAG
                        GTCGAGG TCTAGAT   ATGAGGC TAGAGAC
                            AGGCTTT ATCCGAT AGGCTTT GAGACAG
                       AGTCGAG   TTAGATT ATGAGGC    AGAGACA
                             GGCTTTA TCCGATG     TTTAGAG
                          CGAGGCT TAGATCC TGAGGCT    GAGACAG
                       AGTCGAG TTTAGATC ATGAGGC TTAGAGA
                           GAGGCTT GATCCGA GAGGCTT GAGACAG


8   25.04.11   Velvet / Curtain
Example

               Read: GTCGAGG




                   GTCG
                   (1x)




9   25.04.11              Velvet / Curtain
Example

                Read: GTCGAGG




                    GTCG     TCGA
                    (1x)     (1x)




10   25.04.11              Velvet / Curtain
Example

                Read: GTCGAGG




                    GTCG     TCGA      CGAG
                    (1x)     (1x)      (1x)




11   25.04.11              Velvet / Curtain
Example

                Read: GTCGAGG




                    GTCG     TCGA      CGAG   GAGG
                    (1x)     (1x)      (1x)   (1x)




12   25.04.11              Velvet / Curtain
Example

         New read: CGAGGCT




                GTCG     TCGA      CGAG   GAGG
                (1x)     (1x)      (2x)   (1x)




13   25.04.11          Velvet / Curtain
Example

                Read: CGAGGCT




                    GTCG     TCGA      CGAG   GAGG
                    (1x)     (1x)      (2x)   (2x)




14   25.04.11              Velvet / Curtain
Example

                Read: CGAGGCT




                    GTCG     TCGA      CGAG   GAGG   AGGC
                    (1x)     (1x)      (2x)   (2x)   (1x)




15   25.04.11              Velvet / Curtain
Example

                Read: CGAGGCT




                    GTCG     TCGA      CGAG   GAGG   AGGC   GGCT
                    (1x)     (1x)      (2x)   (2x)   (1x)   (1x)




16   25.04.11              Velvet / Curtain
Example

                New read: TCGACGC




                    GTCG     TCGA      CGAG   GAGG   AGGC
                    (1x)     (2x)      (2x)   (2x)   (1x)




17   25.04.11              Velvet / Curtain
Example

                Read: TCGACGC




                    GTCG     TCGA      CGAG   GAGG   AGGC
                    (1x)     (2x)      (2x)   (2x)   (1x)



                                       CGAC   GACG   ACGC
                                       (1x)   (1x)   (1x)




18   25.04.11              Velvet / Curtain
Example

                   etc…
                                                                                                                GATT
                                                                                                                (1x)




                                                        TGAG     ATGA   GATG   CGAT   CCGA   TCCG     ATCC     GATC     AGAT
                                                        (9x)     (8x)   (5x)   (6x)   (7x)   (7x)     (7x)     (8x)     (8x)

                                                                                                                                              AGAA
                                                                                                                                              (1x)

                                                                                   GCTC   CTCT      TCTA     CTAG
                                                                                   (2x)   (1x)      (2x)     (2x)

                TAGT   AGTC   GTCG     TCGA      CGAG    GAGG      AGGC    GGCT                                       TAGA     AGAG   GAGA    AGAC   GACA   ACAG
                (3x)   (7x)   (9x)     (10x)     (8x)    (16x)     (16x)   (11x)                                      (16x)    (9x)   (12x)   (9x)   (8x)   (5x)
                                                                                   GCTT   CTTT      TTTA     TTAG
                                                                                   (8x)   (8x)      (8x)     (12x)
                                                 CGAC    GACG       ACGC
                                                 (1x)    (1x)       (1x)




19   25.04.11                        Velvet / Curtain
Example

                  After simplification…


                                                      GATT
                                                                  AGAT

                                                  GATCCGATGAG                             AGAA
                                                                GCTCTAG
                TAGTCGA    CGAG

                                             GAGGCT    GGCT               TAGA   AGAGA   AGACAG
                                                                GCTTTAG
                          CGACGC




20   25.04.11             Velvet / Curtain
Example

                  Tips removed…


                                                                  AGAT

                                                  GATCCGATGAG
                                                                GCTCTAG
                TAGTCGA    CGAG

                                             GAGGCT    GGCT               TAGA   AGAGA   AGACAG
                                                                GCTTTAG




21   25.04.11             Velvet / Curtain
De Bruijn graph biology extensions (Velvet)
     • Handling of reverse strand
                • DNA is read in two directions
                • Paired-end data
     • Handling small differences, which are “uninteresting”
                • Errors in sequencing technology
     • Memory
                • regularly use 80, 100GB real memory
                • easily get to 1TB real memory requirements




22   25.04.11             Velvet / Curtain
Read variety
     • Short reads                      ~75bp
                • Illumina / Solexa
                • SOLiD (colour space)
     • Long reads                     500-1000 bp
                • 454 read
                • Sanger capillary reads
     • Paired-end reads
                • Short reads
                • short insert length
     • Mate pair reads
                • Short reads
                • long insert length

23   25.04.11              Velvet / Curtain
Paired-End




                                    Mate Pair




24   25.04.11   Velvet / Curtain
Short paired-end / mate pair reads


                                     ?

Velvet expect Illumina paired-end orientation: (L-> <-R)

                    L                              R       paired-end




25   25.04.11     Velvet / Curtain
Short paired-end / mate pair reads


Illumina mate-pair orientation: (<-L R->)
                     L                                     R
                                                               mate pair

                                      reverse complement



                     L                                     R   paired-end




26   25.04.11      Velvet / Curtain
Velvet algorithms
     • Remove Bubbles
                • Tour Bus




     • Velvet parameters
                • -max_branch_length
                • -max_divergence
                • -max_gap_count




27   25.04.11            Velvet / Curtain
Example




                                                                          AGAT

                                                  GATCCGATGAG
                                                                        GCTCTAG
                TAGTCGA    CGAG

                                             GAGGCT    GGCT                               TAGA         AGAGA   AGACAG
                                                                        GCTTTAG



                                                                 GCTC    CTCT   TCTA   CTAG
                                                                 (2x)    (1x)   (2x)   (2x)

                                                         GGCT                                  TAGA
                                                         (11x)                                 (16x)
                                                                 GCTT    CTTT   TTTA   TTAG
                                                                 (8x)    (8x)   (8x)   (12x)


28   25.04.11             Velvet / Curtain
Example

                  Bubbles removed… by TourBus


                                                                 AGAT

                                                  GATCCGATGAG


                TAGTCGA    CGAG

                                             GAGGCT    GGCT     GCTTTAG   TAGA   AGAGA   AGACAG




29   25.04.11             Velvet / Curtain
Example

                Final simplification…


                                          AGATCCGATGAG



                      TAGTCGAG            GAGGCTTTAGA    AGAGACAG




30   25.04.11          Velvet / Curtain
Example
                              TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG


                Final simplification…


                                          AGATCCGATGAG



                      TAGTCGAG            GAGGCTTTAGA     AGAGACAG


     One possible walk through the graph ...
                              TAGTCGAG
                                   GAGGCTTTAGA
                                           AGATCCGATGAG
                                                    GAGGCTTTAGA
                                                            AGAGACAG

31   25.04.11          Velvet / Curtain
N50
     • Total

     • N90



     • N50



     • N10




32    25.04.11   Velvet / Curtain
N50
     • Total
          • 4,295,113bp
     • N90
          • 439bp


     • N50
          • 3,119bp


     • N10
          • 13,519bp




33    25.04.11         Velvet / Curtain
N50
     • N50 is the length of the smallest contig
                • contains the fewest (largest) contigs
                • combined length represents at least 50% of the assembly
     • N10
                • > 10 % of the largest contigs


           http://www.broadinstitute.org/crd/wiki/index.php/N50




34   25.04.11              Velvet / Curtain
Velvet practical: Part 1
     • Compile
     • Single end (ERX001300)
                • K-mer length
                • Coverage cut-offs
     • Whole genome sequence as input???
                • Staphylococcus aureus MRSA252




35   25.04.11             Velvet / Curtain
Velvet algorithms
     • Long read information
                • Rock Band




     • Velvet parameters
                • -long_mult_cutoff




36   25.04.11             Velvet / Curtain
Velvet algorithms
     • Paired-end information
                • Pebble




     • Velvet parameters
                • -min_pair_count




                                         Once all distances and variance computed,
                                         Simple greedy extension from main contigs out



37   25.04.11              Velvet / Curtain
Paired-end in Velvet
     • Hugely improves quality of assembly
     • Insert length greater than repeat
                • greater than the length of the most common genomic repeat
     • Mixed insert length improves results
                • Short: helps for local assembly
                • Long: get over repeats
     • Large genomes
                • Very memory intensive
                • Calculation intensive




38   25.04.11              Velvet / Curtain
Data formats and quality
     • Fasta                                            • Fastq
                • .fasta                                  • .fastq
                • .fa                                     • .fq
                • ?                                       • ?
                                              Header

                   >read_1                                @SEQ_ID
                   TATAATATTTAT...                        GATTTGGGGTTCAAAGC
                                       Sequence           +
                                                          !''*((((***+))%%%

                                              Quality




39   25.04.11              Velvet / Curtain
FASTQ paired
                             @SRR022863.1.F
                             ATATAGATGTACATAAATTAGTTGAAGTATATGAACG
                             +
     .F .R                   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII
     /1 /2                   @SRR022863.1.R
                             TTCACCCATTTTATCCATGATTTTGTTCTTTCTCTTC
                             +
                             IIIIIHIIIIIIII3III.,IIII&II6II-))&'I0


                 @SRR022863.1.F                          @SRR022863.1.R
                 ATATAGATGTACATAAATTAGT...               TTCACCCATTTTATCCATGATTTTGTT...
                 +                                       +
                 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII   IIIIIHIIIIIIII3III.,IIII&II6II-))&'I0
                 @SRR022863.2.F                          @SRR022863.2.R
                 TTATGAATTATTAATAAGTGCT...               CATAAAAAAAGAAAATGTACTCTTTAC...
                 +                                       +
                 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII   IIII)0&A,%.&9$8I4+A;I)4II)&%-I$I%#)II



40    25.04.11                Velvet / Curtain
Quality score
     • Velvet does NOT use quality score!!!
                • Error correction of de Bruijn graph
     • p
                • the probability that the corresponding base call is incorrect

     • Phred quality score
                • 10 -> 1 in 10
                • 40 -> 1 in 10,000.

     • Odds ratio
                • earlier versions of solexa pipeline
                • differs mainly at lower levels



41   25.04.11              Velvet / Curtain
Quality encoding
     • !''*((((***+))%%%
                • One value per base
                • Integer mapping based on ASCII encoding
                • probability of incorrect base call



     • Sanger format                           • Illumina 1.5+
                •   Phred score                   •   Phred score
                •   ASCII 33 – 126 -> 0 – 93      •   ASCII 59 – 126 -> -5 – 62
                •   Rarely exceeds 60             •   Only 2 – 40 expected
                •   ! = 33 -> 0                   •   ! = 33 -> (does not exist)
                •   b = 66 -> 33                  •   b = 66 -> 2

42   25.04.11               Velvet / Curtain
Quality encoding
     • wikipedia




43   25.04.11      Velvet / Curtain
Quality trimming

                                                                Good / Bad ?



                          Quality score




                                          Bp position in Read

44   25.04.11   Velvet / Curtain
Quality trimming
     • Fixed length trimming
                • Cut-off at position x
     • Adaptive trimming
                • Quality score cut-off
                • Minimum sequence length
     • Sliding window
                • Window size
                • Quality score cut-off
                • Use average quality value of window




45   25.04.11              Velvet / Curtain
Velvet practical: Part 2
     • Paired-end (SRX008042)
                • Explore parameters
                • Set cut-offs
     • Analyse quality score (SRX008042)
                • Trimming reads




46   25.04.11            Velvet / Curtain
Velvet modules
     • Columbus (since Velvet 1.0)
                •   use reference sequence
                •   assist with alignment information
                •   local re-sequencing
                •   structural variants




47   25.04.11               Velvet / Curtain
Velvet modules
     • Oases
                • De novo transcriptome assembler
                • uses preliminary Velvet assembly
                • clusters contigs into loci
                • construct transcript isoforms using paired-end / long read
                  information
                • confidence score: describes uniqueness of a transcript in a locus




48   25.04.11              Velvet / Curtain
Read Simulation - Why?
     • Controlling the data
                •   Contamination
                •   Coverage distribution
                •   Sequencing errors
                •   Genome size
                •   Insert length
                •   Insert length distribution




49   25.04.11                Velvet / Curtain
Read Simulation - Why?
     • Make results comparable
                •   Assemblers
                •   Parameters
                •   Algorithms
                •   Assembly strategies
                •   Genome specific “features”
     • Robust
                • Introduce errors
                • Simulate SNPs




50   25.04.11               Velvet / Curtain
Real data vs. simulation




                                   Mario Caccamo

51   25.04.11   Velvet / Curtain
Real data vs. simulation




                                   Mario Caccamo

52   25.04.11   Velvet / Curtain
Velvet practical: Part 3
     • Velvet
                • Long Reads
                • Hybrid Assembly
                • Mixed insert length libraries




53   25.04.11              Velvet / Curtain
Curtain
     •     assembly pipeline
     •     Paired-end assembly for large genomes
     •     Group related Contigs
     •     Uses velvet to assemble groups of related reads
     •     Iterative approach




54   25.04.11       Velvet / Curtain
Curtain

                   Genome assembly Pipeline


                                         Curtain
     Contigs

                 Map               Group     Fill
                                                    Assemble   Collect
                 Reads             Contigs   Bins




55   25.04.11   Velvet / Curtain
Curtain
      Curtain                                  Contigs
                                                         Map   Group    Fill
                                                                             AssembleCollect
                                                         Reads Contigs Bins

     • Set of input Contigs
     • Use established assemblers
                •   Velvet unpaired
                •   Cortex
                •   SGA
                •   ...




56   25.04.11               Velvet / Curtain
Curtain
      Curtain                              Contigs
                                                      Map Group Fill AssembleCollect
                                                     Reads Contigs Bins

     • Map reads to input contigs
     • SAM file support
                • bwa
                • maq




57   25.04.11           Velvet / Curtain
Curtain
      Curtain                                  Contigs
                                                             Map   Group Fill AssembleCollect
                                                             Reads Contigs Bins


     • Group Contigs using Paired-end information

                1                      2   3             4                        5




                     bin mapping read & read pair




58   25.04.11       Velvet / Curtain
Curtain
      Curtain                                    Contigs
                                                            Map   Group    Fill
                                                            Reads Contigs Bins AssembleCollect

     • Assemble each bin
                • Run velvet using paired-end information
                • bin specific parameters
     •     Run each bin individually                                velvet
     •     Highly parallelizable
     •     Collect results
     •     Start next iteration                              ………………….




                                                                    Results


59   25.04.11             Velvet / Curtain
Curtain
     •     Low memory footprint
     •     Scalable for large genomes
     •     Make use of cluster
     •     Available
                • www.ebi.ac.uk/egt
                • http://code.google.com/p/curtain/
     • Future announcements
                • http://groups.google.com/group/curtain-assembler
     • Future work
                • Long read support




60   25.04.11              Velvet / Curtain
Curtain practical
     • Run Curtain for Staphylococcus
                • Simulation data




61   25.04.11             Velvet / Curtain
Thanks ...




62   25.04.11   Velvet / Curtain

Mais conteúdo relacionado

Último

Último (20)

Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 

Destaque

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Destaque (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

2011-04-26_01-velvet-curtain-presentation

  • 1. Velvet / Curtain Matthias Haimel EBI is an Outstation of the European Molecular Biology Laboratory.
  • 2.  2 25.04.11 Velvet / Curtain
  • 3. Overview • De Bruijn Graph • Velvet • Theory • Practice • Data formats and quality • Velvet • Simulation data • Multiple insert lengths • Curtain • Theory • Practice 3 25.04.11 Velvet / Curtain
  • 4. De Bruijn graph • A concept in combinatorial mathematics • In combinatorics, de bruijn graph is usually fully connected • http://en.wikipedia.org/wiki/De_Bruijn_graph • de bruijn sequence • Related concept • Path through graph • Velvet • de Bruijn inspired graph structure 4 25.04.11 Velvet / Curtain
  • 5. De Bruijn graph (Velvet) • Representation of • a sequence based on short words (k-mers) • overlaps between words • K-mer: word of length k • K=5 GCCTTCCA • k-1 overlap GCCTT GCCTT GCCTT CCTTC CCTTC CCTTC CTTCC CTTCC TTCCA ... GCCTTCCA GCCTTCCA GCCTTCCA 5 25.04.11 Velvet / Curtain
  • 6. De Bruijn graph (Velvet) GCCTTCCAATTT GCCTTCAAATTT C A CTTC TTCC ..... CAATT T CCT TC G CT C AATTT A A CTTC TTCA ..... AAATT 6 25.04.11 Velvet / Curtain
  • 7. De Bruijn graph representations (Velvet) TTCA ATTC TCAG Error free, no repeat, no polymorphism Repeat > kmer length SNP, variant, < kmer length Structural variant, inversion Structural variant, deletion… … 7 25.04.11 Velvet / Curtain
  • 8. Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGAGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG 8 25.04.11 Velvet / Curtain
  • 9. Example Read: GTCGAGG GTCG (1x) 9 25.04.11 Velvet / Curtain
  • 10. Example Read: GTCGAGG GTCG TCGA (1x) (1x) 10 25.04.11 Velvet / Curtain
  • 11. Example Read: GTCGAGG GTCG TCGA CGAG (1x) (1x) (1x) 11 25.04.11 Velvet / Curtain
  • 12. Example Read: GTCGAGG GTCG TCGA CGAG GAGG (1x) (1x) (1x) (1x) 12 25.04.11 Velvet / Curtain
  • 13. Example New read: CGAGGCT GTCG TCGA CGAG GAGG (1x) (1x) (2x) (1x) 13 25.04.11 Velvet / Curtain
  • 14. Example Read: CGAGGCT GTCG TCGA CGAG GAGG (1x) (1x) (2x) (2x) 14 25.04.11 Velvet / Curtain
  • 15. Example Read: CGAGGCT GTCG TCGA CGAG GAGG AGGC (1x) (1x) (2x) (2x) (1x) 15 25.04.11 Velvet / Curtain
  • 16. Example Read: CGAGGCT GTCG TCGA CGAG GAGG AGGC GGCT (1x) (1x) (2x) (2x) (1x) (1x) 16 25.04.11 Velvet / Curtain
  • 17. Example New read: TCGACGC GTCG TCGA CGAG GAGG AGGC (1x) (2x) (2x) (2x) (1x) 17 25.04.11 Velvet / Curtain
  • 18. Example Read: TCGACGC GTCG TCGA CGAG GAGG AGGC (1x) (2x) (2x) (2x) (1x) CGAC GACG ACGC (1x) (1x) (1x) 18 25.04.11 Velvet / Curtain
  • 19. Example etc… GATT (1x) TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x) AGAA (1x) GCTC CTCT TCTA CTAG (2x) (1x) (2x) (2x) TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT TAGA AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (11x) (16x) (9x) (12x) (9x) (8x) (5x) GCTT CTTT TTTA TTAG (8x) (8x) (8x) (12x) CGAC GACG ACGC (1x) (1x) (1x) 19 25.04.11 Velvet / Curtain
  • 20. Example After simplification… GATT AGAT GATCCGATGAG AGAA GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG CGACGC 20 25.04.11 Velvet / Curtain
  • 21. Example Tips removed… AGAT GATCCGATGAG GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG 21 25.04.11 Velvet / Curtain
  • 22. De Bruijn graph biology extensions (Velvet) • Handling of reverse strand • DNA is read in two directions • Paired-end data • Handling small differences, which are “uninteresting” • Errors in sequencing technology • Memory • regularly use 80, 100GB real memory • easily get to 1TB real memory requirements 22 25.04.11 Velvet / Curtain
  • 23. Read variety • Short reads ~75bp • Illumina / Solexa • SOLiD (colour space) • Long reads 500-1000 bp • 454 read • Sanger capillary reads • Paired-end reads • Short reads • short insert length • Mate pair reads • Short reads • long insert length 23 25.04.11 Velvet / Curtain
  • 24. Paired-End Mate Pair 24 25.04.11 Velvet / Curtain
  • 25. Short paired-end / mate pair reads ? Velvet expect Illumina paired-end orientation: (L-> <-R) L R paired-end 25 25.04.11 Velvet / Curtain
  • 26. Short paired-end / mate pair reads Illumina mate-pair orientation: (<-L R->) L R mate pair reverse complement L R paired-end 26 25.04.11 Velvet / Curtain
  • 27. Velvet algorithms • Remove Bubbles • Tour Bus • Velvet parameters • -max_branch_length • -max_divergence • -max_gap_count 27 25.04.11 Velvet / Curtain
  • 28. Example AGAT GATCCGATGAG GCTCTAG TAGTCGA CGAG GAGGCT GGCT TAGA AGAGA AGACAG GCTTTAG GCTC CTCT TCTA CTAG (2x) (1x) (2x) (2x) GGCT TAGA (11x) (16x) GCTT CTTT TTTA TTAG (8x) (8x) (8x) (12x) 28 25.04.11 Velvet / Curtain
  • 29. Example Bubbles removed… by TourBus AGAT GATCCGATGAG TAGTCGA CGAG GAGGCT GGCT GCTTTAG TAGA AGAGA AGACAG 29 25.04.11 Velvet / Curtain
  • 30. Example Final simplification… AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG 30 25.04.11 Velvet / Curtain
  • 31. Example TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG Final simplification… AGATCCGATGAG TAGTCGAG GAGGCTTTAGA AGAGACAG One possible walk through the graph ... TAGTCGAG GAGGCTTTAGA AGATCCGATGAG GAGGCTTTAGA AGAGACAG 31 25.04.11 Velvet / Curtain
  • 32. N50 • Total • N90 • N50 • N10 32 25.04.11 Velvet / Curtain
  • 33. N50 • Total • 4,295,113bp • N90 • 439bp • N50 • 3,119bp • N10 • 13,519bp 33 25.04.11 Velvet / Curtain
  • 34. N50 • N50 is the length of the smallest contig • contains the fewest (largest) contigs • combined length represents at least 50% of the assembly • N10 • > 10 % of the largest contigs http://www.broadinstitute.org/crd/wiki/index.php/N50 34 25.04.11 Velvet / Curtain
  • 35. Velvet practical: Part 1 • Compile • Single end (ERX001300) • K-mer length • Coverage cut-offs • Whole genome sequence as input??? • Staphylococcus aureus MRSA252 35 25.04.11 Velvet / Curtain
  • 36. Velvet algorithms • Long read information • Rock Band • Velvet parameters • -long_mult_cutoff 36 25.04.11 Velvet / Curtain
  • 37. Velvet algorithms • Paired-end information • Pebble • Velvet parameters • -min_pair_count Once all distances and variance computed, Simple greedy extension from main contigs out 37 25.04.11 Velvet / Curtain
  • 38. Paired-end in Velvet • Hugely improves quality of assembly • Insert length greater than repeat • greater than the length of the most common genomic repeat • Mixed insert length improves results • Short: helps for local assembly • Long: get over repeats • Large genomes • Very memory intensive • Calculation intensive 38 25.04.11 Velvet / Curtain
  • 39. Data formats and quality • Fasta • Fastq • .fasta • .fastq • .fa • .fq • ? • ? Header >read_1 @SEQ_ID TATAATATTTAT... GATTTGGGGTTCAAAGC Sequence + !''*((((***+))%%% Quality 39 25.04.11 Velvet / Curtain
  • 40. FASTQ paired @SRR022863.1.F ATATAGATGTACATAAATTAGTTGAAGTATATGAACG + .F .R IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII /1 /2 @SRR022863.1.R TTCACCCATTTTATCCATGATTTTGTTCTTTCTCTTC + IIIIIHIIIIIIII3III.,IIII&II6II-))&'I0 @SRR022863.1.F @SRR022863.1.R ATATAGATGTACATAAATTAGT... TTCACCCATTTTATCCATGATTTTGTT... + + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAIIII IIIIIHIIIIIIII3III.,IIII&II6II-))&'I0 @SRR022863.2.F @SRR022863.2.R TTATGAATTATTAATAAGTGCT... CATAAAAAAAGAAAATGTACTCTTTAC... + + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII IIII)0&A,%.&9$8I4+A;I)4II)&%-I$I%#)II 40 25.04.11 Velvet / Curtain
  • 41. Quality score • Velvet does NOT use quality score!!! • Error correction of de Bruijn graph • p • the probability that the corresponding base call is incorrect • Phred quality score • 10 -> 1 in 10 • 40 -> 1 in 10,000. • Odds ratio • earlier versions of solexa pipeline • differs mainly at lower levels 41 25.04.11 Velvet / Curtain
  • 42. Quality encoding • !''*((((***+))%%% • One value per base • Integer mapping based on ASCII encoding • probability of incorrect base call • Sanger format • Illumina 1.5+ • Phred score • Phred score • ASCII 33 – 126 -> 0 – 93 • ASCII 59 – 126 -> -5 – 62 • Rarely exceeds 60 • Only 2 – 40 expected • ! = 33 -> 0 • ! = 33 -> (does not exist) • b = 66 -> 33 • b = 66 -> 2 42 25.04.11 Velvet / Curtain
  • 43. Quality encoding • wikipedia 43 25.04.11 Velvet / Curtain
  • 44. Quality trimming Good / Bad ? Quality score Bp position in Read 44 25.04.11 Velvet / Curtain
  • 45. Quality trimming • Fixed length trimming • Cut-off at position x • Adaptive trimming • Quality score cut-off • Minimum sequence length • Sliding window • Window size • Quality score cut-off • Use average quality value of window 45 25.04.11 Velvet / Curtain
  • 46. Velvet practical: Part 2 • Paired-end (SRX008042) • Explore parameters • Set cut-offs • Analyse quality score (SRX008042) • Trimming reads 46 25.04.11 Velvet / Curtain
  • 47. Velvet modules • Columbus (since Velvet 1.0) • use reference sequence • assist with alignment information • local re-sequencing • structural variants 47 25.04.11 Velvet / Curtain
  • 48. Velvet modules • Oases • De novo transcriptome assembler • uses preliminary Velvet assembly • clusters contigs into loci • construct transcript isoforms using paired-end / long read information • confidence score: describes uniqueness of a transcript in a locus 48 25.04.11 Velvet / Curtain
  • 49. Read Simulation - Why? • Controlling the data • Contamination • Coverage distribution • Sequencing errors • Genome size • Insert length • Insert length distribution 49 25.04.11 Velvet / Curtain
  • 50. Read Simulation - Why? • Make results comparable • Assemblers • Parameters • Algorithms • Assembly strategies • Genome specific “features” • Robust • Introduce errors • Simulate SNPs 50 25.04.11 Velvet / Curtain
  • 51. Real data vs. simulation Mario Caccamo 51 25.04.11 Velvet / Curtain
  • 52. Real data vs. simulation Mario Caccamo 52 25.04.11 Velvet / Curtain
  • 53. Velvet practical: Part 3 • Velvet • Long Reads • Hybrid Assembly • Mixed insert length libraries 53 25.04.11 Velvet / Curtain
  • 54. Curtain • assembly pipeline • Paired-end assembly for large genomes • Group related Contigs • Uses velvet to assemble groups of related reads • Iterative approach 54 25.04.11 Velvet / Curtain
  • 55. Curtain Genome assembly Pipeline Curtain Contigs Map Group Fill Assemble Collect Reads Contigs Bins 55 25.04.11 Velvet / Curtain
  • 56. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Set of input Contigs • Use established assemblers • Velvet unpaired • Cortex • SGA • ... 56 25.04.11 Velvet / Curtain
  • 57. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Map reads to input contigs • SAM file support • bwa • maq 57 25.04.11 Velvet / Curtain
  • 58. Curtain Curtain Contigs Map Group Fill AssembleCollect Reads Contigs Bins • Group Contigs using Paired-end information 1 2 3 4 5 bin mapping read & read pair 58 25.04.11 Velvet / Curtain
  • 59. Curtain Curtain Contigs Map Group Fill Reads Contigs Bins AssembleCollect • Assemble each bin • Run velvet using paired-end information • bin specific parameters • Run each bin individually velvet • Highly parallelizable • Collect results • Start next iteration …………………. Results 59 25.04.11 Velvet / Curtain
  • 60. Curtain • Low memory footprint • Scalable for large genomes • Make use of cluster • Available • www.ebi.ac.uk/egt • http://code.google.com/p/curtain/ • Future announcements • http://groups.google.com/group/curtain-assembler • Future work • Long read support 60 25.04.11 Velvet / Curtain
  • 61. Curtain practical • Run Curtain for Staphylococcus • Simulation data 61 25.04.11 Velvet / Curtain
  • 62. Thanks ... 62 25.04.11 Velvet / Curtain