3. Overview
• De Bruijn Graph
• Velvet
• Theory
• Practice
• Data formats and quality
• Velvet
• Simulation data
• Multiple insert lengths
• Curtain
• Theory
• Practice
3 25.04.11 Velvet / Curtain
4. De Bruijn graph
• A concept in combinatorial mathematics
• In combinatorics, de bruijn graph is usually fully connected
• http://en.wikipedia.org/wiki/De_Bruijn_graph
• de bruijn sequence
• Related concept
• Path through graph
• Velvet
• de Bruijn inspired graph structure
4 25.04.11 Velvet / Curtain
5. De Bruijn graph (Velvet)
• Representation of
• a sequence based on short words (k-mers)
• overlaps between words
• K-mer: word of length k
• K=5
GCCTTCCA
• k-1 overlap
GCCTT GCCTT GCCTT
CCTTC CCTTC CCTTC
CTTCC CTTCC
TTCCA
...
GCCTTCCA GCCTTCCA GCCTTCCA
5 25.04.11 Velvet / Curtain
6. De Bruijn graph (Velvet)
GCCTTCCAATTT
GCCTTCAAATTT
C A
CTTC TTCC .....
CAATT
T
CCT TC
G CT
C AATTT
A A
CTTC TTCA ..... AAATT
6 25.04.11 Velvet / Curtain
20. Example
After simplification…
GATT
AGAT
GATCCGATGAG AGAA
GCTCTAG
TAGTCGA CGAG
GAGGCT GGCT TAGA AGAGA AGACAG
GCTTTAG
CGACGC
20 25.04.11 Velvet / Curtain
21. Example
Tips removed…
AGAT
GATCCGATGAG
GCTCTAG
TAGTCGA CGAG
GAGGCT GGCT TAGA AGAGA AGACAG
GCTTTAG
21 25.04.11 Velvet / Curtain
22. De Bruijn graph biology extensions (Velvet)
• Handling of reverse strand
• DNA is read in two directions
• Paired-end data
• Handling small differences, which are “uninteresting”
• Errors in sequencing technology
• Memory
• regularly use 80, 100GB real memory
• easily get to 1TB real memory requirements
22 25.04.11 Velvet / Curtain
23. Read variety
• Short reads ~75bp
• Illumina / Solexa
• SOLiD (colour space)
• Long reads 500-1000 bp
• 454 read
• Sanger capillary reads
• Paired-end reads
• Short reads
• short insert length
• Mate pair reads
• Short reads
• long insert length
23 25.04.11 Velvet / Curtain
29. Example
Bubbles removed… by TourBus
AGAT
GATCCGATGAG
TAGTCGA CGAG
GAGGCT GGCT GCTTTAG TAGA AGAGA AGACAG
29 25.04.11 Velvet / Curtain
30. Example
Final simplification…
AGATCCGATGAG
TAGTCGAG GAGGCTTTAGA AGAGACAG
30 25.04.11 Velvet / Curtain
31. Example
TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
Final simplification…
AGATCCGATGAG
TAGTCGAG GAGGCTTTAGA AGAGACAG
One possible walk through the graph ...
TAGTCGAG
GAGGCTTTAGA
AGATCCGATGAG
GAGGCTTTAGA
AGAGACAG
31 25.04.11 Velvet / Curtain
34. N50
• N50 is the length of the smallest contig
• contains the fewest (largest) contigs
• combined length represents at least 50% of the assembly
• N10
• > 10 % of the largest contigs
http://www.broadinstitute.org/crd/wiki/index.php/N50
34 25.04.11 Velvet / Curtain
35. Velvet practical: Part 1
• Compile
• Single end (ERX001300)
• K-mer length
• Coverage cut-offs
• Whole genome sequence as input???
• Staphylococcus aureus MRSA252
35 25.04.11 Velvet / Curtain
36. Velvet algorithms
• Long read information
• Rock Band
• Velvet parameters
• -long_mult_cutoff
36 25.04.11 Velvet / Curtain
37. Velvet algorithms
• Paired-end information
• Pebble
• Velvet parameters
• -min_pair_count
Once all distances and variance computed,
Simple greedy extension from main contigs out
37 25.04.11 Velvet / Curtain
38. Paired-end in Velvet
• Hugely improves quality of assembly
• Insert length greater than repeat
• greater than the length of the most common genomic repeat
• Mixed insert length improves results
• Short: helps for local assembly
• Long: get over repeats
• Large genomes
• Very memory intensive
• Calculation intensive
38 25.04.11 Velvet / Curtain
41. Quality score
• Velvet does NOT use quality score!!!
• Error correction of de Bruijn graph
• p
• the probability that the corresponding base call is incorrect
• Phred quality score
• 10 -> 1 in 10
• 40 -> 1 in 10,000.
• Odds ratio
• earlier versions of solexa pipeline
• differs mainly at lower levels
41 25.04.11 Velvet / Curtain
42. Quality encoding
• !''*((((***+))%%%
• One value per base
• Integer mapping based on ASCII encoding
• probability of incorrect base call
• Sanger format • Illumina 1.5+
• Phred score • Phred score
• ASCII 33 – 126 -> 0 – 93 • ASCII 59 – 126 -> -5 – 62
• Rarely exceeds 60 • Only 2 – 40 expected
• ! = 33 -> 0 • ! = 33 -> (does not exist)
• b = 66 -> 33 • b = 66 -> 2
42 25.04.11 Velvet / Curtain
47. Velvet modules
• Columbus (since Velvet 1.0)
• use reference sequence
• assist with alignment information
• local re-sequencing
• structural variants
47 25.04.11 Velvet / Curtain
48. Velvet modules
• Oases
• De novo transcriptome assembler
• uses preliminary Velvet assembly
• clusters contigs into loci
• construct transcript isoforms using paired-end / long read
information
• confidence score: describes uniqueness of a transcript in a locus
48 25.04.11 Velvet / Curtain
49. Read Simulation - Why?
• Controlling the data
• Contamination
• Coverage distribution
• Sequencing errors
• Genome size
• Insert length
• Insert length distribution
49 25.04.11 Velvet / Curtain
54. Curtain
• assembly pipeline
• Paired-end assembly for large genomes
• Group related Contigs
• Uses velvet to assemble groups of related reads
• Iterative approach
54 25.04.11 Velvet / Curtain
55. Curtain
Genome assembly Pipeline
Curtain
Contigs
Map Group Fill
Assemble Collect
Reads Contigs Bins
55 25.04.11 Velvet / Curtain
56. Curtain
Curtain Contigs
Map Group Fill
AssembleCollect
Reads Contigs Bins
• Set of input Contigs
• Use established assemblers
• Velvet unpaired
• Cortex
• SGA
• ...
56 25.04.11 Velvet / Curtain
57. Curtain
Curtain Contigs
Map Group Fill AssembleCollect
Reads Contigs Bins
• Map reads to input contigs
• SAM file support
• bwa
• maq
57 25.04.11 Velvet / Curtain
58. Curtain
Curtain Contigs
Map Group Fill AssembleCollect
Reads Contigs Bins
• Group Contigs using Paired-end information
1 2 3 4 5
bin mapping read & read pair
58 25.04.11 Velvet / Curtain
59. Curtain
Curtain Contigs
Map Group Fill
Reads Contigs Bins AssembleCollect
• Assemble each bin
• Run velvet using paired-end information
• bin specific parameters
• Run each bin individually velvet
• Highly parallelizable
• Collect results
• Start next iteration ………………….
Results
59 25.04.11 Velvet / Curtain
60. Curtain
• Low memory footprint
• Scalable for large genomes
• Make use of cluster
• Available
• www.ebi.ac.uk/egt
• http://code.google.com/p/curtain/
• Future announcements
• http://groups.google.com/group/curtain-assembler
• Future work
• Long read support
60 25.04.11 Velvet / Curtain
61. Curtain practical
• Run Curtain for Staphylococcus
• Simulation data
61 25.04.11 Velvet / Curtain