SlideShare uma empresa Scribd logo
1 de 66
Baixar para ler offline
De novo Genome Assembly
A/Prof Torsten Seemann
Winter School in Mathematical & Computational Biology - Brisbane, Australia - 4 July 2016
Introduction
The human genome has 47 pieces
MT
(or XY)
The shortest piece is 48,000,000 bp
Total haploid size
3,200,000,000 bp
(3.2 Gbp)
Total diploid size
6,400,000,000 bp
(6.4 Gbp)
∑ = 6,400,000,000 bp
Human DNA iSequencer ™ 46 chromosomal
and1 mitochondrial
sequences
In an ideal world ...
AGTCTAGGATTCGCTATAG
ATTCAGGCTCTGATATATT
TCGCGGCATTAGCTAGAGA
TCTCGAGATTCGTCCCAGT
CTAGGATTCGCTAT
AAGTCTAAGATTC...
The real world ( for now )
Short
fragments
Reads are stored in “FASTQ” files
Genome
Sequencing
Assemble Compare
Genome assembly
(the red pill)
De novo genome assembly
Ideally, one sequence per replicon.
Millions of short
sequences
(reads)
A few long
sequences
(contigs)
Reconstruct the original genome
from the sequence reads only
De novo genome assembly
“From scratch”
An example
A small “genome”
Friends,
Romans,
countrymen,
lend me your ears;
I’ll return
them
tomorrow!
• Reads
ds, Romans, count
ns, countrymen, le
Friends, Rom
send me your ears;
crymen, lend me
Shakespearomics
Whoops!
I dropped
them.
• Reads
ds, Romans, count
ns, countrymen, le
Friends, Rom
send me your ears;
crymen, lend me
• Overlaps
Friends, Rom
ds, Romans, count
ns, countrymen, le
crymen, lend me
send me your ears;
Shakespearomics
I am good
with words.
• Reads
ds, Romans, count
ns, countrymen, le
Friends, Rom
send me your ears;
crymen, lend me
• Overlaps
Friends, Rom
ds, Romans, count
ns, countrymen, le
crymen, lend me
send me your ears;
• Majority consensus
Friends, Romans, countrymen, lend me your ears; (1 contig)
Shakespearomics
We have
reached a
consensus !
Overlap - Layout - Consensus
Amplified DNA
Shear DNA
Sequenced reads
Overlaps
Layout
Consensus ↠ “Contigs”
Assembly graphs
Overlap graph
Another example is this
Overlaps find
Do the graph traverse
Size matters not. Look at me. Judge me by my size, do you?
Size matters not. Look at me. Judge me by my soze, do you?
2 supporting reads
1 supporting reads
So far, so good.
Why is it so hard?
What makes a jigsaw puzzle hard?
Repetitive
regions
Lots of
pieces
Missing
pieces
Multiple
copies
No corners
(circular
genomes)
No box
Dirty
pieces
Frayed
pieces : .,
What makes genome assembly hard
Size of the human genome = 3.2 x 109
bp (3,200,000,000)
Typical short read length = 102
bp (100)
A puzzle with millions to billions of pieces
Storing in RAM is a challenge ⇢ “succinct data structure”
1. Many pieces (read length is very short compared to the genome)
What makes genome assembly hard
Finding overlaps means
examining every pair of reads
Comparisons = N×(N-1)/2
~ N2
Lots of smart tricks to reduce
this close to ~N
2. Lots of overlaps
What makes genome assembly hard
ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA
TATATATA TATATATA TATATATA TATATATA TATATATA
TATATATA TATATATA TATATATA
TATATATA TATATATA TATATATA
3. Lots of sky (short repeats)
How could we possibly assemble this segment of DNA?
All the reads are the same!
What makes genome assembly hard
Read 1: GGAACCTTTGGCCCTGT
Read 2: GGCGCTGTCCATTTTAGAAACC
4. Dirty pieces (sequencing errors)
1. The error in Read 2 might prevent us from seeing that it
overlaps with Read 1
2. How would we know which is the correct sequence?
What makes genome assembly hard
5. Multiple copies (long repeats)
Our old nemesis the REPEAT !
Gene Copy 1 Gene Copy 2
Repeats
What is a repeat?
A segment of DNA
that occurs more than once
in the genome
Major classes of repeats in the human genome
Repeat Class Arrangement Coverage (Hg) Length (bp)
Satellite (micro, mini) Tandem 3% 2-100
SINE Interspersed 15% 100-300
Transposable elements Interspersed 12% 200-5k
LINE Interspersed 21% 500-8k
rDNA Tandem 0.01% 2k-43k
Segmental Duplications Tandem or
Interspersed
0.2% 1k-100k
Repeats
Repeat copy 1 Repeat copy 2
Collapsed repeat consensus
1 locus
4 contigs
Repeats are hubs in the graph
Long reads can span repeats
Repeat copy 1 Repeat copy 2
long
reads
1 contig
Draft vs Finished genomes (bacteria)
250 bp - Illumina - $250 12,000 bp - Pacbio - $2500
The two laws of repeats
1. It is impossible to resolve repeats of length L
unless you have reads longer than L.
2. It is impossible to resolve repeats of length L
unless you have reads longer than L.
How much data do we need?
Coverage vs Depth
Coverage
Fraction of the genome
sequenced by
at least one read
Depth
Average number of reads
that cover any given region
Intuitively more reads should increase coverage and depth
Example: Read length = ⅛ Genome Length
Coverage = ⅜
Depth = ⅜
Genome
Reads
Example: Read length = ⅛ Genome Length
Coverage = 4.5/8
Depth = 5/8
Genome
Reads
Newly Covered Regions = 1.5 Reads
Example: Read length = ⅛ Genome Length
Coverage = 5.2/8
Depth = 7/8
Genome
Reads
Newly Covered Regions = 0.7 Reads
Example: Read length = ⅛ Genome Length
Coverage = 6.7/8
Depth = 10/8
Genome
Reads
Newly Covered Regions = 1.5 Reads
Depth is easy to calculate
Depth = N × L / G
Depth = Y / G
N = Number of reads
L = Length of a read
G = Genome length
Y = Sequence yield (N x L)
Example: Coverage of the Tarantula genome
“The size estimate of the tarantula
genome based on k-mer analysis is 6 Gb
and we sequenced at 40x coverage from
a single female A. geniculata”
Sangaard et al 2014. Nature Comm (5) p1-11
Depth = N × L / G
40 = N x 100 / (6 Billion)
N = 40 * 6 Billion / 100
N = 2.4 Billion
Coverage and depth are related
Approximate
formula for
coverage
assuming
random reads
Coverage = 1 - e-Depth
Much more sequencing needed in reality
● Sequencing is not random
○ GC and AT rich regions
are under represented
○ Other chemistry quirks
● More depth needed for:
○ sequencing errors
○ polymorphisms
Assessing assemblies
Contiguity
Completeness
Correctness
Contiguity
● Desire
○ Fewer contigs
○ Longer contigs
● Metrics
○ Number of contigs
○ Average contig length
○ Median contig length
○ Maximum contig length
○ “N50”, “NG50”, “D50”
Contiguity: the N50 statistic
Completeness : Total size
Proportion of the original genome represented by the assembly
Can be between 0 and 1
Proportion of estimated genome size
… but estimates are not perfect
Completeness: core genes
Proportion of coding sequences can be estimated based on
known core genes thought to be present in a wide variety of
organisms.
Assumes that the proportion of assembled genes is equal to the
proportion of assembled core genes.
Number of Core Genes in
Assembly
Number of Core Genes in
Database
In the past this was done with a tool called CEGMA
There is a new tool for this called BUSCO
Correctness
Proportion of the assembly that is free from errors
Errors include
1. Mis-joins
2. Repeat compressions
3. Unnecessary duplications
4. Indels / SNPs caused by assembler
Correctness: check for self consistency
● Align all the reads back to the contigs
● Look for inconsistencies
Original Reads
Align
Assembly
Mapped Read
Assemble ALL the things
Why assemble genomes
● Produce a reference for new species
● Genomic variation can be studied by comparing
against the reference
● A wide variety of molecular biological tools become
available or more effective
● Coding sequences can be studied in the context of their
related non-coding (eg regulatory) sequences
● High level genome structure (number, arrangement of
genes and repeats) can be studied
Vast majority of genomes remain unsequenced
Estimated number of species ~ 9 Million
Whole genome sequences (including
incomplete drafts) ~ 3000
Estimated number of species Millions to
Billions
Genomic sequences ~ 150,000
Eukaryotes Bacteria & Archaea
Every individual has a different genome
Some diseases such as cancer give rise to massive genome level variation
Not just genomes
● Transcriptomes
○ One contig for every isoform
○ Do not expect uniform coverage
● Metagenomes
○ Mixture of different organisms
○ Host, bacteria, virus, fungi all at once
○ All different depths
● Metatranscriptomes
○ Combination of above!
Conclusions
Take home points
● De novo assembly is the process of reconstructing a long
sequence from many short ones
● Represented as a mathematical “overlap graph”
● Assembly is very challenging (“impossible”) because
○ sequencing bias under represents certain regions
○ Reads are short relative to genome size
○ Repeats create tangled hubs in the assembly graph
○ Sequencing errors cause detours and bubbles in the assembly graph
Contact
tseemann.github.io
torsten.seemann@gmail.com
@torstenseemann
The End

Mais conteúdo relacionado

Mais procurados

Yeast two hybrid system
Yeast two hybrid systemYeast two hybrid system
Yeast two hybrid systemiqraakbar8
 
2 whole genome sequencing and analysis
2 whole genome sequencing and analysis2 whole genome sequencing and analysis
2 whole genome sequencing and analysissaberhussain9
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing priyanka raviraj
 
Expressed sequence tag (EST), molecular marker
Expressed sequence tag (EST), molecular markerExpressed sequence tag (EST), molecular marker
Expressed sequence tag (EST), molecular markerKAUSHAL SAHU
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Mrinal Vashisth
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewDominic Suciu
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seqJyoti Singh
 
Transcriptomics and metabolomics
Transcriptomics and metabolomicsTranscriptomics and metabolomics
Transcriptomics and metabolomicsSukhjinder Singh
 
Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques fikrem24yahoocom6261
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expressionishi tandon
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1BITS
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewSean Davis
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics Senthil Natesan
 

Mais procurados (20)

Yeast two hybrid system
Yeast two hybrid systemYeast two hybrid system
Yeast two hybrid system
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
2 whole genome sequencing and analysis
2 whole genome sequencing and analysis2 whole genome sequencing and analysis
2 whole genome sequencing and analysis
 
Gene prediction method
Gene prediction method Gene prediction method
Gene prediction method
 
Est database
Est databaseEst database
Est database
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 
Expressed sequence tag (EST), molecular marker
Expressed sequence tag (EST), molecular markerExpressed sequence tag (EST), molecular marker
Expressed sequence tag (EST), molecular marker
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
BLAST
BLASTBLAST
BLAST
 
Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)Next generation sequencing methods (final edit)
Next generation sequencing methods (final edit)
 
Next Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology OverviewNext Gen Sequencing (NGS) Technology Overview
Next Gen Sequencing (NGS) Technology Overview
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
 
Transcriptomics and metabolomics
Transcriptomics and metabolomicsTranscriptomics and metabolomics
Transcriptomics and metabolomics
 
Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques Ion torrent and SOLiD Sequencing Techniques
Ion torrent and SOLiD Sequencing Techniques
 
Gene prediction and expression
Gene prediction and expressionGene prediction and expression
Gene prediction and expression
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Overview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seqOverview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seq
 
Genomics and bioinformatics
Genomics and bioinformatics Genomics and bioinformatics
Genomics and bioinformatics
 

Destaque

Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Robert (Rob) Salomon
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informaticsDaniela Rotariu
 
Bioinformatics A Biased Overview
Bioinformatics A Biased OverviewBioinformatics A Biased Overview
Bioinformatics A Biased OverviewPhilip Bourne
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataPhilip Bourne
 
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura AdamMapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adammadalladam
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification Senthil Natesan
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomesmadalladam
 
Ap Chapter 21
Ap Chapter 21Ap Chapter 21
Ap Chapter 21smithbio
 
Molecular Markers: Major Applications in Insects
Molecular Markers: Major Applications in InsectsMolecular Markers: Major Applications in Insects
Molecular Markers: Major Applications in InsectsSaramita De Chakravarti
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformaticianChristian Frech
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformaticsAbhishek Vatsa
 

Destaque (12)

Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
 
Bioinformatics A Biased Overview
Bioinformatics A Biased OverviewBioinformatics A Biased Overview
Bioinformatics A Biased Overview
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big Data
 
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura AdamMapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomes
 
Ap Chapter 21
Ap Chapter 21Ap Chapter 21
Ap Chapter 21
 
Molecular Markers: Major Applications in Insects
Molecular Markers: Major Applications in InsectsMolecular Markers: Major Applications in Insects
Molecular Markers: Major Applications in Insects
 
How to be a bioinformatician
How to be a bioinformaticianHow to be a bioinformatician
How to be a bioinformatician
 
Gene concept
Gene conceptGene concept
Gene concept
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 

Semelhante a De novo Genome Assembly Guide

GENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.pptGENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.pptsherylbadayos
 
1_7_genome_1.ppt
1_7_genome_1.ppt1_7_genome_1.ppt
1_7_genome_1.pptOmerBushra4
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsfmaumus
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
Cot Curve_Dr. Sonia.pdf
Cot Curve_Dr. Sonia.pdfCot Curve_Dr. Sonia.pdf
Cot Curve_Dr. Sonia.pdfsoniaangeline
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityMonica Munoz-Torres
 
NGSWorkflows.ppt
NGSWorkflows.pptNGSWorkflows.ppt
NGSWorkflows.pptRaulArias48
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityMonica Munoz-Torres
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Torsten Seemann
 
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Integrated DNA Technologies
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 

Semelhante a De novo Genome Assembly Guide (20)

GENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.pptGENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.ppt
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
1_7_genome_1.ppt
1_7_genome_1.ppt1_7_genome_1.ppt
1_7_genome_1.ppt
 
Lecture on the annotation of transposable elements
Lecture on the annotation of transposable elementsLecture on the annotation of transposable elements
Lecture on the annotation of transposable elements
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Cot Curve_Dr. Sonia.pdf
Cot Curve_Dr. Sonia.pdfCot Curve_Dr. Sonia.pdf
Cot Curve_Dr. Sonia.pdf
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
Sept2016 sv 10_x
Sept2016 sv 10_xSept2016 sv 10_x
Sept2016 sv 10_x
 
Apollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research communityApollo - A webinar for the Phascolarctos cinereus research community
Apollo - A webinar for the Phascolarctos cinereus research community
 
NGSWorkflows.ppt
NGSWorkflows.pptNGSWorkflows.ppt
NGSWorkflows.ppt
 
Genome structure
Genome structure Genome structure
Genome structure
 
Apollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research CommunityApollo Introduction for the Chestnut Research Community
Apollo Introduction for the Chestnut Research Community
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
DNA as Storage Medium
DNA as Storage MediumDNA as Storage Medium
DNA as Storage Medium
 
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
 
DNA sequencing
DNA sequencing  DNA sequencing
DNA sequencing
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 

Mais de Torsten Seemann

How to write bioinformatics software no one will use
How to write bioinformatics software no one will useHow to write bioinformatics software no one will use
How to write bioinformatics software no one will useTorsten Seemann
 
How to write bioinformatics software people will use and cite - t.seemann - ...
How to write bioinformatics software people will use and cite -  t.seemann - ...How to write bioinformatics software people will use and cite -  t.seemann - ...
How to write bioinformatics software people will use and cite - t.seemann - ...Torsten Seemann
 
Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016Torsten Seemann
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Torsten Seemann
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...Torsten Seemann
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...Torsten Seemann
 
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...Torsten Seemann
 
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Torsten Seemann
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015Torsten Seemann
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015Torsten Seemann
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015Torsten Seemann
 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Torsten Seemann
 
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Torsten Seemann
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Torsten Seemann
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Torsten Seemann
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014Torsten Seemann
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Torsten Seemann
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Torsten Seemann
 
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014Torsten Seemann
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...Torsten Seemann
 

Mais de Torsten Seemann (20)

How to write bioinformatics software no one will use
How to write bioinformatics software no one will useHow to write bioinformatics software no one will use
How to write bioinformatics software no one will use
 
How to write bioinformatics software people will use and cite - t.seemann - ...
How to write bioinformatics software people will use and cite -  t.seemann - ...How to write bioinformatics software people will use and cite -  t.seemann - ...
How to write bioinformatics software people will use and cite - t.seemann - ...
 
Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
 
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
 
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
 
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
 
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
 

Último

GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 

Último (20)

GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 

De novo Genome Assembly Guide

  • 1. De novo Genome Assembly A/Prof Torsten Seemann Winter School in Mathematical & Computational Biology - Brisbane, Australia - 4 July 2016
  • 3. The human genome has 47 pieces MT (or XY)
  • 4. The shortest piece is 48,000,000 bp Total haploid size 3,200,000,000 bp (3.2 Gbp) Total diploid size 6,400,000,000 bp (6.4 Gbp) ∑ = 6,400,000,000 bp
  • 5. Human DNA iSequencer ™ 46 chromosomal and1 mitochondrial sequences In an ideal world ... AGTCTAGGATTCGCTATAG ATTCAGGCTCTGATATATT TCGCGGCATTAGCTAGAGA TCTCGAGATTCGTCCCAGT CTAGGATTCGCTAT AAGTCTAAGATTC...
  • 6. The real world ( for now ) Short fragments Reads are stored in “FASTQ” files Genome Sequencing
  • 7.
  • 10. De novo genome assembly Ideally, one sequence per replicon. Millions of short sequences (reads) A few long sequences (contigs) Reconstruct the original genome from the sequence reads only
  • 11. De novo genome assembly “From scratch”
  • 13. A small “genome” Friends, Romans, countrymen, lend me your ears; I’ll return them tomorrow!
  • 14. • Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me Shakespearomics Whoops! I dropped them.
  • 15. • Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me • Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears; Shakespearomics I am good with words.
  • 16. • Reads ds, Romans, count ns, countrymen, le Friends, Rom send me your ears; crymen, lend me • Overlaps Friends, Rom ds, Romans, count ns, countrymen, le crymen, lend me send me your ears; • Majority consensus Friends, Romans, countrymen, lend me your ears; (1 contig) Shakespearomics We have reached a consensus !
  • 17. Overlap - Layout - Consensus Amplified DNA Shear DNA Sequenced reads Overlaps Layout Consensus ↠ “Contigs”
  • 22. Do the graph traverse Size matters not. Look at me. Judge me by my size, do you? Size matters not. Look at me. Judge me by my soze, do you? 2 supporting reads 1 supporting reads
  • 23. So far, so good.
  • 24.
  • 25. Why is it so hard?
  • 26. What makes a jigsaw puzzle hard? Repetitive regions Lots of pieces Missing pieces Multiple copies No corners (circular genomes) No box Dirty pieces Frayed pieces : .,
  • 27. What makes genome assembly hard Size of the human genome = 3.2 x 109 bp (3,200,000,000) Typical short read length = 102 bp (100) A puzzle with millions to billions of pieces Storing in RAM is a challenge ⇢ “succinct data structure” 1. Many pieces (read length is very short compared to the genome)
  • 28. What makes genome assembly hard Finding overlaps means examining every pair of reads Comparisons = N×(N-1)/2 ~ N2 Lots of smart tricks to reduce this close to ~N 2. Lots of overlaps
  • 29. What makes genome assembly hard ATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA TATATATA 3. Lots of sky (short repeats) How could we possibly assemble this segment of DNA? All the reads are the same!
  • 30. What makes genome assembly hard Read 1: GGAACCTTTGGCCCTGT Read 2: GGCGCTGTCCATTTTAGAAACC 4. Dirty pieces (sequencing errors) 1. The error in Read 2 might prevent us from seeing that it overlaps with Read 1 2. How would we know which is the correct sequence?
  • 31. What makes genome assembly hard 5. Multiple copies (long repeats) Our old nemesis the REPEAT ! Gene Copy 1 Gene Copy 2
  • 33. What is a repeat? A segment of DNA that occurs more than once in the genome
  • 34. Major classes of repeats in the human genome Repeat Class Arrangement Coverage (Hg) Length (bp) Satellite (micro, mini) Tandem 3% 2-100 SINE Interspersed 15% 100-300 Transposable elements Interspersed 12% 200-5k LINE Interspersed 21% 500-8k rDNA Tandem 0.01% 2k-43k Segmental Duplications Tandem or Interspersed 0.2% 1k-100k
  • 35. Repeats Repeat copy 1 Repeat copy 2 Collapsed repeat consensus 1 locus 4 contigs
  • 36. Repeats are hubs in the graph
  • 37. Long reads can span repeats Repeat copy 1 Repeat copy 2 long reads 1 contig
  • 38. Draft vs Finished genomes (bacteria) 250 bp - Illumina - $250 12,000 bp - Pacbio - $2500
  • 39. The two laws of repeats 1. It is impossible to resolve repeats of length L unless you have reads longer than L. 2. It is impossible to resolve repeats of length L unless you have reads longer than L.
  • 40. How much data do we need?
  • 41. Coverage vs Depth Coverage Fraction of the genome sequenced by at least one read Depth Average number of reads that cover any given region Intuitively more reads should increase coverage and depth
  • 42. Example: Read length = ⅛ Genome Length Coverage = ⅜ Depth = ⅜ Genome Reads
  • 43. Example: Read length = ⅛ Genome Length Coverage = 4.5/8 Depth = 5/8 Genome Reads Newly Covered Regions = 1.5 Reads
  • 44. Example: Read length = ⅛ Genome Length Coverage = 5.2/8 Depth = 7/8 Genome Reads Newly Covered Regions = 0.7 Reads
  • 45. Example: Read length = ⅛ Genome Length Coverage = 6.7/8 Depth = 10/8 Genome Reads Newly Covered Regions = 1.5 Reads
  • 46. Depth is easy to calculate Depth = N × L / G Depth = Y / G N = Number of reads L = Length of a read G = Genome length Y = Sequence yield (N x L)
  • 47. Example: Coverage of the Tarantula genome “The size estimate of the tarantula genome based on k-mer analysis is 6 Gb and we sequenced at 40x coverage from a single female A. geniculata” Sangaard et al 2014. Nature Comm (5) p1-11 Depth = N × L / G 40 = N x 100 / (6 Billion) N = 40 * 6 Billion / 100 N = 2.4 Billion
  • 48. Coverage and depth are related Approximate formula for coverage assuming random reads Coverage = 1 - e-Depth
  • 49. Much more sequencing needed in reality ● Sequencing is not random ○ GC and AT rich regions are under represented ○ Other chemistry quirks ● More depth needed for: ○ sequencing errors ○ polymorphisms
  • 52. Contiguity ● Desire ○ Fewer contigs ○ Longer contigs ● Metrics ○ Number of contigs ○ Average contig length ○ Median contig length ○ Maximum contig length ○ “N50”, “NG50”, “D50”
  • 53. Contiguity: the N50 statistic
  • 54. Completeness : Total size Proportion of the original genome represented by the assembly Can be between 0 and 1 Proportion of estimated genome size … but estimates are not perfect
  • 55. Completeness: core genes Proportion of coding sequences can be estimated based on known core genes thought to be present in a wide variety of organisms. Assumes that the proportion of assembled genes is equal to the proportion of assembled core genes. Number of Core Genes in Assembly Number of Core Genes in Database In the past this was done with a tool called CEGMA There is a new tool for this called BUSCO
  • 56. Correctness Proportion of the assembly that is free from errors Errors include 1. Mis-joins 2. Repeat compressions 3. Unnecessary duplications 4. Indels / SNPs caused by assembler
  • 57. Correctness: check for self consistency ● Align all the reads back to the contigs ● Look for inconsistencies Original Reads Align Assembly Mapped Read
  • 59. Why assemble genomes ● Produce a reference for new species ● Genomic variation can be studied by comparing against the reference ● A wide variety of molecular biological tools become available or more effective ● Coding sequences can be studied in the context of their related non-coding (eg regulatory) sequences ● High level genome structure (number, arrangement of genes and repeats) can be studied
  • 60. Vast majority of genomes remain unsequenced Estimated number of species ~ 9 Million Whole genome sequences (including incomplete drafts) ~ 3000 Estimated number of species Millions to Billions Genomic sequences ~ 150,000 Eukaryotes Bacteria & Archaea Every individual has a different genome Some diseases such as cancer give rise to massive genome level variation
  • 61.
  • 62. Not just genomes ● Transcriptomes ○ One contig for every isoform ○ Do not expect uniform coverage ● Metagenomes ○ Mixture of different organisms ○ Host, bacteria, virus, fungi all at once ○ All different depths ● Metatranscriptomes ○ Combination of above!
  • 64. Take home points ● De novo assembly is the process of reconstructing a long sequence from many short ones ● Represented as a mathematical “overlap graph” ● Assembly is very challenging (“impossible”) because ○ sequencing bias under represents certain regions ○ Reads are short relative to genome size ○ Repeats create tangled hubs in the assembly graph ○ Sequencing errors cause detours and bubbles in the assembly graph