This presentation discusses mapping RNA-seq reads to genes in order to construct a count table. There are two scenarios - mapping to a reference genome sequence or performing a de novo assembly. When a reference is available, reads are mapped using gapped alignment to account for reads spanning introns. It is important to use genome annotations and check mapping quality using tools like RSeQC and BamQC to visualize coverage and duplication rates.
1. Mapping to assign reads to
genes
Joachim Jacob
20 and 27 January 2014
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts hereof.
2. Goal
Assign reads to genes.
The result of the mapping will be used to
construct a summary of the counts: the
count table.
GeneA: 12
GeneB: 5
3. 2 scenarios
Reference genome sequence available
NO reference genome sequence available
● De novo assembly of the reads
(trinity) (transcriptome construction)
● Map the reads to the assembly
(RSEM mapper)
● Extract count table
(note:no removal of polyA is required.
Computationally expensive!)
4. Reference genome sequence available
Preprocessed reads are mapped to the
reference sequence:
1. Reference is haplotype: mixture of alleles,
leads to mismatches.
35 → for 2 alleles
together
If we compare samples within the same specimen, this effect is similar for all samples.
2. Reads contain sequencing errors
3. Reads derived from mRNA, genome is DNA.
5. mRNA reads: some reads span introns
●
Reads are derived from mRNA
mRNA
One isoform!
exon
intron
etc.
Many reads span
introns: they need to be aligned
with gaps. This can be used to
detect intron-exon junctions
http://www.ensembl.org
6. mRNA reads: multiple isoforms exist
●
Isoforms are transcribed at different
levels, contributing differently to the
number of reads.
http://www.ensembl.org
8. Principle of gapped read mapping
●
STAR: fast and suited for longer reads
STAR: ultrafast universal RNA-seq aligner
Alexander Dobin et al. Bioinformatics
9. Checklist for mapping to reference genome
1. A reference genome sequence (fasta),
to be indexed by the alignment software.
2. A genome annotation file (GFF3 or
GTF), with indication of currently known
annotations (optional, but highly
recommended)
3. The cleaned (preprocessed) reads (fastq
)
10. Getting your reference genome sequence
●
●
Genomes to be used by TopHat can be fetched
from iGenomes and for STAR here
If your genome is not
listed above, check
http://ensembl.org
and
http://ensemblgenomes.org ; and follow indexing
software
●
If still no luck, try a specialized species website, e.g.
11. Indexing a genome
Mapping reads is fairly
fast, because the heavy
lifting is done beforehand:
the reference genome
sequence is preprocessed
by indexing (taking a lot of
time), making mapping fast.
●
On Galaxy, the indexing
has already been
performed for you. Just
choose your genome from
the list.
●
12. Using genome annotation information
Annotation info is stored in text files formatted as
GTF or GFF3 files.
●
If sequencing is deep enough, the complete
transcriptome structure can be derived from the
mapping: splice junctions, isoforms, variants,...
CuffLinks for example reconstructs the annotation
from an alignment, and generates a GFF file, to be
used. Potentially novel transcripts are included in
this file. But remember, this is NOT OUR GOAL.
●
We will use a GTF file from an respected genome
database to assist the mapping of reads.
●
http://cufflinks.cbcb.umd.edu/
18. Mapping QC
TIP: align a subsample of reads in Galaxy. Play with
the settings, and determine the best outcome.
●
Set the mapping fairly liberal: map as much as
possible, and let the mapper assign mapping
qualities. Ideally, every read maps once ('uniquely
mapped'). In the following step, we will discard reads
mapped to multiple locations ('multi reads').
●
The outcome of the alignment is a SAM or a BAM
format, which you can visualize in Galaxy (or with a
stand-alone viewer such as GenomeView or IGV.
●
19. Mapping QC
The outcome of the alignment is a SAM or a BAM
format, which you can visualize in Galaxy (or with a
stand-alone viewer such as GenomeView or IGV.
Check whether this
visualization
matches:
- paired end
- splice junctions
- strandedness
- ...
Let's visualize
20. Practical tips
Position on the reference
genome sequence
Add the GTF to the viz
These are the reads, 2 colours
because of the sense and
antisense strand. (obviously
this library was not stranded!)
Some reads span an intron
21. Mapping QC - RSeQC
After checking the mapping visually, determine more
metrics with RseQC.
http://rseqc.sourceforge.net/
22. Mapping QC - RSeQC
Duplication rate observed in the RNA-seq data.
http://rseqc.sourceforge.net/
24. Mapping QC - RSeQC
Sequence depth saturation
Q1 → Q4: from low
count genes
to high count genes
Early flattening points
to saturation
http://rseqc.sourceforge.net/