RNA-seq: Mapping and quality control - part 3

Mapping to assign reads to
genes
Joachim Jacob
20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to
http://www.bits.vib.be/ if you use this presentation or parts hereof.

Goal
Assign reads to genes.
The result of the mapping will be used to
construct a summary of the counts: the
count table.

GeneA: 12
GeneB: 5

2 scenarios
Reference genome sequence available
NO reference genome sequence available
● De novo assembly of the reads
(trinity) (transcriptome construction)
● Map the reads to the assembly
(RSEM mapper)
● Extract count table
(note:no removal of polyA is required.
Computationally expensive!)

Reference genome sequence available
Preprocessed reads are mapped to the
reference sequence:
1. Reference is haplotype: mixture of alleles,
leads to mismatches.

35 → for 2 alleles
together

If we compare samples within the same specimen, this effect is similar for all samples.

2. Reads contain sequencing errors
3. Reads derived from mRNA, genome is DNA.

mRNA reads: some reads span introns
●

Reads are derived from mRNA

mRNA
One isoform!

exon
intron

etc.

Many reads span
introns: they need to be aligned
with gaps. This can be used to
detect intron-exon junctions

http://www.ensembl.org

mRNA reads: multiple isoforms exist
●

Isoforms are transcribed at different
levels, contributing differently to the
number of reads.

http://www.ensembl.org

Algorithm: gapped read mapping
●

Exon-first approach: TopHat (popular)

Junction database constructed to
try to map unmapped reads.

TopHat: discovering splice junctions with RNA-Seq
Vol. doi:10.1093/bioinformatics/btp120
25 no. 9 2009, pages 1105–1111

Principle of gapped read mapping
●

STAR: fast and suited for longer reads

STAR: ultrafast universal RNA-seq aligner
Alexander Dobin et al. Bioinformatics

Checklist for mapping to reference genome
1. A reference genome sequence (fasta),
to be indexed by the alignment software.
2. A genome annotation file (GFF3 or
GTF), with indication of currently known
annotations (optional, but highly
recommended)
3. The cleaned (preprocessed) reads (fastq
)

Getting your reference genome sequence
●

●

Genomes to be used by TopHat can be fetched
from iGenomes and for STAR here
If your genome is not
listed above, check
http://ensembl.org
and
http://ensemblgenomes.org ; and follow indexing
software

●

If still no luck, try a specialized species website, e.g.

Indexing a genome
Mapping reads is fairly
fast, because the heavy
lifting is done beforehand:
the reference genome
sequence is preprocessed
by indexing (taking a lot of
time), making mapping fast.
●

On Galaxy, the indexing
has already been
performed for you. Just
choose your genome from
the list.
●

Using genome annotation information
Annotation info is stored in text files formatted as
GTF or GFF3 files.
●

If sequencing is deep enough, the complete
transcriptome structure can be derived from the
mapping: splice junctions, isoforms, variants,...
CuffLinks for example reconstructs the annotation
from an alignment, and generates a GFF file, to be
used. Potentially novel transcripts are included in
this file. But remember, this is NOT OUR GOAL.
●

We will use a GTF file from an respected genome
database to assist the mapping of reads.
●

http://cufflinks.cbcb.umd.edu/

Using genome annotation information

Mapping in Galaxy
Mapping in
Galaxy
Basic settings

!

Mapping QC
TIP: align a subsample of reads in Galaxy. Play with
the settings, and determine the best outcome.
●

Set the mapping fairly liberal: map as much as
possible, and let the mapper assign mapping
qualities. Ideally, every read maps once ('uniquely
mapped'). In the following step, we will discard reads
mapped to multiple locations ('multi reads').
●

The outcome of the alignment is a SAM or a BAM
format, which you can visualize in Galaxy (or with a
stand-alone viewer such as GenomeView or IGV.
●

Mapping QC
The outcome of the alignment is a SAM or a BAM
format, which you can visualize in Galaxy (or with a
stand-alone viewer such as GenomeView or IGV.
Check whether this
visualization
matches:
- paired end
- splice junctions
- strandedness
- ...

Let's visualize

Practical tips
Position on the reference
genome sequence

Add the GTF to the viz

These are the reads, 2 colours
because of the sense and
antisense strand. (obviously
this library was not stranded!)
Some reads span an intron

Mapping QC - RSeQC
After checking the mapping visually, determine more
metrics with RseQC.

http://rseqc.sourceforge.net/

Mapping QC - RSeQC
Duplication rate observed in the RNA-seq data.


Mapping QC - RSeQC
Read quality of aligned reads


Mapping QC - RSeQC
Sequence depth saturation
Q1 → Q4: from low
count genes
to high count genes

Early flattening points
to saturation


Mapping QC - RSeQC
Sequence depth saturation


Mapping QC - RSeQC
After checking visually, determine more metrics with
RseQC.


Mapping QC - RSeQC
After checking visually, determine more metrics with
RseQC.
Deviating!


Mapping QC - BamQC
Another useful tool is BamQC of the Qualimap Suite.
Be aware however: also useful for DNA-seq!

http://qualimap.bioinfo.cipf.es/

Mapping QC: BamQC


Mapping QC: BamQC

Fraction of genome sequence not covered


Mapping QC: BamQC
Some examples to watch out for.


Keywords
haplotype
Gapped mapping
GTF
duplication
isoforms
strandedness
coverage

Write in your own words what the terms mean

Exercise
→ → Mapping exercise

RNA-seq: Mapping and quality control - part 3

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a RNA-seq: Mapping and quality control - part 3

Semelhante a RNA-seq: Mapping and quality control - part 3 (20)

Mais de BITS

Mais de BITS (13)

Último

Último (20)

RNA-seq: Mapping and quality control - part 3