DEseq, voom and vst

DESeq, voom and vst
Qiang Kou
qkou@umail.iu.edu
April 28, 2014
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 1 / 31

Background
Advantages of RNA-seq Compared to Microarray
Detecting novel transcripts and isoforms
High reproducibility, low background
Detection of gene fusions and SNPs

Background
Diﬀerential Expression Analysis
Steps
Normalization
Dispersion estimation
Statistical testing
Methods to be presented
DESeq: negative binomial distribution [1]
voom: variance modelling at the observational level [2]
vst: variance-stabilizing transformation [1, 3]

Background
Timeline
2002 2004 2006 2008 2010 2012 2014 2016
vst
lim
m
a
cuﬄ
inksD
Eseq,edgeR
baySeq
voom

Background
Why diﬀerent models?

Background
RNA-seq is Discrete
Garber et al. (2011) Nature Methods 8:469-477

Background
Length Normalization
Within sample: gene length
Between samples: library size
RPKM and FPKM
Reads/fragments per kilobase per million mapped reads
Normalization for gene length and library size

Background
Diﬀerent Distribution
0.0
0.2
0.4
0.6
1 2 3 4
expression
density
(a) Microarray
0.0
0.1
0.2
0.3
0.4
−2 0 2 4
log10(fpkm)
density
condition
Untreated
CG8144_RNAi
genes
(b) RNA-seq

Background
Diﬀerential Expression as a Function of Transcript Length
0 2000 4000 6000 8000
020406080
Sequencing Data (Sultan)
%DE
a
0 2000 4000 6000 8000
020406080
Array Data (Sultan)
Transcript length (bp)
%DE
b
2000 4000 6000 8000 10000
024681012
Sequencing Data (Cloonan)
%DE
c
0 1000 2000 3000 4000 5000 6000 7000
020406080
Sequencing Data (Marioni)
d
1000 3000 5000 7000
020406080
Array Data (Marioni)
e
1000 2000 3000 4000 5000 6000 7000
020406080
Sequencing Data (Marioni)
f
1000 2000 3000 4000 5000 6000 7000
020406080
Array Data (Marioni)
g
Oshlack et al. (2009) Biology Direct 4:14

Background
Poisson and Negative Binomial Distribution

Background
Poisson Distribution
Graph from Wikipedia
Pr(X = k) = λk
e−λ
k!
E(x) = Var(X) = λ
A list of genes g1, g2, . . . gn
X ∼ Poisson(λ), a random variable
representing the number of reads
falling in gi
Likelihood ratio test

Background
Negative Binomial Distribution
Graph from Wikipedia
X ∼ NB(r; p)
Pr(X = k) = Ck
k+r−1pk
(1 − p)r
p: probability of success
r: predeﬁned number of failures
X: number of successes until r
failures

Background
DEseq, voom and vst

DEseq, voom and vst
Normalization in DESeq
Assumption
Most genes not expressed diﬀerentially
Diﬀerentially expressed genes divided equally between up- and down-regulation
Steps
Geometric mean of gene’s counts across all samples
Divide gene’s counts by the geometric mean
Normalization factor: median of ratios

Model in DESeq
Model in DESeq
Read counts for gene i in sample j follows negative binomial distribution
Kij ∼ NB(µij , σ2
ij )
Why not Poisson distribution?
In RNA-seq, variance is larger than mean
Very diﬃcult to estimate µij and σ2
ij
Parameters estimation is the main diﬀerence between methods based on NB
distribution

Model in DESeq
Model in DESeq
Count sum for gene i in condition A: a
Count sum for gene i in condition B: b
Sum: κ = a + b
p(a), p(b) and p(a, b)
p-value:
p =
i+j=κ,p(i,j)<p(a,b) p(i, j)
i+j=κ p(i, j)

Model in DESeq
R code for DESeq
library(DESeq)
DESeq.cds = newCountDataSet(countData = data.sim$counts,
conditions = factor(data.sim$treatment))
DESeq.cds = estimateSizeFactors(DESeq.cds)
DESeq.cds = estimateDispersions(DESeq.cds, fitType = "local")
DESeq.test = nbinomTest(DESeq.cds, "1", "2")
DESeq.pvalues = DESeq.test$pval
DESeq.adjpvalues = p.adjust(DESeq.pvalues, method = "BH")

Model in limma
Model in limma
Linear Models for Microarray Data: lmFit()
Classical t-test: tj =
µ1j −µ2j
σ2
j ( 1
n1
+ 1
n2
)
Very hard to get the σ2
j from a small sample size
limma: moderated t-test
Use information from other genes
σ2
j ∼ Inverse Gamma(α, β)
Empirical Bayesian for parameter estimate: eBayes()

Model in voom
Model in voom
voom: variance modelling at the observational level
Locally weighted regression to get the relation between count and variance
Moderated t-test in limma

Model in voom
Model in voom
4 6 8 10 12 14
0.00.20.40.60.81.0
Average log2(count size + 0.5)
Sqrt(standarddeviation)
a
4 6 8 10 12 14
Average log2(count size + 0.5)
voom: Mean−variance trend
b
4 6 8 10 12 14
Fitted log2(count size + 0.5)
c
1.2
Law et al. Genome Biology 2014, 15:R29

Model in voom
R code for voom
library(limma)
library(DESeq)
group = factor(conditions)
nf = calcNormFactors(data.matrix, method = "TMM")
voom.data = voom(data.matrix, design = model.matrix(~group),
lib.size = colSums(data.matrix) * nf)
voom.data$genes = rownames(data.matrix)
voom.fitlimma = lmFit(voom.data, design = model.matrix(~group))
voom.fitbayes = eBayes(voom.fitlimma)
voom.pvalues = voom.fitbayes$p.value[, 2]
voom.adjpvalues = p.adjust(voom.pvalues, method = "BH")
voom.genes <- data.matrix[which(voom.adjpvalues <=
0.05), ]

Model in vst
Model in vst
Variance-stabilizing transformation
To ﬁnd a simple function f to create new values y = f (x) that the variability
of y is not related to mean
A method used in microarray data analysis [4]
Moderated t-test in limma

Model in vst
R code for vst
library(DESeq)
library(limma)
group = factor(conditions)
DESeq.cds = newCountDataSet(countData = data.matrix,
conditions = group)
DESeq.cds = estimateSizeFactors(DESeq.cds)
DESeq.cds = estimateDispersions(DESeq.cds, method = "blind",
fitType = "local")
DESeq.vst = getVarianceStabilizedData(DESeq.cds)
DESeq.vst.fitlimma = lmFit(DESeq.vst, design = model.matrix(~group))
DESeq.vst.fitbayes = eBayes(DESeq.vst.fitlimma)
DESeq.vst.pvalues = DESeq.vst.fitbayes$p.value[, 2]
DESeq.vst.adjpvalues = p.adjust(DESeq.vst.pvalues,
method = "BH")
DESeq.vst.genes <- data.matrix[which(DESeq.vst.adjpvalues <=
0.05), ]

Results from Simulation
AUC Results
0.5
0.6
0.7
0.8
5.0 7.5 10.0 12.5 15.0
#sample/condition
AUC
software
baySeq
DESeq
EBSeq
edgeR
NBPSeq
SAMseq
ShrinkSeq
TSPM.
voom
vst

Diﬀerential Expression Gene Number
1
10
baySeq
DESeq
NBPSeq
voom
vst
edgeR
ShrinkSeq
TSPM
EBSeq
SAMSeq
software
value
variable
correct
incorrect

Running Time
0
100
200
300
400
500
5.0 7.5 10.0 12.5 15.0
#sample/condition
time(sec)
software
baySeq
DESeq
EBSeq
edgeR
NBPSeq
SAMseq
ShrinkSeq
TSPM
voom
vst

Running Time with 15 Samples per Condition
Software AUC Time
edgeR 0.810 0.630
DESeq 0.652 48.388
NBPSeq 0.767 24.942
baySeq 0.495 210.781
EBSeq 0.769 12.666
TSPM 0.836 7.486
SAMseq 0.827 1.801
voom 0.835 0.264
vst 0.830 0.138
ShrinkSeq 0.796 343.260

Venn Diagram for Drosophila melanogaster
4
7
13
11
310
178
17
DESeq voom
vst

Some Conclusion
Some Conclusion
Each method has many assumptions
Negative binomial model has a relatively better specificity and sensitivity
Good performance of voom and vst in accuracy and time, no difference
between them
All methods will have better performance with larger sample, however,
sample size very limited in practice
Different normalization in cuffdiff: both alternative isoforms and length of
transcripts

Some Conclusion
References
Simon Anders and Wolfgang Huber.
Diﬀerential expression analysis for sequence count data.
Genome Biology, 11:R106, 2010.
Charity W Law, Yunshun Chen, Wei Shi, and Gordon K Smyth.
Voom: precision weights unlock linear model analysis tools for rna-seq read counts.
Genome Biology, 15(2):R29, 2014.
Gordon K Smyth.
Linear models and empirical bayes methods for assessing diﬀerential expression in microarray
experiments.
Statistical Applications in Genetics and Molecular Biology, 3:Article 3, 2004.
Blythe P Durbin, Johanna S Hardin, Douglas M Hawkins, and David M Rocke.
A variance-stabilizing transformation for gene-expression microarray data.
Bioinformatics, pages S105–S110, 2002.

Thanks
Thanks
Thank you for your time!
Qiang Kou
qkou@umail.iu.edu

DEseq, voom and vst

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a DEseq, voom and vst

Semelhante a DEseq, voom and vst (20)

Último

Último (20)

DEseq, voom and vst