1. DESeq, voom and vst
Qiang Kou
qkou@umail.iu.edu
April 28, 2014
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 1 / 31
2. Background
Advantages of RNA-seq Compared to Microarray
Detecting novel transcripts and isoforms
High reproducibility, low background
Detection of gene fusions and SNPs
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 2 / 31
3. Background
Differential Expression Analysis
Steps
Normalization
Dispersion estimation
Statistical testing
Methods to be presented
DESeq: negative binomial distribution [1]
voom: variance modelling at the observational level [2]
vst: variance-stabilizing transformation [1, 3]
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 3 / 31
4. Background
Timeline
2002 2004 2006 2008 2010 2012 2014 2016
vst
lim
m
a
cuffl
inksD
Eseq,edgeR
baySeq
voom
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 4 / 31
7. Background
Length Normalization
Within sample: gene length
Between samples: library size
RPKM and FPKM
Reads/fragments per kilobase per million mapped reads
Normalization for gene length and library size
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 7 / 31
8. Background
Different Distribution
0.0
0.2
0.4
0.6
1 2 3 4
expression
density
(a) Microarray
0.0
0.1
0.2
0.3
0.4
−2 0 2 4
log10(fpkm)
density
condition
Untreated
CG8144_RNAi
genes
(b) RNA-seq
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 8 / 31
9. Background
Differential Expression as a Function of Transcript Length
0 2000 4000 6000 8000
020406080
Sequencing Data (Sultan)
%DE
a
0 2000 4000 6000 8000
020406080
Array Data (Sultan)
Transcript length (bp)
%DE
b
2000 4000 6000 8000 10000
024681012
Sequencing Data (Cloonan)
Transcript length (bp)
%DE
c
0 1000 2000 3000 4000 5000 6000 7000
020406080
Sequencing Data (Marioni)
d
1000 3000 5000 7000
020406080
Array Data (Marioni)
Transcript length (bp)
e
1000 2000 3000 4000 5000 6000 7000
020406080
Sequencing Data (Marioni)
f
1000 2000 3000 4000 5000 6000 7000
020406080
Array Data (Marioni)
Transcript length (bp)
g
Oshlack et al. (2009) Biology Direct 4:14
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 9 / 31
10. Background
Poisson and Negative Binomial Distribution
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 10 / 31
11. Background
Poisson Distribution
Graph from Wikipedia
Pr(X = k) = λk
e−λ
k!
E(x) = Var(X) = λ
A list of genes g1, g2, . . . gn
X ∼ Poisson(λ), a random variable
representing the number of reads
falling in gi
Likelihood ratio test
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 11 / 31
12. Background
Negative Binomial Distribution
Graph from Wikipedia
X ∼ NB(r; p)
Pr(X = k) = Ck
k+r−1pk
(1 − p)r
p: probability of success
r: predefined number of failures
X: number of successes until r
failures
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 12 / 31
14. DEseq, voom and vst
Normalization in DESeq
Assumption
Most genes not expressed differentially
Differentially expressed genes divided equally between up- and down-regulation
Steps
Geometric mean of gene’s counts across all samples
Divide gene’s counts by the geometric mean
Normalization factor: median of ratios
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 14 / 31
15. Model in DESeq
Model in DESeq
Read counts for gene i in sample j follows negative binomial distribution
Kij ∼ NB(µij , σ2
ij )
Why not Poisson distribution?
In RNA-seq, variance is larger than mean
Very difficult to estimate µij and σ2
ij
Parameters estimation is the main difference between methods based on NB
distribution
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 15 / 31
16. Model in DESeq
Model in DESeq
Count sum for gene i in condition A: a
Count sum for gene i in condition B: b
Sum: κ = a + b
p(a), p(b) and p(a, b)
p-value:
p =
i+j=κ,p(i,j)<p(a,b) p(i, j)
i+j=κ p(i, j)
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 16 / 31
17. Model in DESeq
R code for DESeq
library(DESeq)
DESeq.cds = newCountDataSet(countData = data.sim$counts,
conditions = factor(data.sim$treatment))
DESeq.cds = estimateSizeFactors(DESeq.cds)
DESeq.cds = estimateDispersions(DESeq.cds, fitType = "local")
DESeq.test = nbinomTest(DESeq.cds, "1", "2")
DESeq.pvalues = DESeq.test$pval
DESeq.adjpvalues = p.adjust(DESeq.pvalues, method = "BH")
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 17 / 31
18. Model in limma
Model in limma
Linear Models for Microarray Data: lmFit()
Classical t-test: tj =
µ1j −µ2j
σ2
j ( 1
n1
+ 1
n2
)
Very hard to get the σ2
j from a small sample size
limma: moderated t-test
Use information from other genes
σ2
j ∼ Inverse Gamma(α, β)
Empirical Bayesian for parameter estimate: eBayes()
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 18 / 31
19. Model in voom
Model in voom
voom: variance modelling at the observational level
Locally weighted regression to get the relation between count and variance
Moderated t-test in limma
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 19 / 31
20. Model in voom
Model in voom
4 6 8 10 12 14
0.00.20.40.60.81.0
Average log2(count size + 0.5)
Sqrt(standarddeviation)
a
4 6 8 10 12 14
Average log2(count size + 0.5)
voom: Mean−variance trend
b
4 6 8 10 12 14
Fitted log2(count size + 0.5)
c
1.2
Law et al. Genome Biology 2014, 15:R29
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 20 / 31
22. Model in vst
Model in vst
Variance-stabilizing transformation
To find a simple function f to create new values y = f (x) that the variability
of y is not related to mean
A method used in microarray data analysis [4]
Moderated t-test in limma
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 22 / 31
27. Results from Simulation
Running Time with 15 Samples per Condition
Software AUC Time
edgeR 0.810 0.630
DESeq 0.652 48.388
NBPSeq 0.767 24.942
baySeq 0.495 210.781
EBSeq 0.769 12.666
TSPM 0.836 7.486
SAMseq 0.827 1.801
voom 0.835 0.264
vst 0.830 0.138
ShrinkSeq 0.796 343.260
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 27 / 31
28. Results from Simulation
Venn Diagram for Drosophila melanogaster
4
7
13
11
310
178
17
DESeq voom
vst
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 28 / 31
29. Some Conclusion
Some Conclusion
Each method has many assumptions
Negative binomial model has a relatively better specificity and sensitivity
Good performance of voom and vst in accuracy and time, no difference
between them
All methods will have better performance with larger sample, however,
sample size very limited in practice
Different normalization in cuffdiff: both alternative isoforms and length of
transcripts
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 29 / 31
30. Some Conclusion
References
Simon Anders and Wolfgang Huber.
Differential expression analysis for sequence count data.
Genome Biology, 11:R106, 2010.
Charity W Law, Yunshun Chen, Wei Shi, and Gordon K Smyth.
Voom: precision weights unlock linear model analysis tools for rna-seq read counts.
Genome Biology, 15(2):R29, 2014.
Gordon K Smyth.
Linear models and empirical bayes methods for assessing differential expression in microarray
experiments.
Statistical Applications in Genetics and Molecular Biology, 3:Article 3, 2004.
Blythe P Durbin, Johanna S Hardin, Douglas M Hawkins, and David M Rocke.
A variance-stabilizing transformation for gene-expression microarray data.
Bioinformatics, pages S105–S110, 2002.
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 30 / 31
31. Thanks
Thanks
Thank you for your time!
Qiang Kou
qkou@umail.iu.edu
Qiang Kou (qkou@umail.iu.edu) DESeq, voom and vst April 28, 2014 31 / 31