2. Background
• Comparison of
commonly used DE
software packages
–
–
–
–
–
–
Cuffdiff
edgeR
DESeq
PoisssonSeq
baySeq
limma
• Two benchmark
datasets
– Sequencing Quality
Control (SEQC) dataset
• Includes qRT-PCR for
1,000 genes
– Biological replicates from
3 cell lines as part of
ENCODE project
3. Focus of paper:
Comparison of elevant measures for
DE detection
• Normalization of count data
• Sensitivity and specificity of DE detection
• Genes expressed in one condition but no
expression in the other condition
• Sequencing depth and number of replicates
4. Theoretical background
• Count matrix—number
of reads assigned to
gene i in sequencing
experiment j
• Length bias when
measuring gene
expression by RNA-seq
– Reduces the ability to
detect differential
expression among
shorter genes
• Differential gene
expression consists of 3
components:
– Normalization of counts
– Parameter estimation of
the statistical model
– Tests for differential
expression
5. Normalization
• Commonly used
– RPKM
– FPKM
– Biases—proportional
representation of each
gene is dependent on
expression levels of other
genes
• DESeq-scaling factor
based normalization
– median of ratio for each
gene of its read count over
its geometric mean across
all samples
• Cuffdiff—extension of
DESeq normalization
– Intra-condition library
scaling
– Second scaling between
conditions
– Also accounts for changes
in isoform levels
6. Normalization
• edgeR
– Trimmed means of M
values (TMM)
– Weighted average of
subset of genes
(excluding genes of high
average read counts and
genes with large
differences in
expression)
• baySeq
– Sum gene counts to
upper 25% quantile to
normalize library size
• PoissonSeq
– Goodness of fit estimate
to define a gene set that
is least differentiated
between 2 conditions,
and then used to
compute library
normalization factors
7. Normalization
• limma (2 normalization procedures)
– Quantile normalization
Sorts counts from each sample and sets the
values to be equal to quantile mean from all
samples
– Voom: LOWESS regression to estimate mean
variance relation and transforms read counts to
log form for linear modeling
8. Statistical modeling of gene expression
• edgeR and DESeq
– Negative binomial distribution (estimation of
dispersion factor)
• edgeR
– Estimation of dispersion factor as weighted
combination of 2 components
• Gene specific dispersion effect and common dispersion
effect calculated for all genes
9. Statistical modeling of gene expression
• DESeq
– Variance estimate into a combination of Poisson
estimate and a second term that models biological
variability
• Cuffdiff
– Separate variance models for single isoform and
multiple isoform genes
• Single isoform—similar to DESeq
• Multiple isoform– mixed model of negative binomial
and beta distributions
10. Statistical modeling of gene expression
• baySeq
– Full Bayesian model of negative binomial
distributions
– Prior probability parameters are estimated by
numerical sampling of the data
• PoissonSeq
– Models gene counts as a Poisson variable
– Mean of distribution represented by log-linear
relationship of library size, expression of gene, and
correlation of gene with condition
11. Test for differential expression
• edgeR and DESeq
– Variation of Fisher exact test modified for negative
binomial distribution
– Returns exact P value from derived probabilities
• Cuffdiff
– Ratio of normalized counts between 2 conditions
(follows normal distribution)
– t-test to calculate P value
12. Test for differential expression
• limma
– Moderated t-statistic of modified standard error
and degrees of freedom
• baySeq
– Estimates 2 models for every gene
• No differential expression
• Differential expression
– Posterior likelihood of DE given the data is used to
identify differentially expressed genes (no P value)
13. Test for differential expression
• PoissonSeq
– Test for significance of correlation term
– Evaluated by score statistics which follow a Chisquared distribution (used to derive P values)
• Multiple hypothesis corrections
– Benjamini-Hochberg
– PoissonSeq—permutation based FDR
14. Results
• Normalization and log expression correlation
• Differential expression analysis
• Evaluation of type I errors
• Evaluation of genes expressed in one condition
• Impact of sequencing depth and replication on
DE detection