SlideShare uma empresa Scribd logo
1 de 13
Background - DESeq
• Modelling the number of reads sequenced from a gene X
– Can use a Binomial B(n, p), n=total number of reads, p=prob. from gene X
– Can approximate with a Poisson(np) as n large, p small
– Poisson model works ok for a gene’s variation between technical replicates
– However, Poisson understimates variation between biological replicates
– edgeR and deseq use a negative binomial instead (for gene i in sample j)
Equation (1): Kij ~ NB(mu_ij, sigma2
ij)
– Negative binomial has two parameters, mean mu and variance sigma2
– Number of replicates is usually too small to estimate both for a gene X
EdgeR
– Assumes sigma2
= mu + alpha*mu2
, where alpha is the same for all genes
– Just needs to estimate mu for a gene, then calculate sigma2
from that
• DESeq
– For each sample, makes a local regression of sigma2
versus mu
– Given mu for gene X, uses the local regression to estimate sigma2
Results & Discussion
• DESeq’s model - makes three assumptions
– Equation (2): mu_ij = qi,rho(j) * sj
mu_ij = expected value of mean count (no. reads) for gene i in sample j
qi,rho(j) = proportional to concentration of fragments from gene i in sample j
sj = coverage (sampling depth) of library j
– Equation (3): sigma2
_ij = mu_ij + sj2
* vi,rho(j)
sigma2_ij = variance of no. reads for gene i in sample j
mu_ij = variance due to Poisson model (technical variation) = “shot noise”
sj2
* vi,rho(j) = variance due to biological variation(?) = “raw variance”
– Equation (4): vi,rho(j) = vrho ( qi,rho(j) )
ie. vi,rho(j) is a function of qi,rho(j)
So we can make a regression of vi,rho(j) against qi,rho(j) for lots of genes (i)
Then estimate vi,rho(j) for gene X, based on qi,rho(j) and the regression line
• DESeq’s model – estimating parameters
– sj : coverage (sampling depth) of library j
The total number of reads in library j is not a good measure of depth.
Instead, take the median (over all genes) of the ratios of observed counts:
Equation (5): sj = median_over_i ( kij / [ Sum_over_v kiv ]^(1/m) ] )
– qi,rho(j) = “expression strength” parameter for gene i in condition rho
Proportional to concentration of fragments from gene i in sample j.
Use the average of countsfrom samples j for condition rho:
Equation (6): qi,rho = 1/m_rho * Sum_over_j (kij / sj)
– vrho = function describing how vi,rho(j) depends on qi,rho(j)
Estimate the sample variance for each gene i, wi(rho) (Equation 7)
Fit a local regression line to wi(rho) versus qi(rho)
For a particular qi(rho) value, predict w=wi(rho) from the regression line
Also calculate zi(rho) for gene i (Equation 8)
Then use v = w – zi(rho) as an unbiased estimate of the variance vi,rho for
gene i (Equation 9)
• DESeq’s model – testing for differential expression
– Null hypothesis: qiA = qiB
qiA = expression strength parameter for gene i in the samples of condition A,
mA = number of samples for condition A
– Test statistic: total counts in each condition
Equation (10): KiA = counts in condition A = Sum_over_A ( Kij)
– P-value for test of null hypothesis
Under the null hypothesis, can compute prob(KiA = a, KiB = b) = p(a,b)
Equation (11): P-value for observed count (kiA, kiB) =
Sum of probabilities p(a,b) where p(a,b)≤ p(kiA,kiB), a+b = kiA+kiB
Sum of probabilities p(a,b) where a+b = kiA+kiB
– Computing p(a,b) values
p(a,b) = Prob(KiA = a) * Prob(KiB = b), assuming samples are independent
KiA is the sum of mA NB-distributed variables
We approximate its distribution by a NB(mu, sigma) distribution
whose parameters mu, sigma are estimated using Equations 12,13,14
Applications
• Variance estimation
– Use RNA-seq data from fly embryos: ‘A’ and ‘B’ samples, 2 replicates each
Figure 1: estimated variances wi(rho) plotted against qi(rho) for fly sample A
Distance between orange and purple lines is noise due to biological sampling
regression
edgeR
“shot noise”
(technical
variation)
• Testing for differential expression
– Compared the 2 replicates for fly sample A
Figure 2: the empirical cumulative distribution functions of the P-values
The ECDF curve (blue line) should be below the diagonal (gray line)
Type I error is controlled by EdgeR & DESeq, but not a Poisson-based test
EdgeR has an excess of small P-values for low counts, but is more
conservative for high counts
DESeq
edgeR
Poisson
Low High All
• Testing for differential expression
– Compared fly A & B samples
Figure 3: obtained fold changes and P-values
The ability to detect differential expression depends on overall counts
The strong shot noise (technical variation) for low counts causes the testing
procedure to call only very high fold changes as significant
Red: significant p-value
• Comparison with EdgeR
– Ran edgeR with 4 settings:
(i) “Common-dispersion” or “tagwise-dispersion” modes for estimating variance
(ii) Size factors estimated by DESeq, or total number of reads
Results were very similar for the 4 settings
EdgeR’s single-value dispersion estimate of variance is lower than DESeq for
weakly expressed genes & higher for strongly expressed genes (Figure 1)
regression
edgeR
“shot noise”
(technical
variation)
As a result, EdgeR is anti-conservative for
lowly expressed genes, but more
conservative for strongly expressed genes
This biases the list of discoveries by EdgeR
Figure 4 shows that weakly expressed genes seem to be over-represented
Few genes with high average level are called differentially expressed by EdgeR
DESeq produced results which were more balanced over the dynamic range
All fly data
DESeq hits
EdgeR hits
• Working without replicates
– DESeq can work if there are no replicates in one or both conditions
If there are just replicates from one condition, fit regression line using that one
If there are no replicates, treat the samples as replicates to fit the regression
For neural cell data, variability between replicates ≈ variability bet. conditions
However, for fly data, variability between replicates << variability bet. conditions
• Variance-stabilising transformation (VST)
– Given a variance-mean regression, a VST transforms the values so the
variance is independent of the mean (Equation 15)
This yields (transformed) count values whose variances are approximately
the same throughout the dynamic range
This is useful for sample clustering, since clustering assumes all genes have
roughly the same variance
Figure 5 shows clustering for neural cell samples, using VST-transformed data
• ChIP-Seq data
– Compared HapMap IDs GM12878 and GM12891
DESeq does not give false positives when comparing replicates for 1 individual
Using a Poisson-based model, you would get many false positives
DESeq
Poisson
Same individual Different individuals
Summary
• A Poisson model underestimates the variance between
biological samples; this leads to false positives in differential
expression analyses
• A Negative Binomial distribution is much better
• This is especially true for highly expressed genes
• DESeq and EdgeR use the Negative Binomial
• However, DESeq estimates the sequencing depth differently
• Also DESeq estimates the variance for a gene by assuming
it has similar variance to genes of similiar expression level
• DESeq and EdgeR have similar sensitivity, but EdgeR calls a
greater number of weakly expressed genes as significant,
and fewer highly expressed genes as significant

Mais conteúdo relacionado

Semelhante a DESeq Paper Journal club

Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...DrAmitJoshi9
 
DHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptxDHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptxDivyanshGupta922023
 
IGARSS_2011.pptx
IGARSS_2011.pptxIGARSS_2011.pptx
IGARSS_2011.pptxgrssieee
 
Neural Networks with Complex Sample Data
Neural Networks with Complex Sample DataNeural Networks with Complex Sample Data
Neural Networks with Complex Sample DataSavano Pereira
 
How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2AdamCribbs1
 
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWASEvaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWASAmin Mohamed
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1BITS
 
Oral presentation at Protein Folding Consortium Workshop in Berkeley (2017)
Oral presentation at Protein Folding Consortium Workshop in Berkeley (2017)Oral presentation at Protein Folding Consortium Workshop in Berkeley (2017)
Oral presentation at Protein Folding Consortium Workshop in Berkeley (2017)Temple University
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the CloudDataMine Lab
 
2.0.statistical methods and determination of sample size
2.0.statistical methods and determination of sample size2.0.statistical methods and determination of sample size
2.0.statistical methods and determination of sample sizesalummkata1
 
Igor Segota: PhD thesis presentation
Igor Segota: PhD thesis presentationIgor Segota: PhD thesis presentation
Igor Segota: PhD thesis presentationIgorSegota3
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalJoachim Jacob
 
IGARSSWellLog_Vancouver_07_29.pptx
IGARSSWellLog_Vancouver_07_29.pptxIGARSSWellLog_Vancouver_07_29.pptx
IGARSSWellLog_Vancouver_07_29.pptxgrssieee
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...Vahid Taslimitehrani
 
Integrated modelling Cape Town
Integrated modelling Cape TownIntegrated modelling Cape Town
Integrated modelling Cape TownBob O'Hara
 

Semelhante a DESeq Paper Journal club (20)

Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...Genome wide association studies---In genomics, a genome-wide association stud...
Genome wide association studies---In genomics, a genome-wide association stud...
 
DHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptxDHC Microbiome Presentation 4-23-19.pptx
DHC Microbiome Presentation 4-23-19.pptx
 
IGARSS_2011.pptx
IGARSS_2011.pptxIGARSS_2011.pptx
IGARSS_2011.pptx
 
Neural Networks with Complex Sample Data
Neural Networks with Complex Sample DataNeural Networks with Complex Sample Data
Neural Networks with Complex Sample Data
 
How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2How to analyse bulk transcriptomic data using Deseq2
How to analyse bulk transcriptomic data using Deseq2
 
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWASEvaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
Makalah ukuran penyebaran
Makalah ukuran penyebaranMakalah ukuran penyebaran
Makalah ukuran penyebaran
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
Oral presentation at Protein Folding Consortium Workshop in Berkeley (2017)
Oral presentation at Protein Folding Consortium Workshop in Berkeley (2017)Oral presentation at Protein Folding Consortium Workshop in Berkeley (2017)
Oral presentation at Protein Folding Consortium Workshop in Berkeley (2017)
 
R Analytics in the Cloud
R Analytics in the CloudR Analytics in the Cloud
R Analytics in the Cloud
 
2.0.statistical methods and determination of sample size
2.0.statistical methods and determination of sample size2.0.statistical methods and determination of sample size
2.0.statistical methods and determination of sample size
 
Igor Segota: PhD thesis presentation
Igor Segota: PhD thesis presentationIgor Segota: PhD thesis presentation
Igor Segota: PhD thesis presentation
 
Part 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goalPart 1 of RNA-seq for DE analysis: Defining the goal
Part 1 of RNA-seq for DE analysis: Defining the goal
 
Chapter4_Multi_Reg_Estim.pdf.pdf
Chapter4_Multi_Reg_Estim.pdf.pdfChapter4_Multi_Reg_Estim.pdf.pdf
Chapter4_Multi_Reg_Estim.pdf.pdf
 
IGARSSWellLog_Vancouver_07_29.pptx
IGARSSWellLog_Vancouver_07_29.pptxIGARSSWellLog_Vancouver_07_29.pptx
IGARSSWellLog_Vancouver_07_29.pptx
 
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
A new CPXR Based Logistic Regression Method and Clinical Prognostic Modeling ...
 
Integrated modelling Cape Town
Integrated modelling Cape TownIntegrated modelling Cape Town
Integrated modelling Cape Town
 
Statistics-2 : Elements of Inference
Statistics-2 : Elements of InferenceStatistics-2 : Elements of Inference
Statistics-2 : Elements of Inference
 

Mais de avrilcoghlan

Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomesavrilcoghlan
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignmentavrilcoghlan
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithmavrilcoghlan
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functionsavrilcoghlan
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithmavrilcoghlan
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformaticsavrilcoghlan
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformaticsavrilcoghlan
 

Mais de avrilcoghlan (10)

Introduction to genomes
Introduction to genomesIntroduction to genomes
Introduction to genomes
 
Homology
HomologyHomology
Homology
 
BLAST
BLASTBLAST
BLAST
 
Multiple alignment
Multiple alignmentMultiple alignment
Multiple alignment
 
The Smith Waterman algorithm
The Smith Waterman algorithmThe Smith Waterman algorithm
The Smith Waterman algorithm
 
Alignment scoring functions
Alignment scoring functionsAlignment scoring functions
Alignment scoring functions
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
Dotplots for Bioinformatics
Dotplots for BioinformaticsDotplots for Bioinformatics
Dotplots for Bioinformatics
 
Introduction to HMMs in Bioinformatics
Introduction to HMMs in BioinformaticsIntroduction to HMMs in Bioinformatics
Introduction to HMMs in Bioinformatics
 

Último

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 

Último (20)

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

DESeq Paper Journal club

  • 1. Background - DESeq • Modelling the number of reads sequenced from a gene X – Can use a Binomial B(n, p), n=total number of reads, p=prob. from gene X – Can approximate with a Poisson(np) as n large, p small – Poisson model works ok for a gene’s variation between technical replicates – However, Poisson understimates variation between biological replicates – edgeR and deseq use a negative binomial instead (for gene i in sample j) Equation (1): Kij ~ NB(mu_ij, sigma2 ij) – Negative binomial has two parameters, mean mu and variance sigma2 – Number of replicates is usually too small to estimate both for a gene X EdgeR – Assumes sigma2 = mu + alpha*mu2 , where alpha is the same for all genes – Just needs to estimate mu for a gene, then calculate sigma2 from that • DESeq – For each sample, makes a local regression of sigma2 versus mu – Given mu for gene X, uses the local regression to estimate sigma2
  • 2. Results & Discussion • DESeq’s model - makes three assumptions – Equation (2): mu_ij = qi,rho(j) * sj mu_ij = expected value of mean count (no. reads) for gene i in sample j qi,rho(j) = proportional to concentration of fragments from gene i in sample j sj = coverage (sampling depth) of library j – Equation (3): sigma2 _ij = mu_ij + sj2 * vi,rho(j) sigma2_ij = variance of no. reads for gene i in sample j mu_ij = variance due to Poisson model (technical variation) = “shot noise” sj2 * vi,rho(j) = variance due to biological variation(?) = “raw variance” – Equation (4): vi,rho(j) = vrho ( qi,rho(j) ) ie. vi,rho(j) is a function of qi,rho(j) So we can make a regression of vi,rho(j) against qi,rho(j) for lots of genes (i) Then estimate vi,rho(j) for gene X, based on qi,rho(j) and the regression line
  • 3. • DESeq’s model – estimating parameters – sj : coverage (sampling depth) of library j The total number of reads in library j is not a good measure of depth. Instead, take the median (over all genes) of the ratios of observed counts: Equation (5): sj = median_over_i ( kij / [ Sum_over_v kiv ]^(1/m) ] ) – qi,rho(j) = “expression strength” parameter for gene i in condition rho Proportional to concentration of fragments from gene i in sample j. Use the average of countsfrom samples j for condition rho: Equation (6): qi,rho = 1/m_rho * Sum_over_j (kij / sj) – vrho = function describing how vi,rho(j) depends on qi,rho(j) Estimate the sample variance for each gene i, wi(rho) (Equation 7) Fit a local regression line to wi(rho) versus qi(rho) For a particular qi(rho) value, predict w=wi(rho) from the regression line Also calculate zi(rho) for gene i (Equation 8) Then use v = w – zi(rho) as an unbiased estimate of the variance vi,rho for gene i (Equation 9)
  • 4. • DESeq’s model – testing for differential expression – Null hypothesis: qiA = qiB qiA = expression strength parameter for gene i in the samples of condition A, mA = number of samples for condition A – Test statistic: total counts in each condition Equation (10): KiA = counts in condition A = Sum_over_A ( Kij) – P-value for test of null hypothesis Under the null hypothesis, can compute prob(KiA = a, KiB = b) = p(a,b) Equation (11): P-value for observed count (kiA, kiB) = Sum of probabilities p(a,b) where p(a,b)≤ p(kiA,kiB), a+b = kiA+kiB Sum of probabilities p(a,b) where a+b = kiA+kiB – Computing p(a,b) values p(a,b) = Prob(KiA = a) * Prob(KiB = b), assuming samples are independent KiA is the sum of mA NB-distributed variables We approximate its distribution by a NB(mu, sigma) distribution whose parameters mu, sigma are estimated using Equations 12,13,14
  • 5. Applications • Variance estimation – Use RNA-seq data from fly embryos: ‘A’ and ‘B’ samples, 2 replicates each Figure 1: estimated variances wi(rho) plotted against qi(rho) for fly sample A Distance between orange and purple lines is noise due to biological sampling regression edgeR “shot noise” (technical variation)
  • 6. • Testing for differential expression – Compared the 2 replicates for fly sample A Figure 2: the empirical cumulative distribution functions of the P-values The ECDF curve (blue line) should be below the diagonal (gray line) Type I error is controlled by EdgeR & DESeq, but not a Poisson-based test EdgeR has an excess of small P-values for low counts, but is more conservative for high counts DESeq edgeR Poisson Low High All
  • 7. • Testing for differential expression – Compared fly A & B samples Figure 3: obtained fold changes and P-values The ability to detect differential expression depends on overall counts The strong shot noise (technical variation) for low counts causes the testing procedure to call only very high fold changes as significant Red: significant p-value
  • 8. • Comparison with EdgeR – Ran edgeR with 4 settings: (i) “Common-dispersion” or “tagwise-dispersion” modes for estimating variance (ii) Size factors estimated by DESeq, or total number of reads Results were very similar for the 4 settings EdgeR’s single-value dispersion estimate of variance is lower than DESeq for weakly expressed genes & higher for strongly expressed genes (Figure 1) regression edgeR “shot noise” (technical variation) As a result, EdgeR is anti-conservative for lowly expressed genes, but more conservative for strongly expressed genes
  • 9. This biases the list of discoveries by EdgeR Figure 4 shows that weakly expressed genes seem to be over-represented Few genes with high average level are called differentially expressed by EdgeR DESeq produced results which were more balanced over the dynamic range All fly data DESeq hits EdgeR hits
  • 10. • Working without replicates – DESeq can work if there are no replicates in one or both conditions If there are just replicates from one condition, fit regression line using that one If there are no replicates, treat the samples as replicates to fit the regression For neural cell data, variability between replicates ≈ variability bet. conditions However, for fly data, variability between replicates << variability bet. conditions
  • 11. • Variance-stabilising transformation (VST) – Given a variance-mean regression, a VST transforms the values so the variance is independent of the mean (Equation 15) This yields (transformed) count values whose variances are approximately the same throughout the dynamic range This is useful for sample clustering, since clustering assumes all genes have roughly the same variance Figure 5 shows clustering for neural cell samples, using VST-transformed data
  • 12. • ChIP-Seq data – Compared HapMap IDs GM12878 and GM12891 DESeq does not give false positives when comparing replicates for 1 individual Using a Poisson-based model, you would get many false positives DESeq Poisson Same individual Different individuals
  • 13. Summary • A Poisson model underestimates the variance between biological samples; this leads to false positives in differential expression analyses • A Negative Binomial distribution is much better • This is especially true for highly expressed genes • DESeq and EdgeR use the Negative Binomial • However, DESeq estimates the sequencing depth differently • Also DESeq estimates the variance for a gene by assuming it has similar variance to genes of similiar expression level • DESeq and EdgeR have similar sensitivity, but EdgeR calls a greater number of weakly expressed genes as significant, and fewer highly expressed genes as significant