Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
EiB Seminar from Antoni Miñarro, Ph.D
1.
2. RNA-seq
RNA-seq, also called "Whole Transcriptome Shotgun Sequencing“
("WTSS"), refers to the use of high-throughput sequencing technologies to
sequence cDNA in order to get information about a sample's RNA content.
3. Analysis of RNA-seq data
Single nucleotide variation discovery: currently being applied to cancer
research and microbiology.
Fusion gene detection: fusion genes have gained attention because of
their relationship with cancer. The idea follows from the process of
aligning the short transcriptomic reads to a reference genome.
Most of the short reads will fall within one complete exon, and a
smaller but still large set would be expected to map to known
exon-exon junctions. The remaining unmapped short reads would
then be further analyzed to determine whether they match an
exon-exon junction where the exons come from different genes.
This would be evidence of a possible fusion.
Gene expression
7. Gene expression
Detect differences in gene level expression between samples. This
sort of analysis is particularly relevant for controlled experiments
comparing expression in wild-type and mutant strains of the same
tissue, comparing treated versus untreated cells, cancer versus
normal, and so on.
8. Differential expression (2)
RNA-seq gives a discrete measurement for each gene.
Transformation of count data is not well approximated by continuous
distributions, especially in the lower count range and for small
samples. Therefore, statistical models appropriate for count data are
vital to extracting the most information from RNA-seq data.
In general, the Poisson distribution forms the basis for modeling RNA-
seq count data.
10. Mapping
The first step in this procedure is the read mapping or alignment: to find the
unique location where a short read is identical to the reference.
However, in reality the reference is never a perfect representation of the
actual biological source of RNA being sequenced: SNPs, indels, also the
consideration that the reads arise from a spliced transcriptome rather than a
genome.
Short reads can sometimes align perfectly to multiple locations and can
contain sequencing errors that have to be accounted for.
The real task is to find the location where each short read best matches the
reference, while allowing for errors and structural variation.
11. Aligners
Aligners differ in how they handle ‘multimaps’ (reads that
map equally well to several locations). Most aligners
either discard multimaps, allocate them randomly or
allocate them on the basis of an estimate of local
coverage.
Paired-end reads reduce the problem of multi-mapping, as
both ends of the cDNA fragment from which the short
reads were generated should map nearby on the
transcriptome, allowing the ambiguity of multimaps to be
resolved in most circumstances.
12. Reference genome
The most commonly used approach is to use the genome itself as the
reference. This has the benefit of being easy and not biased towards any
known annotation. However, reads that span exon boundaries will not map to
this reference. Thus, using the genome as a reference will give greater
coverage (at the same true expression level) to transcripts with fewer exons,
as they will contain fewer exon junctions.
In order to account for junction reads, it is
common practice to build exon junction
libraries in which reference sequences are
constructed using boundaries between
annotated exons, a proxy genome generated
with known exonic sequences.
Another option is the de novo assembly of the
transcriptome, for use as a reference, using
genome assembly tools.
A commonly used approach for transcriptome
mapping is to progressively increase the
complexity of the mapping strategy to handle
the unaligned reads.
14. Normalization (2)
Within-library normalization allows quantification of expression levels of each gene
relative to other genes in the sample. Because longer transcripts have higher read
counts (at the same expression level), a common method for within-library
normalization is to divide the summarized counts by the length of the gene
[32,34]. The widely used RPKM (reads per kilobase of exon model per million
mapped reads) accounts for both library size and gene length effects in within-
sample comparisons.
When testing individual genes for DE between samples, technical biases, such as
gene length and nucleotide composition, will mainly cancel out because the
underlying sequence used for summarization is the same between samples.
However, between-sample normalization is still essential for comparing counts
from different libraries relative to each other. The simplest and most commonly
used normalization adjusts by the total number of reads in the library [34,51],
accounting for the fact that more reads will be assigned to each gene if a sample is
sequenced to a greater depth.
17. NG-5045 (Diabetes)
Pool 1 2, 4, 12, 16
Pool 2 3, 9, 13, 14
Pool 3 1, 5, 6, 7
Pool 4 8, 10, 11, 15
Morbidly obese persons without insulin resistance: 2, 3, 4, 9, 12, 13,
14, 16.
Morbidly obese persons with high insulin resistance: 1, 5, 6, 7, 8, 10,
11, 15.
18.
19.
20.
21.
22. Differential expression
The goal of a DE analysis is to highlight genes that have changed
significantly in abundance across experimental conditions. In general,
this means taking a table of summarized count data for each library
and performing statistical testing between samples of interest.
Transformation of count data is not well approximated by continuous
distributions, especially in the lower count range and for small
samples. Therefore, statistical models appropriate for count data are
vital to extracting the most information from RNA-seq data.
23. Poisson-based analysis
In an early RNA-seq study using a single source of RNA goodness-of-fit
statistics suggested that the distribution of counts across lanes for the
majority of genes was indeed Poisson distributed . This has been
independently confirmed using a technical experiment and software tools are
readily available to perform these analyses.
24. Each RNA sample was sequenced in seven lanes, producing 12.9–14.7 million reads per lane at the 3 pM concentration
and 8.4–9.3
million reads at the 1.5 pM concentration. We aligned all reads against the whole genome. 40% of reads mapped
uniquely to a genomic location, and of these, 65% mapped to autosomal or sex chromosomes (the remainder mapped
25. Haga clic para modificar el estilo de texto del patrón
Segundo nivel
● Tercer nivel
● Cuarto nivel
● Quinto nivel
27. Alternative strategies
Biological variability is not captured well by the Poisson assumption.
Hence, Poisson-based analyses for datasets with biological replicates will
be prone to high false positive rates resulting from the underestimation
of sampling error
Goodness-of-fit tests indicate that a small proportion of genes show clear
deviations from this model (extra-Poisson variation), and although we
found that these deviations did not lead to falsepositive identification of
differentially expressed genes at a stringent FDR, there is nevertheless
room for improved models that account for the extra-Poisson variation.
One natural strategy would be to replace the Poisson distribution with
another distribution, such as the quasi-Poisson distribution (Venables and
Ripley 2002) or the negative binomial distribution (Robinson and Smyth
2007), which have an additional parameter that estimates over- (or under-)
dispersion relative to a Poisson model.
28. Poisson-Negative Binomial
The negative binomial distribution, can be used as an alternative to the Poisson distribution. It is
especially useful for discrete data over an unbounded positive range whose sample variance
exceeds the sample mean. In such cases, the observations are overdispersed with respect to a
Poisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is not
an appropriate model. Since the negative binomial distribution has one more parameter than the
Poisson, the second parameter can be used to adjust the variance independently of the mean.
29. Negative-Binomial based analysis
In order to account for biological variability, methods that have been
developed for serial analysis of gene expression (SAGE) data have
recently been applied to RNA-seq data. The major difference between
SAGE and RNA-seq data is the scale of the datasets. To account for
biological variability, the negative binomial distribution has been used
as a natural extension of the Poisson distribution, requiring an
additional dispersion parameter to be estimated.
30. Description of SAGE
Serial analysis of gene expression (SAGE) is
a method for comprehensive analysis of
gene expression patterns.
Three principles underlie the SAGE
methodology:
1. A short sequence tag (10-14bp) contains
sufficient information to uniquely identify a
transcript provided that that the tag is
obtained from a unique position within each
transcript;
2. Sequence tags can be linked together to
from long serial molecules that can be
cloned and sequenced; and
3. Quantization of the number of times a
particular tag is observed provides the
expression level of the corresponding
transcript.
31. Robinson, McCarthy, Smyth (2010)
Haga clic para modificar el estilo de texto del patrón
Segundo nivel
● Tercer nivel
● Cuarto nivel
● Quinto nivel
35. Negative-Binomial based software
R packages in Bioconductor:
ØedgeR (Robinson et al., 2010): Exact test based on Negative
Binomial distribution.
ØDESeq (Anders and Huber, 2010): Exact test based on
Negative Binomial distribution.
ØbaySeq (Hardcastle et al., 2010): Estimation of the posterior
likelihood of dierential expression (or more complex
hypotheses) via empirical Bayesian methods using Poisson or
NB distributions.
36. CLC Genomics Workbench approach
19.4.2.1 Kal et al.'s test (Z-test)
Kal et al.'s test [Kal et al., 1999] compares a single sample against another
single sample, and thus requires that each group in you experiment has only
one sample. The test relies on an approximation of the binomial distribution
by the normal distribution [Kal et al., 1999]. Considering proportions rather
than raw counts the test is also suitable in situations where the sum of
counts is different between the samples.
19.4.2.2 Baggerley et al.'s test (Beta-binomial)
Baggerley et al.'s test [Baggerly et al., 2003] compares the proportions of
counts in a group of samples against those of another group of samples, and
is suited to cases where replicates are available in the groups. The samples
are given different weights depending on their sizes (total counts). The
weights are obtained by assuming a Beta distribution on the proportions in a
group, and estimating these, along with the proportion of a binomial
distribution, by the method of moments. The result is a weighted t-type test
statistic.
37. Baggerly, K., Deng, L., Morris, J., and Aldaz, C. (2003).
Differential expression in SAGE: accounting for normal between-library
variation.
Bioinformatics, 19(12):1477-1483.
42. Suggested pipeline ?
Quality Control: fastQC, DNAA
•
Mapping the reads:
•
•
Obtaining the reference
•
Aligning reads to the reference: BOWTIE
Differential Expression
•
•
Summarization of reads
•
Differential Expression Testing: edgeR
Gene Set testing (GO): goseq
•
43. Experimental design ?
Many of the current strategies for DE analysis of count data are
limited to simple experimental designs, such as pairwise or multiple
group comparisons. To the best of our knowledge, no general
methods have been proposed for the analysis of more complex
designs, such as paired samples or time course experiments, in the
context of RNA-seq data. In the absence of such methods,
researchers have transformed their count data and used tools
appropriate for continuous data. Generalized linear models provide
the logical extension to the count models presented above, and
clever strategies to share information over all genes will need to be
developed; software tools now provide these methods (such as
edgeR).
Auer, P.L., and Doerge R.W. (2010) Statistical Design and
Analysis of RNA Sequencing Data. Genetics, 185, 405-416.
44. Integration with other data
There is wide scope for integrating the results of RNA-seq data with
other sources of biological data to establish a more complete picture of
gene regulation [69]. For example, RNA-seq has been used in
conjunction with genotyping data to identify genetic loci responsible
for variation in gene expression between individuals (expression
quantitative trait loci or eQTLs) [35,70]. Furthermore, integration of
expression data with transcription factor binding, RNA interference,
histone modification and DNA methylation information has the
potential for greater understanding of a variety of regulatory
mechanisms. A few reports of these ‘integrative’ analyses have
emerged recently [71-73].