3. Common biostatistics tasks
● Cleaning and transforming data
● Data description
● Statistical testing
● Tabulation and visualization
● Bioinformatics (applied statistics for genomics)
● Post-hoc power calculations
● ...
4. Common biostatistics tasks
● Cleaning and transforming data
● Data description
● Statistical testing
● Tabulation and visualization
● Bioinformatics (applied statistics for genomics)
● Post-hoc power calculations
● Complain they weren't consulted earlier
5.
6. Post-hoc sample size / power analysis
● Due to convenience, we justify choices already made
● Find the similar effect size in literature
● Use the posterior distribution as prior
● Set the desired power (80-100%)
● Adjust as needed for dropout, loss, margin-of-error
● Obtain the sample size you already have
12. Natural variability of RNA per gene
De Torrente et al. (2020)
Surprisingly, the expression of less than 50% of all genes
was Normally-distributed, with other distributions including
Gamma, Bimodal, Cauchy, and Lognormal also
represented.
Liu et al. (2019)
Based on the analysis of a group of real gene expression
profiles, this study reveal that the primary density
distributions of the real profiles are normal/log-normal and
t distributions, accounting for 80% and 19% respectively.
20K+ genes
13. Representing RNAs with fragments
Gamma-Poisson distribution
Count and normalize to quantify (TPM)
14. Overview of the pipeline
Effect
between
groups
Inter-individual
variation in RNA
Batch effects
Representation
variability
Tissue
sample
Chemical
preparation
Sequencing
18. RNA characterization of COVID-19 (2021) - Plan
● Total RNA – virus and host (human)
● Nasopharyngeal swabs and blood samples
● Paired design (on admittance and discharge from hospital)
● 18 individuals, total of 72 samples
● Which biological pathways are affected? (DEG)
● What can we say about the viral load? (metagenomics)
19. Estimating sample size for RNA
● Theoretical models with assumed distributions
● Parameters inferred from previous datasets
● R-packages: RNASeqDesign, PROPER, powsimR, ssizeRNA
● Web tool: RNASeqSampleSize
● Variable result
● If cost is not relevant, choose the most conservative (largest)
20. Proposed approach
● Perform one estimate and use it
● Remove unwanted variability (batch
effect)
● Reduce variability with paired design
● Use meaningful metadata
● Filter the genes
21. ● Remove unwanted variability
● Paired design
● Meaningful metadata
● Filter genes
A number of methods based on SVD remove high level batch effects
without specifically tracing them to interpretable variables.
One can use housekeeping or control genes as markers.
• SVA
• RUVseq
These methods produce new surrogate variables.
Colleague quote:
"Once I see batch effects, I can correct them mathematically, but I
never trust that dataset again."
23. ● Remove unwanted variability
● Paired design
● Meaningful metadata
● Filter genes
Paired design - taking control samples from patients
after resolution or before the event.
● Increases power
● Not all analysis frameworks can take advantage of it
● Sometimes biologically difficult
● Reduces DF by half
24. ● Remove unwanted variability
● Paired design
● Meaningful metadata
● Filter genes
Gender and age can always be relevant.
Collect metrics of sample quality (before and after
sequencing).
Disease subtypes can be a covariate or group variable.
Helps choosing when sequencing a subset.
25. ● Remove unwanted variability
● Paired design
● Meaningful metadata
● Filter genes
Multiple testing correction for 20K+ genes.
Remove mostly unexpressed genes.
A priori removal is allowed.
27. Annotation representation testing – Panther.db
● Annotation is a subset of genes
● Multiple available annotation sets (structure, function, pathway...)
● We only use significant genes
● Overrepresentation test – chi-square to compare observed and
expected frequencies
● Enrichment test – Mann-Whitney to test randomness of ranks
28. Molecular function in blood (PAIRED)
● Increased
immunoglobulin binding
● Reduced smell (in blood!)
● Reduced oxygen binding
and carrier activity
● We consider the result
validated
29. Takeaways of the study
● Study rescued by pairing
● No batch to correct
● Almost no metadata
● Smaller signal in blood
● Specific tissue (nasal) more
robust
33. Easier to control for batches
● Pairing absorbs a proportion of
batch effects
● Usually 8 lanes in a flowcell
● Focus on pairs instead of whole
samples
● Aggregation of datasets easier
34. Technical downsides of pairing
● Loss of half DF
● Many frameworks cannot use it as easily as GLM-based ones
● RNA is used for other analyses:
○ SUPPA2 for alternative splicing
○ Building empirical distribution from all pairs of samples
○ If pairing was implemented, would reduce the observations
drastically
36. Tissue implications
● Specific tissues have robust signatures without pairing
● Blood reflects many tissues:
○ Weaker signal
○ Local changes reflected
● Systemic effects are found only in blood
● Always available for sampling (minimum invasive)
● Blood analysis benefits from pairing
37. Utility implications
● Paired designs are easier to aggregate to meta-studies (robust to
batch effects)
● Blood controls can be used as unpaired controls for other studies (if
healthy enough)
● Solves the problem of finding controls
● If controls are after resolution, questionable health (long COVID)
● Some chronic diseases cannot be caught early or ever resolved, so
pairing is impossible
38. Example – cardiovascular events
● We are interested in markers of
plaque progression/instability
● Patient checkup and sampling every
X months
● Sequencing is expensive, sampling
and storing is not
● Sequence only the previous two
samples before the event
39. Example – neurodegenerative disease (ALS)
● We cannot predict the disease (10% familial)
● Patient available for sampling once diseased
● Sequence patients sufficiently apart
● We cannot find the root cause of ALS, as we
are not catching the initial event
● We can find signatures of neuronal suffering
and death, which is an actionable point
● Generalizes to all chronic diseases
40. Example – cancer
● For DNA, tumor is matched with blood
sample control
● For RNA, we need the normal
surrounding tissue
● Sampling the healthy normal target
tissue may be problematic
● Tissue margin – potential normal
sample
● Admixture of tumor in normal reduces
the signal (but not critically for RNA)
41. Many thanks to...
● Institute for Biocides and Medical Ecology
for providing the samples and sequencing
● HTEC Group for providing computational
resources and support
● School of Medicine, University of Belgrade
for supporting research
● Thanks to DSC organizers for the invite
● Last but not least...
Hello, my name is Ognjen Milicevic from Belgrade, Serbia. Because of my mixed medical and engineering background, today I chose to tackle an interdisciplinary subject -