O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

[DigiHealth 22] Budget friendly sample sizes for genomics research - Ognjen Milicevic

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio

Confira estes a seguir

1 de 43 Anúncio

[DigiHealth 22] Budget friendly sample sizes for genomics research - Ognjen Milicevic

Baixar para ler offline

Ognjen will be talking about decisions required for an expensive genomics study, and what tricks you can do to make sure you come out on top.

Ognjen will be talking about decisions required for an expensive genomics study, and what tricks you can do to make sure you come out on top.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Mais de DataScienceConferenc1 (20)

Mais recentes (20)

Anúncio

[DigiHealth 22] Budget friendly sample sizes for genomics research - Ognjen Milicevic

  1. 1. Budget friendly sample sizes for genomics research Biostatistician, bioinformatician Ognjen Milicevic, MD
  2. 2. Why do you need a biostatistician?
  3. 3. Common biostatistics tasks ● Cleaning and transforming data ● Data description ● Statistical testing ● Tabulation and visualization ● Bioinformatics (applied statistics for genomics) ● Post-hoc power calculations ● ...
  4. 4. Common biostatistics tasks ● Cleaning and transforming data ● Data description ● Statistical testing ● Tabulation and visualization ● Bioinformatics (applied statistics for genomics) ● Post-hoc power calculations ● Complain they weren't consulted earlier
  5. 5. Post-hoc sample size / power analysis ● Due to convenience, we justify choices already made ● Find the similar effect size in literature ● Use the posterior distribution as prior ● Set the desired power (80-100%) ● Adjust as needed for dropout, loss, margin-of-error ● Obtain the sample size you already have
  6. 6. Make a wish, biostatistician
  7. 7. Dear bioinformatician, how many samples do we need to sequence to investigate...
  8. 8. NO CONVENIENCE! ● Not routinely done ● Effect size unknown ● Literature not helpful ● Multiple unknown genes ● Distribution is complex ● ...
  9. 9. RNA sequencing around the internet
  10. 10. DATA SCIENCE OF RNA SEQUENCING
  11. 11. Natural variability of RNA per gene De Torrente et al. (2020) Surprisingly, the expression of less than 50% of all genes was Normally-distributed, with other distributions including Gamma, Bimodal, Cauchy, and Lognormal also represented. Liu et al. (2019) Based on the analysis of a group of real gene expression profiles, this study reveal that the primary density distributions of the real profiles are normal/log-normal and t distributions, accounting for 80% and 19% respectively. 20K+ genes
  12. 12. Representing RNAs with fragments Gamma-Poisson distribution Count and normalize to quantify (TPM)
  13. 13. Overview of the pipeline Effect between groups Inter-individual variation in RNA Batch effects Representation variability Tissue sample Chemical preparation Sequencing
  14. 14. Count matrix and metadata Each gene is an independent outcome
  15. 15. LAYERS UPON LAYERS OF VARIABILITY So, what about those sample sizes?
  16. 16. COVID-19 RNA characterization Example project
  17. 17. RNA characterization of COVID-19 (2021) - Plan ● Total RNA – virus and host (human) ● Nasopharyngeal swabs and blood samples ● Paired design (on admittance and discharge from hospital) ● 18 individuals, total of 72 samples ● Which biological pathways are affected? (DEG) ● What can we say about the viral load? (metagenomics)
  18. 18. Estimating sample size for RNA ● Theoretical models with assumed distributions ● Parameters inferred from previous datasets ● R-packages: RNASeqDesign, PROPER, powsimR, ssizeRNA ● Web tool: RNASeqSampleSize ● Variable result ● If cost is not relevant, choose the most conservative (largest)
  19. 19. Proposed approach ● Perform one estimate and use it ● Remove unwanted variability (batch effect) ● Reduce variability with paired design ● Use meaningful metadata ● Filter the genes
  20. 20. ● Remove unwanted variability ● Paired design ● Meaningful metadata ● Filter genes A number of methods based on SVD remove high level batch effects without specifically tracing them to interpretable variables. One can use housekeeping or control genes as markers. • SVA • RUVseq These methods produce new surrogate variables. Colleague quote: "Once I see batch effects, I can correct them mathematically, but I never trust that dataset again."
  21. 21. Batch effects against the collaborative science!
  22. 22. ● Remove unwanted variability ● Paired design ● Meaningful metadata ● Filter genes Paired design - taking control samples from patients after resolution or before the event. ● Increases power ● Not all analysis frameworks can take advantage of it ● Sometimes biologically difficult ● Reduces DF by half
  23. 23. ● Remove unwanted variability ● Paired design ● Meaningful metadata ● Filter genes Gender and age can always be relevant. Collect metrics of sample quality (before and after sequencing). Disease subtypes can be a covariate or group variable. Helps choosing when sequencing a subset.
  24. 24. ● Remove unwanted variability ● Paired design ● Meaningful metadata ● Filter genes Multiple testing correction for 20K+ genes. Remove mostly unexpressed genes. A priori removal is allowed.
  25. 25. Results ● EdgeR GLM ● Nasal DEG p<0.05: 40(paired)/51(unpaired) ● Blood DEG p<0.05: 76(paired)/2(unpaired) ● Every parameter choice changes results ● Validation?
  26. 26. Annotation representation testing – Panther.db ● Annotation is a subset of genes ● Multiple available annotation sets (structure, function, pathway...) ● We only use significant genes ● Overrepresentation test – chi-square to compare observed and expected frequencies ● Enrichment test – Mann-Whitney to test randomness of ranks
  27. 27. Molecular function in blood (PAIRED) ● Increased immunoglobulin binding ● Reduced smell (in blood!) ● Reduced oxygen binding and carrier activity ● We consider the result validated
  28. 28. Takeaways of the study ● Study rescued by pairing ● No batch to correct ● Almost no metadata ● Smaller signal in blood ● Specific tissue (nasal) more robust
  29. 29. WHAT HAPPENED? Data science implications
  30. 30. Reduced individual variation Effect between groups Inter-individual variation in RNA Batch effects Representation variability Tissue sample Chemical preparation Sequencing Intra
  31. 31. Reduced batch effects Effect between groups Inter-individual variation in RNA Batch effects Representation variability Tissue sample Chemical preparation Sequencing Intra
  32. 32. Easier to control for batches ● Pairing absorbs a proportion of batch effects ● Usually 8 lanes in a flowcell ● Focus on pairs instead of whole samples ● Aggregation of datasets easier
  33. 33. Technical downsides of pairing ● Loss of half DF ● Many frameworks cannot use it as easily as GLM-based ones ● RNA is used for other analyses: ○ SUPPA2 for alternative splicing ○ Building empirical distribution from all pairs of samples ○ If pairing was implemented, would reduce the observations drastically
  34. 34. SHOULD WE ALWAYS PAIR? Medical implications
  35. 35. Tissue implications ● Specific tissues have robust signatures without pairing ● Blood reflects many tissues: ○ Weaker signal ○ Local changes reflected ● Systemic effects are found only in blood ● Always available for sampling (minimum invasive) ● Blood analysis benefits from pairing
  36. 36. Utility implications ● Paired designs are easier to aggregate to meta-studies (robust to batch effects) ● Blood controls can be used as unpaired controls for other studies (if healthy enough) ● Solves the problem of finding controls ● If controls are after resolution, questionable health (long COVID) ● Some chronic diseases cannot be caught early or ever resolved, so pairing is impossible
  37. 37. Example – cardiovascular events ● We are interested in markers of plaque progression/instability ● Patient checkup and sampling every X months ● Sequencing is expensive, sampling and storing is not ● Sequence only the previous two samples before the event
  38. 38. Example – neurodegenerative disease (ALS) ● We cannot predict the disease (10% familial) ● Patient available for sampling once diseased ● Sequence patients sufficiently apart ● We cannot find the root cause of ALS, as we are not catching the initial event ● We can find signatures of neuronal suffering and death, which is an actionable point ● Generalizes to all chronic diseases
  39. 39. Example – cancer ● For DNA, tumor is matched with blood sample control ● For RNA, we need the normal surrounding tissue ● Sampling the healthy normal target tissue may be problematic ● Tissue margin – potential normal sample ● Admixture of tumor in normal reduces the signal (but not critically for RNA)
  40. 40. Many thanks to... ● Institute for Biocides and Medical Ecology for providing the samples and sequencing ● HTEC Group for providing computational resources and support ● School of Medicine, University of Belgrade for supporting research ● Thanks to DSC organizers for the invite ● Last but not least...
  41. 41. ...THANK YOU FOR LISTENING!
  42. 42. ognjen.milicevic@med.bg.ac.rs ognjen.milicevic@htecgroup.com ognjen011@gmail.com

Notas do Editor

  • Hello, my name is Ognjen Milicevic from Belgrade, Serbia. Because of my mixed medical and engineering background, today I chose to tackle an interdisciplinary subject -

×