O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Data analysis patterns, tools and data types in genomics

194 visualizações

Publicada em

Simplified introduction for data analysis tasks in genomics

Publicada em: Saúde e medicina
  • Seja o primeiro a comentar

Data analysis patterns, tools and data types in genomics

  1. 1. Data analysis patterns, tools and data types in genomics for the uninitiated BIMSB Sys. Bio. Lectures Jan 2019 Altuna Akalin
  2. 2. Last week today… • Talked about mindset for research • Most of the content is here as a blog post: https://medium.com/@aakalin
  3. 3. This week • Common data analysis patterns in genomics • Which tools and data types are relevant in which step ? • Ideas on how to get started with learning bioinformatics • Programming languages used for data analysis Slides will be athttps://www.slideshare.net/altunaakalin
  4. 4. What can we do with high-throughput assays • Which genes are expressed and how much ? • Where does a transcription factor bind ? • Which bases are methylated in the genome ? • Which transcripts are translated ? • Where does RNA-binding proteins bind ? • Which microRNAs are expressed ? • Which parts of the genome are in contact with each other ? • Where are the mutations in the genome located ? • Which parts of the genome are nucleosome-free ? • Many more…
  5. 5. The general idea behind high- throughput techniques From http://compgenomr.github.io/book
  6. 6. High-throughput sequencing • AKA massively parallel sequencing • collection of many methods and technologies • can sequence DNA, millions of fragments at a time.
  7. 7. From http://compgenomr.github.io/book
  8. 8. How do you go from here…
  9. 9. …to here ?
  10. 10. General genomics workflow From http://compgenomr.github.io/book
  11. 11. General data analysis workflow Exploratory analysis & modeling Data collection Quality check & cleaning Processing Visualization & reporting
  12. 12. Data collection • Where and how you get your data • Includes publicly available data resources
  13. 13. Data collection for genomics • Which sequencing technology you are using ? • What kind of experiments are you doing ? • How many samples ? • How many replicates ? • Which public data you will include in your analysis ? Data collection Fastq files
  14. 14. Quality check and clean up • Data clean up starts with the data set you get • Can include removing low quality data points • Can include removing missing values or incomplete data sets In general,
  15. 15. Quality check and clean up • Quality check is mostly about checking read quality • Can involve removing low quality bases from reads • Can involve removing adapter/barcode sequences from reads In genomics, Quality check & cleaning Fastq filesFastq files PS: You can also filter aligned reads based on how well they align, ignoring this for simplicity
  16. 16. • Example tools: – Trimming reads: TrimGalore, cutadapt, trimmomatic – Read quality check: fastqc, multiQC Quality check & cleaning Fastq filesFastq files
  17. 17. Data Processing • Transforming raw data to a state where modeling or exploratory data analysis can start • Can include making a tabular data structure from raw data • Can include data transformations such as taking logs or normalization In general,
  18. 18. Data Processing • alignment + quantification • Can include further processing/modeling such as calling peaks for ChIP-seq In genomics, Processing SAM/BAM files BED files Text files Fastq files
  19. 19. SAM/BAM files • These are produced by aligners such as but not limited to STAR, Bowtie and BWA • SAM is a tab-delimited text format contains alignment info. Pavlopoulos, BioData Mining20136:13
  20. 20. SAM/BAM files • BAM is the compressed and indexed version of SAM files • Indexing allows random access to the compressed file • samtools and friends filter/manipulate BAM files • More info @ http://samtools.github.io/hts-specs/
  21. 21. BED files • Aligners or more frequently post-alignment processing produces BED files • ChIP-seq peak callers such as MACS2 More info https://genome.ucsc.edu/FAQ/FAQformat.html#format1
  22. 22. Non-standard text files • Alignment quantification tools such as featureCounts or HTSeq-count can output text files • These will be number of reads per transcript or gene across samples
  23. 23. General trend in genomics file formats • Text files • Tab-delimited • genomic location and other features such as names (gene or feature names) and scores • Many formats (such as BED and SAM) can be compressed and indexed
  24. 24. Exploratory analysis and modeling • How samples or variables relate to each other – clustering & dimension reduction (PCA, etc.) • Prediction of variable of interest: Y ~ X1 + X2+ X3 • Statistical models including hypothesis testing In general, In genomics, • All of the above • Annotation with gene sets/pathways • Looking at genomics data with special browsers, such as UCSC genome browser or IGV
  25. 25. Final visualization and reporting • Final figures, tables and text that describes the outcome of your analysis • Jupyter notebook or Rmarkdown go-to tool for compiling reports these days In general, In genomics, • Same as above • Example reports from RNA-seq analysis
  26. 26. Example RNA-seq workflow Exploratory analysis & modeling Data collection Quality check & cleaning Processing Visualization & reporting Fastqc trimGalore STAR featureCounts DESeq2 gProfiler rmarkdown
  27. 27. Example ChIP-seq workflow Exploratory analysis & modeling Data collection Quality check & cleaning Processing Visualization & reporting Fastqc trimGalore Bowtie2 genomation Clustering RmarkdownMACS2
  28. 28. First pass analysis • Running through your workflow with default or pre-defined parameters • Gives you an idea about data set quality and biology
  29. 29. Analysis/re-analysis cycle • The first-pass analysis often has to be repeated Exploratory analysis & modeling Data collection Quality check & cleaning Processing Visualization & reporting
  30. 30. Can’t we automate all this ? • Yes, to some extent http://bioinformatics.mdc-berlin.de/pigx/ Wurmus et al. (2018) GigaScience
  31. 31. This is not the end • More derivative analysis is required based on the research questions • This could lead to reprocessing data or different modeling and visualization
  32. 32. The most important part of data analysis is visualization • In each step there is some visualization involved • Intermediate results are VERY important Exploratory analysis & modeling Data collection Quality check & cleaning Processing Visualization & reporting
  33. 33. Importance of data exploration with genome browsers A walk through
  34. 34. Look at your genes or regions of interest with processed data
  35. 35. Look through your genes or regions of interest with processed data Genes of interest Control genes
  36. 36. Form a hypothesis or observation based on limited data points Based on limited data points I looked: • It seems my genes of interest have longer CpG islands • It seems my genes of interest have broader transcription initiation
  37. 37. Test hypothesis/observation with all the data To be able to do that: • We need to get the features of interest for all genes (genes of interest and control genes) • We need to calculate lengths and numbers of features • We need to do hypothesis testing • We need to do visualization The results of such analysis is here: Akalin et al. 2009, Genome Biology
  38. 38. Give me some practical advice, how can I start analyzing my own data ? Programming • Terminal (Bash) • R • Python • Perl • … Click through (GUI) • Galaxy • KNIME • …
  39. 39. Galaxy or other GUIs • There is a tool for every step of analysis you can chain them. https://usegalaxy.org/ • You still need to know how and where to use each tool • The only thing you are bypassing is the terminal/command line • GUIs are limited in their flexibility
  40. 40. Programming • Learning programming diversifies your skillset – Better for postdoc applications – Can do stuff outside science or academia • Learning a GUI does not give you the same edge
  41. 41. Where do I begin ? First, a motivating example
  42. 42. Where do I begin ? Graduate A • PhD – Genetics, Thesis: wet-lab genomics • M.Sc. – Molecular Biology, Thesis: wet-lab genomics • B.Sc. – Molecular Biology Graduate B • PhD – Genetics, Thesis: wet-lab genomics • M.Sc. – Biology, Thesis: wet-lab • Pharmacist in Training • B.Sc. – Pharmacy Guess first position after PhD ?
  43. 43. Why R ? • All of exploratory analysis, modeling, visualization and reporting can be done in R • Bioconductor has thousands of specialized bioinformatics algorithms/methods – You can even do alignments & quality check
  44. 44. Where do I begin ? • Learn how to read text or csv tables • Learn how to manipulate data frames • And make simple plots (plot(),hist(),barplot()) • Repeat until you are comfortable Then, • Write a function • Learn loops and control structures • Learn about other R data types
  45. 45. Get online/offline courses and material • Coursera courses: https://www.coursera.org/learn/r-programming • Computational genomics with R (book draft): http://compgenomr.github.io/book • Rstudio resources:https://www.rstudio.com/online- learning/#r-programming • Datacamp interactive learning (some free stuff) https://www.datacamp.com/courses/free- introduction-to-r
  46. 46. Buddy up • Get a colleague from your lab or neighboring lab where you can ask each other questions about programming
  47. 47. How do I get help ? • Google it out, most problems you will encounter are encountered. The answer is reachable by a correctly formed query in Google • If it fails, come to “bioinfo. Clinics” or book consultation at http://iris.mdc-berlin.de
  48. 48. Don’t be a perfectionist • Just do something that resembles what you want to do, you will iterate over later make it better • Ex: – don’t worry about making the cutest plot, just make a plot – Don’t worry about which mapping algorithm is the best, just use one and get some results
  49. 49. Python vs R • If Python is the greatest thing that happened for general programming languages, R/Bioconductor is the greatest thing happened in bioinformatics • if you learn python first, you will regularly have to drop in to R for any kind of statistics developed for HT-seq.
  50. 50. pandas Data frames statsmodels stats seaborn ggplot Convergence of data analysis/science languages
  51. 51. Convergence of data analysis/science languages https://ursalabs.org/
  52. 52. Convergence of data analysis/science languages Data & Machine-learning models Interface for data access and manipulation
  53. 53. @AltunaAkalinhttp://bioinformatics.mdc-berlin.de http://github.com/BIMSBbioinfo Slides will be at: https://www.slideshare.net/altunaakalin
  54. 54. More References/Reading material • RNA-seqlopedia https://rnaseq.uoregon.edu/ • Computational genomics with R, http://compgenomr.github.io/book • Biostars tutorials: https://www.biostars.org/t/Tutorials/