O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Course on parsing methods for biologists with a focus on ChIP-seq data

68 visualizações

Publicada em

This presentation is about data formats in bioinformatics and basic linux tools. The focus is on Chip-Seq experiment.

Publicada em: Educação
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Course on parsing methods for biologists with a focus on ChIP-seq data

  1. 1. Luca Cozzuto Sarah Bonnin Bioinformatics Core Facility Additional topics (parsing methods) for biologists with a focus on ChIP-seq data
  2. 2. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  3. 3. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  4. 4. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  5. 5. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  6. 6. @HWI-ST227:389:C4WA2ACXX:7:1204:2272:59979 GGAGGAAGGTCCTCGCTCCTCTTTCATATAAGGGAAATGGCTGAAT + FFFFHHHHHHJIJJJJJJJJIJJJIGIGIGGIJJIJIJJJJJJIII @HWI-ST227:389:C4WA2ACXX:7:1205:15214:42893 GAGGATCCCAGGGAGGAAGGTCCTCGCTCCTCTTTCATCTAAGGGA + 12BAFB?A:3<AE1@<FF;1*@EG*)?0?DBD>9BF9B*?###### @HWI-ST227:389:C4WA2ACXX:8:2208:2467:44624 AAAGAGGAGAGAGGACCATCCTCCCTGGGATCCTCAGAAGTCTACT + BDDA:DB?2AA@FC>F?EEGC<FED>GFD;?GBB?<?F99*/9?9? Raw data, reads in FASTQ format
  7. 7. Raw data, reads in FASTQ format @HWI-ST227:389:C4WA2ACXX:7:1204:2272:59979 GGAGGAAGGTCCTCGCTCCTCTTTCATATAAGGGAAATGGCTGAAT + FFFFHHHHHHJIJJJJJJJJIJJJIGIGIGGIJJIJIJJJJJJIII @HWI-ST227:389:C4WA2ACXX:7:1205:15214:42893 GAGGATCCCAGGGAGGAAGGTCCTCGCTCCTCTTTCATCTAAGGGA + 12BAFB?A:3<AE1@<FF;1*@EG*)?0?DBD>9BF9B*?###### @HWI-ST227:389:C4WA2ACXX:8:2208:2467:44624 AAAGAGGAGAGAGGACCATCCTCCCTGGGATCCTCAGAAGTCTACT + BDDA:DB?2AA@FC>F?EEGC<FED>GFD;?GBB?<?F99*/9?9? Header Sequence Quality
  8. 8. Raw data, reads in FASTQ format zcat B7_H3K4me1.fastq.gz | awk '{num++}END{print num/4}’ 41103741 Counting fastq reads (the slow way)
  9. 9. Raw data, reads in FASTQ format Phred quality score. l Q=-10 log10p l p = probability that the corresponding base call is incorrect l Example: p = 0.001 means a quality of 30 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ 0........................26...31.........41
  10. 10. Raw data, reads in FASTQ format Analyzing the quality (FASTQC) GOOD BAD https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  11. 11. Alignment l Align 20-30 million reads per sample to the reference genome. l Reference genome can be very long (human is 3 Giga bases) l We need ultra-fast mappers: l Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) l Bwa (http://bio-bwa.sourceforge.net/) l GEM (https://github.com/smarco/gem3-mapper) l …
  12. 12. Reference genome (Fasta file) >1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  13. 13. Reference genome (Fasta file) >1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Header
  14. 14. Reference genome (Fasta file) zcat GRCm38.primary_assembly.genome.fa.gz | grep ">" >chr1 1 >chr2 2 >chr3 3 >chr4 4 >chr5 5 >chr6 6 >chr7 7 >chr8 8 >chr9 9 >chr10 10 >chr11 11 >chr12 12 >chr13 13 >chr14 14 >chr15 15 >chr16 16 >chr17 17 >chr18 18 >chr19 19 >chrX X >chrY Y >chrM MT
  15. 15. Annotations (GTF format) #!genome-build GRCm38.p5 #!genome-version GRCm38 #!genome-date 2012-01 #!genome-build-accession NCBI:GCA_000001635.7 #!genebuild-last-updated 2017-01 1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; https://www.ensembl.org/info/website/upload/gff.html
  16. 16. Header Annotations (GTF format) #!genome-build GRCm38.p5 #!genome-version GRCm38 #!genome-date 2012-01 #!genome-build-accession NCBI:GCA_000001635.7 #!genebuild-last-updated 2017-01 1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; Reference sequence // Source // Feature (gene, transcript, exon etc) // Start // End // Score // Strand // Frame (0,1,2) // Attributes separated by “;” https://www.ensembl.org/info/website/upload/gff.html
  17. 17. Alignment l Align 20-30 million reads per sample to the reference genome. l Reference genome has to be indexed l Problems with repetitive sequences ?
  18. 18. Alignment l Align 20-30 million reads per sample to the reference genome. l Reference genome has to be indexed l Problems with repetitive sequences l Problems with PCR artifacts (marking duplicates)
  19. 19. Alignment (SAM / BAM format) @HD VN:1.5 SO:coordinate @SQ SN:1 LN:195471971 @SQ SN:2 LN:182113224 @SQ SN:3 LN:160039680 … @PG ID:bowtie2 PN:bowtie2 VN:2.3.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz" NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7 75M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTE/EEEAAEEE EEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates https://samtools.github.io/hts-specs/SAMv1.pdf
  20. 20. Alignment (SAM / BAM format) @HD VN:1.5 SO:coordinate @SQ SN:1 LN:195471971 @SQ SN:2 LN:182113224 @SQ SN:3 LN:160039680 … @PG ID:bowtie2 PN:bowtie2 VN:2.3.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz" NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7 75M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates Header @HD: header line // VN: format version // SO: sorting order of alignments @SQ: reference sequence dictionary // SN: sequence name // LN: length @PG: program // ID: program name // VN: version // CL: command line https://samtools.github.io/hts-specs/SAMv1.pdf
  21. 21. Alignment (SAM / BAM format) @HD VN:1.5 SO:coordinate @SQ SN:1 LN:195471971 @SQ SN:2 LN:182113224 @SQ SN:3 LN:160039680 … @PG ID:bowtie2 PN:bowtie2 VN:2.3.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz" NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7 75M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates Alignment Query name // FLAG // Reference name // leftmost mapping position // Mapping quality (7, p=0.2) // CIGAR string // Reference name for mate read // Position of the mate // template length // sequence // quality In this case FLAG 16 means: “read being reverse complemented” https://samtools.github.io/hts-specs/SAMv1.pdf
  22. 22. Alignment (SAM / BAM format) https://software.broadinstitute.org/software/igv/
  23. 23. Quality control of the enrichment https://deeptools.readthedocs.io/en/develop/index.html
  24. 24. Distribution of the signal (wiggle format) https://deeptools.readthedocs.io/en/develop/index.html variableStep chrom=chr2 300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5 ...
  25. 25. Peak calling https://software.broadinstitute.org/software/igv/
  26. 26. Peak calling Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP- Seq (MACS). Genome Biol. 2008;9(9):R137. It is possible to infer the fragment size and use it for extending the reads to get more reliable peaks (i.e. binding sites). The peak is in the middle.
  27. 27. Peak coordinates (Bed format) https://genome.ucsc.edu/FAQ/FAQformat.html#format1 Chromosome // Start // End (3 fields BED) + Name // Score // Strand (6 fields BED) + thickStart // thickEnd // itemRgb + blockCount // blockSizes // blockStarts (12 fields BED) track name=chipseq description=”IP of Ring1B TF" 1 3444977 3445551 peak_1 31 . 1 4773116 4774454 peak_2 114 . 1 4774530 4777431 peak_3 108 . 1 4786374 4786850 peak_4 80 . 1 4806806 4807288 peak_5 66 .
  28. 28. bigBed and bigWig format https://genome.ucsc.edu/goldenpath/help/bigWig.html https://genome.ucsc.edu/goldenpath/help/bigBed.html Indexed binary format generated from bed and wiggle files.
  29. 29. Annotating peaks https://bedtools.readthedocs.io/en/latest/ Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34 Crossing information from gtf files and bed files (BedTools) intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed -b gencode.vM17.annotation.gtf -wa -wb -nonamecheck | awk '{if ($9 == "gene") print }'
  30. 30. Annotating peaks https://bedtools.readthedocs.io/en/latest/ Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34 Crossing information from gtf files and bed files (BedTools) intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed -b gencode.vM17.annotation.gtf -wa -wb -nonamecheck | awk '{if ($9 == "gene") print }' chr1 3444977 3445551 peak_15 31 . chr1 HAVANA gene -nonamecheck 3205901 3671498 . - . gene_id "ENSMUSG00000051951.5"; gene_type "protein_coding"; gene_name "Xkr4"; level 2; havana_gene "OTTMUSG00000026353.2";
  31. 31. Annotating peaks https://bedtools.readthedocs.io/en/latest/ Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34 Crossing information from gtf files and bed files (BedTools) awk '{if ($3 == "gene") print }' gencode.vM17.annotation.gtf | closestBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed -d -b -

×