SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Basics of RNA - seq data analysis
Revision
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Drawback of Dynamic programming - Very slow
Need for - faster alignment strategies
Fast Sequence alignment strategies
• Using hash table based indexing - seed extend paradigm, space allowance
• Using suffix/prefix tree based - Suffix array, Burrows wheeler
transformation and FM index
• Merge sorting
Strategy: making a dictionary (index) – An example of 4-nt index
AAAA: 235, 783, 10083,......
AAAC: 132, 236, 832, 932, ...
TTTT: 327, 1328, 5523,......
Algorithms
Hashing reads - Eland, MAQ, Mosaik...
Hashing reference genome - BFAST, Mosaik, SOAP, ...
Hash table based indexing
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Burrows wheeler transformation and FM index
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Criteria for choosing an aligner
• Global or local
• Aligning short sequences to long sequences such as short
reads to a reference
• Aligning long sequences to long sequences such as long
reads or contigs to a reference
• Handles small gaps (insertions and deletions)
• Handles large gaps (introns)
• Handles split alignments (chimera)
• Speed and ease of use
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Short read aligner
Aligner Purpose
Bowtie Fast
BWA small gaps (indels)
GSNAP Large gaps (introns)
Bowtie 2 Takes care of gaps
Long sequence aligner
Aligner Purpose
BLAST Many reference genome
BLAT Large gaps (introns)
BWA Small gaps (indels)
Exonerate Ease of use
GMAP Large gaps (introns)
MUMmer Align two genome
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RNA
Exon 1 Exon 2
Exon read mapping
Spliced read mapping
Exon 1 Exon 2
RNA
Seed matching
K-mer seeds
Seed extend
RNA
Exon 1 Exon 2
Exon read mapping
Spliced read mapping
Exon 1 Exon 2
RNA
Seed matching
K-mer seeds
Seed extend
RNA -seq alignment strategy
Exon - first approach Seed - extend approach
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• Available tools:
• MapSplice, SpliceMap, TopHat
• Two step procedure
• Map reads continuously using
unspliced read aligners
• Unmapped reads are split into shorter
segments and aligned independently
• Efficient when not too many reads into
the junction
• Second step is computationally intensive
• Can miss reads across exon-intron
junctions
RNA
Exon read mapping
Spliced read mapping
Exon 1 Exon 2
Exon - first approach
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• Representative algorithms
• Genomic short read nucleotide
alignment program (GSNAP)
• Computing accurate spliced
alignments (QPALMA)
• Steps
• Break reads into short seeds
• Candidate regions are
combined’ (such as Smith-Waterman)
• Increased sensitivity
• One arm may not provide enough
specificity for alignment
RNA
Exon 1 Exon 2
Exon read mapping
Spliced read mapping
Exon 1 Exon 2
RNA
Seed matching
K-mer seeds
Seed extend
Seed - extend approach
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
A software package that identifies splice sites ab initio by large- scale
mapping of RNA-Seq reads.
• A splice junction mapping algorithm must be able to identify reads that
may have only a few bases on one side of a junction, or else that
junction will be missed
TopHat
Map reads to whole genome with Bowtie
Collect initially unmappable
reads
Build seed table index
from
unmappable reads
Generate possible splices
between
neighbouring exons
Map reads to possible
splices
via seed-and-extend
Assemble consensus of
covered regions
gt ag ag
gt ag ag
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Normalization of read count
R/FPKM (Mortazavi et al.,2008) - Reads/Fragment per kilobase of exon
per million mappable reads
• Corrects for: differences in sequencing depth and transcript length
• Aiming to: compare a gene across samples and different genes within
samples
TMM (Robinson and Oshlack., 2010) - Trimmed mean of M values
• Corrects for: differences in transcript pool composition; extreme outliers
• Aiming to: provide better across-sample comparability
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Normalization of read count
Limma voom (LogCPM) (Law et al.,2013) - Counts per million
• Aiming to: Stabilize variance, removes dependence of variance on the
mean
TPM (Li etal 2010, Wagner et al 2012) - Transcripts per million
• Corrects for: transcript length distribution in RNA pool
• Aiming to: provide better across-sample comparability
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• FPKM for paired end reads and RPKM for single end reads
• Fragment means fragment of DNA, so the two reads that
comprise a paired-end read count as one.
• Per kilobase of exon means the counts of fragments are then
normalized by dividing by the total length of all exons in the gene.
• This bit of magic makes it possible to compare Gene A to Gene B
even if they are of different lengths.
• Per million reads means this value is then normalized against the
library size.
• This bit of magic makes it possible to compare Gene A in Sample
1 to Sample 2
R/FPKM (Mortazavi et al.,2008)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
A quantification measurement for gene expression
• R: expression level of the gene
• L: length of the gene
• N: depth of the sequencing
• C: number total reads fall into the gene region
R/FPKM (Mortazavi et al.,2008)
Total exon size of a gene is 3,000-nt. Calculate the expression levels for
this gene in RPKM in an RNA-seq experiment that contained 50 million
mappable reads, with 600 reads falling into exon regions of this gene.
R = 600/(50 × 3.000) = 4.00
R = C ÷ L × N( ) L in kbs and N in Millions
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Calculation of FPKM/RPKM
Genes Sample1 Sample 2 Sample 3
1 (2kb) 20 24 60
2 (4kb) 40 50 120
3 (1kb) 10 16 30
4 (10kb) 0 0 2
Total 70 90 212
Total reads for sample 1, 2 and 3 - 7M ,9M and 21.2M
(millions of reads equated to a scale of tens of reads)
Step 1. Divide the reads of each gene with the total reads of the sample
Genes Sample1(RPM) Sample 2(RPM) Sample 3(RPM)
1 (2kb) 2.86 2.67 2.83
2 (4kb) 5.71 5.56 5.66
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.09
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Fragments/Reads per kilobase per million of reads
Reads are scaled for both depth and length
Step 2. Divide the values obtained after step 1 with the gene lengths
Genes Sample1 (RPKM) Sample 2
(RPKM)
Sample 3
(RPKM)
1 (2kb) 1.43 1.33 1.42
2 (4kb) 1.43 1.39 1.42
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.009
Total normalized
reads
4.29 4.5 4.5
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Calculation of TPM
Step 1. Divide the reads of each gene with the length of each gene
Genes Sample 1 Sample 2 Sample 3
1 (2kb) 20 24 60
2 (4kb) 40 50 120
3 (1kb) 10 16 30
4 (10kb) 0 0 2
Total reads per kb of gene for sample 1, 2 and 3- 3M,4.05M and 9.02M
Genes Sample 1(RPK) Sample 2(RPK) Sample 3(RPK)
1 (2kb) 10 12 30
2 (4kb) 10 12.5 30
3 (1kb) 10 16 30
4 (10kb) 0 0 0.2
Total 30 40.5 90.2
(millions of reads equated to a scale of tens of reads)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Step 2. Divide the values obtained after step 1 with the gene lengths
Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM)
1 (2kb) 3.33 2.96 3.326
2 (4kb) 3.33 3.09 3.326
3 (1kb) 3.33 3.95 3.326
4 (10kb) 0 0 0.02
Total 10 10 10
Genes Sample1 (RPK) Sample 2(RPK) Sample 3(RPK)
1 (2kb) 10 12 30
2 (4kb) 10 12.5 30
3 (1kb) 10 16 30
4 (10kb) 0 0 0.2
Calculation of TPM
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RPKM vs TPM
Genes Sample1 (RPKM) Sample 2
(RPKM)
Sample 3
(RPKM)
1 (2kb) 1.43 1.33 1.42
2 (4kb) 1.43 1.39 1.42
3 (1kb) 1.43 1.78 1.42
4 (10kb) 0 0 0.009
Total normalized
reads
4.29 4.5 4.5
Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM)
1 (2kb) 3.33 2.96 3.326
2 (4kb) 3.33 3.09 3.326
3 (1kb) 3.33 3.95 3.326
4 (10kb) 0 0 0.02
Total normalized
reads
10 10 10
Sums of total normalized reads
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Eg : if certain genes are very highly expressed in one tissue but not another,
there will be less ‘’sequencing real estate’’ left for the less expressed genes in
that tissue and RPKM normalization (or similar) will give biased expression
values for them compared to the other sample
Equal sequencing depth -> Yellow and green will get lower RPKM in RNA population
1 although the expression levels are actually the same in populations 1 and 2
Robinson and Oshlack Genome Biology 2010, 11: R25, http://genomebiology.com /
2010/11/3/R25
RNA population 1 RNA population 2
TMM – Trimmed Mean of M Value
Attempts to correct for differences in RNA
composition between samples
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Identification of Differentially expressed genes - I
(using Cufflinks)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Identification of differentially expressed genes
Quality filtered/trimmed RNA-Seq Short reads
FPKM based
strategy
Calculate transcript
abundances
(Cufflinks)
Reference Genome
(Y/N)
Mapping to the reference
(GMAP-GSNAP, Tophat,Bowtie,etc.)
Y
N De novo Transcriptome
assembly (Trinity)
Mapping and detection of
DEGs (RSEM)
Count based
strategy
Generate count data
(RSEM)
Detection of DEGs
(cuffdiff2)
Detection of DEGs
(DESeq, edgeR, EBSeq)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Identification of differentially expressed genes
Quality filtered/trimmed RNA-Seq Short reads
FPKM based strategy
Calculate transcript
abundances
(Cufflinks)
Reference Genome
Mapping to the reference
(GMAP - GSNAP)
Detection of DEGs
(cuffdiff2)
Downloading the reference
genome and gtf from UCSC
genome browser
Requirements
For running gmap-gsnap- fasta file of the genome and reads file
For running cufflinks - bam files of all samples and gtf file of the genome
For running samtools - sam file generated from gmap-gsnap
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Downloading the FASTA file and GTF from the UCSC genome
browser (https://genome.ucsc.edu)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
When you click on the bosTau8.fa.gz, you will be able to download a file of
866.1MB,which on clicking would give a file of 2.72 GB
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Genome Mapping and Alignment using GMAP - GSNAP
Genomic Mapping and Alignment Program
• GMAP is a standalone program for mapping and aligning cDNA sequences to a
genome.
• The program maps and aligns a single sequence with minimal startup time and
memory requirements, and provides fast batch processing of large sequence sets.
• The program generates accurate gene structures, even in the presence of
substantial polymorphisms and sequence errors, without using probabilistic splice
site models.
Step 1. Command for indexing the the genome : gmap_build -d btau8
bosTau8.fa
Initially used a hashing
scheme but later used a
much more efficient
double lookup scheme
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The index files created are as below in the folder btau8

gsnap –d btau8 –t 4 control_R1.fastq> control_R1.sam
Step 2. Mapping the reads to the genome
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• The end product of the GMAP aligner is a SAM file which needs to be
converted into a BAM file for further analysis in cufflinks.
• Repeat the same for the other replicate by changing the input file name.
• A total of four SAM files are generated separately.
The BAM files generated can be analysed in two ways -
1. The BAM files can be used to generate a merged assembly of transcripts
via cufflinks and cuffmerge. This merged assembly (i.e merged.gtf) is
used in cuffdiff to generate differential expressed genes.
2. Cuffdiff can be used directly to generate differentially expressed genes
using the BAM files generated.
The index files created are as below in the folder btau8

Computational Biology and Genomics Facility, Indian Veterinary Research Institute
./samtools view –bsh aln.sam >aln.bam
-b: Output in the BAM format. -s: Input in the SAM format. –h: Include
header in the output
For the Control sample:
./samtools view –bsh control_R1.sam >control_R1.bam
For the Infected sample:
./samtools view –bsh infected_R1.sam >infected_R1.bam
Step 3. Converting SAM to BAM using samtools
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for sorting:./samtools sort aln.bam aln.sorted
Example:
For the Control sample:
./samtools sort control_R1.bam control_R1_sorted
For the Infected sample:
./samtools sort infected_R1.bam infected_R1_sorted
Step 4. Sorting BAM using samtools
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for running cufflinks on a BAM file
For the Control sample:
cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN
control_R1_sorted.bam
Step 5. (Option 1) Differential expression using cufflinks,
cuffmerge and cuffdiff.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
For the infected sample:
cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN
infected_R1_sorted.bam
These commands generate transcript.gtf files for each replicate, which are
further used in cuffmerge to generate a merged assembly. This merged
assembly is then used in cuffdiff to generate differentially expressed genes.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Command for running cuffmerge
cuffmerge -g btau8refflat.gtf -s bosTau8.fa -p 8 assemblies.txt
assemblies.txt is the file with the list of all the GTFs.
This generates a merged.gtf in the merged_asm folder. This file is
used in the next cuffdiff command.
Command for running cuffdiff
cuffdiff merged.gtf control_R1_sorted.bam control_R2_sorted.bam
infected_R1_sorted.bam infected_R2_sorted.bam
This command generates many files, out of which gene_exp.diff is the file
of our concern.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
CuffDiff computes differentially expressed genes in the set. For computing
differential expression at least two samples -infected and control are required.
CuffDiff should always be run on replicates - i.e., N infected vs N control.
Command:
Cuffdiff –p –N transcripts.gtf
-p: num-threads <int>. -N
Running cuffdiff for our BAM files
cuffdiff –p 3 –N bostau8reflat.gtf control_R1_sorted.bam,control_R2_sorted.bam
infected_R1_sorted.bam,infected_R2_sorted.bam –o cuffdiff_out
Step 5. (Option 2) Differential expression using CuffDiff directly
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
A unique identifier
describing the object
(gene, transcript, CDS,
primary transcript)
Gene ID
Gene Name
Infected
OK (test successful), NOTEST (not enough alignments
for testing), LOWDATA (too complex or shallowly
sequenced), HIDATA (too many fragments in locus), or
FAIL, when an ill-conditioned covariance matrix or
other numerical exception prevents testing
FPKM in
Sample 1
FPKM in
Sample 2
The (base 2) log
of the fold
change y/x
Genomic coordinates for easy
browsing to the genes or
transcripts being tested.
Control
The value of the test statistic
used to compute significance
of the observed change in
FPKM
The uncorrected
p-value of the test
statistic
gene_exp.diff
Log2fold change = Log2(FPKM infected/FPKM of control)
= Log2(0.576748/3.92513) = -2.76673

Mais conteúdo relacionado

Mais procurados

Higher Order Low Pass FIR Filter Design using IPSO Algorithm
Higher Order Low Pass FIR Filter Design using IPSO AlgorithmHigher Order Low Pass FIR Filter Design using IPSO Algorithm
Higher Order Low Pass FIR Filter Design using IPSO Algorithmijtsrd
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsAjit Shinde
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issuesDongyan Zhao
 
Ion torrent semiconductor sequencing technology
Ion torrent semiconductor sequencing technologyIon torrent semiconductor sequencing technology
Ion torrent semiconductor sequencing technologyCD Genomics
 
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...CSCJournals
 
DEseq, voom and vst
DEseq, voom and vstDEseq, voom and vst
DEseq, voom and vstQiang Kou
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation SequencingArindam Ghosh
 
Aug2015 analysis team 04 10x genomics
Aug2015 analysis team 04 10x genomicsAug2015 analysis team 04 10x genomics
Aug2015 analysis team 04 10x genomicsGenomeInABottle
 
Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Jane Landolin
 
Study Of Microarray (Genomic study)
Study Of Microarray (Genomic study)Study Of Microarray (Genomic study)
Study Of Microarray (Genomic study)PriyankaSharma1071
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
Theses exam 2012 - Wideband Speech Reconstruction
Theses exam 2012 - Wideband Speech ReconstructionTheses exam 2012 - Wideband Speech Reconstruction
Theses exam 2012 - Wideband Speech ReconstructionFitrie Ratnasari
 

Mais procurados (20)

Polymerase chain reaction (PCR)
Polymerase chain reaction (PCR)Polymerase chain reaction (PCR)
Polymerase chain reaction (PCR)
 
Ion torrent
Ion torrentIon torrent
Ion torrent
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
RNA-Seq with R-Bioconductor
RNA-Seq with R-BioconductorRNA-Seq with R-Bioconductor
RNA-Seq with R-Bioconductor
 
Higher Order Low Pass FIR Filter Design using IPSO Algorithm
Higher Order Low Pass FIR Filter Design using IPSO AlgorithmHigher Order Low Pass FIR Filter Design using IPSO Algorithm
Higher Order Low Pass FIR Filter Design using IPSO Algorithm
 
Macs course
Macs courseMacs course
Macs course
 
T26123129
T26123129T26123129
T26123129
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 
Ion torrent semiconductor sequencing technology
Ion torrent semiconductor sequencing technologyIon torrent semiconductor sequencing technology
Ion torrent semiconductor sequencing technology
 
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
 
DEseq, voom and vst
DEseq, voom and vstDEseq, voom and vst
DEseq, voom and vst
 
Ah34207211
Ah34207211Ah34207211
Ah34207211
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Aug2015 analysis team 04 10x genomics
Aug2015 analysis team 04 10x genomicsAug2015 analysis team 04 10x genomics
Aug2015 analysis team 04 10x genomics
 
Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121Open pacbiomodelorgpaper j_landolin_20150121
Open pacbiomodelorgpaper j_landolin_20150121
 
Study Of Microarray (Genomic study)
Study Of Microarray (Genomic study)Study Of Microarray (Genomic study)
Study Of Microarray (Genomic study)
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Theses exam 2012 - Wideband Speech Reconstruction
Theses exam 2012 - Wideband Speech ReconstructionTheses exam 2012 - Wideband Speech Reconstruction
Theses exam 2012 - Wideband Speech Reconstruction
 
ChIP-seq Theory
ChIP-seq TheoryChIP-seq Theory
ChIP-seq Theory
 

Semelhante a Cufflinks

RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data AnalysisRavi Gandham
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedJonathan Eisen
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
Lecture 2 , mbbs students. pcr, rt pcr,
Lecture 2 , mbbs students. pcr, rt pcr,  Lecture 2 , mbbs students. pcr, rt pcr,
Lecture 2 , mbbs students. pcr, rt pcr, Dr Vishnu Kumar
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqManjappa Ganiger
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedJonathan Eisen
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...fruitbreedomics
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seqJyoti Singh
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification Senthil Natesan
 
Present status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPresent status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPrabhatSingh628463
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
Development of SSR markers in mungbean
Development of SSR markers in mungbeanDevelopment of SSR markers in mungbean
Development of SSR markers in mungbeanNidhi Singh
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfPushpendra83
 

Semelhante a Cufflinks (20)

RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Lecture 2 , mbbs students. pcr, rt pcr,
Lecture 2 , mbbs students. pcr, rt pcr,  Lecture 2 , mbbs students. pcr, rt pcr,
Lecture 2 , mbbs students. pcr, rt pcr,
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Catalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seqCatalyzing Plant Science Research with RNA-seq
Catalyzing Plant Science Research with RNA-seq
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
 
20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
 
Iplant pag
Iplant pagIplant pag
Iplant pag
 
Rna seq and chip seq
Rna seq and chip seqRna seq and chip seq
Rna seq and chip seq
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 
Present status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPresent status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptx
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
Development of SSR markers in mungbean
Development of SSR markers in mungbeanDevelopment of SSR markers in mungbean
Development of SSR markers in mungbean
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
 

Mais de Ravi Gandham

Functional annotation
Functional annotationFunctional annotation
Functional annotationRavi Gandham
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and PrinseqliteRavi Gandham
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview Ravi Gandham
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 

Mais de Ravi Gandham (6)

Functional annotation
Functional annotationFunctional annotation
Functional annotation
 
Data formats
Data formatsData formats
Data formats
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and Prinseqlite
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Primer designing
Primer designingPrimer designing
Primer designing
 

Último

ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 

Último (20)

ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 

Cufflinks

  • 1. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Basics of RNA - seq data analysis Revision
  • 2. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Drawback of Dynamic programming - Very slow Need for - faster alignment strategies Fast Sequence alignment strategies • Using hash table based indexing - seed extend paradigm, space allowance • Using suffix/prefix tree based - Suffix array, Burrows wheeler transformation and FM index • Merge sorting Strategy: making a dictionary (index) – An example of 4-nt index AAAA: 235, 783, 10083,...... AAAC: 132, 236, 832, 932, ... TTTT: 327, 1328, 5523,...... Algorithms Hashing reads - Eland, MAQ, Mosaik... Hashing reference genome - BFAST, Mosaik, SOAP, ... Hash table based indexing
  • 3. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Burrows wheeler transformation and FM index
  • 4. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Criteria for choosing an aligner • Global or local • Aligning short sequences to long sequences such as short reads to a reference • Aligning long sequences to long sequences such as long reads or contigs to a reference • Handles small gaps (insertions and deletions) • Handles large gaps (introns) • Handles split alignments (chimera) • Speed and ease of use
  • 5. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Short read aligner Aligner Purpose Bowtie Fast BWA small gaps (indels) GSNAP Large gaps (introns) Bowtie 2 Takes care of gaps Long sequence aligner Aligner Purpose BLAST Many reference genome BLAT Large gaps (introns) BWA Small gaps (indels) Exonerate Ease of use GMAP Large gaps (introns) MUMmer Align two genome
  • 6. Computational Biology and Genomics Facility, Indian Veterinary Research Institute RNA Exon 1 Exon 2 Exon read mapping Spliced read mapping Exon 1 Exon 2 RNA Seed matching K-mer seeds Seed extend RNA Exon 1 Exon 2 Exon read mapping Spliced read mapping Exon 1 Exon 2 RNA Seed matching K-mer seeds Seed extend RNA -seq alignment strategy Exon - first approach Seed - extend approach
  • 7. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • Available tools: • MapSplice, SpliceMap, TopHat • Two step procedure • Map reads continuously using unspliced read aligners • Unmapped reads are split into shorter segments and aligned independently • Efficient when not too many reads into the junction • Second step is computationally intensive • Can miss reads across exon-intron junctions RNA Exon read mapping Spliced read mapping Exon 1 Exon 2 Exon - first approach
  • 8. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • Representative algorithms • Genomic short read nucleotide alignment program (GSNAP) • Computing accurate spliced alignments (QPALMA) • Steps • Break reads into short seeds • Candidate regions are combined’ (such as Smith-Waterman) • Increased sensitivity • One arm may not provide enough specificity for alignment RNA Exon 1 Exon 2 Exon read mapping Spliced read mapping Exon 1 Exon 2 RNA Seed matching K-mer seeds Seed extend Seed - extend approach
  • 9. Computational Biology and Genomics Facility, Indian Veterinary Research Institute A software package that identifies splice sites ab initio by large- scale mapping of RNA-Seq reads. • A splice junction mapping algorithm must be able to identify reads that may have only a few bases on one side of a junction, or else that junction will be missed TopHat Map reads to whole genome with Bowtie Collect initially unmappable reads Build seed table index from unmappable reads Generate possible splices between neighbouring exons Map reads to possible splices via seed-and-extend Assemble consensus of covered regions gt ag ag gt ag ag
  • 10. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Normalization of read count R/FPKM (Mortazavi et al.,2008) - Reads/Fragment per kilobase of exon per million mappable reads • Corrects for: differences in sequencing depth and transcript length • Aiming to: compare a gene across samples and different genes within samples TMM (Robinson and Oshlack., 2010) - Trimmed mean of M values • Corrects for: differences in transcript pool composition; extreme outliers • Aiming to: provide better across-sample comparability
  • 11. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Normalization of read count Limma voom (LogCPM) (Law et al.,2013) - Counts per million • Aiming to: Stabilize variance, removes dependence of variance on the mean TPM (Li etal 2010, Wagner et al 2012) - Transcripts per million • Corrects for: transcript length distribution in RNA pool • Aiming to: provide better across-sample comparability
  • 12. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • FPKM for paired end reads and RPKM for single end reads • Fragment means fragment of DNA, so the two reads that comprise a paired-end read count as one. • Per kilobase of exon means the counts of fragments are then normalized by dividing by the total length of all exons in the gene. • This bit of magic makes it possible to compare Gene A to Gene B even if they are of different lengths. • Per million reads means this value is then normalized against the library size. • This bit of magic makes it possible to compare Gene A in Sample 1 to Sample 2 R/FPKM (Mortazavi et al.,2008)
  • 13. Computational Biology and Genomics Facility, Indian Veterinary Research Institute A quantification measurement for gene expression • R: expression level of the gene • L: length of the gene • N: depth of the sequencing • C: number total reads fall into the gene region R/FPKM (Mortazavi et al.,2008) Total exon size of a gene is 3,000-nt. Calculate the expression levels for this gene in RPKM in an RNA-seq experiment that contained 50 million mappable reads, with 600 reads falling into exon regions of this gene. R = 600/(50 × 3.000) = 4.00 R = C ÷ L × N( ) L in kbs and N in Millions
  • 14. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Calculation of FPKM/RPKM Genes Sample1 Sample 2 Sample 3 1 (2kb) 20 24 60 2 (4kb) 40 50 120 3 (1kb) 10 16 30 4 (10kb) 0 0 2 Total 70 90 212 Total reads for sample 1, 2 and 3 - 7M ,9M and 21.2M (millions of reads equated to a scale of tens of reads) Step 1. Divide the reads of each gene with the total reads of the sample Genes Sample1(RPM) Sample 2(RPM) Sample 3(RPM) 1 (2kb) 2.86 2.67 2.83 2 (4kb) 5.71 5.56 5.66 3 (1kb) 1.43 1.78 1.42 4 (10kb) 0 0 0.09
  • 15. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Fragments/Reads per kilobase per million of reads Reads are scaled for both depth and length Step 2. Divide the values obtained after step 1 with the gene lengths Genes Sample1 (RPKM) Sample 2 (RPKM) Sample 3 (RPKM) 1 (2kb) 1.43 1.33 1.42 2 (4kb) 1.43 1.39 1.42 3 (1kb) 1.43 1.78 1.42 4 (10kb) 0 0 0.009 Total normalized reads 4.29 4.5 4.5
  • 16. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Calculation of TPM Step 1. Divide the reads of each gene with the length of each gene Genes Sample 1 Sample 2 Sample 3 1 (2kb) 20 24 60 2 (4kb) 40 50 120 3 (1kb) 10 16 30 4 (10kb) 0 0 2 Total reads per kb of gene for sample 1, 2 and 3- 3M,4.05M and 9.02M Genes Sample 1(RPK) Sample 2(RPK) Sample 3(RPK) 1 (2kb) 10 12 30 2 (4kb) 10 12.5 30 3 (1kb) 10 16 30 4 (10kb) 0 0 0.2 Total 30 40.5 90.2 (millions of reads equated to a scale of tens of reads)
  • 17. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Step 2. Divide the values obtained after step 1 with the gene lengths Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM) 1 (2kb) 3.33 2.96 3.326 2 (4kb) 3.33 3.09 3.326 3 (1kb) 3.33 3.95 3.326 4 (10kb) 0 0 0.02 Total 10 10 10 Genes Sample1 (RPK) Sample 2(RPK) Sample 3(RPK) 1 (2kb) 10 12 30 2 (4kb) 10 12.5 30 3 (1kb) 10 16 30 4 (10kb) 0 0 0.2 Calculation of TPM
  • 18. Computational Biology and Genomics Facility, Indian Veterinary Research Institute RPKM vs TPM Genes Sample1 (RPKM) Sample 2 (RPKM) Sample 3 (RPKM) 1 (2kb) 1.43 1.33 1.42 2 (4kb) 1.43 1.39 1.42 3 (1kb) 1.43 1.78 1.42 4 (10kb) 0 0 0.009 Total normalized reads 4.29 4.5 4.5 Genes Sample1(TPM) Sample 2(TPM) Sample 3(TPM) 1 (2kb) 3.33 2.96 3.326 2 (4kb) 3.33 3.09 3.326 3 (1kb) 3.33 3.95 3.326 4 (10kb) 0 0 0.02 Total normalized reads 10 10 10 Sums of total normalized reads
  • 19. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Eg : if certain genes are very highly expressed in one tissue but not another, there will be less ‘’sequencing real estate’’ left for the less expressed genes in that tissue and RPKM normalization (or similar) will give biased expression values for them compared to the other sample Equal sequencing depth -> Yellow and green will get lower RPKM in RNA population 1 although the expression levels are actually the same in populations 1 and 2 Robinson and Oshlack Genome Biology 2010, 11: R25, http://genomebiology.com / 2010/11/3/R25 RNA population 1 RNA population 2 TMM – Trimmed Mean of M Value Attempts to correct for differences in RNA composition between samples
  • 20. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Identification of Differentially expressed genes - I (using Cufflinks)
  • 21. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Identification of differentially expressed genes Quality filtered/trimmed RNA-Seq Short reads FPKM based strategy Calculate transcript abundances (Cufflinks) Reference Genome (Y/N) Mapping to the reference (GMAP-GSNAP, Tophat,Bowtie,etc.) Y N De novo Transcriptome assembly (Trinity) Mapping and detection of DEGs (RSEM) Count based strategy Generate count data (RSEM) Detection of DEGs (cuffdiff2) Detection of DEGs (DESeq, edgeR, EBSeq)
  • 22. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Identification of differentially expressed genes Quality filtered/trimmed RNA-Seq Short reads FPKM based strategy Calculate transcript abundances (Cufflinks) Reference Genome Mapping to the reference (GMAP - GSNAP) Detection of DEGs (cuffdiff2) Downloading the reference genome and gtf from UCSC genome browser Requirements For running gmap-gsnap- fasta file of the genome and reads file For running cufflinks - bam files of all samples and gtf file of the genome For running samtools - sam file generated from gmap-gsnap
  • 23. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Downloading the FASTA file and GTF from the UCSC genome browser (https://genome.ucsc.edu)
  • 24. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
  • 25. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
  • 26. Computational Biology and Genomics Facility, Indian Veterinary Research Institute When you click on the bosTau8.fa.gz, you will be able to download a file of 866.1MB,which on clicking would give a file of 2.72 GB
  • 27. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
  • 28. Computational Biology and Genomics Facility, Indian Veterinary Research Institute
  • 29. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Genome Mapping and Alignment using GMAP - GSNAP Genomic Mapping and Alignment Program • GMAP is a standalone program for mapping and aligning cDNA sequences to a genome. • The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. • The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Step 1. Command for indexing the the genome : gmap_build -d btau8 bosTau8.fa Initially used a hashing scheme but later used a much more efficient double lookup scheme
  • 30. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The index files created are as below in the folder btau8
 gsnap –d btau8 –t 4 control_R1.fastq> control_R1.sam Step 2. Mapping the reads to the genome
  • 31. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • The end product of the GMAP aligner is a SAM file which needs to be converted into a BAM file for further analysis in cufflinks. • Repeat the same for the other replicate by changing the input file name. • A total of four SAM files are generated separately. The BAM files generated can be analysed in two ways - 1. The BAM files can be used to generate a merged assembly of transcripts via cufflinks and cuffmerge. This merged assembly (i.e merged.gtf) is used in cuffdiff to generate differential expressed genes. 2. Cuffdiff can be used directly to generate differentially expressed genes using the BAM files generated. The index files created are as below in the folder btau8

  • 32. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ./samtools view –bsh aln.sam >aln.bam -b: Output in the BAM format. -s: Input in the SAM format. –h: Include header in the output For the Control sample: ./samtools view –bsh control_R1.sam >control_R1.bam For the Infected sample: ./samtools view –bsh infected_R1.sam >infected_R1.bam Step 3. Converting SAM to BAM using samtools
  • 33. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Command for sorting:./samtools sort aln.bam aln.sorted Example: For the Control sample: ./samtools sort control_R1.bam control_R1_sorted For the Infected sample: ./samtools sort infected_R1.bam infected_R1_sorted Step 4. Sorting BAM using samtools
  • 34. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Command for running cufflinks on a BAM file For the Control sample: cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN control_R1_sorted.bam Step 5. (Option 1) Differential expression using cufflinks, cuffmerge and cuffdiff.
  • 35. Computational Biology and Genomics Facility, Indian Veterinary Research Institute For the infected sample: cufflinks -G btau8refflat.gtf -g btau8refflat.gtf -b bosTau8.fa -u -L CN infected_R1_sorted.bam These commands generate transcript.gtf files for each replicate, which are further used in cuffmerge to generate a merged assembly. This merged assembly is then used in cuffdiff to generate differentially expressed genes.
  • 36. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Command for running cuffmerge cuffmerge -g btau8refflat.gtf -s bosTau8.fa -p 8 assemblies.txt assemblies.txt is the file with the list of all the GTFs. This generates a merged.gtf in the merged_asm folder. This file is used in the next cuffdiff command. Command for running cuffdiff cuffdiff merged.gtf control_R1_sorted.bam control_R2_sorted.bam infected_R1_sorted.bam infected_R2_sorted.bam This command generates many files, out of which gene_exp.diff is the file of our concern.
  • 37. Computational Biology and Genomics Facility, Indian Veterinary Research Institute CuffDiff computes differentially expressed genes in the set. For computing differential expression at least two samples -infected and control are required. CuffDiff should always be run on replicates - i.e., N infected vs N control. Command: Cuffdiff –p –N transcripts.gtf -p: num-threads <int>. -N Running cuffdiff for our BAM files cuffdiff –p 3 –N bostau8reflat.gtf control_R1_sorted.bam,control_R2_sorted.bam infected_R1_sorted.bam,infected_R2_sorted.bam –o cuffdiff_out Step 5. (Option 2) Differential expression using CuffDiff directly
  • 38. Computational Biology and Genomics Facility, Indian Veterinary Research Institute A unique identifier describing the object (gene, transcript, CDS, primary transcript) Gene ID Gene Name Infected OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing FPKM in Sample 1 FPKM in Sample 2 The (base 2) log of the fold change y/x Genomic coordinates for easy browsing to the genes or transcripts being tested. Control The value of the test statistic used to compute significance of the observed change in FPKM The uncorrected p-value of the test statistic gene_exp.diff Log2fold change = Log2(FPKM infected/FPKM of control) = Log2(0.576748/3.92513) = -2.76673