SlideShare uma empresa Scribd logo
1 de 80
Baixar para ler offline
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Overview of the previous lecture
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Overview of the previous lecture
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Instrument
Needed
Coverage
Throughput/
Run
Human
Genome/run
Cost per Run** Total Cost $/Genome
HiSeq 2500 30x 1Tb 11 29.000 29,000 2,536
HiSeq x 10 30x 1 STb 16 15.700 12,000 950
Pacbio RSII 54x 1 Gb 0.000 212 34,374 34,374
PacBio Sequel 50x 5-10 Gb 0.06 700 10,300 10,500
Current Loading Technologies
Platform Average Reeds
length
Advantages Limitation
Material
Recommended
Illumina MiSeq 2 x 300 bps Accurate Short Length 100-200 ng
Illumina Mole-
culo
5 Kbps Accurate Coverage may
Fluctuate
10 ug
Pacific
Bioscience
15 Kbps Long reads Relatively
expensive
10-100 ug
Oxford
Nanopore
5 Kbps Low ownership
cost
High error rate 1 ug
Long-Range Sequencers Comparison
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Basics of RNA - seq data analysis
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• Sequence DNA
• De novo sequencing
• Reference based re-sequencing - SNP, CNV and indels
• Metagenomics - Identify who is there in a mixture of microbes
• Sequence RNA
• RNA-Seq (Transcriptome wide sequencing)
• miRNA - Seq novo sequencing
• Novel NcRNAs
• Study Protein-DNA/RNA interactions
• ChIP-Seq (TFs)
• CLIP - Seq (For RNA binding proteins)
• Epigenetics
• DNA methylation
• Histone modification (ChIP-seq)
• Nucleosome positioning
• Chromosome looping
Applications of NGS
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Certain important key words to be remembered
from the previous chapter
• Sequencing
• both DNA and RNA (with modified protocol)
• Short reads
• 35, 50, 75, and 100-bp (Solexa and SOLiD)
• 400-bp (454)
• Ultra-high throughput
• 1 to 1.5 billion reads (Solexa and SOLiD)
• 2- 4 million reads (454)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• The transcriptome : Complete set of transcripts in a cell, both in terms
of type and quantity.
• Transcriptome analysis -
• in understanding the pattern of gene expression to address basic
biological questions.
• greater insights into biological pathways and molecular mechanisms
that regulate cell fate, development, and disease progression.
What is a transcriptome?
Transcriptome can be studied through RNA-seq/microarray
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RNA sequencing
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Aim of RNA-seq experiment
• To quantify RNA abundance 

• To determine the transcriptional structure
of genes: start sites, 5’ and 3’ ends,
splicing patterns 

• To quantify the changing expression levels
of each transcript during development and
under different conditions 

• To identify variants on the transcripts 

Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Basic work flow of Next generation sequencing
Wang et al., 2009
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
What can we get out of RNA-seq experiment?
Gene expression
Alternative splicing
Transcript variation
Non-coding RNAs
RNA -seq
Different expression
Non- syn SNPs
Synonymous SNPs
SNPs in 3’- UTRs
Allele specific
expression
Protein changes
RNA binding proteins
microRNA binding
sites
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Microarray
Normal RNA
Cy3 Labelling
Reverse transcription
Experimental RNA
Cy5 labelling
Hybridize Wash & Scan
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RNA - Seq vs Microarray
“RNA- seq…. is expected to revolutionise the manner in
which eukaryotic transcriptome are analysed”
Wang et al. Nat Rev Gen, 2009
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Limitations of microarray platforms
• High background levels owing to cross-hybridization;
• Limited dynamic range of detection owing to both background and
saturation of signals.
• Reliance upon existing knowledge about genome sequence;
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RNA - seq has higher dynamic range
For microarrays:
• Low signal end: high background noise
• High signal end: signals will be saturated
With proper depth, NGS can solve this problem well
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Comparison of Microarray with RNA - seq
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Experimental considerations
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Sample preparation
• Amount requirement
• 100 ug total RNA (several Wold’s studies)
• 2 ug total RNA (Center for Medical Genomics, SOLiD)
• 50 ng total RNA (collaboration with Pourmand, SOLiD)
• Single cell (Tang et al. Nat. Methods, 2009)
(when RNA is limiting, approaches to amplify small quantities of RNA exist)
• rRNA removal
• rRNAs are highly abundant (>90% of total RNA)
Solutions:
• rRNA depletion kits
• Poly-A selection
• Using enzymes that selectively degrade uncapped RNA
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Sample preparation
• Reverse transcription (RNA -> cDNA)
• oligo dT
• Pros: Focusing on polyA ed transcriptions cleaner.
• Cons: Bias towards the 3’-end of transcripts
• Random primers
• Pros: Equal coverage, can be used to study non-polyAed transcripts
• Cons: higher proportion of rRNA
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Single end reads vs paired ends reads
3' 5'
5' 3'
5'
3'
5' 3'
AAAA
AAAA
3' 5'TTTT
cDNA fragments and adapter ligation
cDNA conversion
R1
R2
5' 3'
3' 5'
Sequencing of each fragment
R1 will run in the same direction of the reference
R2 will run in the opposite direction of the reference
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Statistical considerations
Number of conditions?
• Dynamics?
• Dose-effect?
• Tissue-specificity?
Number of replicates?
Depth of sequencing?
• Biological variation, …
• Statistical power, …
• Gene expression
• Alternative splicing
• Allele specific expression
₹
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Basics of RNA - seq data analysis - II
Data processing
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Data analysis workflow
Primary analysis Tertiary analysis
• Image acquisition /
semiconductor based detection
• Base calling and
• Quality metrics
Secondary analysis
• Sequence alignment
• Sequence Stats
• Consensus Calling
• Sequence assembly
• Application specific analysis
• DNA-Seq
• Re-Sequencing
• De-novo Sequencing
• ChiP Sequencing
• RNA-Seq
• Epigenetics
Data analysis workflow
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Sequence alignment
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Sequence alignment
▪ Sequence alignment is a way of arranging the sequences of DNA, RNA,
or protein to identify regions of similarity
▪ Aligned sequences of nucleotide or amino acid residues are typically
represented as rows within a matrix.
▪ Gaps are inserted between the residues so that identical or similar
characters are aligned in successive columns.
1. To find whether two (or more) genes or proteins are evolutionarily
related to each other
2. To observe patterns of conservation (or variability).
3. To find structurally or functionally similar regions within proteins i.e to
find the common motifs present in both sequences.
4. To find out which sequences from the database are similar to the
sequence at hand
Purpose of sequence alignment
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
1. They are often used interchangeably, they have quite different
meanings.
2. Sequence identity refers to the occurrence of exactly the same
nucleotide or amino acid in the same position in aligned sequences.
3. The term ‘sequence homology’ is the most important (and the most
abused) of the three.
• When we say that sequence A has high homology to sequence B,
then we are making two distinct claims:
• not only are we saying that sequences A and B look much the
same, but also that all of their ancestors also looked the same,
going all the way back to a common ancestor.
Identity vs Similarity vs Homology
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Sequence Identity Sequence similarity
Sequence
homology
Definition
Proportion of identical
residues between two
sequences.
Proportion of similar
residues between two
sequences. Two residues are
similar if their substitution cost
is higher than 0.
Sequences
derived from a
common
ancestor
Expressed as % identity % Similarity Yes or No
Rule-of-thumb: If two sequences are more than 100 amino acids long
(or 100 nucleotides long) they are considered homologues if 25% of the
amino acids are identical (70% of nucleotide for DNA).
Twilight zone = protein sequence similarity between ~0-20%
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Global alignment
• assumes that the two sequences are basically similar over the entire
length of one another.
• forces to match the sequences from end to end, even though parts of the
alignment are not very convincingly matching.
• most suitable when the two sequences are of similar length and are with a
significant degree of similarity throughout.
.
Computational approaches
• Global alignment
• Local alignment
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Local alignment
• Identifies segments of the two sequences that match well with no
attempt to force the entire sequences into alignment
• Parts that appear to have good similarity, according to some criterion
are aligned.
• Suitable when comparing substantially different sequences, which
possibly differ significantly in length, and have only short patches of
similarity
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Give two sequences we need a number to associate with each possible
alignment (i.e. the alignment score = goodness of alignment).
The scoring scheme is a set of rules which assigns the alignment
score to any given alignment of two sequences.
• The scoring scheme is residue based: it consists of residue substitution
scores (i.e. score for each possible residue alignment), plus penalties for
gaps.
• The alignment score is the sum of substitution scores and gap penalties.
Substitution scores are given by :
For DNA : Substitution Matrix for DNA (Purine/Purine or purine/pyramidine
substitutions)
For proteins : Substitution matrix based on Polarity, Size, Charge or
Hydrophobicity
Evolutionary distance matrices :- PAM and BLOSUM for
protein sequences
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Types of alignment
Pairwise alignment Multiple Sequence alignment
Can be Global or Local
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• Dot Matrix method
• The dynamic programming method
• Needleman and Wunch
• Smith and Watermann
• Heuristic methods - FASTA ; BLAST
Methods of pairwise alignment
• It is a visual graphical representation of similarities between two
sequences.
• Each axis represents one of the two sequences to be compared.
• In the dot matrix method when two sequences are similar over their entire
length a line will extend from one corner of the dot plot to the diagonally
opposite corner.
• If two sequences share only patches of similarity then it will be revealed
by diagonal stretches.
Dot Matrix method
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Interpretation of Dot Matrix
• Regions of similarity appear as diagonal runs of dots.
• Reverse diagonals (perpendicular to diagonal) indicate inversions.
• Reverse crossing diagonals (Xs) indicate palindromes.
Limitation:-
• The dot matrix computer programs do not show an actual alignment.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The dynamic programming
• Dynamic programming reduces the massive number of possibilities that
need to be considered in aligning sequences.
• This method was first used for global alignment of sequences by
Needleman-Wunch algorithm (1970) and for local alignment by Smith -
Waterman algorithm (1981).
• Both the algorithms involve initialization, matrix filling (scoring) and trace
back steps. The algorithms use either PAM or BLOSUM matrices in the
scoring step to fill the score matrix.
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Global alignment
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The three main steps in this algorithm are :
1. Initialization
2. Matrix filling
3. Traceback for alignment
Initialization
1. Place the two sequences one across the row and other down the
column
2. The first column and first row should be a gap
3. Add the cumulative gap cost across the row and other down the
column to fill the first column and first row
Global alignment (Needleman and Wunch)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• Matrix filling
• Rules :-
• Check the side, top and diagonal values of the box
• Box Beside - (add gap cost)
• Box top - ( add gap cost)
• Diagonal box - (match/mismatch)
• Put the highest value in the respective boxes
• Proceed to the end of the scoring matrix
• Trace back
• Start from the end of the matrix and reach the start by tracing back
the value obtained in the box
• if diagonal - Place the characters
• if vertical or horizontal - place a gap in the sequence being pointed
by the arrow
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap T G A
gap 0 -2 -4 -6
A -2
-1 -4
-1
-4
-3 -6
-3
-3
-3 -8
-3
-5
T -4
-1 -3
-1
-6
-2 -5
-2
-3
-4 -5
-4
-4
G -6
-5 -3
-3
-8
0 -4
0
-5
-3 -6
-2
-2
C -8
-7 -5
-5
-10
-4 -2
-2
-7
-1 -4
-1
-4
Matrix Filling - Gap -2; Mismatch -1; Match +1 Box Beside : +gap; Box Top : +gap; Diagonal box : match or mismatch
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap T G A
gap 0 -2 -4 -6
A -2
-1 -4
-1
-4
-3 -6
-3
-3
-3 -8
-3
-5
T -4
-1 -3
-1
-6
-2 -5
-2
-3
-4 -5
-4
-4
G -6
-5 -3
-3
-8
0 -4
0
-5
-3 -3
-2
-2
C -8
-7 -5
-5
-10
-4 -2
-2
-7
-1 -4
-1
-3
Trace back
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
ATGC
- TGA
-2+1+1-1
= -1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap G A C T A C
gap 0 -1 -2 -3 -4 -5 -6
A -1
0 -2
0
-2
0 -3
0
-1
-2 -4
-1
-1
-3 -5
-2
-2
-3 -6
-3
-3
-5 -7
-4
-4
C -2
-1 -1
-1
-3
0 -1
0
-2
+1 -2
+1
-1
-1 -3
0
0
-2 -4
-1
-1
-3 -5
-2
-2
G -3
-1 -2
-1
-4
-1 -1
-1
-2
0 0
0
-2
+1 -1
+1
-1
0 -2
0
0
-1 -3
-1
-1
C -4
-3 -2
-2
-5
-1 -2
-1
-3
0 -1
0
-2
0 0
0
-1
+1 -1
+1
-1
+1 -2
+1
0
Matrix Filling - Gap -1; Mismatch - 0; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Matrix Filling - Gap -1; Mismatch - 0; Match +1
gap G A C T A C
gap 0 -1 -2 -3 -4 -5 -6
A -1
0
-2
0
-2
0
-3
0
-1
-2
-4
-1
-1
-3
-5
-2
-2
-3
-6
-3
-3
-5
-7
-4
-4
C -2
-1
-1
-1
-3
0
-1
0
-2
+1
-2
+1
-1
-1
-3
0
0
-2
-4
-1
-1
-3
-5
-2
-2
G -3
-1
-2
-1
-4
-1
-1
-1
-2
0
0
0
-2
+1
-1
+1
-1
0
-2
0
0
-1 -3
-1
-1
C -4
-3
-2
-2
-5
-1
-2
-1
-3
0
-1
0
-2
0
0
0
-1
+1
-1
+1
-1
+1
-2
+1
0
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
ACG-C
GACTAC
-1+1+1+0-1+1
= +1
AC-GC
GACTAC
-1+1+1-1+0+1
= +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap T G A
gap 0 -2 -4 -6
A -2
T -4
G -6
C -8
Matrix Filling - Gap -2; Mismatch -1; Match +1 Box Beside : +gap; Box Bottom : +gap; Diagonal box : match or mismatch
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap G A C T A C
gap 0
A -2
C -4
G -6
C -8
Matrix Filling - Gap -2; Mismatch - -1; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap G A C T A C
gap
A
C
G
C
Matrix Filling - Gap -1; Mismatch - 0; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap T G A
gap
A
T
G
C
Matrix Filling - Gap -2; Mismatch -1; Match +1 Box Beside : +gap; Box Bottom : +gap; Diagonal box : match or mismatch
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
The three main steps in this algorithm are :
1. Initialization
2. Matrix filling
3. Traceback for alignment
Initialization
1. Place the two sequences one across the row and other down the
column
2. The first column and first row should be a gap
3. Place zeros in first column and first row
Matrix filling
1. The value of each box thereon depends on the top, diagonal and
side boxes (Box Beside - (add gap cost); Box top - ( add gap
cost); Diagonal box - (match/mismatch)
2. If the value is negative - put the value as zero
3. The highest of the three values is placed in the box
4. The same is continued till the end of the matrix
Smith and Waterman algorithm (Local alignment)
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap T G A
gap 0 0 0 0
A 0
0 0
0
0
0 0
0
0
+1 0
+1
0
T 0
+1 0
+1
0
0 0
0
0
0 0
0
0
G 0
0 0
0
0
+2 0
+2
0
0 0
0
0
C 0
0 0
0
0
0 0
0
0
+1 0
+1
0
Matrix Filling - Gap -2; Mismatch -1; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
TG
TG
+1+1=+2
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap G C C T A C C C G A A T
gap 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0
+1 0
+1
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 -2
0
-2
0 -2
0
-2
+1 0
+1
-2
0 0
0
0
0 0
0
0
0 0
0
0
A 0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
+1 0
+1
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
+2 0
+2
0
+1 0
+1
0
0 0
0
0
A 0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
+1 0
+1
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
+1 0
+1
0
+3 0
+3
0
0 0
+1
+1
T 0
0 0
0
0
0 0
0
0
0 0
0
0
+1 0
+1
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 +1
+1
0
+4 0
+4
0
Matrix Filling - Gap -2; Mismatch -1; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
GAAT
GAAT
1+1+1+1=+4
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Global alignment - example
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap A T G G C G T
gap 0 -2 -4 -6 -8 -10 -12 -14
A -2
+1 -4
+1
-4
-3 -6
-1
-1
-5 -8
-3
-3
-7 -10
-5
-5
-9 -12
-7
-7
-11 -14
-9
-9
-13 -16
-11
-11
T -4
-3 -1
-1
-6
+2 -3
+2
-3
-2 -5
0
0
-4 -7
-2
-2
-6 -9
-4
-4
-8 -11
-6
-6
-8 -13
-8
-8
G -6
-5 -3
-3
-8
-2 0
0
-5
+3 -2
+3
-2
+1 -4
+1
+1
-3 -6
-1
-1
-3 -8
-3
-3
-7 -10
-5
-5
A -8
-5 -5
-5
-10
-4 -2
-2
-7
-1 +1
+1
-4
+2 -1
+2
-1
0 -3
0
0
-2 -5
-2
-2
-4 -7
-4
-4
G -10
-9 -7
-7
-12
-6 -4
-4
-9
-1 -1
-1
-6
+2 0
+2
-3
+1 -2
+1
0
+1 -4
+1
-1
-3 -6
-1
-1
T -12
-11 -9
-9
-14
-6 -6
-6
-11
-5 -3
-3
-8
-2 0
0
-5
+1 -1
+1
-2
0 -1
0
-1
+2 -3
+2
-2
Matrix Filling - Gap -2; Mismatch -1; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap A T G G C G T
gap 0 -2 -4 -6 -8 -10 -12 -14
A -2
+1 -4
+1
-4
-3 -6
-1
-1
-5 -8
-3
-3
-7 -10
-5
-5
-9 -12
-7
-7
-11 -14
-9
-9
-13 -16
-11
-11
T -4
-3 -1
-1
-6
+2 -3
+2
-3
-2 -5
0
0
-4 -7
-2
-2
-6 -9
-4
-4
-8 -11
-6
-6
-8 -13
-8
-8
G -6
-5 -3
-3
-8
-2 0
0
-5
+3 -2
+3
-2
+1 -4
+1
+1
-3 -6
-1
-1
-3 -8
-3
-3
-7 -10
-5
-5
A -8
-5 -5
-5
-10
-4 -2
-2
-7
-1 +1
+1
-4
+2 -1
+2
-1
0 -3
0
0
-2 -5
-2
-2
-4 -7
-4
-4
G -10
-9 -7
-7
-12
-6 -4
-4
-9
-1 -1
-1
-6
+2 0
+2
-3
+1 -2
+1
0
+1 -4
+1
-1
-3 -6
-1
-1
T -12
-11 -9
-9
-14
-6 -6
-6
-11
-5 -3
-3
-8
-2 0
0
-5
+1 -1
+1
-2
0 -1
0
-1
+2 -3
+2
-2
Matrix Filling - Gap -2; Mismatch -1; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap A T G G C G T
gap 0 -2 -4 -6 -8 -10 -12 -14
A -2
+1 -4
+1
-4
-3 -6
-1
-1
-5 -8
-3
-3
-7 -10
-5
-5
-9 -12
-7
-7
-11 -14
-9
-9
-13 -16
-11
-11
T -4
-3 -1
-1
-6
+2 -3
+2
-3
-2 -5
0
0
-4 -7
-2
-2
-6 -9
-4
-4
-8 -11
-6
-6
-8 -13
-8
-8
G -6
-5 -3
-3
-8
-2 0
0
-5
+3 -2
+3
-2
+1 -4
+1
+1
-3 -6
-1
-1
-3 -8
-3
-3
-7 -10
-5
-5
A -8
-5 -5
-5
-10
-4 -2
-2
-7
-1 +1
+1
-4
+2 -1
+2
-1
0 -3
0
0
-2 -5
-2
-2
-4 -7
-4
-4
G -10
-9 -7
-7
-12
-6 -4
-4
-9
-1 -1
-1
-6
+2 0
+2
-3
+1 -2
+1
0
+1 -4
+1
-1
-3 -6
-1
-1
T -12
-11 -9
-9
-14
-6 -6
-6
-11
-5 -3
-3
-8
-2 0
0
-5
+1 -1
+1
-2
0 -1
0
-1
+2 -3
+2
-2
Matrix Filling - Gap -2; Mismatch -1; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap A T G G C G T
gap 0 -2 -4 -6 -8 -10 -12 -14
A -2
+1 -4
+1
-4
-3 -6
-1
-1
-5 -8
-3
-3
-7 -10
-5
-5
-9 -12
-7
-7
-11 -14
-9
-9
-13 -16
-11
-11
T -4
-3 -1
-1
-6
+2 -3
+2
-3
-2 -5
0
0
-4 -7
-2
-2
-6 -9
-4
-4
-8 -11
-6
-6
-8 -13
-8
-8
G -6
-5 -3
-3
-8
-2 0
0
-5
+3 -2
+3
-2
+1 -4
+1
+1
-3 -6
-1
-1
-3 -8
-3
-3
-7 -10
-5
-5
A -8
-5 -5
-5
-10
-4 -2
-2
-7
-1 +1
+1
-4
+2 -1
+2
-1
0 -3
0
0
-2 -5
-2
-2
-4 -7
-4
-4
G -10
-9 -7
-7
-12
-6 -4
-4
-9
-1 -1
-1
-6
+2 0
+2
-3
+1 -2
+1
0
+1 -4
+1
-1
-3 -6
-1
-1
T -12
-11 -9
-9
-14
-6 -6
-6
-11
-5 -3
-3
-8
-2 0
0
-5
+1 -1
+1
-2
0 -1
0
-1
+2 -3
+2
-2
Matrix Filling - Gap -2; Mismatch -1; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
ATGGCGT
AT - GAGT
1+1-2+1-1+1+1
= +2
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
ATGGCGT
ATGA - GT
1+1+1-1-2+1+1
= +2
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
ATGGCGT
ATG - AGT
1+1+1-2-1+1+1
= +2
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
ATGGCGT
ATG - AGT
1+1+1-2-1+1+1
= +2
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
ATGGCGT
AT - GAGT
1+1-2+1-1+1+1
= +2
ATGGCGT
ATGA - GT
1+1+1-1-2+1+1
= +2
ATGGCGT
ATG - AGT
1+1+1-2-1+1+1
= +2
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Local alignment
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap A T G G C G T
gap 0 0 0 0 0 0 0 0
A 0
+1 0
+1
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
T 0
0 0
0
0
+2 0
+2
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
+1 0
+1
0
G 0
0 0
0
0
0 0
0
0
+3 0
+3
0
+1 0
+1
0
0 0
0
0
+1 0
+1
0
0 0
0
0
A 0
+1 0
+1
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
G 0
0 0
0
0
0 0
0
0
+1 0
+1
0
+1 0
+1
0
0 0
0
0
+1 0
+1
0
0 0
0
0
T 0
0 0
0
0
+1 0
+1
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
+2 0
+2
0
Matrix Filling - Gap -2; Mismatch -1; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
ATG
ATG
1+1+1= +3
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
gap A T G G C G T
gap 0 -2 -4 -6 -8 -10 -12 -14
A -2
T -4
G -6
A -8
G -10
T -12
Matrix Filling - Gap -2; Mismatch -1; Match +1
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Note :- Always take the value of gap cost or mismatch cost a
negative value and the values have to be different
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Drawback of Dynamic programming - Very slow
Need for - faster alignment strategies
Fast Sequence alignment strategies
• Using hash table based indexing - seed extend paradigm, space allowance
• Using suffix/prefix tree based - Suffix array, Burrows wheeler
transformation and FM index
• Merge sorting
Strategy: making a dictionary (index) – An example of 4-nt index
AAAA: 235, 783, 10083,......
AAAC: 132, 236, 832, 932, ...
TTTT: 327, 1328, 5523,......
Algorithms
Hashing reads - Eland, MAQ, Mosaik...
Hashing reference genome - BFAST, Mosaik, SOAP, ...
Hash table based indexing
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Burrows wheeler transformation and FM index
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Criteria for choosing an aligner
• Global or local
• Aligning short sequences to long sequences such as short
reads to a reference
• Aligning long sequences to long sequences such as long
reads or contigs to a reference
• Handles small gaps (insertions and deletions)
• Handles large gaps (introns)
• Handles split alignments (chimera)
• Speed and ease of use
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Short read aligner
Aligner Purpose
Bowtie Fast
BWA small gaps (indels)
GSNAP Large gaps (introns)
Bowtie 2 Takes care of gaps
Long sequence aligner
Aligner Purpose
BLAST Many reference genome
BLAT Large gaps (introns)
BWA Small gaps (indels)
Exonerate Ease of use
GMAP Large gaps (introns)
MUMmer Align two genome
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Three major challenges
• Short reads (36-125 nt)
• Error rates are considerable
• Many reads span exon-exon junctions
Alignment should be conducted on
• Genome
• Reference transcriptome
Short read aligners are
• No gaps allowed, or
• Allow small gaps
• Nether will work on intron regions
RNA - seq alignment
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
RNA
Exon 1 Exon 2
Exon read mapping
Spliced read mapping
Exon 1 Exon 2
RNA
Seed matching
K-mer seeds
Seed extend
RNA
Exon 1 Exon 2
Exon read mapping
Spliced read mapping
Exon 1 Exon 2
RNA
Seed matching
K-mer seeds
Seed extend
RNA -seq alignment strategy
Exon - first approach Seed - extend approach
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• Available tools:
• MapSplice, SpliceMap, TopHat
• Two step procedure
• Map reads continuously using
unspliced read aligners
• Unmapped reads are split into shorter
segments and aligned independently
• Efficient when not too many reads into
the junction
• Second step is computationally intensive
• Can miss reads across exon-intron
junctions
RNA
Exon read mapping
Spliced read mapping
Exon 1 Exon 2
Exon - first approach
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• Representative algorithms
• Genomic short read nucleotide
alignment program (GSNAP)
• Computing accurate spliced
alignments (QPALMA)
• Steps
• Break reads into short seeds
• Candidate regions are
combined’ (such as Smith-Waterman)
• Increased sensitivity
• One arm may not provide enough
specificity for alignment
RNA
Exon 1 Exon 2
Exon read mapping
Spliced read mapping
Exon 1 Exon 2
RNA
Seed matching
K-mer seeds
Seed extend
Seed - extend approach
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Sequence Assembly
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• Overlap, layout, consensus
• De Bruijn Graph or k-mer
• Burrows Wheeler transform and
FM-Index
Source (Genome, Exome, Clones and amplicons,Transcriptome)
Assembly (Reference-based assembly, de novo assembly)
Assembly Algorithms -
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Algorithm Purpose
OLC
Small genome
Long reads
Handles indels
De Bruijin graph
Large genome
Short high-quality reads
No indels
BWT and Ferragina-
Manzini index
Large genome
Short or long
No indels (currently)
Assembly algorithms
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
Algorithm Assemblers
OLC
ARACHNE, CAP3, Celera
assembler,MIRA,Newbler,Phrap
De Bruijin graph ABySS, ALLPATHS, SOAP de novo, Velvet
BWT and Ferragina-
Manzini index
String Graph Assembler (SGA)
Assemblers
Computational Biology and Genomics Facility, Indian Veterinary Research Institute
• Find two sequences with the largest overlap and merge them; repeat
• Flaw: prone to mis - assembly
Greedy
Overlap, Layout, Consensus
• Overlap
• Find all pairs of sequences that overlap
• Layout
• Remove redundant and weak overlaps.
• Merge pairs of sequences that overlap
• unambiguously; that is, pairs of sequences that overlap only
with each other and no other sequence
• Consensus
• Call the consensus base at positions where reads overlap

Mais conteúdo relacionado

Mais procurados

Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Yaoyu Wang
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Manikhandan Mudaliar
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishingNikolay Vyahhi
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWASEvaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWASAmin Mohamed
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomicsMads Albertsen
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsGenomeInABottle
 
CSU Next Generation Sequencing Core 06/09/2015
CSU Next Generation Sequencing Core 06/09/2015CSU Next Generation Sequencing Core 06/09/2015
CSU Next Generation Sequencing Core 06/09/2015Richard Casey
 
NGS overview
NGS overviewNGS overview
NGS overviewAllSeq
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisDespoina Kalfakakou
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizeAnn Loraine
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1BITS
 
140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analysesGenomeInABottle
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 

Mais procurados (20)

Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGSCurso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
 
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWASEvaluation of Pool-Seq as a cost-effective alternative to GWAS
Evaluation of Pool-Seq as a cost-effective alternative to GWAS
 
[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics[2013.10.29] albertsen genomics metagenomics
[2013.10.29] albertsen genomics metagenomics
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral genetics
 
CSU Next Generation Sequencing Core 06/09/2015
CSU Next Generation Sequencing Core 06/09/2015CSU Next Generation Sequencing Core 06/09/2015
CSU Next Generation Sequencing Core 06/09/2015
 
NGS overview
NGS overviewNGS overview
NGS overview
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses140127 platinum genomes pedigree analyses
140127 platinum genomes pedigree analyses
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Biz model for ion proton dna sequencer
Biz model for ion proton dna sequencerBiz model for ion proton dna sequencer
Biz model for ion proton dna sequencer
 

Semelhante a RNA Seq Data Analysis

DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification Senthil Natesan
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030GenomeInABottle
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGLong Pei
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim D. Pruitt
 
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013Functional Genomics Data Society
 
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-SeqNUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-SeqHimanshu Sethi
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshopGenomeInABottle
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?Nick Loman
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GenomeInABottle
 
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...Thermo Fisher Scientific
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issuesDongyan Zhao
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studies2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studiesFOODCROPS
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing priyanka raviraj
 

Semelhante a RNA Seq Data Analysis (20)

DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Cufflinks
CufflinksCufflinks
Cufflinks
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
 
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
Alvis Brazma, Array Express Gene Expression Atlas, fged_seattle_2013
 
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-SeqNUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
NUGEN-X-Gen_2011_poster_trancriptome_sequencing_RNA-Seq
 
Iplant pag
Iplant pagIplant pag
Iplant pag
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?
 
GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015GIAB update for GRC GIAB workshop 191015
GIAB update for GRC GIAB workshop 191015
 
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...
Analytical Validation of the Oncomine™ Comprehensive Assay v3 with FFPE and C...
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
SNp mining in crops
SNp mining in cropsSNp mining in crops
SNp mining in crops
 
2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studies2007. stephen chanock. technologic issues in gwas and follow up studies
2007. stephen chanock. technologic issues in gwas and follow up studies
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 

Mais de Ravi Gandham

Functional annotation
Functional annotationFunctional annotation
Functional annotationRavi Gandham
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packagesRavi Gandham
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and PrinseqliteRavi Gandham
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview Ravi Gandham
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence AlignmentRavi Gandham
 

Mais de Ravi Gandham (7)

Functional annotation
Functional annotationFunctional annotation
Functional annotation
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packages
 
Data formats
Data formatsData formats
Data formats
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and Prinseqlite
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Primer designing
Primer designingPrimer designing
Primer designing
 

Último

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - Englishneillewis46
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 

Último (20)

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 

RNA Seq Data Analysis

  • 1. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Overview of the previous lecture
  • 2. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Overview of the previous lecture
  • 3. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Instrument Needed Coverage Throughput/ Run Human Genome/run Cost per Run** Total Cost $/Genome HiSeq 2500 30x 1Tb 11 29.000 29,000 2,536 HiSeq x 10 30x 1 STb 16 15.700 12,000 950 Pacbio RSII 54x 1 Gb 0.000 212 34,374 34,374 PacBio Sequel 50x 5-10 Gb 0.06 700 10,300 10,500 Current Loading Technologies Platform Average Reeds length Advantages Limitation Material Recommended Illumina MiSeq 2 x 300 bps Accurate Short Length 100-200 ng Illumina Mole- culo 5 Kbps Accurate Coverage may Fluctuate 10 ug Pacific Bioscience 15 Kbps Long reads Relatively expensive 10-100 ug Oxford Nanopore 5 Kbps Low ownership cost High error rate 1 ug Long-Range Sequencers Comparison
  • 4. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Basics of RNA - seq data analysis
  • 5. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • Sequence DNA • De novo sequencing • Reference based re-sequencing - SNP, CNV and indels • Metagenomics - Identify who is there in a mixture of microbes • Sequence RNA • RNA-Seq (Transcriptome wide sequencing) • miRNA - Seq novo sequencing • Novel NcRNAs • Study Protein-DNA/RNA interactions • ChIP-Seq (TFs) • CLIP - Seq (For RNA binding proteins) • Epigenetics • DNA methylation • Histone modification (ChIP-seq) • Nucleosome positioning • Chromosome looping Applications of NGS
  • 6. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Certain important key words to be remembered from the previous chapter • Sequencing • both DNA and RNA (with modified protocol) • Short reads • 35, 50, 75, and 100-bp (Solexa and SOLiD) • 400-bp (454) • Ultra-high throughput • 1 to 1.5 billion reads (Solexa and SOLiD) • 2- 4 million reads (454)
  • 7. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • The transcriptome : Complete set of transcripts in a cell, both in terms of type and quantity. • Transcriptome analysis - • in understanding the pattern of gene expression to address basic biological questions. • greater insights into biological pathways and molecular mechanisms that regulate cell fate, development, and disease progression. What is a transcriptome? Transcriptome can be studied through RNA-seq/microarray
  • 8. Computational Biology and Genomics Facility, Indian Veterinary Research Institute RNA sequencing
  • 9. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Aim of RNA-seq experiment • To quantify RNA abundance 
 • To determine the transcriptional structure of genes: start sites, 5’ and 3’ ends, splicing patterns 
 • To quantify the changing expression levels of each transcript during development and under different conditions 
 • To identify variants on the transcripts 

  • 10. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Basic work flow of Next generation sequencing Wang et al., 2009
  • 11. Computational Biology and Genomics Facility, Indian Veterinary Research Institute What can we get out of RNA-seq experiment? Gene expression Alternative splicing Transcript variation Non-coding RNAs RNA -seq Different expression Non- syn SNPs Synonymous SNPs SNPs in 3’- UTRs Allele specific expression Protein changes RNA binding proteins microRNA binding sites
  • 12. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Microarray Normal RNA Cy3 Labelling Reverse transcription Experimental RNA Cy5 labelling Hybridize Wash & Scan
  • 13. Computational Biology and Genomics Facility, Indian Veterinary Research Institute RNA - Seq vs Microarray “RNA- seq…. is expected to revolutionise the manner in which eukaryotic transcriptome are analysed” Wang et al. Nat Rev Gen, 2009
  • 14. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Limitations of microarray platforms • High background levels owing to cross-hybridization; • Limited dynamic range of detection owing to both background and saturation of signals. • Reliance upon existing knowledge about genome sequence;
  • 15. Computational Biology and Genomics Facility, Indian Veterinary Research Institute RNA - seq has higher dynamic range For microarrays: • Low signal end: high background noise • High signal end: signals will be saturated With proper depth, NGS can solve this problem well
  • 16. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Comparison of Microarray with RNA - seq
  • 17. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Experimental considerations
  • 18. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Sample preparation • Amount requirement • 100 ug total RNA (several Wold’s studies) • 2 ug total RNA (Center for Medical Genomics, SOLiD) • 50 ng total RNA (collaboration with Pourmand, SOLiD) • Single cell (Tang et al. Nat. Methods, 2009) (when RNA is limiting, approaches to amplify small quantities of RNA exist) • rRNA removal • rRNAs are highly abundant (>90% of total RNA) Solutions: • rRNA depletion kits • Poly-A selection • Using enzymes that selectively degrade uncapped RNA
  • 19. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Sample preparation • Reverse transcription (RNA -> cDNA) • oligo dT • Pros: Focusing on polyA ed transcriptions cleaner. • Cons: Bias towards the 3’-end of transcripts • Random primers • Pros: Equal coverage, can be used to study non-polyAed transcripts • Cons: higher proportion of rRNA
  • 20. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Single end reads vs paired ends reads 3' 5' 5' 3' 5' 3' 5' 3' AAAA AAAA 3' 5'TTTT cDNA fragments and adapter ligation cDNA conversion R1 R2 5' 3' 3' 5' Sequencing of each fragment R1 will run in the same direction of the reference R2 will run in the opposite direction of the reference
  • 21. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Statistical considerations Number of conditions? • Dynamics? • Dose-effect? • Tissue-specificity? Number of replicates? Depth of sequencing? • Biological variation, … • Statistical power, … • Gene expression • Alternative splicing • Allele specific expression ₹
  • 22. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Basics of RNA - seq data analysis - II Data processing
  • 23. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Data analysis workflow Primary analysis Tertiary analysis • Image acquisition / semiconductor based detection • Base calling and • Quality metrics Secondary analysis • Sequence alignment • Sequence Stats • Consensus Calling • Sequence assembly • Application specific analysis • DNA-Seq • Re-Sequencing • De-novo Sequencing • ChiP Sequencing • RNA-Seq • Epigenetics Data analysis workflow
  • 24. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Sequence alignment
  • 25. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Sequence alignment ▪ Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity ▪ Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. ▪ Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. 1. To find whether two (or more) genes or proteins are evolutionarily related to each other 2. To observe patterns of conservation (or variability). 3. To find structurally or functionally similar regions within proteins i.e to find the common motifs present in both sequences. 4. To find out which sequences from the database are similar to the sequence at hand Purpose of sequence alignment
  • 26. Computational Biology and Genomics Facility, Indian Veterinary Research Institute 1. They are often used interchangeably, they have quite different meanings. 2. Sequence identity refers to the occurrence of exactly the same nucleotide or amino acid in the same position in aligned sequences. 3. The term ‘sequence homology’ is the most important (and the most abused) of the three. • When we say that sequence A has high homology to sequence B, then we are making two distinct claims: • not only are we saying that sequences A and B look much the same, but also that all of their ancestors also looked the same, going all the way back to a common ancestor. Identity vs Similarity vs Homology
  • 27. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Sequence Identity Sequence similarity Sequence homology Definition Proportion of identical residues between two sequences. Proportion of similar residues between two sequences. Two residues are similar if their substitution cost is higher than 0. Sequences derived from a common ancestor Expressed as % identity % Similarity Yes or No Rule-of-thumb: If two sequences are more than 100 amino acids long (or 100 nucleotides long) they are considered homologues if 25% of the amino acids are identical (70% of nucleotide for DNA). Twilight zone = protein sequence similarity between ~0-20%
  • 28. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Global alignment • assumes that the two sequences are basically similar over the entire length of one another. • forces to match the sequences from end to end, even though parts of the alignment are not very convincingly matching. • most suitable when the two sequences are of similar length and are with a significant degree of similarity throughout. . Computational approaches • Global alignment • Local alignment
  • 29. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Local alignment • Identifies segments of the two sequences that match well with no attempt to force the entire sequences into alignment • Parts that appear to have good similarity, according to some criterion are aligned. • Suitable when comparing substantially different sequences, which possibly differ significantly in length, and have only short patches of similarity
  • 30. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Give two sequences we need a number to associate with each possible alignment (i.e. the alignment score = goodness of alignment). The scoring scheme is a set of rules which assigns the alignment score to any given alignment of two sequences. • The scoring scheme is residue based: it consists of residue substitution scores (i.e. score for each possible residue alignment), plus penalties for gaps. • The alignment score is the sum of substitution scores and gap penalties. Substitution scores are given by : For DNA : Substitution Matrix for DNA (Purine/Purine or purine/pyramidine substitutions) For proteins : Substitution matrix based on Polarity, Size, Charge or Hydrophobicity Evolutionary distance matrices :- PAM and BLOSUM for protein sequences
  • 31. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Types of alignment Pairwise alignment Multiple Sequence alignment Can be Global or Local
  • 32. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • Dot Matrix method • The dynamic programming method • Needleman and Wunch • Smith and Watermann • Heuristic methods - FASTA ; BLAST Methods of pairwise alignment • It is a visual graphical representation of similarities between two sequences. • Each axis represents one of the two sequences to be compared. • In the dot matrix method when two sequences are similar over their entire length a line will extend from one corner of the dot plot to the diagonally opposite corner. • If two sequences share only patches of similarity then it will be revealed by diagonal stretches. Dot Matrix method
  • 33. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Interpretation of Dot Matrix • Regions of similarity appear as diagonal runs of dots. • Reverse diagonals (perpendicular to diagonal) indicate inversions. • Reverse crossing diagonals (Xs) indicate palindromes. Limitation:- • The dot matrix computer programs do not show an actual alignment.
  • 34. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The dynamic programming • Dynamic programming reduces the massive number of possibilities that need to be considered in aligning sequences. • This method was first used for global alignment of sequences by Needleman-Wunch algorithm (1970) and for local alignment by Smith - Waterman algorithm (1981). • Both the algorithms involve initialization, matrix filling (scoring) and trace back steps. The algorithms use either PAM or BLOSUM matrices in the scoring step to fill the score matrix.
  • 35. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Global alignment
  • 36. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The three main steps in this algorithm are : 1. Initialization 2. Matrix filling 3. Traceback for alignment Initialization 1. Place the two sequences one across the row and other down the column 2. The first column and first row should be a gap 3. Add the cumulative gap cost across the row and other down the column to fill the first column and first row Global alignment (Needleman and Wunch)
  • 37. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • Matrix filling • Rules :- • Check the side, top and diagonal values of the box • Box Beside - (add gap cost) • Box top - ( add gap cost) • Diagonal box - (match/mismatch) • Put the highest value in the respective boxes • Proceed to the end of the scoring matrix • Trace back • Start from the end of the matrix and reach the start by tracing back the value obtained in the box • if diagonal - Place the characters • if vertical or horizontal - place a gap in the sequence being pointed by the arrow
  • 38. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap T G A gap 0 -2 -4 -6 A -2 -1 -4 -1 -4 -3 -6 -3 -3 -3 -8 -3 -5 T -4 -1 -3 -1 -6 -2 -5 -2 -3 -4 -5 -4 -4 G -6 -5 -3 -3 -8 0 -4 0 -5 -3 -6 -2 -2 C -8 -7 -5 -5 -10 -4 -2 -2 -7 -1 -4 -1 -4 Matrix Filling - Gap -2; Mismatch -1; Match +1 Box Beside : +gap; Box Top : +gap; Diagonal box : match or mismatch
  • 39. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap T G A gap 0 -2 -4 -6 A -2 -1 -4 -1 -4 -3 -6 -3 -3 -3 -8 -3 -5 T -4 -1 -3 -1 -6 -2 -5 -2 -3 -4 -5 -4 -4 G -6 -5 -3 -3 -8 0 -4 0 -5 -3 -3 -2 -2 C -8 -7 -5 -5 -10 -4 -2 -2 -7 -1 -4 -1 -3 Trace back
  • 40. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ATGC - TGA -2+1+1-1 = -1
  • 41. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap G A C T A C gap 0 -1 -2 -3 -4 -5 -6 A -1 0 -2 0 -2 0 -3 0 -1 -2 -4 -1 -1 -3 -5 -2 -2 -3 -6 -3 -3 -5 -7 -4 -4 C -2 -1 -1 -1 -3 0 -1 0 -2 +1 -2 +1 -1 -1 -3 0 0 -2 -4 -1 -1 -3 -5 -2 -2 G -3 -1 -2 -1 -4 -1 -1 -1 -2 0 0 0 -2 +1 -1 +1 -1 0 -2 0 0 -1 -3 -1 -1 C -4 -3 -2 -2 -5 -1 -2 -1 -3 0 -1 0 -2 0 0 0 -1 +1 -1 +1 -1 +1 -2 +1 0 Matrix Filling - Gap -1; Mismatch - 0; Match +1
  • 42. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Matrix Filling - Gap -1; Mismatch - 0; Match +1 gap G A C T A C gap 0 -1 -2 -3 -4 -5 -6 A -1 0 -2 0 -2 0 -3 0 -1 -2 -4 -1 -1 -3 -5 -2 -2 -3 -6 -3 -3 -5 -7 -4 -4 C -2 -1 -1 -1 -3 0 -1 0 -2 +1 -2 +1 -1 -1 -3 0 0 -2 -4 -1 -1 -3 -5 -2 -2 G -3 -1 -2 -1 -4 -1 -1 -1 -2 0 0 0 -2 +1 -1 +1 -1 0 -2 0 0 -1 -3 -1 -1 C -4 -3 -2 -2 -5 -1 -2 -1 -3 0 -1 0 -2 0 0 0 -1 +1 -1 +1 -1 +1 -2 +1 0
  • 43. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ACG-C GACTAC -1+1+1+0-1+1 = +1 AC-GC GACTAC -1+1+1-1+0+1 = +1
  • 44. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap T G A gap 0 -2 -4 -6 A -2 T -4 G -6 C -8 Matrix Filling - Gap -2; Mismatch -1; Match +1 Box Beside : +gap; Box Bottom : +gap; Diagonal box : match or mismatch
  • 45. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap G A C T A C gap 0 A -2 C -4 G -6 C -8 Matrix Filling - Gap -2; Mismatch - -1; Match +1
  • 46. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap G A C T A C gap A C G C Matrix Filling - Gap -1; Mismatch - 0; Match +1
  • 47. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap T G A gap A T G C Matrix Filling - Gap -2; Mismatch -1; Match +1 Box Beside : +gap; Box Bottom : +gap; Diagonal box : match or mismatch
  • 48. Computational Biology and Genomics Facility, Indian Veterinary Research Institute The three main steps in this algorithm are : 1. Initialization 2. Matrix filling 3. Traceback for alignment Initialization 1. Place the two sequences one across the row and other down the column 2. The first column and first row should be a gap 3. Place zeros in first column and first row Matrix filling 1. The value of each box thereon depends on the top, diagonal and side boxes (Box Beside - (add gap cost); Box top - ( add gap cost); Diagonal box - (match/mismatch) 2. If the value is negative - put the value as zero 3. The highest of the three values is placed in the box 4. The same is continued till the end of the matrix Smith and Waterman algorithm (Local alignment)
  • 49. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap T G A gap 0 0 0 0 A 0 0 0 0 0 0 0 0 0 +1 0 +1 0 T 0 +1 0 +1 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 +2 0 +2 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 +1 0 +1 0 Matrix Filling - Gap -2; Mismatch -1; Match +1
  • 50. Computational Biology and Genomics Facility, Indian Veterinary Research Institute TG TG +1+1=+2
  • 51. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap G C C T A C C C G A A T gap 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 +1 0 +1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -2 0 -2 0 -2 0 -2 +1 0 +1 -2 0 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 0 +1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +2 0 +2 0 +1 0 +1 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 0 +1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 0 +1 0 +3 0 +3 0 0 0 +1 +1 T 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 0 +1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 +1 0 +4 0 +4 0 Matrix Filling - Gap -2; Mismatch -1; Match +1
  • 52. Computational Biology and Genomics Facility, Indian Veterinary Research Institute GAAT GAAT 1+1+1+1=+4
  • 53. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Global alignment - example
  • 54. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap A T G G C G T gap 0 -2 -4 -6 -8 -10 -12 -14 A -2 +1 -4 +1 -4 -3 -6 -1 -1 -5 -8 -3 -3 -7 -10 -5 -5 -9 -12 -7 -7 -11 -14 -9 -9 -13 -16 -11 -11 T -4 -3 -1 -1 -6 +2 -3 +2 -3 -2 -5 0 0 -4 -7 -2 -2 -6 -9 -4 -4 -8 -11 -6 -6 -8 -13 -8 -8 G -6 -5 -3 -3 -8 -2 0 0 -5 +3 -2 +3 -2 +1 -4 +1 +1 -3 -6 -1 -1 -3 -8 -3 -3 -7 -10 -5 -5 A -8 -5 -5 -5 -10 -4 -2 -2 -7 -1 +1 +1 -4 +2 -1 +2 -1 0 -3 0 0 -2 -5 -2 -2 -4 -7 -4 -4 G -10 -9 -7 -7 -12 -6 -4 -4 -9 -1 -1 -1 -6 +2 0 +2 -3 +1 -2 +1 0 +1 -4 +1 -1 -3 -6 -1 -1 T -12 -11 -9 -9 -14 -6 -6 -6 -11 -5 -3 -3 -8 -2 0 0 -5 +1 -1 +1 -2 0 -1 0 -1 +2 -3 +2 -2 Matrix Filling - Gap -2; Mismatch -1; Match +1
  • 55. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap A T G G C G T gap 0 -2 -4 -6 -8 -10 -12 -14 A -2 +1 -4 +1 -4 -3 -6 -1 -1 -5 -8 -3 -3 -7 -10 -5 -5 -9 -12 -7 -7 -11 -14 -9 -9 -13 -16 -11 -11 T -4 -3 -1 -1 -6 +2 -3 +2 -3 -2 -5 0 0 -4 -7 -2 -2 -6 -9 -4 -4 -8 -11 -6 -6 -8 -13 -8 -8 G -6 -5 -3 -3 -8 -2 0 0 -5 +3 -2 +3 -2 +1 -4 +1 +1 -3 -6 -1 -1 -3 -8 -3 -3 -7 -10 -5 -5 A -8 -5 -5 -5 -10 -4 -2 -2 -7 -1 +1 +1 -4 +2 -1 +2 -1 0 -3 0 0 -2 -5 -2 -2 -4 -7 -4 -4 G -10 -9 -7 -7 -12 -6 -4 -4 -9 -1 -1 -1 -6 +2 0 +2 -3 +1 -2 +1 0 +1 -4 +1 -1 -3 -6 -1 -1 T -12 -11 -9 -9 -14 -6 -6 -6 -11 -5 -3 -3 -8 -2 0 0 -5 +1 -1 +1 -2 0 -1 0 -1 +2 -3 +2 -2 Matrix Filling - Gap -2; Mismatch -1; Match +1
  • 56. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap A T G G C G T gap 0 -2 -4 -6 -8 -10 -12 -14 A -2 +1 -4 +1 -4 -3 -6 -1 -1 -5 -8 -3 -3 -7 -10 -5 -5 -9 -12 -7 -7 -11 -14 -9 -9 -13 -16 -11 -11 T -4 -3 -1 -1 -6 +2 -3 +2 -3 -2 -5 0 0 -4 -7 -2 -2 -6 -9 -4 -4 -8 -11 -6 -6 -8 -13 -8 -8 G -6 -5 -3 -3 -8 -2 0 0 -5 +3 -2 +3 -2 +1 -4 +1 +1 -3 -6 -1 -1 -3 -8 -3 -3 -7 -10 -5 -5 A -8 -5 -5 -5 -10 -4 -2 -2 -7 -1 +1 +1 -4 +2 -1 +2 -1 0 -3 0 0 -2 -5 -2 -2 -4 -7 -4 -4 G -10 -9 -7 -7 -12 -6 -4 -4 -9 -1 -1 -1 -6 +2 0 +2 -3 +1 -2 +1 0 +1 -4 +1 -1 -3 -6 -1 -1 T -12 -11 -9 -9 -14 -6 -6 -6 -11 -5 -3 -3 -8 -2 0 0 -5 +1 -1 +1 -2 0 -1 0 -1 +2 -3 +2 -2 Matrix Filling - Gap -2; Mismatch -1; Match +1
  • 57. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap A T G G C G T gap 0 -2 -4 -6 -8 -10 -12 -14 A -2 +1 -4 +1 -4 -3 -6 -1 -1 -5 -8 -3 -3 -7 -10 -5 -5 -9 -12 -7 -7 -11 -14 -9 -9 -13 -16 -11 -11 T -4 -3 -1 -1 -6 +2 -3 +2 -3 -2 -5 0 0 -4 -7 -2 -2 -6 -9 -4 -4 -8 -11 -6 -6 -8 -13 -8 -8 G -6 -5 -3 -3 -8 -2 0 0 -5 +3 -2 +3 -2 +1 -4 +1 +1 -3 -6 -1 -1 -3 -8 -3 -3 -7 -10 -5 -5 A -8 -5 -5 -5 -10 -4 -2 -2 -7 -1 +1 +1 -4 +2 -1 +2 -1 0 -3 0 0 -2 -5 -2 -2 -4 -7 -4 -4 G -10 -9 -7 -7 -12 -6 -4 -4 -9 -1 -1 -1 -6 +2 0 +2 -3 +1 -2 +1 0 +1 -4 +1 -1 -3 -6 -1 -1 T -12 -11 -9 -9 -14 -6 -6 -6 -11 -5 -3 -3 -8 -2 0 0 -5 +1 -1 +1 -2 0 -1 0 -1 +2 -3 +2 -2 Matrix Filling - Gap -2; Mismatch -1; Match +1
  • 58. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ATGGCGT AT - GAGT 1+1-2+1-1+1+1 = +2
  • 59. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ATGGCGT ATGA - GT 1+1+1-1-2+1+1 = +2
  • 60. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ATGGCGT ATG - AGT 1+1+1-2-1+1+1 = +2
  • 61. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ATGGCGT ATG - AGT 1+1+1-2-1+1+1 = +2
  • 62. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ATGGCGT AT - GAGT 1+1-2+1-1+1+1 = +2 ATGGCGT ATGA - GT 1+1+1-1-2+1+1 = +2 ATGGCGT ATG - AGT 1+1+1-2-1+1+1 = +2
  • 63. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Local alignment
  • 64. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap A T G G C G T gap 0 0 0 0 0 0 0 0 A 0 +1 0 +1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 T 0 0 0 0 0 +2 0 +2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +1 0 +1 0 G 0 0 0 0 0 0 0 0 0 +3 0 +3 0 +1 0 +1 0 0 0 0 0 +1 0 +1 0 0 0 0 0 A 0 +1 0 +1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 0 0 0 0 +1 0 +1 0 +1 0 +1 0 0 0 0 0 +1 0 +1 0 0 0 0 0 T 0 0 0 0 0 +1 0 +1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +2 0 +2 0 Matrix Filling - Gap -2; Mismatch -1; Match +1
  • 65. Computational Biology and Genomics Facility, Indian Veterinary Research Institute ATG ATG 1+1+1= +3
  • 66. Computational Biology and Genomics Facility, Indian Veterinary Research Institute gap A T G G C G T gap 0 -2 -4 -6 -8 -10 -12 -14 A -2 T -4 G -6 A -8 G -10 T -12 Matrix Filling - Gap -2; Mismatch -1; Match +1
  • 67. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Note :- Always take the value of gap cost or mismatch cost a negative value and the values have to be different
  • 68. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Drawback of Dynamic programming - Very slow Need for - faster alignment strategies Fast Sequence alignment strategies • Using hash table based indexing - seed extend paradigm, space allowance • Using suffix/prefix tree based - Suffix array, Burrows wheeler transformation and FM index • Merge sorting Strategy: making a dictionary (index) – An example of 4-nt index AAAA: 235, 783, 10083,...... AAAC: 132, 236, 832, 932, ... TTTT: 327, 1328, 5523,...... Algorithms Hashing reads - Eland, MAQ, Mosaik... Hashing reference genome - BFAST, Mosaik, SOAP, ... Hash table based indexing
  • 69. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Burrows wheeler transformation and FM index
  • 70. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Criteria for choosing an aligner • Global or local • Aligning short sequences to long sequences such as short reads to a reference • Aligning long sequences to long sequences such as long reads or contigs to a reference • Handles small gaps (insertions and deletions) • Handles large gaps (introns) • Handles split alignments (chimera) • Speed and ease of use
  • 71. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Short read aligner Aligner Purpose Bowtie Fast BWA small gaps (indels) GSNAP Large gaps (introns) Bowtie 2 Takes care of gaps Long sequence aligner Aligner Purpose BLAST Many reference genome BLAT Large gaps (introns) BWA Small gaps (indels) Exonerate Ease of use GMAP Large gaps (introns) MUMmer Align two genome
  • 72. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Three major challenges • Short reads (36-125 nt) • Error rates are considerable • Many reads span exon-exon junctions Alignment should be conducted on • Genome • Reference transcriptome Short read aligners are • No gaps allowed, or • Allow small gaps • Nether will work on intron regions RNA - seq alignment
  • 73. Computational Biology and Genomics Facility, Indian Veterinary Research Institute RNA Exon 1 Exon 2 Exon read mapping Spliced read mapping Exon 1 Exon 2 RNA Seed matching K-mer seeds Seed extend RNA Exon 1 Exon 2 Exon read mapping Spliced read mapping Exon 1 Exon 2 RNA Seed matching K-mer seeds Seed extend RNA -seq alignment strategy Exon - first approach Seed - extend approach
  • 74. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • Available tools: • MapSplice, SpliceMap, TopHat • Two step procedure • Map reads continuously using unspliced read aligners • Unmapped reads are split into shorter segments and aligned independently • Efficient when not too many reads into the junction • Second step is computationally intensive • Can miss reads across exon-intron junctions RNA Exon read mapping Spliced read mapping Exon 1 Exon 2 Exon - first approach
  • 75. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • Representative algorithms • Genomic short read nucleotide alignment program (GSNAP) • Computing accurate spliced alignments (QPALMA) • Steps • Break reads into short seeds • Candidate regions are combined’ (such as Smith-Waterman) • Increased sensitivity • One arm may not provide enough specificity for alignment RNA Exon 1 Exon 2 Exon read mapping Spliced read mapping Exon 1 Exon 2 RNA Seed matching K-mer seeds Seed extend Seed - extend approach
  • 76. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Sequence Assembly
  • 77. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • Overlap, layout, consensus • De Bruijn Graph or k-mer • Burrows Wheeler transform and FM-Index Source (Genome, Exome, Clones and amplicons,Transcriptome) Assembly (Reference-based assembly, de novo assembly) Assembly Algorithms -
  • 78. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Algorithm Purpose OLC Small genome Long reads Handles indels De Bruijin graph Large genome Short high-quality reads No indels BWT and Ferragina- Manzini index Large genome Short or long No indels (currently) Assembly algorithms
  • 79. Computational Biology and Genomics Facility, Indian Veterinary Research Institute Algorithm Assemblers OLC ARACHNE, CAP3, Celera assembler,MIRA,Newbler,Phrap De Bruijin graph ABySS, ALLPATHS, SOAP de novo, Velvet BWT and Ferragina- Manzini index String Graph Assembler (SGA) Assemblers
  • 80. Computational Biology and Genomics Facility, Indian Veterinary Research Institute • Find two sequences with the largest overlap and merge them; repeat • Flaw: prone to mis - assembly Greedy Overlap, Layout, Consensus • Overlap • Find all pairs of sequences that overlap • Layout • Remove redundant and weak overlaps. • Merge pairs of sequences that overlap • unambiguously; that is, pairs of sequences that overlap only with each other and no other sequence • Consensus • Call the consensus base at positions where reads overlap