1. Gene Prediction Using Hidden
Markov Model
&
Recurrent Neural Network
Ahmed Hani AlGhidani
MSc Student in Computer Science at Cairo University
Research and SDE at RDI Egypt
ahmed.hani@rdi-eg.com
2. Agenda
• DNA Structure
- Eukaryotic and Prokaryotic Cells
• Gene Prediction Methods
- Empirical Methods
- Ab initio Methods
• Hidden Markov Model
- Existed HMM-based systems
• Recurrent Neural Network
• Other Methods
7. Gene Prediction
• Get the exons regions that would be
translated to Amino Acid (Protein)
8. Gene Prediction (Cont.)
• Empirical methods are used for specifically
Prokaryotic cells
• Most of it is coding regions and no introns
• Feature Engineering method
• Open Reading Frames (ORFs)
10. Gene Prediction (Cont.)
• Pros
- Simple and easy for implementation
- Works well with Prokaryotic DNA
because of its simplicity
• Cons
- Bad performance in large sequences
- Works bad with complex DNA such as
Eukaryotic DNA
11. Gene Prediction (Cont.)
• Ab initio methods for Eukaryotic cells
• Depend on statistical methods and
computational models
• Features Engineering could be involved in
the computations
• Hidden Markov Model and Recurrent
Neural Networks
12. Hidden Markov Model
• The basic idea is Markov Chains
•
• Set of finite states
• Transition Matrix
14. Hidden Markov Model (Cont.)
• Practically, it may be hard to access the
patterns or classes that we want to predict
• We need indicators (visible states) to
obtain the hidden patterns
16. Hidden Markov Model (Cont.)
• Observations Probability Estimation
- Estimate the probability of observation
sequence given the model
• Optimal Hidden State Sequence
- Determine the optimal sequence of the
hidden states
• HMM Parameters Estimation
- Get the model parameters that maximizes
the probability of specific observations
given specific states
17. Hidden Markov Model (Cont.)
• In Gene Prediction, the observations are
the A, C, G, T sequences, and the hidden
states are Exons, Introns and Other
• Use the training data to set the model
parameters (problem 3) using Baum-
Welch algorithm
• For the given observations, we predict the
states (problem 2) using Viterbi algorithm
20. Neural Network (Cont.)
• Unexplored area in Bioinformatics
• No need for features engineering
• Outperforms old-school Machine Learning
• Based on Biological philiosophy!
25. Recurrent Neural Networks
(Cont.)
• Exons/Introns still in progress
• Dataset size is 800K sequences
• Sequences aren’t fixed-size
• LSTM instead of Vanilla RNN
• Tensorflow
26. Other Methods
• Naive Bayesian + Statistical Features
• Hidden Markov Model Support Vector
Machine (HMM-SVM)
• Open Reading Frames + Hidden Markov
Model
• Open Reading Frames + Statistical
Features + Hidden Markov Model