2. Its the evolutionary history of a kind
of organism...
the evolution of a genetically related
group of organisms as distinguished
from the development of the
individual organism
the history or course of the
development of something.
3. • Phylogeny is the inference of evolutionary
relationships
• All forms of life share a common origin.
– here the goal is to deduce the correct trees for all
species of life
– to estimate the time of divergence between
organisms since the time they last shared a
common ancestor
6. Comparison of Speciation with the Genetic change
Species tree versus gene tree
• In a species tree an internal node represents a
speciation event
• In a gene tree an internal node represents the
divergence of an ancestral gene into two new
genes with distinct sequences
• Species tree <> Gene tree
– horizontal gene transfer
– gene duplications
8. Phylogenetic Tree development steps
1. Selection of sequences or any parameter for
analysis
2. Multiple sequence alignment
3. Tree building
4. Tree evaluation
9. DNA:
– Higher Phylogenetic signal:
• Synonymous vs. non-synonymous substitutions
(detect negative and positive selection)
Protein:
– Phylogenetic signal less predominant than in DNA
– Better to construct a tree for evolutionary distant
species or genes
RNA: rRNA often used for constructing species
trees
Selection of sequences for analysis
10. Multiple sequence alignment
This is a critical step in the analysis as in many cases the alignment of
amino acids or nucleotides in a column implies that they share a common
ancestor
If you misalign a group of sequences you will still be able to produce a
tree. However, it is not likely to be biologically meaningful.
Crap in is crap out!
Inspect the alignment to be sure that all sequences are homologous
Some times with ClustalW distantly related sequences are not well
aligned. Try different gap and extension parameters to improve the
alignment
Only use these columns of the multiple alignment for which you have data
for all organisms or sequences. Delete the columns for which this is not
the case.
Delete columns with gaps
12. Distance based methods
Distance based methods:
– calculate the distances between molecular sequences using
some distance matrices
– A clustering method (UPGMA, neighbor joining) is used to infer
the tree from the pair wise distance matrix
– treat the sequence from a horizontal(parental) perspective, by
calculating a single distance between entire sequences
Advantage:
• Fast
• Allow using evolutionary models
Disadvantage:
• sequences reduced to one number
13. Character based methods
Character based methods:
– treat the sequences from a vertical(evolutionary)
perspective.
– they search for each column of the alignment, the
simplest explanation for how the characters
evolved.
– For instance, MP(Maximum Parsimony) involves a
search for a tree with the fewest number of amino
acid (or nucleotide character changes that account
for the observed differences between the protein
(gene) sequences.
14. Tree evaluation: bootstrapping
• sampling technique for estimating the statistical
error in situations where the underlying sampling
distribution is unknown
• evaluating the reliability of the inferred tree - or
better the reliability of specific branches
How to proceed:
• From the original alignment, columns in the sequence alignment are
chosen at random ‘sampling with replacement’
• a new alignment is constructed with the same size as the original one
• a tree is constructed
This process is repeated 100 of times**
15. Evaluation
Show bootstrap values on Phylogenetic trees
• majority-rule consensus tree
• map bootstrap values on the original tree
• now while evaluating from bootstrap value we
are going to check a certain tree’s occurrence
number !!! If its 60 out of 100 times its
significant, more than 50 is accountable but
bellow 50 definitely rejected.
18. Pairwise distance methods
Approach:
• align pairs of sequences and count the number of
differences (Hamming distance).
• For an alignment of length N with n sites at which there
are differences: D= (n/N*100).
Problem:
• observed differences <> actual genetic distances between
the sequences.
=> dissimilarity is an underestimation of the true
evolutionary distance, because of the fact that some of
the sequence positions are the result of multiple events
Solution:
• Use an evolutionary model that corrects for multiple
mutations
Distance calculation
22. Pairwise distance methods
UPGMA Method (Unweighted Pair Group Method with
Arithmetic Mean):
This method is generally attributed to Sokal and Michener
• assumes a molecular clock , i.e. that all sequences
evolve at a similar rate
•distance = twice node height
• forces distances to be ultrametric (for any three
species, the two largest distances are equal)
• produces rooted tree (in this case root is incorrect
but topology is otherwise correct)
23. Pairwise distance methods
• when two OTUs are grouped, we treat them as a new single OTU
• when OTUs A, B (which have been grouped before) and C are grouped into a
new node ‘u’, then the distance from node ‘u’ to any other node ‘k’ (e.g. grouping
D and E) is simply computed as follows:
Tree inference: UPGMA
27. Pairwise distance methods
Advantages:
• Fast
• Allows incorporation of evolutionary models
Disadvantages:
• Assumption of a molecular clock
• Non realistic evolutionary approach as all groups
are equally distanced from the root.
Tree inference: UPGMA
28. Neighbor Joining
• Very popular method
• Does not make molecular clock assumption :
modified distance matrix constructed to adjust
for differences in evolution rate of each taxon
• Produces un-rooted tree
• Assumes additivity: distance between pairs of
leaves = sum of lengths of edges connecting them
• Like UPGMA, constructs tree by sequentially
joining sub-trees
29. Pairwise distance methods
• Additive distances can be fitted to an unrooted
tree such that the evolutionary distance
between a pair of OTUs equals the sum of the
lengths of the branches connecting them, rather
than being an average as in the case of cluster
analysis
• Tree construction methods:The neighbour
joining (NJ) method, developed by Saitou and
Nei (1987) offers a heuristic approach to solve
this problem
Tree inference: neighbor joining
33. Pairwise distance methods
Advantages:
• Fast
• Allows incorporation of evolutionary models
• No assumption of a molecular clock
Disadvantages
• Constructed tree is sometimes only
hypothetically based and no connection with the
original tree
Tree inference: neighbor joining
34. Maximum parsimony
Principle
• Select that tree that minimizes the total tree length = being the
number of nucleic acid substitutions or amino acid replacements
required to explain a given set of data.
Method
• a particular topology is considered
• for this topology, the ancestral sequences at each branching point
are reconstructed
• the minimum number of events to explain the sequence differences
over the whole tree is computed: the minimum number of
substitutions is computed for each nucleotide (or amino acid) site,
and the numbers for all sites are added.
• another tree topology is chosen
35. Maximum parsimony
)2(2
)32(
2
n
n
N nR
)3(2
)52(
3
n
n
N nU
OTU's rooted tree topologies unrooted tree topologies
3 3 1
4 15 3
5 105 15
6 954 105
7 10395 954
8 135135 10395
9 2027025 135135
equation
• Exhaustive search impossible
• Heuristics needed
37. Maximum Parsimony
Assumptions
• Equal rate of evolution in all branches
Advantages
• sequence information is not reduced to one number (such
as for example in pairwise distance methods)
Disadvantages of maximum parsimony methods
• can be slow for very large datasets
• no correction for multiple mutations, i.e. no substitution
model can be applied
• sensitive to unequal rates of evolution in different lineages
41. Comparison for the Character based Methods
Parsimony vs. Maximum Likelihood
There is an efficient algorithm to calculate the parsimony score for a
given topology, therefore parsimony is faster than ML.
Parsimony is an approximation to ML when mutations are rare
events.
Weighted parsimony schemes can be used to treat most of the
different evolutionary models used with ML.
Parsimony throws away information from non-informative sites
that is informative in ML and distance matrix methods.
Parsimony gives little information about branch lengths.
Parsimony is inconsistent in certain cases (Felsenstein zone), and
suffers badly from long branch attraction.
42.
43. Commonly used Phylogeny packages
• 369 phylogeny packages
(http://evolution.gs.washington.edu/phylip/software.html) and 54 free
servers (as of Sep 30, 2011)
– Phylip (general package, protdist, NJ, parsimony, maximum likelihood,
etc)
– PAUP (parsimony)
– PAML (maximum likelihood)
– TreePuzzle (quartet based)
– PhyML (maximum likelihood)
– MyBayes
– MEGA (biologist-centric)
** now while evaluating from bootstrap value we are going to check a certain tree’s occurrence number !!! If its 60 out of 100 times its significant, more than 50 is accountable but bellow 50 definitely rejected.