Phylogenetic analysis

PRESENTATION BY
Mr. Nitin Maruti Naik
(M. Sc. SET, PGDBI, PGDGC & CP.)
Phylogenetic Analysis

INTRODUCTION
• A phylogenetic tree also known as a
phylogeny is a diagram that depicts the lines
of evolutionary descent of different species,
organisms, or genes from a common ancestor.
– Attempt to reconstruct evolutionary ancestors
– Estimate time of divergence from ancestor

• Can be used to solve a number of interesting
problems
– Forensics
• HIV virus mutates rapidly
– Predicting evolution of influenza viruses
– Predicting functions of uncharacterized genes -
orthologue detection
– Drug discovery
– Vaccine development
• Target inferred common ancestor

Objectives
• Evolution,
• Elements of phylogeny,
• Methods of phylogenetic analysis,
• Phylogenetic tree of life,
• Comparison of genetic sequence of
organisms,
• Phylogenetic analysis tools-
– Phylip,
– ClustalW.

Evolution
• Speciation
– Evolution of new organisms is driven by
• Mutations
– The DNA sequence can be changed due to single base changes,
deletion/insertion of DNA segments, etc.
• Selection bias
– Speciation events lead to creation of different species.
– Speciation caused by physical separation into groups where
different genetic variants become dominant
• Any two species share a (possibly distant) common ancestor
• The molecular clock hypothesis

A phylogenetic tree
• A phylogenetic tree is a graph reflecting the
approximate distances between a set of objects
(species, genes, proteins, families) in a hierarchical
fashion
▪ Leaves – current species; sequences in current
species
▪ Internal nodes - hypothetical common ancestors
▪ Branches (Edges) length - “time” from one
speciation to the next (branching represents
speciation into two new species)

Example of Rooted tree
Split (Bipartition)
Terminal Nodes (Leaf)
Interior Nodes (Vertex)
Branch (Edge)
Root of Tree
Taxon A
Taxon B
Taxon C
Taxon D
2
2
1
2
1
1

Taxon A
Taxon B
Taxon C
Taxon C
1
1
4
1
2
Fig. Example of an Unrooted Tree for 4 Taxa

A B C D E
Split (Bipartition)
Terminal Node (Leaf)
Interior Node (Vertex)
Branch (Edge)
Root of Tree
Fig. Terms Used in representing a phylogenetic Tree

Methods of phylogenetic analysis

Methods for analysing phylogenetic
tree
Distance Methods
• Also called Phenetic
• Trees are constructed by
similarity of sequences.
• Tree is called Dendrogram
• Does not necessarily reflect
evolutionary relationship.
• E.g.
– UPGMA clustering,
– Neighbour Joining
– Fitch-Margolish
Character Methods
• Also called Cladistic
• Trees are calculated by considering
various possible pathway of
evolution.
• Based on parsimony or likelihood
methods
• Tree is called Cladogram.
• Use each alignment position as
evolutionary information to build a
tree.
• E.g.
– Maximum parsimony
– Maximum likelihood
– Bayesian

Distance Methods
• UPGMA clustering,
• Neighbour Joining
• Fitch-Margolish

UPGMA
(Unweighted – Pair – Group – Method –with Arithmetic mean)
➢ Stands for Unweighted pair group method with
arithmetic mean.
➢ Originally developed for numeric taxonomy in 1958 by
Sokal and Michener.
➢ This method uses sequential clustering algorithm.
➢ Oldest Distance Method
➢ Proposed by Michener & Sokal in 1957
➢ Produces rooted trees.
➢ It assumes that the trees are ultrametric, meaning that
it assumes constant rate of substitutions in all branches
of the tree.

This method follows a clustering procedure:
(1) Assume that initially each species is a cluster
on its own.
(2) Join closest 2 clusters and recalculate
distance of the joint pair by taking the
average.
(3) Repeat this process until all species are
connected in a single cluster.

Employs a sequential clustering algorithm
I. Identify the two OTU’s from among all the OTUs, that are
most similar to each other and then treat these as a new
single OTU.
II. Subsequently from among the new group of OTUs,
identify the pair with the highest similarity, and so on.

• Advantage
– Fast
– Can handle many sequences
• Disadvantage
– Cannot be used when rates of substitutions are
unequal
– Does not consider multiple substitutions.

NEIGHBOUR JOINING METHOD
• Neighbor-joining methods apply general data
clustering techniques to sequence analysis
using genetic distance as a clustering metric.
• Developed in 1987 by Saitou and Nei.
• The simple neighbor-joining method produces
unrooted trees, but it does not assume a
constant rate of evolution (i.e., a molecular
clock) across lineages.

• It begins with an unresolved star-like tree .
• Each pair is evaluated for being joined and the
sum of all branches length is calculated of the
resultant tree.
• The pair that yields the smallest sum is considered
the closest neighbors and is thus joined .
• A new branch is inserted between them and the
rest of the tree and the branch length is
recalculated.
• This process is repeated until only one terminal is
present

DRAWBACKS
• But it produces only one tree and neglects
other possible trees, which might be as good
as NJ trees, if not significantly better.
• Moreover since errors in distance estimates
are exponentially larger for longer distances,
under some condition, this method will yield a
biased tree.

FITCH – MARGOLIASH METHOD
• Proposed in 1967
• Produces unrooted trees
• Criteria for fitting trees to distance matrices
• Uses a weighted least squares method for
clustering based on genetic distance.
• Closely related sequences are given more weight
in the tree construction process to correct for the
increased inaccuracy in measuring distances
between distantly related sequences.

Character Based methods
• Maximum Parsimony
• Maximum Likelihood(ML)

Maximum Parsimony
(Fitch, 1977)
• Parsimony – carefulness in the use of resources.
• The basic underlying principle behind parsimony is
given by Occam’s Razor:
• “Given a choice between – a hard and easy way of
doing things, nature will always pick the easiest way i.e.
simple is always preferred over complex.”
• Parsimony assumes that the relationship that requires
the fewest number of mutations to explain the current
state of sequences being considered is the relationship
that is most likely to be correct.

concept of parsimony
• The concept of parsimony is at the heart of all character
based methods of phylogenetic reconstruction.
• The 2 fundamental ideas of biological parsimony are:
– Mutations are exceedingly rare events ;
– The more unlikely events a model invokes, the less likely the
model is to be correct.
• As a result, the relationship that requires the fewest
number of mutations to explain the current state of
the sequences being considered, is the
relationship that is most likely to be
correct.

Example
• Multiple sequence alignment, for a parsimony approach, contains positions
that fall into two categories in terms of their information content : those that
have information (are informative) and those that do not (are uninformative).
Example:
• seq 1 2 3 4 5 6
• 1 G G G G G G
• 2 G G G A G T
• 3 G G A T A G
• 4 G A T C A T
• Position 1 is said invariant and therefore uninformative, because all trees
invoke the same number of mutations (0);
• Position 2 is uninformative because 1 mutation occurs in all three possible
trees;
• Position 3, 2 mutations occur; Position 4 requires 3 mutations in all possible
trees.
• Positions 5 and 6 are informative, because one of the trees invokes only one
mutation and the other 2 alternative trees both require 2 mutations.
• In general, for a position to be informative regardless of how many sequences
are aligned, it has to have at least 2 different nucleotides, and each of these
nucleotides has to be present at least twice.

• The maximum parsimony algorithm searches
for the minimum number of genetic events
(nucleotide substitutions or amino acids
changes) to infer the most parsimonious tree
from a set of sequences.
• The best tree is one which needs fewest
changes.

• Maximum Parsimony (positive points):
– Does not reduce sequence information to a single number
– Tries to provide information on the ancestral sequences
– Evaluates different trees
• Maximum Parsimony (negative points):
– Is slow in comparison with distance methods
– Does not use all the sequence information (only
informative sites are used)
– Does not correct for multiple mutations (does not imply a
model of evolution)
– Does not provide information on the branch lengths

Bootstrapping
• Bootstrap analysis:
• is a statistical method for obtaining an
estimate of error.
• Is used to evaluate the reliability of a tree
• Is used to examine how often a particular
cluster in a tree appears when nucleotides or
amino acids are resampled.

Maximum likelihood
• This approach is a purely statistical based method.
• Probabilities are considered for every individual
nucleotide substitutions in a set of sequence alignment.
• Since transitions are observed roughly 3 times as often as
transversions; it can be reasonably argued that a greater
likelihood exists that the sequence with C and T are more
closely related to each other than they are to the
sequence with G.
• Calculation of probabilities is complicated by the fact that
the sequence of the common ancestor to the sequences
considered being unknown.
• Furthermore multiple substitutions may have occurred at
one or more sites and that all sites are not necessarily
independent or equivalent.

• Notes :
• 1. This is the best justified method from a
theoretical viewpoint;
• 2. ML estimates the branch lengths of the final
tree ;
• 3. ML methods are usually consistent ;
• 4. Sequence simulation experiments have shown
that this method works better than all others in
most cases.
• Drawbacks : they need long computation time to
construct a tree.

Advantages and disadvantages of character
based methods
• Advantages
– MP tries to provide information on the ancestral
sequences
– ML tends to outperform alternative methods such as
parsimony or distance methods even with very short
sequences
• Disadvantages
– Slow in comparison with distance methods
– MP does not use all the sequence information
– ML result is dependent on the model of evolution
used

Applications
• There are wide array of applications of
phylogenetic analysis which include:
– Evolution studies
– Medical research and epidemiology
– In ecology
– In criminal studies
– Finding the orthologues and paralogs

Comparison of genetic sequence of organisms

Phylogenetic analysis tools-
• Phylip,
• ClustalW/X.

PHYLIP (Phylogeny Inference Package)
http://evolution.genetics.washington.edu/phylip.html
• Available free in Windows/MacOS/Linux
systems
• Parsimony, distance matrix and likelihood
methods (bootstrapping and consensus trees)
• Data can be molecular sequences, gene
frequencies, restriction sites and fragments,
distance matrices and discrete characters

PHYLIP (Phylogeny Inference Package)
http://evolution.genetics.washington.edu/phylip.html
• PHYLIP (the PHYLogeny Inference Package) is a package of programs for
inferring phylogenies (evolutionary trees).
• It is available free over the Internet, and written to work on as many
different kinds of computer systems as possible.
• The source code is distributed (in C), and executables are also distributed.
• In particular, already-compiled executables are available for Windows
(95/98/NT/2000/me/xp/Vista), Mac OS X, and Linux systems.
• Older executables are also available for Mac OS 8 or 9 systems.
• Complete documentation is available on documentation files that come
with the package.

The Phylip Manual
• is an excellent source of information.
• Brief one line descriptions of the programs are here
• The easiest way to run PHYLIP programs is via a command
line menu (similar to clustalw).
• The program is invoked through clicking on an icon, or by typing the
program name at the command line.
• > protdist
• > neighbor
• If there is no file called infile the program responds with:
• [gogarten@carrot gogarten]$ seqboot
• seqboot: can't find input file "infile"
• Please enter a new file name>
•

Methods
• Methods that are available in the package include
– parsimony,
– distance matrix, and
– likelihood methods, including
• bootstrapping and
• consensus trees.
• Data types that can be handled include
– molecular sequences,
– gene frequencies,
– restriction sites and fragments,
– distance matrices, and
– discrete characters.

Programs
• The programs are controlled through a menu, which asks the users
which options they want to set, and allows them to start the
computation.
• The data are read into the program from a text file, which the user
can prepare using any word processor or text editor (but it is
important that this text file not be in the special format of that
word processor -- it should instead be in "flat ASCII" or "Text Only"
format).
• Some sequence analysis programs such as the ClustalW alignment
program can write data files in the PHYLIP format.
• Most of the programs look for the data in a file called "infile" -- if
they do not find this file they then ask the user to type in the file
name of the data file.

• Output is written onto special files with names like "outfile" and
"outtree".
• Trees written onto "outtree" are in the Newick format, an informal
standard agreed to in 1986 by authors of a number of major
phylogeny packages.
• At this stage we do not have a mouse-windows interface for PHYLIP.
• PHYLIP is probably the most widely-distributed phylogeny package.

• It is the sixth most frequently cited phylogeny package,
after MrBayes, PAUP*, RAxML, Phyml, and MEGA.
• PHYLIP is also the oldest widely-distributed package.
• It has been in distribution since October, 1980, and has
over 30,000 registered users.
• It is still being updated.

CLUSTAL – w
• www.ebi.ac.uk/clustalw/
• Clustal is progressive MSA program available
either as a stand alone or online program.
• Clustal is a widely used multiple sequence
alignment computer program.

The latest version is 2.0. There are two
main variations:
• ClustalW: command line interface
• ClustalX: This version has a graphical user
interface.
• It is available for Windows, Mac OS, and
Unix/Linux.
• This program is available from the Clustal
Homepage or European Bioinformatics Institute
ftp server.

There are three main steps:
• Do a pair wise alignment
• Create a phylogenetic tree (or use a user-
defined tree)
• Use the phylogenetic tree to carry out a
multiple alignment

Phylogenetic analysis

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Phylogenetic analysis

Semelhante a Phylogenetic analysis (20)

Último

Último (20)

Phylogenetic analysis