1. Introduction to Multivariate Data Analysis (MVA)
o Introduction to exploring data with MVA
o Tutorial on using Excel to perform multivariate analysis
1
2. What is Multivariate analysis?
•‘Multivariate’ means data represented by two or more variables
e.g. height, weight and gender of a person
• Majority of datasets collected in biomedical research are multivariate
• These datasets nearly always contain ‘noise’
• Aim of exploratory MVA is to discover patterns that exist within the data
despite noise
e.g. patterns maybe subgroups of patients with a
certain disease
• When we apply MV methods we study:
• Variation in each of these variables
• Similarity or distance between variables
• in MVA we work in multidimensional space
2
4. Data types
Data in a variable can be:
Numerical 0,1,2,3…
0.1,0.2,0.3… e.g. height, gene expression level
Categorical (factor) A, B, AB, O… e.g. blood group
0,1,2,3… immunohistochemistry score
0 or 1 survival 0= dead; 1 alive
Multivariate datasets can contain mixed data types :
P1 P2 P3 P4 P5
V1 77.2 74.2 66.6 28.9 3.5
Numerical V2 91.6 66.9 49.6 0.2 3.9
V3 41.9 21.2 71.2 17.7 4.1
V4 0 1 0 1 1
categorical
V5 A A C E B
4
5. There are different categories of MVA methods
MVA methods
We will look at
multivariate statistical
methods for exploratory
analysis
Multivariate statistics Machine learning
Modelling &
Exploratory
Classification
-Find underlying patterns in the data -Create models e.g. predict cancer
-Classify groups e.g. new cancer subgroup
-Determine groups e.g. similar genes
-Generate hypotheses 5
6. Main categories of Exploratory MVA methods that we will look at
Exploratory multivariate analysis methods
Clustering Data Reduction
• Principal Components Analysis
Tree based Partition •(PCA)
• Hierarchical • K-Means
Cluster • Partition Around Medoids (PAM)
Analysis
(HCA)
All these methods allow good visualization of patterns in your data
6
7. Commonly used software for multivariate analysis in academia
Commercial:
SPSS - Limited
Minitab - Limited
Matlab - Comprehensive
Free & open source:
R - Comprehensive
Octave - Comprehensive
WEKA - Comprehensive
Many other (more limited) free software packages available here:
http://www.freestatistics.info/en/stat.php
This lecture focuses on how we can use R directly from within Microsoft Excel
7
8. R Statistical Analysis & Programming Environment
Download here: http://cran.r-project.org/
Introductory book: http://cran.r-project.org/doc/manuals/R-intro.pdf
Recommended book: R for Medicine and Biology, Jones & Bartlett, 2009
8
10. You can use R directly from Excel
RCom
Excel and R can be linked by installing a piece of ‘middleware’ called Rcom (see next slide)
Combining Excel and R provides you with a environment for complete data processing and
analysis:
1. Use Excel to put your data together
2. Use a menu in Excel to analyse your data in R
3. Open the Demo Workbook
4. Use this workbook to analyze your data
10
11. Full instructions for downloading and installing R for Excel
1. Download and install R and other software you need to use R in Microsoft Excel:
http://cancerinformatics.swansea.ac.uk/pathology/pmm23/rexcel.htm
** PM-M23 Students – You should have already have installed this software in Week 2 **
2. Download the Excel Workbook that accompanies the lecture:
http://cancerinformatics.swansea.ac.uk/pathology/pmm23/Demo.zip
11
12. If you encounter the following error during installation :
Then you will need to download, unzip and install the Office Service Pack 1 file:
http://cancerinformatics.swansea.ac.uk/pathology/pmm23/officesp1.zip
If an error occurs during that installation you will need to download, unzip and
install the Office Update file:
http://cancerinformatics.swansea.ac.uk/pathology/pmm23/officeupdate.zip
12
16. Hierarchical Cluster Analysis
Patients Objective:
A B C D We have a dataset of DV’s (columns) and IV’s
S1 42 18 4 37 (rows)
S2 35 23 10 48
genes
We want to VISUALIZE how DV’s group together
S3 39 25 7 22 according to how similar they are across the IV
... ... ... ... ... scores or vice versa
S10 27 22 16 41 So we measure Similarity = Distance
What does HCA give you?
A tree (or dendrogram)
Steps:
1 2 3
Data distance matrix Build tree Visualize How many groups there are
16
17. What do we mean by distance?
Think of your data as being points in multidimensional space
Point B
Point A
The distance between two points is the length of the path connecting them.
The closer together two points (i.e. your variables) are the more similar
they are in what is being measured
17
18. 1. Create a distance matrix Measure similarity between column variables
Patients 50
A
A B C D
S1 42 18 4 37
S2 35 23 10 48 26.8
genes
24
S1
S3 39 25 7 22
... ... ... ... ...
S10 27 22 16 41 B 12
0 50
S2
How similar are variables A & B
Across all cases S1....Sn?
AB = √ (24)2 + (12)2 = 26.8
18
19. Patients
Measure similarity between variables
A B C D
50 50
S1 42 18 4 37
A A
S2 35 23 10 48 26.8
genes
S1 S1 25.3
S3 39 25 7 22 B
... ... ... ... ... B
S10 27 22 16 41 0 50 0 50
S2 S3
50
A
26.4
S1 Distance between AB:
B
√ (24)2 + (12)2 + (8) 2 + ...... + (5) 2
0 50
S10
And so on ......
19
20. The distance matrix
The distance represents similarity measures for ALL pairs of variables across ALL cases
A 0
B 26 0
C 18 32 0
D 31 22 9 0
A B C D
20
21. Tree Building from distance matrix
1. Find smallest distance value between a pair
2. Take average and create a new matrix combining the pair
A 0
A 0
B 26 0
B 26 0
C 18 32 0
C&D 24.5 27 0
D 31 22 9 0
A B C D
A B C&D
26.5
24.5 B 0
9 A&C&D 26.5 0
B A C D
B A&C&D
21
22. This is what I
Some common distance measures
just used
Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance
in the multidimensional space.
Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place
progressively greater weight on objects that are further apart.
City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most
cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the
effect of single large differences (outliers) is dampened (since they are not squared).
Correlation
Gower's distance – allows you to use mixed numerical and categorical data
22
23. Some common tree building algorithms
Single linkage (nearest neighbor). The distance between two clusters is determined
by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a
sense, string objects together to form clusters, and the resulting clusters tend to represent long
"chains.“
Complete linkage (furthest neighbor). In this method, the distances between
clusters are determined by the greatest distance between any two objects in the different clusters (i.e.,
by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually
form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type
nature, then this method is inappropriate.
Unweighted pair-group average. In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two different clusters. This method
is also very efficient when the objects form natural distinct "clumps," however, it performs equally well
with elongated, "chain" type clusters.
This is what I
just used
23
24. Using Hierarchical Cluster Analysis in Excel
1
Start R…
1. Click on the Add-ins tab
2. Click on the RExcel Menu
3. Click on ‘Connect R’
These steps are always used to start R in Excel
24
25. Using Hierarchical Cluster Analysis in Excel
2
Install libraries in R…
1. Highlight the cell A2
2. Right click the selection
3. Click on Run Code to install
25
27. Using Hierarchical Cluster Analysis in Excel
4
Setup Worksheet
Load the necessary Libraries…
1. Highlight the cells with the
code
2. Right click the selection
3. Click on Run Code to load the
libraries in R
27
28. Using Hierarchical Cluster Analysis in Excel
Data Worksheet
5
Select data…
1. Highlight the dataset with column/row names
2. Right click the selection
3. Click on ‘Put R Var’
4. Type in ‘dat’ into the ‘Array name in R’ box
5. Click the ‘with rownames’, ‘with ‘columnames’
boxes
6. Click OK
28
30. To Plot a dendrogram for DV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’
- Right click the cell A19 and Click on ‘Run code’ (the dendrogram should appear)
- The tree show the similarities between patients according to gene expression levels
30
31. To Plot a dendrogram for IV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’
- Right click the cell A22 and click on ‘Run code’
- The tree shows similarities for gene expression across patients
31
32. To plot a dendrogram and HEATMAP for IV’s and DV’s
- Highlight and right click the cells c18:C23 and click on ‘Run code’
- The trees are now visualized together and the heatmap colours are relative to the
expression levels of each gene in each patient (green = high; red = low; black = intermediate)
32
33. Summary of what HCA has shown us
HCA...
•Provides an overall feel for how our data
groups
• In the example, there might be:
•2 clusters of patients
•2 large clusters of genes
• 4 or 5 smaller sub-clusters of
genes
•Genes cluster according to patterns of
expression across patients
33
35. Partition Clustering
Patients
A B C D Objective:
S1 42 18 4 37
We have a dataset of DV’s (columns) and IV’s
S2 35 23 10 48
genes
(rows)
S3 39 25 7 22
... ... ... ... ... We have a feel for how many clusters there are
in our dataset after using HCA
S10 27 22 16 41
We want to assign our variables into distinct
clusters – so we use a partition clustering
method
What does Partition clustering give you?
A table showing the hard assignment of your
variables into to discrete clusters
35
36. Steps in Partition Clustering
1. Choose a partition clustering method suitable for your data
e.g. K-Means, Partition Around Medoids
2. Tell the method how many clusters you think there are in the dataset
e.g. 2, 3, 4…..
3. Read output table to see which cluster each variable has been assigned to
4. Try to assess the ‘fit’ of each variable in a cluster
i.e. how well has clustering worked?
5. Repeat with a different cluster number until you get the best fit
36
37. Partition Clustering Algorithm Overview….
All this will be explained pictorially in the next few slides
1. You have to define the number of clusters
2. A distance matrix is created between variables
3. Random cluster ‘centres’ are created in multidimensional space
4. Method then assigns samples to nearest cluster centre
5. Cluster centres are then moved to better fit the samples
6. Samples are reassigned to cluster centres
7. Process repeated until best fit is achieved
Most widely used method is K-Means clustering
K-Means uses euclidean distance to create the distance matrix
37
38. An Example … are there 4 clusters in this dataset?
Data Space...
The gray dots represent data and red squares possible cluster ‘centres’
39. Using the interactive tool at the URL below we can follow how K-Means partitions our data
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
39
41. Boundaries are drawn around the nearest data points that K-Means thinks should group with the cluster
centre. The cluster centre is then shifted towards the centre of these data points
41
42. The boundary lines are then redrawn around the data points that are closest to the new cluster centres
This means that some data points better fit a new cluster
42
50. Can Partition Clustering methods be used on categorical data?
Yes!
•You just need to use a different method to create the distance matrix
•Do not use K-Means!
•Use Partition Around Medoids (PAM) instead of K-Means with
Gower’s Distance measure.
50
51. An alternative method to K-Means is…K-Medoids Clustering
The most common K-Medoids method is:
Partition Around Medoids (PAM)
Pam measures the average DISSIMILARITY between variables in a cluster
Why use PAM?
PAM is more robust than K-Means as…
• It gives a better approximation of the centre of a cluster
• It can use any type of distance matrix (not just euclidean distance)
• It uses a novel visualization tool, the silhouette plot, to help you decide the
optimal number of clusters
51
52. Evaluating how well our clustering has worked
How good is fit of clusters across variables?
What is the optimal number of clusters?
The silhouette plot provides these answers
Clusters = 4
N = 75
Bars = fit of sample in cluster
Bar Length = goodness of fit
Each cluster has an average
length (Si)
Average Silhouette
Width = 0.74
Rough rule of thumb:
Average Silhouette
Width > 0.4 is good Anything greater than 0.5
is a decent fit 52
53. Keep trying different cluster numbers (k) to see how the average
silhouette width changes
If Clusters = 5 then:
Average Silhouette Width
decreases
Not very
Look at cluster 3 good fit
One sample has a poor fit
Other samples have not so
good a fit
Choose K that has the highest Silhouette Width
53
56. Change the value of K (no. clusters) and observe the average silhouette width
K=3 K=4 K=5
Average Average Average
Silhouette = 0.45 Silhouette = 0.49 Silhouette = 0.59
Width Width Width 56
57. Getting output to show cluster assignment
1. Click on a new worksheet
2. Right click a cell
3. Click ‘Get R Output’
57
58. Summary of what PAM has shown us
•PAM told us that it is most likely that
there are 5 clusters of genes in our
dataset
•PAM assigned each gene to a definite
cluster
58
60. Principal Components Analysis (PCA)
What does it do…
• It is a data reduction technique
•It seeks a linear combination of variables such that the maximum variance is extracted
from the variables.
• PCA produces uncorrelated factors (components).
What does it give you…
• The components might represent underlying groups within the data
• By finding a small number of components you have reduced the dimensionality of
your data
60
61. PCA – The Concepts
X Y
If we take data for two variables and plot as a scatter plot, we can draw a
1 42 18
line of best fit through the data (the length of which is from the furthest
2 35 23 two data points)
3 39 25
By summing the distances between points and the line we can determine
... ... ...
how much variation in the data each line captures.
N 27 22
We can then draw a second line at right angles between the two further
data points in that direction and this line captures more variation
61
62. PCA – The Concepts
Each data point has a score on
each component like a
correlation
eigenvector
eigenvalue
•In multivariate data we have many variables potted in multidimensional space
•So we draw many ‘lines of best fit’ – each line is called an eigenvector
•The variables have a score on each eigenvector depending on how much variation is
explained by that line (eigenvalue)
•We refer to the eigenvectors as components
•Different variables will have similar or different correlations on each component
62
•Therefore we can group together variables according to these similarities
63. How many groups are there?
Each component explains different amounts of variation in the data
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Proportion of Variance 0.62 0.24 0.08 0.04
Cumulative Proportion 0.62 0.86 0.95 1.00
Why is this important?
- It tells us how many components to retain (i.e. we throw out minor components)
- The number of components we retain is the number of groups in the data
Rough rule of thumb:
Retain components explaining >= 5% of the variation
63
64. How many groups are there?
Eigenvalues help us decide on many components to retain
A Scree plot will show you the eigenvalues
for each component
This scree plot shows the
variance of each component
Rough rule of thumb:
Look to see where the curve levels off
The Kaiser criterion:
Retain components having an eigenvalue > 1 64
66. Getting output to show scores of IV’s on components
1. Click on a new worksheet
2. Right click a cell
3. Click ‘Get R Output’
66
67. Generate a Variance Table & a Scree Plot
Optimal number of components is 4
where variance explained is > =5%
67
68. Visualizing the scores of IV’s on components using a scatterplot
This plot shows:
Component 1 (PC1)
v.
Component 2 (PC2)
• PC1 & PC2 separate groups
of genes and patients
You can see that
P1 and P2 are
similar due to
levels of gene g9
P5 is clearly different to the other
patients according to gene expression P3 and P4 are similar
68
levels
69. Visualizing the scores of IV’s on components using a scatterplot
This plot shows:
Component 1 (PC1)
v.
Component 3 (PC3)
This plot gives
another view on the
data groups and the
relationship between
variables and
components
69
70. Putting it all together…A whole map of the patterns in our data….
A …We have a consensus of
B how our variables group
E
D
We could generate new
A hypotheses from our data
C
E
A
B
C
D
E
70
71. Typical MVA workflow you can apply to your data in research projects
Dataset
Estimate number of groups with Tree
Hierarchical Cluster Analysis
based Clustering
K-Means, PAM
Confirm number of groups with
Partition Clustering
Visualize relationship between Principal Components
variables with data reduction Analysis (PCA)
71