PMM23 Week 3 Lectures

Introduction to Multivariate Data Analysis (MVA)

o Introduction to exploring data with MVA

o Tutorial on using Excel to perform multivariate analysis

1

What is Multivariate analysis?

•‘Multivariate’ means data represented by two or more variables
e.g. height, weight and gender of a person

• Majority of datasets collected in biomedical research are multivariate

• These datasets nearly always contain ‘noise’

• Aim of exploratory MVA is to discover patterns that exist within the data
despite noise
e.g. patterns maybe subgroups of patients with a
certain disease

• When we apply MV methods we study:

• Variation in each of these variables
• Similarity or distance between variables

• in MVA we work in multidimensional space

2

A Typical Multivariate Dataset has Independent and Dependent Variables

e.g. The expression levels for 20 genes in 5 patients
Dependent Variables (DV’s)
p1 p2 p3 p4 p5
g1 77.2 91.6 41.9 37.2 68.5
g2 74.2 66.9 21.2 31.4 57.1
g3 66.6 49.6 71.2 27.8 72.6
Independent Variables (IV’s)

g4 28.9 0.2 17.7 1.4 8.1
g5 3.5 3.9 4.1 8.2 6.4
g6 18 47.4 94 59 7
g7 73.1 42.8 34.9 96.3 25
g8 66.7 34.3 48.2 44.3 51
g9 98.2 82.7 28.1 17.7 47.6
g10 20.3 61.6 45.5 83.5 70.9
g11 0.3 0.9 2.1 4.1 1.1
g12 34.1 12.3 90.6 73.4 90.9
g13 68 48.2 5.2 10.1 66.7
g14 5.3 74.6 64.1 19.4 16.8
g15 73.5 67.8 13.6 12.5 81.6
g16 4 14 16.5 22 16.5
g17 69.5 61.3 53.3 78.7 73.3
g18 0.9 7.4 12.5 1.4 15.9
g19 1.7 16.2 32.5 37.4 79.4
g20 49.8 52.4 85.7 47.7 84.8

An expression level in a patient is dependent on the gene 3

Data types

Data in a variable can be:

Numerical 0,1,2,3…
0.1,0.2,0.3… e.g. height, gene expression level

Categorical (factor) A, B, AB, O… e.g. blood group
0,1,2,3… immunohistochemistry score
0 or 1 survival 0= dead; 1 alive

Multivariate datasets can contain mixed data types :

P1 P2 P3 P4 P5
V1 77.2 74.2 66.6 28.9 3.5
Numerical V2 91.6 66.9 49.6 0.2 3.9
V3 41.9 21.2 71.2 17.7 4.1
V4 0 1 0 1 1
categorical
V5 A A C E B
4

There are different categories of MVA methods

MVA methods
We will look at
multivariate statistical
methods for exploratory
analysis
Multivariate statistics Machine learning

Modelling &
Exploratory
Classification

-Find underlying patterns in the data -Create models e.g. predict cancer
-Classify groups e.g. new cancer subgroup
-Determine groups e.g. similar genes

-Generate hypotheses 5

Main categories of Exploratory MVA methods that we will look at

Exploratory multivariate analysis methods

Clustering Data Reduction

• Principal Components Analysis
Tree based Partition •(PCA)

• Hierarchical • K-Means
Cluster • Partition Around Medoids (PAM)
Analysis
(HCA)

All these methods allow good visualization of patterns in your data
6

Commonly used software for multivariate analysis in academia

Commercial:

SPSS - Limited
Minitab - Limited
Matlab - Comprehensive

Free & open source:

R - Comprehensive
Octave - Comprehensive
WEKA - Comprehensive

Many other (more limited) free software packages available here:
http://www.freestatistics.info/en/stat.php

This lecture focuses on how we can use R directly from within Microsoft Excel

7

R Statistical Analysis & Programming Environment

Download here: http://cran.r-project.org/
Introductory book: http://cran.r-project.org/doc/manuals/R-intro.pdf
Recommended book: R for Medicine and Biology, Jones & Bartlett, 2009
8

R can be your ‘hub’ for data analysis

9

You can use R directly from Excel

RCom

Excel and R can be linked by installing a piece of ‘middleware’ called Rcom (see next slide)

Combining Excel and R provides you with a environment for complete data processing and
analysis:

1. Use Excel to put your data together

2. Use a menu in Excel to analyse your data in R

3. Open the Demo Workbook

4. Use this workbook to analyze your data
10

Full instructions for downloading and installing R for Excel

1. Download and install R and other software you need to use R in Microsoft Excel:

http://cancerinformatics.swansea.ac.uk/pathology/pmm23/rexcel.htm

** PM-M23 Students – You should have already have installed this software in Week 2 **

2. Download the Excel Workbook that accompanies the lecture:

http://cancerinformatics.swansea.ac.uk/pathology/pmm23/Demo.zip

11

If you encounter the following error during installation :

Then you will need to download, unzip and install the Office Service Pack 1 file:

http://cancerinformatics.swansea.ac.uk/pathology/pmm23/officesp1.zip

If an error occurs during that installation you will need to download, unzip and
install the Office Update file:

http://cancerinformatics.swansea.ac.uk/pathology/pmm23/officeupdate.zip

12

The Excel Workbook for MVA – Demo.xlsx

Select Worksheet
Select Code 13

Rest of Lecture is….
Exploring our data using these methods in Excel…

1 Hierarchical Cluster Analysis

2 Partition Clustering
+
3 PCA
+
Examples

14

1

Hierarchical Cluster Analysis

15


Patients Objective:

A B C D We have a dataset of DV’s (columns) and IV’s
S1 42 18 4 37 (rows)
S2 35 23 10 48
genes

We want to VISUALIZE how DV’s group together
S3 39 25 7 22 according to how similar they are across the IV
... ... ... ... ... scores or vice versa

S10 27 22 16 41 So we measure Similarity = Distance

What does HCA give you?

A tree (or dendrogram)
Steps:
1 2 3
Data distance matrix Build tree Visualize How many groups there are
16

What do we mean by distance?
Think of your data as being points in multidimensional space

Point B

Point A

The distance between two points is the length of the path connecting them.

The closer together two points (i.e. your variables) are the more similar
they are in what is being measured
17

1. Create a distance matrix Measure similarity between column variables

Patients 50
A
A B C D
S1 42 18 4 37
S2 35 23 10 48 26.8
genes

24
S1
S3 39 25 7 22
... ... ... ... ...
S10 27 22 16 41 B 12

0 50
S2
How similar are variables A & B
Across all cases S1....Sn?
AB = √ (24)2 + (12)2 = 26.8

18

Patients
Measure similarity between variables
A B C D
50 50
S1 42 18 4 37
A A
S2 35 23 10 48 26.8
genes

S1 S1 25.3
S3 39 25 7 22 B
... ... ... ... ... B
S10 27 22 16 41 0 50 0 50
S2 S3

50
A
26.4
S1 Distance between AB:
B
√ (24)2 + (12)2 + (8) 2 + ...... + (5) 2
0 50
S10

And so on ......
19

The distance matrix

The distance represents similarity measures for ALL pairs of variables across ALL cases

A 0

B 26 0

C 18 32 0

D 31 22 9 0
A B C D
20

Tree Building from distance matrix

1. Find smallest distance value between a pair
2. Take average and create a new matrix combining the pair
A 0
A 0
B 26 0
B 26 0
C 18 32 0
C&D 24.5 27 0
D 31 22 9 0
A B C D
A B C&D

26.5
24.5 B 0

9 A&C&D 26.5 0
B A C D
B A&C&D

21

This is what I
Some common distance measures
just used

Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance
in the multidimensional space.

Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place
progressively greater weight on objects that are further apart.

City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most
cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the
effect of single large differences (outliers) is dampened (since they are not squared).

Correlation

Gower's distance – allows you to use mixed numerical and categorical data

22

Some common tree building algorithms

Single linkage (nearest neighbor). The distance between two clusters is determined
by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a
sense, string objects together to form clusters, and the resulting clusters tend to represent long
"chains.“

Complete linkage (furthest neighbor). In this method, the distances between
clusters are determined by the greatest distance between any two objects in the different clusters (i.e.,
by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually
form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type
nature, then this method is inappropriate.

Unweighted pair-group average. In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two different clusters. This method
is also very efficient when the objects form natural distinct "clumps," however, it performs equally well
with elongated, "chain" type clusters.

This is what I
just used

23

Using Hierarchical Cluster Analysis in Excel

1
Start R…

1. Click on the Add-ins tab
2. Click on the RExcel Menu
3. Click on ‘Connect R’

These steps are always used to start R in Excel

24


2

Install libraries in R…

1. Highlight the cell A2
2. Right click the selection
3. Click on Run Code to install

25


3

Select a Download Source…

1. Choose Bristol or London

26


4

Setup Worksheet

Load the necessary Libraries…

1. Highlight the cells with the
code
3. Click on Run Code to load the
libraries in R

27


Data Worksheet

5
Select data…

1. Highlight the dataset with column/row names
3. Click on ‘Put R Var’
4. Type in ‘dat’ into the ‘Array name in R’ box
5. Click the ‘with rownames’, ‘with ‘columnames’
boxes
6. Click OK

28


Click on the HCA tab in the workbook

29

To Plot a dendrogram for DV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’
- Right click the cell A19 and Click on ‘Run code’ (the dendrogram should appear)
- The tree show the similarities between patients according to gene expression levels

30

To Plot a dendrogram for IV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’
- Right click the cell A22 and click on ‘Run code’
- The tree shows similarities for gene expression across patients

31

To plot a dendrogram and HEATMAP for IV’s and DV’s
- Highlight and right click the cells c18:C23 and click on ‘Run code’
- The trees are now visualized together and the heatmap colours are relative to the
expression levels of each gene in each patient (green = high; red = low; black = intermediate)

32

Summary of what HCA has shown us

HCA...

•Provides an overall feel for how our data
groups
• In the example, there might be:
•2 clusters of patients
•2 large clusters of genes
• 4 or 5 smaller sub-clusters of
genes
•Genes cluster according to patterns of
expression across patients

33

2

Confirm the number of groups in our data using

Partition Clustering

34

Patients

A B C D Objective:
S1 42 18 4 37
We have a dataset of DV’s (columns) and IV’s
S2 35 23 10 48
genes

(rows)
S3 39 25 7 22
... ... ... ... ... We have a feel for how many clusters there are
in our dataset after using HCA
S10 27 22 16 41
We want to assign our variables into distinct
clusters – so we use a partition clustering
method

What does Partition clustering give you?

A table showing the hard assignment of your
variables into to discrete clusters

35

Steps in Partition Clustering

1. Choose a partition clustering method suitable for your data
e.g. K-Means, Partition Around Medoids

2. Tell the method how many clusters you think there are in the dataset
e.g. 2, 3, 4…..

3. Read output table to see which cluster each variable has been assigned to

4. Try to assess the ‘fit’ of each variable in a cluster
i.e. how well has clustering worked?

5. Repeat with a different cluster number until you get the best fit

36

Partition Clustering Algorithm Overview….

All this will be explained pictorially in the next few slides

1. You have to define the number of clusters

2. A distance matrix is created between variables

3. Random cluster ‘centres’ are created in multidimensional space

4. Method then assigns samples to nearest cluster centre

5. Cluster centres are then moved to better fit the samples

6. Samples are reassigned to cluster centres

7. Process repeated until best fit is achieved

Most widely used method is K-Means clustering

K-Means uses euclidean distance to create the distance matrix
37

An Example … are there 4 clusters in this dataset?

Data Space...

The gray dots represent data and red squares possible cluster ‘centres’

Using the interactive tool at the URL below we can follow how K-Means partitions our data

http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
39

K-Means starts by RANDOMLY assigning cluster centres to the data

40

Boundaries are drawn around the nearest data points that K-Means thinks should group with the cluster
centre. The cluster centre is then shifted towards the centre of these data points

41

The boundary lines are then redrawn around the data points that are closest to the new cluster centres
This means that some data points better fit a new cluster

42

It keeps doing this….

43

…until…. A best fit is achieved – it cannot get a better fit by moving centres around
48

Variables are then listed according to cluster

Variable Cluster Variable Cluster Variable Cluster
1 3 11 2 21 2
2 4 12 4 22 4
3 4 13 1 23 1
4 1 14 3 24 2
5 2 15 4 25 1
6 4 16 4
7 2 17 2
8 4 18 3
9 3 19 2
10 1 20 1

49

Can Partition Clustering methods be used on categorical data?

Yes!

•You just need to use a different method to create the distance matrix

•Do not use K-Means!

•Use Partition Around Medoids (PAM) instead of K-Means with
Gower’s Distance measure.

50

An alternative method to K-Means is…K-Medoids Clustering

The most common K-Medoids method is:

Partition Around Medoids (PAM)

Pam measures the average DISSIMILARITY between variables in a cluster

Why use PAM?

PAM is more robust than K-Means as…

• It gives a better approximation of the centre of a cluster

• It can use any type of distance matrix (not just euclidean distance)

• It uses a novel visualization tool, the silhouette plot, to help you decide the
optimal number of clusters
51

Evaluating how well our clustering has worked
How good is fit of clusters across variables?
What is the optimal number of clusters?
The silhouette plot provides these answers

Clusters = 4

N = 75

Bars = fit of sample in cluster

Bar Length = goodness of fit

Each cluster has an average
length (Si)

Average Silhouette
Width = 0.74

Rough rule of thumb:

Average Silhouette
Width > 0.4 is good Anything greater than 0.5
is a decent fit 52

Keep trying different cluster numbers (k) to see how the average
silhouette width changes

If Clusters = 5 then:

Average Silhouette Width
decreases
Not very
Look at cluster 3 good fit

One sample has a poor fit

Other samples have not so
good a fit

Choose K that has the highest Silhouette Width
53

The K-Means & PAM Worksheet

54

Running PAM in Excel
Clustering IV’s

55

Change the value of K (no. clusters) and observe the average silhouette width

K=3 K=4 K=5

Average Average Average
Silhouette = 0.45 Silhouette = 0.49 Silhouette = 0.59
Width Width Width 56

Getting output to show cluster assignment

1. Click on a new worksheet
2. Right click a cell
3. Click ‘Get R Output’

57

Summary of what PAM has shown us

•PAM told us that it is most likely that
there are 5 clusters of genes in our
dataset

•PAM assigned each gene to a definite
cluster

58

3

Visualize the relationship between variables in groups with

Principal Components Analysis

59

Principal Components Analysis (PCA)

What does it do…

• It is a data reduction technique

•It seeks a linear combination of variables such that the maximum variance is extracted
from the variables.

• PCA produces uncorrelated factors (components).

What does it give you…

• The components might represent underlying groups within the data

• By finding a small number of components you have reduced the dimensionality of
your data

60

PCA – The Concepts
X Y
If we take data for two variables and plot as a scatter plot, we can draw a
1 42 18
line of best fit through the data (the length of which is from the furthest
2 35 23 two data points)
3 39 25
By summing the distances between points and the line we can determine
... ... ...
how much variation in the data each line captures.
N 27 22
We can then draw a second line at right angles between the two further
data points in that direction and this line captures more variation

61

PCA – The Concepts

Each data point has a score on
each component like a
correlation

eigenvector

eigenvalue

•In multivariate data we have many variables potted in multidimensional space
•So we draw many ‘lines of best fit’ – each line is called an eigenvector
•The variables have a score on each eigenvector depending on how much variation is
explained by that line (eigenvalue)
•We refer to the eigenvectors as components
•Different variables will have similar or different correlations on each component
62
•Therefore we can group together variables according to these similarities

How many groups are there?

Each component explains different amounts of variation in the data

Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Proportion of Variance 0.62 0.24 0.08 0.04
Cumulative Proportion 0.62 0.86 0.95 1.00

Why is this important?

- It tells us how many components to retain (i.e. we throw out minor components)

- The number of components we retain is the number of groups in the data


Retain components explaining >= 5% of the variation
63

How many groups are there?
Eigenvalues help us decide on many components to retain

A Scree plot will show you the eigenvalues
for each component

This scree plot shows the
variance of each component

Look to see where the curve levels off

The Kaiser criterion:
Retain components having an eigenvalue > 1 64

The PCA Worksheet

65

Getting output to show scores of IV’s on components

1. Click on a new worksheet
2. Right click a cell
3. Click ‘Get R Output’

66

Generate a Variance Table & a Scree Plot

Optimal number of components is 4
where variance explained is > =5%
67

Visualizing the scores of IV’s on components using a scatterplot

This plot shows:

Component 1 (PC1)
v.
Component 2 (PC2)

• PC1 & PC2 separate groups
of genes and patients
You can see that
P1 and P2 are
similar due to
levels of gene g9

P5 is clearly different to the other
patients according to gene expression P3 and P4 are similar
68
levels

Visualizing the scores of IV’s on components using a scatterplot

This plot shows:

Component 1 (PC1)
v.
Component 3 (PC3)

This plot gives
another view on the
data groups and the
relationship between
variables and
components

69

Putting it all together…A whole map of the patterns in our data….

A …We have a consensus of
B how our variables group
E
D
We could generate new
A hypotheses from our data
C

E
A

B

C
D
E

70

Typical MVA workflow you can apply to your data in research projects

Dataset

Estimate number of groups with Tree
based Clustering

K-Means, PAM
Confirm number of groups with

Visualize relationship between Principal Components
variables with data reduction Analysis (PCA)

71

PMM23 Week 3 Lectures

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Destaque

Destaque (20)

Semelhante a PMM23 Week 3 Lectures

Semelhante a PMM23 Week 3 Lectures (20)

Último

Último (20)

PMM23 Week 3 Lectures