SlideShare uma empresa Scribd logo
1 de 71
Introduction to Multivariate Data Analysis (MVA)



o Introduction to exploring data with MVA


o Tutorial on using Excel to perform multivariate analysis




                                                             1
What is Multivariate analysis?

•‘Multivariate’ means data represented by two or more variables
  e.g. height, weight and gender of a person

• Majority of datasets collected in biomedical research are multivariate

• These datasets nearly always contain ‘noise’

• Aim of exploratory MVA is to discover patterns that exist within the data
despite noise
 e.g. patterns maybe subgroups of patients with a
certain disease

• When we apply MV methods we study:

     • Variation in each of these variables
     • Similarity or distance between variables

• in MVA we work in multidimensional space

                                                                              2
A Typical Multivariate Dataset has Independent and Dependent Variables

e.g. The expression levels for 20 genes in 5 patients
                                                   Dependent Variables (DV’s)
                                             p1         p2       p3       p4     p5
                                       g1   77.2       91.6     41.9     37.2   68.5
                                       g2   74.2       66.9     21.2     31.4   57.1
                                       g3   66.6       49.6     71.2     27.8   72.6
       Independent Variables (IV’s)




                                       g4   28.9        0.2     17.7      1.4    8.1
                                       g5    3.5        3.9      4.1      8.2    6.4
                                       g6    18        47.4      94       59      7
                                       g7   73.1       42.8     34.9     96.3    25
                                       g8   66.7       34.3     48.2     44.3    51
                                       g9   98.2       82.7     28.1     17.7   47.6
                                      g10   20.3       61.6     45.5     83.5   70.9
                                      g11    0.3        0.9      2.1      4.1    1.1
                                      g12   34.1       12.3     90.6     73.4   90.9
                                      g13    68        48.2      5.2     10.1   66.7
                                      g14    5.3       74.6     64.1     19.4   16.8
                                      g15   73.5       67.8     13.6     12.5   81.6
                                      g16     4         14      16.5      22    16.5
                                      g17   69.5       61.3     53.3     78.7   73.3
                                      g18    0.9        7.4     12.5      1.4   15.9
                                      g19    1.7       16.2     32.5     37.4   79.4
                                      g20   49.8       52.4     85.7     47.7   84.8

An expression level in a patient is dependent on the gene                              3
Data types

Data in a variable can be:

Numerical                    0,1,2,3…
                             0.1,0.2,0.3…           e.g. height, gene expression level

Categorical (factor)         A, B, AB, O…           e.g. blood group
                             0,1,2,3…                    immunohistochemistry score
                             0 or 1                      survival 0= dead; 1 alive


Multivariate datasets can contain mixed data types :

                                      P1       P2          P3        P4        P5
                              V1     77.2     74.2        66.6      28.9      3.5
      Numerical               V2     91.6     66.9        49.6      0.2       3.9
                              V3     41.9     21.2        71.2      17.7      4.1
                              V4       0       1           0         1         1
      categorical
                              V5       A       A           C         E         B
                                                                                         4
There are different categories of MVA methods


                                               MVA methods
We will look at
multivariate statistical
methods for exploratory
analysis
                               Multivariate statistics       Machine learning



                                                     Modelling &
                      Exploratory
                                                     Classification



       -Find underlying patterns in the data    -Create models e.g. predict cancer
                                                -Classify groups e.g. new cancer subgroup
       -Determine groups e.g. similar genes

       -Generate hypotheses                                                          5
Main categories of Exploratory MVA methods that we will look at


                       Exploratory multivariate analysis methods




                   Clustering                             Data Reduction


                                                     • Principal Components Analysis
      Tree based                Partition            •(PCA)

  • Hierarchical         • K-Means
   Cluster               • Partition Around Medoids (PAM)
   Analysis
  (HCA)

         All these methods allow good visualization of patterns in your data
                                                                                   6
Commonly used software for multivariate analysis in academia

         Commercial:

                                     SPSS                         -         Limited
                                     Minitab                      -         Limited
                                     Matlab                       -         Comprehensive

         Free & open source:

                                     R                            -         Comprehensive
                                     Octave                       -         Comprehensive
                                     WEKA                         -         Comprehensive


         Many other (more limited) free software packages available here:
         http://www.freestatistics.info/en/stat.php




This lecture focuses on how we can use R directly from within Microsoft Excel

                                                                                            7
R Statistical Analysis & Programming Environment




Download here:       http://cran.r-project.org/
Introductory book:   http://cran.r-project.org/doc/manuals/R-intro.pdf
Recommended book:    R for Medicine and Biology, Jones & Bartlett, 2009
                                                                          8
R can be your ‘hub’ for data analysis




                                        9
You can use R directly from Excel



                                       RCom




Excel and R can be linked by installing a piece of ‘middleware’ called Rcom (see next slide)

Combining Excel and R provides you with a environment for complete data processing and
analysis:

                        1. Use Excel to put your data together

                        2. Use a menu in Excel to analyse your data in R

                        3. Open the Demo Workbook

                        4. Use this workbook to analyze your data
                                                                                        10
Full instructions for downloading and installing R for Excel


  1. Download and install R and other software you need to use R in Microsoft Excel:




      http://cancerinformatics.swansea.ac.uk/pathology/pmm23/rexcel.htm



** PM-M23 Students – You should have already have installed this software in Week 2 **



   2. Download the Excel Workbook that accompanies the lecture:


       http://cancerinformatics.swansea.ac.uk/pathology/pmm23/Demo.zip

                                                                                       11
If you encounter the following error during installation :




Then you will need to download, unzip and install the Office Service Pack 1 file:


http://cancerinformatics.swansea.ac.uk/pathology/pmm23/officesp1.zip


If an error occurs during that installation you will need to download, unzip and
install the Office Update file:


http://cancerinformatics.swansea.ac.uk/pathology/pmm23/officeupdate.zip

                                                                                    12
The Excel Workbook for MVA – Demo.xlsx




Select Worksheet
                   Select Code                          13
Rest of Lecture is….
Exploring our data using these methods in Excel…



 1       Hierarchical Cluster Analysis

 2       Partition Clustering
                                           +
 3       PCA
                                           +
                                         Examples



                                                    14
1


Hierarchical Cluster Analysis



                                15
Hierarchical Cluster Analysis

                        Patients                        Objective:

                  A       B        C      D             We have a dataset of DV’s (columns) and IV’s
        S1        42      18       4      37            (rows)
        S2        35      23       10     48
genes




                                                        We want to VISUALIZE how DV’s group together
        S3        39      25       7      22            according to how similar they are across the IV
        ...       ...     ...      ...    ...           scores or vice versa

        S10 27            22       16     41            So we measure Similarity = Distance

                                                         What does HCA give you?

                                                         A tree (or dendrogram)
   Steps:
              1                            2                 3
   Data            distance matrix              Build tree       Visualize How many groups there are
                                                                                                       16
What do we mean by distance?
          Think of your data as being points in multidimensional space

Point B




                                           Point A



      The distance between two points is the length of the path connecting them.

      The closer together two points (i.e. your variables) are the more similar
      they are in what is being measured
                                                                                   17
1. Create a distance matrix               Measure similarity between column variables

                       Patients                    50
                                                                                    A
                 A       B        C     D
        S1       42      18       4     37
        S2       35      23       10    48                            26.8
genes




                                                                                  24
                                              S1
        S3       39      25       7     22
        ...      ...     ...      ...   ...
        S10 27           22       16    41                        B          12



                                                   0                                     50
                                                                        S2
         How similar are variables A & B
         Across all cases S1....Sn?
                                                          AB = √ (24)2 + (12)2 = 26.8

                                                                                              18
Patients
                                                                   Measure similarity between variables
                  A         B      C     D
                                                  50                               50
        S1        42        18     4     37
                                                                          A                            A
        S2        35        23     10    48                    26.8
genes




                                                      S1                           S1           25.3
        S3        39        25     7     22                                                                B
        ...       ...       ...    ...   ...                   B
        S10 27              22     16    41                0                  50        0                  50
                                                                   S2                         S3


        50
                                   A
                        26.4
         S1                                           Distance between AB:
                        B
                                                      √ (24)2 + (12)2 + (8) 2 + ...... + (5) 2
              0                          50
                            S10

                                   And so on ......
                                                                                                           19
The distance matrix


The distance represents similarity measures for ALL pairs of variables across ALL cases




              A      0

              B      26              0

              C      18              32            0

              D      31              22            9             0
                     A               B             C             D
                                                                                          20
Tree Building from distance matrix

                                    1. Find smallest distance value between a pair
                                    2. Take average and create a new matrix combining the pair
            A    0
                                                         A       0
            B    26        0
                                                         B       26         0
            C    18        32       0
                                                         C&D     24.5       27      0
            D    31        22       9       0
                 A         B        C       D
                                                                 A          B       C&D




     26.5
     24.5                                                    B       0

        9                                                A&C&D       26.5       0
                 B     A        C       D
                                                                     B          A&C&D


                                                                                           21
This is what I
Some common distance measures
                                                                                      just used



  Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance
  in the multidimensional space.


  Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place
  progressively greater weight on objects that are further apart.


  City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most
  cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the
  effect of single large differences (outliers) is dampened (since they are not squared).



  Correlation


  Gower's distance – allows you to use mixed numerical and categorical data




                                                                                                                                  22
Some common tree building algorithms


         Single linkage (nearest neighbor).                    The distance between two clusters is determined
         by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a
         sense, string objects together to form clusters, and the resulting clusters tend to represent long
         "chains.“

         Complete linkage (furthest neighbor). In this method, the distances between
         clusters are determined by the greatest distance between any two objects in the different clusters (i.e.,
         by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually
         form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type
         nature, then this method is inappropriate.

         Unweighted pair-group average. In this method, the distance between two clusters is
         calculated as the average distance between all pairs of objects in the two different clusters. This method
         is also very efficient when the objects form natural distinct "clumps," however, it performs equally well
         with elongated, "chain" type clusters.




   This is what I
     just used


                                                                                                                          23
Using Hierarchical Cluster Analysis in Excel




                                                                 1
                                                      Start R…

                                                      1. Click on the Add-ins tab
                                                      2. Click on the RExcel Menu
                                                      3. Click on ‘Connect R’

                                    These steps are always used to start R in Excel




                                                                            24
Using Hierarchical Cluster Analysis in Excel




                                                          2


                                               Install libraries in R…

                                               1. Highlight the cell A2
                                               2. Right click the selection
                                               3. Click on Run Code to install




                                                                         25
Using Hierarchical Cluster Analysis in Excel




                                                         3

                                               Select a Download Source…

                                               1. Choose Bristol or London




                                                                     26
Using Hierarchical Cluster Analysis in Excel




                                                              4

                                                      Setup Worksheet

                                               Load the necessary Libraries…

                                               1. Highlight the cells with the
                                                  code
                                               2. Right click the selection
                                               3. Click on Run Code to load the
                                                  libraries in R



                                                                          27
Using Hierarchical Cluster Analysis in Excel


                                       Data Worksheet

                                                                          5
                                       Select data…

                                       1.   Highlight the dataset with column/row names
                                       2.   Right click the selection
                                       3.   Click on ‘Put R Var’
                                       4.   Type in ‘dat’ into the ‘Array name in R’ box
                                       5.   Click the ‘with rownames’, ‘with ‘columnames’
                                            boxes
                                       6.   Click OK




                                                                                    28
Using Hierarchical Cluster Analysis in Excel




                                  Click on the HCA tab in the workbook


                                                                         29
To Plot a dendrogram for DV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’
 -   Right click the cell A19 and Click on ‘Run code’ (the dendrogram should appear)
 -   The tree show the similarities between patients according to gene expression levels




                                                                                         30
To Plot a dendrogram for IV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’
 -   Right click the cell A22 and click on ‘Run code’
 -   The tree shows similarities for gene expression across patients




                                                                                          31
To plot a dendrogram and HEATMAP for IV’s and DV’s
-    Highlight and right click the cells c18:C23 and click on ‘Run code’
-    The trees are now visualized together and the heatmap colours are relative to the
     expression levels of each gene in each patient (green = high; red = low; black = intermediate)




                                                                                            32
Summary of what HCA has shown us



HCA...

•Provides an overall feel for how our data
groups
• In the example, there might be:
      •2 clusters of patients
      •2 large clusters of genes
      • 4 or 5 smaller sub-clusters of
      genes
•Genes cluster according to patterns of
expression across patients




                                                    33
2

Confirm the number of groups in our data using


    Partition Clustering


                                                 34
Partition Clustering
                    Patients

              A       B        C     D     Objective:
        S1    42      18       4     37
                                           We have a dataset of DV’s (columns) and IV’s
        S2    35      23       10    48
genes




                                           (rows)
        S3    39      25       7     22
        ...   ...     ...      ...   ...   We have a feel for how many clusters there are
                                           in our dataset after using HCA
        S10 27        22       16    41
                                           We want to assign our variables into distinct
                                           clusters – so we use a partition clustering
                                           method

                                           What does Partition clustering give you?

                                           A table showing the hard assignment of your
                                           variables into to discrete clusters

                                                                                           35
Steps in Partition Clustering

1. Choose a partition clustering method suitable for your data
   e.g. K-Means, Partition Around Medoids

2. Tell the method how many clusters you think there are in the dataset
   e.g. 2, 3, 4…..

3. Read output table to see which cluster each variable has been assigned to

4. Try to assess the ‘fit’ of each variable in a cluster
   i.e. how well has clustering worked?

5. Repeat with a different cluster number until you get the best fit




                                                                               36
Partition Clustering Algorithm Overview….

     All this will be explained pictorially in the next few slides

    1. You have to define the number of clusters

    2. A distance matrix is created between variables

    3. Random cluster ‘centres’ are created in multidimensional space

    4. Method then assigns samples to nearest cluster centre

    5. Cluster centres are then moved to better fit the samples

    6. Samples are reassigned to cluster centres

    7. Process repeated until best fit is achieved

    Most widely used method is K-Means clustering

    K-Means uses euclidean distance to create the distance matrix
                                                                        37
An Example … are there 4 clusters in this dataset?




Data Space...




         The gray dots represent data and red squares possible cluster ‘centres’
Using the interactive tool at the URL below we can follow how K-Means partitions our data




   http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
                                                                                     39
K-Means starts by RANDOMLY assigning cluster centres to the data




                                                                   40
Boundaries are drawn around the nearest data points that K-Means thinks should group with the cluster
centre. The cluster centre is then shifted towards the centre of these data points




                                                                                               41
The boundary lines are then redrawn around the data points that are closest to the new cluster centres
This means that some data points better fit a new cluster




                                                                                                42
It keeps doing this….




                        43
…and on….




            44
…and on….




            45
…and on….




            46
…and on….




            47
…until….   A best fit is achieved – it cannot get a better fit by moving centres around
                                                                                          48
Variables are then listed according to cluster

 Variable     Cluster        Variable     Cluster   Variable   Cluster
    1           3               11           2        21         2
    2           4               12           4        22         4
    3           4               13           1        23         1
    4           1               14           3        24         2
    5           2               15           4        25         1
    6           4               16           4
    7           2               17           2
    8           4               18           3
    9           3               19           2
   10           1               20           1


                                                                         49
Can Partition Clustering methods be used on categorical data?

 Yes!


•You just need to use a different method to create the distance matrix

•Do not use K-Means!

•Use Partition Around Medoids (PAM) instead of K-Means with
Gower’s Distance measure.




                                                                         50
An alternative method to K-Means is…K-Medoids Clustering

   The most common K-Medoids method is:

   Partition Around Medoids (PAM)

   Pam measures the average DISSIMILARITY between variables in a cluster



    Why use PAM?

   PAM is more robust than K-Means as…

   • It gives a better approximation of the centre of a cluster

   • It can use any type of distance matrix (not just euclidean distance)

   • It uses a novel visualization tool, the silhouette plot, to help you decide the
   optimal number of clusters
                                                                                       51
Evaluating how well our clustering has worked
How good is fit of clusters across variables?
What is the optimal number of clusters?
The silhouette plot provides these answers

Clusters = 4

N = 75

Bars = fit of sample in cluster

Bar Length = goodness of fit

Each cluster has an average
length (Si)

Average Silhouette
Width = 0.74

Rough rule of thumb:

Average Silhouette
Width > 0.4 is good                             Anything greater than 0.5
                                                is a decent fit             52
Keep trying different cluster numbers (k) to see how the average
 silhouette width changes


If Clusters = 5 then:

Average Silhouette Width
decreases
                                                                         Not very
Look at cluster 3                                                        good fit

One sample has a poor fit

Other samples have not so
good a fit




                        Choose K that has the highest Silhouette Width
                                                                              53
The K-Means & PAM Worksheet




                              54
Running PAM in Excel
Clustering IV’s




                       55
Change the value of K (no. clusters) and observe the average silhouette width




             K=3                    K=4                    K=5




     Average                    Average                 Average
     Silhouette = 0.45          Silhouette = 0.49       Silhouette = 0.59
     Width                      Width                   Width               56
Getting output to show cluster assignment




               1. Click on a new worksheet
               2. Right click a cell
               3. Click ‘Get R Output’




                                             57
Summary of what PAM has shown us




•PAM told us that it is most likely that
there are 5 clusters of genes in our
dataset

•PAM assigned each gene to a definite
cluster




                                                 58
3

Visualize the relationship between variables in groups with


Principal Components Analysis


                                                              59
Principal Components Analysis (PCA)

What does it do…


• It is a data reduction technique

•It seeks a linear combination of variables such that the maximum variance is extracted
from the variables.

• PCA produces uncorrelated factors (components).




What does it give you…

• The components might represent underlying groups within the data

• By finding a small number of components you have reduced the dimensionality of
your data

                                                                                    60
PCA – The Concepts
      X     Y
                  If we take data for two variables and plot as a scatter plot, we can draw a
1     42    18
                  line of best fit through the data (the length of which is from the furthest
2     35    23    two data points)
3     39    25
                  By summing the distances between points and the line we can determine
...   ...   ...
                  how much variation in the data each line captures.
N     27    22
                  We can then draw a second line at right angles between the two further
                  data points in that direction and this line captures more variation




                                                                                      61
PCA – The Concepts

                  Each data point has a score on
                  each component like a
                  correlation



                                                                              eigenvector


                                       eigenvalue




•In multivariate data we have many variables potted in multidimensional space
•So we draw many ‘lines of best fit’ – each line is called an eigenvector
•The variables have a score on each eigenvector depending on how much variation is
explained by that line (eigenvalue)
•We refer to the eigenvectors as components
•Different variables will have similar or different correlations on each component
                                                                                     62
•Therefore we can group together variables according to these similarities
How many groups are there?

    Each component explains different amounts of variation in the data

Importance of components:
                       Comp.1                Comp.2         Comp.3          Comp.4
Proportion of Variance 0.62                  0.24           0.08            0.04
Cumulative Proportion 0.62                   0.86           0.95            1.00


 Why is this important?

 - It tells us how many components to retain (i.e. we throw out minor components)

 - The number of components we retain is the number of groups in the data


  Rough rule of thumb:

  Retain components explaining >= 5% of the variation
                                                                                    63
How many groups are there?
   Eigenvalues help us decide on many components to retain

A Scree plot will show you the eigenvalues
for each component




                                             This scree plot shows the
                                             variance of each component



Rough rule of thumb:
Look to see where the curve levels off

The Kaiser criterion:
Retain components having an eigenvalue > 1                                64
The PCA Worksheet




                    65
Getting output to show scores of IV’s on components



                                                1. Click on a new worksheet
                                                2. Right click a cell
                                                3. Click ‘Get R Output’




                                                                              66
Generate a Variance Table & a Scree Plot




                                  Optimal number of components is 4
                                  where variance explained is > =5%
                                                                67
Visualizing the scores of IV’s on components using a scatterplot




   This plot shows:

   Component 1 (PC1)
       v.
   Component 2 (PC2)

   • PC1 & PC2 separate groups
   of genes and patients
                 You can see that
                 P1 and P2 are
                 similar due to
                 levels of gene g9



               P5 is clearly different to the other
               patients according to gene expression   P3 and P4 are similar
                                                                      68
               levels
Visualizing the scores of IV’s on components using a scatterplot




   This plot shows:

   Component 1 (PC1)
       v.
   Component 3 (PC3)


   This plot gives
   another view on the
   data groups and the
   relationship between
   variables and
   components


                                                                   69
Putting it all together…A whole map of the patterns in our data….



                                       A                        …We have a consensus of
                             B                                  how our variables group
                             E
                             D
                                                                We could generate new
                             A                                  hypotheses from our data
                             C


                                                         E
                             A

                             B

                             C
                             D
                             E


                                                                                    70
Typical MVA workflow you can apply to your data in research projects



                Dataset




  Estimate number of groups with Tree
                                                Hierarchical Cluster Analysis
            based Clustering


                                                 K-Means, PAM
     Confirm number of groups with
           Partition Clustering



     Visualize relationship between              Principal Components
      variables with data reduction              Analysis (PCA)


                                                                           71

Mais conteúdo relacionado

Mais procurados

Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)TIEZHENG YUAN
 
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYCATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYijaia
 
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...ijaia
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysisguest0edcaf
 
Two-factor Mixed MANOVA with SPSS
Two-factor Mixed MANOVA with SPSSTwo-factor Mixed MANOVA with SPSS
Two-factor Mixed MANOVA with SPSSJ P Verma
 
Matrix algebra in_r
Matrix algebra in_rMatrix algebra in_r
Matrix algebra in_rRazzaqe
 
Hierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningHierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningYashraj Nigam
 
Repeated measures anova with spss
Repeated measures anova with spssRepeated measures anova with spss
Repeated measures anova with spssJ P Verma
 
Comparative study of ksvdd and fsvm for classification of mislabeled data
Comparative study of ksvdd and fsvm for classification of mislabeled dataComparative study of ksvdd and fsvm for classification of mislabeled data
Comparative study of ksvdd and fsvm for classification of mislabeled dataeSAT Journals
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleSajith Edirisinghe
 
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Anmol Dwivedi
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis緯鈞 沈
 
Program_Cluster_Analysis
Program_Cluster_AnalysisProgram_Cluster_Analysis
Program_Cluster_AnalysisSammya Sengupta
 
Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8Laila Fatehy
 
Reporting a paired sample t test
Reporting a paired sample t testReporting a paired sample t test
Reporting a paired sample t testKen Plummer
 

Mais procurados (19)

Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)Cluster Analysis Assignment 2013-2014(2)
Cluster Analysis Assignment 2013-2014(2)
 
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORYCATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
CATEGORY TREES – CLASSIFIERS THAT BRANCH ON CATEGORY
 
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
A NEW PERSPECTIVE OF PARAMODULATION COMPLEXITY BY SOLVING 100 SLIDING BLOCK P...
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Two-factor Mixed MANOVA with SPSS
Two-factor Mixed MANOVA with SPSSTwo-factor Mixed MANOVA with SPSS
Two-factor Mixed MANOVA with SPSS
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Matrix algebra in_r
Matrix algebra in_rMatrix algebra in_r
Matrix algebra in_r
 
Measures of Variation
Measures of Variation Measures of Variation
Measures of Variation
 
Hierarchical Clustering in Data Mining
Hierarchical Clustering in Data MiningHierarchical Clustering in Data Mining
Hierarchical Clustering in Data Mining
 
Repeated measures anova with spss
Repeated measures anova with spssRepeated measures anova with spss
Repeated measures anova with spss
 
Comparative study of ksvdd and fsvm for classification of mislabeled data
Comparative study of ksvdd and fsvm for classification of mislabeled dataComparative study of ksvdd and fsvm for classification of mislabeled data
Comparative study of ksvdd and fsvm for classification of mislabeled data
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Higgs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - KaggleHiggs Boson Machine Learning Challenge - Kaggle
Higgs Boson Machine Learning Challenge - Kaggle
 
Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis (LDA)
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Program_Cluster_Analysis
Program_Cluster_AnalysisProgram_Cluster_Analysis
Program_Cluster_Analysis
 
Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8Discriminant Analysis-lecture 8
Discriminant Analysis-lecture 8
 
Reporting a paired sample t test
Reporting a paired sample t testReporting a paired sample t test
Reporting a paired sample t test
 

Destaque

Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningHoang Nguyen
 
Clustering Methods with R
Clustering Methods with RClustering Methods with R
Clustering Methods with RAkira Murakami
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Guy Lebanon
 
Metropolis Healthcare Ltd
Metropolis Healthcare LtdMetropolis Healthcare Ltd
Metropolis Healthcare LtdVikas Saini
 
EDX -EXCEL Data Analysis - Take It to the MAX() (Delft University of Technol...
EDX -EXCEL Data Analysis - Take It to the MAX()  (Delft University of Technol...EDX -EXCEL Data Analysis - Take It to the MAX()  (Delft University of Technol...
EDX -EXCEL Data Analysis - Take It to the MAX() (Delft University of Technol...David Parnell, CPIM
 
PS CH 10 matter properties and changes edited
PS CH 10 matter properties and changes editedPS CH 10 matter properties and changes edited
PS CH 10 matter properties and changes editedEsther Herrera
 
Chapter36a
Chapter36aChapter36a
Chapter36aYing Liu
 
Marketing techniques & much more
Marketing techniques & much moreMarketing techniques & much more
Marketing techniques & much moreSantosh Tiwari
 
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RFinding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RRevolution Analytics
 
Test for equal variances
Test for equal variancesTest for equal variances
Test for equal variancesJohn Smith
 

Destaque (20)

Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Clustering Methods with R
Clustering Methods with RClustering Methods with R
Clustering Methods with R
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
British airways
British airwaysBritish airways
British airways
 
Classical mgmt
Classical mgmtClassical mgmt
Classical mgmt
 
Company Profiles
Company ProfilesCompany Profiles
Company Profiles
 
Metropolis Healthcare Ltd
Metropolis Healthcare LtdMetropolis Healthcare Ltd
Metropolis Healthcare Ltd
 
Ch 2 data analysis
Ch 2 data analysisCh 2 data analysis
Ch 2 data analysis
 
HTML for Education
HTML for EducationHTML for Education
HTML for Education
 
EDX -EXCEL Data Analysis - Take It to the MAX() (Delft University of Technol...
EDX -EXCEL Data Analysis - Take It to the MAX()  (Delft University of Technol...EDX -EXCEL Data Analysis - Take It to the MAX()  (Delft University of Technol...
EDX -EXCEL Data Analysis - Take It to the MAX() (Delft University of Technol...
 
PS CH 10 matter properties and changes edited
PS CH 10 matter properties and changes editedPS CH 10 matter properties and changes edited
PS CH 10 matter properties and changes edited
 
Chap017
Chap017Chap017
Chap017
 
Chapter36a
Chapter36aChapter36a
Chapter36a
 
Securitas
SecuritasSecuritas
Securitas
 
Marketing techniques & much more
Marketing techniques & much moreMarketing techniques & much more
Marketing techniques & much more
 
121 vhgfhg
121 vhgfhg121 vhgfhg
121 vhgfhg
 
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RFinding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
 
Porter 5 forces
Porter 5 forcesPorter 5 forces
Porter 5 forces
 
Test for equal variances
Test for equal variancesTest for equal variances
Test for equal variances
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 

Semelhante a PMM23 Week 3 Lectures

2018. gwas data cleaning
2018. gwas data cleaning2018. gwas data cleaning
2018. gwas data cleaningFOODCROPS
 
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...Silvio Cesare
 
Programacion multiobjetivo
Programacion multiobjetivoProgramacion multiobjetivo
Programacion multiobjetivoDiego Bass
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
analysis part 02.pptx
analysis part 02.pptxanalysis part 02.pptx
analysis part 02.pptxefrembeyene4
 
生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLAKeiichiro Ono
 
Sas rule based codebook generation for exploratory data analysis - wuss 2012
Sas rule based codebook generation for exploratory data analysis - wuss 2012Sas rule based codebook generation for exploratory data analysis - wuss 2012
Sas rule based codebook generation for exploratory data analysis - wuss 2012RossBettinger
 
Sma Research
Sma ResearchSma Research
Sma ResearchAnt Wong
 
An introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everitAn introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everitFredy Gomez Gutierrez
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alRazzaqe
 
Microbial genetics and genetic engineering
Microbial genetics and genetic engineeringMicrobial genetics and genetic engineering
Microbial genetics and genetic engineeringLani Manahan
 
SLOPE 1st workshop - presentation 7
SLOPE 1st workshop - presentation 7SLOPE 1st workshop - presentation 7
SLOPE 1st workshop - presentation 7SLOPE Project
 
Reproducible Research: how could Research Objects help
Reproducible Research: how could Research Objects helpReproducible Research: how could Research Objects help
Reproducible Research: how could Research Objects helpCarole Goble
 
Neo4j_Cypher.pdf
Neo4j_Cypher.pdfNeo4j_Cypher.pdf
Neo4j_Cypher.pdfJaberRad1
 
Test for significance
Test for significanceTest for significance
Test for significanceMaria Theresa
 
one complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfone complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfstudy help
 

Semelhante a PMM23 Week 3 Lectures (20)

Malhotra18
Malhotra18Malhotra18
Malhotra18
 
2018. gwas data cleaning
2018. gwas data cleaning2018. gwas data cleaning
2018. gwas data cleaning
 
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
FooCodeChu - Services for Software Analysis, Malware Detection, and Vulnerabi...
 
Programacion multiobjetivo
Programacion multiobjetivoProgramacion multiobjetivo
Programacion multiobjetivo
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
analysis part 02.pptx
analysis part 02.pptxanalysis part 02.pptx
analysis part 02.pptx
 
生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA生命を理解する道具としての計算機  SCSN@UCLA
生命を理解する道具としての計算機  SCSN@UCLA
 
Sas rule based codebook generation for exploratory data analysis - wuss 2012
Sas rule based codebook generation for exploratory data analysis - wuss 2012Sas rule based codebook generation for exploratory data analysis - wuss 2012
Sas rule based codebook generation for exploratory data analysis - wuss 2012
 
May 15 workshop
May 15  workshopMay 15  workshop
May 15 workshop
 
Sma Research
Sma ResearchSma Research
Sma Research
 
May workshop
May workshopMay workshop
May workshop
 
An introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everitAn introductiontoappliedmultivariateanalysiswithr everit
An introductiontoappliedmultivariateanalysiswithr everit
 
An intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et alAn intro to applied multi stat with r by everitt et al
An intro to applied multi stat with r by everitt et al
 
Microbial genetics and genetic engineering
Microbial genetics and genetic engineeringMicrobial genetics and genetic engineering
Microbial genetics and genetic engineering
 
SLOPE 1st workshop - presentation 7
SLOPE 1st workshop - presentation 7SLOPE 1st workshop - presentation 7
SLOPE 1st workshop - presentation 7
 
Reproducible Research: how could Research Objects help
Reproducible Research: how could Research Objects helpReproducible Research: how could Research Objects help
Reproducible Research: how could Research Objects help
 
Topic 14 two anova
Topic 14 two anovaTopic 14 two anova
Topic 14 two anova
 
Neo4j_Cypher.pdf
Neo4j_Cypher.pdfNeo4j_Cypher.pdf
Neo4j_Cypher.pdf
 
Test for significance
Test for significanceTest for significance
Test for significance
 
one complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdfone complete report from all the 4 labs.pdf
one complete report from all the 4 labs.pdf
 

Último

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

PMM23 Week 3 Lectures

  • 1. Introduction to Multivariate Data Analysis (MVA) o Introduction to exploring data with MVA o Tutorial on using Excel to perform multivariate analysis 1
  • 2. What is Multivariate analysis? •‘Multivariate’ means data represented by two or more variables e.g. height, weight and gender of a person • Majority of datasets collected in biomedical research are multivariate • These datasets nearly always contain ‘noise’ • Aim of exploratory MVA is to discover patterns that exist within the data despite noise e.g. patterns maybe subgroups of patients with a certain disease • When we apply MV methods we study: • Variation in each of these variables • Similarity or distance between variables • in MVA we work in multidimensional space 2
  • 3. A Typical Multivariate Dataset has Independent and Dependent Variables e.g. The expression levels for 20 genes in 5 patients Dependent Variables (DV’s) p1 p2 p3 p4 p5 g1 77.2 91.6 41.9 37.2 68.5 g2 74.2 66.9 21.2 31.4 57.1 g3 66.6 49.6 71.2 27.8 72.6 Independent Variables (IV’s) g4 28.9 0.2 17.7 1.4 8.1 g5 3.5 3.9 4.1 8.2 6.4 g6 18 47.4 94 59 7 g7 73.1 42.8 34.9 96.3 25 g8 66.7 34.3 48.2 44.3 51 g9 98.2 82.7 28.1 17.7 47.6 g10 20.3 61.6 45.5 83.5 70.9 g11 0.3 0.9 2.1 4.1 1.1 g12 34.1 12.3 90.6 73.4 90.9 g13 68 48.2 5.2 10.1 66.7 g14 5.3 74.6 64.1 19.4 16.8 g15 73.5 67.8 13.6 12.5 81.6 g16 4 14 16.5 22 16.5 g17 69.5 61.3 53.3 78.7 73.3 g18 0.9 7.4 12.5 1.4 15.9 g19 1.7 16.2 32.5 37.4 79.4 g20 49.8 52.4 85.7 47.7 84.8 An expression level in a patient is dependent on the gene 3
  • 4. Data types Data in a variable can be: Numerical 0,1,2,3… 0.1,0.2,0.3… e.g. height, gene expression level Categorical (factor) A, B, AB, O… e.g. blood group 0,1,2,3… immunohistochemistry score 0 or 1 survival 0= dead; 1 alive Multivariate datasets can contain mixed data types : P1 P2 P3 P4 P5 V1 77.2 74.2 66.6 28.9 3.5 Numerical V2 91.6 66.9 49.6 0.2 3.9 V3 41.9 21.2 71.2 17.7 4.1 V4 0 1 0 1 1 categorical V5 A A C E B 4
  • 5. There are different categories of MVA methods MVA methods We will look at multivariate statistical methods for exploratory analysis Multivariate statistics Machine learning Modelling & Exploratory Classification -Find underlying patterns in the data -Create models e.g. predict cancer -Classify groups e.g. new cancer subgroup -Determine groups e.g. similar genes -Generate hypotheses 5
  • 6. Main categories of Exploratory MVA methods that we will look at Exploratory multivariate analysis methods Clustering Data Reduction • Principal Components Analysis Tree based Partition •(PCA) • Hierarchical • K-Means Cluster • Partition Around Medoids (PAM) Analysis (HCA) All these methods allow good visualization of patterns in your data 6
  • 7. Commonly used software for multivariate analysis in academia Commercial: SPSS - Limited Minitab - Limited Matlab - Comprehensive Free & open source: R - Comprehensive Octave - Comprehensive WEKA - Comprehensive Many other (more limited) free software packages available here: http://www.freestatistics.info/en/stat.php This lecture focuses on how we can use R directly from within Microsoft Excel 7
  • 8. R Statistical Analysis & Programming Environment Download here: http://cran.r-project.org/ Introductory book: http://cran.r-project.org/doc/manuals/R-intro.pdf Recommended book: R for Medicine and Biology, Jones & Bartlett, 2009 8
  • 9. R can be your ‘hub’ for data analysis 9
  • 10. You can use R directly from Excel RCom Excel and R can be linked by installing a piece of ‘middleware’ called Rcom (see next slide) Combining Excel and R provides you with a environment for complete data processing and analysis: 1. Use Excel to put your data together 2. Use a menu in Excel to analyse your data in R 3. Open the Demo Workbook 4. Use this workbook to analyze your data 10
  • 11. Full instructions for downloading and installing R for Excel 1. Download and install R and other software you need to use R in Microsoft Excel: http://cancerinformatics.swansea.ac.uk/pathology/pmm23/rexcel.htm ** PM-M23 Students – You should have already have installed this software in Week 2 ** 2. Download the Excel Workbook that accompanies the lecture: http://cancerinformatics.swansea.ac.uk/pathology/pmm23/Demo.zip 11
  • 12. If you encounter the following error during installation : Then you will need to download, unzip and install the Office Service Pack 1 file: http://cancerinformatics.swansea.ac.uk/pathology/pmm23/officesp1.zip If an error occurs during that installation you will need to download, unzip and install the Office Update file: http://cancerinformatics.swansea.ac.uk/pathology/pmm23/officeupdate.zip 12
  • 13. The Excel Workbook for MVA – Demo.xlsx Select Worksheet Select Code 13
  • 14. Rest of Lecture is…. Exploring our data using these methods in Excel… 1 Hierarchical Cluster Analysis 2 Partition Clustering + 3 PCA + Examples 14
  • 16. Hierarchical Cluster Analysis Patients Objective: A B C D We have a dataset of DV’s (columns) and IV’s S1 42 18 4 37 (rows) S2 35 23 10 48 genes We want to VISUALIZE how DV’s group together S3 39 25 7 22 according to how similar they are across the IV ... ... ... ... ... scores or vice versa S10 27 22 16 41 So we measure Similarity = Distance What does HCA give you? A tree (or dendrogram) Steps: 1 2 3 Data distance matrix Build tree Visualize How many groups there are 16
  • 17. What do we mean by distance? Think of your data as being points in multidimensional space Point B Point A The distance between two points is the length of the path connecting them. The closer together two points (i.e. your variables) are the more similar they are in what is being measured 17
  • 18. 1. Create a distance matrix Measure similarity between column variables Patients 50 A A B C D S1 42 18 4 37 S2 35 23 10 48 26.8 genes 24 S1 S3 39 25 7 22 ... ... ... ... ... S10 27 22 16 41 B 12 0 50 S2 How similar are variables A & B Across all cases S1....Sn? AB = √ (24)2 + (12)2 = 26.8 18
  • 19. Patients Measure similarity between variables A B C D 50 50 S1 42 18 4 37 A A S2 35 23 10 48 26.8 genes S1 S1 25.3 S3 39 25 7 22 B ... ... ... ... ... B S10 27 22 16 41 0 50 0 50 S2 S3 50 A 26.4 S1 Distance between AB: B √ (24)2 + (12)2 + (8) 2 + ...... + (5) 2 0 50 S10 And so on ...... 19
  • 20. The distance matrix The distance represents similarity measures for ALL pairs of variables across ALL cases A 0 B 26 0 C 18 32 0 D 31 22 9 0 A B C D 20
  • 21. Tree Building from distance matrix 1. Find smallest distance value between a pair 2. Take average and create a new matrix combining the pair A 0 A 0 B 26 0 B 26 0 C 18 32 0 C&D 24.5 27 0 D 31 22 9 0 A B C D A B C&D 26.5 24.5 B 0 9 A&C&D 26.5 0 B A C D B A&C&D 21
  • 22. This is what I Some common distance measures just used Euclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). Correlation Gower's distance – allows you to use mixed numerical and categorical data 22
  • 23. Some common tree building algorithms Single linkage (nearest neighbor). The distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long "chains.“ Complete linkage (furthest neighbor). In this method, the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the "furthest neighbors"). This method usually performs quite well in cases when the objects actually form naturally distinct "clumps." If the clusters tend to be somehow elongated or of a "chain" type nature, then this method is inappropriate. Unweighted pair-group average. In this method, the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct "clumps," however, it performs equally well with elongated, "chain" type clusters. This is what I just used 23
  • 24. Using Hierarchical Cluster Analysis in Excel 1 Start R… 1. Click on the Add-ins tab 2. Click on the RExcel Menu 3. Click on ‘Connect R’ These steps are always used to start R in Excel 24
  • 25. Using Hierarchical Cluster Analysis in Excel 2 Install libraries in R… 1. Highlight the cell A2 2. Right click the selection 3. Click on Run Code to install 25
  • 26. Using Hierarchical Cluster Analysis in Excel 3 Select a Download Source… 1. Choose Bristol or London 26
  • 27. Using Hierarchical Cluster Analysis in Excel 4 Setup Worksheet Load the necessary Libraries… 1. Highlight the cells with the code 2. Right click the selection 3. Click on Run Code to load the libraries in R 27
  • 28. Using Hierarchical Cluster Analysis in Excel Data Worksheet 5 Select data… 1. Highlight the dataset with column/row names 2. Right click the selection 3. Click on ‘Put R Var’ 4. Type in ‘dat’ into the ‘Array name in R’ box 5. Click the ‘with rownames’, ‘with ‘columnames’ boxes 6. Click OK 28
  • 29. Using Hierarchical Cluster Analysis in Excel Click on the HCA tab in the workbook 29
  • 30. To Plot a dendrogram for DV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’ - Right click the cell A19 and Click on ‘Run code’ (the dendrogram should appear) - The tree show the similarities between patients according to gene expression levels 30
  • 31. To Plot a dendrogram for IV’s with: Distance matrix= ‘correlation’, Tree building = ‘complete’ - Right click the cell A22 and click on ‘Run code’ - The tree shows similarities for gene expression across patients 31
  • 32. To plot a dendrogram and HEATMAP for IV’s and DV’s - Highlight and right click the cells c18:C23 and click on ‘Run code’ - The trees are now visualized together and the heatmap colours are relative to the expression levels of each gene in each patient (green = high; red = low; black = intermediate) 32
  • 33. Summary of what HCA has shown us HCA... •Provides an overall feel for how our data groups • In the example, there might be: •2 clusters of patients •2 large clusters of genes • 4 or 5 smaller sub-clusters of genes •Genes cluster according to patterns of expression across patients 33
  • 34. 2 Confirm the number of groups in our data using Partition Clustering 34
  • 35. Partition Clustering Patients A B C D Objective: S1 42 18 4 37 We have a dataset of DV’s (columns) and IV’s S2 35 23 10 48 genes (rows) S3 39 25 7 22 ... ... ... ... ... We have a feel for how many clusters there are in our dataset after using HCA S10 27 22 16 41 We want to assign our variables into distinct clusters – so we use a partition clustering method What does Partition clustering give you? A table showing the hard assignment of your variables into to discrete clusters 35
  • 36. Steps in Partition Clustering 1. Choose a partition clustering method suitable for your data e.g. K-Means, Partition Around Medoids 2. Tell the method how many clusters you think there are in the dataset e.g. 2, 3, 4….. 3. Read output table to see which cluster each variable has been assigned to 4. Try to assess the ‘fit’ of each variable in a cluster i.e. how well has clustering worked? 5. Repeat with a different cluster number until you get the best fit 36
  • 37. Partition Clustering Algorithm Overview…. All this will be explained pictorially in the next few slides 1. You have to define the number of clusters 2. A distance matrix is created between variables 3. Random cluster ‘centres’ are created in multidimensional space 4. Method then assigns samples to nearest cluster centre 5. Cluster centres are then moved to better fit the samples 6. Samples are reassigned to cluster centres 7. Process repeated until best fit is achieved Most widely used method is K-Means clustering K-Means uses euclidean distance to create the distance matrix 37
  • 38. An Example … are there 4 clusters in this dataset? Data Space... The gray dots represent data and red squares possible cluster ‘centres’
  • 39. Using the interactive tool at the URL below we can follow how K-Means partitions our data http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html 39
  • 40. K-Means starts by RANDOMLY assigning cluster centres to the data 40
  • 41. Boundaries are drawn around the nearest data points that K-Means thinks should group with the cluster centre. The cluster centre is then shifted towards the centre of these data points 41
  • 42. The boundary lines are then redrawn around the data points that are closest to the new cluster centres This means that some data points better fit a new cluster 42
  • 43. It keeps doing this…. 43
  • 48. …until…. A best fit is achieved – it cannot get a better fit by moving centres around 48
  • 49. Variables are then listed according to cluster Variable Cluster Variable Cluster Variable Cluster 1 3 11 2 21 2 2 4 12 4 22 4 3 4 13 1 23 1 4 1 14 3 24 2 5 2 15 4 25 1 6 4 16 4 7 2 17 2 8 4 18 3 9 3 19 2 10 1 20 1 49
  • 50. Can Partition Clustering methods be used on categorical data? Yes! •You just need to use a different method to create the distance matrix •Do not use K-Means! •Use Partition Around Medoids (PAM) instead of K-Means with Gower’s Distance measure. 50
  • 51. An alternative method to K-Means is…K-Medoids Clustering The most common K-Medoids method is: Partition Around Medoids (PAM) Pam measures the average DISSIMILARITY between variables in a cluster Why use PAM? PAM is more robust than K-Means as… • It gives a better approximation of the centre of a cluster • It can use any type of distance matrix (not just euclidean distance) • It uses a novel visualization tool, the silhouette plot, to help you decide the optimal number of clusters 51
  • 52. Evaluating how well our clustering has worked How good is fit of clusters across variables? What is the optimal number of clusters? The silhouette plot provides these answers Clusters = 4 N = 75 Bars = fit of sample in cluster Bar Length = goodness of fit Each cluster has an average length (Si) Average Silhouette Width = 0.74 Rough rule of thumb: Average Silhouette Width > 0.4 is good Anything greater than 0.5 is a decent fit 52
  • 53. Keep trying different cluster numbers (k) to see how the average silhouette width changes If Clusters = 5 then: Average Silhouette Width decreases Not very Look at cluster 3 good fit One sample has a poor fit Other samples have not so good a fit Choose K that has the highest Silhouette Width 53
  • 54. The K-Means & PAM Worksheet 54
  • 55. Running PAM in Excel Clustering IV’s 55
  • 56. Change the value of K (no. clusters) and observe the average silhouette width K=3 K=4 K=5 Average Average Average Silhouette = 0.45 Silhouette = 0.49 Silhouette = 0.59 Width Width Width 56
  • 57. Getting output to show cluster assignment 1. Click on a new worksheet 2. Right click a cell 3. Click ‘Get R Output’ 57
  • 58. Summary of what PAM has shown us •PAM told us that it is most likely that there are 5 clusters of genes in our dataset •PAM assigned each gene to a definite cluster 58
  • 59. 3 Visualize the relationship between variables in groups with Principal Components Analysis 59
  • 60. Principal Components Analysis (PCA) What does it do… • It is a data reduction technique •It seeks a linear combination of variables such that the maximum variance is extracted from the variables. • PCA produces uncorrelated factors (components). What does it give you… • The components might represent underlying groups within the data • By finding a small number of components you have reduced the dimensionality of your data 60
  • 61. PCA – The Concepts X Y If we take data for two variables and plot as a scatter plot, we can draw a 1 42 18 line of best fit through the data (the length of which is from the furthest 2 35 23 two data points) 3 39 25 By summing the distances between points and the line we can determine ... ... ... how much variation in the data each line captures. N 27 22 We can then draw a second line at right angles between the two further data points in that direction and this line captures more variation 61
  • 62. PCA – The Concepts Each data point has a score on each component like a correlation eigenvector eigenvalue •In multivariate data we have many variables potted in multidimensional space •So we draw many ‘lines of best fit’ – each line is called an eigenvector •The variables have a score on each eigenvector depending on how much variation is explained by that line (eigenvalue) •We refer to the eigenvectors as components •Different variables will have similar or different correlations on each component 62 •Therefore we can group together variables according to these similarities
  • 63. How many groups are there? Each component explains different amounts of variation in the data Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Proportion of Variance 0.62 0.24 0.08 0.04 Cumulative Proportion 0.62 0.86 0.95 1.00 Why is this important? - It tells us how many components to retain (i.e. we throw out minor components) - The number of components we retain is the number of groups in the data Rough rule of thumb: Retain components explaining >= 5% of the variation 63
  • 64. How many groups are there? Eigenvalues help us decide on many components to retain A Scree plot will show you the eigenvalues for each component This scree plot shows the variance of each component Rough rule of thumb: Look to see where the curve levels off The Kaiser criterion: Retain components having an eigenvalue > 1 64
  • 66. Getting output to show scores of IV’s on components 1. Click on a new worksheet 2. Right click a cell 3. Click ‘Get R Output’ 66
  • 67. Generate a Variance Table & a Scree Plot Optimal number of components is 4 where variance explained is > =5% 67
  • 68. Visualizing the scores of IV’s on components using a scatterplot This plot shows: Component 1 (PC1) v. Component 2 (PC2) • PC1 & PC2 separate groups of genes and patients You can see that P1 and P2 are similar due to levels of gene g9 P5 is clearly different to the other patients according to gene expression P3 and P4 are similar 68 levels
  • 69. Visualizing the scores of IV’s on components using a scatterplot This plot shows: Component 1 (PC1) v. Component 3 (PC3) This plot gives another view on the data groups and the relationship between variables and components 69
  • 70. Putting it all together…A whole map of the patterns in our data…. A …We have a consensus of B how our variables group E D We could generate new A hypotheses from our data C E A B C D E 70
  • 71. Typical MVA workflow you can apply to your data in research projects Dataset Estimate number of groups with Tree Hierarchical Cluster Analysis based Clustering K-Means, PAM Confirm number of groups with Partition Clustering Visualize relationship between Principal Components variables with data reduction Analysis (PCA) 71