SlideShare uma empresa Scribd logo
1 de 53
Multivariate Samples Recall  some very basic concepts of univariate and bivariate statistics Describe Multivariate Samples  Analyze multivariate samples in a geometrical perspective Describe distances in the Euclidean Space
The data we will consider Example1.  Innovation and Research in Europe (Source: Eurostat) Country code Geo Country name Country european region Region E-government on-line availability - Online availability of 20 basic public services E_gov_avail Exports of high technology products as a share of total exports HT_Exports % of males  20-24 having completed at least upper 2° educ. Y_Educ__Lev_m % of fem. 20-24 having completed at least upper 2° educ. Y_Educ_Lev_f Youth education attainment level - total - % of the population 20-24 who completed at least upper secondary education Y_Educ_Lev Expenditure on Telecommunications as a % of GDP Telec_Expenditure Expenditure on Information Technology as a % of GDP IT_Expenditure No patents granted by the US Patent and Trademark Office per million inhabitants USTPO No patent applications to the European Patent Office per million inhabitants EPO Male tertiary graduates in S&T per 1000 of males aged 20-29 ST_grad_m Female tertiary graduates in S&T per 1000 of females aged 20-29 ST_grad_f Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29 ST_grad Level of Internet access - % of households who have Internet access at home Internet_Acc GERD - abroad - % of GERD financed by abroad GERD_abroad GERD - government - % of GERD financed by government GERD_govern GERD - industry - % of GERD financed by industry GERD_industry Gross domestic expenditure on R&D (GERD) - As a % of GDP GERD Spending on Human Resources (total public expen. on education) - % of GDP Educ_Exp
Some basic concepts of Univariate and Bivariate statistics
Back to basics…. Considering one variable Let us consider one variable of interest, say EPO In statistics a commonly used  position measure  is the  arithmetic (sample) mean , obtained by summing up all the observed values and dividing the results by the nr of obs Netherlands Spain Mean =  127.6987 144.52 34.00 Western France 299.99 60.00 Western Germany 141.80 50.00 Western Belgium 246.15 67.00 Western Netherlands 30.64 34.00 Southern Spain 84.14 34.00 Southern Italy 9.87 17.00 Southern Greece 293.32 73.00 Northern Sweden 309.09 51.00 Northern Finland 124.19 56.00 Northern UK 135.77 60.00 Northern Norway 79.87 40.00 Northern Ireland 2.78 12.00 Northern Lithuania 12.04 19.00 Eastern Czech Republic 1.31 6.00 Eastern Romania EPO Internet_Acc region country
Back to basics…. Considering one variable The mean can be used to make a “prediction” about EPO for a generic country without any further information. To evaluate the reliability of the mean as a synthesis of the observed data , we can consider for each observed value the  error incurred when substituting it with the sample mean.  Netherlands Spain Mean =  127.6987 In the plot: errors incurred when substituting the mean to the values observed for  Netherlands  and  Spain  respectively.  The TOTAL SUM OF SQUARES  is the sum of the  squared errors Variable of interest: EPO
Back to basics…. Considering one variable A synthesis of the errors, and a measure of the  reliability of the mean as a synthesis of the observed data , is the (sample)  variance This is the  average of the squared errors  we incur when substituting the observed values with the sample mean.  It is obtained by dividing the  Total SS  by the number of observations (minus 1) The variance of  EPO  turns out to be  12646.5814 . Hence the error we can expect to incur for a generic observation is the square root of the variance, which is called  standard deviation Variable of interest: EPO
Back to basics…. Considering one variable In statistics we are mainly concerned with the  explanation of variance , i.e., we are interested in explaining why a phenomenon varies and, also, we are considering predictive tools characterized by  low prediction errors. So the question now is:  Can we do better than the mean? i.e., can we use external information (other vars) related to EPO, and hence proving useful to predict the values of EPO with a lower error? In the following we will consider two supporting variables having different characteristics: The  Region  (a categorical variable) Internet_Access  (a numerical variable) and we will show how it is possible to evaluate the extent to which one external variable provides information about the variable of interest Variable of interest: EPO
Back to basics…. Considering one variable If we consider the  region , our prediction on  EPO  can be better? General Mean = 127.6987 We can use the  conditional means  rather than the  general one . It is worth only if the prediction error  is considerably lower (it can be  shown that it is lower by construction) Netherlands Spain Values observed within the regions 144.52 Western France 299.99 Western Germany 141.80 Western Belgium 246.15 Western Netherlands 30.64 Southern Spain 84.14 Southern Italy 9.87 Southern Greece 293.32 Northern Sweden 309.09 Northern Finland 124.19 Northern UK 135.77 Northern Norway 79.87 Northern Ireland 2.78 Northern Lithuania 12.04 Eastern Czech Republic 1.31 Eastern Romania EPO region country
Back to basics…. Considering one variable Consider the  region  to improve prediction on  EPO    Use the conditional means Netherlands Spain To evaluate the  reliability of the conditional means  as syntheses of the observed EPO data, we can consider the squared difference between each value and the proper conditional mean.  In the plot: errors for  Netherlands  and  Spain The WITHIN SUM OF SQUARES of EPO given Region  is the sum of the  squared errors  incurred when using the conditional means (by region) to predict EPO
Back to basics…. Considering one variable If we use the region, our improvement as compared to the general mean is The R 2  ranges from 0 to 1. It measures the ability of the categorical var as a predictor of the numerical one. % of variance of EPO accounted for by Region Compare general mean / conditional means as predictors of EPO TOTAL SS EPO  =  177052.1395 282.9561 29684.2921 198.8467 14030.7105 9420.3912 1897.3603 13883.6025 27430.415 32902.8037 12.311 65.1459 2287.5845 15604.6816 13376.9349 15974.1035 Squared errors WITHIN SS EPO | REGION  =  94296.85 4044.324 208.115 127.6987 144.52 Western France 8441.016 208.115 127.6987 299.99 Western Germany 4397.679 208.115 127.6987 141.8 Western Belgium 1446.661 208.115 127.6987 246.15 Western Netherlands 119.0281 41.55 127.6987 30.64 Southern Spain 1813.908 41.55 127.6987 84.14 Southern Italy 1003.622 41.55 127.6987 9.87 Southern Greece 18446.18 157.5033 127.6987 293.32 Northern Sweden 22978.53 157.5033 127.6987 309.09 Northern Finland 1109.776 157.5033 127.6987 124.19 Northern UK 472.3363 157.5033 127.6987 135.77 Northern Norway 6026.929 157.5033 127.6987 79.87 Northern Ireland 23939.3 157.5033 127.6987 2.78 Northern Lithuania 28.7832 6.675 127.6987 12.04 Eastern Czech Republic 28.7832 6.675 127.6987 1.31 Eastern Romania Squared errors Conditional means General mean EPO region country
Back to basics…. Considering one variable If we consider  Internet_Access , our prediction on  EPO  can be better? When considering numerical variables, we are interested in evaluating the existence of a  linear association  between them. To evaluate if a linear relationship exists and to determine its direction we refer to the  sample covariance (absolute measure of linear association) 144.52 34.00 France 299.99 60.00 Germany 141.80 50.00 Belgium 246.15 67.00 Netherlands 30.64 34.00 Spain 84.14 34.00 Italy 9.87 17.00 Greece 293.32 73.00 Sweden 309.09 51.00 Finland 124.19 56.00 UK 135.77 60.00 Norway 79.87 40.00 Ireland 2.78 12.00 Lithuania 12.04 19.00 Czech Republic 1.31 6.00 Romania EPO Internet_Acc country
Back to basics…. Considering one variable If we consider  Internet_Access , our prediction on  EPO  can be better? The covariance between the two variables is: Cov(EPO, Int_Acc) = 1868.5152 This measure only indicates that a linear relationship exists and that it is direct (an inspection of the scatter plot confirms this). Nevertheless, the value of the covariance depends upon the unit of measurement of the considered variables.  A  relative measure  of linear association is the correlation coefficient. The correlation coefficient ranges from – 1 to +1. Values close to 1 indicate strong  direct  linear association, values close to –1 denote strong  inverse  association. Values close to zero indicate no relationship. Here we have  Corr(EPO, Int_Acc) = 0.8527 (strong association)
Back to basics…. Considering one variable If we consider  Internet_Access , our prediction on  EPO  can be better? EPO = –60.018 + 4.5934*Int_Acc The high value of the correlation tells us that observations tend to cluster around a line having a positive slope. This line, evidenced in the scatterplot is called  regression line.   Its analytical expression can be easily determined
Back to basics…. Considering one variable EPO = –60.018 + 4.5934*Int_Acc For each observation we can calculate the difference between the observed EPO value and the value predicted using the regression line. In the plot the error is evidenced for the Spain. Spain The MODEL SUM OF SQUARES of EPO given Int_Acc  is the sum of the  squared errors  incurred when using the line to predict EPO. Consider  Internet_Access  to improve prediction on  EPO    Use the regression line
Back to basics…. Considering one variable Notice that we have a considerable decrease of the prediction errors. Compare general mean / regression line as predictors of EPO TOTAL SS EPO  =  177052.1395 282.9561 29684.2921 198.8467 14030.7105 9420.3912 1897.3603 13883.6025 27430.415 32902.8037 12.311 65.1459 2287.5845 15604.6816 13376.9349 15974.1035 Squared errors MODEL SS EPO | Int_Acc  =  48309.46 2338.922 96.1576 127.6987 144.52 34.00 France 7124.035 215.586 127.6987 299.99 60.00 Germany 775.7339 169.652 127.6987 141.8 50.00 Belgium 2.5275 247.7398 127.6987 246.15 67.00 Netherlands 4292.556 96.1576 127.6987 30.64 34.00 Spain 144.4227 96.1576 127.6987 84.14 34.00 Italy 67.2367 18.0698 127.6987 9.87 17.00 Greece 324.7132 275.3002 127.6987 293.32 73.00 Sweden 18183.07 174.2454 127.6987 309.09 51.00 Finland 5332.271 197.2124 127.6987 124.19 56.00 UK 6370.594 215.586 127.6987 135.77 60.00 Norway 1922.647 123.718 127.6987 79.87 40.00 Ireland 58.9394 -4.8972 127.6987 2.78 12.00 Lithuania 231.5449 27.2566 127.6987 12.04 19.00 Czech Republic 1140.251 -32.4576 127.6987 1.31 6.00 Romania Squared errors Prediction using the line = 4.5934*Int_Acc-60.018 Gen mean EPO Int_Acc country
Back to basics…. Considering one variable The R 2  index ranges from 0 to 1 and it measures the ability of the numerical var to predict the other one. It can be shown that the index coincides with the squared correlation coefficient. Hence the correlation measures  the extent of linear association , whereas its square measures the  percentage of the variance of one variable which can be explained by the other variable (numerical) . If we use the line (function of Int_Acc), our improvement as compared to the general mean is % of variance of EPO accounted for by Int_Acc If we consider  Internet_Access , our prediction on  EPO  can be better?
Data Matrices (Numerical variables only)
Data matrices  Example1 (continued).  Innovation and Research in Europe. For the sake of simplicity, we limit attention to few variables and to few observations The country variable is useful to  identify  the statistical units but it is  not object of analysis. At the moment we consider only  numerical variables For each  observation  we have information collected on  p  variables For each  variable  we have information collected on  n  observations The data matrix  contains information available for the  n  cases ( rows)  on the  p  variables ( columns ) Here we have  15  rows (cases,  n ) and  7  columns (vars,  p ) 50.00 144.52 19.50 34.00 36.90 54.20 2.20 Western France 47.00 299.99 8.10 60.00 31.40 65.70 2.46 Western Germany 35.00 141.80 10.50 50.00 22.00 63.40 2.08 Western Belgium 32.00 246.15 6.60 67.00 35.80 51.90 1.80 Western Netherlands 55.00 30.64 11.90 34.00 39.90 47.20 0.91 Southern Spain 53.00 84.14 7.40 34.00 46.80 47.20 1.09 Southern Italy 32.00 9.87 8.00 17.00 46.60 33.00 0.64 Southern Greece 74.00 293.32 13.30 73.00 21.30 71.50 4.25 Northern Sweden 67.00 309.09 17.40 51.00 25.50 70.80 3.30 Northern Finland 59.00 124.19 20.30 56.00 28.80 45.60 1.83 Northern UK 56.00 135.77 7.70 60.00 39.80 51.60 1.60 Northern Norway 50.00 79.87 20.50 40.00 25.60 66.70 1.10 Northern Ireland 40.00 2.78 14.60 12.00 56.30 37.10 0.67 Northern Lithuania 30.00 12.04 6.00 19.00 43.60 52.50 1.20 Eastern Czech Republic 25.00 1.31 5.80 6.00 43.00 47.60 0.39 Eastern Romania E_gov_avail EPO ST_ grad Internet_Acc GERD_govern GERD_ industry GERD region country
Data matrices  Example1 (continued).  Innovation and Research in Europe. (subset) To each  observation  a collection of  p   values is associated. These values are the realizations observed for each  variables  corresponding to the considered obs. Similarly, to each  variable , a collection of  n  values can be associated (values observed for all the cases) A collection of  k  values is usually called a  vector . To avoid confusion, we will only consider  column  vectors, with dimension ( k     1) – i.e., a collection of values arranged in  k  rows  and in  1 column  . A  row  (1     k ) vector can always be seen as the  transpose  of a column ( k     1) vector. 50.00 144.52 19.50 34.00 36.90 54.20 2.20 France 47.00 299.99 8.10 60.00 31.40 65.70 2.46 Germany 35.00 141.80 10.50 50.00 22.00 63.40 2.08 Belgium 32.00 246.15 6.60 67.00 35.80 51.90 1.80 Netherlands 55.00 30.64 11.90 34.00 39.90 47.20 0.91 Spain 53.00 84.14 7.40 34.00 46.80 47.20 1.09 Italy 32.00 9.87 8.00 17.00 46.60 33.00 0.64 Greece 74.00 293.32 13.30 73.00 21.30 71.50 4.25 Sweden 67.00 309.09 17.40 51.00 25.50 70.80 3.30 Finland 59.00 124.19 20.30 56.00 28.80 45.60 1.83 UK 56.00 135.77 7.70 60.00 39.80 51.60 1.60 Norway 50.00 79.87 20.50 40.00 25.60 66.70 1.10 Ireland 40.00 2.78 14.60 12.00 56.30 37.10 0.67 Lithuania 30.00 12.04 6.00 19.00 43.60 52.50 1.20 Czech Republic 25.00 1.31 5.80 6.00 43.00 47.60 0.39 Romania E_gov_avail EPO ST_ grad Internet_Acc GERD_govern GERD_ industry GERD
Data matrices  x i   =  vector ( p     1) containing measurements on the  p  vars for the  i -th case. x ( j )   = vector ( n     1) containing the  n  measurements on the  j -th variable Data matrix ( n  individuals and  p  variables) Transposition operation A data matrix can be seen as a collection of  n  row (transposed) vectors (cases) and/or as a collection of  p  column vectors (variables)
Data matrices  Example1 (continued).  Innovation and Research in Europe. (subset) Row vector associated  to “Belgium” (measurements on 7 vars) Column vector associated to EPO (measurements on 15 obs) The element in the  i- th row and in the  j- th column,  x ij  is the value observed for the  i -th case corresponding to the  j- th variable.  In this simple example,  x 13  6  is the value of EPO (6°  variable) for  Belgium (13° observation). 50.00 144.52 19.50 34.00 36.90 54.20 2.20 France 47.00 299.99 8.10 60.00 31.40 65.70 2.46 Germany 35.00 141.80 10.50 50.00 22.00 63.40 2.08 Belgium 32.00 246.15 6.60 67.00 35.80 51.90 1.80 Netherlands 55.00 30.64 11.90 34.00 39.90 47.20 0.91 Spain 53.00 84.14 7.40 34.00 46.80 47.20 1.09 Italy 32.00 9.87 8.00 17.00 46.60 33.00 0.64 Greece 74.00 293.32 13.30 73.00 21.30 71.50 4.25 Sweden 67.00 309.09 17.40 51.00 25.50 70.80 3.30 Finland 59.00 124.19 20.30 56.00 28.80 45.60 1.83 UK 56.00 135.77 7.70 60.00 39.80 51.60 1.60 Norway 50.00 79.87 20.50 40.00 25.60 66.70 1.10 Ireland 40.00 2.78 14.60 12.00 56.30 37.10 0.67 Lithuania 30.00 12.04 6.00 19.00 43.60 52.50 1.20 Czech Republic 25.00 1.31 5.80 6.00 43.00 47.60 0.39 Romania E_gov_avail EPO ST_ grad Internet_Acc GERD_govern GERD_ industry GERD
Data matrices – Vectors  A ( K     1) vector is as an oriented line in a  K -dimensional space v 1 v 2 v 3 v 1 v 2 A two-dimensional vector A three-dimensional vector Vectors of higher dimension cannot be represented in this way A one-dimensional vector (scalar) v 1
Data matrices – Vectors  (length) For a given vector in the  k -dimensional space, we define its  length  as: It is the  length  of the line connecting  v  to the origin,  0 : v 1 v 2 v 3 v 1 v 2 v 1 0 0 0
Data matrices – Vectors (Distance)  0 v v 1 v 2 u u 1 u 2 Given two vectors,  v  and  u  in the  k -dimensional space, we define the  Euclidean Distance  between  v  and  u  as the  length  of the line connecting  v  to  u : | v 1  –  u 1 | | v 2  –  u 2 | !!!   the length of a vector  v  coincides with its distance from the origin,  0. Example in the two-dimensional space
Analyze multivariate samples in a geometrical perspective Describe distances in the Euclidean Space
Data matrices  A data matrix can be see as a collection of two kind of vectors:  Row vectors: x i lie in the  p -dimensional space Column vectors: x ( j) lie in the  n -dimensional space Hence two dimensional spaces can be considered to analyze/describe a data matrix.  Of course, these spaces will be related one to each other. For the sake of simplicity, we will analyze in depth only the  space of the observations.
Syntheses of variables The position.  The  sample mean  (unbiased estimator for the population mean)  f or the  j -th variable (column) is: It may be seen as the vector associated to the “artificial case” “mean” – an unobserved case being in the average with respect to all the vars Remember: the mean is not robust (sensitive to extreme values) How to arrange syntheses of  p  variables, i.e., how to synthesize the elements of the  column vectors? Vector of the sample means ( centroid ).
The space of the observations Consider a graphical representation we are used to: the 2-dimensional space Note: axes adjusted to have the same scale. Mean of E_gov_indiv Mean of Internet_Acc The centroid  (vector whose elements are the sample means)   is the  centre of gravity  of the cloud. It is the point which is globally  less distant  from all the points.
Synthesis of variables Notice that it is the  average of the squared distances between the observed values and the sample mean ,[object Object],[object Object],[object Object],The Std. Dev has the same unit of measurement as the variable taken into account. It measures of the expected error (below or above the mean) we incur when substituting the mean to a generic case.  Moreover it can be considered as the  average distance between a generic value and the mean . It is the expected distance from mean. Being based upon averages, both the variance and the standard deviation are not robust (sensitive to extreme values) Average of the squared errors we incur when substituting the observed values with the sample mean.
The space of the observations Consider again the 2-dimensional space Let us consider the distance from Iceland (IS) to the centroid  Note: axes adjusted to have the same scale. Absolute Difference between the Iceland E_gov_Indiv value and the mean of E_gov_Indiv Absolute Difference between the Iceland Internet_Acc  value and the mean of Internet_Acc
The space of the observations Consider, in the 2-dimensional space,  ALL THE DISTANCES FROM POINTS TO THE CENTROID. Note: axes adjusted to have the same scale. Var(E_gov_indiv) + Var(Internet_cc) = SUM of the variances of  THE TWO VARIABLES  is proportional to the sum of the squared distances from the obs to the centroid
Synthesis of association between vars The linear association. The  sample covariance  for the  j -th and the  h -th variables (columns) is The  sample correlation coefficient  for the  j -th and the  h -th variables is (absolute measure of linear association) (relative measure of linear association; it ranges from – 1 to +1).  Remember: being based upon averages, the correlation coefficient is not robust (sensitive to extreme values)
The space of the observations Consider again the 2-dimensional space Since the covariance and the correlations are actually measuring the concentration of points around a line, both the indices give us information about the ORIENTATION of the scatter. Note: axes adjusted to have the same scale.
Variance and Covariance Matrix Variances and covariances are arranged in the so called  variance and covariance matrix S  is a  square  matrix (number of rows equals the number of columns) The  diagonal elements  of  S ,  s jj ,  are the  variances  (notice that the variance can be regarded as the covariance between one variable and itself) The  extra-diagonal elements  of  S ,  s jh ,  are the  covariances Since  s jh  =  s hj ,  S  is a  symmetric matrix.
Correlation Matrix Correlations are arranged in the  correlation matrix R  is also a  square  matrix, and its  diagonal elements  are 1’s (the correlation between one variable and itself is 1) Its  extra-diagonal elements ,  r jh ,  are the  correlations , and of course,  R  is a  symmetric matrix.   Due to the relationship between covariances and correlations: R  can be simply obtained from the variance and covariance matrix
The space of the observations The centroid  (vector whose elements are the sample means)   is the  centre of gravity  of the  p -dimensional cloud The  elements of the variance and covariance matrix  give us information about the  dispersion around the centroid ( remember the 2-dimension example)  and on the orientation of the cloud
Measuring dispersion ,[object Object],[object Object],[object Object],The  Total Variance  is the sum of the  diagonal elements of  the var/cov matrix, S.  The sum of the diagonal elements of a square matrix is defined to be its  trace . Hence, we have: Notice that we are not taking into account the interrelationships between vars, i.e. the  orientation  of the cloud.
The space of the observations To motivate the second measure of multivariate dispersion, consider the “portion” of the space which is occupied by data (area of the ellipse). We will come back to this concept later, but can intuitively understand that the area of the ellipse (in higher-dimensional space, the volume of an ellipsoid)  is somehow related to the variances and to the covariances, i.e., to  all  the entries of the var/cov matrix, S
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Measuring dispersion
The space of the observations The variances and covariance matrix contains relevant information to describe the points in a  p -dimensional space, and, also information about their distances. We now consider different measures of  distances between cases  in the  p -dimensional space, related to particular  transformations  of the original vars. Notice first that if the variables are centred on their mean nothing changes as concerns the dispersion of the points. This operation only consists in a change of the origin
Multivariate Samples - Transformations Centroid = Origin =  0 Var/Cov Matrix:  S Corr Matrix:  R TRASFORMATION: VARS CENTRED ON THEIR MEANS Original Data Matrix Centred Data Matrix The centred matrix is obtained by subtracting to each observation on a given variable the mean of the variable itself. This means that to all the observations on a given column, say the  j- th, the mean of the  j- th variable is subtracted. Centroid =  x Var/Cov Matrix:  S Corr Matrix:  R
A closer look at the distance The Euclidean distance is the length of the line connecting a point to the origin.  Consider, in the plot of the centred variables, Cyprus and Italy: their distance from the origin, 0, is (almost) the same.  This similar distance is due to  different combinations of  x-  and  y-   deviations  from 0.   Should the  x-  and  y-   deviations  be   evaluated in the same manner ?  Notice that the distance of Slovakia from the origin is higher.  We will consider this later
A closer look at the distance Remember: the  standard deviation  of a variable is the typical deviation from the mean. Here Std.Dev.(E_gov_Avail)=15, Std.Dev.(Int_Acc) = 21.31. To  compare  adequately the  deviations  from the origin (data are centred) ,  we should take into account the  Std.Dev  (of course,  squared deviations  should be compared with  variances ). Internet_Acc has an higher std.dev. Hence, a deviation  D   from the origin  along the horizontal axis should “count less” than a deviation  D  from the origin along the vertical axis.
A closer look at the distance In the Euclidean distance, the deviations are considered in  absolute terms . When we are considering variables having different Std.Dev, we should consider  relative deviations.  To remove the effect of Std. Dev, thus obtaining comparable deviations, we have to  standardize  the variables. The Euclidean Distance between two standardized observations is: Statistical Distance:   A different weight is assigned to the squared deviation of each variable in the calculation of the distance (1/ s jj ). The statistical distance is proportional to the Euclidean one only if the variances are all equal. Standardization of the  j -th variable:
A closer look at the distance The statistical distance (visualization in the original/centred space).  x- deviations are penalized  less  than  y- deviations, since the  x -axis is characterized by an  higher dispersion .  Hence Cyprus, which is showing an higher  y- deviation from the origin as compared to Italy is characterized by a statistical distance from the origin which is higher than that characterizing Italy.  Points having the same statistical distance from the origin Notice that Slovakia has a stat. distance from 0 which is now similar to that of Cyprus.
Multivariate Samples - Transformations Centroid = Origin =  0 Var/Cov Matrix:  R Corr Matrix:  R TRASFORMATION: STANDARDIZED VARS Original Data Matrix Standardized Data Matrix The standardized matrix is obtained by subtracting to each observation on a given variable the mean of the variable itself and by dividing this difference by the Std.Dev. The centred vars have null mean, the standardized vars have variances all equal to 1 (the unit of measurement is removed). Since Variance=Std.Dev= 1 for each variable, the covariances coincide with correlations (Corr=Cov/Product of Std.Dev’s). Centroid =  x Var/Cov Matrix:  S Corr Matrix:  R
A closer look at the distance  Euclidean distance in the standardized space.  The standardization makes all the differences comparable, so now the Euclidean distance coincides with the statistical distance calculated in the original space.  Notice that the cloud still has orientation Euclidean distance in the original space Statistical distance in the original space
A closer look at the distance  In  statistical distance  deviations are adjusted by taking into account dispersions of the variables.  But  no attention is posed on the “coherence” between each point and the cloud of points ( standardization does not involve correlations )  Slovakia and Cyprus are equally statistically distant from the origin.  Notice that Lithuania is more statistically distant from the origin.  Consider the  orientation  of the cloud: the line connecting Lithuania to  0  has the same direction of the cloud. This is less true for Slovakia. The line connecting Cyprus to the origin is in countertendency
A closer look at the distance In Statistical distance, the coherence with the orientation of the cloud is not considered. A transformation of data which removes the effect of Std. Dev, and also penalizes deviations by considering the orientation of the cloud of points id the so called  Mahalanobis transformation . We do not enter into details here.  The so called Mahalanobis distance is defined as the Euclidean distance calculated on Mahalanobis transformed observations: Mahalanobis transf. of the  j -th variable: The Mahalanobis transformation is a particular linear combination of the  considered variables.
Multivariate Samples - Transformations TRASFORMATION: MAHALANOBIS  Centroid = Origin =  0 Var/Cov Matrix:  I Corr Matrix:  I Original Data Matrix Mahalanobis Data Matrix The Mahalanobis distance is the Euclidean distance evaluated by previously transforming data according to the Mahalanobis transformation.  The variables transformed according to the Mahalanobis transformation have null means, variances all equal to 1 (unit of measurement is removed), and null correlations (orientation of the cloud is removed). Centroid =  x Var/Cov Matrix:  S Corr Matrix:  R
A closer look at the distance  Mahalanobis Distance:  deviations from the origin are adjusted by taking into account both the dispersions of variables  and  their correlations (orientation).  Now Cyprus, being in countertendency with respect to the orientation of the cloud is characterized by a Mahalanobis distance from  0  which is higher than that characterizing Slovakia.  Notice that Lithuania has a Mahalan. distance from 0 similar to that of Slovakia. Points having the same Mahalanobis distance from the origin
A closer look at the distance  Euclidean distance (original space Statistical distance (original space) Mahalanobis distance (original space) Euclidean distance in the Mahalanobis space.  By removing both dispersion and correlation differences are comparable also with respect to their orientation, so now the Euclidean distance coincides with the mahalanobis distance calculated in the original space. Notice that the cloud has no orientation.
Multivariate samples – Transformations Conclusion:  By transforming data via standardization or Mahalanobis transformation we are simply defining a new space such that the Euclidean Distance calculated on the transformed points coincides respectively with: Statistical distance  -  standardization , deviations are differently evaluated depending on their  Std.Dev Mahalanobis distance   -  Mahalanobis transformation , deviations are differently evaluated depending on the  Std.Dev.’s  and to the orientation of the cloud -  correlations/covariances ). As for now the latter transformation was not explicitly defined due to its analytical complexity, but we will see later how to obtain Mahalanobis-transformed data. 0 r jk r jk r jk Correlations Mahalanobis Statistical Euclidean Euclidean Euclidean distance 0 r jk s jk s jk Covariances 1 1 s jj s jj Variances 0 0 0 Means Z M Z X MAHALANOBIS STANDARDIZATION CENTRED ON MEAN ORIGINAL

Mais conteúdo relacionado

Semelhante a 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

A mathematical model of movement in virtual reality through thoughts
A mathematical model of movement in virtual reality through thoughts A mathematical model of movement in virtual reality through thoughts
A mathematical model of movement in virtual reality through thoughts IJECEIAES
 
EU-28 Quality of Life Deep Learning model
EU-28 Quality of Life Deep Learning modelEU-28 Quality of Life Deep Learning model
EU-28 Quality of Life Deep Learning modelJoakim Jörwall
 
ForecastingBUS255 GoalsBy the end of this chapter, y.docx
ForecastingBUS255 GoalsBy the end of this chapter, y.docxForecastingBUS255 GoalsBy the end of this chapter, y.docx
ForecastingBUS255 GoalsBy the end of this chapter, y.docxbudbarber38650
 
Statistical flaws in excel calculations
Statistical flaws in excel calculationsStatistical flaws in excel calculations
Statistical flaws in excel calculationsProbodh Mallick
 
Event Study – Apple, Inc
Event Study – Apple, IncEvent Study – Apple, Inc
Event Study – Apple, IncRandall Stauder
 
Sensitivity analysis in a lidar camera calibration
Sensitivity analysis in a lidar camera calibrationSensitivity analysis in a lidar camera calibration
Sensitivity analysis in a lidar camera calibrationcsandit
 
SENSITIVITY ANALYSIS IN A LIDARCAMERA CALIBRATION
SENSITIVITY ANALYSIS IN A LIDARCAMERA CALIBRATIONSENSITIVITY ANALYSIS IN A LIDARCAMERA CALIBRATION
SENSITIVITY ANALYSIS IN A LIDARCAMERA CALIBRATIONcscpconf
 
Présentation Olivier Biau Random forests et conjoncture
Présentation Olivier Biau Random forests et conjoncturePrésentation Olivier Biau Random forests et conjoncture
Présentation Olivier Biau Random forests et conjonctureCdiscount
 
Statistical Model to Predict IPO Prices for Semiconductor
Statistical Model to Predict IPO Prices for SemiconductorStatistical Model to Predict IPO Prices for Semiconductor
Statistical Model to Predict IPO Prices for SemiconductorXuanhua(Peter) Yin
 
Principal components
Principal componentsPrincipal components
Principal componentsHutami Endang
 
X18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsX18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsShantanu Deshpande
 
Empirical Finance, Jordan Stone- Linkedin
Empirical Finance, Jordan Stone- LinkedinEmpirical Finance, Jordan Stone- Linkedin
Empirical Finance, Jordan Stone- LinkedinJordan Stone
 
Fpe 90min-all
Fpe 90min-allFpe 90min-all
Fpe 90min-allwenchyan
 
Financial Markets Signal Detection with Bayesian Networks - Phd DREAMT - Work...
Financial Markets Signal Detection with Bayesian Networks - Phd DREAMT - Work...Financial Markets Signal Detection with Bayesian Networks - Phd DREAMT - Work...
Financial Markets Signal Detection with Bayesian Networks - Phd DREAMT - Work...Alessandro Greppi
 
EUGM 2011 | DARCHY | Deployment & use of east within sanofi r & d
EUGM 2011 | DARCHY | Deployment & use of east within sanofi r & dEUGM 2011 | DARCHY | Deployment & use of east within sanofi r & d
EUGM 2011 | DARCHY | Deployment & use of east within sanofi r & dCytel USA
 
statistical learning theory
statistical learning theorystatistical learning theory
statistical learning theoryHarshKumar943076
 
Voice Recognition Eye Test
Voice Recognition Eye TestVoice Recognition Eye Test
Voice Recognition Eye TestIRJET Journal
 

Semelhante a 8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008 (20)

A mathematical model of movement in virtual reality through thoughts
A mathematical model of movement in virtual reality through thoughts A mathematical model of movement in virtual reality through thoughts
A mathematical model of movement in virtual reality through thoughts
 
EU-28 Quality of Life Deep Learning model
EU-28 Quality of Life Deep Learning modelEU-28 Quality of Life Deep Learning model
EU-28 Quality of Life Deep Learning model
 
ForecastingBUS255 GoalsBy the end of this chapter, y.docx
ForecastingBUS255 GoalsBy the end of this chapter, y.docxForecastingBUS255 GoalsBy the end of this chapter, y.docx
ForecastingBUS255 GoalsBy the end of this chapter, y.docx
 
Statistical flaws in excel calculations
Statistical flaws in excel calculationsStatistical flaws in excel calculations
Statistical flaws in excel calculations
 
Event Study – Apple, Inc
Event Study – Apple, IncEvent Study – Apple, Inc
Event Study – Apple, Inc
 
Sensitivity analysis in a lidar camera calibration
Sensitivity analysis in a lidar camera calibrationSensitivity analysis in a lidar camera calibration
Sensitivity analysis in a lidar camera calibration
 
SENSITIVITY ANALYSIS IN A LIDARCAMERA CALIBRATION
SENSITIVITY ANALYSIS IN A LIDARCAMERA CALIBRATIONSENSITIVITY ANALYSIS IN A LIDARCAMERA CALIBRATION
SENSITIVITY ANALYSIS IN A LIDARCAMERA CALIBRATION
 
Présentation Olivier Biau Random forests et conjoncture
Présentation Olivier Biau Random forests et conjoncturePrésentation Olivier Biau Random forests et conjoncture
Présentation Olivier Biau Random forests et conjoncture
 
Statistical Model to Predict IPO Prices for Semiconductor
Statistical Model to Predict IPO Prices for SemiconductorStatistical Model to Predict IPO Prices for Semiconductor
Statistical Model to Predict IPO Prices for Semiconductor
 
Measure 2nd lec
Measure 2nd lecMeasure 2nd lec
Measure 2nd lec
 
Principal components
Principal componentsPrincipal components
Principal components
 
The Suitcase Case
The Suitcase CaseThe Suitcase Case
The Suitcase Case
 
Pareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQPareto Models, Slides EQUINEQ
Pareto Models, Slides EQUINEQ
 
X18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalyticsX18125514 ca2-statisticsfor dataanalytics
X18125514 ca2-statisticsfor dataanalytics
 
Empirical Finance, Jordan Stone- Linkedin
Empirical Finance, Jordan Stone- LinkedinEmpirical Finance, Jordan Stone- Linkedin
Empirical Finance, Jordan Stone- Linkedin
 
Fpe 90min-all
Fpe 90min-allFpe 90min-all
Fpe 90min-all
 
Financial Markets Signal Detection with Bayesian Networks - Phd DREAMT - Work...
Financial Markets Signal Detection with Bayesian Networks - Phd DREAMT - Work...Financial Markets Signal Detection with Bayesian Networks - Phd DREAMT - Work...
Financial Markets Signal Detection with Bayesian Networks - Phd DREAMT - Work...
 
EUGM 2011 | DARCHY | Deployment & use of east within sanofi r & d
EUGM 2011 | DARCHY | Deployment & use of east within sanofi r & dEUGM 2011 | DARCHY | Deployment & use of east within sanofi r & d
EUGM 2011 | DARCHY | Deployment & use of east within sanofi r & d
 
statistical learning theory
statistical learning theorystatistical learning theory
statistical learning theory
 
Voice Recognition Eye Test
Voice Recognition Eye TestVoice Recognition Eye Test
Voice Recognition Eye Test
 

Último

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Último (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

  • 1. Multivariate Samples Recall some very basic concepts of univariate and bivariate statistics Describe Multivariate Samples Analyze multivariate samples in a geometrical perspective Describe distances in the Euclidean Space
  • 2. The data we will consider Example1. Innovation and Research in Europe (Source: Eurostat) Country code Geo Country name Country european region Region E-government on-line availability - Online availability of 20 basic public services E_gov_avail Exports of high technology products as a share of total exports HT_Exports % of males 20-24 having completed at least upper 2° educ. Y_Educ__Lev_m % of fem. 20-24 having completed at least upper 2° educ. Y_Educ_Lev_f Youth education attainment level - total - % of the population 20-24 who completed at least upper secondary education Y_Educ_Lev Expenditure on Telecommunications as a % of GDP Telec_Expenditure Expenditure on Information Technology as a % of GDP IT_Expenditure No patents granted by the US Patent and Trademark Office per million inhabitants USTPO No patent applications to the European Patent Office per million inhabitants EPO Male tertiary graduates in S&T per 1000 of males aged 20-29 ST_grad_m Female tertiary graduates in S&T per 1000 of females aged 20-29 ST_grad_f Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29 ST_grad Level of Internet access - % of households who have Internet access at home Internet_Acc GERD - abroad - % of GERD financed by abroad GERD_abroad GERD - government - % of GERD financed by government GERD_govern GERD - industry - % of GERD financed by industry GERD_industry Gross domestic expenditure on R&D (GERD) - As a % of GDP GERD Spending on Human Resources (total public expen. on education) - % of GDP Educ_Exp
  • 3. Some basic concepts of Univariate and Bivariate statistics
  • 4. Back to basics…. Considering one variable Let us consider one variable of interest, say EPO In statistics a commonly used position measure is the arithmetic (sample) mean , obtained by summing up all the observed values and dividing the results by the nr of obs Netherlands Spain Mean = 127.6987 144.52 34.00 Western France 299.99 60.00 Western Germany 141.80 50.00 Western Belgium 246.15 67.00 Western Netherlands 30.64 34.00 Southern Spain 84.14 34.00 Southern Italy 9.87 17.00 Southern Greece 293.32 73.00 Northern Sweden 309.09 51.00 Northern Finland 124.19 56.00 Northern UK 135.77 60.00 Northern Norway 79.87 40.00 Northern Ireland 2.78 12.00 Northern Lithuania 12.04 19.00 Eastern Czech Republic 1.31 6.00 Eastern Romania EPO Internet_Acc region country
  • 5. Back to basics…. Considering one variable The mean can be used to make a “prediction” about EPO for a generic country without any further information. To evaluate the reliability of the mean as a synthesis of the observed data , we can consider for each observed value the error incurred when substituting it with the sample mean. Netherlands Spain Mean = 127.6987 In the plot: errors incurred when substituting the mean to the values observed for Netherlands and Spain respectively. The TOTAL SUM OF SQUARES is the sum of the squared errors Variable of interest: EPO
  • 6. Back to basics…. Considering one variable A synthesis of the errors, and a measure of the reliability of the mean as a synthesis of the observed data , is the (sample) variance This is the average of the squared errors we incur when substituting the observed values with the sample mean. It is obtained by dividing the Total SS by the number of observations (minus 1) The variance of EPO turns out to be 12646.5814 . Hence the error we can expect to incur for a generic observation is the square root of the variance, which is called standard deviation Variable of interest: EPO
  • 7. Back to basics…. Considering one variable In statistics we are mainly concerned with the explanation of variance , i.e., we are interested in explaining why a phenomenon varies and, also, we are considering predictive tools characterized by low prediction errors. So the question now is: Can we do better than the mean? i.e., can we use external information (other vars) related to EPO, and hence proving useful to predict the values of EPO with a lower error? In the following we will consider two supporting variables having different characteristics: The Region (a categorical variable) Internet_Access (a numerical variable) and we will show how it is possible to evaluate the extent to which one external variable provides information about the variable of interest Variable of interest: EPO
  • 8. Back to basics…. Considering one variable If we consider the region , our prediction on EPO can be better? General Mean = 127.6987 We can use the conditional means rather than the general one . It is worth only if the prediction error is considerably lower (it can be shown that it is lower by construction) Netherlands Spain Values observed within the regions 144.52 Western France 299.99 Western Germany 141.80 Western Belgium 246.15 Western Netherlands 30.64 Southern Spain 84.14 Southern Italy 9.87 Southern Greece 293.32 Northern Sweden 309.09 Northern Finland 124.19 Northern UK 135.77 Northern Norway 79.87 Northern Ireland 2.78 Northern Lithuania 12.04 Eastern Czech Republic 1.31 Eastern Romania EPO region country
  • 9. Back to basics…. Considering one variable Consider the region to improve prediction on EPO  Use the conditional means Netherlands Spain To evaluate the reliability of the conditional means as syntheses of the observed EPO data, we can consider the squared difference between each value and the proper conditional mean. In the plot: errors for Netherlands and Spain The WITHIN SUM OF SQUARES of EPO given Region is the sum of the squared errors incurred when using the conditional means (by region) to predict EPO
  • 10. Back to basics…. Considering one variable If we use the region, our improvement as compared to the general mean is The R 2 ranges from 0 to 1. It measures the ability of the categorical var as a predictor of the numerical one. % of variance of EPO accounted for by Region Compare general mean / conditional means as predictors of EPO TOTAL SS EPO = 177052.1395 282.9561 29684.2921 198.8467 14030.7105 9420.3912 1897.3603 13883.6025 27430.415 32902.8037 12.311 65.1459 2287.5845 15604.6816 13376.9349 15974.1035 Squared errors WITHIN SS EPO | REGION = 94296.85 4044.324 208.115 127.6987 144.52 Western France 8441.016 208.115 127.6987 299.99 Western Germany 4397.679 208.115 127.6987 141.8 Western Belgium 1446.661 208.115 127.6987 246.15 Western Netherlands 119.0281 41.55 127.6987 30.64 Southern Spain 1813.908 41.55 127.6987 84.14 Southern Italy 1003.622 41.55 127.6987 9.87 Southern Greece 18446.18 157.5033 127.6987 293.32 Northern Sweden 22978.53 157.5033 127.6987 309.09 Northern Finland 1109.776 157.5033 127.6987 124.19 Northern UK 472.3363 157.5033 127.6987 135.77 Northern Norway 6026.929 157.5033 127.6987 79.87 Northern Ireland 23939.3 157.5033 127.6987 2.78 Northern Lithuania 28.7832 6.675 127.6987 12.04 Eastern Czech Republic 28.7832 6.675 127.6987 1.31 Eastern Romania Squared errors Conditional means General mean EPO region country
  • 11. Back to basics…. Considering one variable If we consider Internet_Access , our prediction on EPO can be better? When considering numerical variables, we are interested in evaluating the existence of a linear association between them. To evaluate if a linear relationship exists and to determine its direction we refer to the sample covariance (absolute measure of linear association) 144.52 34.00 France 299.99 60.00 Germany 141.80 50.00 Belgium 246.15 67.00 Netherlands 30.64 34.00 Spain 84.14 34.00 Italy 9.87 17.00 Greece 293.32 73.00 Sweden 309.09 51.00 Finland 124.19 56.00 UK 135.77 60.00 Norway 79.87 40.00 Ireland 2.78 12.00 Lithuania 12.04 19.00 Czech Republic 1.31 6.00 Romania EPO Internet_Acc country
  • 12. Back to basics…. Considering one variable If we consider Internet_Access , our prediction on EPO can be better? The covariance between the two variables is: Cov(EPO, Int_Acc) = 1868.5152 This measure only indicates that a linear relationship exists and that it is direct (an inspection of the scatter plot confirms this). Nevertheless, the value of the covariance depends upon the unit of measurement of the considered variables. A relative measure of linear association is the correlation coefficient. The correlation coefficient ranges from – 1 to +1. Values close to 1 indicate strong direct linear association, values close to –1 denote strong inverse association. Values close to zero indicate no relationship. Here we have Corr(EPO, Int_Acc) = 0.8527 (strong association)
  • 13. Back to basics…. Considering one variable If we consider Internet_Access , our prediction on EPO can be better? EPO = –60.018 + 4.5934*Int_Acc The high value of the correlation tells us that observations tend to cluster around a line having a positive slope. This line, evidenced in the scatterplot is called regression line. Its analytical expression can be easily determined
  • 14. Back to basics…. Considering one variable EPO = –60.018 + 4.5934*Int_Acc For each observation we can calculate the difference between the observed EPO value and the value predicted using the regression line. In the plot the error is evidenced for the Spain. Spain The MODEL SUM OF SQUARES of EPO given Int_Acc is the sum of the squared errors incurred when using the line to predict EPO. Consider Internet_Access to improve prediction on EPO  Use the regression line
  • 15. Back to basics…. Considering one variable Notice that we have a considerable decrease of the prediction errors. Compare general mean / regression line as predictors of EPO TOTAL SS EPO = 177052.1395 282.9561 29684.2921 198.8467 14030.7105 9420.3912 1897.3603 13883.6025 27430.415 32902.8037 12.311 65.1459 2287.5845 15604.6816 13376.9349 15974.1035 Squared errors MODEL SS EPO | Int_Acc = 48309.46 2338.922 96.1576 127.6987 144.52 34.00 France 7124.035 215.586 127.6987 299.99 60.00 Germany 775.7339 169.652 127.6987 141.8 50.00 Belgium 2.5275 247.7398 127.6987 246.15 67.00 Netherlands 4292.556 96.1576 127.6987 30.64 34.00 Spain 144.4227 96.1576 127.6987 84.14 34.00 Italy 67.2367 18.0698 127.6987 9.87 17.00 Greece 324.7132 275.3002 127.6987 293.32 73.00 Sweden 18183.07 174.2454 127.6987 309.09 51.00 Finland 5332.271 197.2124 127.6987 124.19 56.00 UK 6370.594 215.586 127.6987 135.77 60.00 Norway 1922.647 123.718 127.6987 79.87 40.00 Ireland 58.9394 -4.8972 127.6987 2.78 12.00 Lithuania 231.5449 27.2566 127.6987 12.04 19.00 Czech Republic 1140.251 -32.4576 127.6987 1.31 6.00 Romania Squared errors Prediction using the line = 4.5934*Int_Acc-60.018 Gen mean EPO Int_Acc country
  • 16. Back to basics…. Considering one variable The R 2 index ranges from 0 to 1 and it measures the ability of the numerical var to predict the other one. It can be shown that the index coincides with the squared correlation coefficient. Hence the correlation measures the extent of linear association , whereas its square measures the percentage of the variance of one variable which can be explained by the other variable (numerical) . If we use the line (function of Int_Acc), our improvement as compared to the general mean is % of variance of EPO accounted for by Int_Acc If we consider Internet_Access , our prediction on EPO can be better?
  • 17. Data Matrices (Numerical variables only)
  • 18. Data matrices Example1 (continued). Innovation and Research in Europe. For the sake of simplicity, we limit attention to few variables and to few observations The country variable is useful to identify the statistical units but it is not object of analysis. At the moment we consider only numerical variables For each observation we have information collected on p variables For each variable we have information collected on n observations The data matrix contains information available for the n cases ( rows) on the p variables ( columns ) Here we have 15 rows (cases, n ) and 7 columns (vars, p ) 50.00 144.52 19.50 34.00 36.90 54.20 2.20 Western France 47.00 299.99 8.10 60.00 31.40 65.70 2.46 Western Germany 35.00 141.80 10.50 50.00 22.00 63.40 2.08 Western Belgium 32.00 246.15 6.60 67.00 35.80 51.90 1.80 Western Netherlands 55.00 30.64 11.90 34.00 39.90 47.20 0.91 Southern Spain 53.00 84.14 7.40 34.00 46.80 47.20 1.09 Southern Italy 32.00 9.87 8.00 17.00 46.60 33.00 0.64 Southern Greece 74.00 293.32 13.30 73.00 21.30 71.50 4.25 Northern Sweden 67.00 309.09 17.40 51.00 25.50 70.80 3.30 Northern Finland 59.00 124.19 20.30 56.00 28.80 45.60 1.83 Northern UK 56.00 135.77 7.70 60.00 39.80 51.60 1.60 Northern Norway 50.00 79.87 20.50 40.00 25.60 66.70 1.10 Northern Ireland 40.00 2.78 14.60 12.00 56.30 37.10 0.67 Northern Lithuania 30.00 12.04 6.00 19.00 43.60 52.50 1.20 Eastern Czech Republic 25.00 1.31 5.80 6.00 43.00 47.60 0.39 Eastern Romania E_gov_avail EPO ST_ grad Internet_Acc GERD_govern GERD_ industry GERD region country
  • 19. Data matrices Example1 (continued). Innovation and Research in Europe. (subset) To each observation a collection of p values is associated. These values are the realizations observed for each variables corresponding to the considered obs. Similarly, to each variable , a collection of n values can be associated (values observed for all the cases) A collection of k values is usually called a vector . To avoid confusion, we will only consider column vectors, with dimension ( k  1) – i.e., a collection of values arranged in k rows and in 1 column . A row (1  k ) vector can always be seen as the transpose of a column ( k  1) vector. 50.00 144.52 19.50 34.00 36.90 54.20 2.20 France 47.00 299.99 8.10 60.00 31.40 65.70 2.46 Germany 35.00 141.80 10.50 50.00 22.00 63.40 2.08 Belgium 32.00 246.15 6.60 67.00 35.80 51.90 1.80 Netherlands 55.00 30.64 11.90 34.00 39.90 47.20 0.91 Spain 53.00 84.14 7.40 34.00 46.80 47.20 1.09 Italy 32.00 9.87 8.00 17.00 46.60 33.00 0.64 Greece 74.00 293.32 13.30 73.00 21.30 71.50 4.25 Sweden 67.00 309.09 17.40 51.00 25.50 70.80 3.30 Finland 59.00 124.19 20.30 56.00 28.80 45.60 1.83 UK 56.00 135.77 7.70 60.00 39.80 51.60 1.60 Norway 50.00 79.87 20.50 40.00 25.60 66.70 1.10 Ireland 40.00 2.78 14.60 12.00 56.30 37.10 0.67 Lithuania 30.00 12.04 6.00 19.00 43.60 52.50 1.20 Czech Republic 25.00 1.31 5.80 6.00 43.00 47.60 0.39 Romania E_gov_avail EPO ST_ grad Internet_Acc GERD_govern GERD_ industry GERD
  • 20. Data matrices x i = vector ( p  1) containing measurements on the p vars for the i -th case. x ( j ) = vector ( n  1) containing the n measurements on the j -th variable Data matrix ( n individuals and p variables) Transposition operation A data matrix can be seen as a collection of n row (transposed) vectors (cases) and/or as a collection of p column vectors (variables)
  • 21. Data matrices Example1 (continued). Innovation and Research in Europe. (subset) Row vector associated to “Belgium” (measurements on 7 vars) Column vector associated to EPO (measurements on 15 obs) The element in the i- th row and in the j- th column, x ij is the value observed for the i -th case corresponding to the j- th variable. In this simple example, x 13 6 is the value of EPO (6° variable) for Belgium (13° observation). 50.00 144.52 19.50 34.00 36.90 54.20 2.20 France 47.00 299.99 8.10 60.00 31.40 65.70 2.46 Germany 35.00 141.80 10.50 50.00 22.00 63.40 2.08 Belgium 32.00 246.15 6.60 67.00 35.80 51.90 1.80 Netherlands 55.00 30.64 11.90 34.00 39.90 47.20 0.91 Spain 53.00 84.14 7.40 34.00 46.80 47.20 1.09 Italy 32.00 9.87 8.00 17.00 46.60 33.00 0.64 Greece 74.00 293.32 13.30 73.00 21.30 71.50 4.25 Sweden 67.00 309.09 17.40 51.00 25.50 70.80 3.30 Finland 59.00 124.19 20.30 56.00 28.80 45.60 1.83 UK 56.00 135.77 7.70 60.00 39.80 51.60 1.60 Norway 50.00 79.87 20.50 40.00 25.60 66.70 1.10 Ireland 40.00 2.78 14.60 12.00 56.30 37.10 0.67 Lithuania 30.00 12.04 6.00 19.00 43.60 52.50 1.20 Czech Republic 25.00 1.31 5.80 6.00 43.00 47.60 0.39 Romania E_gov_avail EPO ST_ grad Internet_Acc GERD_govern GERD_ industry GERD
  • 22. Data matrices – Vectors A ( K  1) vector is as an oriented line in a K -dimensional space v 1 v 2 v 3 v 1 v 2 A two-dimensional vector A three-dimensional vector Vectors of higher dimension cannot be represented in this way A one-dimensional vector (scalar) v 1
  • 23. Data matrices – Vectors (length) For a given vector in the k -dimensional space, we define its length as: It is the length of the line connecting v to the origin, 0 : v 1 v 2 v 3 v 1 v 2 v 1 0 0 0
  • 24. Data matrices – Vectors (Distance) 0 v v 1 v 2 u u 1 u 2 Given two vectors, v and u in the k -dimensional space, we define the Euclidean Distance between v and u as the length of the line connecting v to u : | v 1 – u 1 | | v 2 – u 2 | !!! the length of a vector v coincides with its distance from the origin, 0. Example in the two-dimensional space
  • 25. Analyze multivariate samples in a geometrical perspective Describe distances in the Euclidean Space
  • 26. Data matrices A data matrix can be see as a collection of two kind of vectors: Row vectors: x i lie in the p -dimensional space Column vectors: x ( j) lie in the n -dimensional space Hence two dimensional spaces can be considered to analyze/describe a data matrix. Of course, these spaces will be related one to each other. For the sake of simplicity, we will analyze in depth only the space of the observations.
  • 27. Syntheses of variables The position. The sample mean (unbiased estimator for the population mean) f or the j -th variable (column) is: It may be seen as the vector associated to the “artificial case” “mean” – an unobserved case being in the average with respect to all the vars Remember: the mean is not robust (sensitive to extreme values) How to arrange syntheses of p variables, i.e., how to synthesize the elements of the column vectors? Vector of the sample means ( centroid ).
  • 28. The space of the observations Consider a graphical representation we are used to: the 2-dimensional space Note: axes adjusted to have the same scale. Mean of E_gov_indiv Mean of Internet_Acc The centroid (vector whose elements are the sample means) is the centre of gravity of the cloud. It is the point which is globally less distant from all the points.
  • 29.
  • 30. The space of the observations Consider again the 2-dimensional space Let us consider the distance from Iceland (IS) to the centroid Note: axes adjusted to have the same scale. Absolute Difference between the Iceland E_gov_Indiv value and the mean of E_gov_Indiv Absolute Difference between the Iceland Internet_Acc value and the mean of Internet_Acc
  • 31. The space of the observations Consider, in the 2-dimensional space, ALL THE DISTANCES FROM POINTS TO THE CENTROID. Note: axes adjusted to have the same scale. Var(E_gov_indiv) + Var(Internet_cc) = SUM of the variances of THE TWO VARIABLES is proportional to the sum of the squared distances from the obs to the centroid
  • 32. Synthesis of association between vars The linear association. The sample covariance for the j -th and the h -th variables (columns) is The sample correlation coefficient for the j -th and the h -th variables is (absolute measure of linear association) (relative measure of linear association; it ranges from – 1 to +1). Remember: being based upon averages, the correlation coefficient is not robust (sensitive to extreme values)
  • 33. The space of the observations Consider again the 2-dimensional space Since the covariance and the correlations are actually measuring the concentration of points around a line, both the indices give us information about the ORIENTATION of the scatter. Note: axes adjusted to have the same scale.
  • 34. Variance and Covariance Matrix Variances and covariances are arranged in the so called variance and covariance matrix S is a square matrix (number of rows equals the number of columns) The diagonal elements of S , s jj , are the variances (notice that the variance can be regarded as the covariance between one variable and itself) The extra-diagonal elements of S , s jh , are the covariances Since s jh = s hj , S is a symmetric matrix.
  • 35. Correlation Matrix Correlations are arranged in the correlation matrix R is also a square matrix, and its diagonal elements are 1’s (the correlation between one variable and itself is 1) Its extra-diagonal elements , r jh , are the correlations , and of course, R is a symmetric matrix. Due to the relationship between covariances and correlations: R can be simply obtained from the variance and covariance matrix
  • 36. The space of the observations The centroid (vector whose elements are the sample means) is the centre of gravity of the p -dimensional cloud The elements of the variance and covariance matrix give us information about the dispersion around the centroid ( remember the 2-dimension example) and on the orientation of the cloud
  • 37.
  • 38. The space of the observations To motivate the second measure of multivariate dispersion, consider the “portion” of the space which is occupied by data (area of the ellipse). We will come back to this concept later, but can intuitively understand that the area of the ellipse (in higher-dimensional space, the volume of an ellipsoid) is somehow related to the variances and to the covariances, i.e., to all the entries of the var/cov matrix, S
  • 39.
  • 40. The space of the observations The variances and covariance matrix contains relevant information to describe the points in a p -dimensional space, and, also information about their distances. We now consider different measures of distances between cases in the p -dimensional space, related to particular transformations of the original vars. Notice first that if the variables are centred on their mean nothing changes as concerns the dispersion of the points. This operation only consists in a change of the origin
  • 41. Multivariate Samples - Transformations Centroid = Origin = 0 Var/Cov Matrix: S Corr Matrix: R TRASFORMATION: VARS CENTRED ON THEIR MEANS Original Data Matrix Centred Data Matrix The centred matrix is obtained by subtracting to each observation on a given variable the mean of the variable itself. This means that to all the observations on a given column, say the j- th, the mean of the j- th variable is subtracted. Centroid = x Var/Cov Matrix: S Corr Matrix: R
  • 42. A closer look at the distance The Euclidean distance is the length of the line connecting a point to the origin. Consider, in the plot of the centred variables, Cyprus and Italy: their distance from the origin, 0, is (almost) the same. This similar distance is due to different combinations of x- and y- deviations from 0. Should the x- and y- deviations be evaluated in the same manner ? Notice that the distance of Slovakia from the origin is higher. We will consider this later
  • 43. A closer look at the distance Remember: the standard deviation of a variable is the typical deviation from the mean. Here Std.Dev.(E_gov_Avail)=15, Std.Dev.(Int_Acc) = 21.31. To compare adequately the deviations from the origin (data are centred) , we should take into account the Std.Dev (of course, squared deviations should be compared with variances ). Internet_Acc has an higher std.dev. Hence, a deviation D from the origin along the horizontal axis should “count less” than a deviation D from the origin along the vertical axis.
  • 44. A closer look at the distance In the Euclidean distance, the deviations are considered in absolute terms . When we are considering variables having different Std.Dev, we should consider relative deviations. To remove the effect of Std. Dev, thus obtaining comparable deviations, we have to standardize the variables. The Euclidean Distance between two standardized observations is: Statistical Distance: A different weight is assigned to the squared deviation of each variable in the calculation of the distance (1/ s jj ). The statistical distance is proportional to the Euclidean one only if the variances are all equal. Standardization of the j -th variable:
  • 45. A closer look at the distance The statistical distance (visualization in the original/centred space). x- deviations are penalized less than y- deviations, since the x -axis is characterized by an higher dispersion . Hence Cyprus, which is showing an higher y- deviation from the origin as compared to Italy is characterized by a statistical distance from the origin which is higher than that characterizing Italy. Points having the same statistical distance from the origin Notice that Slovakia has a stat. distance from 0 which is now similar to that of Cyprus.
  • 46. Multivariate Samples - Transformations Centroid = Origin = 0 Var/Cov Matrix: R Corr Matrix: R TRASFORMATION: STANDARDIZED VARS Original Data Matrix Standardized Data Matrix The standardized matrix is obtained by subtracting to each observation on a given variable the mean of the variable itself and by dividing this difference by the Std.Dev. The centred vars have null mean, the standardized vars have variances all equal to 1 (the unit of measurement is removed). Since Variance=Std.Dev= 1 for each variable, the covariances coincide with correlations (Corr=Cov/Product of Std.Dev’s). Centroid = x Var/Cov Matrix: S Corr Matrix: R
  • 47. A closer look at the distance Euclidean distance in the standardized space. The standardization makes all the differences comparable, so now the Euclidean distance coincides with the statistical distance calculated in the original space. Notice that the cloud still has orientation Euclidean distance in the original space Statistical distance in the original space
  • 48. A closer look at the distance In statistical distance deviations are adjusted by taking into account dispersions of the variables. But no attention is posed on the “coherence” between each point and the cloud of points ( standardization does not involve correlations ) Slovakia and Cyprus are equally statistically distant from the origin. Notice that Lithuania is more statistically distant from the origin. Consider the orientation of the cloud: the line connecting Lithuania to 0 has the same direction of the cloud. This is less true for Slovakia. The line connecting Cyprus to the origin is in countertendency
  • 49. A closer look at the distance In Statistical distance, the coherence with the orientation of the cloud is not considered. A transformation of data which removes the effect of Std. Dev, and also penalizes deviations by considering the orientation of the cloud of points id the so called Mahalanobis transformation . We do not enter into details here. The so called Mahalanobis distance is defined as the Euclidean distance calculated on Mahalanobis transformed observations: Mahalanobis transf. of the j -th variable: The Mahalanobis transformation is a particular linear combination of the considered variables.
  • 50. Multivariate Samples - Transformations TRASFORMATION: MAHALANOBIS Centroid = Origin = 0 Var/Cov Matrix: I Corr Matrix: I Original Data Matrix Mahalanobis Data Matrix The Mahalanobis distance is the Euclidean distance evaluated by previously transforming data according to the Mahalanobis transformation. The variables transformed according to the Mahalanobis transformation have null means, variances all equal to 1 (unit of measurement is removed), and null correlations (orientation of the cloud is removed). Centroid = x Var/Cov Matrix: S Corr Matrix: R
  • 51. A closer look at the distance Mahalanobis Distance: deviations from the origin are adjusted by taking into account both the dispersions of variables and their correlations (orientation). Now Cyprus, being in countertendency with respect to the orientation of the cloud is characterized by a Mahalanobis distance from 0 which is higher than that characterizing Slovakia. Notice that Lithuania has a Mahalan. distance from 0 similar to that of Slovakia. Points having the same Mahalanobis distance from the origin
  • 52. A closer look at the distance Euclidean distance (original space Statistical distance (original space) Mahalanobis distance (original space) Euclidean distance in the Mahalanobis space. By removing both dispersion and correlation differences are comparable also with respect to their orientation, so now the Euclidean distance coincides with the mahalanobis distance calculated in the original space. Notice that the cloud has no orientation.
  • 53. Multivariate samples – Transformations Conclusion: By transforming data via standardization or Mahalanobis transformation we are simply defining a new space such that the Euclidean Distance calculated on the transformed points coincides respectively with: Statistical distance - standardization , deviations are differently evaluated depending on their Std.Dev Mahalanobis distance - Mahalanobis transformation , deviations are differently evaluated depending on the Std.Dev.’s and to the orientation of the cloud - correlations/covariances ). As for now the latter transformation was not explicitly defined due to its analytical complexity, but we will see later how to obtain Mahalanobis-transformed data. 0 r jk r jk r jk Correlations Mahalanobis Statistical Euclidean Euclidean Euclidean distance 0 r jk s jk s jk Covariances 1 1 s jj s jj Variances 0 0 0 Means Z M Z X MAHALANOBIS STANDARDIZATION CENTRED ON MEAN ORIGINAL