More Related Content Similar to Exploring Variable Clustering and Importance in JMP (20) More from JMP software from SAS (12) Exploring Variable Clustering and Importance in JMP1. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
EXPLORING VARIABLE CLUSTERING
AND IMPORTANCE IN JMP
CHRIS GOTWALT AND RYAN PARKER
2. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
CLUSTERING
INTRODUCTION
• Variable clustering is a method that performs dimension reduction on the
number of input variables to be used in a predictive model.
• Reduces inputs by finding groups of similar variables so that a single variable
can represent each group.
• Helps reduce effects of collinearity on the input variables.
• Developed by SAS/STAT Development Director Warren Sarle.
3. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
CLUSTERING
AN ITERATIVE ALGORITHM
• Iteratively splits and assigns variables to clusters.
• Sample iterations for variables in Wine Quality data set:
Iteration 1 Alcohol, Citric Acid, pH, Sugar, Sulfur Dioxide
Alcohol, Citric Acid, Sulfur Dioxide
Alcohol, Sugar
pH, Sulfur
Dioxide
pH, Sugar
Citric Acid
Iteration 2
Iteration 3
4. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
CLUSTERING
ALGORITHM DETAILS
• At each iteration the cluster with the largest second eigenvalue is split.
• Variables within this cluster are assigned to two new clusters based on each
variable’s correlation with the first two orthoblique rotated principal
components.
• After the split, variables from other clusters are reassigned to one of the new
clusters if they have a higher correlation with the new cluster.
• Ends when the second eigenvalue of all clusters is less than one.
5. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
CLUSTERING
REDUCING EACH CLUSTER TO A SINGLE VARIABLE
pH
Sugar
pH
Citric
Acid
• Each cluster can be reduced to a single
variable for modeling.
• There are two ways to do this:
1. We can use the most representative
variable from each cluster.
2. Alternatively, the cluster component from
each cluster can be used.
6. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
CLUSTERING
MOST REPRESENTATIVE VARIABLES
• These are variables that best represent each cluster.
• They have the highest correlation with the variables in its cluster.
• Most representative variables provide a clear interpretation when used.
7. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
CLUSTERING
CLUSTER COMPONENTS
• New variables created using the first principal component of each cluster.
• Provide a way to combine variables in each cluster into a single variable.
• Similar to traditional principal components analysis (PCA) except that each
cluster component only uses variables from that cluster.
• Interpretation not as clear when compared to most representative variables.
8. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
CLUSTERING
DEMO: IMPORTANT TERMS
• RSquare with Own Cluster
• The RSquare a variable has with variables in its cluster.
• RSquare with Next Closest
• The RSquare a variable has with variables in the next most similar cluster.
• 1-RSquare Ratio
• Relative similarity between a variable’s own cluster and the next closest cluster.
• Values should always be less than 1.
• Values greater than 1 indicate variable should be moved to the next closest cluster.
9. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
IMPORTANCE
INTRODUCTION
• Provides a general way to assess the importance of variables for predictive
models in JMP.
• Insight is in terms of practical significance of input variables.
• Based on functional decomposition ideas of I. M. Sobol.
10. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
IMPORTANCE
FUNCTIONAL DECOMPOSITION
• I. M. Sobol showed that we can decompose a function 𝑓(𝑋1, … , 𝑋 𝑝) into the
sum of lower dimensional inputs:
• 𝑓 𝑋1, … , 𝑋 𝑝 = 𝑓0 + 𝑓1 𝑋1 + ⋯ + 𝑓𝑝 𝑋 𝑝 + 𝑓12 𝑋1, 𝑋2 + ⋯
• Decomposition has a function for each 𝑋𝑖, each pair (𝑋𝑖, 𝑋𝑗), etc.
• The variability of these lower dimensional functions assess the importance of
the input variables.
11. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
IMPORTANCE
IMPORTANCE EFFECTS
• Assessment of variable importance is in terms of effect indices.
• These indices are numbers between 0 and 1 indicating relative importance.
• Main effect indices measure variability of predictions due to a single input.
• They do not account for interaction effects.
• Total effect indices measure the total variability of predictions due the input.
• Combines all main and higher order interaction effects.
12. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
IMPORTANCE
DISTRIBUTION OF INPUT VARIABLES
• Variability in predictions is due to the distribution of input variables
• JMP 11 provides three input variable distribution options:
1. Independent Uniform
2. Independent Resampled
3. Dependent Resampled
• Monte Carlo estimation procedure used for independent cases.
• 𝐾-nearest neighbors estimation used for dependent case.
13. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
IMPORTANCE
USE RESAMPLED INPUTS?
Uniform
Acceptable
Resampled
Needed
14. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
IMPORTANCE
MARGINAL INFERENCE
Main Effects0.16 0.03
15. Copyr ight © 2012, SAS Institute Inc. All rights reser ved.
VARIABLE
IMPORTANCE
DEMO