This document describes an automated clustering and outlier detection program. The program normalizes data, performs principal component analysis to select important components, compares clustering algorithms, selects the best model using silhouette values, and produces outputs labeling clusters and outliers. It is demonstrated on a sample of 5,000 credit card customer records, identifying a small cluster of 3 accounts as outliers based on features like new status and high late payments.
2. Motivation
Motivation entails the development of a program that automatically performs
clustering and outlier detection for a wide variety of numerically represented data.
3. Outline of program features
Normalizes all data to be clustered
Creates normalized principal components from the normalized data
Automatically selects the necessary normalized principal components for use in actual
clustering and outlier detection
Compares a variety of algorithms based upon the selected set of normalized principal
components
Adopts the top performing model based upon silhouette coefficient values to perform
the final clustering and outlier detection procedures
Produces relevant information and outputs throughout the process
4. Data normalization
Data normalization
Converts each numerically represented dimension to be clustered into the range [0,1].
A desirable procedure for preparing numeric attributes for clustering
5. Principal component analysis
Principal component analysis (PCA) is a statistical procedure that uses an
orthogonal transformation to convert a set of observations of possibly correlated
variables into a set of values of linearly uncorrelated variables called principal
components.
In this way, PCA can both reduce dimensionality as well as eliminate inherent
problems associated with clustering data whose attributes are correlated
In the following slides, a random sample of 5,000 credit card customers is used to
demonstrate the automated clustering and outlier detection program
6. Principal component analysis
PCA initially results in four principal
components being generated from
the original data
Using a cumulative data variability
threshold of 80% (default
specification), three principal
components are automatically
selected for analysis – they explain
the vast majority of data variability
7. Principal component analysis
Scatter plot of PC1 and PC2
In this view, the top 2 principal
components are plotted for each object in
two-dimensional space.
As can be seen, a small subset of records
appear significantly more distant/different
from the vast majority of objects.
8. Clustering exploration/simulation process - examples
Ward method
Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for
choosing the pair of clusters to merge at each step is based on the optimal value of an objective function.
Complete link method
This method is also known as farthest neighbor clustering.The result of the clustering can be visualized
as a dendrogram, which shows the sequence of cluster fusion and the distance at which each fusion took
place.
PAM (partitioning around medoids)
The k-medoids algorithm is a clustering method related to the k-means algorithm and the medoids shift
algorithm; It is considered more stable than k-means, because it uses the median rather than mean
K-means
k-means clustering aims to partition n observations into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a prototype of the cluster.
9. Clustering exploration results
The result shown below is based upon a simulation exercise, whereby all four
algorithms are automatically compared on the data set (i.e., a random sample of 5,000
records from the credit card customer data). In this particular case, the best model is
found to be a two-cluster solution using the complete link hierarchical method. This is
the final model and is used for subsequent clustering and outlier detection.
Best clustering result:
The silhouette value can theoretically range from -1 to +1, with higher values indicative
of better cluster quality in terms of both cohesion and separation.
Best Method Number Of Clusters SilhouetteValue
complete link hierarchical 2 0.753754205720575
10. Complete-link Hierarchical clustering (1/2)
The 5,000 instances are on the
x-axis. In moving vertically from
the x-axis, one can begin to see
how the actual clusters are
formed.
11. Plot of PCs with cluster assignment labels (1/3)
In this view, the top two principal
components (i.e., PC1 and PC2) are
plotted for each object in two-
dimensional space.
In the graph, there are two clusters, one
dark blue and the other light blue.
The small subset of three records appears
substantially more different from the
majority of objects.
12. Plot of PCs with cluster assignment labels (2/3)
In this view, PC1 and PC3 are plotted for
each object in two-dimensional space.
In the graph, the two clusters are again
shown.
It is once again evident that the small
subset of three records appears more
different from the majority of other
objects.
13. Plot of PCs with cluster assignment labels (3/3)
In this view, PC2 and PC3 are
plotted for each object in two-
dimensional space.
Cluster differences appear less
prominent from this perspective.
14. Principal components 3D scatterplot
Cluster one represents the majority
class (black) while cluster two
represents the rare class (red).
In this view, one can clearly see the
subset of three records (in red)
appearing more isolated from the other
objects.
15. Cluster 1 outlier plot
In this view, an arbitrary cutoff is
inserted at the 99.9th percentile (red
horizontal line) so as to provide for
efficient identification of very irregular
records.
Objects further from the x-axis are
more questionable.
While all objects distant from the x-
axis might be worth investigating,
points above the cutoff should be
viewed as particularly suspicious.
16. Conclusion of Process
At the conclusion of outlier detection, an output file for each cluster containing the unique
record identifier, original variables, normalized variables, principal components, normalized
principal components, cluster assignments, and mahalanobis distance information can be
exported to facilitate further analyses and investigations.
Cluster 2 – final output file of a subset of fields:
Distinguishing features of cluster 2 records: 1) New accounts (age = 1 month), 2)
Very high incidence of late payments, and 3) Relatively high credit limits,
particularly given the account age and late payment issues.
Record AccountAge CreditLimit AdditionalAssets LatePayments model.cluster md
32430 1 2500 1 3 2 5.83E-05
65470 1 8500 1 4 2 0.002371778
78772 1 2200 0 3 2 0.000442305