What is Hierarchical Clustering and How Can an Organization Use it to Analyze Data?

Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s

Hierarchical Clustering
Parameter Tuning & Use cases

Introduction with example
Standard tuning parameters
Sample UI for input variables & parameters
selection & output
Other sample output formats
Interpretation of Output
Limitations
Business use cases
What Are
All Covered

What is it used for…
It’s a process by which objects are classified
into number of groups so that they are as
much dissimilar as possible from one group
to another group and as much similar as
possible within each group
Thus it’s simply a grouping of similar things
/data points
For example ,objects within group
1(clusterc1) shown in image above should be
as similar as possible in terms of attributes of
the objects.
Objects in group 1 & group 2 should be as
dissimilar as possible

Some Examples
Let’s take
a few
examples
for more
clarity :
Movie tickets booking website users can be
grouped into movie freaks/moderate
watchers/ rare watchers based on their past
movie tickets purchase behavior, such as
#days from last movie seen , average number
of tickets booked each time , frequency of
tickets booking per month , etc.
Retail customers can be clubbed into loyal /
infrequent / rare customer groups based on
their retail outlet/website visits per month ,
purchase amount per month , purchase
frequency per month etc.
 It is used to find natural groups which
have not been explicitly labeled in the
data. This can be used to confirm
business assumptions about what types
of groups exist or to identify unknown
groups in complex data sets.
 Once the algorithm has been run and the
groups are defined, any new data can be
easily assigned to the relevant group

We divide the cluster data into two clusters such
that the data points in same cluster are as much
similar/closer as possible but data points in
different clusters are as much dissimilar/far as
possible
All observations start in one cluster
For each cluster we repeat the process until
specified number of clusters is reached
How it Works

Standard Tuning Parameters
oNumber of clusters:
Suggested range : 3 to 5,
though this depends on the
business demand
The number of clusters must
be less than the number of
data points in the data set
Iterations:
The number of iterations
generate clusters
Default value should be set
to 20

Sample UI For Input
Variables & Parameters
Selection & Output

Sample UI For Selecting Predictors And Applying Tuning
Parameters: For Two Predictors
The silhouette score is another useful criterion for assessing the natural and optimal number of clusters as well as for
checking overall quality of partition
The largest silhouette score, over different K, indicates the best number of clusters
Select the variables you
would like to use as
predictors to build
clusters
Height
Weight
BMI
21
Tuning parameters
Number of clusters
Maximum iterations

Height(cm) Weight(Kg) BMI
Cluster
Number
158 60 23 1
160 65 25 2
170 70 26 2
149 50 21 1
180 80 27 3
165 80 28 3
200 90 23 1
Each customer is assigned a cluster membership as shown in the table in left
Height
Weight
silhouette = 0.7
Indicating good quality clusters
Output UI: For two
predictors
As clusters are built using only 2 predictors
here , scatter plot axis will reflect actual
predictors i.e. Height and Weight instead of
principle components .
Again, lesser the overlap in cluster outlines ,
better the clusters’ assignment
Alternatively silhouette score can be checked
to evaluate how clusters are partitioned -
Closer this value to 1 , better the partition
quality.

Sample UI for selecting predictors and applying
tuning parameters: For four predictors•
Select the variables you
would like to use as
predictors to build
clusters
Purchase amount
Purchase frequency
Total purchase quantity
Annual income
Website visits
21
Tuning parameters
Number of clusters
Maximum iterations
 The silhouette score is another useful criterion for assessing the natural and optimal
number of clusters as well as for checking overall quality of partition
 The largest silhouette score, over different K, indicates the best number of clusters

Customer ID
Purchase
amount(per
month)
Purchase
frequency
(per month)
Total
purchase
quantity
Annual
Income slab
Website
visits (per
month)
Cluster
Number
1 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
2 1k to 5k 1 4 1 lac to 2 lac <3 1
3 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3
4 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
5 10 to 15k 3 8 4 lac to 6 lac 6 to 10 3
6 1k to 5k 1 2 1 lac to 2 lac <3 1
7 5k to 10k 2 6 2 lac to 4 lac 3 to 6 2
8 1k to 5k 1 2 1 lac to 2 lac <3 1
9 1k to 5k 1 2 1 lac to 2 lac <3 1
10 >20k 4 32
10 lac to 15
lac
>10 3
First principal component
Secondprincipalcomponent
Output UI: For four
predictors
• In the 2D cluster plot shown in right , clusters
distribution is plotted.
• In this case, axis will reflect first two principle
components (check definition below ) instead of
actual predictors as number of predictors is >2. In
case of 3D plot , first three principal components
will be shown. Lesser the overlap between clusters
, better the clusters’ assignment.
• Also silhouette score can be checked to evaluate
how clusters are partitioned - Closer this value to 1
, better the partition quality.
Whenever there are more than 2 predictors as input ,
axis will reflect principle components instead of actual
predictors
Principle components are linear combination of original
predictors which captures the maximum variance in
data set.
Most of the variance in data is explained by first three
principle components so we can ignore remaining
components.
Each customer is assigned a cluster membership as shown in the table in left

Other Sample Output
Formats
Clusters with silhouette
score closer to 1 are
more desirable. So this
index can be used to
measure the goodness
of clusters.
silhouette = -0.5
Indicating poor quality clusters

Limitations
Relatively time consuming
compared to other clustering
methods
The distance measure
used to separate the
data points in a cluster
is affected by the scale
of variables, hence
scaling/normalization of
variables is necessary
prior to clustering
Based on the distance
calculation method
chosen, the algorithm
sometimes may get
influenced by outliers

Applications & Business
Use Cases

General applications
Some examples of use cases are:
Behavioral segmentation:
Segment customers by purchase
history
Segment users by activities on
application, website, or platform
Define personas based on interests
Create consumer profiles based on
activity monitoring
Some other general applications :
Pattern recognitions, Market segmentation, Classification analysis, Artificial intelligence, agriculture and many others.

Use Case 1
Business benefit:
• Once segments are identified ,
bank will have a loan applicants’
dataset with each applicant labeled
as high/medium/low risk.
• Based on this labels , bank can
easily make a decision on whether
to give loan to an applicant or not
and if yes then how much credit
limit and interest rate each
applicant is eligible for based on
the amount of risk involved.
Business problem :
• Grouping loan applicants into
high/medium/low risk applicants
based on attributes such as Loan
amount , Monthly installment,
Employment tenure , Times
delinquent, Annual income, Debt to
income ratio etc.

Use case 1 : Input dataset
Customer ID Loan amount
Monthly
installment
Annual income
Debt to
income ratio
Times delinquent Employment tenure
1039153 21000 701.73 105000 9 5 4
1069697 15000 483.38 92000 11 5 2
1068120 25600 824.96 110000 10 9 2
563175 23000 534.94 80000 9 2 12
562842 19750 483.65 57228 11 3 21
562681 25000 571.78 113000 10 0 9
562404 21250 471.2 31008 12 1 12
700159 14400 448.99 82000 20 6 6
696484 10000 241.33 45000 18 8 2
702598 11700 381.61 45192 20 7 3
702470 10000 243.29 38000 17 9 7
702373 4800 144.77 54000 19 8 2
701975 12500 455.81 43560 15 8 4

Use case 1 : Output : Each record will have the
cluster (segment) assignment as shown below
Customer ID Loan amount
Monthly
installment
Annual
income
Debt to
income ratio
Times delinquent Employment tenure Cluster
1039153 21000 701.73 105000 9 5 4 Medium
1069697 15000 483.38 92000 11 5 2 Medium
1068120 25600 824.96 110000 10 9 2 Medium
563175 23000 534.94 80000 9 2 12 Low
562842 19750 483.65 57228 11 3 21 Low
562681 25000 571.78 113000 10 0 9 Low
562404 21250 471.2 31008 12 1 12 Low
700159 14400 448.99 82000 20 6 6 High
696484 10000 241.33 45000 18 8 2 High
702598 11700 381.61 45192 20 7 3 High
702470 10000 243.29 38000 17 9 7 High
702373 4800 144.77 54000 19 8 2 High
701975 12500 455.81 43560 15 8 4 High

Use case 1: Output : Cluster profiles
• As can be seen in the table above, there are distinctive characteristics of high /medium and low risk segments
• High risk segment has high likelihood to be delinquent, highest debt to income ratio and lowest employment tenure as compared to other
two segments
• Whereas low risk segment exhibits exactly the opposite pattern i.e. lowest debt to income ratio, lowest delinquency and highest employment
tenure as compared to other two segments
• Hence , delinquency , employment tenure and debt to income ratio are the determinant factors when it comes to segmenting loan applicants
Cluster
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Risk
Segment
1 10447.30 304.87 66467.74 9.58 1.69 16.82 Low
2 21391.58 598.54 94912.59 12.37 5.98 4.58 Medium
3 7521.32 227.43 60935.28 16.55 6.91 4.01 High

Use case 1: Output : Cluster distribution
• In the cluster distribution plot ,
there is negligible overlap in
cluster outlines so we can say that
cluster assignments is good in our
case
• Clusters with silhouette width
average closer to 1 are more
desirable. So this index can be
used to test the quality of clusters’
distribution.

Use case 2
Business benefit:
• Once segments are identified
, marketing messages and
even products can be
customized for each segment.
• The better the segment(s)
chosen for targeting by a
particular organization , the
more successful it is assumed
to be in the market place.
Business problem :
• Organizing customers into
groups/segments based on
similar traits, product
preferences and expectations
• Segments are constructed on
basis of the customers’
demographic characteristics,
psychographics, past behavior
and product use behaviors

Use case 3
Business benefit:
• Business marketing team can focus
on risky customer segments in
efficient way in order to avert them
from churning/leaving
• Sales team segments which are
facing challenges based on current
discounting strategy can be
identified and deal negotiation
strategy can be improved
/optimized for them.
Business problem :
• Discount Analysis and Customer
Retention – Visualize ‘segments of
sales group based on discount
behavior’ and ‘customer churn -
segments of customers on verge of
leaving’

Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018

What is Hierarchical Clustering and How Can an Organization Use it to Analyze Data?

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a What is Hierarchical Clustering and How Can an Organization Use it to Analyze Data?

Semelhante a What is Hierarchical Clustering and How Can an Organization Use it to Analyze Data? (20)

Mais de Smarten Augmented Analytics

Mais de Smarten Augmented Analytics (14)

Último

Último (20)

What is Hierarchical Clustering and How Can an Organization Use it to Analyze Data?

Notas do Editor