08448380779 Call Girls In Civil Lines Women Seeking Men
Term Paper on WEKA
1. 2012
Regression Analysis
and Cluster Analysis
Using WEKA
Kanishka Chakraborty (10BM60036)
VGSoM, IIT Kharagpur
2010-2012
2. Table of Contents
Introduction ........................................................................................................... 3
Scope of this term paper ........................................................................................ 4
Data Used ................................................................................................................................ 4
Analysis Done ........................................................................................................................... 5
Analysis------------------------------------------------------------------------------------------------6
Regression Analysis ................................................................................................................. 6
Cluster Analysis ........................................................................................................................ 8
References------------------------------------------------------------------------------------------- 10
2
3. Introduction
The amount of data generated is huge and growing at exponential rate each moment. But data
is not much of use in itself. It must be into information that can be interpreted and used. There
are multiple methods to convert data into information. Data mining is one of the methods
which help in deducing meaningful patterns and facts from the data. It has an application in
every walk of life. Any organization must rely on data mining in order to get proper insights on
which there decisions will be based. Many data mining tools are present in the market. WEKA
(Waikato Environment for Knowledge Analysis) is one such data mining tool. It is the only
toolkit that has gained such widespread popularity.
It is a java-based free tool available under GNU General Public License. It consists of many
features and hence has made it quite a popular data mining tool. It consists of many
visualization tools, algorithms and preprocessing & modeling techniques to conduct data
mining. It provides the user with both a GUI (Graphical User Interface) and CLI (Command Line
Interface).
The applications available:
Explorer: An environment to analyze data in WEKA
Experimenter: Environment for conducting statistical tests
KnowledgeFlow: Same as explorer with additional feature of drag-and-drop
Simple CLI: Provides command line interface for WEKA
The tool requires the data to be in .arff format. Arff stands for Attribute Relation File Format. It
is an ASCII file with all the attributes, their relation and values for each instance. It consists of
three parts: Relation, Attribute and Data.
3
4. Scope of this term-paper
This paper deals with the analysis of telecom customers about their Value Added Services usage
pattern and experience. This analysis is being carried out in order to identify customers who are
likely to go for a service like 3G. The paper will also try to identify which factors are important in
order to assess which customer will adopt 3G. This information plays a major role in creation of
the marketing strategy of 3G.
DATA USED
The data that has been used for this paper was collected with the help of a survey conducted in
Guwahati, Assam. This is being done in order to identify important factors differentiating
between potential 3G customers and non-3G potential customers. The sample size used for this
analysis is 206 and consists of the following demographic segments:
Students
Young Professionals (<35 years of age),Working Professionals (>35 years of age)
Housewives
Defense personnel
Low Income Group (Rickshaw drivers, Auto rickshaw drivers, Shopkeepers etc.)
Variable Description Categories
Monthly
How much the customer spends on
expenditure on <100, 100-300, 300-500, >500
VAS in a month
VAS
Whether the customer uses
Mobile Internet Yes, No
internet on their mobile
What has been the mobile internet
Internet speed Satisfied. Neither Satisfied nor Dissatisfied,
usage satisfaction level of the
experience Dissatisfied, Not used
customers
How aware is the customer Using 3G, Fully Aware, Partially Aware, Not
3G Awareness
regarding the 3G services Aware
<3000, 3000-5000, 5000-7000, 7000-10000,
What is the price of the handset the
Handset Price 10000-15000, 15000-20000, 20000-30000,
customer is using
>30000
Whether the customer is planning
3G usage plan Yes, No
to use 3G in the near future
Low income group, Housewives, Defense,
The age-occupation combination of
Demography Young Professionals, Working Professionals,
the customer
Students
4
5. To be usable in WEKA the data was first converted in .arff format. This is done by introducing a
few things:
Attribute: Each variable is defined as an attribute. The data type (numeric, string etc.) is
also defined for each attribute
Data: The instances are input under the data header. It consists of the value for each
attribute for the instances.
ANALYSIS DONE
The following analysis will be conducted using the tool:
Regression
Clustering
Regression will be carried out in order to understand the relation between the various variables
used in the data in order to predict how any variable will vary with respect to some other
variable(s). Clustering is a technique that helps to form different groups and assign each
instance to one group or another. Each group consists of instances which are similar to each
other. It has widespread usage in segmenting customers according to their characteristics and
preferences.
5
6. Analysis
Regression Analysis
The regression analysis is used to understand the relation that a particular variable (Dependent
variable) share with others (Independent variable). For this paper the factors studied are as
follows:
Dependent Variable: Plan to use 3G
Independent Variable:
o Internet mobile user
o 3G awareness
o Price of the handset used
STEPS TO FOLLOW
I. Select Classify tab
II. Click on the Choose button
III. Go to functions
IV. Select LinearRegression from the list
V. Enter the % of data wanted for the test (rest will be used for validation) from Test
options
VI. Click on Start to perform the analysis
6
7. OUTPUT
The regression analysis conducted on the data gives us the following equation:
3G Planner = 0.4599 * (Internet Mobile User) + 0.0891 * (3G awareness) - 0.1325 * (Handset
price) + 0.9421
ANALYSIS OF THE OUTPUT
The output received leads to the following interpretations:
Whether a person is planning to buy 3G depends upto a great extent to whether that
person is using internet on their mobile or not. A person who is using internet on their
mobile is more likely to try 3G.
Dependence of 3G trial plan also relates to the price of the handset the respondent is
currently using. Higher the price higher is the likelihood that the person will try 3G.
The plan for 3G usage also depends on the 3G awareness level. The dependence is
weak. According to the output the higher the awareness about 3G more likely it is that
the person will try 3G.
7
8. Cluster Analysis
Before creating a marketing strategy for any product it is very important to identify particular
segments present in the market. These segments can then be studied in order to select the one
which is best suited for targeting. For identifying the segments present in the market clustering
can be used. For this paper, K Means Clustering has been used.
STEPS TO FOLLOW
I. Select Cluster tab
II. Click on the Choose button
III. Select SimpleKMeans from the list
IV. Click on the text box besides the Choose button. Enter the number of clusters you want
to have in numclusters
V. Click on Start to perform the analysis
8
9. OUTPUT
The outputs obtained are as follows:
Cluster centroids
The centroids obtained by clustering helps in understanding the characteristics of each
segment. It provides us with information regarding each cluster according to the various
variables.
Attribute Cluster 0 1 2 3
Membership (65) (61) (41) (39)
Monthly Expense on VAS 1.1692 1 1.1951 1.3333
Mobile Internet user .6923 2 1.7317 .9487
Satisfaction level of mobile internet usage 1.9385 0 0 1.2462
3G Awareness 2.4769 2.6885 2.6098 2.1538
Demography 4.18154 2.3607 4.6829 5.2821
Price of Handset used 2.3385 2.1311 3.3415 3.8769
3G usage plan 2 2 1.9268 .8974
Clustered Instances
Cluster instances basically give information regarding the number of instances that belong to
each cluster. This aids in predicting what percentage of the total population is likely to belong
to each cluster
Cluster 0: 65 (32%)
Cluster 1: 61 (30%)
Cluster 2: 41 (20%)
Cluster 3: 39 (19%)
9
10. ANALYSIS OF THE OUTPUT
In K Means Clustering the number of clusters to be formed is entered by the user. Here the
number of clusters to be formed by the clustering tool has been assigned as 4. WEKA provided
us with the description of each cluster in terms of the centroids of each variable with respect to
the cluster. The cluster descriptions are as follows:
Attribute Cluster 0 1 2 3
Membership (65) (61) (41) (39)
Monthly <100 <100 <100 0-300
Expense on VAS
Mobile Internet Yes No No Yes
user
Satisfaction Not Satisfied Haven’t Haven’t used Satisfied
level of mobile used
internet usage
3G Awareness Low Low Low Fully Aware
awareness awareness awareness
Demography Working House Working Young
Professionals wives Professional Professionals &
Students
Price of Handset 3000-5000 3000-5000 5000-7000 7000-10000
used
3G usage plan No No No Yes
Thus the segment to be targeted initially is the cluster 3. It consists of Young working
professionals (< 35 years of age) and students. This segment is the most likely to go for 3G
services. The awareness level of this segment is fairly high. The handset used by the members
in this segment is in the price band of 7000-10000. The members of this segment are satisfied
with the speed of internet they receive on their handsets. The cluster membership of this
segment is 19%. Thus it can be deduced according to the analysis that around 19% of the total
population consists of customers who are likely to go for a service like 3G.
References
http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html
http://en.wikipedia.org/wiki/Weka_%28machine_learning%29
http://sourceforge.net/projects/weka/files/documentation/3.6.x/WekaManual-3-6-
2.pdf/download
10