Data Mining Techniques using WEKA_Saurabh Singh_10BM60082
1. IT for Business Intelligence
Term paper on Weka
Submitted by:
Saurabh Singh 10BM60082
2. Introduction
The Weka contains a collection of visualization tools and algorithms for data analysis and predictive
modeling, together with graphical user interfaces for easy access to this functionality. The original non-
Java version of Weka was a TCL/TK front-end to (mostly third-party) modeling algorithms implemented
in other programming languages, plus data preprocessing utilities in C, and a Make file-based system for
running machine learning experiments. This original version was primarily designed as a tool for
analyzing data from agricultural domains, but the more recent fully Java-based version (Weka 3), for
which development started in 1997, is now used in many different application areas, in particular for
educational purposes and research. Advantages of Weka include:
free availability under the GNU General Public License
portability, since it is fully implemented in the Java programming language and thus runs on
almost any modern computing platform
a comprehensive collection of data preprocessing and modeling techniques
ease of use due to its graphical user interfaces
Weka primarily consists of following four screens:
3. K-means clustering in WEKA
Suppose a company wants to cluster the market based on the attribute collected by its research team.
This can be done very effectively and efficiently by using K- mean clustering in Weka.
The attributes used are as follows:
ID
AGE
SEX
RELIGION
INCOME
MARRIED
CHILDREN
CAR
SAVING A/C
CURRENT A/C
LOAN
PENSION PLAN
Weka accepts few file input format such as .csv, .arff etc. We would be using .csv file as the input file in
our example. Given data file consists of 1600 instances and 12 attributes as described above.
Steps in K-mean analysis:
Step 1:
Weak Startup screen
4. Step 2:
Choose explorer option from the menu. This option is more than enough for us to perform all the
required operation on the data.
Step 3:
Load the .csv file of bank accounts data.
5. Step 4:
Since we intend to create cluster within the data so click on cluster tab and choose Simple K-means
among the choices that appear. Following screen would appear.
Step 5:
Click on the box next to choose box and following menu would appear
6. Step 6:
Assign value 4 to ‘numClusters’ box.
Step 7:
Click on start to begin the clustering process. Following screen would appear for the same.
Step 8:
The result can be viewed in a separate window. Following screen would appear.
7. We can interpret by the above given results that
Cluster 0:
Centers around male population.
Mainly lives in town area.
Is mostly non married.
Doesn’t own a car or previous loan.
Owns a Savings a/c and current a/c.
Still is not having a pension plan.
Hence we can conclude that cluster 1 is the likely cluster to buy a pension plan. Similar interpretation
can be applied to other clusters as well according to requirements.
Step 9:
We can use visualize all to see the distribution of all the variables in the population.
8. Linear Regression using WEKA
Regression
Regression model can easily answer questions such as how much should be charged for a given model of
car with certain set of features. It uses the past data of car sales, price of the cars, features provided and
other attributes to determine the price of future models.
Regression in WEKA
Suppose a company wants to regress the Price of a car with various features associated with it. It can
run the regression in WEKA by appropriately determining the independent variables and then establish a
regression equation establishing the relationship between independent variables and dependent
variable. Following example illustrates this procedure -
Step 1:
Weak Startup screen
9. Step 2:
Choose explorer option from the menu. This option is more than enough for us to perform all the
required operation on the data.
Step 3:
Load the .csv file of car specification data.
10. Step4:
Click Classify tab, then click Choose button and then select Linear Regression from Functions. Following
screen would appear after this.
Step5:
After clicking on Start button, following output would be generated.
11. Interpretation of the output – From the above output, we can observe that the selling price is positively
correlated to the engine displacement and none of the other factors.
Step 6:
Right click on result list for options and select visualize Classifier errors for the following screen.
Step 7:
If we click at any point on the given plot summary of data point is given by Weka. E.g.