Data Mining Techniques Using R and WEKA

Data Mining Techniques Using R and WEKA
IT for Business Intelligence

Term paper
Utsav Mone (10BM60094)

This Term paper explained two Techniques -

1) Linear Modelling using R
2) Clustering using WEKA

Linear Modelling using R

Here I have tried to analyse the relation of bid ask spread of the company with the vitality of the
prices. I have used three different hypotheses to fit the model. First I tried to see linear modelling then
tried to fit logarithmic and exponential relation of volatility.

We have data from where bid ask spread can be calculated at different hours. Through trade data I
calculated daily price volatility of the stock and tried to see relation between them.

Bid Ask Spread
A Measure of liquidity
• The amount by which the ask price exceeds the bid. This is essentially the difference in price
between the highest price that a buyer is willing to pay for an asset and the lowest price for
which a seller is willing to sell it
•
Ask - The price a seller is willing to accept for a security, also known as the offer price. Along with
the price, the ask quote will generally also stipulate the amount of the security

Bid - An offer made by an investor, a trader or a dealer to buy a security. The bid will stipulate both
the price at which the buyer is willing to purchase the security and the quantity of the security

Factors effecting Bid Ask Spread

1) Volatility (With more volatility the spread is high)

Standard deviation
Variance between returns from that same security or market index

2) Volumes (More volumes reduce the spread)
Absolute number of shares under transaction
Percentage of free floating shares
Number of orders

3) Others
Tick size
Price of Share

All above measures can re classified again in two categories (except Others)
Executed
Requested

In Our case I have used only identifying the relation of volatility and Bid Ask Spread.
I have used daily volatility of share prices of Tata Motors for the month of Feb 2008. Daily volatility of
the share prices are calculated on the basis of hourly data instead of regular way of finding closing
prices of each day.
This is required since we want to see changes in daily volatility.

Data –
Dr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I have used
few files from her data.
There are three types of files.
1) Snapshots
2) Trade Data
3) Price Volume Data

Price Volume Data
I have used February 2008 share data of Tata Motors. Except the traded data rest all data is available
in public domain.

The file contains the following items
i) Symbol,
ii) Series,
iii) Date,
iv) Prev Close,
v) Open Price,
vi) High Price,
vii) Low Price,
viii) Last Price,
ix) Close Price,
x) Average Price,
xi) Total Traded
xii) Quantity,
xiii) Turnover in Lacs,

This text file is available at this link- http://bit.ly/TM_PVD

TATAMOTORS,EQ,03-Dec-2007,732.45,736,749,733.35,737,736.15,741,481721,3569.5399
TATAMOTORS,EQ,04-Dec-2007,736.15,737,746,728.35,746,741.3,738.2,631272,4660.0808995,
TATAMOTORS,EQ,05-Dec-2007,741.3,744,783.9,744,773,772.4,769.92,1410714,10861.311993,
TATAMOTORS,EQ,06-Dec-2007,772.4,775.5,782,763.25,778,775.45,774.13,807793,6253.379844,
TATAMOTORS,EQ,10-Dec-2007,767.3,772,777.7,745.05,775,766.45,757.78,521361,3950.7440285,
TATAMOTORS,EQ,11-Dec-2007,766.45,770,777.3,761,777.3,775.2,770.04,676097,5206.1990345,
TATAMOTORS,EQ,12-Dec-2007,775.2,776.9,780,762,769,770.05,768.88,665743,5118.7625105,

Snapshots Data

In this type of data we have snapshot of order book for 4 Hours in a day which are 11Hr, 12Hr, 13 Hr,
14Hr. Here we see snapshot data of Tata Motor for different months and hours of the day.

Here is a look of the data. Since numbers of files are too much it is difficult to upload it.

A look at Snapshot data –
1) Order Number
2) Company
3) Trade Type
4) No of shares in Order
5) Quote
6) Time Stamp
7) Buy Sell
8) Flags

A Sample Snapshot Data of Tata Motors on 1 Feb 11 Hr -
2008020150046719 TATAMOTORS EQ 500 559.60 09:55:48 B ynnn nnn nnn RL 0
2008020150067971 TATAMOTORS EQ 824 576.65 09:56:38 B ynnn nny nnn RL 0
2008020150233325 TATAMOTORS EQ 25000 585.00 10:04:34 B ynnn nny nnn RL 0

Detail of Flags can be seen at –
https://docs.google.com/document/d/1pW0Fou2VzSiacEn0OKR5rKeKBRZTzw5USk7HOBVQRjs/edit

Trade Data
This is a daily trade data. Which gives all the trades took place in a day.

A look at Trade data –
1) Trade Number
2) Name of Company
3) Type of Trade
4) Time of Trading
5) Price
6) Volume of shares traded

Opening data Price = 708
2475593 TATAMOTORS EQ 09:55:16 708 3713
2475830 TATAMOTORS EQ 09:55:20 708 800
2475871 TATAMOTORS EQ 09:55:21 708 200
2475872 TATAMOTORS EQ 09:55:21 708 1
2475873 TATAMOTORS EQ 09:55:21 708 1
2475874 TATAMOTORS EQ 09:55:21 708 210
2475935 TATAMOTORS EQ 09:55:22 708 800

See Price variation in 3 Seconds from 755 to back 755
3843007 TATAMOTORS EQ 13:33:37 755 5
3843008 TATAMOTORS EQ 13:33:37 755 453
3843021 TATAMOTORS EQ 13:33:38 754.9 1
3843022 TATAMOTORS EQ 13:33:38 754.55 9
3843037 TATAMOTORS EQ 13:33:38 755 1
3843050 TATAMOTORS EQ 13:33:38 754.9 1
3843051 TATAMOTORS EQ 13:33:38 754.9 9
3843052 TATAMOTORS EQ 13:33:38 754.9 1
3843069 TATAMOTORS EQ 13:33:39 755 1

More detail of the data is available at –
https://docs.google.com/document/d/1pW0Fou2VzSiacEn0OKR5rKeKBRZTzw5USk7HOBVQRjs/edit

R Program

Data Location
We need to set Directory location in R.
R looks for all the file in the directory assigned.

Packages Requirement
CHRON
ZOO
FDA
MASS
PROTO
DBI
RSQL.LITE
RSQL.EXTFUNCS
STATS4
SDE
TCLTK
SQLDF

Program Understanding

Reading File

In this program I have first read the different files using for loop.
1) Reading Trade Data
File Name is made of = Name of Company, Day, Month, Year.txt
2) Reading Snapshot Data
File name is made of = Company Name_Day, Month, Year_Time Hour.txt

The data is read using SQL queries for which many packages are Required
st
Since the data is lot and we have to decide which data to read First I found out average prices at 1
second of the hour.
nd
Eg Prices of all the trades when time stamp was 10:00 all seconds after 2 are taken into
consideration.
Then I found out gain or loss every hour to find the volatility of the trade.

Then for each hour I found out Bid ask Spread and averaged that for day.

Now for linear modelling I used Feb Month data of daily volatility and bid ask spread.
Program has comments to help us understanding more.

Almost same program was run for exponential and logarithmic relations, but there was little change in
code in last 6 lines. The other code is given in the cases explained.

Since I am new user of R, The program is not very efficient, but the code is perfectly fine and runs
well.

Code

Name<-'TATAMOTORS_'
#/* This is name of company*/

MY<-'Feb08'
#/* This is month and year*/

Day<-c('01','04','06','07','08','11','12','13','14','15','18','19','20','21','22')
#/* These are Working days of Feb Month, This is hardcoded as of now*/

#/* This is just to define PSpread and Dailystdev as numeric array*/
PSpread<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
Dailystdev<-PSpread
BD<-c(1,2,3,4)

#/* Reading Trade Data*/

i=1
for(i in 1:15)
{

TFile<-paste(Name,Day[i],MY,".txt",sep = "")
Trade<-read.table(TFile)
summary(Trade)

#/* Below are sql query to find average price of all the trades at perticular hour */

Hr10Price<-sqldf("select avg(V5) from Trade where V4 like '10:00%'")

#/* This is to find returns at each hour*/

R1 = ((Hr10Price[1,1] - Hr11Price[1,1])/Hr10Price[1,1])
R2 = (Hr11Price[1,1] -Hr12Price[1,1])/Hr11Price[1,1]
R<-c(R1,R2,R3,R4,R5)

Dailystdev[i]<-sd(R, na.rm = FALSE)
#/* Dailystdev variable have standard deviation of daily returns*/

#/******************************************/
#/* Code below is for reading snapshot data */
#/******************************************/

Company<-'TATAMOTORS'
Month<-'Feb'
Year<-"08"
Time<-c(11,12,13,14)
h<-"_"

i

for(j in 1:4)
{
File<-paste(Company,h,Day[i],Month,Year,h,Time[j],".txt",sep = "")
X<-read.table(File)

#/* SQL and formulas find the Bid and Ask value of the hour */

MaxBuyP<-sqldf("select max(V5) from X where V10 = 'nnn' and V7 = 'B' ")
MinSellP<-sqldf("select min(V5) from X where V10 = 'nnn' and V7 = 'S' ")
MinSell = MinSellP[1,1]
MaxBuy = MaxBuyP[1,1]

#/* This is done to bring array variable to regular variable */

BidAsk = MinSell - MaxBuy

BD[j] =BidAsk/((MaxBuy+MinSell)/2)
}

PSpread[i]<- mean(BD)
}

PSpread
Dailystdev

/* DF is Data Frame for modeling */

DF <- data.frame(PSpread,Dailystdev)
Result<-lm(PSpread ~ Dailystdev,DF)
Result

summary(Result)

/*******************END***********************/

Analysis

Analysis show that interrupt at Y axis is significant but the coefficient is not significant.

Adjusted R square is also showing that model is not fitting.

F statistic also have very high p values which gives overall indication that Bid Ask Spread do not have
any linear relation with daily volatility of the prices.

So I changed the Null Hypothesis to following cases.

Bid ask spread is exponentially related with the volatility

or

Bid ask spread is logarithmically related with the volatility

Exponential Case

Dailystdevexp<-exp(Dailystdev)
DFexp <- data.frame(PSpread,Dailystdevexp)
Resultexp <-lm(PSpread ~ Dailystdevexp,DFexp)
Resultexp

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.04463 0.08037 -0.555 0.588

Dailystdevexp 0.04582 0.07970 0.575 0.575

Log Case

Dailystdevln<-log(Dailystdev, base = exp(1))
DFln <- data.frame(PSpread,Dailystdevln)
Resultln <-lm(PSpread ~ Dailystdevln,DFln)
summary(Resultln)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.0021993 0.0027177 0.809 0.433

Dailystdevln 0.0001246 0.0005400 0.231 0.821

Still we see that even exponential or log normal model is not fitting.

Clustering Using WEKA

Clustering helps one to make group of data instances. These help especially the marketers
to identify patterns in data and segment their customers.

The Dataset
The data used here is obtained from the CD of book on Marketing research by Naresh
Malhotra. The data can be downloaded from the following link - bit.ly/HVwPEP
The example illustrates the use of clustering method to segment customers based on there
attitudes towards shopping. Customers were asked to express s their degree of agreement
on the following variables on a 7 point scale

V1 - Shopping is fun

V 2 - Shopping is bad for your budget

V3 - I combine shopping with eating out

V4 - I try to get the best buys when shopping

V5 - I don’t care about shopping

V6 - You can save a lot of money by comparing prices

Clustering Procedure

Load the data using the open file option in Weka. You will get the window as shown in
figure 1.

Click on cluster tab. Then click Choose and select SimpleKMeans .You will get the
window as shown in figure 2. By default the number of cluster created would be 2. In order
to change the number of cluster click on SimpleKMeans. You will get the window as shown
in figure 3. In the numcluster field specify the number of clusters to be created. For this
example number of cluster created is 3. Click on start. You will get the output as shown in
figure 4.

Interpreting The Results

Each cluster tells us a type of behavior in our customers, from which we can begin to draw
some conclusions:
 Cluster 0 — High values on V2 and V4 and V6 . Can be called as economical
shoppers
 Cluster 1 — High values on variables V1 and V3 and low values on V5 – They could
be labeled as fun loving and concerned shoppers
 Cluster 2 — Opposite of cluster 1. Can be termed as apathetic clusters

To visually inspect the cluster right-click on theResult List section. One of the options
from this pop-up menu is Visualize Cluster Assignments. A window will pop up that
lets you play with the results and see them visually (see figure 5).

Figure 1 The window after loading the dataset

Figure 2 Window after choosing the SimpleKMeans procedure

Figure 3 Changeing the number of clusters

Figure 4 The result of cluster analysis

Figure 5 Visually viewing the cluster

Data Mining Techniques Using R and WEKA

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Data Mining Techniques Using R and WEKA

Semelhante a Data Mining Techniques Using R and WEKA (20)

Último

Último (20)

Data Mining Techniques Using R and WEKA