SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
Data Mining Techniques Using R and WEKA
                    IT for Business Intelligence




                                      Term paper
                                   Utsav Mone (10BM60094)




This Term paper explained two Techniques -

   1) Linear Modelling using R
   2) Clustering using WEKA
Linear Modelling using R

Here I have tried to analyse the relation of bid ask spread of the company with the vitality of the
prices. I have used three different hypotheses to fit the model. First I tried to see linear modelling then
tried to fit logarithmic and exponential relation of volatility.

We have data from where bid ask spread can be calculated at different hours. Through trade data I
calculated daily price volatility of the stock and tried to see relation between them.

Bid Ask Spread
A Measure of liquidity
    • The amount by which the ask price exceeds the bid. This is essentially the difference in price
         between the highest price that a buyer is willing to pay for an asset and the lowest price for
         which a seller is willing to sell it
    •
Ask - The price a seller is willing to accept for a security, also known as the offer price. Along with
the price, the ask quote will generally also stipulate the amount of the security

Bid - An offer made by an investor, a trader or a dealer to buy a security. The bid will stipulate both
the price at which the buyer is willing to purchase the security and the quantity of the security


Factors effecting Bid Ask Spread

    1) Volatility (With more volatility the spread is high)

        Standard deviation
        Variance between returns from that same security or market index

    2) Volumes (More volumes reduce the spread)
       Absolute number of shares under transaction
       Percentage of free floating shares
       Number of orders

    3) Others
       Tick size
       Price of Share

All above measures can re classified again in two categories (except Others)
Executed
Requested


In Our case I have used only identifying the relation of volatility and Bid Ask Spread.
I have used daily volatility of share prices of Tata Motors for the month of Feb 2008. Daily volatility of
the share prices are calculated on the basis of hourly data instead of regular way of finding closing
prices of each day.
This is required since we want to see changes in daily volatility.
Data –
Dr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I have used
few files from her data.
There are three types of files.
    1) Snapshots
    2) Trade Data
    3) Price Volume Data




Price Volume Data
I have used February 2008 share data of Tata Motors. Except the traded data rest all data is available
in public domain.

The file contains the following items
   i)        Symbol,
   ii)       Series,
   iii)      Date,
   iv)       Prev Close,
   v)        Open Price,
   vi)       High Price,
   vii)      Low Price,
   viii)     Last Price,
   ix)       Close Price,
   x)        Average Price,
   xi)       Total Traded
   xii)      Quantity,
   xiii)     Turnover in Lacs,


This text file is available at this link- http://bit.ly/TM_PVD


TATAMOTORS,EQ,03-Dec-2007,732.45,736,749,733.35,737,736.15,741,481721,3569.5399
TATAMOTORS,EQ,04-Dec-2007,736.15,737,746,728.35,746,741.3,738.2,631272,4660.0808995,
TATAMOTORS,EQ,05-Dec-2007,741.3,744,783.9,744,773,772.4,769.92,1410714,10861.311993,
TATAMOTORS,EQ,06-Dec-2007,772.4,775.5,782,763.25,778,775.45,774.13,807793,6253.379844,
TATAMOTORS,EQ,10-Dec-2007,767.3,772,777.7,745.05,775,766.45,757.78,521361,3950.7440285,
TATAMOTORS,EQ,11-Dec-2007,766.45,770,777.3,761,777.3,775.2,770.04,676097,5206.1990345,
TATAMOTORS,EQ,12-Dec-2007,775.2,776.9,780,762,769,770.05,768.88,665743,5118.7625105,
Snapshots Data

In this type of data we have snapshot of order book for 4 Hours in a day which are 11Hr, 12Hr, 13 Hr,
14Hr. Here we see snapshot data of Tata Motor for different months and hours of the day.


Here is a look of the data. Since numbers of files are too much it is difficult to upload it.




A look at Snapshot data –
    1) Order Number
    2) Company
    3) Trade Type
    4) No of shares in Order
    5) Quote
    6) Time Stamp
    7) Buy Sell
    8) Flags


A Sample Snapshot Data of Tata Motors on 1 Feb 11 Hr -
2008020150046719 TATAMOTORS EQ 500 559.60 09:55:48 B ynnn nnn nnn RL 0
2008020150716321 TATAMOTORS EQ 10 560.00 10:35:56 B ynnn nnn nnn RL 0
2008020150034116 TATAMOTORS EQ 100 575.00 09:55:22 B ynnn nnn nnn RL 0
2008020150067971 TATAMOTORS EQ 824 576.65 09:56:38 B ynnn nny nnn RL 0
2008020100283272 TATAMOTORS EQ 100 582.00 10:09:10 B ynnn nnn nnn RL 0
2008020150233325 TATAMOTORS EQ 25000 585.00 10:04:34 B ynnn nny nnn RL 0

Detail of Flags can be seen at –
https://docs.google.com/document/d/1pW0Fou2VzSiacEn0OKR5rKeKBRZTzw5USk7HOBVQRjs/edit
Trade Data
This is a daily trade data. Which gives all the trades took place in a day.




A look at Trade data –
    1) Trade Number
    2) Name of Company
    3) Type of Trade
    4) Time of Trading
    5) Price
    6) Volume of shares traded

Opening data Price = 708
2475593 TATAMOTORS EQ 09:55:16 708 3713
2475830 TATAMOTORS EQ 09:55:20 708 800
2475871 TATAMOTORS EQ 09:55:21 708 200
2475872 TATAMOTORS EQ 09:55:21 708 1
2475873 TATAMOTORS EQ 09:55:21 708 1
2475874 TATAMOTORS EQ 09:55:21 708 210
2475935 TATAMOTORS EQ 09:55:22 708 800

See Price variation in 3 Seconds from 755 to back 755
3843007 TATAMOTORS EQ 13:33:37 755 5
3843008 TATAMOTORS EQ 13:33:37 755 453
3843021 TATAMOTORS EQ 13:33:38 754.9 1
3843022 TATAMOTORS EQ 13:33:38 754.55 9
3843037 TATAMOTORS EQ 13:33:38 755 1
3843050 TATAMOTORS EQ 13:33:38 754.9 1
3843051 TATAMOTORS EQ 13:33:38 754.9 9
3843052 TATAMOTORS EQ 13:33:38 754.9 1
3843069 TATAMOTORS EQ 13:33:39 755 1

More detail of the data is available at –
https://docs.google.com/document/d/1pW0Fou2VzSiacEn0OKR5rKeKBRZTzw5USk7HOBVQRjs/edit
R Program

Data Location
We need to set Directory location in R.
R looks for all the file in the directory assigned.




Packages Requirement
CHRON
ZOO
FDA
MASS
PROTO
DBI
RSQL.LITE
RSQL.EXTFUNCS
STATS4
SDE
TCLTK
SQLDF
Program Understanding

Reading File

In this program I have first read the different files using for loop.
1) Reading Trade Data
         File Name is made of = Name of Company, Day, Month, Year.txt
2) Reading Snapshot Data
         File name is made of = Company Name_Day, Month, Year_Time Hour.txt


The data is read using SQL queries for which many packages are Required
                                                                                                       st
Since the data is lot and we have to decide which data to read First I found out average prices at 1
second of the hour.
                                                                              nd
Eg Prices of all the trades when time stamp was 10:00 all seconds after 2 are taken into
consideration.
Then I found out gain or loss every hour to find the volatility of the trade.

Then for each hour I found out Bid ask Spread and averaged that for day.

Now for linear modelling I used Feb Month data of daily volatility and bid ask spread.
Program has comments to help us understanding more.

Almost same program was run for exponential and logarithmic relations, but there was little change in
code in last 6 lines. The other code is given in the cases explained.

Since I am new user of R, The program is not very efficient, but the code is perfectly fine and runs
well.
Code




Name<-'TATAMOTORS_'
#/* This is name of company*/

MY<-'Feb08'
#/* This is month and year*/

Day<-c('01','04','06','07','08','11','12','13','14','15','18','19','20','21','22')
#/* These are Working days of Feb Month, This is hardcoded as of now*/



#/* This is just to define PSpread and Dailystdev as numeric array*/
PSpread<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
Dailystdev<-PSpread
BD<-c(1,2,3,4)
#/* Reading Trade Data*/

i=1
for(i in 1:15)
{

TFile<-paste(Name,Day[i],MY,".txt",sep = "")
Trade<-read.table(TFile)
summary(Trade)

#/* Below are sql query to find average price of all the trades at perticular hour */

Hr10Price<-sqldf("select avg(V5) from Trade where V4 like '10:00%'")
Hr11Price<-sqldf("select avg(V5) from Trade where V4 like '11:00%'")
Hr12Price<-sqldf("select avg(V5) from Trade where V4 like '12:00%'")
Hr13Price<-sqldf("select avg(V5) from Trade where V4 like '13:00%'")
Hr14Price<-sqldf("select avg(V5) from Trade where V4 like '14:00%'")
Hr15Price<-sqldf("select avg(V5) from Trade where V4 like '15:00%'")

#/* This is to find returns at each hour*/

R1 = ((Hr10Price[1,1] - Hr11Price[1,1])/Hr10Price[1,1])
R2 = (Hr11Price[1,1] -Hr12Price[1,1])/Hr11Price[1,1]
R3 = (Hr12Price[1,1] -Hr13Price[1,1])/Hr12Price[1,1]
R4 = (Hr13Price[1,1] -Hr14Price[1,1])/Hr13Price[1,1]
R5 = (Hr14Price[1,1] -Hr15Price[1,1])/Hr14Price[1,1]
R<-c(R1,R2,R3,R4,R5)

Dailystdev[i]<-sd(R, na.rm = FALSE)
#/* Dailystdev variable have standard deviation of daily returns*/



#/******************************************/
#/* Code below is for reading snapshot data */
#/******************************************/

Company<-'TATAMOTORS'
Month<-'Feb'
Year<-"08"
Time<-c(11,12,13,14)
h<-"_"

i
for(j in 1:4)
{
File<-paste(Company,h,Day[i],Month,Year,h,Time[j],".txt",sep = "")
X<-read.table(File)

#/* SQL and formulas find the Bid and Ask value of the hour */

MaxBuyP<-sqldf("select max(V5) from X where V10 = 'nnn' and V7 = 'B' ")
MinSellP<-sqldf("select min(V5) from X where V10 = 'nnn' and V7 = 'S' ")
MinSell = MinSellP[1,1]
MaxBuy = MaxBuyP[1,1]

#/* This is done to bring array variable to regular variable */

BidAsk = MinSell - MaxBuy

BD[j] =BidAsk/((MaxBuy+MinSell)/2)
}

PSpread[i]<- mean(BD)
}

PSpread
Dailystdev

/* DF is Data Frame for modeling */

DF <- data.frame(PSpread,Dailystdev)
Result<-lm(PSpread ~ Dailystdev,DF)
Result

summary(Result)

/*******************END***********************/
Analysis

Analysis show that interrupt at Y axis is significant but the coefficient is not significant.

Adjusted R square is also showing that model is not fitting.

F statistic also have very high p values which gives overall indication that Bid Ask Spread do not have
any linear relation with daily volatility of the prices.




So I changed the Null Hypothesis to following cases.

Bid ask spread is exponentially related with the volatility

or

Bid ask spread is logarithmically related with the volatility
Exponential Case

Dailystdevexp<-exp(Dailystdev)
DFexp <- data.frame(PSpread,Dailystdevexp)
Resultexp <-lm(PSpread ~ Dailystdevexp,DFexp)
Resultexp


Coefficients:

                Estimate             Std. Error   t value   Pr(>|t|)

(Intercept)     -0.04463              0.08037     -0.555    0.588

Dailystdevexp 0.04582                0.07970      0.575     0.575
Log Case

Dailystdevln<-log(Dailystdev, base = exp(1))
DFln <- data.frame(PSpread,Dailystdevln)
Resultln <-lm(PSpread ~ Dailystdevln,DFln)
summary(Resultln)

Coefficients:

                Estimate                 Std. Error       t value        Pr(>|t|)

(Intercept)     0.0021993                0.0027177        0.809          0.433

Dailystdevln    0.0001246                0.0005400        0.231          0.821

Still we see that even exponential or log normal model is not fitting.
Clustering Using WEKA

Clustering helps one to make group of data instances. These help especially the marketers
to identify patterns in data and segment their customers.


The Dataset
The data used here is obtained from the CD of book on Marketing research by Naresh
Malhotra. The data can be downloaded from the following link - bit.ly/HVwPEP
The example illustrates the use of clustering method to segment customers based on there
attitudes towards shopping. Customers were asked to express s their degree of agreement
on the following variables on a 7 point scale

V1 - Shopping is fun

V 2 - Shopping is bad for your budget

V3 - I combine shopping with eating out

V4 - I try to get the best buys when shopping

V5 - I don’t care about shopping

V6 - You can save a lot of money by comparing prices



Clustering Procedure

Load the data using the open file option in Weka. You will get the window as shown in
figure 1.

Click on cluster tab. Then click Choose and select SimpleKMeans .You will get the
window as shown in figure 2. By default the number of cluster created would be 2. In order
to change the number of cluster click on SimpleKMeans. You will get the window as shown
in figure 3. In the numcluster field specify the number of clusters to be created. For this
example number of cluster created is 3. Click on start. You will get the output as shown in
figure 4.
Interpreting The Results

Each cluster tells us a type of behavior in our customers, from which we can begin to draw
some conclusions:
    Cluster 0 — High values on V2 and V4 and V6 . Can be called as economical
       shoppers
    Cluster 1 — High values on variables V1 and V3 and low values on V5 – They could
       be labeled as fun loving and concerned shoppers
       Cluster 2 — Opposite of cluster 1. Can be termed as apathetic clusters

To visually inspect the cluster right-click on theResult List section. One of the options
from this pop-up menu is Visualize Cluster Assignments. A window will pop up that
lets you play with the results and see them visually (see figure 5).




Figure 1 The window after loading the dataset
Figure 2 Window after choosing the SimpleKMeans procedure
Figure 3 Changeing the number of clusters
Figure 4 The result of cluster analysis
Figure 5 Visually viewing the cluster

Mais conteúdo relacionado

Semelhante a Data Mining Techniques Using R and WEKA

IRJET - Stock Market Analysis and Prediction using Deep Learning
IRJET - Stock Market Analysis and Prediction using Deep LearningIRJET - Stock Market Analysis and Prediction using Deep Learning
IRJET - Stock Market Analysis and Prediction using Deep LearningIRJET Journal
 
Benchmark the Actual Bond Prices
Benchmark the Actual Bond PricesBenchmark the Actual Bond Prices
Benchmark the Actual Bond PricesRan Zhang
 
IRJET - Stock Price Prediction using Microblogging Data
IRJET - Stock Price Prediction using Microblogging DataIRJET - Stock Price Prediction using Microblogging Data
IRJET - Stock Price Prediction using Microblogging DataIRJET Journal
 
ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066rahulsm27
 
IRJET- Data Visualization and Stock Market and Prediction
IRJET- Data Visualization and Stock Market and PredictionIRJET- Data Visualization and Stock Market and Prediction
IRJET- Data Visualization and Stock Market and PredictionIRJET Journal
 
Intelligent Supermarket using Apriori
Intelligent Supermarket using AprioriIntelligent Supermarket using Apriori
Intelligent Supermarket using AprioriIRJET Journal
 
IRJET - Stock Market Analysis and Prediction
IRJET - Stock Market Analysis and PredictionIRJET - Stock Market Analysis and Prediction
IRJET - Stock Market Analysis and PredictionIRJET Journal
 
RETRIEVING FUNDAMENTAL VALUES OF EQUITY
RETRIEVING FUNDAMENTAL VALUES OF EQUITYRETRIEVING FUNDAMENTAL VALUES OF EQUITY
RETRIEVING FUNDAMENTAL VALUES OF EQUITYIRJET Journal
 
IRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning ApproacheIRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning ApproacheIRJET Journal
 
IRJET- Stock Market Prediction using Machine Learning
IRJET- Stock Market Prediction using Machine LearningIRJET- Stock Market Prediction using Machine Learning
IRJET- Stock Market Prediction using Machine LearningIRJET Journal
 
Investigation of Frequent Batch Auctions using Agent Based Model
Investigation of Frequent Batch Auctions using Agent Based ModelInvestigation of Frequent Batch Auctions using Agent Based Model
Investigation of Frequent Batch Auctions using Agent Based ModelTakanobu Mizuta
 
Stock Market Prediction Using Artificial Neural Network
Stock Market Prediction Using Artificial Neural NetworkStock Market Prediction Using Artificial Neural Network
Stock Market Prediction Using Artificial Neural NetworkINFOGAIN PUBLICATION
 
Use of data mining techniques in the discovery of spatial and ...
Use of data mining techniques in the discovery of spatial and ...Use of data mining techniques in the discovery of spatial and ...
Use of data mining techniques in the discovery of spatial and ...butest
 
Automation Tool Development to Improve Machine Results using Data Analysis
Automation Tool Development to Improve Machine Results using Data AnalysisAutomation Tool Development to Improve Machine Results using Data Analysis
Automation Tool Development to Improve Machine Results using Data AnalysisIRJET Journal
 
Stock Market Prediction Using Deep Learning
Stock Market Prediction Using Deep LearningStock Market Prediction Using Deep Learning
Stock Market Prediction Using Deep LearningIRJET Journal
 
IRJET- Prediction in Stock Marketing
IRJET- Prediction in Stock MarketingIRJET- Prediction in Stock Marketing
IRJET- Prediction in Stock MarketingIRJET Journal
 
Recorded Future News Analytics for Financial Services
Recorded Future News Analytics for Financial ServicesRecorded Future News Analytics for Financial Services
Recorded Future News Analytics for Financial ServicesChris Holden
 
Stock Market Prediction
Stock Market PredictionStock Market Prediction
Stock Market PredictionMRIDUL GUPTA
 

Semelhante a Data Mining Techniques Using R and WEKA (20)

10.1.1.129.1408
10.1.1.129.140810.1.1.129.1408
10.1.1.129.1408
 
IRJET - Stock Market Analysis and Prediction using Deep Learning
IRJET - Stock Market Analysis and Prediction using Deep LearningIRJET - Stock Market Analysis and Prediction using Deep Learning
IRJET - Stock Market Analysis and Prediction using Deep Learning
 
Benchmark the Actual Bond Prices
Benchmark the Actual Bond PricesBenchmark the Actual Bond Prices
Benchmark the Actual Bond Prices
 
IRJET - Stock Price Prediction using Microblogging Data
IRJET - Stock Price Prediction using Microblogging DataIRJET - Stock Price Prediction using Microblogging Data
IRJET - Stock Price Prediction using Microblogging Data
 
ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066ITB Term Paper - 10BM60066
ITB Term Paper - 10BM60066
 
IRJET- Data Visualization and Stock Market and Prediction
IRJET- Data Visualization and Stock Market and PredictionIRJET- Data Visualization and Stock Market and Prediction
IRJET- Data Visualization and Stock Market and Prediction
 
Intelligent Supermarket using Apriori
Intelligent Supermarket using AprioriIntelligent Supermarket using Apriori
Intelligent Supermarket using Apriori
 
IRJET - Stock Market Analysis and Prediction
IRJET - Stock Market Analysis and PredictionIRJET - Stock Market Analysis and Prediction
IRJET - Stock Market Analysis and Prediction
 
RETRIEVING FUNDAMENTAL VALUES OF EQUITY
RETRIEVING FUNDAMENTAL VALUES OF EQUITYRETRIEVING FUNDAMENTAL VALUES OF EQUITY
RETRIEVING FUNDAMENTAL VALUES OF EQUITY
 
IRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning ApproacheIRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning Approache
 
IRJET- Stock Market Prediction using Machine Learning
IRJET- Stock Market Prediction using Machine LearningIRJET- Stock Market Prediction using Machine Learning
IRJET- Stock Market Prediction using Machine Learning
 
Investigation of Frequent Batch Auctions using Agent Based Model
Investigation of Frequent Batch Auctions using Agent Based ModelInvestigation of Frequent Batch Auctions using Agent Based Model
Investigation of Frequent Batch Auctions using Agent Based Model
 
Stock Market Prediction Using Artificial Neural Network
Stock Market Prediction Using Artificial Neural NetworkStock Market Prediction Using Artificial Neural Network
Stock Market Prediction Using Artificial Neural Network
 
Use of data mining techniques in the discovery of spatial and ...
Use of data mining techniques in the discovery of spatial and ...Use of data mining techniques in the discovery of spatial and ...
Use of data mining techniques in the discovery of spatial and ...
 
Automation Tool Development to Improve Machine Results using Data Analysis
Automation Tool Development to Improve Machine Results using Data AnalysisAutomation Tool Development to Improve Machine Results using Data Analysis
Automation Tool Development to Improve Machine Results using Data Analysis
 
Stock Market Prediction Using Deep Learning
Stock Market Prediction Using Deep LearningStock Market Prediction Using Deep Learning
Stock Market Prediction Using Deep Learning
 
IRJET- Prediction in Stock Marketing
IRJET- Prediction in Stock MarketingIRJET- Prediction in Stock Marketing
IRJET- Prediction in Stock Marketing
 
Recorded Future News Analytics for Financial Services
Recorded Future News Analytics for Financial ServicesRecorded Future News Analytics for Financial Services
Recorded Future News Analytics for Financial Services
 
Stock Market Prediction
Stock Market PredictionStock Market Prediction
Stock Market Prediction
 
Data Mining _ Weka
Data Mining _ WekaData Mining _ Weka
Data Mining _ Weka
 

Último

The-Ethical-issues-ghhhhhhhhjof-Byjus.pptx
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptxThe-Ethical-issues-ghhhhhhhhjof-Byjus.pptx
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptxmbikashkanyari
 
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!Doge Mining Website
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menzaictsugar
 
Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Seta Wicaksana
 
Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03DallasHaselhorst
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfrichard876048
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Kirill Klimov
 
Annual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesAnnual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesKeppelCorporation
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...ictsugar
 
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdfNewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdfKhaled Al Awadi
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCRashishs7044
 
Marketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent ChirchirMarketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent Chirchirictsugar
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Servicecallgirls2057
 
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCRashishs7044
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCRashishs7044
 

Último (20)

Corporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information TechnologyCorporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information Technology
 
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptx
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptxThe-Ethical-issues-ghhhhhhhhjof-Byjus.pptx
The-Ethical-issues-ghhhhhhhhjof-Byjus.pptx
 
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!
Unlocking the Future: Explore Web 3.0 Workshop to Start Earning Today!
 
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu MenzaYouth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
Youth Involvement in an Innovative Coconut Value Chain by Mwalimu Menza
 
Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...Ten Organizational Design Models to align structure and operations to busines...
Ten Organizational Design Models to align structure and operations to busines...
 
Call Us ➥9319373153▻Call Girls In North Goa
Call Us ➥9319373153▻Call Girls In North GoaCall Us ➥9319373153▻Call Girls In North Goa
Call Us ➥9319373153▻Call Girls In North Goa
 
Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03Cybersecurity Awareness Training Presentation v2024.03
Cybersecurity Awareness Training Presentation v2024.03
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdf
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024
 
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
 
Annual General Meeting Presentation Slides
Annual General Meeting Presentation SlidesAnnual General Meeting Presentation Slides
Annual General Meeting Presentation Slides
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
 
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdfNewBase  19 April  2024  Energy News issue - 1717 by Khaled Al Awadi.pdf
NewBase 19 April 2024 Energy News issue - 1717 by Khaled Al Awadi.pdf
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
 
8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR8447779800, Low rate Call girls in Rohini Delhi NCR
8447779800, Low rate Call girls in Rohini Delhi NCR
 
Marketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent ChirchirMarketplace and Quality Assurance Presentation - Vincent Chirchir
Marketplace and Quality Assurance Presentation - Vincent Chirchir
 
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
 
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR8447779800, Low rate Call girls in Tughlakabad Delhi NCR
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
 
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
 

Data Mining Techniques Using R and WEKA

  • 1. Data Mining Techniques Using R and WEKA IT for Business Intelligence Term paper Utsav Mone (10BM60094) This Term paper explained two Techniques - 1) Linear Modelling using R 2) Clustering using WEKA
  • 2. Linear Modelling using R Here I have tried to analyse the relation of bid ask spread of the company with the vitality of the prices. I have used three different hypotheses to fit the model. First I tried to see linear modelling then tried to fit logarithmic and exponential relation of volatility. We have data from where bid ask spread can be calculated at different hours. Through trade data I calculated daily price volatility of the stock and tried to see relation between them. Bid Ask Spread A Measure of liquidity • The amount by which the ask price exceeds the bid. This is essentially the difference in price between the highest price that a buyer is willing to pay for an asset and the lowest price for which a seller is willing to sell it • Ask - The price a seller is willing to accept for a security, also known as the offer price. Along with the price, the ask quote will generally also stipulate the amount of the security Bid - An offer made by an investor, a trader or a dealer to buy a security. The bid will stipulate both the price at which the buyer is willing to purchase the security and the quantity of the security Factors effecting Bid Ask Spread 1) Volatility (With more volatility the spread is high) Standard deviation Variance between returns from that same security or market index 2) Volumes (More volumes reduce the spread) Absolute number of shares under transaction Percentage of free floating shares Number of orders 3) Others Tick size Price of Share All above measures can re classified again in two categories (except Others) Executed Requested In Our case I have used only identifying the relation of volatility and Bid Ask Spread. I have used daily volatility of share prices of Tata Motors for the month of Feb 2008. Daily volatility of the share prices are calculated on the basis of hourly data instead of regular way of finding closing prices of each day. This is required since we want to see changes in daily volatility.
  • 3. Data – Dr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I have used few files from her data. There are three types of files. 1) Snapshots 2) Trade Data 3) Price Volume Data Price Volume Data I have used February 2008 share data of Tata Motors. Except the traded data rest all data is available in public domain. The file contains the following items i) Symbol, ii) Series, iii) Date, iv) Prev Close, v) Open Price, vi) High Price, vii) Low Price, viii) Last Price, ix) Close Price, x) Average Price, xi) Total Traded xii) Quantity, xiii) Turnover in Lacs, This text file is available at this link- http://bit.ly/TM_PVD TATAMOTORS,EQ,03-Dec-2007,732.45,736,749,733.35,737,736.15,741,481721,3569.5399 TATAMOTORS,EQ,04-Dec-2007,736.15,737,746,728.35,746,741.3,738.2,631272,4660.0808995, TATAMOTORS,EQ,05-Dec-2007,741.3,744,783.9,744,773,772.4,769.92,1410714,10861.311993, TATAMOTORS,EQ,06-Dec-2007,772.4,775.5,782,763.25,778,775.45,774.13,807793,6253.379844, TATAMOTORS,EQ,10-Dec-2007,767.3,772,777.7,745.05,775,766.45,757.78,521361,3950.7440285, TATAMOTORS,EQ,11-Dec-2007,766.45,770,777.3,761,777.3,775.2,770.04,676097,5206.1990345, TATAMOTORS,EQ,12-Dec-2007,775.2,776.9,780,762,769,770.05,768.88,665743,5118.7625105,
  • 4. Snapshots Data In this type of data we have snapshot of order book for 4 Hours in a day which are 11Hr, 12Hr, 13 Hr, 14Hr. Here we see snapshot data of Tata Motor for different months and hours of the day. Here is a look of the data. Since numbers of files are too much it is difficult to upload it. A look at Snapshot data – 1) Order Number 2) Company 3) Trade Type 4) No of shares in Order 5) Quote 6) Time Stamp 7) Buy Sell 8) Flags A Sample Snapshot Data of Tata Motors on 1 Feb 11 Hr - 2008020150046719 TATAMOTORS EQ 500 559.60 09:55:48 B ynnn nnn nnn RL 0 2008020150716321 TATAMOTORS EQ 10 560.00 10:35:56 B ynnn nnn nnn RL 0 2008020150034116 TATAMOTORS EQ 100 575.00 09:55:22 B ynnn nnn nnn RL 0 2008020150067971 TATAMOTORS EQ 824 576.65 09:56:38 B ynnn nny nnn RL 0 2008020100283272 TATAMOTORS EQ 100 582.00 10:09:10 B ynnn nnn nnn RL 0 2008020150233325 TATAMOTORS EQ 25000 585.00 10:04:34 B ynnn nny nnn RL 0 Detail of Flags can be seen at – https://docs.google.com/document/d/1pW0Fou2VzSiacEn0OKR5rKeKBRZTzw5USk7HOBVQRjs/edit
  • 5. Trade Data This is a daily trade data. Which gives all the trades took place in a day. A look at Trade data – 1) Trade Number 2) Name of Company 3) Type of Trade 4) Time of Trading 5) Price 6) Volume of shares traded Opening data Price = 708 2475593 TATAMOTORS EQ 09:55:16 708 3713 2475830 TATAMOTORS EQ 09:55:20 708 800 2475871 TATAMOTORS EQ 09:55:21 708 200 2475872 TATAMOTORS EQ 09:55:21 708 1 2475873 TATAMOTORS EQ 09:55:21 708 1 2475874 TATAMOTORS EQ 09:55:21 708 210 2475935 TATAMOTORS EQ 09:55:22 708 800 See Price variation in 3 Seconds from 755 to back 755 3843007 TATAMOTORS EQ 13:33:37 755 5 3843008 TATAMOTORS EQ 13:33:37 755 453 3843021 TATAMOTORS EQ 13:33:38 754.9 1 3843022 TATAMOTORS EQ 13:33:38 754.55 9 3843037 TATAMOTORS EQ 13:33:38 755 1 3843050 TATAMOTORS EQ 13:33:38 754.9 1 3843051 TATAMOTORS EQ 13:33:38 754.9 9 3843052 TATAMOTORS EQ 13:33:38 754.9 1 3843069 TATAMOTORS EQ 13:33:39 755 1 More detail of the data is available at – https://docs.google.com/document/d/1pW0Fou2VzSiacEn0OKR5rKeKBRZTzw5USk7HOBVQRjs/edit
  • 6. R Program Data Location We need to set Directory location in R. R looks for all the file in the directory assigned. Packages Requirement CHRON ZOO FDA MASS PROTO DBI RSQL.LITE RSQL.EXTFUNCS STATS4 SDE TCLTK SQLDF
  • 7. Program Understanding Reading File In this program I have first read the different files using for loop. 1) Reading Trade Data File Name is made of = Name of Company, Day, Month, Year.txt 2) Reading Snapshot Data File name is made of = Company Name_Day, Month, Year_Time Hour.txt The data is read using SQL queries for which many packages are Required st Since the data is lot and we have to decide which data to read First I found out average prices at 1 second of the hour. nd Eg Prices of all the trades when time stamp was 10:00 all seconds after 2 are taken into consideration. Then I found out gain or loss every hour to find the volatility of the trade. Then for each hour I found out Bid ask Spread and averaged that for day. Now for linear modelling I used Feb Month data of daily volatility and bid ask spread. Program has comments to help us understanding more. Almost same program was run for exponential and logarithmic relations, but there was little change in code in last 6 lines. The other code is given in the cases explained. Since I am new user of R, The program is not very efficient, but the code is perfectly fine and runs well.
  • 8. Code Name<-'TATAMOTORS_' #/* This is name of company*/ MY<-'Feb08' #/* This is month and year*/ Day<-c('01','04','06','07','08','11','12','13','14','15','18','19','20','21','22') #/* These are Working days of Feb Month, This is hardcoded as of now*/ #/* This is just to define PSpread and Dailystdev as numeric array*/ PSpread<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15) Dailystdev<-PSpread BD<-c(1,2,3,4)
  • 9. #/* Reading Trade Data*/ i=1 for(i in 1:15) { TFile<-paste(Name,Day[i],MY,".txt",sep = "") Trade<-read.table(TFile) summary(Trade) #/* Below are sql query to find average price of all the trades at perticular hour */ Hr10Price<-sqldf("select avg(V5) from Trade where V4 like '10:00%'") Hr11Price<-sqldf("select avg(V5) from Trade where V4 like '11:00%'") Hr12Price<-sqldf("select avg(V5) from Trade where V4 like '12:00%'") Hr13Price<-sqldf("select avg(V5) from Trade where V4 like '13:00%'") Hr14Price<-sqldf("select avg(V5) from Trade where V4 like '14:00%'") Hr15Price<-sqldf("select avg(V5) from Trade where V4 like '15:00%'") #/* This is to find returns at each hour*/ R1 = ((Hr10Price[1,1] - Hr11Price[1,1])/Hr10Price[1,1]) R2 = (Hr11Price[1,1] -Hr12Price[1,1])/Hr11Price[1,1] R3 = (Hr12Price[1,1] -Hr13Price[1,1])/Hr12Price[1,1] R4 = (Hr13Price[1,1] -Hr14Price[1,1])/Hr13Price[1,1] R5 = (Hr14Price[1,1] -Hr15Price[1,1])/Hr14Price[1,1] R<-c(R1,R2,R3,R4,R5) Dailystdev[i]<-sd(R, na.rm = FALSE) #/* Dailystdev variable have standard deviation of daily returns*/ #/******************************************/ #/* Code below is for reading snapshot data */ #/******************************************/ Company<-'TATAMOTORS' Month<-'Feb' Year<-"08" Time<-c(11,12,13,14) h<-"_" i
  • 10. for(j in 1:4) { File<-paste(Company,h,Day[i],Month,Year,h,Time[j],".txt",sep = "") X<-read.table(File) #/* SQL and formulas find the Bid and Ask value of the hour */ MaxBuyP<-sqldf("select max(V5) from X where V10 = 'nnn' and V7 = 'B' ") MinSellP<-sqldf("select min(V5) from X where V10 = 'nnn' and V7 = 'S' ") MinSell = MinSellP[1,1] MaxBuy = MaxBuyP[1,1] #/* This is done to bring array variable to regular variable */ BidAsk = MinSell - MaxBuy BD[j] =BidAsk/((MaxBuy+MinSell)/2) } PSpread[i]<- mean(BD) } PSpread Dailystdev /* DF is Data Frame for modeling */ DF <- data.frame(PSpread,Dailystdev) Result<-lm(PSpread ~ Dailystdev,DF) Result summary(Result) /*******************END***********************/
  • 11. Analysis Analysis show that interrupt at Y axis is significant but the coefficient is not significant. Adjusted R square is also showing that model is not fitting. F statistic also have very high p values which gives overall indication that Bid Ask Spread do not have any linear relation with daily volatility of the prices. So I changed the Null Hypothesis to following cases. Bid ask spread is exponentially related with the volatility or Bid ask spread is logarithmically related with the volatility
  • 12. Exponential Case Dailystdevexp<-exp(Dailystdev) DFexp <- data.frame(PSpread,Dailystdevexp) Resultexp <-lm(PSpread ~ Dailystdevexp,DFexp) Resultexp Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.04463 0.08037 -0.555 0.588 Dailystdevexp 0.04582 0.07970 0.575 0.575
  • 13. Log Case Dailystdevln<-log(Dailystdev, base = exp(1)) DFln <- data.frame(PSpread,Dailystdevln) Resultln <-lm(PSpread ~ Dailystdevln,DFln) summary(Resultln) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0021993 0.0027177 0.809 0.433 Dailystdevln 0.0001246 0.0005400 0.231 0.821 Still we see that even exponential or log normal model is not fitting.
  • 14. Clustering Using WEKA Clustering helps one to make group of data instances. These help especially the marketers to identify patterns in data and segment their customers. The Dataset The data used here is obtained from the CD of book on Marketing research by Naresh Malhotra. The data can be downloaded from the following link - bit.ly/HVwPEP The example illustrates the use of clustering method to segment customers based on there attitudes towards shopping. Customers were asked to express s their degree of agreement on the following variables on a 7 point scale V1 - Shopping is fun V 2 - Shopping is bad for your budget V3 - I combine shopping with eating out V4 - I try to get the best buys when shopping V5 - I don’t care about shopping V6 - You can save a lot of money by comparing prices Clustering Procedure Load the data using the open file option in Weka. You will get the window as shown in figure 1. Click on cluster tab. Then click Choose and select SimpleKMeans .You will get the window as shown in figure 2. By default the number of cluster created would be 2. In order to change the number of cluster click on SimpleKMeans. You will get the window as shown in figure 3. In the numcluster field specify the number of clusters to be created. For this example number of cluster created is 3. Click on start. You will get the output as shown in figure 4.
  • 15. Interpreting The Results Each cluster tells us a type of behavior in our customers, from which we can begin to draw some conclusions:  Cluster 0 — High values on V2 and V4 and V6 . Can be called as economical shoppers  Cluster 1 — High values on variables V1 and V3 and low values on V5 – They could be labeled as fun loving and concerned shoppers  Cluster 2 — Opposite of cluster 1. Can be termed as apathetic clusters To visually inspect the cluster right-click on theResult List section. One of the options from this pop-up menu is Visualize Cluster Assignments. A window will pop up that lets you play with the results and see them visually (see figure 5). Figure 1 The window after loading the dataset
  • 16. Figure 2 Window after choosing the SimpleKMeans procedure
  • 17. Figure 3 Changeing the number of clusters
  • 18. Figure 4 The result of cluster analysis
  • 19. Figure 5 Visually viewing the cluster