Data mining Computerassignment 1

Data mining
‘REGRESSION: CPU Performance’

Visualized data with WEKA
COMPUTER ASSIGNMENT 1

BARRY KOLLEE

10349863

Regression
|
CPU
performance

1. Do you think that ERP should be at least partially predictable from the input attributes?

Not in all cases. This is only possible if we’re able to see correlation between the two attributes that we
compare. In case both values correlate with each other we can state that we can predict certain values
from the input attribute.

2. Do any attributes exhibit significant correlations?

I’ve loaded up the delivered database file into WEKA. With visualising the data as a graph (which shows
the correlation between all attributes) I’m seeing the plotted graphs which is listed below. To see
correlation between all ‘dots’ it is necessary to see a linear pattern. The following correlated graphs
seems to correlate with ERP; respectively MYCT, MMIN and MMAX:

• Green MMAX, with MMAX plotted on the X-axis I see a pattern which is increasing slowly at
first and after words it increases rapidly. If we swap the y and x axis we see the opposite result.
It starts with increasing rapidly and after words it increases slowly.
• Blue MYCT, with MYCT plotted on my x-axis I see a pattern within the correlation between
ERP and MYCT. The pattern look like a (1/n) math graph where we start of with a high value.
When increasing the x-axis you see a direct decrease in the pattern which is going to the
‘zeropoint’ of the Y-axis. When increasing the x-axis even more we don’t see the slope
decreasing anymore. If we swap the x and y axis we see a similar pattern.
• Red MMIN, the pattern which I see within MMIN is similar to the one of MMAX.

2

Regression
|
CPU
performance

3. Now we have a feel for the data and we will try fitting a simple linear regression model to
the data. On the Classify tab, select Choose > functions > LinearRegression.

• Use the default options and click Start. This will use 10-fold cross-validation to fit the linear
regression model. Examine the results:
• Record the Root relative squared error and the Relative absolute error. The Relative squared
error is computed by dividing (normalizing) the sum of the squared prediction errors by the sum
of the prediction errors obtained by always predicting the mean. The Root relative squared
error is obtained by taking the square root of the Relative squared error. The Relative absolute
error is similar to the Relative squared error, but uses absolute values rather than squares.
Therefore, if we have a relative error of 100%, the learned model is no better than this very
dumb predictor.

When I perform the linear regression function onto the ERP attribute I’m getting the following
information about this attribute. The ‘Root relative squared error’ is given in red.

Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: cpu-weka.filters.unsupervised.attribute.Remove-R1
Instances: 209
Attributes: 7
MYCT
MMIN
MMAX
CACH
CHMIN
CHMAX
ERP
Test mode:10-fold cross-validation

=== Classifier model (full training set) ===
Linear Regression Model

ERP =

0.0661 * MYCT +
0.0142 * MMIN +
0.0066 * MMAX +
0.4871 * CACH +
1.1868 * CHMAX +
-66.5968

Time taken to build model: 0 seconds

=== Cross-validation ===
=== Summary ===

Correlation coefficient 0.928
Mean absolute error 35.4878
Root mean squared error 57.5296
Relative absolute error 40.4842 %
Root relative squared error 37.1725 %
Total Number of Instances 209

The Root relative squared error looks pretty high. That’s because we take all of the attributes into
account and we fit that into our calculation. You can also see that we take 5 attributes into account for
our scope of our linear regression model. Below you see the actual given function of y = ax + b which
represents our linear regression graph model. And we eventually have a scope of 0.928 if we take all
these attributes within our calculation. This calculation looks like:

y = a x + b

a = 0.0661 * MYCT + 0.0142 * MMIN + 0.0066 * MMAX + 0.4871 * CACH + 1.1868 * CHMAX

b = -66.5968

3

Regression
|
CPU
performance

I prospect that we can make a better fitting linear regression model if we only take the attributes into
account which correlates best with ERP which we gave in answer 2. If we want to achieve this we only
take MMIN and MMAX into account because it looks like that these attributes correlates best if we
stipulate the output which is given in answer 2. I made another linear regression model where I’ve only
used the MMIN and MMAX attribute, which is given below (Root relative error in red):

=== Run information ===

Relation: cpu-weka.filters.unsupervised.attribute.Remove-R1-
weka.filters.unsupervised.attribute.Remove-R1,4-6
Instances: 209
Attributes: 3
MMIN
MMAX
ERP



ERP =

0.0128 * MMIN +
0.0087 * MMAX +
-39.814

Time taken to build model: 0 seconds

=== Summary ===


My assumptions were actually wrong. I see that when only taking MMIN and MMAX into account the
correlation coefficient is lower and we’ve got a higher error rate; i.e. the Mean absolute error which is
higher. This value gives us the average of the difference that we find between the actual value and the
value of all the test cases. Also the value Root relative squared error has increased with ca. 6 %.

4. Did you expect such a performance given your earlier observations? Hint: We are fitting a
linear model.

Because we’re trying to fit a linear model we’re searching for the attributes which correlates best with
ERP. The performance boost is clearly visible if we look at the correlation coefficient. A rate of ca. 0.93
is really close to ‘1’ which is the best rate possible.

However the root relative squared error is pretty high. If we would like to get a better fitting linear
regression model we should only try to take attributes into account which correlates best with ERP. This
would result in a correlation coefficient closer to 1 and an error rate which is closer to 0%. However my
observation when only using MMIN and MMAX weren’t that hopeful. Perhaps that’s because these
errors are less seen if we include more attributes. The using of more attributes seems to decrease the
error rate.

On the other hand I would expect that including more attributes would be more error sensitive

4

Regression
|
CPU
performance

5. Above we deleted the vendor variable. However, we can use nominal attributes in
regression by converting them to numeric. The standard way of so doing is to replace the
nominal variable with a bunch of binary variables of the form "is_first_nominal_value,
is_second_nominal_value" and so on. Reload the unmodified data file cpu.arff.
• On the Preprocess tab select Choose > filters > unsupervised > attribute >
NominaltoBinary and click Apply. This replaces the vendor variable with 30 binary
variables and we now have 37 attributes (we started with 8).
Now train a linear regression model as in (4) and examine the results.
• Record the Relative absolute error and the Root relative squared error

Relation: cpu-weka.filters.unsupervised.attribute.NominalToBinary-Rfirst-last
Instances: 209
Attributes: 37
vendor=adviser
vendor=amdahl
vendor=apollo
vendor=basf
vendor=bti
vendor=burroughs
vendor=c.r.d
vendor=cdc
vendor=cambex
vendor=dec
vendor=dg
vendor=formation
vendor=four-phase
vendor=gould
vendor=hp
vendor=harris
vendor=honeywell
vendor=ibm
vendor=ipl
vendor=magnuson
vendor=microdata
vendor=nas
vendor=ncr
vendor=nixdorf
vendor=perkin-elmer
vendor=prime
vendor=siemens
vendor=sperry
vendor=sratus
vendor=wang
MYCT
MMIN
MMAX
CACH
CHMIN
CHMAX
ERP



ERP =

-132.1272 * vendor=adviser +
-34.3319 * vendor=burroughs +
-52.3128 * vendor=gould +
-35.8202 * vendor=honeywell +
-16.7597 * vendor=ibm +
-144.1856 * vendor=microdata +
-22.7172 * vendor=nas +
41.5185 * vendor=sperry +
0.0696 * MYCT +
0.0167 * MMIN +
0.0055 * MMAX +
0.6304 * CACH +
-1.5416 * CHMIN +
1.6106 * CHMAX +
-57.432

Time taken to build model: 0.02 seconds

=== Summary ===

5

Regression
|
CPU
performance


6. Compare the performance to the one we had previously. Did adding the binarized vendor
variable help?

The errors of the first linear model where:


The error rate of the latest linear regression model is:


It looks like that the error rate has only increased. I think that’s because we now take a lot more
attributes into account what makes our slope (the a in y=ax+b) more complex and error sensitive. I
predict that the error rate would be less higher of we would only take attributes into account which
correlates best with ERP.

6

Data mining Computerassignment 1

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Data mining Computerassignment 1

Semelhante a Data mining Computerassignment 1 (20)

Mais de BarryK88

Mais de BarryK88 (14)

Data mining Computerassignment 1