SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
Data mining
‘REGRESSION: CPU Performance’




        Visualized data with WEKA
        COMPUTER ASSIGNMENT 1

        BARRY KOLLEE

        10349863
Regression	
  |	
  CPU	
  performance	
  
	
  
1. Do you think that ERP should be at least partially predictable from the input attributes?

Not in all cases. This is only possible if we’re able to see correlation between the two attributes that we
compare. In case both values correlate with each other we can state that we can predict certain values
from the input attribute.

2. Do any attributes exhibit significant correlations?

I’ve loaded up the delivered database file into WEKA. With visualising the data as a graph (which shows
the correlation between all attributes) I’m seeing the plotted graphs which is listed below. To see
correlation between all ‘dots’ it is necessary to see a linear pattern. The following correlated graphs
seems to correlate with ERP; respectively MYCT, MMIN and MMAX:

       •   Green MMAX, with MMAX plotted on the X-axis I see a pattern which is increasing slowly at
           first and after words it increases rapidly. If we swap the y and x axis we see the opposite result.
           It starts with increasing rapidly and after words it increases slowly.
       •   Blue MYCT, with MYCT plotted on my x-axis I see a pattern within the correlation between
           ERP and MYCT. The pattern look like a (1/n) math graph where we start of with a high value.
           When increasing the x-axis you see a direct decrease in the pattern which is going to the
           ‘zeropoint’ of the Y-axis. When increasing the x-axis even more we don’t see the slope
           decreasing anymore. If we swap the x and y axis we see a similar pattern.
       •   Red MMIN, the pattern which I see within MMIN is similar to the one of MMAX.




2
Regression	
  |	
  CPU	
  performance	
  
	
  

3. Now we have a feel for the data and we will try fitting a simple linear regression model to
the data. On the Classify tab, select Choose > functions > LinearRegression.

        •      Use the default options and click Start. This will use 10-fold cross-validation to fit the linear
               regression model. Examine the results:
        •      Record the Root relative squared error and the Relative absolute error. The Relative squared
               error is computed by dividing (normalizing) the sum of the squared prediction errors by the sum
               of the prediction errors obtained by always predicting the mean. The Root relative squared
               error is obtained by taking the square root of the Relative squared error. The Relative absolute
               error is similar to the Relative squared error, but uses absolute values rather than squares.
               Therefore, if we have a relative error of 100%, the learned model is no better than this very
               dumb predictor.

When I perform the linear regression function onto the ERP attribute I’m getting the following
information about this attribute. The ‘Root relative squared error’ is given in red.



       Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
       Relation:     cpu-weka.filters.unsupervised.attribute.Remove-R1
       Instances:    209
       Attributes:   7
                     MYCT
                     MMIN
                     MMAX
                     CACH
                     CHMIN
                     CHMAX
                     ERP
       Test mode:10-fold cross-validation

       === Classifier model (full training set) ===
       Linear Regression Model

       ERP =

              0.0661    *   MYCT +
              0.0142    *   MMIN +
              0.0066    *   MMAX +
              0.4871    *   CACH +
              1.1868    *   CHMAX +
            -66.5968

       Time taken to build model: 0 seconds

       === Cross-validation ===
       === Summary ===

       Correlation coefficient                               0.928
       Mean absolute error                                   35.4878
       Root mean squared error                               57.5296
       Relative absolute error                               40.4842 %
       Root relative squared error                           37.1725 %
       Total Number of Instances                             209	
  	
  	
  



The Root relative squared error looks pretty high. That’s because we take all of the attributes into
account and we fit that into our calculation. You can also see that we take 5 attributes into account for
our scope of our linear regression model. Below you see the actual given function of y = ax + b which
represents our linear regression graph model. And we eventually have a scope of 0.928 if we take all
these attributes within our calculation. This calculation looks like:


       y = a x + b

       a = 0.0661 * MYCT + 0.0142 * MMIN +           0.0066 * MMAX + 0.4871 * CACH + 1.1868 * CHMAX

       b = -66.5968




3
Regression	
  |	
  CPU	
  performance	
  
	
  
	
  
I prospect that we can make a better fitting linear regression model if we only take the attributes into
account which correlates best with ERP which we gave in answer 2. If we want to achieve this we only
take MMIN and MMAX into account because it looks like that these attributes correlates best if we
stipulate the output which is given in answer 2. I made another linear regression model where I’ve only
used the MMIN and MMAX attribute, which is given below (Root relative error in red):
	
  
	
  

           === Run information ===

           Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
           Relation:     cpu-weka.filters.unsupervised.attribute.Remove-R1-
           weka.filters.unsupervised.attribute.Remove-R1,4-6
           Instances:    209
           Attributes:   3
                         MMIN
                         MMAX
                         ERP
           Test mode:10-fold cross-validation

           === Classifier model (full training set) ===


           Linear Regression Model

           ERP =

                 0.0128 * MMIN +
                 0.0087 * MMAX +
               -39.814

           Time taken to build model: 0 seconds

           === Cross-validation ===
           === Summary ===

           Correlation coefficient                      0.9022
           Mean absolute error                          39.8811
           Root mean squared error                      66.584
           Relative absolute error                      45.4961 %
           Root relative squared error                  43.023 %
           Total Number of Instances                    209	
  	
  	
  	
  	
  	
  
	
  
	
  
	
  	
  
My assumptions were actually wrong. I see that when only taking MMIN and MMAX into account the
correlation coefficient is lower and we’ve got a higher error rate; i.e. the Mean absolute error which is
higher. This value gives us the average of the difference that we find between the actual value and the
value of all the test cases. Also the value Root relative squared error has increased with ca. 6 %.

4. Did you expect such a performance given your earlier observations? Hint: We are fitting a
linear model.

Because we’re trying to fit a linear model we’re searching for the attributes which correlates best with
ERP. The performance boost is clearly visible if we look at the correlation coefficient. A rate of ca. 0.93
is really close to ‘1’ which is the best rate possible.

However the root relative squared error is pretty high. If we would like to get a better fitting linear
regression model we should only try to take attributes into account which correlates best with ERP. This
would result in a correlation coefficient closer to 1 and an error rate which is closer to 0%. However my
observation when only using MMIN and MMAX weren’t that hopeful. Perhaps that’s because these
errors are less seen if we include more attributes. The using of more attributes seems to decrease the
error rate.

On the other hand I would expect that including more attributes would be more error sensitive
	
  




4
Regression	
  |	
  CPU	
  performance	
  
	
  
5. Above we deleted the vendor variable. However, we can use nominal attributes in
regression by converting them to numeric. The standard way of so doing is to replace the
nominal variable with a bunch of binary variables of the form "is_first_nominal_value,
is_second_nominal_value" and so on. Reload the unmodified data file cpu.arff.
    • On the Preprocess tab select Choose > filters > unsupervised > attribute >
        NominaltoBinary and click Apply. This replaces the vendor variable with 30 binary
        variables and we now have 37 attributes (we started with 8).
        Now train a linear regression model as in (4) and examine the results.
    • Record the Relative absolute error and the Root relative squared error



       Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
       Relation:     cpu-weka.filters.unsupervised.attribute.NominalToBinary-Rfirst-last
       Instances:    209
       Attributes:   37
                     vendor=adviser
                     vendor=amdahl
                     vendor=apollo
                     vendor=basf
                     vendor=bti
                     vendor=burroughs
                     vendor=c.r.d
                     vendor=cdc
                     vendor=cambex
                     vendor=dec
                     vendor=dg
                     vendor=formation
                     vendor=four-phase
                     vendor=gould
                     vendor=hp
                     vendor=harris
                     vendor=honeywell
                     vendor=ibm
                     vendor=ipl
                     vendor=magnuson
                     vendor=microdata
                     vendor=nas
                     vendor=ncr
                     vendor=nixdorf
                     vendor=perkin-elmer
                     vendor=prime
                     vendor=siemens
                     vendor=sperry
                     vendor=sratus
                     vendor=wang
                     MYCT
                     MMIN
                     MMAX
                     CACH
                     CHMIN
                     CHMAX
                     ERP
       Test mode:10-fold cross-validation

       === Classifier model (full training set) ===

       Linear Regression Model

       ERP =

           -132.1272 * vendor=adviser +
           -34.3319 * vendor=burroughs +
           -52.3128 * vendor=gould +
           -35.8202 * vendor=honeywell +
           -16.7597 * vendor=ibm +
           -144.1856 * vendor=microdata +
           -22.7172 * vendor=nas +
           41.5185 * vendor=sperry +
           0.0696 * MYCT +
           0.0167 * MMIN +
           0.0055 * MMAX +
           0.6304 * CACH +
           -1.5416 * CHMIN +
           1.6106 * CHMAX +
          -57.432

       Time taken to build model: 0.02 seconds

       === Cross-validation ===
       === Summary ===




5
Regression	
  |	
  CPU	
  performance	
  
	
  

       Correlation coefficient                          0.9252
       Mean absolute error                              35.9725
       Root mean squared error                          58.5821
       Relative absolute error                          41.0372 %
       Root relative squared error                      37.8525 %
       Total Number of Instances                        209	
  	
  	
  	
  
	
  
	
  
6. Compare the performance to the one we had previously. Did adding the binarized vendor
variable help?
	
  
The errors of the first linear model where:

Relative absolute error                     40.4842 %
Root relative squared error                 37.1725 %


The error rate of the latest linear regression model is:

Relative absolute error                     41.0372 %
Root relative squared error                 37.8525 %


It looks like that the error rate has only increased. I think that’s because we now take a lot more
attributes into account what makes our slope (the a in y=ax+b) more complex and error sensitive. I
predict that the error rate would be less higher of we would only take attributes into account which
correlates best with ERP.




6

Mais conteúdo relacionado

Mais procurados

Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionGene Chang
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
 
Tutorial: Cross-compiling Linux Kernels on x86_64
Tutorial: Cross-compiling Linux Kernels on x86_64Tutorial: Cross-compiling Linux Kernels on x86_64
Tutorial: Cross-compiling Linux Kernels on x86_64Samsung Open Source Group
 
Part-2: Mastering microcontroller with embedded driver development
Part-2: Mastering microcontroller with embedded driver developmentPart-2: Mastering microcontroller with embedded driver development
Part-2: Mastering microcontroller with embedded driver developmentFastBit Embedded Brain Academy
 
Assembly Language Programming
Assembly Language ProgrammingAssembly Language Programming
Assembly Language ProgrammingNiropam Das
 
Zynqで始めるUSB開発-FPGAとARMで動く USBオーディオデバイスの実例とともに-
Zynqで始めるUSB開発-FPGAとARMで動くUSBオーディオデバイスの実例とともに-Zynqで始めるUSB開発-FPGAとARMで動くUSBオーディオデバイスの実例とともに-
Zynqで始めるUSB開発-FPGAとARMで動く USBオーディオデバイスの実例とともに-mmitti
 
Analysis of Open-Source Drivers for IEEE 802.11 WLANs
Analysis of Open-Source Drivers for IEEE 802.11 WLANsAnalysis of Open-Source Drivers for IEEE 802.11 WLANs
Analysis of Open-Source Drivers for IEEE 802.11 WLANsDanh Nguyen
 
Computer architecture the pentium architecture
Computer architecture the pentium architectureComputer architecture the pentium architecture
Computer architecture the pentium architectureMazin Alwaaly
 
HC 05藍芽模組連線
HC 05藍芽模組連線HC 05藍芽模組連線
HC 05藍芽模組連線Chen-Hung Hu
 
Advanced debugging  techniques in different environments
Advanced debugging  techniques in different environmentsAdvanced debugging  techniques in different environments
Advanced debugging  techniques in different environmentsAndrii Soldatenko
 
Operand and Opcode | Computer Science
Operand and Opcode | Computer ScienceOperand and Opcode | Computer Science
Operand and Opcode | Computer ScienceTransweb Global Inc
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
Solution manual of assembly language programming and organization of the ibm ...
Solution manual of assembly language programming and organization of the ibm ...Solution manual of assembly language programming and organization of the ibm ...
Solution manual of assembly language programming and organization of the ibm ...Tayeen Ahmed
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveNetronome
 

Mais procurados (20)

Linux MMAP & Ioremap introduction
Linux MMAP & Ioremap introductionLinux MMAP & Ioremap introduction
Linux MMAP & Ioremap introduction
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
To connect two jframe
To connect two jframeTo connect two jframe
To connect two jframe
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
Embedded C - Lecture 2
Embedded C - Lecture 2Embedded C - Lecture 2
Embedded C - Lecture 2
 
Tutorial: Cross-compiling Linux Kernels on x86_64
Tutorial: Cross-compiling Linux Kernels on x86_64Tutorial: Cross-compiling Linux Kernels on x86_64
Tutorial: Cross-compiling Linux Kernels on x86_64
 
Part-2: Mastering microcontroller with embedded driver development
Part-2: Mastering microcontroller with embedded driver developmentPart-2: Mastering microcontroller with embedded driver development
Part-2: Mastering microcontroller with embedded driver development
 
CO by Rakesh Roshan
CO by Rakesh RoshanCO by Rakesh Roshan
CO by Rakesh Roshan
 
Assembly Language Programming
Assembly Language ProgrammingAssembly Language Programming
Assembly Language Programming
 
Embedded C - Lecture 4
Embedded C - Lecture 4Embedded C - Lecture 4
Embedded C - Lecture 4
 
Zynqで始めるUSB開発-FPGAとARMで動く USBオーディオデバイスの実例とともに-
Zynqで始めるUSB開発-FPGAとARMで動くUSBオーディオデバイスの実例とともに-Zynqで始めるUSB開発-FPGAとARMで動くUSBオーディオデバイスの実例とともに-
Zynqで始めるUSB開発-FPGAとARMで動く USBオーディオデバイスの実例とともに-
 
Analysis of Open-Source Drivers for IEEE 802.11 WLANs
Analysis of Open-Source Drivers for IEEE 802.11 WLANsAnalysis of Open-Source Drivers for IEEE 802.11 WLANs
Analysis of Open-Source Drivers for IEEE 802.11 WLANs
 
Computer architecture the pentium architecture
Computer architecture the pentium architectureComputer architecture the pentium architecture
Computer architecture the pentium architecture
 
HC 05藍芽模組連線
HC 05藍芽模組連線HC 05藍芽模組連線
HC 05藍芽模組連線
 
Advanced debugging  techniques in different environments
Advanced debugging  techniques in different environmentsAdvanced debugging  techniques in different environments
Advanced debugging  techniques in different environments
 
Linux networking
Linux networkingLinux networking
Linux networking
 
Operand and Opcode | Computer Science
Operand and Opcode | Computer ScienceOperand and Opcode | Computer Science
Operand and Opcode | Computer Science
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Solution manual of assembly language programming and organization of the ibm ...
Solution manual of assembly language programming and organization of the ibm ...Solution manual of assembly language programming and organization of the ibm ...
Solution manual of assembly language programming and organization of the ibm ...
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
 

Semelhante a Data mining Computerassignment 1

House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachYusuf Uzun
 
Parameter Estimation User Guide
Parameter Estimation User GuideParameter Estimation User Guide
Parameter Estimation User GuideAndy Salmon
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Chakkrit (Kla) Tantithamthavorn
 
KnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProjectKnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProjectMarciano Moreno
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIVikas Virani
 
Working with the data for Machine Learning
Working with the data for Machine LearningWorking with the data for Machine Learning
Working with the data for Machine LearningMehwish690898
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...prateek kumar
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...DineshRaj Goud
 
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...Geon-Hong Kim
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
PVS-Studio team is about to produce a technical breakthrough, but for now let...
PVS-Studio team is about to produce a technical breakthrough, but for now let...PVS-Studio team is about to produce a technical breakthrough, but for now let...
PVS-Studio team is about to produce a technical breakthrough, but for now let...PVS-Studio
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson ChallengeRaouf KESKES
 
Data_Mining_Exploration
Data_Mining_ExplorationData_Mining_Exploration
Data_Mining_ExplorationBrett Keim
 
Scientific calculator project in c language
Scientific calculator project in c languageScientific calculator project in c language
Scientific calculator project in c languageAMIT KUMAR
 
Steady state CFD analysis of C-D nozzle
Steady state CFD analysis of C-D nozzle Steady state CFD analysis of C-D nozzle
Steady state CFD analysis of C-D nozzle Vishnu R
 

Semelhante a Data mining Computerassignment 1 (20)

House Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN ApproachHouse Price Estimation as a Function Fitting Problem with using ANN Approach
House Price Estimation as a Function Fitting Problem with using ANN Approach
 
Parameter Estimation User Guide
Parameter Estimation User GuideParameter Estimation User Guide
Parameter Estimation User Guide
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
 
KnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProjectKnowledgeFromDataAtScaleProject
KnowledgeFromDataAtScaleProject
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMI
 
Working with the data for Machine Learning
Working with the data for Machine LearningWorking with the data for Machine Learning
Working with the data for Machine Learning
 
Week 4
Week 4Week 4
Week 4
 
C++ Homework Help
C++ Homework HelpC++ Homework Help
C++ Homework Help
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...
 
Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...Modified monte carlo technique for confidence limits of system reliability us...
Modified monte carlo technique for confidence limits of system reliability us...
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Chap 5 c++
Chap 5 c++Chap 5 c++
Chap 5 c++
 
PVS-Studio team is about to produce a technical breakthrough, but for now let...
PVS-Studio team is about to produce a technical breakthrough, but for now let...PVS-Studio team is about to produce a technical breakthrough, but for now let...
PVS-Studio team is about to produce a technical breakthrough, but for now let...
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
Data_Mining_Exploration
Data_Mining_ExplorationData_Mining_Exploration
Data_Mining_Exploration
 
Scientific calculator project in c language
Scientific calculator project in c languageScientific calculator project in c language
Scientific calculator project in c language
 
Steady state CFD analysis of C-D nozzle
Steady state CFD analysis of C-D nozzle Steady state CFD analysis of C-D nozzle
Steady state CFD analysis of C-D nozzle
 

Mais de BarryK88

Data mining test notes (back)
Data mining test notes (back)Data mining test notes (back)
Data mining test notes (back)BarryK88
 
Data mining test notes (front)
Data mining test notes (front)Data mining test notes (front)
Data mining test notes (front)BarryK88
 
Data mining Computerassignment 3
Data mining Computerassignment 3Data mining Computerassignment 3
Data mining Computerassignment 3BarryK88
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2BarryK88
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4BarryK88
 
Data mining assignment 3
Data mining assignment 3Data mining assignment 3
Data mining assignment 3BarryK88
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5BarryK88
 
Data mining assignment 6
Data mining assignment 6Data mining assignment 6
Data mining assignment 6BarryK88
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1BarryK88
 
Data mining Computerassignment 2
Data mining Computerassignment 2Data mining Computerassignment 2
Data mining Computerassignment 2BarryK88
 
Semantic web final assignment
Semantic web final assignmentSemantic web final assignment
Semantic web final assignmentBarryK88
 
Semantic web assignment 3
Semantic web assignment 3Semantic web assignment 3
Semantic web assignment 3BarryK88
 
Semantic web assignment 2
Semantic web assignment 2Semantic web assignment 2
Semantic web assignment 2BarryK88
 
Semantic web assignment1
Semantic web assignment1Semantic web assignment1
Semantic web assignment1BarryK88
 

Mais de BarryK88 (14)

Data mining test notes (back)
Data mining test notes (back)Data mining test notes (back)
Data mining test notes (back)
 
Data mining test notes (front)
Data mining test notes (front)Data mining test notes (front)
Data mining test notes (front)
 
Data mining Computerassignment 3
Data mining Computerassignment 3Data mining Computerassignment 3
Data mining Computerassignment 3
 
Data mining assignment 2
Data mining assignment 2Data mining assignment 2
Data mining assignment 2
 
Data mining assignment 4
Data mining assignment 4Data mining assignment 4
Data mining assignment 4
 
Data mining assignment 3
Data mining assignment 3Data mining assignment 3
Data mining assignment 3
 
Data mining assignment 5
Data mining assignment 5Data mining assignment 5
Data mining assignment 5
 
Data mining assignment 6
Data mining assignment 6Data mining assignment 6
Data mining assignment 6
 
Data mining assignment 1
Data mining assignment 1Data mining assignment 1
Data mining assignment 1
 
Data mining Computerassignment 2
Data mining Computerassignment 2Data mining Computerassignment 2
Data mining Computerassignment 2
 
Semantic web final assignment
Semantic web final assignmentSemantic web final assignment
Semantic web final assignment
 
Semantic web assignment 3
Semantic web assignment 3Semantic web assignment 3
Semantic web assignment 3
 
Semantic web assignment 2
Semantic web assignment 2Semantic web assignment 2
Semantic web assignment 2
 
Semantic web assignment1
Semantic web assignment1Semantic web assignment1
Semantic web assignment1
 

Data mining Computerassignment 1

  • 1. Data mining ‘REGRESSION: CPU Performance’ Visualized data with WEKA COMPUTER ASSIGNMENT 1 BARRY KOLLEE 10349863
  • 2. Regression  |  CPU  performance     1. Do you think that ERP should be at least partially predictable from the input attributes? Not in all cases. This is only possible if we’re able to see correlation between the two attributes that we compare. In case both values correlate with each other we can state that we can predict certain values from the input attribute. 2. Do any attributes exhibit significant correlations? I’ve loaded up the delivered database file into WEKA. With visualising the data as a graph (which shows the correlation between all attributes) I’m seeing the plotted graphs which is listed below. To see correlation between all ‘dots’ it is necessary to see a linear pattern. The following correlated graphs seems to correlate with ERP; respectively MYCT, MMIN and MMAX: • Green MMAX, with MMAX plotted on the X-axis I see a pattern which is increasing slowly at first and after words it increases rapidly. If we swap the y and x axis we see the opposite result. It starts with increasing rapidly and after words it increases slowly. • Blue MYCT, with MYCT plotted on my x-axis I see a pattern within the correlation between ERP and MYCT. The pattern look like a (1/n) math graph where we start of with a high value. When increasing the x-axis you see a direct decrease in the pattern which is going to the ‘zeropoint’ of the Y-axis. When increasing the x-axis even more we don’t see the slope decreasing anymore. If we swap the x and y axis we see a similar pattern. • Red MMIN, the pattern which I see within MMIN is similar to the one of MMAX. 2
  • 3. Regression  |  CPU  performance     3. Now we have a feel for the data and we will try fitting a simple linear regression model to the data. On the Classify tab, select Choose > functions > LinearRegression. • Use the default options and click Start. This will use 10-fold cross-validation to fit the linear regression model. Examine the results: • Record the Root relative squared error and the Relative absolute error. The Relative squared error is computed by dividing (normalizing) the sum of the squared prediction errors by the sum of the prediction errors obtained by always predicting the mean. The Root relative squared error is obtained by taking the square root of the Relative squared error. The Relative absolute error is similar to the Relative squared error, but uses absolute values rather than squares. Therefore, if we have a relative error of 100%, the learned model is no better than this very dumb predictor. When I perform the linear regression function onto the ERP attribute I’m getting the following information about this attribute. The ‘Root relative squared error’ is given in red. Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: cpu-weka.filters.unsupervised.attribute.Remove-R1 Instances: 209 Attributes: 7 MYCT MMIN MMAX CACH CHMIN CHMAX ERP Test mode:10-fold cross-validation === Classifier model (full training set) === Linear Regression Model ERP = 0.0661 * MYCT + 0.0142 * MMIN + 0.0066 * MMAX + 0.4871 * CACH + 1.1868 * CHMAX + -66.5968 Time taken to build model: 0 seconds === Cross-validation === === Summary === Correlation coefficient 0.928 Mean absolute error 35.4878 Root mean squared error 57.5296 Relative absolute error 40.4842 % Root relative squared error 37.1725 % Total Number of Instances 209       The Root relative squared error looks pretty high. That’s because we take all of the attributes into account and we fit that into our calculation. You can also see that we take 5 attributes into account for our scope of our linear regression model. Below you see the actual given function of y = ax + b which represents our linear regression graph model. And we eventually have a scope of 0.928 if we take all these attributes within our calculation. This calculation looks like: y = a x + b a = 0.0661 * MYCT + 0.0142 * MMIN + 0.0066 * MMAX + 0.4871 * CACH + 1.1868 * CHMAX b = -66.5968 3
  • 4. Regression  |  CPU  performance       I prospect that we can make a better fitting linear regression model if we only take the attributes into account which correlates best with ERP which we gave in answer 2. If we want to achieve this we only take MMIN and MMAX into account because it looks like that these attributes correlates best if we stipulate the output which is given in answer 2. I made another linear regression model where I’ve only used the MMIN and MMAX attribute, which is given below (Root relative error in red):     === Run information === Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: cpu-weka.filters.unsupervised.attribute.Remove-R1- weka.filters.unsupervised.attribute.Remove-R1,4-6 Instances: 209 Attributes: 3 MMIN MMAX ERP Test mode:10-fold cross-validation === Classifier model (full training set) === Linear Regression Model ERP = 0.0128 * MMIN + 0.0087 * MMAX + -39.814 Time taken to build model: 0 seconds === Cross-validation === === Summary === Correlation coefficient 0.9022 Mean absolute error 39.8811 Root mean squared error 66.584 Relative absolute error 45.4961 % Root relative squared error 43.023 % Total Number of Instances 209                     My assumptions were actually wrong. I see that when only taking MMIN and MMAX into account the correlation coefficient is lower and we’ve got a higher error rate; i.e. the Mean absolute error which is higher. This value gives us the average of the difference that we find between the actual value and the value of all the test cases. Also the value Root relative squared error has increased with ca. 6 %. 4. Did you expect such a performance given your earlier observations? Hint: We are fitting a linear model. Because we’re trying to fit a linear model we’re searching for the attributes which correlates best with ERP. The performance boost is clearly visible if we look at the correlation coefficient. A rate of ca. 0.93 is really close to ‘1’ which is the best rate possible. However the root relative squared error is pretty high. If we would like to get a better fitting linear regression model we should only try to take attributes into account which correlates best with ERP. This would result in a correlation coefficient closer to 1 and an error rate which is closer to 0%. However my observation when only using MMIN and MMAX weren’t that hopeful. Perhaps that’s because these errors are less seen if we include more attributes. The using of more attributes seems to decrease the error rate. On the other hand I would expect that including more attributes would be more error sensitive   4
  • 5. Regression  |  CPU  performance     5. Above we deleted the vendor variable. However, we can use nominal attributes in regression by converting them to numeric. The standard way of so doing is to replace the nominal variable with a bunch of binary variables of the form "is_first_nominal_value, is_second_nominal_value" and so on. Reload the unmodified data file cpu.arff. • On the Preprocess tab select Choose > filters > unsupervised > attribute > NominaltoBinary and click Apply. This replaces the vendor variable with 30 binary variables and we now have 37 attributes (we started with 8). Now train a linear regression model as in (4) and examine the results. • Record the Relative absolute error and the Root relative squared error Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: cpu-weka.filters.unsupervised.attribute.NominalToBinary-Rfirst-last Instances: 209 Attributes: 37 vendor=adviser vendor=amdahl vendor=apollo vendor=basf vendor=bti vendor=burroughs vendor=c.r.d vendor=cdc vendor=cambex vendor=dec vendor=dg vendor=formation vendor=four-phase vendor=gould vendor=hp vendor=harris vendor=honeywell vendor=ibm vendor=ipl vendor=magnuson vendor=microdata vendor=nas vendor=ncr vendor=nixdorf vendor=perkin-elmer vendor=prime vendor=siemens vendor=sperry vendor=sratus vendor=wang MYCT MMIN MMAX CACH CHMIN CHMAX ERP Test mode:10-fold cross-validation === Classifier model (full training set) === Linear Regression Model ERP = -132.1272 * vendor=adviser + -34.3319 * vendor=burroughs + -52.3128 * vendor=gould + -35.8202 * vendor=honeywell + -16.7597 * vendor=ibm + -144.1856 * vendor=microdata + -22.7172 * vendor=nas + 41.5185 * vendor=sperry + 0.0696 * MYCT + 0.0167 * MMIN + 0.0055 * MMAX + 0.6304 * CACH + -1.5416 * CHMIN + 1.6106 * CHMAX + -57.432 Time taken to build model: 0.02 seconds === Cross-validation === === Summary === 5
  • 6. Regression  |  CPU  performance     Correlation coefficient 0.9252 Mean absolute error 35.9725 Root mean squared error 58.5821 Relative absolute error 41.0372 % Root relative squared error 37.8525 % Total Number of Instances 209             6. Compare the performance to the one we had previously. Did adding the binarized vendor variable help?   The errors of the first linear model where: Relative absolute error 40.4842 % Root relative squared error 37.1725 % The error rate of the latest linear regression model is: Relative absolute error 41.0372 % Root relative squared error 37.8525 % It looks like that the error rate has only increased. I think that’s because we now take a lot more attributes into account what makes our slope (the a in y=ax+b) more complex and error sensitive. I predict that the error rate would be less higher of we would only take attributes into account which correlates best with ERP. 6