SlideShare uma empresa Scribd logo
1 de 12
IT for Business Intelligence
Term Paper on Data Mining Techniques




 Prepared By:
 Niloy Ghosh

 Roll No: 10BM60054

 Second Year, MBA

 VInod Gupta School of Management (VGSOM)

 IIT Kharagpur
Introduction
The purpose of this term paper is to demonstrate data mining techniques using the software tool
WEKA. Data mining aims at transforming large amounts of data into meaningful patterns and rules.
The derivation of meaning from the vast amounts of data has numerous business applications and is
generating a tremendous amount of interest.

Waikato Environment for Knowledge Analysis (WEKA) is a free and open source software that can be
used to mine data and generate useful information. For using WEKA, the data needs to be in the
Attribute-Relation File Format (ARFF). It is a flat file format where the type of data being loaded is
defined first, followed by the data itself.

In this paper two techniques, Linear Regression and Decision Tree, are discussed with examples. The
source of the data used to demonstrate the techniques is provided in the reference section.



Technique I
Linear Regression

Linear regression is used to predict the value of an unknown dependent variable based on the values
of a number of independent variables. In this example, the model tries to predict the housing prices
in the Boston area.



Description of dataset

The dataset contains details about housing in Boston area. The data contains 14 variables which are
defined as follows.

    1.    CRIM: per capita crime rate by town
    2.    ZN:    proportion of residential land zoned for lots over 25,000 sq.ft.
    3.    INDUS: proportion of non-retail business acres per town
    4.    CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    5.    NOX: nitric oxides concentration (parts per 10 million)
    6.    RM:    average number of rooms per dwelling
    7.    AGE:   proportion of owner-occupied units built prior to 1940
    8.    DIS:   weighted distances to five Boston employment centres
    9.    RAD: index of accessibility to radial highways
    10.   TAX: full-value property-tax rate per $10,000
    11.   PTRATIO: pupil-teacher ratio by town
    12.   B:     1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13.   LSTAT: Percentage of lower status of the population
    14.   MEDV: Median value of owner-occupied homes in $1000's

The objective is to predict the housing values (i.e. the variable MEDV) using Linear Regression.
Output

On running the model in WEKA, the following output was obtained.



=== Run information ===



Scheme:WEKA.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation: housing

Instances: 506

Attributes: 14

        CRIM

        ZN

        INDUS

        CHAS

        NOX

        RM

        AGE

        DIS

        RAD

        TAX

        PTRATIO

        B

        LSTAT

        CLASS

Test mode:split 70.0% train, remainder test



=== Classifier model (full training set) ===
Linear Regression Model

CLASS = -0.1084 * CRIM + 0.0458 * ZN + 2.7188 * CHAS + -17.3768 * NOX + 3.8016 * RM + -1.4927 * DIS + 0.2996 * RAD +
-0.0118 * TAX + -0.9466 * PTRATIO + 0.0093 * B + -0.5225 * LSTAT + 36.342



Time taken to build model: 0.05 seconds



=== Evaluation on test split ===

=== Summary ===



Correlation coefficient            0.8547

Mean absolute error                3.3219

Root mean squared error              4.6107

Relative absolute error            52.2759 %

Root relative squared error         51.9447 %

Total Number of Instances           152



The experiment was conducted using a 70-30 split of the data (70% used to form the model, 30%
used to test the same).



Interpretation

The results show a correlation of 85%, and thus the model is sufficiently acceptable. Though the
error values are quite high, other methods have yielded only slightly better results.

The following conclusions can be made:

         The proportion of non-retail business and age of the buildings are not a factor for
          evaluation.
         As expected, crime rates, air pollution and (high) tax rates have a negative effect on the
          house value.
         The proportion of lower status population has a negative effect. Thus, low income
          neighbourhoods will have lower house rates than affluent neighbourhoods.
         Interestingly, the pupil student ratio has a negative effect and that too quite prominent.
          Thus, it is evident that educational facilities is a big concern while looking for a home and
          people are ready to pay more for areas having better educational facilities.
Technique II
Decision Tree

In data mining, a decision tree is a predictive model which maps observations about an item to
conclusions about the item's target value. Also known as classification trees, the leaves represent
class labels and branches represent conjunctions of features that lead to those class labels.

The WEKA classifier used in the example is J48. The model tries to make a diagnosis of urinary
system disease.



Description of dataset

The dataset contains the following variables.

    1.    Temperature of patient
    2.   Occurrence of nausea { yes, no }
    3.   Lumbar pain { yes, no }
    4.   Urine pushing (continuous need for urination) { yes, no }
    5.   Micturition pains { yes, no }
    6.   Burning of urethra, itch, swelling of urethra outlet { yes, no }
    7.   Decision: Inflammation of urinary bladder { yes, no }
    8.   Decision: Nephritis of renal pelvis origin { yes, no }

For the purpose of the demonstration, first the variable ‘Nephritis of renal pelvis origin’ had been
removed. The analysis then creates a decision tree for the prediction of the inflammation of urinary
bladder.

Next, the variable ‘Inflammation of urinary bladder’ has been removed and a new decision tree is
created for the prediction of Nephritis of renal pelvis origin.
Output

The WEKA output for prediction of the inflammation of urinary bladder was obtained as follows.



Model 1



=== Run information ===



Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2

Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R8

Instances: 120

Attributes: 7

         temperature

         nausea

         Lumbar_pain

         Urine_pushing

         Micturition_pains

         Burning_of_urethra

         Inflammation_of_urinary_bladder

Test mode:10-fold cross-validation



=== Classifier model (full training set) ===



J48 pruned tree

------------------



Urine_pushing = yes

| Micturition_pains = yes: yes (49.0)

| Micturition_pains = no
| | Lumbar_pain = yes: no (21.0)

| | Lumbar_pain = no: yes (10.0)

Urine_pushing = no: no (40.0)



Number of Leaves : 4



Size of the tree :    7



Time taken to build model: 0.01 seconds



=== Stratified cross-validation ===

=== Summary ===



Correctly Classified Instances            120        100       %

Incorrectly Classified Instances          0          0     %

Kappa statistic                           1

Mean absolute error                       0

Root mean squared error                   0

Relative absolute error                   0     %

Root relative squared error               0     %

Total Number of Instances                 120



=== Detailed Accuracy By Class ===




                          TP Rate     FP Rate       Precision      Recall   F-Measure   ROC Area   Class
                             1           0              1            1          1          1        yes
                             1           0              1            1          1          1        no
 Weighted Avg.               1           0              1            1          1          1
=== Confusion Matrix ===

                               a         b             <-- classified as
                               59        0             a = yes
                               0         61           b = no




The tree is visualised as shown below.




The same experiment was repeated for predicting the occurrence of Nephritis of renal pelvis origin.

The following results were obtained.
Model 2



=== Run information ===



Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2

Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R7

Instances: 120

Attributes: 7

         temperature

         nausea

         Lumbar_pain

         Urine_pushing

         Micturition_pains

         Burning_of_urethra

         Nephritis_of_renal_pelvis_origin

Test mode:evaluate on training data



=== Classifier model (full training set) ===



J48 pruned tree

------------------



temperature <= 37.9: no (60.0)

temperature > 37.9

| Lumbar_pain = yes: yes (50.0)

| Lumbar_pain = no: no (10.0)



Number of Leaves : 3
Size of the tree :    5



Time taken to build model: 0 seconds



=== Evaluation on training set ===

=== Summary ===



Correctly Classified Instances          120        100       %

Incorrectly Classified Instances        0          0     %

Kappa statistic                         1

Mean absolute error                     0

Root mean squared error                 0

Relative absolute error                 0     %

Root relative squared error             0     %

Total Number of Instances               120



=== Detailed Accuracy By Class ===

                              TP Rate    FP Rate       Precision   Recall       F-Measure   ROC Area   Class
                                 1          0              1         1              1          1        yes
                                 1          0              1         1              1          1        no
      Weighted Avg.              1          0              1         1              1          1


=== Confusion Matrix ===



                                             a             b       <-- classified as
                                            50             0            a = yes
                                            0             70            b = no
The visual tree is as below




Interpretation

As can be seen in both the models, 100% of the data has been classified correctly.

In Model 1, the differentiating factors were Urine pushing, Micturition pains and Lumbar pain.

In Model 2, the differentiating factors were Temperature and Lumbar Pain.



As can be seen from both the results, Lumbar pain is an important factor in determining urinary
infections.



Conclusion
The paper barely scratches the surface of all the possible applications of data mining. This powerful
technique can have unique applications in the field of business as well as academic research. It may
provide clues to numerous questions by allowing us to make sense of the ever growing volume of
data.
Reference
  1. http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html

  2. http://archive.ics.uci.edu/ml/datasets/Acute+Inflammations

  3. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html

  4. http://en.wikipedia.org/wiki/Decision_tree_learning

Mais conteúdo relacionado

Mais procurados

Sociocast CF Benchmark
Sociocast CF BenchmarkSociocast CF Benchmark
Sociocast CF BenchmarkAlbert Azout
 
Linearprog, Reading Materials for Operational Research
Linearprog, Reading Materials for Operational Research Linearprog, Reading Materials for Operational Research
Linearprog, Reading Materials for Operational Research Derbew Tesfa
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4arogozhnikov
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackarogozhnikov
 
Luca Pozzi JSM 2011
Luca Pozzi JSM 2011Luca Pozzi JSM 2011
Luca Pozzi JSM 2011Luca Pozzi
 
Solutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. MontgomerySolutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. MontgomeryByron CZ
 
Lec 2 discrete random variable
Lec 2 discrete random variableLec 2 discrete random variable
Lec 2 discrete random variablecairo university
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 

Mais procurados (10)

Sociocast CF Benchmark
Sociocast CF BenchmarkSociocast CF Benchmark
Sociocast CF Benchmark
 
Linearprog, Reading Materials for Operational Research
Linearprog, Reading Materials for Operational Research Linearprog, Reading Materials for Operational Research
Linearprog, Reading Materials for Operational Research
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Tree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptionsTree models with Scikit-Learn: Great models with little assumptions
Tree models with Scikit-Learn: Great models with little assumptions
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
 
Luca Pozzi JSM 2011
Luca Pozzi JSM 2011Luca Pozzi JSM 2011
Luca Pozzi JSM 2011
 
Chapter4
Chapter4Chapter4
Chapter4
 
Solutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. MontgomerySolutions. Design and Analysis of Experiments. Montgomery
Solutions. Design and Analysis of Experiments. Montgomery
 
Lec 2 discrete random variable
Lec 2 discrete random variableLec 2 discrete random variable
Lec 2 discrete random variable
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 

Destaque

The GRASS GIS software (with QGIS) - GIS Seminar
The GRASS GIS software (with QGIS) - GIS SeminarThe GRASS GIS software (with QGIS) - GIS Seminar
The GRASS GIS software (with QGIS) - GIS SeminarMarkus Neteler
 
QGIS - How does it work?
QGIS - How does it work?QGIS - How does it work?
QGIS - How does it work?Nathan Woodrow
 
QGIS Module 2
QGIS Module 2QGIS Module 2
QGIS Module 2CAPSUCSF
 
QGIS Module 4
QGIS Module 4QGIS Module 4
QGIS Module 4CAPSUCSF
 
Glacier and snow
Glacier and snowGlacier and snow
Glacier and snowSwetha A
 
Spatial Analysis Tools with Open Source GIS
Spatial Analysis Tools with Open Source GISSpatial Analysis Tools with Open Source GIS
Spatial Analysis Tools with Open Source GISChingchai Humhong
 
QGIS Module 1
QGIS Module 1QGIS Module 1
QGIS Module 1CAPSUCSF
 
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...Swetha A
 
QGIS Module 3
QGIS Module 3QGIS Module 3
QGIS Module 3CAPSUCSF
 
OSM and QGIS
OSM and QGISOSM and QGIS
OSM and QGISQGIS UK
 
GEOPROCESSING IN QGIS
GEOPROCESSING IN QGISGEOPROCESSING IN QGIS
GEOPROCESSING IN QGISSwetha A
 
Remote Sensing And GIS Application In Wetland Mapping
Remote Sensing And GIS Application In Wetland MappingRemote Sensing And GIS Application In Wetland Mapping
Remote Sensing And GIS Application In Wetland MappingSwetha A
 
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...Swetha A
 
Cv cipta setiawan - 0915 - bez almt
Cv   cipta setiawan - 0915 - bez almtCv   cipta setiawan - 0915 - bez almt
Cv cipta setiawan - 0915 - bez almtCipta Setiawan
 
留住胡同 鉛筆畫
留住胡同 鉛筆畫留住胡同 鉛筆畫
留住胡同 鉛筆畫psjlew
 
Ip crammer presentation 2013
Ip crammer presentation 2013Ip crammer presentation 2013
Ip crammer presentation 2013Anneke Weber
 
1 9-and-a-z-in-the-water
1 9-and-a-z-in-the-water1 9-and-a-z-in-the-water
1 9-and-a-z-in-the-waterpsjlew
 

Destaque (20)

QGIS Tutorial 2
QGIS Tutorial 2QGIS Tutorial 2
QGIS Tutorial 2
 
QGIS Tutorial 1
QGIS Tutorial 1QGIS Tutorial 1
QGIS Tutorial 1
 
The GRASS GIS software (with QGIS) - GIS Seminar
The GRASS GIS software (with QGIS) - GIS SeminarThe GRASS GIS software (with QGIS) - GIS Seminar
The GRASS GIS software (with QGIS) - GIS Seminar
 
QGIS - How does it work?
QGIS - How does it work?QGIS - How does it work?
QGIS - How does it work?
 
QGIS Module 2
QGIS Module 2QGIS Module 2
QGIS Module 2
 
QGIS Module 4
QGIS Module 4QGIS Module 4
QGIS Module 4
 
Glacier and snow
Glacier and snowGlacier and snow
Glacier and snow
 
Spatial Analysis Tools with Open Source GIS
Spatial Analysis Tools with Open Source GISSpatial Analysis Tools with Open Source GIS
Spatial Analysis Tools with Open Source GIS
 
QGIS Module 1
QGIS Module 1QGIS Module 1
QGIS Module 1
 
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...
MISSION TO PLANETS (CHANDRAYAAN,MAVEN,CURIOSITY,MANGALYAAN,CASSINI SOLSTICE M...
 
QGIS Module 3
QGIS Module 3QGIS Module 3
QGIS Module 3
 
OSM and QGIS
OSM and QGISOSM and QGIS
OSM and QGIS
 
QGIS training class 1
QGIS training class 1QGIS training class 1
QGIS training class 1
 
GEOPROCESSING IN QGIS
GEOPROCESSING IN QGISGEOPROCESSING IN QGIS
GEOPROCESSING IN QGIS
 
Remote Sensing And GIS Application In Wetland Mapping
Remote Sensing And GIS Application In Wetland MappingRemote Sensing And GIS Application In Wetland Mapping
Remote Sensing And GIS Application In Wetland Mapping
 
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...
Remote Sensing And GIS Application In Mineral , Oil , Ground Water MappingMin...
 
Cv cipta setiawan - 0915 - bez almt
Cv   cipta setiawan - 0915 - bez almtCv   cipta setiawan - 0915 - bez almt
Cv cipta setiawan - 0915 - bez almt
 
留住胡同 鉛筆畫
留住胡同 鉛筆畫留住胡同 鉛筆畫
留住胡同 鉛筆畫
 
Ip crammer presentation 2013
Ip crammer presentation 2013Ip crammer presentation 2013
Ip crammer presentation 2013
 
1 9-and-a-z-in-the-water
1 9-and-a-z-in-the-water1 9-and-a-z-in-the-water
1 9-and-a-z-in-the-water
 

Semelhante a IT for Business Intelligence Term Paper

Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
Introduction
IntroductionIntroduction
Introductionbutest
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svmtaikhoan262
 
Data analysis on bank data
Data analysis on bank dataData analysis on bank data
Data analysis on bank dataANISH BHANUSHALI
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习AdaboostShocky1
 
Beyond Classification and Ranking: Constrained Optimization of the ROI
Beyond Classification and Ranking: Constrained Optimization of the ROIBeyond Classification and Ranking: Constrained Optimization of the ROI
Beyond Classification and Ranking: Constrained Optimization of the ROInkaf61
 
Est3 tutorial3mejorado
Est3 tutorial3mejoradoEst3 tutorial3mejorado
Est3 tutorial3mejoradohunapuh
 
Industrial plant optimization in reduced dimensional spaces
Industrial plant optimization in reduced dimensional spacesIndustrial plant optimization in reduced dimensional spaces
Industrial plant optimization in reduced dimensional spacesCapstone
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
Statistics
StatisticsStatistics
Statisticsmegamsma
 
26 Ch. 3 Organizing and Graphing DataAssignment 2ME.docx
26     Ch. 3 Organizing and Graphing DataAssignment 2ME.docx26     Ch. 3 Organizing and Graphing DataAssignment 2ME.docx
26 Ch. 3 Organizing and Graphing DataAssignment 2ME.docxeugeniadean34240
 
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesA Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesMohamed Farouk
 
Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueGael Varoquaux
 
Frequentist Operating Characteristics of Bayesian Posterior Designs
Frequentist Operating Characteristics of Bayesian Posterior DesignsFrequentist Operating Characteristics of Bayesian Posterior Designs
Frequentist Operating Characteristics of Bayesian Posterior DesignsBiomedical Statistical Consulting
 

Semelhante a IT for Business Intelligence Term Paper (20)

Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
Introduction
IntroductionIntroduction
Introduction
 
JZanzigposter
JZanzigposterJZanzigposter
JZanzigposter
 
Guide
GuideGuide
Guide
 
Huong dan cu the svm
Huong dan cu the svmHuong dan cu the svm
Huong dan cu the svm
 
1b7 quality control
1b7 quality control1b7 quality control
1b7 quality control
 
Data analysis on bank data
Data analysis on bank dataData analysis on bank data
Data analysis on bank data
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
Beyond Classification and Ranking: Constrained Optimization of the ROI
Beyond Classification and Ranking: Constrained Optimization of the ROIBeyond Classification and Ranking: Constrained Optimization of the ROI
Beyond Classification and Ranking: Constrained Optimization of the ROI
 
Guide
GuideGuide
Guide
 
Est3 tutorial3mejorado
Est3 tutorial3mejoradoEst3 tutorial3mejorado
Est3 tutorial3mejorado
 
Industrial plant optimization in reduced dimensional spaces
Industrial plant optimization in reduced dimensional spacesIndustrial plant optimization in reduced dimensional spaces
Industrial plant optimization in reduced dimensional spaces
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
Statistics
StatisticsStatistics
Statistics
 
26 Ch. 3 Organizing and Graphing DataAssignment 2ME.docx
26     Ch. 3 Organizing and Graphing DataAssignment 2ME.docx26     Ch. 3 Organizing and Graphing DataAssignment 2ME.docx
26 Ch. 3 Organizing and Graphing DataAssignment 2ME.docx
 
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector MachinesA Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
A Multi-Objective Genetic Algorithm for Pruning Support Vector Machines
 
Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic value
 
Frequentist Operating Characteristics of Bayesian Posterior Designs
Frequentist Operating Characteristics of Bayesian Posterior DesignsFrequentist Operating Characteristics of Bayesian Posterior Designs
Frequentist Operating Characteristics of Bayesian Posterior Designs
 
Week 10 GEE Data Examples v2.pptx
Week 10 GEE Data Examples v2.pptxWeek 10 GEE Data Examples v2.pptx
Week 10 GEE Data Examples v2.pptx
 

Último

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Último (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

IT for Business Intelligence Term Paper

  • 1. IT for Business Intelligence Term Paper on Data Mining Techniques Prepared By: Niloy Ghosh Roll No: 10BM60054 Second Year, MBA VInod Gupta School of Management (VGSOM) IIT Kharagpur
  • 2. Introduction The purpose of this term paper is to demonstrate data mining techniques using the software tool WEKA. Data mining aims at transforming large amounts of data into meaningful patterns and rules. The derivation of meaning from the vast amounts of data has numerous business applications and is generating a tremendous amount of interest. Waikato Environment for Knowledge Analysis (WEKA) is a free and open source software that can be used to mine data and generate useful information. For using WEKA, the data needs to be in the Attribute-Relation File Format (ARFF). It is a flat file format where the type of data being loaded is defined first, followed by the data itself. In this paper two techniques, Linear Regression and Decision Tree, are discussed with examples. The source of the data used to demonstrate the techniques is provided in the reference section. Technique I Linear Regression Linear regression is used to predict the value of an unknown dependent variable based on the values of a number of independent variables. In this example, the model tries to predict the housing prices in the Boston area. Description of dataset The dataset contains details about housing in Boston area. The data contains 14 variables which are defined as follows. 1. CRIM: per capita crime rate by town 2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS: proportion of non-retail business acres per town 4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX: nitric oxides concentration (parts per 10 million) 6. RM: average number of rooms per dwelling 7. AGE: proportion of owner-occupied units built prior to 1940 8. DIS: weighted distances to five Boston employment centres 9. RAD: index of accessibility to radial highways 10. TAX: full-value property-tax rate per $10,000 11. PTRATIO: pupil-teacher ratio by town 12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT: Percentage of lower status of the population 14. MEDV: Median value of owner-occupied homes in $1000's The objective is to predict the housing values (i.e. the variable MEDV) using Linear Regression.
  • 3. Output On running the model in WEKA, the following output was obtained. === Run information === Scheme:WEKA.classifiers.functions.LinearRegression -S 0 -R 1.0E-8 Relation: housing Instances: 506 Attributes: 14 CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT CLASS Test mode:split 70.0% train, remainder test === Classifier model (full training set) ===
  • 4. Linear Regression Model CLASS = -0.1084 * CRIM + 0.0458 * ZN + 2.7188 * CHAS + -17.3768 * NOX + 3.8016 * RM + -1.4927 * DIS + 0.2996 * RAD + -0.0118 * TAX + -0.9466 * PTRATIO + 0.0093 * B + -0.5225 * LSTAT + 36.342 Time taken to build model: 0.05 seconds === Evaluation on test split === === Summary === Correlation coefficient 0.8547 Mean absolute error 3.3219 Root mean squared error 4.6107 Relative absolute error 52.2759 % Root relative squared error 51.9447 % Total Number of Instances 152 The experiment was conducted using a 70-30 split of the data (70% used to form the model, 30% used to test the same). Interpretation The results show a correlation of 85%, and thus the model is sufficiently acceptable. Though the error values are quite high, other methods have yielded only slightly better results. The following conclusions can be made:  The proportion of non-retail business and age of the buildings are not a factor for evaluation.  As expected, crime rates, air pollution and (high) tax rates have a negative effect on the house value.  The proportion of lower status population has a negative effect. Thus, low income neighbourhoods will have lower house rates than affluent neighbourhoods.  Interestingly, the pupil student ratio has a negative effect and that too quite prominent. Thus, it is evident that educational facilities is a big concern while looking for a home and people are ready to pay more for areas having better educational facilities.
  • 5. Technique II Decision Tree In data mining, a decision tree is a predictive model which maps observations about an item to conclusions about the item's target value. Also known as classification trees, the leaves represent class labels and branches represent conjunctions of features that lead to those class labels. The WEKA classifier used in the example is J48. The model tries to make a diagnosis of urinary system disease. Description of dataset The dataset contains the following variables. 1. Temperature of patient 2. Occurrence of nausea { yes, no } 3. Lumbar pain { yes, no } 4. Urine pushing (continuous need for urination) { yes, no } 5. Micturition pains { yes, no } 6. Burning of urethra, itch, swelling of urethra outlet { yes, no } 7. Decision: Inflammation of urinary bladder { yes, no } 8. Decision: Nephritis of renal pelvis origin { yes, no } For the purpose of the demonstration, first the variable ‘Nephritis of renal pelvis origin’ had been removed. The analysis then creates a decision tree for the prediction of the inflammation of urinary bladder. Next, the variable ‘Inflammation of urinary bladder’ has been removed and a new decision tree is created for the prediction of Nephritis of renal pelvis origin.
  • 6. Output The WEKA output for prediction of the inflammation of urinary bladder was obtained as follows. Model 1 === Run information === Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2 Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R8 Instances: 120 Attributes: 7 temperature nausea Lumbar_pain Urine_pushing Micturition_pains Burning_of_urethra Inflammation_of_urinary_bladder Test mode:10-fold cross-validation === Classifier model (full training set) === J48 pruned tree ------------------ Urine_pushing = yes | Micturition_pains = yes: yes (49.0) | Micturition_pains = no
  • 7. | | Lumbar_pain = yes: no (21.0) | | Lumbar_pain = no: yes (10.0) Urine_pushing = no: no (40.0) Number of Leaves : 4 Size of the tree : 7 Time taken to build model: 0.01 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 120 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances 120 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 yes 1 0 1 1 1 1 no Weighted Avg. 1 0 1 1 1 1
  • 8. === Confusion Matrix === a b <-- classified as 59 0 a = yes 0 61 b = no The tree is visualised as shown below. The same experiment was repeated for predicting the occurrence of Nephritis of renal pelvis origin. The following results were obtained.
  • 9. Model 2 === Run information === Scheme:WEKA.classifiers.trees.J48 -C 0.25 -M 2 Relation: diagnosis-WEKA.filters.unsupervised.attribute.Remove-R7 Instances: 120 Attributes: 7 temperature nausea Lumbar_pain Urine_pushing Micturition_pains Burning_of_urethra Nephritis_of_renal_pelvis_origin Test mode:evaluate on training data === Classifier model (full training set) === J48 pruned tree ------------------ temperature <= 37.9: no (60.0) temperature > 37.9 | Lumbar_pain = yes: yes (50.0) | Lumbar_pain = no: no (10.0) Number of Leaves : 3
  • 10. Size of the tree : 5 Time taken to build model: 0 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 120 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Total Number of Instances 120 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 yes 1 0 1 1 1 1 no Weighted Avg. 1 0 1 1 1 1 === Confusion Matrix === a b <-- classified as 50 0 a = yes 0 70 b = no
  • 11. The visual tree is as below Interpretation As can be seen in both the models, 100% of the data has been classified correctly. In Model 1, the differentiating factors were Urine pushing, Micturition pains and Lumbar pain. In Model 2, the differentiating factors were Temperature and Lumbar Pain. As can be seen from both the results, Lumbar pain is an important factor in determining urinary infections. Conclusion The paper barely scratches the surface of all the possible applications of data mining. This powerful technique can have unique applications in the field of business as well as academic research. It may provide clues to numerous questions by allowing us to make sense of the ever growing volume of data.
  • 12. Reference 1. http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html 2. http://archive.ics.uci.edu/ml/datasets/Acute+Inflammations 3. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html 4. http://en.wikipedia.org/wiki/Decision_tree_learning