SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
CMP: Data Mining and Statistics within the Health Services                                                                                                                                        19/02/2010




        Data Mining and Statistics
                                                                                                               Content
        Within the Health Services
                                                                                                                    1.      Introduction to Weka
                                     Tutorial for Weka                                                              2.      Data Mining Functions and Tools
                                                                                                                    3.      Data Format
                                             a data mining tool                                                     4.      Hands-on Demos
                                                                                                                         4.1 Weka Explorer
                                              Dr. Wenjia Wang                                                            • Classification
                                                                                                                         • Attribute( feature) Selection
                                        School of Computing Sciences                                                     4.2 Weka Experimenter
                                          University of East Anglia                                                      4.3 Weka KnowledgeFlow
                                                                                                                    5. Summary
            Data                  Pre-processing                 Data Mining                   Knowledge


       Data Mining & Statistics within the Health Services
                                                                                                               Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      2




        1. Introduction to WEKA                                                                                Weka Main Features
        • A collection of open source of many data                                                              • 49 data preprocessing tools
          mining and machine learning algorithms,                                                               • 76 classification/regression algorithms
          including                                                                                             • 8 clustering algorithms
              – pre-processing on data                                                                          • 15 attribute/subset evaluators + 10 search
              – Classification:                                                                                   algorithms for feature selection.
              – clustering                                                                                      • 3 algorithms for finding association rules
              – association rule extraction                                                                     • 3 graphical user interfaces
                                                                                                                      – “The Explorer” (exploratory data analysis)
        • Created by researchers at the University of
                                                                                                                      – “The Experimenter” (experimental environment)
          Waikato in New Zealand                                                                                      – “The KnowledgeFlow” (new process model inspired
        • Java based (also open source).                                                                                interface)
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)               3   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      4




        Weka: Download and Installation                                                                        Start the Weka

        • Download Weka (the stable version) from                                                              • From windows desktop,
               http://www.cs.waikato.ac.nz/ml/weka/                                                                   – click “Start”, choose “All programs”,
               – Choose a self-extracting executable (including Java VM)                                              – Choose “Weka 3.6” to start Weka
                                                                                                                      – Then the first interface
               – (If you are interested in modifying/extending weka there                                               window appears:
                 is a developer version that includes the source code)
                                                                                                                         Weka GUI Chooser.
        • After download is completed, run the self-
          extracting file to install Weka, and use the default
          set-ups.

       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)               5   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      6




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                    1
CMP: Data Mining and Statistics within the Health Services                                                                                                                             19/02/2010




       WEKA Application Interfaces                                                                   Weka Application Interfaces
                                                                                                    • Explorer
                                                                                                        – preprocessing, attribute selection, learning, visualiation
                                                                                                    • Experimenter
                                                                                                        – testing and evaluating machine learning algorithms
                                                                                                    • Knowledge Flow
                                                                                                        – visual design of KDD process
                                                                                                        – Explorer
                                                                                                    • Simple Command-line
                                                                                                        – A simple interface for typing commands


       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)    7   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)       8




                                                                                                     Load data file and
        2. Weka Functions and Tools
                                                                                                     Preprocessing
        •    Preprocessing Filters                                                                   • Load data file in formats: ARFF, CSV, C4.5,
                                                                                                       binary
        •    Attribute selection
                                                                                                     • Import from URL or SQL database (using JDBC)
        •    Classification/Regression
                                                                                                     • Preprocessing filters
        •    Clustering                                                                                    –    Adding/removing attributes
        •    Association discovery                                                                         –    Attribute value substitution
                                                                                                           –    Discretization
        •    Visualization
                                                                                                           –    Time series filters (delta, shift)
                                                                                                           –    Sampling, randomization
                                                                                                           –    Missing value management
                                                                                                           –    Normalization and other numeric transformations
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)    9   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      10




        Feature Selection                                                                            Classification
       • Very flexible: arbitrary combination of search and                                         • Predicted target must be categorical
         evaluation methods                                                                         • Implemented methods
       • Search methods                                                                                  –    decision trees(J48, etc.) and rules
            – best-first                                                                                 –    Naïve Bayes
            – genetic                                                                                    –    neural networks
            – ranking ...
                                                                                                         –    instance-based classifiers …
       • Evaluation measures
                                                                                                    • Evaluation methods
            – ReliefF
                                                                                                         – test data set
            – information gain
            – gain ratio                                                                                 – crossvalidation
       • Demo data: weather_nominal.arff                                                            • Demo data: iris, contact lenses, labor, soybeans,
                                                                                                      etc.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   11   Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      12




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                          2
CMP: Data Mining and Statistics within the Health Services                                                                                                                                  19/02/2010




        Clustering                                                                                    Regression
       • Implemented methods
           –    k-Means                                                                               • Predicted target is continuous
           –    EM                                                                                    • Methods
           –    Cobweb
           –    X-means                                                                                     – linear regression
           –    FarthestFirst…                                                                              – neural networks
       • Clusters can be visualized and compared to “true”                                                  – regression trees …
         clusters (if given)
       • Demo data:                                                                                   • Demo data: cpu.arff,
           – any classification data may be used for clustering when
             its class attribute is filtered out.



       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   13    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          14




        Weka: Pros and cons                                                                           3. WEKA data formats
        • pros                                                                                        • Data can be imported from a file in various
               – Open source,                                                                           formats:
                     • Free                                                                                 – ARFF (Attribute Relation File Format) has two sections:
                     • Extensible                                                                                  • the Header information defines attribute name, type and
                     • Can be integrated into other java packages                                                    relations.
               – GUIs (Graphic User Interfaces)                                                                    • the Data section lists the data records.
                     • Relatively easier to use                                                             – CSV: Comma Separated Values (text file)
               – Features                                                                                   – C4.5: A format used by a decision induction algorithm
                     • Run individual experiment, or                                                          C4.5, requires two separated files
                     • Build KDD phases                                                                            • Name file: defines the names of the attributes
        • Cons                                                                                                     • Date file: lists the records (samples)
               – Lack of proper and adequate documentations                                                 – binary
               – Systems are updated constantly (Kitchen Sink Syndrome)                               • Data can also be read from a URL or from an
                                                                                                        SQL database (using JDBC)
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   15    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          16




        Attribute Relation File Format (arff)                                                         Breast Cancer data in ARFF
                                                                                                    % Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence-
        An ARFF file consists of two distinct sections:                                               events: 85)
                                                                                                    % Part 1: Definitions of attribute name, types and relations
        • the Header section defines attribute name, type                                           @relation breast-cancer
                                                                                                       @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'}
                                                                                                       @attribute menopause {'lt40','ge40','premeno'}
          and relations, start with a keyword.                                                         @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-
                                                                                                       49','50-54','55-59'}
               @Relation <data-name>                                                                   @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30-
                                                                                                       32','33-35','36-39'}
               @attribute <attribute-name> <type> or {range}                                           @attribute node-caps {'yes','no'}
                                                                                                       @attribute deg-malig {'1','2','3'}
                                                                                                       @attribute breast {'left','right'}
        • the Data section lists the data records, starts with                                         @attribute breast-quad {'left_up','left_low','right_up','right_low','central'}
                                                                                                       @attribute 'irradiat' {'yes','no'}
               @Data                                                                                   @attribute 'Class' {'no-recurrence-events','recurrence-events'}

               list of data instances                                                               % Part 2: data section
                                                                                                    @data
        • Any line start with % is the comments.                                                       '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
                                                                                                       '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
                                                                                                       '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
                                                                                                       ……
                                                                                                     * source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer

       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   17    Data Mining & Statistics within the Health Services       Weka Tutorial (Dr. Wenjia Wang)          18




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                   3
CMP: Data Mining and Statistics within the Health Services                                                                                                                                 19/02/2010




        4.1 WEKA Explorer                                                                               Weka Explorer: open data file
                                                                                                    •       Open
            • Click the Explorer on Weka GUI Chooser                                                        Breast
                                                                                                            Cancer
            • On the Explorer window,                                                                       data
                  – click button “Open File” to open a data file                                    •       Click an
                    from                                                                                    attribute,
                                                                                                            e.g. age,
                         • the folder where your data files stored.                                         then its
                           e.g. Breast Cancer data: breast_cancer.arff                                      distributio
                                                                                                            n will be
                         Or (if you don’t have this data set),                                              displayed
                         • the data folder provided by the weka package:                                    in a
                                                                                                            histogra
                           e.g. C:Program FilesWeka-3-6data                                              m.
                                 using “iris.arff” or “weather_nominal.arff”


       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   19       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      20




      Weka Explorer: training classifiers                                                           Results

        After loaded a data file, click “Classify”                                                      • Testing
        • Choose a classifier,                                                                            results:
              – Under “Classifier”: click “choose”, then a drop-down                                    • 97 cases
                                                                                                          used in
                menu appears,
                                                                                                          test.
              – Click “trees” and select “J48” – a decision tree                                        Correct:
                algorithm                                                                                66 (68%)
        • Select a test option                                                                          Wrong:
                                                                                                         31 (32%)
              – Select “percentage split”
                     • with default ratio 66% for training and 34% for testing
        • Click “Start” to train and test the classifier.
              – The training and testing information will be displayed
                in classifier output window.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   21       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      22




        Options for results and model                                                                   View the tree
        •    Point to                                                                                   •     Point to
             result                                                                                           result list
             list                                                                                             window,
             window,                                                                                          and right
             and
             right
                                                                                                              click
             click                                                                                            mouse,
             mouse.                                                                                     •     Choose
                                                                                                              “visualiz
        •    A menu                                                                                           e tree”,
             will pop                                                                                         then the
             out to                                                                                           tree will
             show all                                                                                         be
             the                                                                                              displayed
             options                                                                                          in
             availabl                                                                                         another
             e about
             the
                                                                                                              window.
             model.
       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   23       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      24




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                              4
CMP: Data Mining and Statistics within the Health Services                                                                                                                                  19/02/2010




         View classifier errors                                                                          Save the model and results
         •    right click the                                                                            •     Right
              result list,                                                                                     click on
                                                                                                               the
         •    Choose                                                                                           result
              “visualize                                                                                       list
              classifier
              error”, then a                                                                             •     Choose
              new window will                                                                                  “save
                                                                                                               model”
              be popped out                                                                                    and
              to display the                                                                                   “save
              classifier’s error.                                                                              result
                                                                                                               buffer”
               – Correctly                                                                                     to save
                 predicted                                                                                     the
                                                                                                               classifie
                 cases                                                                                         r and
               – Wrong                                                                                         the
                                                                                                               results
                 cases                                                                                         to the
                                                                                                               disk
                                                                                                               folder.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   25       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      26




         Train a neural net                                                                              View the model’s ROC curve
      Click “Choose”
          to select
                                                                                                         •     Right click
          another                                                                                              the result:
          function,                                                                                            “Multiplaye
      e.g. “Multilayer                                                                                         rPerceptro
          Perceptron”
          - a type of                                                                                          n”
          neural net.                                                                                    •     Choose
      Then click “Start”                                                                                       “visualize
          to train and
          test it. (note:                                                                                      threshold
          the training                                                                                         curve” and
          may take
          much longer
                                                                                                               “recurrent
          time.)                                                                                               events”;
                                                                                                         •     The ROC
      The results
         seem better
                                                                                                               curve will
         than the tree                                                                                         be
         classifier.                                                                                           displayed.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   27       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      28




         Select Attributes                                                                               4.2 Weka Experimenter
       • Click “Select                                                                               •  you can use Experimenter
         Attributes”                                                                                    to carry out experiments
                                                                                                        for multiple data sets
       • Choose an                                                                                      using multiple methods,
         “attribute                                                                                  e.g. classifying
         evaluator”                                                                                  • two data sets
             – e.g. chiSquare                                                                                – Breast cancer
       • Choose a                                                                                            – Iris
         “Search                                                                                     •    Using two methods
         Method”                                                                                             – Decision Tree: J48
       • Then click                                                                                          – Logistic
         “Start”                                                                                     •    The experiment is “Setup”
                                                                                                          as shown in the
       • The selected                                                                                     screenshot.
         attributes are                                                                              •    Then click “Run”
         listed.
        Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)   29       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)      30




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                               5
CMP: Data Mining and Statistics within the Health Services                                                                                                                                     19/02/2010




        Analysis of the results                                                                             4.3 KnowledgeFlow
        •  Click                                                                                        • Click KnowledgeFlow on Weka GUI Chooser
           “analysis” to
           analyse the                                                                                  • A new window opened for buidling KDD process.
           results,
        E.g.
           paired t-test
           significance
        • Click
           “Experiment”
        • Configure
           test: choosing
           appropriate
           test and
           parameters
        • Click
           “Perform test”
           and the test
           results are
           listed.
       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     31       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         32




        Steps for building a KDD                                                                            A KDD process for Breast
        process                                                                                             Cancer
       Major steps for building a process
       1. Adding required nodes
            1) Add nodes
            2) Add a data source node from “DataSources”
                   1) Right click to configure it with a data set
            3)   Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node
            4)   Add a classifier, e.g. J48, from Classifiers
            5)   Add a classiferPerformanceEvaluator node from “Evaluation”
            6)   Add a text viewer from “Visualisation”
       2. Connect the nodes
            – Right click “DataSource” node and choose DataSet, then connect it to the
              ClassAssigner node,
            – do the same or similar for connecting between the other nodes.
       3. Run the process (using the default setups for each node)
            – Right click DataSource node and choose “Start loading”, the process should run and
              “Status” window should indicate if the run is correct and completed.
       4. View the results:
            – If the run is correctly completed, right click “Text Viewer” node and choose “Show
              results”, then another window pops out to show the results.



       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     33       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         34




        Results of the KDD process                                                                          5. Weka Tutorial Summary
                                                                                                        Weka is open source data mining software that offers
        •    right click
             “Text                                                                                      • Some GUI interfaces for data mining
             Viewer”                                                                                            – Explorer
             node and                                                                                           – Experimenter
             choose                                                                                             – KnowledgeFlow
             “Show                                                                                      •      Many functions and tools that include
             results”,                                                                                          – Methods for classification:
             then                                                                                                      decision trees, rule learners, naive Bayes, decision tables, locally weighted
                                                                                                                         regression, SVMs, instance-based learners, logistic regression, multi-layer
             another                                                                                                     perceptron
             window                                                                                             – methods for regression/prediction:
             pops out                                                                                                  linear regression, model tree generators, locally weighted regression, instance-
             to show                                                                                                      based learners, decision tables, multi-layer perceptron
             the                                                                                                – Ensemble schemes
                                                                                                                       • Bagging, boosting, stacking, RandomFrest
             results.
                                                                                                                – Methods for clustering:
                                                                                                                       • K-means, EM and Cobweb
                                                                                                                – Methods for feature selection

       Data Mining & Statistics within the Health Services     Weka Tutorial (Dr. Wenjia Wang)     35       Data Mining & Statistics within the Health Services   Weka Tutorial (Dr. Wenjia Wang)         36




Dr. Wenjia Wang: Tutorial for DM tool Weka                                                                                                                                                                     6

Mais conteúdo relacionado

Semelhante a Wekatutorial

Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOADemed L'Her
 
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...Richard Littauer
 
Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]butest
 
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre NimmagaddaM&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagaddarajopadhye
 
Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Jian Qin
 
Systems Lifecycle workbook
Systems Lifecycle workbookSystems Lifecycle workbook
Systems Lifecycle workbookMISY
 
Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application DevelopmentLARCA UPC
 
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Stuart Wrigley
 
Weka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningWeka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningKeshab Kumar Gaurav
 
The Research Process
The Research ProcessThe Research Process
The Research ProcessZain Mushtaq
 
EdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific researchEdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific researchMark Khoury
 
EdgarDB overview
EdgarDB overviewEdgarDB overview
EdgarDB overviewMark Khoury
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smithVince Smith
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05John Cobb
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overviewjdijcks
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories prwheatley
 

Semelhante a Wekatutorial (20)

Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOA
 
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
Trends in Use of Scientific Workflows: Insights from a Public Repository and ...
 
Weka
WekaWeka
Weka
 
Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]Microsoft PowerPoint - weka [Read-Only]
Microsoft PowerPoint - weka [Read-Only]
 
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre NimmagaddaM&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
M&amp;R Poster P1 Event Manager Website Rajopadhye Mhatre Nimmagadda
 
Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08Data repositories -- Xiamen University 2012 06-08
Data repositories -- Xiamen University 2012 06-08
 
Systems Lifecycle workbook
Systems Lifecycle workbookSystems Lifecycle workbook
Systems Lifecycle workbook
 
Data mining weka
Data mining wekaData mining weka
Data mining weka
 
Machine Learning Application Development
Machine Learning Application DevelopmentMachine Learning Application Development
Machine Learning Application Development
 
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Tech...
 
Weka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data miningWeka : A machine learning algorithms for data mining
Weka : A machine learning algorithms for data mining
 
The Research Process
The Research ProcessThe Research Process
The Research Process
 
EdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific researchEdgarDB -- the simple, powerful database for scientific research
EdgarDB -- the simple, powerful database for scientific research
 
EdgarDB overview
EdgarDB overviewEdgarDB overview
EdgarDB overview
 
2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith2013 02 data portal science group update -v smith
2013 02 data portal science group update -v smith
 
Search Methods for Multidimensional Data
Search Methods for Multidimensional Data Search Methods for Multidimensional Data
Search Methods for Multidimensional Data
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05
 
2012 10 bigdata_overview
2012 10 bigdata_overview2012 10 bigdata_overview
2012 10 bigdata_overview
 
Pain points for preservation services / workflows in repositories
Pain points for preservation services /  workflows in repositories Pain points for preservation services /  workflows in repositories
Pain points for preservation services / workflows in repositories
 

Último

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Wekatutorial

  • 1. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Data Mining and Statistics Content Within the Health Services 1. Introduction to Weka Tutorial for Weka 2. Data Mining Functions and Tools 3. Data Format a data mining tool 4. Hands-on Demos 4.1 Weka Explorer Dr. Wenjia Wang • Classification • Attribute( feature) Selection School of Computing Sciences 4.2 Weka Experimenter University of East Anglia 4.3 Weka KnowledgeFlow 5. Summary Data Pre-processing Data Mining Knowledge Data Mining & Statistics within the Health Services Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 2 1. Introduction to WEKA Weka Main Features • A collection of open source of many data • 49 data preprocessing tools mining and machine learning algorithms, • 76 classification/regression algorithms including • 8 clustering algorithms – pre-processing on data • 15 attribute/subset evaluators + 10 search – Classification: algorithms for feature selection. – clustering • 3 algorithms for finding association rules – association rule extraction • 3 graphical user interfaces – “The Explorer” (exploratory data analysis) • Created by researchers at the University of – “The Experimenter” (experimental environment) Waikato in New Zealand – “The KnowledgeFlow” (new process model inspired • Java based (also open source). interface) Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 3 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 4 Weka: Download and Installation Start the Weka • Download Weka (the stable version) from • From windows desktop, http://www.cs.waikato.ac.nz/ml/weka/ – click “Start”, choose “All programs”, – Choose a self-extracting executable (including Java VM) – Choose “Weka 3.6” to start Weka – Then the first interface – (If you are interested in modifying/extending weka there window appears: is a developer version that includes the source code) Weka GUI Chooser. • After download is completed, run the self- extracting file to install Weka, and use the default set-ups. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 5 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 6 Dr. Wenjia Wang: Tutorial for DM tool Weka 1
  • 2. CMP: Data Mining and Statistics within the Health Services 19/02/2010 WEKA Application Interfaces Weka Application Interfaces • Explorer – preprocessing, attribute selection, learning, visualiation • Experimenter – testing and evaluating machine learning algorithms • Knowledge Flow – visual design of KDD process – Explorer • Simple Command-line – A simple interface for typing commands Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 7 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 8 Load data file and 2. Weka Functions and Tools Preprocessing • Preprocessing Filters • Load data file in formats: ARFF, CSV, C4.5, binary • Attribute selection • Import from URL or SQL database (using JDBC) • Classification/Regression • Preprocessing filters • Clustering – Adding/removing attributes • Association discovery – Attribute value substitution – Discretization • Visualization – Time series filters (delta, shift) – Sampling, randomization – Missing value management – Normalization and other numeric transformations Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 9 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 10 Feature Selection Classification • Very flexible: arbitrary combination of search and • Predicted target must be categorical evaluation methods • Implemented methods • Search methods – decision trees(J48, etc.) and rules – best-first – Naïve Bayes – genetic – neural networks – ranking ... – instance-based classifiers … • Evaluation measures • Evaluation methods – ReliefF – test data set – information gain – gain ratio – crossvalidation • Demo data: weather_nominal.arff • Demo data: iris, contact lenses, labor, soybeans, etc. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 11 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 12 Dr. Wenjia Wang: Tutorial for DM tool Weka 2
  • 3. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Clustering Regression • Implemented methods – k-Means • Predicted target is continuous – EM • Methods – Cobweb – X-means – linear regression – FarthestFirst… – neural networks • Clusters can be visualized and compared to “true” – regression trees … clusters (if given) • Demo data: • Demo data: cpu.arff, – any classification data may be used for clustering when its class attribute is filtered out. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 13 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 14 Weka: Pros and cons 3. WEKA data formats • pros • Data can be imported from a file in various – Open source, formats: • Free – ARFF (Attribute Relation File Format) has two sections: • Extensible • the Header information defines attribute name, type and • Can be integrated into other java packages relations. – GUIs (Graphic User Interfaces) • the Data section lists the data records. • Relatively easier to use – CSV: Comma Separated Values (text file) – Features – C4.5: A format used by a decision induction algorithm • Run individual experiment, or C4.5, requires two separated files • Build KDD phases • Name file: defines the names of the attributes • Cons • Date file: lists the records (samples) – Lack of proper and adequate documentations – binary – Systems are updated constantly (Kitchen Sink Syndrome) • Data can also be read from a URL or from an SQL database (using JDBC) Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 15 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 16 Attribute Relation File Format (arff) Breast Cancer data in ARFF % Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence- An ARFF file consists of two distinct sections: events: 85) % Part 1: Definitions of attribute name, types and relations • the Header section defines attribute name, type @relation breast-cancer @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'} @attribute menopause {'lt40','ge40','premeno'} and relations, start with a keyword. @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45- 49','50-54','55-59'} @Relation <data-name> @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30- 32','33-35','36-39'} @attribute <attribute-name> <type> or {range} @attribute node-caps {'yes','no'} @attribute deg-malig {'1','2','3'} @attribute breast {'left','right'} • the Data section lists the data records, starts with @attribute breast-quad {'left_up','left_low','right_up','right_low','central'} @attribute 'irradiat' {'yes','no'} @Data @attribute 'Class' {'no-recurrence-events','recurrence-events'} list of data instances % Part 2: data section @data • Any line start with % is the comments. '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' …… * source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 17 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 18 Dr. Wenjia Wang: Tutorial for DM tool Weka 3
  • 4. CMP: Data Mining and Statistics within the Health Services 19/02/2010 4.1 WEKA Explorer Weka Explorer: open data file • Open • Click the Explorer on Weka GUI Chooser Breast Cancer • On the Explorer window, data – click button “Open File” to open a data file • Click an from attribute, e.g. age, • the folder where your data files stored. then its e.g. Breast Cancer data: breast_cancer.arff distributio n will be Or (if you don’t have this data set), displayed • the data folder provided by the weka package: in a histogra e.g. C:Program FilesWeka-3-6data m. using “iris.arff” or “weather_nominal.arff” Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 19 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 20 Weka Explorer: training classifiers Results After loaded a data file, click “Classify” • Testing • Choose a classifier, results: – Under “Classifier”: click “choose”, then a drop-down • 97 cases used in menu appears, test. – Click “trees” and select “J48” – a decision tree Correct: algorithm 66 (68%) • Select a test option Wrong: 31 (32%) – Select “percentage split” • with default ratio 66% for training and 34% for testing • Click “Start” to train and test the classifier. – The training and testing information will be displayed in classifier output window. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 21 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 22 Options for results and model View the tree • Point to • Point to result result list list window, window, and right and right click click mouse, mouse. • Choose “visualiz • A menu e tree”, will pop then the out to tree will show all be the displayed options in availabl another e about the window. model. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 23 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 24 Dr. Wenjia Wang: Tutorial for DM tool Weka 4
  • 5. CMP: Data Mining and Statistics within the Health Services 19/02/2010 View classifier errors Save the model and results • right click the • Right result list, click on the • Choose result “visualize list classifier error”, then a • Choose new window will “save model” be popped out and to display the “save classifier’s error. result buffer” – Correctly to save predicted the classifie cases r and – Wrong the results cases to the disk folder. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 25 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 26 Train a neural net View the model’s ROC curve Click “Choose” to select • Right click another the result: function, “Multiplaye e.g. “Multilayer rPerceptro Perceptron” - a type of n” neural net. • Choose Then click “Start” “visualize to train and test it. (note: threshold the training curve” and may take much longer “recurrent time.) events”; • The ROC The results seem better curve will than the tree be classifier. displayed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 27 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 28 Select Attributes 4.2 Weka Experimenter • Click “Select • you can use Experimenter Attributes” to carry out experiments for multiple data sets • Choose an using multiple methods, “attribute e.g. classifying evaluator” • two data sets – e.g. chiSquare – Breast cancer • Choose a – Iris “Search • Using two methods Method” – Decision Tree: J48 • Then click – Logistic “Start” • The experiment is “Setup” as shown in the • The selected screenshot. attributes are • Then click “Run” listed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 29 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 30 Dr. Wenjia Wang: Tutorial for DM tool Weka 5
  • 6. CMP: Data Mining and Statistics within the Health Services 19/02/2010 Analysis of the results 4.3 KnowledgeFlow • Click • Click KnowledgeFlow on Weka GUI Chooser “analysis” to analyse the • A new window opened for buidling KDD process. results, E.g. paired t-test significance • Click “Experiment” • Configure test: choosing appropriate test and parameters • Click “Perform test” and the test results are listed. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 31 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 32 Steps for building a KDD A KDD process for Breast process Cancer Major steps for building a process 1. Adding required nodes 1) Add nodes 2) Add a data source node from “DataSources” 1) Right click to configure it with a data set 3) Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node 4) Add a classifier, e.g. J48, from Classifiers 5) Add a classiferPerformanceEvaluator node from “Evaluation” 6) Add a text viewer from “Visualisation” 2. Connect the nodes – Right click “DataSource” node and choose DataSet, then connect it to the ClassAssigner node, – do the same or similar for connecting between the other nodes. 3. Run the process (using the default setups for each node) – Right click DataSource node and choose “Start loading”, the process should run and “Status” window should indicate if the run is correct and completed. 4. View the results: – If the run is correctly completed, right click “Text Viewer” node and choose “Show results”, then another window pops out to show the results. Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 33 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 34 Results of the KDD process 5. Weka Tutorial Summary Weka is open source data mining software that offers • right click “Text • Some GUI interfaces for data mining Viewer” – Explorer node and – Experimenter choose – KnowledgeFlow “Show • Many functions and tools that include results”, – Methods for classification: then decision trees, rule learners, naive Bayes, decision tables, locally weighted regression, SVMs, instance-based learners, logistic regression, multi-layer another perceptron window – methods for regression/prediction: pops out linear regression, model tree generators, locally weighted regression, instance- to show based learners, decision tables, multi-layer perceptron the – Ensemble schemes • Bagging, boosting, stacking, RandomFrest results. – Methods for clustering: • K-means, EM and Cobweb – Methods for feature selection Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 35 Data Mining & Statistics within the Health Services Weka Tutorial (Dr. Wenjia Wang) 36 Dr. Wenjia Wang: Tutorial for DM tool Weka 6