SlideShare uma empresa Scribd logo
1 de 22
Baixar para ler offline
Data Processing Program
Student name: Neil B Dahlqvist
Title: Data processing program
File name: Project_csci.java
Professor: Dr Ray Hashemi
1
Contents
About the program…………………….…………………………... 2
Looking at the Work Area ……………………………………..……3
Using the File menu ….………...…………………………..……… 4
Setting the program……………………….………...……………….5
Using the tools menu………………………………..………….….. 6
Building an ID3 tree…………………………………………….…... 9
Using testing tools………………………….…………….…………16
Entering new dataset...................................................................18
Options…………………………………….………….……………..19
Explaining program files…………………………………………...20
Notes about the code………………………………………………21
2
About the program
The data processing program will create a set of rules for a given dataset
using a decision tree learning algorithm ID3 developed by Ross Quinlan
(1983) and then it will test the accuracy of these rules.
This application also selects the best group of data and removes any kind of
inconsistencies or conflicted values using data cleansing techniques. The
program counts with many functions that facilitate data management and
data visualization. The graphic user interface provided is friendly and the
resulting decision tree provides the representation of the concept that
appeal to human because it renders the classification process self-evident.
3
Looking at the Work Area
The Data processing program offers a friendly and intuitive graphic user interface.
This interface is equipped with a variety of menus and windows that will help the
user to work with the data with ease.
The GUI is divided in three main components:
Menu bar: It contains all the functions and tools necessary to manipulate the
dataset. You will find three tabs, File, Tools and Options which will be explained
later.
Project Window: This window will display the excel spreadsheet of the main
dataset at the startup. The project window main task is to show the current data
after it has been processed for some function.
Figure 1. Data processing graphic user interface.
4
Output Window: The bottom panel is called the output window. It is in charge of
displaying the result of some operation, like entropy or correlation calculation or
the description of some element like a test file or the id3 tree itself.
Output System Console: This is not part of the GUI. This window is only use to
enter input user. After the open file function is executed.
Using the File menu
The menu bar located at the top of the Project window (see Figure 1) contains
three tabs. File Tools and Options.
The File menu is the first item in the menu bar and possesses commands relating
to the handling of files and the initialization of main dataset.
Figure 2 shows the items found in File menu.
Data set: This item initializes
the main dataset located in
the src file in the project folder
C:UsersUsernameDocumen
tsNetBeansProjectsProject_c
scisrc. (See page 20,
explaining program files).
The data will be display as a
spreadsheet in the Project
window. It is imperative that
you execute this method every
time you start the program.
Note: You can start using the data processing program without running this
function but it is not recommended.
Open: As in any other application the open function allow the user to open a
working file. You can browse a file in your system and the application will display
Figure 2. File Menu items
5
it in the Project window. The data processing program accepts only text files. So
make sure to convert your dataset to txt format. (Figure 3)
Save: With save function you can save your cleansed data anywhere in your
system. The program let you save the file to any format including .xls (excel
format). The data displayed in the Project and output window will be saved.
The program automatically saves data every time you process data. Note that the
temporary files located in the project folder are overwritten all the time so make
sure to save the data manually.
Setting the program
Before you use the program for the first time, you may need to slightly modify the
source code. You can use netbeans, eclipse or even notepad to open the java file
and change the code.
Make sure to follow the following steps to avoid any kind of errors
Figure 3. Open file window.
6
Setting Path file name and variables
Data processing program has a default dataset called data.txt which has 242
attributes and 999 records located in the src folder. To use your own dataset you
must change the path name and the number of rows and columns.
- Reformat your dataset file as a text file. If is an xls file, Excel has options for
file conversion. It is a little bit tedious but you must delete the attributes
names and just keep the data values otherwise the program will have an
exception error.
- Rename the file to data.txt and placed in the src folder located in the
NetBeansProjects/Project_csci folder.
- Locate the String called path and path_n which is at the beginning of the
source code below the line public class Project_csci extends JFrame{ and
change it to your convenience
- Change the value of rows and columns variables. They are below path file
name line. The columns indicate the number of Attributes and the rows are
the number of records.
- The default names for the Attributes are CerN with N being the number of
attribute. To change this, go to the fillnames function and change the
names of the Attributes
Using the tools menu
The tools menu provides functions to manage and process data. The first tools are
visual tools. They let you organize the information according to Entropy,
correlation or both. The data will be displayed in the Project window.
Show results: This function will show the correlation and entropy of the
attributes in the Project window
Entropy and correlation have a similar function. They will display their
corresponding data.
7
Figure 5 shows the Project window after show results function is executed
Figure 4. Tools menu
Figure 5. Results show Attributes names, entropies and correlations.
8
Find: This tool allows the user to find an Attribute, or value according to a
criteria. You can find Attributes names, entropy, correlation or attribute number.
The results are displayed in the output window. In the following order: Attribute
name, correlation and entropy.
The next tools let the user discriminate data, cleanse it and build decision trees in
order to formulate a set of rules to predict a decision based on the dataset.
Figure 6. Find tool seaching for Attribute called Cer12
Figure 7. Functions submenu items.
9
Sort collection, sort entropy, entropy and correlation are self explanatory
functions. The main goal of the data processing systems is the creation of a set of
rules based on your training data. To do this the program builds an ID3 decision
tree. However before do these, the user should cleanse the data to get rid of any
conflict or inconsistence that may cause the creation of incongruent rules.
Building an ID3 Decision Learning Tree
The following steps explain the cleaning data process and the creation of a
pruning tree and a set of rules.
Intersection function: This tool selects the best data using a threshold to pick
attributes with less entropy and greater correlation than the one indicated. This
step is crucial to reduce the amount of attributes to work with.
Figure 8. Intersection function threshold selector.
10
It is a good idea to use show result function or a visual data discussed above,
when you pick an entropy and correlation threshold.
The result_i.txt file is created automatically when you use the intersection
function and you can find the file in the src folder. This is a temporary file so do
not rely on it for future references. You can make a copy of the file or use the
save function in the File menu.
Note: If you need to select all the data, use 1.0 as entropy and -1.0 as correlation.
In this way all the attributes will be selected.
Clean Data: Even though clean data is not in the tool menu, it is one of the most
important tools. Once we got the intersection of attributes, the data should be
cleansed. Many of these records probably contain conflicted or redundant values.
This can cause a stack overflow error while creating the ID3 decision tree, making
the program to run forever and eventually collapse. Click the clean data tab and
you will see the records in the Project window as shown in Figure 9.
Figure 10 shows the result of clean Data function.
Note. Remember to always use the clean data tool otherwise; there could be
complications while creating the ID3 decision learning tree.
Figure 9. Options menu (Clean Data)
11
The clean text file is automatically saved as r.txt, and you can find this file in the
src folder. If you need this file make sure to save it using the save tool in the File
menu. The r.txt file is a temporary file. It is overwritten every time you cleanse
new data.
As shown in Figure 10, the output window display info about the new dataset.
This new info includes the number of attributes, rows and the number of records
with decision 1 and 0.
Now we are ready to create our ID3 decision tree.
Build ID3: This is maybe the most powerful function of the entire program. Build
ID3 creates and displays the tree in an external window a decision tree from the
selected data. Figure 11 shows the external window where the ID3 tree is
displayed
Figure 10. r.txt file shows data without redundant or conflicted records.
12
The ID3 tree has the following components:
Internal node: This element contains a certain number of records with decision 1
and decision 0 along with other properties like name of node, weight etc.
The internal node is represented as a green circle
Link: This is a virtual link between branches. It is represented by a green circle
like the internal node, however links has no properties.
Figure 11. ID3 decision tree components.
13
To display the properties of a node, hover the mouse over the internal node and
after a few seconds a panel will show up with all the info you need. Note that if
you do these with a link, nothing happen.
Leaf: the last element of the ID3 tree is the leaf. A leaf had no branches and it can
have a value of 1 or 0. Leaf has properties too but they are not displayed
dynamically.
Every leaf has a function or rule which is shown the same way properties do. By
clicking the leaf, you can read its corresponding rule.
Figure 12. Properties of internal node Cer66.
Figure 13. Leaf function shows up dynamically.
14
The ID3 decision tree has its own set of tools, which are place at the top left of
the window. The first three tools manipulate the tree directly. The user can
expand the tree at once or step by step and of course collapse it. This is really
helpful when dealing with a tree with many branches.
Figure 14. ID3 tree tool menu
15
The create table option generates a table with all the properties of every node
and leaf of the tree. Figure 15 shows the ID3 tree data table.
Prune ID3: Next step is to prune the tree. By doing these, the tree becomes less
complex. This is our last step before we can test are rules generated.
Notice than the new tree has impure leaves. These leaves has replaced a whole
set of branches.
Now you can use the Generate rules tool. These will give us are beloved set of
rules which hopefully will predict the decision values.
Figure 15. ID3 tree table
16
Using testing tools
After the rules have been generated, it is imperative to test their efficiency that is
how well our rules predict the decision value.
Data Processing program offers some testing tools which can be found in the
submenu of the tools menu called Testing.
The user can choose among three Test functions. EQTraining, NEQTraining and
K-1 Testing. For more detailed explanations of these testing methods please
consult any data mining textbook.
Figure 16. Set of rules displayed
17
EQ/NEQTraining test method: Click the tool item and a new panel will show up.
The user will have to enter a percentage for a test file. After this is done the
testing file manager will appear. We can view the info of all the test and training
files generated along with successful rate.
The tool automatically saved all the test and training files in the project folder.
You can find these text files in the test and training folder located on the
Project_csci folder. (See page 20 for additional information)
The show table button will displayed a list of all the records, their classification
and the match rate.
A successful rate of 0.85 or higher means that the rules are reliable.
Figure 17. Testing tools
Figure 18. Test file input
18
K-1 Testing: This may be the best way to test the efficiency of our rules set.
Depending on the amount of records, it may take up to minutes to process the
data. K-1 test file does not generate any test file; it will only display the record
classification table and the information in the output window
Entering new dataset
Sometime we are going to need to use another dataset completely independent
to the one we have by default. If the new dataset does not have too many
attributes, the user won’t need to set the program again (see page 5 for more
details, Setting the program).
We can use the open function in the File menu.
Click the open tab and browse to the text file which contains the dataset. Make
sure, the file only contains the values, not the attributes names.
Once you click open, go to the output system console and type the number of
attributes and records. You need this information beforehand.
Then the program will ask for the name of the attributes.
Once you are done. The new data will show up in the Project window.
Figure 19. Testing file manager Figure 20. Record classification table.
19
After that, the new dataset will become your default dataset. Feel free to create
an ID3 tree or to clean the data.
If you want to go back to your original default dataset, click the Data set tool in
the File menu.
Options
Data processing program counts with a two options located in the option tab
Format: This option allows the user to change the rounding of the numbers. By
default the correlation is expressed in 5 digits after the comma and the entropy in
3. Note that when you change the format the find function is affected. For
instance the value 0.877 is different than 0.87
Log file: Every time we use a function the program automatically register and
write a description of the event. These events can be found in the log.txt located
in Project_csci folder. Make sure to use the exit tab in the file menu; otherwise
the log file won’t be generated.
Figure 21. Entering new dataset specifications.
20
Explaining Program Files
All the program files are located in The Project_csci folder. Make sure to place
your dataset in it. The project folder can be place anywhere as long you change
the path of the source code. (See page 5. Setting the program)
The following is a brief description of the program folders and files
- Project_csci: The main folder which contains the src and java files along
with the classes, images, and notes.
- Src folder: This folder contains the classes which are Project_csci.java,
Attribute.java which is the structure of the ID3 tree nodes, ID3_tree.java,
Monitor.java, k.java and ID3_Graph.java. The default dataset is located
here as well. The name of the text file is data. Make sure to change the
name of your default data. You can find the icon folder too.
- Test/ Training folder: contains the test and training files generated by
EQ/NEQ training tool. These files are temporary so they are overwritten all
the time.
- Log file: The history file where all the events are recorded.
Figure 22. log.txt file
21
- R.txt : The intersection file which contains only the names of the attributes
selected
- Result_i.: The records free of inconsistencies.
Notes about the code
Data processing program is an application that let the user play with the data
smoothly to get the best results.
There are many points that I will like to improve such as the limitation of the
decision values. The application is limited to have only the values of 1 and 0 for
the decision attribute.
Setting the program can be made more user-friendly. Some of the steps to set the
program involved rewriting the code which can be intimidated for a non
programmer. Another inconvenience is the format of the dataset. It must be a
text file and should only contain integer values. Many dataset have data
expressed in words or decimal numbers. This can be solved by extending the
ID3_tree class using generics.

Mais conteúdo relacionado

Mais procurados

Vb.net session 05
Vb.net session 05Vb.net session 05
Vb.net session 05
Niit Care
 
Ado.net session07
Ado.net session07Ado.net session07
Ado.net session07
Niit Care
 
Ado.net session04
Ado.net session04Ado.net session04
Ado.net session04
Niit Care
 
Ado.net session02
Ado.net session02Ado.net session02
Ado.net session02
Niit Care
 
Ado.net session05
Ado.net session05Ado.net session05
Ado.net session05
Niit Care
 

Mais procurados (20)

Sql ppt
Sql pptSql ppt
Sql ppt
 
2310 b 09
2310 b 092310 b 09
2310 b 09
 
Visual Basic.Net & Ado.Net
Visual Basic.Net & Ado.NetVisual Basic.Net & Ado.Net
Visual Basic.Net & Ado.Net
 
Unit4
Unit4Unit4
Unit4
 
Disconnected Architecture and Crystal report in VB.NET
Disconnected Architecture and Crystal report in VB.NETDisconnected Architecture and Crystal report in VB.NET
Disconnected Architecture and Crystal report in VB.NET
 
Data Warehouse and Business Intelligence - Recipe 2
Data Warehouse and Business Intelligence - Recipe 2Data Warehouse and Business Intelligence - Recipe 2
Data Warehouse and Business Intelligence - Recipe 2
 
Vb.net session 05
Vb.net session 05Vb.net session 05
Vb.net session 05
 
社會網絡分析UCINET Quick Start Guide
社會網絡分析UCINET Quick Start Guide社會網絡分析UCINET Quick Start Guide
社會網絡分析UCINET Quick Start Guide
 
Ado.net
Ado.netAdo.net
Ado.net
 
Data structures and algorithms short note (version 14).pd
Data structures and algorithms short note (version 14).pdData structures and algorithms short note (version 14).pd
Data structures and algorithms short note (version 14).pd
 
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...
Recipes 8 of Data Warehouse and Business Intelligence - Naming convention tec...
 
Data structures
Data structuresData structures
Data structures
 
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
Recipe 5 of Data Warehouse and Business Intelligence - The null values manage...
 
Ado.net session07
Ado.net session07Ado.net session07
Ado.net session07
 
Ado.net session04
Ado.net session04Ado.net session04
Ado.net session04
 
Ado.net
Ado.netAdo.net
Ado.net
 
Recipes 6 of Data Warehouse and Business Intelligence - Naming convention tec...
Recipes 6 of Data Warehouse and Business Intelligence - Naming convention tec...Recipes 6 of Data Warehouse and Business Intelligence - Naming convention tec...
Recipes 6 of Data Warehouse and Business Intelligence - Naming convention tec...
 
ADO.NET difference faqs compiled- 1
ADO.NET difference  faqs compiled- 1ADO.NET difference  faqs compiled- 1
ADO.NET difference faqs compiled- 1
 
Ado.net session02
Ado.net session02Ado.net session02
Ado.net session02
 
Ado.net session05
Ado.net session05Ado.net session05
Ado.net session05
 

Destaque

Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
Sohail Patel
 
Mobilizing Local Government Tax Revenue for Adequate Service Delivery in Nige...
Mobilizing Local Government Tax Revenue for Adequate Service Delivery in Nige...Mobilizing Local Government Tax Revenue for Adequate Service Delivery in Nige...
Mobilizing Local Government Tax Revenue for Adequate Service Delivery in Nige...
Oghenovo Egbegbedia
 

Destaque (20)

Indirect tax (1)
Indirect tax (1)Indirect tax (1)
Indirect tax (1)
 
Amar 38 final
Amar 38 finalAmar 38 final
Amar 38 final
 
Project data analysis
Project data analysisProject data analysis
Project data analysis
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Data analysis and statistical inference project
Data analysis and statistical inference projectData analysis and statistical inference project
Data analysis and statistical inference project
 
Final goods & service tax
Final goods & service taxFinal goods & service tax
Final goods & service tax
 
Property Tax Assessment Services
Property Tax Assessment ServicesProperty Tax Assessment Services
Property Tax Assessment Services
 
Research project report sumit b
Research project report sumit bResearch project report sumit b
Research project report sumit b
 
Research project on packaged drinking water industry
Research project on packaged drinking water industryResearch project on packaged drinking water industry
Research project on packaged drinking water industry
 
Standards of Auditing - Introduction and Application in the Indian Context
Standards of Auditing - Introduction and Application in the Indian ContextStandards of Auditing - Introduction and Application in the Indian Context
Standards of Auditing - Introduction and Application in the Indian Context
 
Auditing Standards- IndusInd Bank
Auditing Standards- IndusInd BankAuditing Standards- IndusInd Bank
Auditing Standards- IndusInd Bank
 
Project-Student Financial Service System
Project-Student Financial Service SystemProject-Student Financial Service System
Project-Student Financial Service System
 
Mobilizing Local Government Tax Revenue for Adequate Service Delivery in Nige...
Mobilizing Local Government Tax Revenue for Adequate Service Delivery in Nige...Mobilizing Local Government Tax Revenue for Adequate Service Delivery in Nige...
Mobilizing Local Government Tax Revenue for Adequate Service Delivery in Nige...
 
STANDARDS ON AUDIT
STANDARDS  ON AUDITSTANDARDS  ON AUDIT
STANDARDS ON AUDIT
 
Project Report on e banking
Project Report on e bankingProject Report on e banking
Project Report on e banking
 
Company audit & accounts
Company audit & accounts  Company audit & accounts
Company audit & accounts
 
E-banking project
E-banking projectE-banking project
E-banking project
 
Demonetization of Indian Currency
Demonetization of Indian CurrencyDemonetization of Indian Currency
Demonetization of Indian Currency
 
A study on understanding the concept of demonetization with reference to MBA ...
A study on understanding the concept of demonetization with reference to MBA ...A study on understanding the concept of demonetization with reference to MBA ...
A study on understanding the concept of demonetization with reference to MBA ...
 
Service tax-Negative list
Service tax-Negative listService tax-Negative list
Service tax-Negative list
 

Semelhante a Data_Processing_Program

Software Systems Modularization
Software Systems ModularizationSoftware Systems Modularization
Software Systems Modularization
chiao-fan yang
 
systems labOnce the Application has started up and you are at the .docx
systems labOnce the Application has started up and you are at the .docxsystems labOnce the Application has started up and you are at the .docx
systems labOnce the Application has started up and you are at the .docx
perryk1
 
Once the Application has started up and you are at the Start Page, s.docx
Once the Application has started up and you are at the Start Page, s.docxOnce the Application has started up and you are at the Start Page, s.docx
Once the Application has started up and you are at the Start Page, s.docx
arnit1
 
2015 Luminant Energy Process Guide
2015 Luminant Energy Process Guide2015 Luminant Energy Process Guide
2015 Luminant Energy Process Guide
Kelly Stark
 
Dynamic Web Pages Ch 4 V1.0
Dynamic Web Pages Ch 4 V1.0Dynamic Web Pages Ch 4 V1.0
Dynamic Web Pages Ch 4 V1.0
Cathie101
 
STATA_Training_for_data_science_juniors.pdf
STATA_Training_for_data_science_juniors.pdfSTATA_Training_for_data_science_juniors.pdf
STATA_Training_for_data_science_juniors.pdf
AronMozart1
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
rohithprabhas1
 

Semelhante a Data_Processing_Program (20)

Lab 1 Essay
Lab 1 EssayLab 1 Essay
Lab 1 Essay
 
Software Systems Modularization
Software Systems ModularizationSoftware Systems Modularization
Software Systems Modularization
 
SAP data archiving
SAP data archivingSAP data archiving
SAP data archiving
 
Tutorial on how to load images in crystal reports dynamically using visual ba...
Tutorial on how to load images in crystal reports dynamically using visual ba...Tutorial on how to load images in crystal reports dynamically using visual ba...
Tutorial on how to load images in crystal reports dynamically using visual ba...
 
data binding.docx
data binding.docxdata binding.docx
data binding.docx
 
Manual orange
Manual orangeManual orange
Manual orange
 
UNIT 3.2 GETTING STARTED WITH IDA.ppt
UNIT 3.2 GETTING STARTED WITH IDA.pptUNIT 3.2 GETTING STARTED WITH IDA.ppt
UNIT 3.2 GETTING STARTED WITH IDA.ppt
 
PATTERNS07 - Data Representation in C#
PATTERNS07 - Data Representation in C#PATTERNS07 - Data Representation in C#
PATTERNS07 - Data Representation in C#
 
Tableau Basic Questions
Tableau Basic QuestionsTableau Basic Questions
Tableau Basic Questions
 
systems labOnce the Application has started up and you are at the .docx
systems labOnce the Application has started up and you are at the .docxsystems labOnce the Application has started up and you are at the .docx
systems labOnce the Application has started up and you are at the .docx
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Once the Application has started up and you are at the Start Page, s.docx
Once the Application has started up and you are at the Start Page, s.docxOnce the Application has started up and you are at the Start Page, s.docx
Once the Application has started up and you are at the Start Page, s.docx
 
2015 Luminant Energy Process Guide
2015 Luminant Energy Process Guide2015 Luminant Energy Process Guide
2015 Luminant Energy Process Guide
 
Dynamic Web Pages Ch 4 V1.0
Dynamic Web Pages Ch 4 V1.0Dynamic Web Pages Ch 4 V1.0
Dynamic Web Pages Ch 4 V1.0
 
Sas UTR How To Create Your UTRs Sep2009
Sas UTR How To Create Your UTRs Sep2009Sas UTR How To Create Your UTRs Sep2009
Sas UTR How To Create Your UTRs Sep2009
 
A Novel Method For Making Cut-Copy-Paste Operations Using Clipboard
A Novel Method For Making Cut-Copy-Paste Operations Using ClipboardA Novel Method For Making Cut-Copy-Paste Operations Using Clipboard
A Novel Method For Making Cut-Copy-Paste Operations Using Clipboard
 
Markinng schme ICT questions.pdf
Markinng schme ICT questions.pdfMarkinng schme ICT questions.pdf
Markinng schme ICT questions.pdf
 
STATA_Training_for_data_science_juniors.pdf
STATA_Training_for_data_science_juniors.pdfSTATA_Training_for_data_science_juniors.pdf
STATA_Training_for_data_science_juniors.pdf
 
001.general
001.general001.general
001.general
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 

Data_Processing_Program

  • 1. Data Processing Program Student name: Neil B Dahlqvist Title: Data processing program File name: Project_csci.java Professor: Dr Ray Hashemi
  • 2. 1 Contents About the program…………………….…………………………... 2 Looking at the Work Area ……………………………………..……3 Using the File menu ….………...…………………………..……… 4 Setting the program……………………….………...……………….5 Using the tools menu………………………………..………….….. 6 Building an ID3 tree…………………………………………….…... 9 Using testing tools………………………….…………….…………16 Entering new dataset...................................................................18 Options…………………………………….………….……………..19 Explaining program files…………………………………………...20 Notes about the code………………………………………………21
  • 3. 2 About the program The data processing program will create a set of rules for a given dataset using a decision tree learning algorithm ID3 developed by Ross Quinlan (1983) and then it will test the accuracy of these rules. This application also selects the best group of data and removes any kind of inconsistencies or conflicted values using data cleansing techniques. The program counts with many functions that facilitate data management and data visualization. The graphic user interface provided is friendly and the resulting decision tree provides the representation of the concept that appeal to human because it renders the classification process self-evident.
  • 4. 3 Looking at the Work Area The Data processing program offers a friendly and intuitive graphic user interface. This interface is equipped with a variety of menus and windows that will help the user to work with the data with ease. The GUI is divided in three main components: Menu bar: It contains all the functions and tools necessary to manipulate the dataset. You will find three tabs, File, Tools and Options which will be explained later. Project Window: This window will display the excel spreadsheet of the main dataset at the startup. The project window main task is to show the current data after it has been processed for some function. Figure 1. Data processing graphic user interface.
  • 5. 4 Output Window: The bottom panel is called the output window. It is in charge of displaying the result of some operation, like entropy or correlation calculation or the description of some element like a test file or the id3 tree itself. Output System Console: This is not part of the GUI. This window is only use to enter input user. After the open file function is executed. Using the File menu The menu bar located at the top of the Project window (see Figure 1) contains three tabs. File Tools and Options. The File menu is the first item in the menu bar and possesses commands relating to the handling of files and the initialization of main dataset. Figure 2 shows the items found in File menu. Data set: This item initializes the main dataset located in the src file in the project folder C:UsersUsernameDocumen tsNetBeansProjectsProject_c scisrc. (See page 20, explaining program files). The data will be display as a spreadsheet in the Project window. It is imperative that you execute this method every time you start the program. Note: You can start using the data processing program without running this function but it is not recommended. Open: As in any other application the open function allow the user to open a working file. You can browse a file in your system and the application will display Figure 2. File Menu items
  • 6. 5 it in the Project window. The data processing program accepts only text files. So make sure to convert your dataset to txt format. (Figure 3) Save: With save function you can save your cleansed data anywhere in your system. The program let you save the file to any format including .xls (excel format). The data displayed in the Project and output window will be saved. The program automatically saves data every time you process data. Note that the temporary files located in the project folder are overwritten all the time so make sure to save the data manually. Setting the program Before you use the program for the first time, you may need to slightly modify the source code. You can use netbeans, eclipse or even notepad to open the java file and change the code. Make sure to follow the following steps to avoid any kind of errors Figure 3. Open file window.
  • 7. 6 Setting Path file name and variables Data processing program has a default dataset called data.txt which has 242 attributes and 999 records located in the src folder. To use your own dataset you must change the path name and the number of rows and columns. - Reformat your dataset file as a text file. If is an xls file, Excel has options for file conversion. It is a little bit tedious but you must delete the attributes names and just keep the data values otherwise the program will have an exception error. - Rename the file to data.txt and placed in the src folder located in the NetBeansProjects/Project_csci folder. - Locate the String called path and path_n which is at the beginning of the source code below the line public class Project_csci extends JFrame{ and change it to your convenience - Change the value of rows and columns variables. They are below path file name line. The columns indicate the number of Attributes and the rows are the number of records. - The default names for the Attributes are CerN with N being the number of attribute. To change this, go to the fillnames function and change the names of the Attributes Using the tools menu The tools menu provides functions to manage and process data. The first tools are visual tools. They let you organize the information according to Entropy, correlation or both. The data will be displayed in the Project window. Show results: This function will show the correlation and entropy of the attributes in the Project window Entropy and correlation have a similar function. They will display their corresponding data.
  • 8. 7 Figure 5 shows the Project window after show results function is executed Figure 4. Tools menu Figure 5. Results show Attributes names, entropies and correlations.
  • 9. 8 Find: This tool allows the user to find an Attribute, or value according to a criteria. You can find Attributes names, entropy, correlation or attribute number. The results are displayed in the output window. In the following order: Attribute name, correlation and entropy. The next tools let the user discriminate data, cleanse it and build decision trees in order to formulate a set of rules to predict a decision based on the dataset. Figure 6. Find tool seaching for Attribute called Cer12 Figure 7. Functions submenu items.
  • 10. 9 Sort collection, sort entropy, entropy and correlation are self explanatory functions. The main goal of the data processing systems is the creation of a set of rules based on your training data. To do this the program builds an ID3 decision tree. However before do these, the user should cleanse the data to get rid of any conflict or inconsistence that may cause the creation of incongruent rules. Building an ID3 Decision Learning Tree The following steps explain the cleaning data process and the creation of a pruning tree and a set of rules. Intersection function: This tool selects the best data using a threshold to pick attributes with less entropy and greater correlation than the one indicated. This step is crucial to reduce the amount of attributes to work with. Figure 8. Intersection function threshold selector.
  • 11. 10 It is a good idea to use show result function or a visual data discussed above, when you pick an entropy and correlation threshold. The result_i.txt file is created automatically when you use the intersection function and you can find the file in the src folder. This is a temporary file so do not rely on it for future references. You can make a copy of the file or use the save function in the File menu. Note: If you need to select all the data, use 1.0 as entropy and -1.0 as correlation. In this way all the attributes will be selected. Clean Data: Even though clean data is not in the tool menu, it is one of the most important tools. Once we got the intersection of attributes, the data should be cleansed. Many of these records probably contain conflicted or redundant values. This can cause a stack overflow error while creating the ID3 decision tree, making the program to run forever and eventually collapse. Click the clean data tab and you will see the records in the Project window as shown in Figure 9. Figure 10 shows the result of clean Data function. Note. Remember to always use the clean data tool otherwise; there could be complications while creating the ID3 decision learning tree. Figure 9. Options menu (Clean Data)
  • 12. 11 The clean text file is automatically saved as r.txt, and you can find this file in the src folder. If you need this file make sure to save it using the save tool in the File menu. The r.txt file is a temporary file. It is overwritten every time you cleanse new data. As shown in Figure 10, the output window display info about the new dataset. This new info includes the number of attributes, rows and the number of records with decision 1 and 0. Now we are ready to create our ID3 decision tree. Build ID3: This is maybe the most powerful function of the entire program. Build ID3 creates and displays the tree in an external window a decision tree from the selected data. Figure 11 shows the external window where the ID3 tree is displayed Figure 10. r.txt file shows data without redundant or conflicted records.
  • 13. 12 The ID3 tree has the following components: Internal node: This element contains a certain number of records with decision 1 and decision 0 along with other properties like name of node, weight etc. The internal node is represented as a green circle Link: This is a virtual link between branches. It is represented by a green circle like the internal node, however links has no properties. Figure 11. ID3 decision tree components.
  • 14. 13 To display the properties of a node, hover the mouse over the internal node and after a few seconds a panel will show up with all the info you need. Note that if you do these with a link, nothing happen. Leaf: the last element of the ID3 tree is the leaf. A leaf had no branches and it can have a value of 1 or 0. Leaf has properties too but they are not displayed dynamically. Every leaf has a function or rule which is shown the same way properties do. By clicking the leaf, you can read its corresponding rule. Figure 12. Properties of internal node Cer66. Figure 13. Leaf function shows up dynamically.
  • 15. 14 The ID3 decision tree has its own set of tools, which are place at the top left of the window. The first three tools manipulate the tree directly. The user can expand the tree at once or step by step and of course collapse it. This is really helpful when dealing with a tree with many branches. Figure 14. ID3 tree tool menu
  • 16. 15 The create table option generates a table with all the properties of every node and leaf of the tree. Figure 15 shows the ID3 tree data table. Prune ID3: Next step is to prune the tree. By doing these, the tree becomes less complex. This is our last step before we can test are rules generated. Notice than the new tree has impure leaves. These leaves has replaced a whole set of branches. Now you can use the Generate rules tool. These will give us are beloved set of rules which hopefully will predict the decision values. Figure 15. ID3 tree table
  • 17. 16 Using testing tools After the rules have been generated, it is imperative to test their efficiency that is how well our rules predict the decision value. Data Processing program offers some testing tools which can be found in the submenu of the tools menu called Testing. The user can choose among three Test functions. EQTraining, NEQTraining and K-1 Testing. For more detailed explanations of these testing methods please consult any data mining textbook. Figure 16. Set of rules displayed
  • 18. 17 EQ/NEQTraining test method: Click the tool item and a new panel will show up. The user will have to enter a percentage for a test file. After this is done the testing file manager will appear. We can view the info of all the test and training files generated along with successful rate. The tool automatically saved all the test and training files in the project folder. You can find these text files in the test and training folder located on the Project_csci folder. (See page 20 for additional information) The show table button will displayed a list of all the records, their classification and the match rate. A successful rate of 0.85 or higher means that the rules are reliable. Figure 17. Testing tools Figure 18. Test file input
  • 19. 18 K-1 Testing: This may be the best way to test the efficiency of our rules set. Depending on the amount of records, it may take up to minutes to process the data. K-1 test file does not generate any test file; it will only display the record classification table and the information in the output window Entering new dataset Sometime we are going to need to use another dataset completely independent to the one we have by default. If the new dataset does not have too many attributes, the user won’t need to set the program again (see page 5 for more details, Setting the program). We can use the open function in the File menu. Click the open tab and browse to the text file which contains the dataset. Make sure, the file only contains the values, not the attributes names. Once you click open, go to the output system console and type the number of attributes and records. You need this information beforehand. Then the program will ask for the name of the attributes. Once you are done. The new data will show up in the Project window. Figure 19. Testing file manager Figure 20. Record classification table.
  • 20. 19 After that, the new dataset will become your default dataset. Feel free to create an ID3 tree or to clean the data. If you want to go back to your original default dataset, click the Data set tool in the File menu. Options Data processing program counts with a two options located in the option tab Format: This option allows the user to change the rounding of the numbers. By default the correlation is expressed in 5 digits after the comma and the entropy in 3. Note that when you change the format the find function is affected. For instance the value 0.877 is different than 0.87 Log file: Every time we use a function the program automatically register and write a description of the event. These events can be found in the log.txt located in Project_csci folder. Make sure to use the exit tab in the file menu; otherwise the log file won’t be generated. Figure 21. Entering new dataset specifications.
  • 21. 20 Explaining Program Files All the program files are located in The Project_csci folder. Make sure to place your dataset in it. The project folder can be place anywhere as long you change the path of the source code. (See page 5. Setting the program) The following is a brief description of the program folders and files - Project_csci: The main folder which contains the src and java files along with the classes, images, and notes. - Src folder: This folder contains the classes which are Project_csci.java, Attribute.java which is the structure of the ID3 tree nodes, ID3_tree.java, Monitor.java, k.java and ID3_Graph.java. The default dataset is located here as well. The name of the text file is data. Make sure to change the name of your default data. You can find the icon folder too. - Test/ Training folder: contains the test and training files generated by EQ/NEQ training tool. These files are temporary so they are overwritten all the time. - Log file: The history file where all the events are recorded. Figure 22. log.txt file
  • 22. 21 - R.txt : The intersection file which contains only the names of the attributes selected - Result_i.: The records free of inconsistencies. Notes about the code Data processing program is an application that let the user play with the data smoothly to get the best results. There are many points that I will like to improve such as the limitation of the decision values. The application is limited to have only the values of 1 and 0 for the decision attribute. Setting the program can be made more user-friendly. Some of the steps to set the program involved rewriting the code which can be intimidated for a non programmer. Another inconvenience is the format of the dataset. It must be a text file and should only contain integer values. Many dataset have data expressed in words or decimal numbers. This can be solved by extending the ID3_tree class using generics.