1. Data Processing Program
Student name: Neil B Dahlqvist
Title: Data processing program
File name: Project_csci.java
Professor: Dr Ray Hashemi
2. 1
Contents
About the program…………………….…………………………... 2
Looking at the Work Area ……………………………………..……3
Using the File menu ….………...…………………………..……… 4
Setting the program……………………….………...……………….5
Using the tools menu………………………………..………….….. 6
Building an ID3 tree…………………………………………….…... 9
Using testing tools………………………….…………….…………16
Entering new dataset...................................................................18
Options…………………………………….………….……………..19
Explaining program files…………………………………………...20
Notes about the code………………………………………………21
3. 2
About the program
The data processing program will create a set of rules for a given dataset
using a decision tree learning algorithm ID3 developed by Ross Quinlan
(1983) and then it will test the accuracy of these rules.
This application also selects the best group of data and removes any kind of
inconsistencies or conflicted values using data cleansing techniques. The
program counts with many functions that facilitate data management and
data visualization. The graphic user interface provided is friendly and the
resulting decision tree provides the representation of the concept that
appeal to human because it renders the classification process self-evident.
4. 3
Looking at the Work Area
The Data processing program offers a friendly and intuitive graphic user interface.
This interface is equipped with a variety of menus and windows that will help the
user to work with the data with ease.
The GUI is divided in three main components:
Menu bar: It contains all the functions and tools necessary to manipulate the
dataset. You will find three tabs, File, Tools and Options which will be explained
later.
Project Window: This window will display the excel spreadsheet of the main
dataset at the startup. The project window main task is to show the current data
after it has been processed for some function.
Figure 1. Data processing graphic user interface.
5. 4
Output Window: The bottom panel is called the output window. It is in charge of
displaying the result of some operation, like entropy or correlation calculation or
the description of some element like a test file or the id3 tree itself.
Output System Console: This is not part of the GUI. This window is only use to
enter input user. After the open file function is executed.
Using the File menu
The menu bar located at the top of the Project window (see Figure 1) contains
three tabs. File Tools and Options.
The File menu is the first item in the menu bar and possesses commands relating
to the handling of files and the initialization of main dataset.
Figure 2 shows the items found in File menu.
Data set: This item initializes
the main dataset located in
the src file in the project folder
C:UsersUsernameDocumen
tsNetBeansProjectsProject_c
scisrc. (See page 20,
explaining program files).
The data will be display as a
spreadsheet in the Project
window. It is imperative that
you execute this method every
time you start the program.
Note: You can start using the data processing program without running this
function but it is not recommended.
Open: As in any other application the open function allow the user to open a
working file. You can browse a file in your system and the application will display
Figure 2. File Menu items
6. 5
it in the Project window. The data processing program accepts only text files. So
make sure to convert your dataset to txt format. (Figure 3)
Save: With save function you can save your cleansed data anywhere in your
system. The program let you save the file to any format including .xls (excel
format). The data displayed in the Project and output window will be saved.
The program automatically saves data every time you process data. Note that the
temporary files located in the project folder are overwritten all the time so make
sure to save the data manually.
Setting the program
Before you use the program for the first time, you may need to slightly modify the
source code. You can use netbeans, eclipse or even notepad to open the java file
and change the code.
Make sure to follow the following steps to avoid any kind of errors
Figure 3. Open file window.
7. 6
Setting Path file name and variables
Data processing program has a default dataset called data.txt which has 242
attributes and 999 records located in the src folder. To use your own dataset you
must change the path name and the number of rows and columns.
- Reformat your dataset file as a text file. If is an xls file, Excel has options for
file conversion. It is a little bit tedious but you must delete the attributes
names and just keep the data values otherwise the program will have an
exception error.
- Rename the file to data.txt and placed in the src folder located in the
NetBeansProjects/Project_csci folder.
- Locate the String called path and path_n which is at the beginning of the
source code below the line public class Project_csci extends JFrame{ and
change it to your convenience
- Change the value of rows and columns variables. They are below path file
name line. The columns indicate the number of Attributes and the rows are
the number of records.
- The default names for the Attributes are CerN with N being the number of
attribute. To change this, go to the fillnames function and change the
names of the Attributes
Using the tools menu
The tools menu provides functions to manage and process data. The first tools are
visual tools. They let you organize the information according to Entropy,
correlation or both. The data will be displayed in the Project window.
Show results: This function will show the correlation and entropy of the
attributes in the Project window
Entropy and correlation have a similar function. They will display their
corresponding data.
8. 7
Figure 5 shows the Project window after show results function is executed
Figure 4. Tools menu
Figure 5. Results show Attributes names, entropies and correlations.
9. 8
Find: This tool allows the user to find an Attribute, or value according to a
criteria. You can find Attributes names, entropy, correlation or attribute number.
The results are displayed in the output window. In the following order: Attribute
name, correlation and entropy.
The next tools let the user discriminate data, cleanse it and build decision trees in
order to formulate a set of rules to predict a decision based on the dataset.
Figure 6. Find tool seaching for Attribute called Cer12
Figure 7. Functions submenu items.
10. 9
Sort collection, sort entropy, entropy and correlation are self explanatory
functions. The main goal of the data processing systems is the creation of a set of
rules based on your training data. To do this the program builds an ID3 decision
tree. However before do these, the user should cleanse the data to get rid of any
conflict or inconsistence that may cause the creation of incongruent rules.
Building an ID3 Decision Learning Tree
The following steps explain the cleaning data process and the creation of a
pruning tree and a set of rules.
Intersection function: This tool selects the best data using a threshold to pick
attributes with less entropy and greater correlation than the one indicated. This
step is crucial to reduce the amount of attributes to work with.
Figure 8. Intersection function threshold selector.
11. 10
It is a good idea to use show result function or a visual data discussed above,
when you pick an entropy and correlation threshold.
The result_i.txt file is created automatically when you use the intersection
function and you can find the file in the src folder. This is a temporary file so do
not rely on it for future references. You can make a copy of the file or use the
save function in the File menu.
Note: If you need to select all the data, use 1.0 as entropy and -1.0 as correlation.
In this way all the attributes will be selected.
Clean Data: Even though clean data is not in the tool menu, it is one of the most
important tools. Once we got the intersection of attributes, the data should be
cleansed. Many of these records probably contain conflicted or redundant values.
This can cause a stack overflow error while creating the ID3 decision tree, making
the program to run forever and eventually collapse. Click the clean data tab and
you will see the records in the Project window as shown in Figure 9.
Figure 10 shows the result of clean Data function.
Note. Remember to always use the clean data tool otherwise; there could be
complications while creating the ID3 decision learning tree.
Figure 9. Options menu (Clean Data)
12. 11
The clean text file is automatically saved as r.txt, and you can find this file in the
src folder. If you need this file make sure to save it using the save tool in the File
menu. The r.txt file is a temporary file. It is overwritten every time you cleanse
new data.
As shown in Figure 10, the output window display info about the new dataset.
This new info includes the number of attributes, rows and the number of records
with decision 1 and 0.
Now we are ready to create our ID3 decision tree.
Build ID3: This is maybe the most powerful function of the entire program. Build
ID3 creates and displays the tree in an external window a decision tree from the
selected data. Figure 11 shows the external window where the ID3 tree is
displayed
Figure 10. r.txt file shows data without redundant or conflicted records.
13. 12
The ID3 tree has the following components:
Internal node: This element contains a certain number of records with decision 1
and decision 0 along with other properties like name of node, weight etc.
The internal node is represented as a green circle
Link: This is a virtual link between branches. It is represented by a green circle
like the internal node, however links has no properties.
Figure 11. ID3 decision tree components.
14. 13
To display the properties of a node, hover the mouse over the internal node and
after a few seconds a panel will show up with all the info you need. Note that if
you do these with a link, nothing happen.
Leaf: the last element of the ID3 tree is the leaf. A leaf had no branches and it can
have a value of 1 or 0. Leaf has properties too but they are not displayed
dynamically.
Every leaf has a function or rule which is shown the same way properties do. By
clicking the leaf, you can read its corresponding rule.
Figure 12. Properties of internal node Cer66.
Figure 13. Leaf function shows up dynamically.
15. 14
The ID3 decision tree has its own set of tools, which are place at the top left of
the window. The first three tools manipulate the tree directly. The user can
expand the tree at once or step by step and of course collapse it. This is really
helpful when dealing with a tree with many branches.
Figure 14. ID3 tree tool menu
16. 15
The create table option generates a table with all the properties of every node
and leaf of the tree. Figure 15 shows the ID3 tree data table.
Prune ID3: Next step is to prune the tree. By doing these, the tree becomes less
complex. This is our last step before we can test are rules generated.
Notice than the new tree has impure leaves. These leaves has replaced a whole
set of branches.
Now you can use the Generate rules tool. These will give us are beloved set of
rules which hopefully will predict the decision values.
Figure 15. ID3 tree table
17. 16
Using testing tools
After the rules have been generated, it is imperative to test their efficiency that is
how well our rules predict the decision value.
Data Processing program offers some testing tools which can be found in the
submenu of the tools menu called Testing.
The user can choose among three Test functions. EQTraining, NEQTraining and
K-1 Testing. For more detailed explanations of these testing methods please
consult any data mining textbook.
Figure 16. Set of rules displayed
18. 17
EQ/NEQTraining test method: Click the tool item and a new panel will show up.
The user will have to enter a percentage for a test file. After this is done the
testing file manager will appear. We can view the info of all the test and training
files generated along with successful rate.
The tool automatically saved all the test and training files in the project folder.
You can find these text files in the test and training folder located on the
Project_csci folder. (See page 20 for additional information)
The show table button will displayed a list of all the records, their classification
and the match rate.
A successful rate of 0.85 or higher means that the rules are reliable.
Figure 17. Testing tools
Figure 18. Test file input
19. 18
K-1 Testing: This may be the best way to test the efficiency of our rules set.
Depending on the amount of records, it may take up to minutes to process the
data. K-1 test file does not generate any test file; it will only display the record
classification table and the information in the output window
Entering new dataset
Sometime we are going to need to use another dataset completely independent
to the one we have by default. If the new dataset does not have too many
attributes, the user won’t need to set the program again (see page 5 for more
details, Setting the program).
We can use the open function in the File menu.
Click the open tab and browse to the text file which contains the dataset. Make
sure, the file only contains the values, not the attributes names.
Once you click open, go to the output system console and type the number of
attributes and records. You need this information beforehand.
Then the program will ask for the name of the attributes.
Once you are done. The new data will show up in the Project window.
Figure 19. Testing file manager Figure 20. Record classification table.
20. 19
After that, the new dataset will become your default dataset. Feel free to create
an ID3 tree or to clean the data.
If you want to go back to your original default dataset, click the Data set tool in
the File menu.
Options
Data processing program counts with a two options located in the option tab
Format: This option allows the user to change the rounding of the numbers. By
default the correlation is expressed in 5 digits after the comma and the entropy in
3. Note that when you change the format the find function is affected. For
instance the value 0.877 is different than 0.87
Log file: Every time we use a function the program automatically register and
write a description of the event. These events can be found in the log.txt located
in Project_csci folder. Make sure to use the exit tab in the file menu; otherwise
the log file won’t be generated.
Figure 21. Entering new dataset specifications.
21. 20
Explaining Program Files
All the program files are located in The Project_csci folder. Make sure to place
your dataset in it. The project folder can be place anywhere as long you change
the path of the source code. (See page 5. Setting the program)
The following is a brief description of the program folders and files
- Project_csci: The main folder which contains the src and java files along
with the classes, images, and notes.
- Src folder: This folder contains the classes which are Project_csci.java,
Attribute.java which is the structure of the ID3 tree nodes, ID3_tree.java,
Monitor.java, k.java and ID3_Graph.java. The default dataset is located
here as well. The name of the text file is data. Make sure to change the
name of your default data. You can find the icon folder too.
- Test/ Training folder: contains the test and training files generated by
EQ/NEQ training tool. These files are temporary so they are overwritten all
the time.
- Log file: The history file where all the events are recorded.
Figure 22. log.txt file
22. 21
- R.txt : The intersection file which contains only the names of the attributes
selected
- Result_i.: The records free of inconsistencies.
Notes about the code
Data processing program is an application that let the user play with the data
smoothly to get the best results.
There are many points that I will like to improve such as the limitation of the
decision values. The application is limited to have only the values of 1 and 0 for
the decision attribute.
Setting the program can be made more user-friendly. Some of the steps to set the
program involved rewriting the code which can be intimidated for a non
programmer. Another inconvenience is the format of the dataset. It must be a
text file and should only contain integer values. Many dataset have data
expressed in words or decimal numbers. This can be solved by extending the
ID3_tree class using generics.