Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
336vd Report Franed1 Update Language
1. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 1
Classification of newborn’s sleeping phases from
their EEG.
Dominik Franˇ k
e
Abstrakt— Correct classification of newborn’s sleeping phases
from their EEG can help to predict the problems on brain or
other mental defects. This semestral work has been disposed to
find optimal k in nearest neighbor classifier. The choice of kNN
is motivated by its simplicity, flexibility to incorporate different
data types and adaptability to irregular feature spaces. The best
k in nearest neighbor classifier was figured up for the value 3,
with accuracy 83.69%.It means each time newborn’s EEG will be
given the algorithm can classify sleeping phases of this newborn
by choosing 3 other nearest EEG records.
I. A SSIGNMENT
Use the method of k Nearest Neighbors for classification of Fig. 1. Graph showing original values of attributes; x-axis: attributes, y-axis:
values of attributes (−5 to 543)
the target attribute of chosen dataset. Chose one of the classes
as target class - positive. Find the best classifier, which has
False Positive rate (F P r)< 0.3. Count the accuracy and True
Positive rate T P r of this classificator.
II. I NTRODUCTION
The problem is to find optimal k in Nearest Neighbor
classificator (next time will be written as NN) for given dataset.
The algorithm can be briefly summarized as follows: In
the training phase, it computes the similarity measures from
all rows in the training set and combines them in a global
similarity measure using the XValidation method. In the testing
phase, for a rows with “unknown“ classes, it chooses their k
nearest neighbors in the training set according to the trained Fig. 2. Graph showing normalized values of attributes; x-axis: attributes,
similarity measure and then uses a customized voting scheme y-axis: values of attributes (0.0 to 1.0)
to generate a list of predictions with confidence scores [4].
Dataset is in *.arff format and each row has 55 attributes.
Attribute called ”class“ has 4 nominal values (0,1,2,3) and it It’s sure the dataset has to be preprocessed before starting
represents the classified new-born’s sleeping phases. I didn’t experiments. First operator Normalization is used (Showed on
find anywhere what does it mean exactly, but from my Fig.5, page 3) which normalizes all numerical values to range
observations I expect it means that from given attributes from 0.0 to 1.0. The optimization of extreme values won’t
(EEG c1 alpha,...) can be computed what kind of sleeping be done because in next part of preprocessing will be chosen
1
these values of attributes represent. [5] just 70 of all rows (2942 rows) and extreme values will be
The given dataset is preprocessed a little bit. There aren’t “eliminated“. For choosing this subset method of Stratified
any rows with zero attributes and all attributes are numerical Sampling is used and as label attribute named ”class“ is set
values. In Fig.1 are shown all attributes of dataset and their attributed. From 2942 chosen rows 2210 are labeled as class
values. These values are not normalized so the range of 0 and 732 as class 1 (Tab.I). Class 0 is merged from original
attribute’s values is from −5 to 543. The normalized dataset classes 1,2 and 3. Class 1 is renamed from original class 0.
is on Fig.2, where all values are in the range from 0.0 to 1.0. Normalized datasubset is shown on Fig.3
Each class (0, 1, 2, 3) has different color. The dataset is too big After attributes normalization the phase of training the
to process it at once, because it sets up of 42027 rows each model begins. As shown in Fig.5 (right side) the normalized
with 55 regular attributes. subset is divided into 2 parts. 1 of this subset goes to training
5
4
phase and 5 are used for testing.
III. E XPERIMENTS In training phase operator Parameter Iteration is used for
The chosen positive class of original data is class 0 (In iterating the k of NN. K is iterated from 1 to 15 incresing by
normalized subset renamed to class 1). The other classes +1.
(1,2,3) are set as negative classes. To avoid overfitting of NN method called K-fold cross
2. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 2
Attr. name Statistics Range
class label 0.0 (2210), 1.0 (732)
PNG 0.427 +/- 0.111 [0.000 ; 0.919]
PNG filtered 0.359 +/- 0.166 [0.000 ; 1.000]
EMG std 0.114 +/- 0.090 [0.033 ; 0.766]
EMG std filtered 0.126 +/- 0.138 [0.004 ; 0.874]
ECG beat 0.427 +/- 0.135 [0.212 ; 0.993]
ECG beat filtered 0.444 +/- 0.138 [0.225 ; 0.987]
EEG fp1 delta 0.216 +/- 0.065 [0.081 ; 0.964]
EEG fp2 delta 0.218 +/- 0.067 [0.071 ; 0.958]
EEG t3 delta 0.202 +/- 0.074 [0.064 ; 0.906]
Fig. 3. Graph showing normalized datasubset with positive class = 1; x-axis:
EEG t4 delta 0.232 +/- 0.089 [0.062 ; 0.956] attributes, y-axis: values of attributes (0.0 to 1.0)
EEG c3 delta 0.243 +/- 0.072 [0.091 ; 0.961]
EEG c4 delta 0.244 +/- 0.070 [0.089 ; 0.968]
EEG o1 delta 0.212 +/- 0.077 [0.066 ; 0.958]
EEG o2 delta 0.211 +/- 0.083 [0.046 ; 0.933] validation (CV) is used. For each iteration of k CV is run
EEG fp1 theta 0.188 +/- 0.072 [0.068 ; 0.976] 10 times. CV divides training set 10 times into 2 parts. CV
EEG fp2 theta 0.216 +/- 0.075 [0.090 ; 0.972] trains kNN on the first part and validates kNN with the second
EEG t3 theta 0.222 +/- 0.065 [0.077 ; 0.970] part. After 10 iterations of K-fold cross validation the average
EEG t4 theta 0.264 +/- 0.079 [0.082 ; 0.938] accuracy of kNN for these 10 CV is computed. After k is
EEG c3 theta 0.308 +/- 0.061 [0.101 ; 0.962] iterated from 1 to 15, k with the highest average accuracy
EEG c4 theta 0.299 +/- 0.060 [0.098 ; 0.960] is selected and will be used in testing phase. Graph with the
EEG o1 theta 0.219 +/- 0.067 [0.080 ; 0.922]
average accuracies for each k NN is on Fig.4.
EEG o2 theta 0.271 +/- 0.079 [0.080 ; 0.931]
EEG fp1 alpha 0.112 +/- 0.077 [0.043 ; 0.981]
EEG fp2 alpha 0.124 +/- 0.081 [0.046 ; 0.956]
EEG t3 alpha 0.158 +/- 0.080 [0.055 ; 0.946]
EEG t4 alpha 0.181 +/- 0.082 [0.055 ; 0.928]
EEG c3 alpha 0.249 +/- 0.070 [0.088 ; 0.943]
EEG c4 alpha 0.246 +/- 0.069 [0.085 ; 0.957]
EEG o1 alpha 0.116 +/- 0.066 [0.039 ; 0.910]
EEG o2 alpha 0.151 +/- 0.066 [0.048 ; 0.935]
EEG fp1 beta1 0.114 +/- 0.079 [0.043 ; 0.985]
EEG fp2 beta1 0.123 +/- 0.083 [0.046 ; 0.943]
EEG t3 beta1 0.152 +/- 0.084 [0.045 ; 0.957]
EEG t4 beta1 0.168 +/- 0.087 [0.053 ; 0.930]
EEG c3 beta1 0.234 +/- 0.077 [0.092 ; 0.942]
EEG c4 beta1 0.226 +/- 0.074 [0.079 ; 0.949] Fig. 4. Average accuracy for kNN; x-axis: k NN; y-axis: accuracy;
EEG o1 beta1 0.091 +/- 0.070 [0.028 ; 0.916]
EEG o2 beta1 0.129 +/- 0.070 [0.041 ; 0.970]
EEG fp1 beta2 0.217 +/- 0.081 [0.086 ; 0.990]
EEG fp2 beta2 0.211 +/- 0.076 [0.083 ; 0.958] IV. M ETHODOLOGY
EEG t3 beta2 0.189 +/- 0.070 [0.063 ; 0.927]
A. Used tool
EEG t4 beta2 0.226 +/- 0.083 [0.065 ; 0.922]
EEG c3 beta2 0.248 +/- 0.066 [0.092 ; 0.960] Tool used is called RapidMiner (v4.0) [1]. Using Rapid-
EEG c4 beta2 0.246 +/- 0.065 [0.090 ; 0.966] Miner allows user to make all phases of DataMining in this
EEG o1 beta2 0.230 +/- 0.085 [0.076 ; 0.958] tool. It detracts from familiarization with only one environ-
EEG o2 beta2 0.220 +/- 0.080 [0.055 ; 0.932] ment. All operators used in this work are accesible from the
EEG fp1 gama 0.154 +/- 0.073 [0.058 ; 0.976] basic version of RapidMiner.
EEG fp2 gama 0.172 +/- 0.076 [0.075 ; 0.956]
EEG t3 gama 0.196 +/- 0.069 [0.067 ; 0.958]
EEG t4 gama 0.227 +/- 0.078 [0.071 ; 0.897] B. Configuration
EEG c3 gama 0.289 +/- 0.063 [0.097 ; 0.959] By combining many operators in RapidMiner the project is
EEG c4 gama 0.281 +/- 0.061 [0.095 ; 0.959] built. The complete tree view of operators used to get the best
EEG o1 gama 0.168 +/- 0.065 [0.062 ; 0.915]
k in Nearest Neighbor classification is showed on Fig.5.
EEG o2 gama 0.237 +/- 0.077 [0.072 ; 0.912]
• All operators has local random seed set to -1. Just the
TABLE I
Root operator has value 2001, because then the random
S TATISTICS OF ATTRIBUTES OF NORMALIZED SUBSET
operations generates the same values. If an operator has
sampling type then it is set to stratified sampling..
• SplitChain operator has split ratio set to 0.2.
3. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 3
• XValidation operator has number of validations set to xi and xj (j = 1, 2, ...n) is defined as:
10 and measure set to Euclidean Distance.
• NearestNeighbor trying k operator has k set to 15 but this d(xi , xj ) = (xi1 − xj1 )2 + (xi2 − xj2 )2 + ... + (xin − xjn )2
parameter is influenced by Iterating k - training operator.
• ClassificationPerformance (1) operator has checked ac- The Algorithm NN can be built as: [6]
curacy.
• NearestNeighbor defined k operator has k set to 3 and • Training phase: Build the set of training examples T .
measure set to Euclidean Distance. • Testing phase:
• ClassificationPerformance (2) operator has checked ac- – Is given a query instance xq to be classified
curacy. – Let x1 ...xn denote the k instances from T that are
• BinominalClassificationPerformance operator has nearest to xq
checked fallout.
n
• ProcessLog operator logs the accuracy from Classifica-
F (xq ) = argmax δ(v, f (xi ))
tionPerformance.
i=1
The best k in Nearest Neighbor classificator is found by
iterating k from 1 to 15. Top value 15 is chosen as enough. In
each iteration the Operator “ClassificationPerformance” counts
accuracy of given k. The Operator ProcessLog writes results
of “ClassificationPerformance” and generates report (Fig.4).
From the report it stands for reason, that the best k is 3.
True 0 True 1
Pred 0 1609 225
Pred 1 159 361
TABLE II
NN CLASSIFICATION FOR k := 3; accuracy = 83.69%, F P r = 8.99%,
T P r = 61.60%
Positive class is selected as class with value = 1. For k = 3
the accuracy was 83, 69% as shown in Tab.II. False Positive
rate (F P r) of this classicator is 8.99%.
159/(159 + 1609) = 0.0899
There is also evidently, that True Positive rate (T P r) is
61.60%. Because there are 586 examples with class=1 and
just 361 of them were classified correctly.
361/(361 + 225) = 0.616
V. D ISCUSSION
False Positive rate seems to be very good. Maybe it seems
to be very low but it’s probably by the big subset of training
data.
To discuss is, if the rate = 0.2 dividing datasubset to training
Fig. 5. “Box view” of complete project from DataMiner and testing part is set correctly or not. With faster computer
can be set oposite rate = 0.8. In my opinion 584 training
examples were enough and F P r declares the rate wasn’t
chosen so badly. On the other hand T P r is 61.60% which
is not much and can be easily higher/lower influencing F P r.
C. Experiments setup Next question can be if the algorithm shouldn’t count the
The Nearest Neighbor classification uses Euclidean distance weighted kNN. I was trying to find any dependencies between
to compute the kNN. In human words it can be translated as the 55 attributes but I wasn’t successful. So I don’t think
“Finds the closest point xi to xj ”. Euclidean distance between setting weights on attributes would be helpful.
4. SEMESTRAL WORK FOR THE COURSE 336VD (DATA MINING), CZECH TECHNICAL UNIVERSITY IN PRAGUE, 2008/2009 4
VI. C ONCLUSION
In my opinion I found very good classifier for the subset
of dataset. There can be done some improvments of this
algorithm to work better. I think it would be useful if such
classifier should be used in practices but just for school work
it’s not so important.
The hardest part of the work was exploring operators in
RapidMiner and finding the right one I needed. I know there
are still some which can be replaced by better operators, but
this solution was working and what more it gave good results.
Most of the time I spent on waiting until RapidMiner will
process all the operators with given dataset. Unfortunately
this programm is written in Java, what is not language for
scientific computing and I had to restart java because it got
out of memory quite often. The most interesting part for me
was generating graphs and writing this report ¨ .
I am very satisfied I finished the work and I can say that I
learnt a lot about datamining and about classifing any dataset.
I am afraid anybody can feel from this work that my future
specialization will be Software Engineering and such scientific
work is not my cup of tea.
R EFERENCES
[1] CENTRAL QUEENSLAND UNIVERSITY. RapidMiner GUI man-
ual [online]. 2007 , May 29, 2007 [cit. 2008-02-08]. Available
from WWW: <http://os.cqu.edu.au/oswins/datamining/
rapidminer/rapidminer-4.0beta-guimanual.pdf>.
[2] FARKASOVA, Blanka, KRCAL, Martin . Project Bibliographic citations
[online]. c2004-2008 [cit. 2008-05-08]. CZ. Available from WWW:
<http://www.citace.com/>.
[3] LAURIKKALA, Jorma. Improving Identification of Difficult Small
Classes by Balancing Class Distribution . [s.l.], 2001. 14 p. DEPART-
MENT OF COMPUTER AND INFORMATION SCIENCES UNIVER-
SITY OF TAMPERE . Report. Available from WWW: <http://
www.cs.uta.fi/reports/pdf/A-2001-2.pdf>. ISBN 951-
44-5093-0.
[4] KARDI, Teknomo. K-Nearest Neighbors Tutorial [online]. c2006 [cit.
2008-05-08]. Available from WWW: <http://people.revoledu.
com/kardi/tutorial/KNN/>.
[5] POBLANO, Adrian and GUTIERREZ, Roberto. Correlation between
the neonatal EEG and the neurological examination in the first year of
life in infants with bacterial meningitis. Arq. Neuro-Psiquiatr. [online].
2007, vol. 65, no. 3a [cited 2008-05-10], pp. 576-580. Available
from: <http://www.scielo.br/scielo.php?script=sci_
arttext&pid=S0004-282X2007000400005&lng=en&nrm=
iso>. ISSN 0004-282X. doi: 10.1590/S0004-282X2007000400005
[6] SOLOMATINE, D.P. Instance-based learning and k-
Nearest neighbor algorithm [online]. c1988-2003 [cit. 2008-
05-10]. EN. Available from WWW: <http://www.
xs4all.nl/˜dpsol/data-machine/nmtutorial/
instancebasedlearningandknearestneighboralgorithm.
htm>.
[7] VAYATIS, Nicolas, CLEMENCON, Stphan. Advanced Machine Learning
Course [online]. [2008] [cit. 2008-05-08]. EN. Available from WWW:
<http://www.cmla.ens-cachan.fr/Membres/vayatis/
teaching/cours-de-machine-learning-ecp.html>.
[8] XIAOJIN, Zhu. K-nearest-neighbor: an introduction to machine learning.
CS 540: Introduction to Artificial Intelligence [online]. 2005 [cit. 2008-
05-08]. Available from WWW: <http://pages.cs.wisc.edu/
˜jerryzhu/cs540/knn.pdf>.
[9] van den BOSCH, Antal.Video: K-nearest neighbor classification [online].
Tilburg University cc2007 [cit. 2008-05-10]. EN. Available from WWW:
<http://videolectures.net/aaai07_bosch_knnc/>.