FACIAL AGE ESTIMATION USING TRANSFER LEARNING AND BAYESIAN OPTIMIZATION BASED...
Machine Learning Techniques for Caries Prediction in Children
1. A Comparative Study of Machine Learning Techniques for Caries Prediction
Robson D. Montenegro, Adriano L. I. Oliveira, George G. Cabral
Department of Computing and Systems, Polytechnic School, Pernambuco State University
Rua Benfica, 455, Madalena, Recife PE, Brazil, 50.750-410
{adriano,rdm,ggc}@dsc.upe.br
Cintia R. T. Katz, Aronita Rosenblatt
Department of Preventive and Social Dentistry, Faculty of Dentistry, Pernambuco State University
Av. Gal. Newton Cavalcanti, 1.650 - Camaragibe, PE, Brazil, 54.753-220
cintiakatz@uol.com.br, rosen@reitoria.upe.br
Abstract
There are striking disparities in the prevalence of den-
tal disease by income. Poor children suffer twice as much
dental caries as their more affluent peers, but are less likely
to receive treatment. This paper presents an experimental
study of the application of machine learning methods to the
problem of caries prediction. For this paper a data set col-
lected from interviews with children under five years of age,
in 2006, in Recife, the capital of Pernambuco, a state in
northeast Brazil, was built. Four different data mining tech-
niques were applied to this problem and their results were
confronted in terms of the classification error and area un-
der the ROC curve (AUC). Results showed that the MLP
neural network classifier outperformed the other machine
learning methods employed in the experiments, followed by
the support vector machine (SVM) predictor. In addition,
the results also show that some rules (extracted by decision
tress) may be useful for understanding the most important
factors that influence the occurrence of caries in children.
1 Introduction
The early childhood caries is a disease that occurs in
young kids and is associated with malnutrition and inad-
equate eating habits during weaning. Dental caries is the
single most common chronic childhood disease - 5 times
more common than asthma and 7 times more common than
hay fever. This disease is considered a public health prob-
lem due to its impact in quality of life; it affects, almost
exclusively, children of social-economic groups less privi-
leged in developed and developing countries. Preceded by
enamel defects, the early childhood caries may have limited
its progress if detected early [22][21].
The increasing widespread use of information systems
in health and the considerable growth of data bases require
traditional manual data analyses to be adjusted to new, ef-
ficient computational models [13], those manual processes
easily break down while the size of the data grows and the
number of dimensions increases. Data Mining is a research
method that has been used to provide benefits to a large
number of fields of medicine, including diagnosis, progno-
sis and the treatment of diseases [2][3][17]. It encompasses
techniques such as machine learning and artificial neural
networks (ANNs), which have been successfully applied to
medical problems to predict clinical results [2][17].
In recent years, there has been a significant increase in
the use of technology in medicine and related areas. The
complexity and sophistication of the technologies often re-
quire the solution of decision problems using combinatorics
and optimization methods [3]. Despite the importante of
data mining and machine learning techniques, there remains
little application of these techniques to the field of den-
tistry. Recently, Oliveira et al. applied machine learning
techniques in the field of dentistry [6][18]. These works
aimed to predict the success of a dental implant by means
of machine learning techniques [6][18].
The purpose of this paper is to build robust models to
solve the problem of prediction of the presence of caries in
preschool children with ages less than five years in state
schools (attended by the low-income population) in Re-
cife, the capital of Pernambuco in the northeastern region
of Brazil. This paper also aims to extract and display, in
more friendly form, the rules, or factors, associated to the
caries prediction, in this specific case.
2. 2 Data Set Characteristics
A databank was constructed with information collected
from 3864 Brazilian preschool children with ages less than
five years. A cross-sectional study was conducted in state
schools (attended by the low-income population) in Re-
cife, the capital of Pernambuco in the northeastern region
of Brazil.
Recife is one of the three most important urban centers
of the northeastern region of Brazil. The population of the
city and its surrounding area is over 3 million people. The
city is divided into six administrative regions and has 153
schools run by the municipality, to which 4,787 4-year-old
children attend.
The questionnaires were completed during personal in-
terviews with each child’s mother. In every case, the ex-
aminer was blind to the child’s questionnaire data. Exam-
inations were performed under natural light, in the class-
room environment, using tongue blades, gloves and masks,
in compliance with the infection control protocol (Ministry
of Health, Brazil).
For each child, 193 (one hundred and ninety and three)
features were collected in the questionnaire. From this to-
tal, only sixteen features were considered significant to the
problem of caries prediction.
As shown in table 1, there is a significantly greater occur-
rence of healthy samples, thereby making the data set un-
balanced [14]. For this reason, in the experiments, only 998
samples were considered for the caries prediction. These
998 samples are equally divided in caries and healthy sam-
ples.
Table 1. Distribution of caries in the whole
dataset.
Class number of samples
Caries 499
Healthy 3365
Total 3864
The input variables (attributes) considered in our prob-
lem are:
1. Gender: male/female.
2. Age in months.
3. Parent’s opinion about the oral health of the child (ex-
cellent, good, regular, bad, very bad)
4. Has the child already had a toothache ? (yes/no)
5. Family income (1 to 7, or more) in minimum wages
(yes/no)
6. Child has already gone to the dentist and a caries was
diagnosed (yes/no)
7. Child has never gone to the dentist for another reason
(yes/no)
8. Child has already gone to the dentist (yes/no)
9. Child has already visited the dentist for having a
toothache (yes/no)
10. Presence of failure in the enamel (yes/no)
11. Presence of fistula (yes/no)
12. Political-administrative region (from 1 to 6)
13. Child has never gone to the dentist for access reason
(yes/no)
14. Child has already gone to the dentist for prevention
reason (yes/no)
15. Child has never gone to the dentist for financial ques-
tions (yes/no)
The output variable is:
1. Presence of caries (yes/no)
3 The Classifiers Evaluated
In this section we briefly review the four classification
techniques used in this work, namely, (1) decision trees, (2)
MLP neural networks, (3) kNN, and (4) support vector ma-
chines.
Decision Trees are statistical models for classification
and data prediction. These models take a ”divide-and-
conquer” approach: a complex problem is decomposed in
simpler sub-models and, recursively, this technique is ap-
plied to each sub-problem [10].
For this work we have chosen one of the most popular
algorithms for building decision trees, the C4.5 [20]. C4.5 is
a software extension of the basic ID3 algorithm designed by
Quinlan to address some issues not dealt with by ID3, such
as avoiding over fitting the data, determining how deeply to
grow a decision tree, improving computational efficiency,
etc. Quinlan’s C4.5 has a factor named confidence factor,
denoted by C, that is used for pruning. In general, smaller
values of C yields more pruning. For the experiments we
have varied the value of the confidence factor to obtain a
more accurate model of classification.
The MLP neural network (Multi Layer Perceptron) de-
rives from the Perceptron model of neural networks. Unlike
the basic perceptron, MLPs are able to to solve non-linearly
3. separable problems. For this work we have chosen the back-
propagation learning algorithm for training MLP neural net-
works.
The MLP network is trained by adapting the weights.
During training the network output is compared with a de-
sired output. The error, that is, the difference between these
two signals is used to adapt the weights. The rate of adapta-
tion is controlled by the learning rate. A high learning rate
will make the network adapt its weights quickly, but will
make it potentially unstable. Therefore it is recommended
to use small learning rates in practical applications.
kNN is a classical prototype-based (or memory-based)
classifier, which is often used in real-world applications due
to its simplicity [24]. Despite its simplicity, it has achieved
considerable classification accuracy on a number of tasks
and is therefore quite often used as a basis for comparison
with novel classifiers.
Support vector machine (SVM) is a recent technique for
classification and regression which has achieved remarkable
accuracy in a number of important problems [4], [23], [5],
[1]. SVM is based on the principle of structural risk mini-
mization (SRM), which states that, in order to achieve good
generalization performance, a machine learning algorithm
should attempt to minimize the structural risk instead of
the empirical risk [9], [1]. The empirical risk is the error in
the training set, whereas the structural risk considers both
the error in the training set and the complexity of the class
of functions used to fit the data. Despite its popularity in
the machine learning and pattern recognition communities,
a recent study has shown that simpler methods, such as kNN
and neural networks, can achieve performance comparable
to or even better than SVMs in some classification and re-
gression problems [16].
4 Experiments
The simulations were carried out using the Weka data
mining tool, which includes several pre-processing and
classification methods [25].
We have used 10-fold cross-validation to assess the gen-
eralization performance as well as to compare the classi-
fiers considered in this article. In 10-fold cross-validation
(CV), a given dataset is divided into ten subsets. A classi-
fier is trained using a subset formed by joining nine of these
subsets and tested by using the one left aside [7]. This is
done ten times each employing a different subset as the test
set and computing the test set error, Ei. Finally, the cross-
validation error is computed as the mean over the ten errors
Ei, 1 < i < 10. It is important to emphasize that all the
simulations reported here used stratified CV, whereby the
subsets are formed by using the same frequency distribu-
tion of patterns of the original [25].
The performance measures used to compare the classi-
fiers are (1) the classification error, and (2) the area under
the ROC curve (AUC) [9], [11], [12]. ROC curves origi-
nated from signal detection theory and are more frequently
used in the case of one-class classification or classification
with two classes, which is the case of our problem [18][8].
In the ROC curve, the x-axis represents the PFA (Prob-
ability of False Alarm), which identifies normal patterns
wrongly classified as novelties; the y-axis represents the PD
(Probability of Detection), which identifies the likelihood
of patterns of the novelty class being recognized correctly.
The area under the ROC curve (AUC) summarizes the ROC
curve and is another way to compare classifiers other than
the accuracy, according to Huang and Ling [18]. In com-
parison with others classifiers, the best classifier is the one
that obtains an AUC more close to 1.
Aiming to select the attributes from the dataset with
greater significance to the problem we have used In-
foGainAttributeEval, as the attribute evaluator, with the
search method Ranker. The InfoGainAttributeEval evalu-
ates the worth of an attribute by measuring the information
gain with respect to the class. The Ranker ranks attributes
by their individual evaluations using a threshold by which
attributes can be discarded. For our experiments we varied
the thresholds by which attributes can be discarded from
10−4
to 10−1
.
4.1 Results and Discussion
We carried out experiments aiming to analyze the per-
formance for the different selected attributes (see table ??).
Table 2 shows the results obtained using the whole input
feature vector (15 input variables), that is, without feature
selection. In these experiment we have achieved a better re-
sult with the MLP method, followed by the SVM (in terms
of 10-fold cross-validation error). In terms of AUC, MLP
have achieved better results, followed by kNN.
For the decision trees, the results demonstrate that the
parameter C has a great influence on the performance of
the classifier, whereas the error has increased 5.01% from
the C = 0.25 to C = 0.001. For C = 0.25, the decision tree
has created 78 nodes while the other decision tree using C
= 0.001 has created only 5 nodes. Fig. 1 shows the simple
model created by the C4.5 algorithm for C = 0.001 without
feature selection. These results match with AUC results, for
C = 0.25 AUC is better than the AUC for C = 0.001.
Among all the experiments carried out using feature se-
lection the best results were found by the InfoGainAttribu-
teEval threshold = 10−4
, which means using only two input
attributes. The two attributes selected by InfoGainAttribu-
teEval were age in months and opinion of the responsible
about the oral health of the child.
Table 3 shows the results obtained using the InfoGainAt-
4. Table 2. Caries prediction results without feature selection (15 input attributes)
Classifier 10-fold cross-validation error AUC
kNN(k = 19) 26.75% 0.8178
C4.5 (C = 0.25) 25.95% 0.7985
C4.5 (C = 0.001) 30.96% 0.7193
MLP (hidden layer units = 2, learning rate = 0.01, epochs = 500) 22.75% 0.8452
SVM (C = 1, σ = 0.1) 23.65% 0.7635
Figure 1. Decision Tree for C = 0.001.
tributeEval threshold = 10−4
. With only two attributes we
have improved the results obtained by kNN and decision
trees. Conversely, the results of the MLP and SVM meth-
ods were inferior to those with 15 input variables. In these
experiments we, as in the experiments without feature se-
lection, have achieved a better result with the MLP method,
followed by the kNN in terms of both performance criteria,
namely, the classification error and the AUC value.
Using feature selection the performance of both decision
trees models achieved a discrete performance improvement.
As a multidisciplinary work, this paper have chosen deci-
sion trees as one of the methods to treat this problem by its
ability to rules extraction of the problem. For a dentist it is
easier to use the results provided by decision trees than to
use the results of classifiers such as MLPs, which are harder
to interpret.
5 Conclusion
The early childhood caries is considered a public health
problem which occurs often in children of social-economic
groups less privileged. In this work we have compared the
performance of four different classifiers applied to the prob-
lem of caries prediction. For this problem, we also per-
formed a feature selection in the dataset aiming to retrieve
the attributes more relevant to the task of caries prediction.
The results have shown that the best model for caries
prediction was obtained by MLP Neural Networks, which
achieved a 10-fold cross validation error rate of 22.75%,
without feature selection. Using the InfoGainAttributeEval
as feature selection method, the MLP and SVM methods
had a discrete performance loss whereas the decision trees
(C = 0.001 and C = 0.25) and the kNN achieved a discrete
improvement in their performance.
From the results obtained in this work we can see that
children with ages from twenty three months are more
caries prone. The results also show that the family income,
if the child had already a toothache and if the child had al-
ready a caries diagnoses, influences the occurrence of the
disease. The results also show that children already diag-
nosed as caries carrier has presented recurrence; this makes
us conclude that the treatment is not achieving a needed ef-
ficiency in the reeducation of the child’s oral hygiene.
References
[1] V. D. S. A. Advanced support vector machines and kernel
methods. Neurocomputing, 55(1-2):5–20, 2003.
[2] S. R. Bhatikar, C. DeGroff, and R. L. Mahajan. A classi-
fier based on the artificial neural network approach for car-
diologic auscultation in pediatrics. Artificial Intelligence in
Medicine, 33(3):251–260, 2005.
[3] T.-C. Chen and T.-C. Hsu. A GAs based approach for min-
ing breast cancer pattern. Expert Syst. Appl, 30(4):674–681,
2006.
[4] C. Cortes and V. Vapnik. Support vector networks. Machine
Learning, 20:1–25, 1995.
[5] N. Cristianini and J. Shawe-Taylor. An Introduction to Sup-
port Vector Machines. Cambridge University Press, 2000.
[6] A. L. I. de Oliveira, C. Baldisserotto, and J. Baldisserotto. A
comparative study on machine learning techniques for pre-
diction of success of dental implants. In A. F. Gelbukh,
A. de Albornoz, and H. Terashima-Mar´ın, editors, MICAI,
volume 3789 of Lecture Notes in Computer Science, pages
939–948. Springer, 2005.
[7] D. Delen, G. Walker, and A. Kadam. Predicting breast can-
cer survivability: a comparison of three data mining meth-
ods. Artificial Intelligence in Medicine, 34(2):113–127,
2005.
[8] N. M. Farsi and F. S. Salama. Sucking habits in saudi chil-
dren: prevalence, contributing factors and effects on the pri-
mary dentition. Pediatr Dent, 19(1):28–33, 1997.
[9] T. Fawcett. An introduction to ROC analysis. Pattern Recog-
nition Letters, 27(8):861–874, June 2006.
[10] J. Gama. Functional trees. Machine Learning, 55(3):219–
250, 2004.
5. Table 3. Caries prediction results for InfoGainAttributeEval threshold = 10−4
(2 input attributes).
Classifier 10-fold cross-validation error AUC
kNN(k = 11) 24,65% 0.8136
C4.5 (C = 0.25) 25,15% 0.8011
C4.5 (C = 0.001) 29,76% 0.7458
MLP (hidden layer units = 2, learning rate = 0.01, epochs = 500) 24,75% 0.8223
SVM (C = 100, σ = 0.1) 25,05% 0.7495
Figure 2. Decision Tree for C = 0.25 with feature selection and InfoGainAttributeEval threshold = 10−4
.
[11] J. Huang and C. X. Ling. Using AUC and accuracy in eval-
uating learning algorithms. IEEE Trans. Knowl. Data Eng,
17(3):299–310, 2005.
[12] T. A. Lasko, J. G. Bhagwat, K. H. Zou, and L. Ohno-
Machado. The use of receiver operating characteristic
curves in biomedical informatics. Journal of Biomedical In-
formatics, 38(5):404–415, 2005.
[13] N. Lavraˇc. Machine learning for data mining in medicine. In
W. Horn, Y. Shahar, G. Lindberg, S. Andreassen, and J. Wy-
att, editors, Proceedings of the Joint European Conference
on Artificial Intellingence in Medicine and Medical Decision
Making (AIMDM-99), volume 1620 of LNAI, pages 47–64,
Berlin, June 20–24 1999. Springer.
[14] Y. Lu, H. Guo, and L. Feldkamp. Robust neural learning
from unbalanced data samples. In IEEE International Con-
ference on Neural Networks (IJCNN’98), volume III, pages
III–1816–III–1821, Anchorage, AK, July 1998. IEEE.
[15] W. P. W. S. McCulloch. A logical calculus of ideas im-
manent in nervous activity. Bulletin of Mathematical Bio-
physics, 5:115–133, 1943.
[16] D. Meyer, F. Leisch, and K. Hornik. The support vector ma-
chine under test. Neurocomputing, 55(1-2):169–186, 2003.
[17] B. A. Mobley, E. Schechter, W. E. Moore, P. A. McKee,
and J. E. Eichner. Neural network predictions of significant
coronary artery stenosis in men. Artificial Intelligence in
Medicine, 34(2):151–161, 2005.
[18] A. L. I. Oliveira, C. Baldisserotto, and J. Baldisserotto. A
comparative study on support vector machine and construc-
tive RBF neural network for prediction of success of den-
tal implants. In A. Sanfeliu and M. Lazo-Cort´es, editors,
CIARP, volume 3773 of Lecture Notes in Computer Science,
pages 1015–1026. Springer, 2005.
[19] J. R. Quinlan. Induction of decision trees. In J. W. Shavlik
and T. G. Dietterich, editors, Readings in Machine Learning.
Morgan Kaufmann, 1990. Originally published in Machine
Learning 1:81–106, 1986.
[20] J. R. Quinlan. C4.5: Programs for Machine Learning. Mor-
gan Kaufmann, San Mateo, CA., 1993.
[21] S. Reisine and J. Douglass. Jm. psychosocial and behavioral
issues in early childhood caries, 1998.
[22] A. Rosenblatt and A. Zarzar. Breast feeding and early child-
hood caries: an assessment among brazilian infants. Inter-
national Journal of Paediatric Dentistry, 14:439–450, 2004.
[23] J. S. Taylor and N. Cristianini. Kernel Methods for Pattern
Analysis. Cambridge University Press, 2004.
[24] A. Webb. Statistical Pattern Recognition. Wiley, 2002.
[25] I. H. Witten and E. Frank. Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations.
Morgan Kaufmann, San Francisco, 2000.
[26] I. H. Witten and E. Frank. Data mining: practical machine
learning tools and techniques with Java implementations.
SIGMOD, 31(1):76–77, Mar. 2002.