Real world data sets considerably is not in a proper manner. They may lead to have incomplete or missing values. Identifying a missed attributes is a challenging task. To impute the missing data, data preprocessing has to be done. Data preprocessing is a data mining process to cleanse the data. Handling missing data is a crucial part in any data mining techniques. Major industries and many real time applications hardly worried about their data. Because loss of data leads the company growth goes down. For example, health care industry has many datas about the patient details. To diagnose the particular patient we need an exact data. If these exist missing attribute values means it is very difficult to retain the datas. Considering the drawback of missing values in the data mining process, many techniques and algorithms were implemented and many of them not so efficient. This paper tends to elaborate the various techniques and machine learning approaches in handling missing attribute values and made a comparative analysis to identify the efficient method.
1. Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 01 June 2015,Pages No.26- 29
ISSN: 2278-2400
26
Machine Learning Approaches and its Challenges
Raj Mohan Kumaravel1
, Ilango Paramasivam2
1
Research Scholar, School of Information technology & Engineering VIT University, Vellore
2
Professor, School of Computing Science & Engineering, VIT University, Vellore
E-Mail: k.rajmohan90@gmail.com, pilango@vit.ac.in
Abstract – Real world data sets considerably is not in a proper
manner. They may lead to have incomplete or missing values.
Identifying a missed attributes is a challenging task. To impute the
missing data, data preprocessing has to be done. Data preprocessing
is a data mining process to cleanse the data. Handling missing data is
a crucial part in any data mining techniques. Major industries and
many real time applications hardly worried about their data. Because
loss of data leads the company growth goes down. For example,
health care industry has many datas about the patient details. To
diagnose the particular patient we need an exact data. If these exist
missing attribute values means it is very difficult to retain the datas.
Considering the drawback of missing values in the data mining
process, many techniques and algorithms were implemented and
many of them not so efficient. This paper tends to elaborate the
various techniques and machine learning approaches in handling
missing attribute values and made a comparative analysis to identify
the efficient method.
Keywords— Data mining, data set, impute, missing attributes,
preprocessing
I. INTRODUCTION
Incomplete data’s is very common in the large and huge data
bases. Technically, some attribute values are missing leads the
database inconsistent state. Data preprocessing is very essential
process to address the missing attribute values. Typically, they
can replace the missing values with many possible approaches.
We need certain knowledge to predict whether the data is
missed or not. [1] Many real world applications taking
complicated decisions to handle missing data. For example in a
health care industry, if doctor have to examine the patient, he /
she have to check for the patient history to predict the result.
Not only health care industry, there are many corporate
concerns also worried about their missed data. There are many
approaches and techniques that are handling for incomplete
data. There are many drawbacks that lead to having missing
attribute values that includes loss of efficiency, complication to
manage and analyze data, bias resulting from differences
between missing and complete data. [2] In order to avoid the
negative effects in the analysis of data mining algorithms.
When missing values are present, different approaches are
employed to prepare and cleanse the data. This is critical as
many existing industrial and research data’s contains missing
values. Missing data’s may lead to imperfection thus it will
lead to preprocessing stage such that data’s can be cleaned
completely. This step improves the extraction process and
inconsequence, the results obtained in any data mining
algorithm. The simplest way of dealing with this problem is
mainly to discard the examples with missing values and
analysis of complete examples does not lead to the serious
problem during inference. In this paper, we compared various
machine learning approaches to handle incomplete data.
Types of missing data
MCAR - (Missing completely at random) Values in the data’s
are said to be MCAR, if any of the data item being missing are
observable and non-observable parameters. It will occur at
random in rare. This is majorly identified in observable and non
– observable parameters.
MAR – (Missing at random) this is a type to handle missing
data. It occurs when the missing ness is related to a particular
variable, but it is not related to the value of the particular
variable that has missing data. Missing at random is the type
which sis going to make the decision in which type of attribute
or variables in the data sets.
NMAR– (Not missing at random) Data’s that are missing for
the specific reason. This is a common type of data handling.
ACTION MAR NMAR
ASSUMPTION Weaker Violated
PARAMETER Partial /
distinct
Good
DATA Information No information
TEST Not fit Not fit
RESULT Plausible Sensitive
Table 1.1 Comparisons of MAR and NMAR
II. MACHINE LEARNING TECHNIQUES
Missing data is the major problem in many real time
applications. There are many possibilities that may occur to
handle missing data including irresponsible to the questionnaire
and so on. Many new approaches have been proposed and
developed for incomplete data handling. [3] Generally, missing
data have the concept of ignoring techniques that simply omits
the cases that contain missing data. Rather removing the
missing data, we can remove the missing data by imputing the
data by replacing the accurate values.There are many
imputation techniques and methods that have proposed for the
data which has missing. There are many imputation methods
have been proposed that includes regression and multiple
imputation.
A. Regression Imputation: This is more useful imputation
technique especially for single imputation for regression based
analysis. Here, predicted data will be replaced as much of
missing data available. This method is thoroughly a prediction
and assumption based linear relationship between many
attributes. [4] Mostly we cannot expect the relationship to be
more linear. We use completers to calculate the regression of
any incomplete variable on other complete variable. Thus
regression imputation has a good imputing technique to handle
incomplete data. Regression has majorly classified with
classification and regression. If we have to address the missing
attributes by taking the complete attributes is actually called as
classification. If we want to take continuous incomplete data
2. Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 01 June 2015,Pages No.26- 29
ISSN: 2278-2400
27
sets, we have to take all that attributes to address complete
data sets. So considering the major algorithm that is mainly
helps us to address the complete attributes to present the
output with efficient data.
B. Multiple Imputations: As the name implies that has
multiple imputed data. [5] By replacing the missing values
with some number of set of n plausible values taken for the
predictive distribution to the state. Over all estimation has
been evaluated and that can analyze by complete data methods
to avoid problems that we faced in single imputation. By this
technique, it can relieve the distortion of the sample various
and produces unbiased estimates, but data must meet normal
distribution assumptions by the storage requirements. There
are many other techniques and methods which have proposed
by machine learning methods for their respective studies, in
which many of the algorithms also proposed efficiently.
C. K-Nearest Neighbor with imputation (KNNI):Using
instance – based algorithm, every time the missing data occur
we can call that as instance. This imputation computes a value
after the data’s imputed. For nominal attributes, KNNI will
complete with most nearest neighbor. Therefore, a proximate
measurement will be taken. KNNI is an imputation technique
that is mainly going to take the missed attributes data by
taking the complete or nearest attributes in the data set. After
taking the neighbor attributes the actual attribute can identify
by taking the probable and possible values that has to replace
in the actual missed attributes. This technique is not so
efficient approach because this can lead the replicated data.
Because this approach is substituting the value with maximum
likelihood that is present in the particular data sets. So this can
mislead the attribute data highly plausible values.
D. Fuzzy K – Means Clustering: We know that the
clustering is the technique to group the data’s into various
clusters. Here, in fuzzy clustering, each data object has a
membership function that describes the degree to which the
data has belongs to a certain cluster. [6] To update the
membership functions, we require fuzzy K – Means
Clustering. In this process, the data object cannot be assigned
to a concrete cluster that is represented by the cluster centroid.
By replacing various non – reference attributes for each
incomplete data object based on the information about
membership degrees.
E. Expectation – Maximization Technique: Estimating the
mean and covariance matrix, we can impute the data’s that are
missing by the technique called Expectation – Maximization
(EM). [7] The steps for EM are, first the records each
regression parameters of the variables with missing value’s
among the variable that has missing value and should compute
the mean and covariance matrix. Second, each record that has
missing values have to replace with expected data’s or values
being the product of the available values and estimated
regression coefficients. Third, the identified mean and
covariance matrix have to re estimated and the sample mean of
the completed data sets have to identify and that to estimate
the imputation error.
F. Support Vector Machine:To analyze data and recognize
patterns used for classification and regression based analysis.
Given a set of training examples, we have some training
algorithms that build a model that assigns new examples into
one category or the other, making some of the probabilistic
binary linear classifier. Formally a SVM can construct a hyper
plane in a high – or infinite dimensional space, which can be
used for various tasks. [8] To keep the computational load
reasonable, the mapping used by SVM schemes are designed
to ensure that dot products may be computed easily in terms of
the variables in the original space, by defining the term called
kernel function K(x, y).
G. Outlier detection: Data analysis has a large number of
variables that are being recorded or sampled. One of the steps
to obtaining a coherent analysis is the detection of outlaying
observations. It is an observation that appears to deviate
markedly from other observations in the sample. An outlier
may indicate wrong data. For example, the data may have
been coded incorrectly or an experiment may not have been
run correctly. If it can be determined that an outlaying point is
in fact erroneous, then the outlaying value should be deleted
from the analysis. In some cases, it may not be possible to
determine if an outlying point is bad data. [9] Outliers may be
due to random variation or may indicate something
scientifically interesting. Labeling, accommodation and
identification are three issues in outlier detection. Outlier
detection is the major technique that encompasses the
variables to be fit into the data set’s in order to avoid the
missing attributes.
Labeling is the flag potential outliers for further investigation.
This is nothing but a identifying a unique variable in the data
sets.
Accommodationuses robust statistical techniques that will not
be affected by outliers. That is, if we cannot determine the
potential outliers are erroneous observations.
Identification is formally test whether observations are
outliers. This can identify the attribute that is actually in
missed value. By taking the outlier detection, it is going to
ignore and eliminate the values and attributes that is not fit
into the data sets.
H. Iterative linear Fitting Method (ILF): This method
belongs to the category of regression – based methods, which
substitutes the missing data based on the maximum likelihood
function under specific modeling assumptions. [10] The linear
regression model is assumed for the data sets for simplicity.
This method predicts the data that are missing attributes for
each in turn. The iterative method is the technique that has a
procedure in mathematical procedure that generates a
sequence of approximate values for a class of problems.
According to the initial approximation called convergent
method. Iterative linear fitting method will be very efficient
algorithm to predict the assigned values and to address missing
or missed data sets respectively. Hence forth the machine
approaches firmly classified to avoid the missed attributes and
to cleanse the data.
Technique Variable
type
Substitution Possib
ility
Regression Incomplete Prediction Yes
Multiple Incomplete Estimation Yes
KNNI Incomplete Prediction Yes
Fuzzyk-
means
Probable Initialization Yes
E-M Iterative Initialization No
3. Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 01 June 2015,Pages No.26- 29
ISSN: 2278-2400
28
SVM Incomplete Prediction Yes
Outlier
detection
Incomplete Distribution Yes
ITF Numerical Iteration Yes
Table 2.1 Comparisons of MLT
III. REAL WORLD APPLICATIONS
A. Bioinformatics: It has become an important part in many
areas of biology. Data’s that includes images and signal
processing allow the extraction of useful results from large
amounts of raw data. Manual interpretation of using biological
tools is called as bioinformatics. Generally in the medical
related industry, it includes the database, analysis and
statistical algorithms. There are many biological data’s that
includes DNA, RNA, Protein, 3D structure, Genomic DNA,
Metabolic data etc.There are many applications in a
bioinformatics industry such as molecular medicine,
personalized medicine, gene therapy, drug development, waste
cleanup, biotechnology and anti – biotic resistance. To
maintain database such as protein sequence database,
secondary database, protein pattern database, structural
classification database is a major challenge. If any of these
database have any missing attributes we should implement
data mining algorithms to follow up the missed data very
efficiently. Many new machine approaches proposed to
maintain the data’s easier. Bioinformatics is the major and
important application that is using machine learning
approaches for various processes. There are many real time
data sets in various repositories. Many sample data sets have
tested and make use of many machine learning techniques in
order to address the actual attribute in the data sets.
B. Database marketing:Database marketing is a major trend
that has improved form of direct marketing. DBM is an
interactive approach to marketing, which uses the individual
addressable marketing media and channels. To extend help to
a company’s target audience, to estimate their demand and to
maintain database electronically. In marketing, there are many
sources of data that includes consumer data, business data,
analytics and modeling. As the name implies, DBM can used
by any organization that the data’s are available for the
customers as possible. Database marketing is the major
important real world application that can make use of the
marketing in various needs. In marketing, there are many
techniques that the company can maintain their strategy. Each
has major differences among various corporate worlds.There
are many users often building elaborate database for
maintaining customers’ information. These may include a
variety of data including name and address. As we know B2B,
Business to Business company marketers, customers are of
many companies can withstand and maintain the database.
C. Pattern recognition:We can generally categorize according
to the type of learning procedure used to generate the output
value. There are a set of training data has been provided
consisting of a set of instances that have been properly labeled
by hand with correct output. Within medical science, pattern
recognition is the basis for computer aided diagnosis systems.
Many machine learning algorithms has been proposed that
includes clustering, neural networks, regression based
methods, sequence labeled algorithms to make a data very
quality without any missing data present. In health care
industry, there are various parameters and data sets available.
So the attributes which is firmly missed that could identify by
considering the various patterns that is suited to the data sets in
the form of exactly cluster groups. These cluster groups can
consider and identify the various missed attributes in the data
sets. Pattern recognition is major important real world
application mainly in medical industry.
D. Robot locomotion:The word robot makes us to fill out
human intervention in any data’s. This was implemented
especially to develop the capabilities for robots to
autonomously decide how the robots have to develop. There
are many types of robots developed for many human needs the
way the prediction of any task to get completed.How could it
help then by machine learning? The techniques filled with
some data’s and can call as huge data’s. So here any of the
directions or any potential things missed means it is widely
makes a problem. So we can use many new machine learning
approaches to make use of robots very well. Locomotion is
nothing but the movement has to make by the robot to do any
task. So there are various movements and actions can do by
the robots and there are many dimensions and approximated
values can be identified. So this process is little easy by using
machine learning approaches. Using various techniques if any
of the dimensional values missed means, we can easily
identify the values by predicting the values by using regression
and classification. Various supervised learning and
unsupervised learning has developed for identifying many
techniques in much real world application.
IV.CHALLENGES
As data’s growing larger and even there are many machine
learning approaches and techniques. Still there may have some
loss of quality to that intend. So formally there are many
challenges out come by the word missing attributes that are
mainly reflected in the quality of data. Many real world
applications formally working with huge amount of data.If any
of the data’s missed means that will reflected to major concern.
So by filling the missing values into the equivalent probable
value or by simply eliminating the missed group or by ignoring
the actual missed data’s may lead to the loss of efficiency. So
the data’s shall say to be missed before going to the data
preprocessing. Although many new techniques impressed
companies and even they are taking and picking up some of the
technique still there may have some drawbacks. The main
challenges in addressing missing value attributes are the loss of
quality. This tends the data’s to go down. So considering the
data’s to be more formal we are about to make a prediction and
replace the values exactly in deed. Replacing is also the way
that we may feel not good. Rather we can go for some other
techniques to achieve. We have identified major challenges
faced by many real time applications and even some draw
backs of present machine learning approaches.
V.CONCLUSION
In this paper we have briefly discussed about the various
techniques of missing attributes. We have discussed about
various applications that are broadly facing this type of missing
attribute values. We have discussed about many machine
learning approaches. Because we have many mechanisms to
4. Integrated Intelligent Research(IIR) International Journal of Business Intelligent
Volume: 04 Issue: 01 June 2015,Pages No.26- 29
ISSN: 2278-2400
29
handle missing attribute values. Many techniques came up with
many added advantages somehow there are some drawbacks in
many machine learning approaches. So considering this into
the major perspective researchers are probably move onto the
evaluating the missed data by calculating manually using many
mathematical formulae and by many statistical software that
can retain the data that is actually missed in the data set. But
many of the algorithms do not in metric to achieve the
efficiency in data. So considering this we can use and
implement new algorithm and techniques that could eradicate
the missing attribute completely and we can propose the new
efficiency methods to achieve quality data. Because presence
of missing attributes may lead the database to go inconsistent
state. To avoid this we need to process and clean our data in
such a manner. Cleansing the data will be the most efficient
way to eradicate missing attribute values. Consequently every
approach has been proposed and so that we can achieve the
quality data with no missing attributes.
REFERENCE
[1] Y. S. Su, A. Gelman, J. Hill and M. Yajima, “Multiple Imputations with
Diagnostics (mi) in R: Opening Windows into the Black Box, 2014”
Journal of Statistical Software.
[2] R.J.A little and D.B Rubin, “Statistical analysis with missing data”, 2013
Wiley, New Jersey.
[3] A. Misrli, A. Benes, and R. Kale:“Artificial based software defect
predictors: Applications and benefits in a case study “2013AI Magazine.
[4] Wang, S.Y. and Lin, C.C. NCTUns 5.0: A Network Simulator for IEEE
802.11(p) and 1609 Wireless Vehicular Network Researches. Second
IEEE.Int.Symp.Wireless, Vehicular Communications, Calgary, 2013
Canada,
[5] E. Acar and B. Yener. Unsupervised multi way data analysis: A
literature survey, 2012
[6] Acuna E, Rodriguez C Classification, clustering and data mining
applications. Springer, 2011 Berlin, pp. 639–648.
[7] Alcalá-fdez J, Sánchez L, Garcia S, Jesus MJD, Ventura S, Garrell JM,
Otero J, Bacardit J, Rivas VM, Fernandez JC, Herrera F Keel: a software
tool to Assess evolutionary algorithms for data mining problems.
2011Soft Computing 13(3):307–318
[8] Luengo J, Garcia S, Herrera F A study on the use of imputation methods
for experimentation with Radial Basis Function Network classifiers
handling missing attribute values: the good synergy between RBFNs and
Event Covering method. Neural Network 23(3):406–418, 2010
[9] Qin B, Xia Y, Prabhakar S Rule induction for uncertain data. Knowledge
Info System: 10.1007/ s10115-010-0335-7, pp. 1–2, 2010
[10] Wang H, Wang S Mining incomplete survey data through classification.
Knowledge Info System 24(2):221–233, 2010.
[11] Peng C, Zhu J (2008) Comparison of two approaches for handling
missing covariates in logistic regression.68 (1):58–77
[12] Farhangfar A, et al A novel framework for imputation of missing values
in databases. IEEE