1. Use of classifiers in research problems
Classifiers are algorithms which map the input data to any specific type of output category.
They can be used to build dynamic models with high precision and accuracy such that the
resulting model can be used to predict or classify previously unknown data points. Classifiers
have found wide use in data science applications in various domains. For instance,
classification of a new tumour as malignant or benign, identifying a mail as spam or ham,
marking an insurance claim as possibly fraudulent or genuine are different instances of
classification. Classification algorithms use training data, i.e., they learn from example data
and build a model or procedure to identify a new data point as belonging to a particular
category. Thereby they belong to the class of supervised learning methods.
There are a number of classifiers that can be used to classify data on the basis of historic and
already existing data. A very short description of these methods is given here just to introduce
the concepts.
Logistic Regression
As a simple case, consider a logistic model with two predictors x1 and x2, and one binary
response variable y which we denote as 𝑝 = 𝑃(𝑌 = 1). We assume a linear relationship
between the predictor variables and the log-odds of the event. This relationship can be
expressed as,
log
𝑝
1 − 𝑝
= β + β 𝑥 + β 𝑥
By simple algebraic manipulation, the probability that Y=1 is,
𝑝 =
𝑒
𝑒 + 1
The above formula shows that once the β ′𝑠 are estimated, we can compute the probability that
Y=1 for a given observation, or its complement Y=0.
Decision Trees
In this technique, we split the population or sample into two or more homogeneous sets (or
sub-populations) based on most significant splitter/differentiator in input variables. The end
result of the algorithm would be a tree like structure with root, branch and leaf nodes (target
variable). Decision trees use multiple algorithms to decide to split a node in two or more sub-
nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. Although
several criteria like Gini index, chi-square, reduction in variance are available for identifying
the nodes, one popular measure used for spitting is the information gain. This is equivalent to
selecting a particular node with maximum reduction in entropy as measured by Shannon’s
index (H).
𝐻 = − 𝑝 log 𝑝
where s is the number of groups at a node and 𝑝 indicate the proportion of individuals in the
ith group.
2. Random Forests
Ensemble learning is a type of supervised learning technique in which the basic idea is to
generate multiple models on a training dataset and then simply combine (average) their output
rules or their hypotheses to generate a stronger model which performs very well. Random forest
is a classic case of ensemble learning. Decision trees are considered very simple and easily
interpretable but a major drawback in them is that they have poor predictive performance and
poor generalization on test set and so sometimes are called weak learners. In the context of
decision trees, random forest is a model based on multiple trees. Rather than just simply
averaging the predictions of individual trees (which we could call a ‘forest’), this model
uses two key concepts that gives it the name ‘random’ viz., (i) random sampling of training
data points when building trees (ii) random subsets of features considered when splitting nodes.
The idea here is that instead of producing a single complicated and complex model which might
have a high variance that will lead to overfitting or might be too simple and have a high bias
which leads to underfitting, we will generate lots of models using the training set and at the
end combine them.
Support Vector Machines
Given a set of training examples, each marked as belonging to one or the other of two categories,
a Support Vector Machine (SVM) training algorithm builds a model that assigns new examples
to one category or the other. In theory, SVM is a discriminative classifier formally defined by a
separating hyperplane. In other words, given labelled training data, the algorithm outputs an
optimal hyperplane which categorizes new examples. Thus, the hyperplanes are decision
boundaries that help classify the data points. Data points falling on either side of the hyperplane
can be attributed to different classes. Also, the dimension of the hyperplane depends upon the
number of features. If the number of input features is 2, then the hyperplane is just a line. If the
number of input features is 3, then the hyperplane becomes a two-dimensional plane. In practice,
there are many hyperplanes that might classify the data. One reasonable choice as the best
hyperplane is the one that represents the largest separation, or margin, between the two classes.
So, we choose the hyperplane such that the distance from it to the nearest data point on each
side is maximized.
Naïve Bayes Classifier
Naive Bayes algorithm, in particular is a logic-based technique which is simple yet so powerful
that it is often known to outperform complex algorithms for very large datasets. The foundation
pillar for naive Bayes algorithm is the Bayes theorem which states that in a sequence of events,
if A is the first event and B is the second event, then P(B/A) is obtained by the expression,
P(B/A) = P(B) * P(A/B) / P(A)
The reason that Naive Bayes algorithm is called naive is not because it is simple (naïve). It is
because the algorithm makes a very strong assumption about the data having features
independent of each other. In other words, it assumes that the presence of one feature in a class
is completely unrelated to the presence of all other features. If this assumption of independence
holds, naive Bayes performs extremely well and often better than other models.
Mathematically,
3. 𝑃(𝑋 , … , 𝑋 /𝑌) = 𝑃(𝑋 /𝑌)
In order to create a classifier model, we find the probability of a given set of inputs for all
possible values of the class variable Y and pick up the output with maximum probability. This
can be expressed as
𝑌 = 𝑎𝑟𝑔𝑢𝑚𝑎𝑥 𝑃(𝑌) 𝑃(𝑋 /𝑌)
Neural Networks
A neural network is a series of algorithms that endeavours to recognize underlying relationships
in a set of data through a process that mimics the way the human brain operates. The basic
computational unit of the brain is a neuron. In comparison, a ‘neuron’ in a neural network also
called a perceptron is a mathematical function that collects and classifies information according
to a specific architecture. The perceptron receives input from some other nodes, or from an
external source and computes an output. Each input has an associated weight (w) which is
assigned on the basis of its relative importance to other inputs. The node applies a nonlinear
function to the weighted sum of its inputs to create the output. The idea is that the synaptic
strengths (the weights w) are revisable based on learning from the training data which in turn
controls the strength of their influence and direction.
The learning happens in two steps: forward propagation and back propagation. In simple words,
forward propagation is making a guess about the answer and back propagation is minimising
the error between the actual answer and guessed answer. The process of updating the input
signals is continued through multiple iterations to arrive at a decision.
K Nearest Neighbour Technique
K-nearest neighbours (KNN) is a simple algorithm that stores all available cases and classifies
new cases based on a similarity measure (e.g., distance functions). A case is classified by a
majority vote of its neighbours meaning the case being assigned to the most common class
amongst its K nearest neighbours measured by a distance function. Below is step by step
procedure to compute K-nearest neighbours.
1. Determine parameter K=number of neighbours to be used.
2. Calculate the distance between the query-instance (item to be identified as belonging
to a preidentified category) and all the training samples.
3. Sort the distance and determine nearest neighbours based on the Kth minimum distance.
4. Gather the category 𝛾 of the nearest neighbours
5. Use simple majority of the category of nearest neighbours as the prediction values of
the query instance.
The most intuitive nearest neighbour type classifier is the 1-nearest neighbour classifier that
assigns a point x to the class of its closest neighbour in the feature space.
Finally, the choice of particular classifier for a given situation would depend on their relative
performance in respect of accuracy, sensitivity and specificity. There are deeper issues
involved in the use of all these techniques and considerable developments have taken place in
both theory and programming related to the topic.
--- Jayaraman