1. Working With Python
Algorithm Implementations In Python
The algorithms involved in machine learning and data science has two vital types of
implementation:
• Classification
• Regression
We will study and analyze some algorithms from both these types and understand how they
accelerate the process of nurturing the data and bring important insights from them.
Linear Regression
Linear regression comes under predictive analysis and is used to find the relationship between
two variables. These two variables are the target variable and the predictor variable. The
dependent variable is the target variable and the independent variable is the predictor variable.
Both of these variables are features that already exist in a dataset.
The overall concept of regression is to check two things- does the given group of predictor
variables do a satisfactory job in predicting the dependent variable? And which variables, in
particular, are the real predictors of the dependent variable, and what is the impact the outcome
variable?
Linear regression is represented by a simple equation-
Y = b*x+c
Where Y equals to a dependent variable, b is the regression coefficient, x is the slope and c is
the constant.
The Line of Best Fit
The line of best fit is a line which demonstrates the correlation between the observed or actual
values against the predicted ones. After applying the linear regression algorithm to our data, we
use this line to check how close the predicted values are to the actual ones. It helps in reducing
the distance between both those values also pronounced as the error values. They are also
referred to as residuals. These residuals are symbolized by the vertical lines showing the
comparison between the predicted and actual values.
2. For example, we can see that the weight of a person increases with an increase in their age.
Therefore, the blue line represents our line of best fit which is also known as the regression line.
For calculating the distance between the line and the points, we need the following formula
SS(residual)= ∑[h(x)-y]^2
where h(x) is the predicted value and y is the actual value
The Cost Function
Let us consider an example to understand this case. A sales department of a company is
planning to invest some capital to increase its sales in the next 6 months. But, they couldn't hit
their targets and had to incur some loss. Hence, to minimize that loss, we use the cost function.
This cost function is applied to represent and calculate the error of the model.
Therefore, cost function, J(Θ0, Θ1) = 1/2m∑[h(x)-y]^2, where and x is the number of rows in the
training set.
Gradient Descent
Gradient Descent is yet another important term which is used to find the minimalistic cost of a
function or an equation. It is by far the best optimization algorithm incorporated in machine
learning and deep learning. Based on a convex function, this descent makes some small tweaks
and changes to its parameters iteratively in order to minimize a given function to a local
minimum if possible.
Gradient Descent can be imagined as climbing down to the bottom of a mountain, instead of
climbing up. This is because it is a minimization technique used to minimize a given local
function.
Code in python
3. # Importing the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Retrieving the dataset
dataset = pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0)
# Performing feature scaling
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
# Fitting the Simple Linear Regression model to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)
# Test set results prediction
y_pred = regressor.predict(x_test)
Logistic Regression
Logistic regression is a field of statistics that come under classification rather than regression.
Like all regression techniques, the logistic regression comes under predictive analysis theory of
implementation. Logistic regression is used to describe the structure of data and explain the
correlation between a dependent binary variable and one or more nominal independent
variables.
It is favorable for predicting binary outcomes as 1/0 or yes/no or true/false considering the kind
of dataset given and the output required. Logistic regression can also be considered as a
special case of linear regression when the outcome variable is categorical, where we are using
log of odds as the dependent variable. In simple words, it predicts the probability of occurrence
of an event by fitting data to a logit function.
This type of regression can be characterized by probabilities of following events-
Odds = p/(1-p) = probability of event occurring/probability of event not occurring
Ln (odds) = ln (p/(1-p))
Logit (p) = ln (p/(1-p))
4. In this, (p/1-p) is the odds ratio. If the log of the odd-oriented ratio is positive, the probability of
success rate will always be higher than 50%. A typical logistic model plot is shown below. It is
observed that the probability never goes below 0 and above 1.
We can check the performance of this regression by testing it through the following parameters.
Akaike Information Criteria- AIC is the measure of fitness which can penalize a model for the
frequency of its model coefficients. Therefore, we always prefer the model with minimum
Alkaline Information criteria value for better results.
Null Deviance- Null Deviance represents the outcome predicted by a model with the help of the
intercept. It all depends, if the null deviance is less, then the model will be better.
Residual Deviance- Residual deviance describes the response predicted by a model on the
addition of independent variables. Same goes for residual deviance, lower the value, better the
results.
Confusion Matrix- Confusion matrix is the tabular representation of actual vs predicted values.
It helps in finding the performance of a machine learning model, either classification or
regression and avoids overfitting.
Predicted Values
Actual Values
True Positive False Positive
False Positive True Negative
The accuracy of a model can be calculated by
True Positive(TP) + True Negative(TN)
True Positive(TP) + True Negative(TN) + False Positive(FP) + False Negative(FN)
5. ROC curve
Receiver operating characteristic curve or ROC curve signifies how well the model can
distinguish between two things by plotting the true positive rate with the false positive rate. Good
models will be able to accurately distinguish between the two. Whereas, a poor model will have
difficulties in differentiating between the two.
Code in python
# Importing the necessary libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import scipy
from scipy import stats
from scipy.stats import spearmanr
# Retrieving the dataset
t1= 'C:/Users/ml/datasets/train.csv'
train=pd.read_csv(t1)
t2= 'C:/Users/ml/datasets/test.csv'
test=pd.read_csv(t2)
x= train.iloc[:, [2,4,5,6,7,9]].values
y= train.iloc[:, 1].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
x_train, X_test, y_train, y_test= train_test_split(x,y,test_size= 0.25, random_state=0)
# Performing feature scaling
from sklearn.preprocessing import StandardScaler
sc_x=StandardScaler()
6. x_train=sc_x.fit_transform(x_train)
x_test=sc_x.transform(x_test)
# Fitting the Simple Linear Regression model to the training set
from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state = 0)
classifier.fit(x_train,y_train)
# Test set results prediction
y_pred=classifier.predict(x_test)
# Creating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test, y_pred)
Support Vector Machines
Support Vector Machines(SVMs) are used to find the best hyperplane in an array of data points
that will best suit the results in a supervised learning environment. Suppose we have got two
columns x and y and they consist of some random data-points. These points are plotted in a
two-dimensional plane. Our motive is to derive a line that is going to separate these points.
The line that separates these points horizontally, vertically or diagonally is known as a
hyperplane. This hyperplane calculates the distance between the data points and itself to
determine the appropriate hyperplane which will enable in classifying these points. This distance
is known as margin.
SVM supports both regression and classification tasks and can tackle multiple continuous and
categorical variables. For categorical variables, a dummy variable is created with case values
as either 0 or 1. Thus, a categorical dependent variable consisting of three levels, say A, B, C, is
represented by a set of three dummy variables
7. A: {1 0 0}
B: {0 1 0}
C: {0 0 1}
As we all know how to identify a hyperplane, the question is how to identify the right one?
We can reach a conclusion by considering the following cases.
CASE 1
There are three hyper-planes in our n-dimensional space which are x1, x2, x3. We need to
identify the right hyperplane between the three. X1 and x3 are traversing between the points
while x2 is separating these points in a perfect fashion. Hence, x3 is our ideal hyper-plane.
CASE 2
The three hyperplanes x1, x2 and x3 are segregating the points quite well as they are all vertical
and parallel to each other. So, how can we identify the right hyperplane in this situation? x1 and
x3 are planes which are nearer to the points that mean their margins are quite small compared
to x3. Hence, x3 is having more margin and hence it is the ideal hyperplane.
CASE 3
In the third case, all the points are residing very close to each other in the center of the plane
with little or no room for the hyperplane to pass between them. What can we do in such a case?
This problem can be dealt with by adding a third axis, the Z-axis! As z is x^2 + y^2, all the
values for z will be positive as z is the squared sum of both x and y. Sometimes, this trick won't
be applicable to this type of scenario. Hence, kernel trick comes into play for such scarcity. It
converts the not so separable problem(the scenario discussed above) to a separable problem.
These functions are called kernels. They are useful in non-linear separation problem. Simply
put, it does some extremely complex data transformations, then finds out the process to
separate the data based on the labels or outputs that have been defined.
Code in python
# Importing the important libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Retrieving the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
# Performing Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
8. # Fitting the SVM model to the training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(x_train, y_train)
# Test set results prediction
y_pred = classifier.predict(x_test)
# Creating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Decision Trees
Decision trees are the most preferred and favored machine learning classification technique in
machine learning. It not only helps us with the prediction analysis but also is a very efficient
algorithm to understand the characteristics of various variables. They come under the
supervised learning algorithm consisting of a predefined target variable which is to be
determined. This is suited for both categorical as well as continuous variables in the output.
The basic functioning of decision trees goes this way- there are a set of points that are plotted
on a plane. These points can’t be separated easily by a line due to their heterogeneous
properties. Hence, decision trees divide these points into different clusters or leaves based on
some predefined criteria and take care of them individually.
There are two different types of decision trees which are classified based on the type of target
variable we have taken.
Binary Variable Decision Tree- The decision tree which has a binary target variable is known
as Binary Variable Decision Tree. In this case, the output will be either “yes” or “no”.
Continuous Variable Decision Tree- The decision tree which has a continuous target variable
is known as Continuous Variable Decision Tree. In this case, the output will be any recurring
value such as the salary of a person.
Let us go through some of the key terms commonly used in decision trees.
Root Node- It represents the entire population or the given sample and further gets divided into
two or more homogeneous sets.
Splitting- It enables the division of a node into two or more sub-nodes.
Decision Node- This is like sub-node splitting into further sub-nodes.
Leaf/Terminal Node- These are nodes with zero sub-nodes, that is, these nodes can’t be split
further.
Pruning- When the size of the decision trees is reduced by removing nodes, the process is
called pruning.
9. Branch/Subtree- A subsection of a decision tree is called as a branch or a sub-tree.
Parent and Child Node- A node which is divided further into small sub-nodes is called a parent
node of whereas sub-nodes are the children of this parent node.
There are some important terms that we first need to understand before we can implement
decision trees in python.
Impurity
Impurity is the measure of unknown or redundant data which is evident when there are traces of
one class into another. There are reasons for its existence. The decision tree can run out of
classes to divide the class any further. We have assumed that we can allow some percentage of
impurity in our data for better performance which will introduce the impurity into our humble
model!
Entropy
Entropy is the degree of redundancy of elements or in other terms, it is a measure of impurity.
Mathematically, it can be calculated with the help of probability of the items as:
H= -Σp(x)*log[p(x)]
It is the negative summation of probability times the log of the probability of item x.
Information Gain
Information gain is the main ingredient that is instrumental in the construction and setting up of a
decision tree. Constructing a decision tree from scratch is all about finding the attribute that will
return the highest information gain in order to produce maximum accuracy in the decision trees.
Therefore, IG is equal to entropy(parent) - (average weights) * entropy(children)
Code in python
# Importing the important libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Retrieving the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Performing Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
10. # Fitting the Decision Tree Classification model to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
# Test set results prediction
y_pred = classifier.predict(X_test)
# Creating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Random Forest
Random Forest algorithm is another approach to supervised classification algorithm after
decision trees. It is like the proper extension to decision trees algorithm. There is a correlation
between the number of trees in the forest and the results it calculates, hence. higher the
frequency of trees, the better and accurate will be the result.
Random forests are equivalent to ensemble learning technique for classification and regression
techniques. Random forest avoids the problem of overfitting by taking care of the fact that there
are enough trees in the model. Another advantage is that the classifier of random forests can
easily manage missing values. It can also be modeled for categorical values.
Working
Working of the random forest depends on 2 stages- one is creating a random forest and the
other is making predictions and extracting useful observations from the random forest classifier
created in the first stage.
These are some of the steps used in the creation of random forests.
• We need to select some random “k” features out of the total “m” features where k is less
than m.
• Among the selected “k” features, we need to calculate a node “d” applying the best split
point.
• We need to split the node into further nodes using the derived best split.
• Steps 1, 2 and 3 must be repeated until some “l” number of nodes has been achieved.
• Construct the forest by re-applying steps 1 to 4 for “n” number of times to create “n”
number of trees.
Applications
Stock market- A random forest can be used to identify the right stock which can attract profits
for the user at most times.
E-commerce- It can be effective in this field by predicting the products which the customer can
buy in future, based on their past choices.
11. Banking- It can recognize the defaulters and the non-defaulters by analyzing the behavior of
the customer through their past records.
Code in python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(X_train)
x_test = sc.transform(X_test)
# Fitting the Random Forest Classification model to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 7, criterion = 'impurity', random_state = 0)
classifier.fit(x_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(x_test)
# Creating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
K-means clustering
Clustering is the process of classifying the given data points into a number of groups or classes
such that the data points in the same groups are compatible with each other in terms of features
and characteristics. In simple words, k-means has the modus operandi of segregating points
into groups with similar properties and assign them into clusters.
Working
It starts with specifying the desired number of clusters ‘k’ required, let’s consider k as 2 for the
five random data points in 2-D space.
12. Then, we need to randomly assign each data point to a cluster. We will assign three points in
cluster 1 as shown in red color and two points in cluster 2 as shown in grey color.
Next, we need to compute centroids for these clusters, the centroid of data points in the red
cluster is signified by a red cross while for the grey cluster, it is shown using a grey cross.
13. Then comes the step of re-assigning each individual data point to the closest cluster centroid.
The data point which is at the bottom is assigned to the red cluster even though it is closer to
the centroid of the grey cluster. Hence, we assign that data point into the grey cluster.
In the end, we need to recompute cluster centroids- We have to recompute the centroids for
both the clusters.
14. Feature engineering is the process of using the domain knowledge and expertise to choose
which data variables to input as features before building a machine learning model. Feature
engineering plays a key role in k-means clustering; using meaningful features that capture the
variability and essence of data is essential before imputing the selected features for applying k-
means.
Feature transformations are conducted, particularly to represent rates rather than
measurements, which help in normalizing the data. At times, it is observed that this engineering
might help get rid of 80% of the error in a dataset. It proves to be effective in maintaining the
accuracy of machine learning model that is implemented to have great insights from the data.
Code in python
# Importing the required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Retrieving the dataset
dataset = pd.read_csv('customers.csv')
x = dataset.iloc[:, [3, 4]].values
y = dataset.iloc[:, 3].values
# Splitting the dataset into the training and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)
# Performing Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
15. # Finding the optimal number of clusters
from sklearn.cluster import KMeans
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Visualising the results using plots
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
# Fitting the K-Means algorithm to our dataset
kmeans = KMeans(n_clusters = 10, init = 'k-means++', random_state = 55)
y_kmeans = kmeans.fit_predict(x)
K-nearest Neighbor(K-NN)
K-nearest neighbor can be considered for both classification and regression problems. A KNN
model is taken into consideration when n number of points need to be classified into groups that
contain data-points or in this case, features of a dataset, in a homogeneous way. These data-
points are all similar to each other and are together. When a new point is introduced in the
plane, it is classified based on its characteristic which matches any homogenous group or class.
It is a non-parametric approach meaning it doesn't depend on data to establish a normal
distribution It is also referred to as lazy classification model which predicts classes based on the
features of observations that are matching.
Selecting the number of nearest neighbors, that is, selecting the value of k, plays a significant
role in calculating the capacity of our model. Selection of k will determine how well the data can
be used to characterize the results of the kNN algorithm. A large k-value will generally tend to
reduce the variance in data due to the noisy data; which will develop a bias. This might lead to
smaller patterns in data which can be fruitful.
There are many data points in the plane whose distance can be calculated by the following
techniques.
Euclidean Distance: Euclidean distance is calculated to be the square root of the sum of the
squared differences between a new point (x) and an existing point (y).
ED= √Σ(x^2-y^2)
Manhattan Distance: Manhattan distance is the distance between vectors using the sum of
their absolute difference.
MD= Σ|x-y|
16. Hamming Distance: It is in favor of categorical variables. If the value (x) and the value (y) are
same, the distance D will be equivalent to zero.
HD= Σ|x-y|
Where x=y when D=0 and x≠y when D=1
KNN is mostly used for searching purposes. It enables the search by finding the nearest item to
the customers' interests. It can also be implemented for building Recommender Systems. It will
find similar items based on the users personal taste or preference. Normally, the KNN algorithm
is not preferred much when compared to SVM or neural networks as it runs slower compared to
other algorithms.
Code in python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the data into the training and test set
from sklearn.cross_validation import train_test_split
x_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(X_train)
x_test = sc.transform(X_test)
# Fitting our K-nearest neighbor model to the Training data
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'hamming', p = 2)
classifier.fit(x_train, y_train)
# Test data result prediction
y_pred = classifier.predict(x_test)
# Creating the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Naive Bayes
Naive Bayes is a basic technique for building classifiers. These models assign class labels to
problem instances, represented as vectors of features. It is a part of classification techniques
17. based on Bayes’ theorem with the assumption that there exists independence between
predictor variables.
In plain terms, a naive Bayes classifier calculates the probability of the outcome assuming that
the presence of a defining feature in a class is not at all related to the presence of any other
feature in another class. For instance, a knife may be considered to have features like
sharpness, being made of stainless steel and a size of 20 inches. These features do not depend
on each other for their existence. Similarly, a naive Bayes approach would take into account all
of the properties of each variable to independently contribute to their probability.
Naive Bayes classifiers need to be trained effectively in a supervised learning setting for
different sorts of probability models. In many practical applications, parameter estimation for
naive bayes models depends on the execution maximum likelihood, which mean that one can
work with the naive bayes model without calculating the bayesian probability or using any
appropriate Bayesian methods.
P(c/x)=
P(x/c)*P(x)
P(x)
where, P(c|x) is called the posterior probability of target given predictor which is x(features), P(c)
is known prior probability of class, P(x|c) is the likelihood, which is the probability of predictor
given class and P(x) is the prior probability of predictor.
Code in python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('SN_Ads.csv')
x = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
# Splitting the data into the training and test set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Performing Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
# Fitting the Naive Bayes model to the Training data
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
18. classifier.fit(x_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(x_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)