COMP_GroupA2.pptx

Department of Computer Engineering,
Sinhgad Institute of Technology and Science, Pune
2019-20
Project Stage II SPPU Examination
Name. PRNNO.
Venkatesh T. Mainalli 71622707L
Gaurav J. Bade 71832932F
Akshay R. Dhole 71833043K
Shivraj Deshmukh 71833027H
Name of the guide :- Prof. G. S. Navale
Classification of Imbalanced Dataset

Agenda
• Introduction of Group Members
• Introduction of Project Topic
• Motivation
• Literature Survey
• Gap Analysis
• Problem Statement
• Objectives
• Activities per Objective
• Feasibility Study
• Schedule
• System Architecture
• Mathematical Model
• SRS
• UML/DFD
• Algorithms
• Testing
• Results
• Conclusion
• Future Scope
• References
25-04-2023 Classification of Imbalanced Dataset 2

Introduction of Group Member 1
Bio – Data : Personal Information
Roll No B17COA77
Name of Student Venkatesh Tukkappa Mainalli
Date of Birth 25-03-1998
Address House No Bldg No Anil Villa, Kolhewadi, Sinhgad Road
Lane, Appt,Society
City District Pune
Tal Dist State Pin Maharashtra- 411024
Email Id venkateshmainalli55555@gmail.com
Mobile No. 7798311297

Class Sem Year of Adm % Marks Remark
F. E. I 2016-2017 62.0 Pass
II 64.0 Pass
S. E. I 2017-2018 54.0 Pass
II 58.0 Pass
T. E. I 2018-2019 60.0 Pass
II 68.0 Pass
B. E. I 74.16 Pass
Academic Track Record

Roll No B17COA14
Name of Student Shivraj Deshmukh
Address House No Bldg No Room no. 201
Lane, Appt,Society Balaji Hostel, Narhe
City District Pune
Email Id deshmukh.shivraj95@gmail.com
Mobile No. 7972326595
Recent
Photo
4.05cm ht

S. E. I 2017-2018 74.16 Pass
II 70.46 Pass
T. E. I 2018-2019 69.72 Pass
II 64.24 Pass
B. E. I 70.16 Pass

Roll No B17COA17
Name of Student Akshay Ramchandra Dhole
Address House No Bldg No Flat No.1,
Lane, Appt,Society Shantiyuga Apartment, Manaji Nagar, Narhe
City District Pune
Tal Dist State Pin Maharashtra - 411041
Email Id akshaydholead10@gmail.com
Mobile No. 7768902644
Recent
Photo
4.05cm ht

S. E. I 2017-2018 72.68 Pass
II 67.50 Pass
T. E. I 2018-2019 61.06 Pass
II 66.53 Pass
B. E. I 78.23 Pass

Roll No B17COA04
Name of Student Gaurav Jayant Bade
Address House No Bldg No Flat No.420, E-Wing
Lane, Appt,Society Samarth Vishwa Society, Khandala
City District Khandala, Satara
Email Id Gauravbade11@gmail.com
Mobile No. 9960503007

S. E. I 2017-2018 69.72 Pass
II 69.12 Pass
T. E. I 2018-2019 64.46 Pass
II 64.09 Pass
B. E. I 2019-2020 73.19 Pass

Introduction
• What is Machine Learning ?
 A branch of artificial intelligence, concerned with
the design and development of algorithms that allow
computers to evolve behaviors based on empirical
data.
 As intelligence requires knowledge, it is necessary for
the computers to acquire knowledge.

• What is Imbalanced Dataset ?
Imbalance data distribution is an important
part of machine learning workflow.
 An imbalanced dataset means instances of
one of the two classes is higher than the
other, in another way, the number of
observations is not the same for all the
classes in a classification dataset.
This problem is faced not only in the binary
class data but also in the multi-class data.

Motivation
 Many real-world data sets have an imbalanced
distribution of the instances. Learning from such
data sets results in the classifier being biased
towards the majority class, thereby tending to
misclassify the minority class samples
For example – fraud detection, cancer detection,
manufacturing defects, online ads conversion etc.

Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
1 “ARCID: A New Approach
to Deal with Imbalanced
Datasets Classification”
Year:- 2018
Authors:- Abdellatif, Safa,
et.al
New Approach to deal with
Classification of Imbalanced
Datasets
Pros: ARCID performs slightly
better than Fitcare and RIPPER.
However, it outperforms all
standard rule-based and non-
rule based approaches by many
ranks
Cons: Managing the
overwhelming number of
CARs generated from real- life
datasets and Removing
redundant rules conveys the
same information.
Literature Survey

Sr.
No.
2 Paper Title:-Applying Map-
Reduce to imbalanced data
classification
Year:- 2017
Authors:-Jedrzejowicz,
Joanna , et.al
Using Map-Reduce to deal
with Imbalanced data
classification
Pros:
1.EDBC uses the three methods
to classify the imbalance data
in proper in manner.
2.SplitBal works parallelly so
the processing time is less as
compare to EDBC.
Cons:
1.EDBC does not works
parallelly so the processing
time is more.
2. In SplitBal dataset is divided
into bins so working on bin is
lengthy.
Literature Survey

Sr. Paper Title Paper Theme/Idea Pros / Cons
3 Paper Title:-A study on
classifying imbalanced
datasets
Year:- 2014
Authors:-T. Lakshmi and
S. Prasad
Study of classifying
imbalanced dataset.
Pros:
1.The combination of SMOTE +
BAG + RF effectively improves
the performance of classifying
of imbalanced dataset.
2.The use of ensemble and
sampling techniques gives
drastic improvement in
performance in terms of AUROC
Cons:
1.Execution time was reduced
by using the undersampling,
but the accuracy of classifying
dataset was distracted.
2.There was loss of necessary
and useful information due to
under-sampling.
Literature Survey

Sr.
No.
4 Paper Title:-Improved
SMOTEBagging and its
application in imbalanced
data classification
Year:- 2013
Authors:-Yongqing, Z. &
Min, & Zhu, Danling, &
Zhang, M. & Gang, and M.
Daichuan
An approach to deal
imabalanced data using
oversampling and bagging
approach.
Pros:
1.SMOTE Method generates
synthetic minority examples to
over-sample the minority class.
2.Bagging Method improves
the performance of the overall
system.
3.Support Vector Machine one
of the most effective machine
learning algorithms for many
complex binary classification
problems.
4.SMOTE Bagging Method
overcomes over-fitting
problems of traditional
algorithm.
Literature Survey

Sr.
No.
4 Paper Title:-Improved
SMOTEBagging and its
application in imbalanced
data classification
Year:- 2013
Authors:-Yongqing, Z. &
Min, & Zhu, Danling, &
Zhang, M. & Gang, and M.
Daichuan
An approach to deal
imabalanced data using
oversampling and bagging
approach.
Cons:
1.SMOTE Method it only work
on minority class not on
majority class.
2.Bagging Method combing of
decision sometimes over-head
Literature Survey

Sr.
No.
5 Paper Title:-Class
Imbalance Problem in Data
Mining: Review
Year:- 2013
Authors:-M. R. Longadge,
M. S. S. Dongre, D. L.
Malik et al
Using Data Mining to deal
with Imbalanced data
classification
Pros:
Hybrid approach provides
better solution for class
imbalance. Suggests several
algorithm and techniques that
solve the issue of imbalance
distribution of sample.
Cons:
This method improves the
classification accuracy of
minority class but, because of
infinite data streams and
continuous concept drifting,
this method cannot suitable for
skewed data stream
classification.
Literature Survey

Sr.
No.
6 Paper Title:-Skew Boost:
An algorithm for classifying
imbalanced datasets
Year:- 2011
Authors:-S. Hukerikar, A.
Tumma, A. Nikam, and V.
Attar,
An approach of handling
imbalanced dataset using
boosting tech
Pros:
SkewBoost performs better
than DataBoost-IM, Inverse
Random UnderSampling (IRUS),
EasyEnsemble and Balance
Cascade and Cost Sensitive
Boosting (CSB2).
Cons:
In case of some datasets,
SkewBoost remains behind of
some existing algorithms in
metrics related to the majority
class accuracy
Literature Survey

Sr.
No.
7 Paper Title:-RUSBoost: A
Hybrid Approach to
Alleviating Class Imbalance
Year:- 2010
Authors:-Seiffert, Chris &
Khoshgoftaar, Taghi & Van
Hulse, Jason & Napolitano,
Amri.
Hybrid approach of
handling imbalanced
dataset using boosting tech.
Pros:
RUSBoost presents an easier,
faster, and fewer advanced
alternative to SMOTEBoost for
learning from imbalanced data.
Cons:
The complexity and model
training time get increased.
Literature Survey

Sr.
No.
8 Paper Title:-Learning
pattern classification tasks
with imbalanced datasets
Year:- 2009
Authors:-S. L. Phung
Recognizing learning
pattern of classification of
imbalanced data.
Pros:
The approach proposed by
author can effectively improve
classification accuracy of
minority classes while
maintaining the overall
classification performance.
Cons:
The combination of supervised
and unsupervised algorithm
tends to have higher values of
G-means and F-measure when
compared with different
training algorithms
Literature Survey

Sr.
No.
9 Paper Title:- “Z-SVM: An
SVM for improved
classification of imbalanced
data
Year:- 2006
Authors:-Imam, Tasadduq
& Ting, Kai &
Kamruzzaman, Joard
SVM approach to handle
Imbalanced Dataset
Pros:
Z-SVM aims at finding the
discriminating hyperplane that
maintains an optimal margin
from boundary examples called
support vectors.
Cons:
It is not much clear that how
particular value of parameter
affects the SVM hyperplane
and generalization capability of
learned model
Literature Survey

Sr.
No.
10 Paper Title:-Editorial:
Special Issue on Learning
from Imbalanced Data Sets
Year:- 2004
Authors:-Chawla, Nitesh &
Japkowicz, Nathalie &
Kołcz, Aleksander.
Information of Learning
from Imbalanced dataset
Pros:
Use of cluster based
oversampling counters the
effect of class imbalance and
small disjuncts.
Cons:
Feature selection can often be
too expensive to apply for
handling imbalanced data.
Literature Survey

Sr.
No.
11 Paper Title:-Learning from
imbalanced data sets with
boosting and data
generation: The DataBoost-
IM approach
Year:- 2004
Authors:-Guo, Hongyu &
Viktor, Herna.
Handling Imbalanced data
using boosting approach
Pros:
DataBoost-IM produce s a
series of high- quality classifiers
will be better able to predict
examples for which the
previous classifier’s
performance is poor.
Cons:
The execution gets complicated
when tedious approach is used
Literature Survey

Sr. Paper Title Paper Theme/Idea Pros / Cons
12 Paper Title:-Synthetic
Minority Oversampling
Technique
Year:- 2002
Authors:-N. Chawla
An oversampling approach
to deal with imbalanced
dataset
Pros:
For almost all the ROC curves, the
SMOTE approach dominates.
Adhering to the definition of ROC
convex hull, most of the doubtless
optimal classifiers are the ones
generated with SMOTE.
Cons:
A minority class sample could
possibly have a majority class
sample as its nearest neighbor
rather than a minority class
sample. This crowding will likely
contribute to the redrawing of the
decision surfaces in favor of the
minority class
Literature Survey

Gap Analysis
• The study of papers gave an idea about how
the imbalanced data can result in
inappropriate results.
• Some papers also gave idea about how we can
deal with imbalanced data.

Problem Statement
• To classify imbalanced data set to predict the accuracy of
minority class from a given structured data set using
diversified ensemble and clustering methods.

Objectives
1. To develop balanced data set from imbalanced
multiclass dataset by applying sampling techniques.
2. To improve the classification precision and achieve
good balancing among the class samples.
3. To develop an efficient method for solving the class
imbalance problem in the classification process.

Activities per Objective
• Develop balanced data set from imbalanced multiclass dataset
by applying sampling techniques.
• Activities::
1. Input imbalanced dataset.
2. Apply SMOTE algorithm with sampling.
3. Result balanced dataset.

• To improve the classification precision and achieve good
balancing among the class samples.
• Activities:
1. Testing with any one classification algorithm then proceed with
combined algorithm .
2. Proper cleansing and pre-processing before testing.
3. Using the learning curve approach to find the performance of
algorithm.

25-04-2023
• To develop an efficient method for solving the class imbalance
problem in the classification process.
• Activities:
1.Evaluate Classiﬁers in Imbalanced Domains.
2. Using statistical measures for evaluating classiﬁcation.
Classification of Imbalanced Dataset 32

Feasibility Study
25-04-2023
Following points are analyzed to check feasibility of Classification
of imabalanced Dataset:
1. Dataset for training and testing
2. SMOTE for synthetic oversampling
3. Random forest for ensembled classification
4. Tkinter for creating GUI
5. XGBoost for performing boosting

Schedule
• - To develop a blockchain manager and integrate it with the
web portal.
• 2.1 Building a blockchain manager interface that can
support dynamic addition of blockchains.
• 2.2 Authorizing the blockchain dynamically to create
a private blockchain network.
• 2.3 Creating methods to support image checking and
uploading to the blockchain via blockchain
manager.
• 2.4 Creating method for restoration after the blockchain
failure.
• 2.5 Calling the methods from web interface.

System Architecture

Mathematical model
• The math's behind XGBoost is as follows
– The following steps are involved in gradient
boosting:
• F0(x)- with which we initialize the boosting algo as:
• The gradient of loss function is computed iteratively:

Mathematical model
• Each hm(x) is fit on gradient obtained at each step.
• The multiplicative factor ym for each terminal node is
derived and the boosted model Fm(x) is defined:

Mathematical model
• The math’s behind the working of RFC is as
follows:
• RFC does both row sampling and column sampling with
decision tree as a base.

Mathematical model
• As the no. of base learner(k) increases, the
variance gets decreased. When k is decreased,
variance is increased. But bias remain
constant for whole process.
• It can be expressed as:
– Random forest= DT(base learner)+ bagging(Row
sampling with replacement)+ feature
bagging(column sampling) +
aggregation(mean/median, majority vote)

Mathematical model
• Math’s behind RFC is:
• Assume training set of microarrays
–D = {(x1,y1),….,(xn,yn)}
• drawn randomly from a (possibly unknown)
probabilitydistribution .
–(xi,yi) ~ (X,Y)

Mathematical model
• Goal: to build a classifier which predicts y from x based
on the data set of examples D.
• Given: ensemble of (possibly weak) classifiers
– h ={ h1(x),….,hk(x)}.
• If each hk(x) is a decision tree, then the ensemble is a
random forest. We define the parameters of decision
tree for classifier hk(x) to be
–Qk = (qk1,qk2,....,qkp)
• (these parameters include the structure of tree, which
variables are split in which nodes, etc)

Mathematical model
• We sometimes write
hk(x) = h(x |qk)
• Thus decision tree k leads to a classifier
hk(x) = h(x |qk)

SRS
• SRS Hyperlink
System Requirment:
Hardware:
• Pentium IV
• 40 GB HDD
• 4 GB Ram
Software:
• Windows 10
• Python
• PyCharm

Non System Requirement
• Performance
• Extendibility
• Reusability
• Reliability

UML Diagrams / DFD Diagrams
• Data Flow Diagram:
DFD Level 0

25-04-2023
DFD Level 1

25-04-2023
DFD Level 2

25-04-2023
Activity Diagram

25-04-2023
Class Diagram

25-04-2023
Sequence Diagram

Project Implementation : Algorithms
• Algorithms used are as follows:
– Random Forest Classifier
– Xtreme Gradient Boosting
– SMOTE

• Working of RFC
– We can understand the working of Random Forest
algorithm with the help of following steps −
– Step 1 − First, start with the selection of random
samples from a given dataset.
– Step 2 − Next, this algorithm will construct a
decision tree for every sample. Then it will get the
prediction result from every decision tree.

• Working of RFC
– Step 3 − In this step, voting will be performed for
every predicted result.
– Step 4 − At last, select the most voted prediction
result as the final prediction result.
The working of RFC is as illustrated in fig.1.

Fig. 1 Working of Random Forest Classification

• Working of XGBoost
– XGBoost is an implementation of Gradient
Boosted decision trees.
– It is a type of Software library that was designed
basically to improve speed and model
performance.

– In this algorithm, decision trees are created in
sequential form.
– Weights play an important role in XGBoost.
Weights are assigned to all the independent
variables which are then fed into the decision tree
which predicts results.
– Weight of variables predicted wrong by the tree is
increased and these the variables are then fed to
the second decision tree.

– These individual classifiers/predictors then
ensemble to give a strong and more precise
model.
The Working of XGB is as illustrated in fig. 2.

Fig. 2 Working of XGBoost

• Working of SMOTE
– Step 1: Setting the minority class set A, for each ‘x
belongs to a’ the k- nearest neighbours of x are
obtained by calculating Euclidean distance
between x and every other sample in set A.
– Step 2: The sampling rate N is set according to
imbalanced proportion. For each ‘x belongs to a’,
N examples (i.e. x1,x2,x3,…xn) are randomly
selected from its KNN and they construct the set
A1.

• Working of SMOTE
– Step 3: For each example xk belongs to A1(k=
1,2,3,…N), the following formula is used to
generate a new example:
x’ = x +rand(0,1)* |x-xk|
in which rand(0,1) represents the random
number between 0 & 1.
The working of SMOTE is as illustrated in fig 3.

Fig. 3: Working of
Smote

Project Implementation : Testing
Test case ID 1: Execute the implemented system in Windows 10
Test Case Summary To verify if the system is able to execute on Win 10.
Prerequisites User must have python with packages installed.
Test Procedure
1. Open the System
2. Run the System
3. Execute the Algorithm
Expected Result All the algorithms should be successfully executed.
Actual Result All Algorithms are executed Successfully.
Status Pass

Test case ID 2: Execute the implemented system in Macintosh
Test Case Summary To verify if the system is able to execute on Mac.
Test Procedure
1. Open the System
2. Run the System
Expected Result All the algorithms should be successfully executed.
Actual Result All Algorithms are executed Successfully.
Status Pass

Test case ID 3: Does component of UI works as expected
Test Case Summary To verify if components of UI are working in every
system.
User must have Ubuntu
Test Procedure
1. Open the System
2. Run the System
Expected Result All the Components of UI should perform its task as assigned.
Actual Result All components of UI performs its task as assigned.
Status Pass

Project Implementation : Results
• We have evaluated some of the Classifiers
along with the proposed algorithm.
• The Classifiers we used are:
– Random Forest Classifier
– Xtreme Gradient Boosting
– Bagging Classifier
– SMOTE + XGBoost + RFC

• For RFC we got performance as:
– Recall: 0.788
– Precision: 0.947
– F1 Score: 0.860
– Average Precision recall: 0.838

• After tuning RFC we got performance as:
– Recall: 0.788
– F1 Score: 0.860

• Using XGB we got performance as:
– Recall: 0.805
– F1 Score: 0.871

• After tuning XGB we got performance as:
– Recall: 0.000
– Precision: 0.nan
– F1 Score: 0.nan

• Using XGB+Smote we got performance as:
– Recall: 1.000
– F1 Score: 1.000

• Using Smote+RFC we got performance as:
– Recall: 1.000
– F1 Score: 1.000

• From the Evaluation performance of each
algo. we came to conclusion that the Smote
resampled data along with XGB and RFC gave
better results than other algorithms.

Conclusion
• Classification is one among the foremost common problem in
machine learning. One of the common issues found in dataset while
classification is imbalanced classes issue. Various researchers have
put their best efforts to solve this issue. The literature review has
yielded some of the useful approaches to face this problem. The best
solution to face this problem is to make use of ensemble methods.
Ensemble methods are combination of multiple machine learning
technique into one model to decrease variance (bagging), bias
(boosting), or improve predictions (stacking). Other methods also
exist but are not capable of performing best in classification of
imbalanced dataset. In future, the proposed work will be
implemented using python programming which will ensure that the
target performance will be more than 95% so that ensemble
classifier can be used reliably.

References
[1] Abdellatif, Safa & Ben Hassine, Mohamed Ali & Ben Yahia, Sadok & Bouzeghoub, Amel. (2018).
“ARCID: A New Approach to Deal with Imbalanced Datasets Classification.”, SOFSEM 2018: Theory
and Practice of Computer Science, pp.569-580 doi:10.1007/978-3-319-73117-9_40.
[2] Jedrzejowicz, Joanna & Neumann, Jakub & Synowczyk, Piotr & Zakrzewska, Magdalena.
(2017). “ Applying Map-Reduce to imbalanced data classification”. 2017 IEEE International
Conference on INnovations in Intelligent SysTems and Applications (INISTA) pp.29-33.
doi10.1109/INISTA.2017.8001127.
[3] T. Lakshmi and S. Prasad, “A study on classifying imbalanced datasets,” IEEE Conference
Publication, 2014.
[4] Yongqing, Z. & Min, & Zhu, Danling, & Zhang, M. & Gang, and M. Daichuan, “Improved
SMOTEBagging and its application in imbalanced data classification,” IEEE Conference Anthology,
2013.
[5] M. R. Longadge, M. S. S. Dongre, D. L. Malik et al., “Class Imbalance Problem in Data Mining:
Review,” International Journal of Computer Science and Network (IJCSN), vol. 2, no. 1, 2013.
[Online]. Available: www.ijcsn.org
[6] S. Hukerikar, A. Tumma, A. Nikam, and V. Attar, “Skew Boost: An algorithm for classifying
imbalanced datasets,” in 2011 2nd International Conference on Computer and Communication
Technology (ICCCT-2011), 2011, pp. 46–52.

[7] Seiffert, Chris & Khoshgoftaar, Taghi & Van Hulse, Jason & Napolitano, Amri. (2010).
“RUSBoost: A Hybrid Approach to Alleviating Class Imbalance.
Systems, Man and Cybernetics, Part A: Systems and Humans,” IEEE Transactions, 2010.
[8] S. L. Phung, “Learning pattern classification tasks with imbalanced datasets,” in
Pattern Recognition. InTech, 2009.
[9] Imam, Tasadduq & Ting, Kai & Kamruzzaman, Joarder. “Z-SVM: An SVM for
improved classification of imbalanced data,” Advances in Artificial
Intelligence, vol. 4304, 2006.
[10]Chawla, Nitesh & Japkowicz, Nathalie & Kołcz, Aleksander. (2004). “Editorial:
Special Issue on Learning from Imbalanced Data Sets.”, SIGKDD
Explorations. 6. 1-6. 10.1145/1007730.1007733..
[11]Guo, Hongyu & Viktor, Herna. (2004). “Learning from imbalanced data sets with
boosting and data generation: The DataBoost-IM approach.” SIGKDD
Explorations. 6. 30-39. 10.1145/1007730.1007736.
[12]N. Chawla, “Synthetic Minority Oversampling Technique,” Journal of Artificial
Intelligence Research, vol. 16, pp. 321–357, 2002.

COMP_GroupA2.pptx

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a COMP_GroupA2.pptx

Semelhante a COMP_GroupA2.pptx (20)

Último

Último (20)

COMP_GroupA2.pptx