SlideShare uma empresa Scribd logo
1 de 76
Department of Computer Engineering,
Sinhgad Institute of Technology and Science, Pune
2019-20
Project Stage II SPPU Examination
Name. PRNNO.
Venkatesh T. Mainalli 71622707L
Gaurav J. Bade 71832932F
Akshay R. Dhole 71833043K
Shivraj Deshmukh 71833027H
Name of the guide :- Prof. G. S. Navale
Classification of Imbalanced Dataset
Agenda
• Introduction of Group Members
• Introduction of Project Topic
• Motivation
• Literature Survey
• Gap Analysis
• Problem Statement
• Objectives
• Activities per Objective
• Feasibility Study
• Schedule
• System Architecture
• Mathematical Model
• SRS
• UML/DFD
• Algorithms
• Testing
• Results
• Conclusion
• Future Scope
• References
25-04-2023 Classification of Imbalanced Dataset 2
Introduction of Group Member 1
25-04-2023 Classification of Imbalanced Dataset 3
Bio – Data : Personal Information
Roll No B17COA77
Name of Student Venkatesh Tukkappa Mainalli
Date of Birth 25-03-1998
Address House No Bldg No Anil Villa, Kolhewadi, Sinhgad Road
Lane, Appt,Society
City District Pune
Tal Dist State Pin Maharashtra- 411024
Email Id venkateshmainalli55555@gmail.com
Mobile No. 7798311297
Introduction of Group Member 1
25-04-2023 Classification of Imbalanced Dataset 4
Class Sem Year of Adm % Marks Remark
F. E. I 2016-2017 62.0 Pass
II 64.0 Pass
S. E. I 2017-2018 54.0 Pass
II 58.0 Pass
T. E. I 2018-2019 60.0 Pass
II 68.0 Pass
B. E. I 74.16 Pass
Academic Track Record
Introduction of Group Member 2
25-04-2023 Classification of Imbalanced Dataset 5
Bio – Data : Personal Information
Roll No B17COA14
Name of Student Shivraj Deshmukh
Date of Birth 17-01-1999
Address House No Bldg No Room no. 201
Lane, Appt,Society Balaji Hostel, Narhe
City District Pune
Tal Dist State Pin Maharashtra- 411041
Email Id deshmukh.shivraj95@gmail.com
Mobile No. 7972326595
Recent
Photo
4.05cm ht
Introduction of Group Member 2
25-04-2023 Classification of Imbalanced Dataset 6
Class Sem Year of Adm % Marks Remark
S. E. I 2017-2018 74.16 Pass
II 70.46 Pass
T. E. I 2018-2019 69.72 Pass
II 64.24 Pass
B. E. I 70.16 Pass
Academic Track Record
Introduction of Group Member 3
25-04-2023 Classification of Imbalanced Dataset 7
Bio – Data : Personal Information
Roll No B17COA17
Name of Student Akshay Ramchandra Dhole
Date of Birth 21-09-1998
Address House No Bldg No Flat No.1,
Lane, Appt,Society Shantiyuga Apartment, Manaji Nagar, Narhe
City District Pune
Tal Dist State Pin Maharashtra - 411041
Email Id akshaydholead10@gmail.com
Mobile No. 7768902644
Recent
Photo
4.05cm ht
Introduction of Group Member 3
25-04-2023 Classification of Imbalanced Dataset 8
Class Sem Year of Adm % Marks Remark
S. E. I 2017-2018 72.68 Pass
II 67.50 Pass
T. E. I 2018-2019 61.06 Pass
II 66.53 Pass
B. E. I 78.23 Pass
Academic Track Record
Introduction of Group Member 4
25-04-2023 Classification of Imbalanced Dataset 9
Bio – Data : Personal Information
Roll No B17COA04
Name of Student Gaurav Jayant Bade
Date of Birth 03-01-1998
Address House No Bldg No Flat No.420, E-Wing
Lane, Appt,Society Samarth Vishwa Society, Khandala
City District Khandala, Satara
Tal Dist State Pin Maharashtra- 412802
Email Id Gauravbade11@gmail.com
Mobile No. 9960503007
Introduction of Group Member 4
25-04-2023 Classification of Imbalanced Dataset 10
Class Sem Year of Adm % Marks Remark
S. E. I 2017-2018 69.72 Pass
II 69.12 Pass
T. E. I 2018-2019 64.46 Pass
II 64.09 Pass
B. E. I 2019-2020 73.19 Pass
Academic Track Record
Introduction
• What is Machine Learning ?
 A branch of artificial intelligence, concerned with
the design and development of algorithms that allow
computers to evolve behaviors based on empirical
data.
 As intelligence requires knowledge, it is necessary for
the computers to acquire knowledge.
25-04-2023 Classification of Imbalanced Dataset 11
• What is Imbalanced Dataset ?
Imbalance data distribution is an important
part of machine learning workflow.
 An imbalanced dataset means instances of
one of the two classes is higher than the
other, in another way, the number of
observations is not the same for all the
classes in a classification dataset.
This problem is faced not only in the binary
class data but also in the multi-class data.
25-04-2023 Classification of Imbalanced Dataset 12
Motivation
 Many real-world data sets have an imbalanced
distribution of the instances. Learning from such
data sets results in the classifier being biased
towards the majority class, thereby tending to
misclassify the minority class samples
For example – fraud detection, cancer detection,
manufacturing defects, online ads conversion etc.
25-04-2023 Classification of Imbalanced Dataset 13
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
1 “ARCID: A New Approach
to Deal with Imbalanced
Datasets Classification”
Year:- 2018
Authors:- Abdellatif, Safa,
et.al
New Approach to deal with
Classification of Imbalanced
Datasets
Pros: ARCID performs slightly
better than Fitcare and RIPPER.
However, it outperforms all
standard rule-based and non-
rule based approaches by many
ranks
Cons: Managing the
overwhelming number of
CARs generated from real- life
datasets and Removing
redundant rules conveys the
same information.
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 14
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
2 Paper Title:-Applying Map-
Reduce to imbalanced data
classification
Year:- 2017
Authors:-Jedrzejowicz,
Joanna , et.al
Using Map-Reduce to deal
with Imbalanced data
classification
Pros:
1.EDBC uses the three methods
to classify the imbalance data
in proper in manner.
2.SplitBal works parallelly so
the processing time is less as
compare to EDBC.
Cons:
1.EDBC does not works
parallelly so the processing
time is more.
2. In SplitBal dataset is divided
into bins so working on bin is
lengthy.
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 15
Sr. Paper Title Paper Theme/Idea Pros / Cons
3 Paper Title:-A study on
classifying imbalanced
datasets
Year:- 2014
Authors:-T. Lakshmi and
S. Prasad
Study of classifying
imbalanced dataset.
Pros:
1.The combination of SMOTE +
BAG + RF effectively improves
the performance of classifying
of imbalanced dataset.
2.The use of ensemble and
sampling techniques gives
drastic improvement in
performance in terms of AUROC
Cons:
1.Execution time was reduced
by using the under- sampling,
but the accuracy of classifying
dataset was distracted.
2.There was loss of necessary
and useful information due to
under-sampling.
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 16
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
4 Paper Title:-Improved
SMOTEBagging and its
application in imbalanced
data classification
Year:- 2013
Authors:-Yongqing, Z. &
Min, & Zhu, Danling, &
Zhang, M. & Gang, and M.
Daichuan
An approach to deal
imabalanced data using
oversampling and bagging
approach.
Pros:
1.SMOTE Method generates
synthetic minority examples to
over-sample the minority class.
2.Bagging Method improves
the performance of the overall
system.
3.Support Vector Machine one
of the most effective machine
learning algorithms for many
complex binary classification
problems.
4.SMOTE Bagging Method
overcomes over-fitting
problems of traditional
algorithm.
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 17
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
4 Paper Title:-Improved
SMOTEBagging and its
application in imbalanced
data classification
Year:- 2013
Authors:-Yongqing, Z. &
Min, & Zhu, Danling, &
Zhang, M. & Gang, and M.
Daichuan
An approach to deal
imabalanced data using
oversampling and bagging
approach.
Cons:
1.SMOTE Method it only work
on minority class not on
majority class.
2.Bagging Method combing of
decision sometimes over-head
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 18
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
5 Paper Title:-Class
Imbalance Problem in Data
Mining: Review
Year:- 2013
Authors:-M. R. Longadge,
M. S. S. Dongre, D. L.
Malik et al
Using Data Mining to deal
with Imbalanced data
classification
Pros:
Hybrid approach provides
better solution for class
imbalance. Suggests several
algorithm and techniques that
solve the issue of imbalance
distribution of sample.
Cons:
This method improves the
classification accuracy of
minority class but, because of
infinite data streams and
continuous concept drifting,
this method cannot suitable for
skewed data stream
classification.
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 19
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
6 Paper Title:-Skew Boost:
An algorithm for classifying
imbalanced datasets
Year:- 2011
Authors:-S. Hukerikar, A.
Tumma, A. Nikam, and V.
Attar,
An approach of handling
imbalanced dataset using
boosting tech
Pros:
SkewBoost performs better
than DataBoost-IM, Inverse
Random UnderSampling (IRUS),
EasyEnsemble and Balance
Cascade and Cost Sensitive
Boosting (CSB2).
Cons:
In case of some datasets,
SkewBoost remains behind of
some existing algorithms in
metrics related to the majority
class accuracy
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 20
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
7 Paper Title:-RUSBoost: A
Hybrid Approach to
Alleviating Class Imbalance
Year:- 2010
Authors:-Seiffert, Chris &
Khoshgoftaar, Taghi & Van
Hulse, Jason & Napolitano,
Amri.
Hybrid approach of
handling imbalanced
dataset using boosting tech.
Pros:
RUSBoost presents an easier,
faster, and fewer advanced
alternative to SMOTEBoost for
learning from imbalanced data.
Cons:
The complexity and model
training time get increased.
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 21
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
8 Paper Title:-Learning
pattern classification tasks
with imbalanced datasets
Year:- 2009
Authors:-S. L. Phung
Recognizing learning
pattern of classification of
imbalanced data.
Pros:
The approach proposed by
author can effectively improve
classification accuracy of
minority classes while
maintaining the overall
classification performance.
Cons:
The combination of supervised
and unsupervised algorithm
tends to have higher values of
G-means and F-measure when
compared with different
training algorithms
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 22
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
9 Paper Title:- “Z-SVM: An
SVM for improved
classification of imbalanced
data
Year:- 2006
Authors:-Imam, Tasadduq
& Ting, Kai &
Kamruzzaman, Joard
SVM approach to handle
Imbalanced Dataset
Pros:
Z-SVM aims at finding the
discriminating hyperplane that
maintains an optimal margin
from boundary examples called
support vectors.
Cons:
It is not much clear that how
particular value of parameter
affects the SVM hyperplane
and generalization capability of
learned model
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 23
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
10 Paper Title:-Editorial:
Special Issue on Learning
from Imbalanced Data Sets
Year:- 2004
Authors:-Chawla, Nitesh &
Japkowicz, Nathalie &
Kołcz, Aleksander.
Information of Learning
from Imbalanced dataset
Pros:
Use of cluster based
oversampling counters the
effect of class imbalance and
small disjuncts.
Cons:
Feature selection can often be
too expensive to apply for
handling imbalanced data.
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 24
Sr.
No.
Paper Title Paper Theme/Idea Pros / Cons
11 Paper Title:-Learning from
imbalanced data sets with
boosting and data
generation: The DataBoost-
IM approach
Year:- 2004
Authors:-Guo, Hongyu &
Viktor, Herna.
Handling Imbalanced data
using boosting approach
Pros:
DataBoost-IM produce s a
series of high- quality classifiers
will be better able to predict
examples for which the
previous classifier’s
performance is poor.
Cons:
The execution gets complicated
when tedious approach is used
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 25
Sr. Paper Title Paper Theme/Idea Pros / Cons
12 Paper Title:-Synthetic
Minority Oversampling
Technique
Year:- 2002
Authors:-N. Chawla
An oversampling approach
to deal with imbalanced
dataset
Pros:
For almost all the ROC curves, the
SMOTE approach dominates.
Adhering to the definition of ROC
convex hull, most of the doubtless
optimal classifiers are the ones
generated with SMOTE.
Cons:
A minority class sample could
possibly have a majority class
sample as its nearest neighbor
rather than a minority class
sample. This crowding will likely
contribute to the redrawing of the
decision surfaces in favor of the
minority class
Literature Survey
25-04-2023 Classification of Imbalanced Dataset 26
Gap Analysis
25-04-2023 Classification of Imbalanced Dataset 27
• The study of papers gave an idea about how
the imbalanced data can result in
inappropriate results.
• Some papers also gave idea about how we can
deal with imbalanced data.
Problem Statement
• To classify imbalanced data set to predict the accuracy of
minority class from a given structured data set using
diversified ensemble and clustering methods.
25-04-2023 Classification of Imbalanced Dataset 28
Objectives
1. To develop balanced data set from imbalanced
multiclass dataset by applying sampling techniques.
2. To improve the classification precision and achieve
good balancing among the class samples.
3. To develop an efficient method for solving the class
imbalance problem in the classification process.
25-04-2023 Classification of Imbalanced Dataset 29
25-04-2023 Classification of Imbalanced Dataset 30
Activities per Objective
• Develop balanced data set from imbalanced multiclass dataset
by applying sampling techniques.
• Activities::
1. Input imbalanced dataset.
2. Apply SMOTE algorithm with sampling.
3. Result balanced dataset.
Activities per Objective
25-04-2023 Classification of Imbalanced Dataset 31
• To improve the classification precision and achieve good
balancing among the class samples.
• Activities:
1. Testing with any one classification algorithm then proceed with
combined algorithm .
2. Proper cleansing and pre-processing before testing.
3. Using the learning curve approach to find the performance of
algorithm.
Activities per Objective
25-04-2023
• To develop an efficient method for solving the class imbalance
problem in the classification process.
• Activities:
1.Evaluate Classifiers in Imbalanced Domains.
2. Using statistical measures for evaluating classification.
Classification of Imbalanced Dataset 32
Feasibility Study
25-04-2023
Following points are analyzed to check feasibility of Classification
of imabalanced Dataset:
1. Dataset for training and testing
2. SMOTE for synthetic oversampling
3. Random forest for ensembled classification
4. Tkinter for creating GUI
5. XGBoost for performing boosting
Classification of Imbalanced Dataset 33
Schedule
25-04-2023 Classification of Imbalanced Dataset 34
• - To develop a blockchain manager and integrate it with the
web portal.
• 2.1 Building a blockchain manager interface that can
support dynamic addition of blockchains.
• 2.2 Authorizing the blockchain dynamically to create
a private blockchain network.
• 2.3 Creating methods to support image checking and
uploading to the blockchain via blockchain
manager.
• 2.4 Creating method for restoration after the blockchain
failure.
• 2.5 Calling the methods from web interface.
System Architecture
25-04-2023 Classification of Imbalanced Dataset 35
Mathematical model
25-04-2023 Classification of Imbalanced Dataset 36
• The math's behind XGBoost is as follows
– The following steps are involved in gradient
boosting:
• F0(x)- with which we initialize the boosting algo as:
• The gradient of loss function is computed iteratively:
Mathematical model
25-04-2023 Classification of Imbalanced Dataset 37
• Each hm(x) is fit on gradient obtained at each step.
• The multiplicative factor ym for each terminal node is
derived and the boosted model Fm(x) is defined:
Mathematical model
25-04-2023 Classification of Imbalanced Dataset 38
• The math’s behind the working of RFC is as
follows:
• RFC does both row sampling and column sampling with
decision tree as a base.
Mathematical model
25-04-2023 Classification of Imbalanced Dataset 39
• As the no. of base learner(k) increases, the
variance gets decreased. When k is decreased,
variance is increased. But bias remain
constant for whole process.
• It can be expressed as:
– Random forest= DT(base learner)+ bagging(Row
sampling with replacement)+ feature
bagging(column sampling) +
aggregation(mean/median, majority vote)
Mathematical model
25-04-2023 Classification of Imbalanced Dataset 40
• Math’s behind RFC is:
• Assume training set of microarrays
–D = {(x1,y1),….,(xn,yn)}
• drawn randomly from a (possibly unknown)
probabilitydistribution .
–(xi,yi) ~ (X,Y)
Mathematical model
25-04-2023 Classification of Imbalanced Dataset 41
• Goal: to build a classifier which predicts y from x based
on the data set of examples D.
• Given: ensemble of (possibly weak) classifiers
– h ={ h1(x),….,hk(x)}.
• If each hk(x) is a decision tree, then the ensemble is a
random forest. We define the parameters of decision
tree for classifier hk(x) to be
–Qk = (qk1,qk2,....,qkp)
• (these parameters include the structure of tree, which
variables are split in which nodes, etc)
Mathematical model
25-04-2023 Classification of Imbalanced Dataset 42
• We sometimes write
hk(x) = h(x |qk)
• Thus decision tree k leads to a classifier
hk(x) = h(x |qk)
SRS
• SRS Hyperlink
System Requirment:
25-04-2023 Classification of Imbalanced Dataset 43
Hardware:
• Pentium IV
• 40 GB HDD
• 4 GB Ram
Software:
• Windows 10
• Python
• PyCharm
Non System Requirement
• Performance
• Extendibility
• Reusability
• Reliability
25-04-2023 Classification of Imbalanced Dataset 44
UML Diagrams / DFD Diagrams
25-04-2023 Classification of Imbalanced Dataset 45
• Data Flow Diagram:
DFD Level 0
25-04-2023
DFD Level 1
Classification of Imbalanced Dataset 46
25-04-2023
DFD Level 2
Classification of Imbalanced Dataset 47
25-04-2023
Activity Diagram
Classification of Imbalanced Dataset 48
25-04-2023
Class Diagram
Classification of Imbalanced Dataset 49
25-04-2023
Sequence Diagram
Classification of Imbalanced Dataset 50
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 51
• Algorithms used are as follows:
– Random Forest Classifier
– Xtreme Gradient Boosting
– SMOTE
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 52
• Working of RFC
– We can understand the working of Random Forest
algorithm with the help of following steps −
– Step 1 − First, start with the selection of random
samples from a given dataset.
– Step 2 − Next, this algorithm will construct a
decision tree for every sample. Then it will get the
prediction result from every decision tree.
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 53
• Working of RFC
– Step 3 − In this step, voting will be performed for
every predicted result.
– Step 4 − At last, select the most voted prediction
result as the final prediction result.
The working of RFC is as illustrated in fig.1.
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 54
Fig. 1 Working of Random Forest Classification
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 55
• Working of XGBoost
– XGBoost is an implementation of Gradient
Boosted decision trees.
– It is a type of Software library that was designed
basically to improve speed and model
performance.
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 56
• Working of XGBoost
– In this algorithm, decision trees are created in
sequential form.
– Weights play an important role in XGBoost.
Weights are assigned to all the independent
variables which are then fed into the decision tree
which predicts results.
– Weight of variables predicted wrong by the tree is
increased and these the variables are then fed to
the second decision tree.
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 57
• Working of XGBoost
– These individual classifiers/predictors then
ensemble to give a strong and more precise
model.
The Working of XGB is as illustrated in fig. 2.
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 58
Fig. 2 Working of XGBoost
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 59
• Working of SMOTE
– Step 1: Setting the minority class set A, for each ‘x
belongs to a’ the k- nearest neighbours of x are
obtained by calculating Euclidean distance
between x and every other sample in set A.
– Step 2: The sampling rate N is set according to
imbalanced proportion. For each ‘x belongs to a’,
N examples (i.e. x1,x2,x3,…xn) are randomly
selected from its KNN and they construct the set
A1.
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 60
• Working of SMOTE
– Step 3: For each example xk belongs to A1(k=
1,2,3,…N), the following formula is used to
generate a new example:
x’ = x +rand(0,1)* |x-xk|
in which rand(0,1) represents the random
number between 0 & 1.
The working of SMOTE is as illustrated in fig 3.
Project Implementation : Algorithms
25-04-2023 Classification of Imbalanced Dataset 61
Fig. 3: Working of
Smote
Project Implementation : Testing
25-04-2023 Classification of Imbalanced Dataset 62
Test case ID 1: Execute the implemented system in Windows 10
Test Case Summary To verify if the system is able to execute on Win 10.
Prerequisites User must have python with packages installed.
Test Procedure
1. Open the System
2. Run the System
3. Execute the Algorithm
Expected Result All the algorithms should be successfully executed.
Actual Result All Algorithms are executed Successfully.
Status Pass
Project Implementation : Testing
25-04-2023 Classification of Imbalanced Dataset 63
Test case ID 2: Execute the implemented system in Macintosh
Test Case Summary To verify if the system is able to execute on Mac.
Prerequisites User must have python with packages installed.
Test Procedure
1. Open the System
2. Run the System
3. Execute the Algorithm
Expected Result All the algorithms should be successfully executed.
Actual Result All Algorithms are executed Successfully.
Status Pass
Project Implementation : Testing
25-04-2023 Classification of Imbalanced Dataset 64
Test case ID 3: Does component of UI works as expected
Test Case Summary To verify if components of UI are working in every
system.
Prerequisites User must have python with packages installed.
User must have Ubuntu
Test Procedure
1. Open the System
2. Run the System
3. Execute the Algorithm
Expected Result All the Components of UI should perform its task as assigned.
Actual Result All components of UI performs its task as assigned.
Status Pass
Project Implementation : Results
25-04-2023 Classification of Imbalanced Dataset 65
• We have evaluated some of the Classifiers
along with the proposed algorithm.
• The Classifiers we used are:
– Random Forest Classifier
– Xtreme Gradient Boosting
– Bagging Classifier
– SMOTE + XGBoost + RFC
Project Implementation : Results
25-04-2023 Classification of Imbalanced Dataset 66
• For RFC we got performance as:
– Recall: 0.788
– Precision: 0.947
– F1 Score: 0.860
– Average Precision recall: 0.838
Project Implementation : Results
25-04-2023 Classification of Imbalanced Dataset 67
• After tuning RFC we got performance as:
– Recall: 0.788
– Precision: 0.947
– F1 Score: 0.860
– Average Precision recall: 0.854
Project Implementation : Results
25-04-2023 Classification of Imbalanced Dataset 68
• Using XGB we got performance as:
– Recall: 0.805
– Precision: 0.948
– F1 Score: 0.871
– Average Precision recall: 0.868
Project Implementation : Results
25-04-2023 Classification of Imbalanced Dataset 69
• After tuning XGB we got performance as:
– Recall: 0.000
– Precision: 0.nan
– F1 Score: 0.nan
– Average Precision recall: 0.669
Project Implementation : Results
25-04-2023 Classification of Imbalanced Dataset 70
• Using XGB+Smote we got performance as:
– Recall: 1.000
– Precision: 0.999
– F1 Score: 1.000
– Average Precision recall: 1.000
Project Implementation : Results
25-04-2023 Classification of Imbalanced Dataset 71
• Using Smote+RFC we got performance as:
– Recall: 1.000
– Precision: 1.000
– F1 Score: 1.000
– Average Precision recall: 1.000
Project Implementation : Results
25-04-2023 Classification of Imbalanced Dataset 72
• From the Evaluation performance of each
algo. we came to conclusion that the Smote
resampled data along with XGB and RFC gave
better results than other algorithms.
Conclusion
• Classification is one among the foremost common problem in
machine learning. One of the common issues found in dataset while
classification is imbalanced classes issue. Various researchers have
put their best efforts to solve this issue. The literature review has
yielded some of the useful approaches to face this problem. The best
solution to face this problem is to make use of ensemble methods.
Ensemble methods are combination of multiple machine learning
technique into one model to decrease variance (bagging), bias
(boosting), or improve predictions (stacking). Other methods also
exist but are not capable of performing best in classification of
imbalanced dataset. In future, the proposed work will be
implemented using python programming which will ensure that the
target performance will be more than 95% so that ensemble
classifier can be used reliably.
25-04-2023 Classification of Imbalanced Dataset 73
References
25-04-2023 Classification of Imbalanced Dataset 74
[1] Abdellatif, Safa & Ben Hassine, Mohamed Ali & Ben Yahia, Sadok & Bouzeghoub, Amel. (2018).
“ARCID: A New Approach to Deal with Imbalanced Datasets Classification.”, SOFSEM 2018: Theory
and Practice of Computer Science, pp.569-580 doi:10.1007/978-3-319-73117-9_40.
[2] Jedrzejowicz, Joanna & Neumann, Jakub & Synowczyk, Piotr & Zakrzewska, Magdalena.
(2017). “ Applying Map-Reduce to imbalanced data classification”. 2017 IEEE International
Conference on INnovations in Intelligent SysTems and Applications (INISTA) pp.29-33.
doi10.1109/INISTA.2017.8001127.
[3] T. Lakshmi and S. Prasad, “A study on classifying imbalanced datasets,” IEEE Conference
Publication, 2014.
[4] Yongqing, Z. & Min, & Zhu, Danling, & Zhang, M. & Gang, and M. Daichuan, “Improved
SMOTEBagging and its application in imbalanced data classification,” IEEE Conference Anthology,
2013.
[5] M. R. Longadge, M. S. S. Dongre, D. L. Malik et al., “Class Imbalance Problem in Data Mining:
Review,” International Journal of Computer Science and Network (IJCSN), vol. 2, no. 1, 2013.
[Online]. Available: www.ijcsn.org
[6] S. Hukerikar, A. Tumma, A. Nikam, and V. Attar, “Skew Boost: An algorithm for classifying
imbalanced datasets,” in 2011 2nd International Conference on Computer and Communication
Technology (ICCCT-2011), 2011, pp. 46–52.
[7] Seiffert, Chris & Khoshgoftaar, Taghi & Van Hulse, Jason & Napolitano, Amri. (2010).
“RUSBoost: A Hybrid Approach to Alleviating Class Imbalance.
Systems, Man and Cybernetics, Part A: Systems and Humans,” IEEE Transactions, 2010.
[8] S. L. Phung, “Learning pattern classification tasks with imbalanced datasets,” in
Pattern Recognition. InTech, 2009.
[9] Imam, Tasadduq & Ting, Kai & Kamruzzaman, Joarder. “Z-SVM: An SVM for
improved classification of imbalanced data,” Advances in Artificial
Intelligence, vol. 4304, 2006.
[10]Chawla, Nitesh & Japkowicz, Nathalie & Kołcz, Aleksander. (2004). “Editorial:
Special Issue on Learning from Imbalanced Data Sets.”, SIGKDD
Explorations. 6. 1-6. 10.1145/1007730.1007733..
[11]Guo, Hongyu & Viktor, Herna. (2004). “Learning from imbalanced data sets with
boosting and data generation: The DataBoost-IM approach.” SIGKDD
Explorations. 6. 30-39. 10.1145/1007730.1007736.
[12]N. Chawla, “Synthetic Minority Oversampling Technique,” Journal of Artificial
Intelligence Research, vol. 16, pp. 321–357, 2002.
25-04-2023 Classification of Imbalanced Dataset 75
Thank You!!!

Mais conteúdo relacionado

Semelhante a COMP_GroupA2.pptx

M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...IRJET Journal
 
Student Performance Evaluation in Education Sector Using Prediction and Clust...
Student Performance Evaluation in Education Sector Using Prediction and Clust...Student Performance Evaluation in Education Sector Using Prediction and Clust...
Student Performance Evaluation in Education Sector Using Prediction and Clust...IJSRD
 
Identifying the Key Factors of Training Technical School and College Teachers...
Identifying the Key Factors of Training Technical School and College Teachers...Identifying the Key Factors of Training Technical School and College Teachers...
Identifying the Key Factors of Training Technical School and College Teachers...ijtsrd
 
Student Performance Prediction via Data Mining & Machine Learning
Student Performance Prediction via Data Mining & Machine LearningStudent Performance Prediction via Data Mining & Machine Learning
Student Performance Prediction via Data Mining & Machine LearningIRJET Journal
 
A comparative study of machine learning algorithms for virtual learning envir...
A comparative study of machine learning algorithms for virtual learning envir...A comparative study of machine learning algorithms for virtual learning envir...
A comparative study of machine learning algorithms for virtual learning envir...IAESIJAI
 
A Study on Data Mining Techniques, Concepts and its Application in Higher Edu...
A Study on Data Mining Techniques, Concepts and its Application in Higher Edu...A Study on Data Mining Techniques, Concepts and its Application in Higher Edu...
A Study on Data Mining Techniques, Concepts and its Application in Higher Edu...IRJET Journal
 
University Recommendation Support System using ML Algorithms
University Recommendation Support System using ML AlgorithmsUniversity Recommendation Support System using ML Algorithms
University Recommendation Support System using ML AlgorithmsIRJET Journal
 
IRJET- Analysis of Student Performance using Machine Learning Techniques
IRJET- Analysis of Student Performance using Machine Learning TechniquesIRJET- Analysis of Student Performance using Machine Learning Techniques
IRJET- Analysis of Student Performance using Machine Learning TechniquesIRJET Journal
 
76201910
7620191076201910
76201910IJRAT
 
Technology Enabled Learning to Improve Student Performance: A Survey
Technology Enabled Learning to Improve Student Performance: A SurveyTechnology Enabled Learning to Improve Student Performance: A Survey
Technology Enabled Learning to Improve Student Performance: A SurveyIIRindia
 
Technology Enabled Learning to Improve Student Performance: A Survey
Technology Enabled Learning to Improve Student Performance: A SurveyTechnology Enabled Learning to Improve Student Performance: A Survey
Technology Enabled Learning to Improve Student Performance: A SurveyIIRindia
 
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...ijcax
 
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...ijcax
 
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...ijcax
 
Volume 2-issue-6-2034-2036
Volume 2-issue-6-2034-2036Volume 2-issue-6-2034-2036
Volume 2-issue-6-2034-2036Editor IJARCET
 
Volume 2-issue-6-2034-2036
Volume 2-issue-6-2034-2036Volume 2-issue-6-2034-2036
Volume 2-issue-6-2034-2036Editor IJARCET
 
Recommendation of Data Mining Technique in Higher Education Prof. Priya Thaka...
Recommendation of Data Mining Technique in Higher Education Prof. Priya Thaka...Recommendation of Data Mining Technique in Higher Education Prof. Priya Thaka...
Recommendation of Data Mining Technique in Higher Education Prof. Priya Thaka...ijceronline
 
Extending the Student’s Performance via K-Means and Blended Learning
Extending the Student’s Performance via K-Means and Blended Learning Extending the Student’s Performance via K-Means and Blended Learning
Extending the Student’s Performance via K-Means and Blended Learning IJEACS
 

Semelhante a COMP_GroupA2.pptx (20)

M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
M-Learners Performance Using Intelligence and Adaptive E-Learning Classify th...
 
Student Performance Evaluation in Education Sector Using Prediction and Clust...
Student Performance Evaluation in Education Sector Using Prediction and Clust...Student Performance Evaluation in Education Sector Using Prediction and Clust...
Student Performance Evaluation in Education Sector Using Prediction and Clust...
 
Identifying the Key Factors of Training Technical School and College Teachers...
Identifying the Key Factors of Training Technical School and College Teachers...Identifying the Key Factors of Training Technical School and College Teachers...
Identifying the Key Factors of Training Technical School and College Teachers...
 
Student Performance Prediction via Data Mining & Machine Learning
Student Performance Prediction via Data Mining & Machine LearningStudent Performance Prediction via Data Mining & Machine Learning
Student Performance Prediction via Data Mining & Machine Learning
 
A comparative study of machine learning algorithms for virtual learning envir...
A comparative study of machine learning algorithms for virtual learning envir...A comparative study of machine learning algorithms for virtual learning envir...
A comparative study of machine learning algorithms for virtual learning envir...
 
A Study on Data Mining Techniques, Concepts and its Application in Higher Edu...
A Study on Data Mining Techniques, Concepts and its Application in Higher Edu...A Study on Data Mining Techniques, Concepts and its Application in Higher Edu...
A Study on Data Mining Techniques, Concepts and its Application in Higher Edu...
 
University Recommendation Support System using ML Algorithms
University Recommendation Support System using ML AlgorithmsUniversity Recommendation Support System using ML Algorithms
University Recommendation Support System using ML Algorithms
 
IRJET- Analysis of Student Performance using Machine Learning Techniques
IRJET- Analysis of Student Performance using Machine Learning TechniquesIRJET- Analysis of Student Performance using Machine Learning Techniques
IRJET- Analysis of Student Performance using Machine Learning Techniques
 
76201910
7620191076201910
76201910
 
A Systematic Review on the Educational Data Mining and its Implementation in ...
A Systematic Review on the Educational Data Mining and its Implementation in ...A Systematic Review on the Educational Data Mining and its Implementation in ...
A Systematic Review on the Educational Data Mining and its Implementation in ...
 
Technology Enabled Learning to Improve Student Performance: A Survey
Technology Enabled Learning to Improve Student Performance: A SurveyTechnology Enabled Learning to Improve Student Performance: A Survey
Technology Enabled Learning to Improve Student Performance: A Survey
 
Technology Enabled Learning to Improve Student Performance: A Survey
Technology Enabled Learning to Improve Student Performance: A SurveyTechnology Enabled Learning to Improve Student Performance: A Survey
Technology Enabled Learning to Improve Student Performance: A Survey
 
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
 
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
 
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
PREDICTING ACADEMIC MAJOR OF STUDENTS USING BAYESIAN NETWORKS TO THE CASE OF ...
 
Volume 2-issue-6-2034-2036
Volume 2-issue-6-2034-2036Volume 2-issue-6-2034-2036
Volume 2-issue-6-2034-2036
 
Volume 2-issue-6-2034-2036
Volume 2-issue-6-2034-2036Volume 2-issue-6-2034-2036
Volume 2-issue-6-2034-2036
 
Recommendation of Data Mining Technique in Higher Education Prof. Priya Thaka...
Recommendation of Data Mining Technique in Higher Education Prof. Priya Thaka...Recommendation of Data Mining Technique in Higher Education Prof. Priya Thaka...
Recommendation of Data Mining Technique in Higher Education Prof. Priya Thaka...
 
Extending the Student’s Performance via K-Means and Blended Learning
Extending the Student’s Performance via K-Means and Blended Learning Extending the Student’s Performance via K-Means and Blended Learning
Extending the Student’s Performance via K-Means and Blended Learning
 
Ae044209211
Ae044209211Ae044209211
Ae044209211
 

Último

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 

Último (20)

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 

COMP_GroupA2.pptx

  • 1. Department of Computer Engineering, Sinhgad Institute of Technology and Science, Pune 2019-20 Project Stage II SPPU Examination Name. PRNNO. Venkatesh T. Mainalli 71622707L Gaurav J. Bade 71832932F Akshay R. Dhole 71833043K Shivraj Deshmukh 71833027H Name of the guide :- Prof. G. S. Navale Classification of Imbalanced Dataset
  • 2. Agenda • Introduction of Group Members • Introduction of Project Topic • Motivation • Literature Survey • Gap Analysis • Problem Statement • Objectives • Activities per Objective • Feasibility Study • Schedule • System Architecture • Mathematical Model • SRS • UML/DFD • Algorithms • Testing • Results • Conclusion • Future Scope • References 25-04-2023 Classification of Imbalanced Dataset 2
  • 3. Introduction of Group Member 1 25-04-2023 Classification of Imbalanced Dataset 3 Bio – Data : Personal Information Roll No B17COA77 Name of Student Venkatesh Tukkappa Mainalli Date of Birth 25-03-1998 Address House No Bldg No Anil Villa, Kolhewadi, Sinhgad Road Lane, Appt,Society City District Pune Tal Dist State Pin Maharashtra- 411024 Email Id venkateshmainalli55555@gmail.com Mobile No. 7798311297
  • 4. Introduction of Group Member 1 25-04-2023 Classification of Imbalanced Dataset 4 Class Sem Year of Adm % Marks Remark F. E. I 2016-2017 62.0 Pass II 64.0 Pass S. E. I 2017-2018 54.0 Pass II 58.0 Pass T. E. I 2018-2019 60.0 Pass II 68.0 Pass B. E. I 74.16 Pass Academic Track Record
  • 5. Introduction of Group Member 2 25-04-2023 Classification of Imbalanced Dataset 5 Bio – Data : Personal Information Roll No B17COA14 Name of Student Shivraj Deshmukh Date of Birth 17-01-1999 Address House No Bldg No Room no. 201 Lane, Appt,Society Balaji Hostel, Narhe City District Pune Tal Dist State Pin Maharashtra- 411041 Email Id deshmukh.shivraj95@gmail.com Mobile No. 7972326595 Recent Photo 4.05cm ht
  • 6. Introduction of Group Member 2 25-04-2023 Classification of Imbalanced Dataset 6 Class Sem Year of Adm % Marks Remark S. E. I 2017-2018 74.16 Pass II 70.46 Pass T. E. I 2018-2019 69.72 Pass II 64.24 Pass B. E. I 70.16 Pass Academic Track Record
  • 7. Introduction of Group Member 3 25-04-2023 Classification of Imbalanced Dataset 7 Bio – Data : Personal Information Roll No B17COA17 Name of Student Akshay Ramchandra Dhole Date of Birth 21-09-1998 Address House No Bldg No Flat No.1, Lane, Appt,Society Shantiyuga Apartment, Manaji Nagar, Narhe City District Pune Tal Dist State Pin Maharashtra - 411041 Email Id akshaydholead10@gmail.com Mobile No. 7768902644 Recent Photo 4.05cm ht
  • 8. Introduction of Group Member 3 25-04-2023 Classification of Imbalanced Dataset 8 Class Sem Year of Adm % Marks Remark S. E. I 2017-2018 72.68 Pass II 67.50 Pass T. E. I 2018-2019 61.06 Pass II 66.53 Pass B. E. I 78.23 Pass Academic Track Record
  • 9. Introduction of Group Member 4 25-04-2023 Classification of Imbalanced Dataset 9 Bio – Data : Personal Information Roll No B17COA04 Name of Student Gaurav Jayant Bade Date of Birth 03-01-1998 Address House No Bldg No Flat No.420, E-Wing Lane, Appt,Society Samarth Vishwa Society, Khandala City District Khandala, Satara Tal Dist State Pin Maharashtra- 412802 Email Id Gauravbade11@gmail.com Mobile No. 9960503007
  • 10. Introduction of Group Member 4 25-04-2023 Classification of Imbalanced Dataset 10 Class Sem Year of Adm % Marks Remark S. E. I 2017-2018 69.72 Pass II 69.12 Pass T. E. I 2018-2019 64.46 Pass II 64.09 Pass B. E. I 2019-2020 73.19 Pass Academic Track Record
  • 11. Introduction • What is Machine Learning ?  A branch of artificial intelligence, concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data.  As intelligence requires knowledge, it is necessary for the computers to acquire knowledge. 25-04-2023 Classification of Imbalanced Dataset 11
  • 12. • What is Imbalanced Dataset ? Imbalance data distribution is an important part of machine learning workflow.  An imbalanced dataset means instances of one of the two classes is higher than the other, in another way, the number of observations is not the same for all the classes in a classification dataset. This problem is faced not only in the binary class data but also in the multi-class data. 25-04-2023 Classification of Imbalanced Dataset 12
  • 13. Motivation  Many real-world data sets have an imbalanced distribution of the instances. Learning from such data sets results in the classifier being biased towards the majority class, thereby tending to misclassify the minority class samples For example – fraud detection, cancer detection, manufacturing defects, online ads conversion etc. 25-04-2023 Classification of Imbalanced Dataset 13
  • 14. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 1 “ARCID: A New Approach to Deal with Imbalanced Datasets Classification” Year:- 2018 Authors:- Abdellatif, Safa, et.al New Approach to deal with Classification of Imbalanced Datasets Pros: ARCID performs slightly better than Fitcare and RIPPER. However, it outperforms all standard rule-based and non- rule based approaches by many ranks Cons: Managing the overwhelming number of CARs generated from real- life datasets and Removing redundant rules conveys the same information. Literature Survey 25-04-2023 Classification of Imbalanced Dataset 14
  • 15. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 2 Paper Title:-Applying Map- Reduce to imbalanced data classification Year:- 2017 Authors:-Jedrzejowicz, Joanna , et.al Using Map-Reduce to deal with Imbalanced data classification Pros: 1.EDBC uses the three methods to classify the imbalance data in proper in manner. 2.SplitBal works parallelly so the processing time is less as compare to EDBC. Cons: 1.EDBC does not works parallelly so the processing time is more. 2. In SplitBal dataset is divided into bins so working on bin is lengthy. Literature Survey 25-04-2023 Classification of Imbalanced Dataset 15
  • 16. Sr. Paper Title Paper Theme/Idea Pros / Cons 3 Paper Title:-A study on classifying imbalanced datasets Year:- 2014 Authors:-T. Lakshmi and S. Prasad Study of classifying imbalanced dataset. Pros: 1.The combination of SMOTE + BAG + RF effectively improves the performance of classifying of imbalanced dataset. 2.The use of ensemble and sampling techniques gives drastic improvement in performance in terms of AUROC Cons: 1.Execution time was reduced by using the under- sampling, but the accuracy of classifying dataset was distracted. 2.There was loss of necessary and useful information due to under-sampling. Literature Survey 25-04-2023 Classification of Imbalanced Dataset 16
  • 17. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 4 Paper Title:-Improved SMOTEBagging and its application in imbalanced data classification Year:- 2013 Authors:-Yongqing, Z. & Min, & Zhu, Danling, & Zhang, M. & Gang, and M. Daichuan An approach to deal imabalanced data using oversampling and bagging approach. Pros: 1.SMOTE Method generates synthetic minority examples to over-sample the minority class. 2.Bagging Method improves the performance of the overall system. 3.Support Vector Machine one of the most effective machine learning algorithms for many complex binary classification problems. 4.SMOTE Bagging Method overcomes over-fitting problems of traditional algorithm. Literature Survey 25-04-2023 Classification of Imbalanced Dataset 17
  • 18. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 4 Paper Title:-Improved SMOTEBagging and its application in imbalanced data classification Year:- 2013 Authors:-Yongqing, Z. & Min, & Zhu, Danling, & Zhang, M. & Gang, and M. Daichuan An approach to deal imabalanced data using oversampling and bagging approach. Cons: 1.SMOTE Method it only work on minority class not on majority class. 2.Bagging Method combing of decision sometimes over-head Literature Survey 25-04-2023 Classification of Imbalanced Dataset 18
  • 19. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 5 Paper Title:-Class Imbalance Problem in Data Mining: Review Year:- 2013 Authors:-M. R. Longadge, M. S. S. Dongre, D. L. Malik et al Using Data Mining to deal with Imbalanced data classification Pros: Hybrid approach provides better solution for class imbalance. Suggests several algorithm and techniques that solve the issue of imbalance distribution of sample. Cons: This method improves the classification accuracy of minority class but, because of infinite data streams and continuous concept drifting, this method cannot suitable for skewed data stream classification. Literature Survey 25-04-2023 Classification of Imbalanced Dataset 19
  • 20. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 6 Paper Title:-Skew Boost: An algorithm for classifying imbalanced datasets Year:- 2011 Authors:-S. Hukerikar, A. Tumma, A. Nikam, and V. Attar, An approach of handling imbalanced dataset using boosting tech Pros: SkewBoost performs better than DataBoost-IM, Inverse Random UnderSampling (IRUS), EasyEnsemble and Balance Cascade and Cost Sensitive Boosting (CSB2). Cons: In case of some datasets, SkewBoost remains behind of some existing algorithms in metrics related to the majority class accuracy Literature Survey 25-04-2023 Classification of Imbalanced Dataset 20
  • 21. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 7 Paper Title:-RUSBoost: A Hybrid Approach to Alleviating Class Imbalance Year:- 2010 Authors:-Seiffert, Chris & Khoshgoftaar, Taghi & Van Hulse, Jason & Napolitano, Amri. Hybrid approach of handling imbalanced dataset using boosting tech. Pros: RUSBoost presents an easier, faster, and fewer advanced alternative to SMOTEBoost for learning from imbalanced data. Cons: The complexity and model training time get increased. Literature Survey 25-04-2023 Classification of Imbalanced Dataset 21
  • 22. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 8 Paper Title:-Learning pattern classification tasks with imbalanced datasets Year:- 2009 Authors:-S. L. Phung Recognizing learning pattern of classification of imbalanced data. Pros: The approach proposed by author can effectively improve classification accuracy of minority classes while maintaining the overall classification performance. Cons: The combination of supervised and unsupervised algorithm tends to have higher values of G-means and F-measure when compared with different training algorithms Literature Survey 25-04-2023 Classification of Imbalanced Dataset 22
  • 23. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 9 Paper Title:- “Z-SVM: An SVM for improved classification of imbalanced data Year:- 2006 Authors:-Imam, Tasadduq & Ting, Kai & Kamruzzaman, Joard SVM approach to handle Imbalanced Dataset Pros: Z-SVM aims at finding the discriminating hyperplane that maintains an optimal margin from boundary examples called support vectors. Cons: It is not much clear that how particular value of parameter affects the SVM hyperplane and generalization capability of learned model Literature Survey 25-04-2023 Classification of Imbalanced Dataset 23
  • 24. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 10 Paper Title:-Editorial: Special Issue on Learning from Imbalanced Data Sets Year:- 2004 Authors:-Chawla, Nitesh & Japkowicz, Nathalie & Kołcz, Aleksander. Information of Learning from Imbalanced dataset Pros: Use of cluster based oversampling counters the effect of class imbalance and small disjuncts. Cons: Feature selection can often be too expensive to apply for handling imbalanced data. Literature Survey 25-04-2023 Classification of Imbalanced Dataset 24
  • 25. Sr. No. Paper Title Paper Theme/Idea Pros / Cons 11 Paper Title:-Learning from imbalanced data sets with boosting and data generation: The DataBoost- IM approach Year:- 2004 Authors:-Guo, Hongyu & Viktor, Herna. Handling Imbalanced data using boosting approach Pros: DataBoost-IM produce s a series of high- quality classifiers will be better able to predict examples for which the previous classifier’s performance is poor. Cons: The execution gets complicated when tedious approach is used Literature Survey 25-04-2023 Classification of Imbalanced Dataset 25
  • 26. Sr. Paper Title Paper Theme/Idea Pros / Cons 12 Paper Title:-Synthetic Minority Oversampling Technique Year:- 2002 Authors:-N. Chawla An oversampling approach to deal with imbalanced dataset Pros: For almost all the ROC curves, the SMOTE approach dominates. Adhering to the definition of ROC convex hull, most of the doubtless optimal classifiers are the ones generated with SMOTE. Cons: A minority class sample could possibly have a majority class sample as its nearest neighbor rather than a minority class sample. This crowding will likely contribute to the redrawing of the decision surfaces in favor of the minority class Literature Survey 25-04-2023 Classification of Imbalanced Dataset 26
  • 27. Gap Analysis 25-04-2023 Classification of Imbalanced Dataset 27 • The study of papers gave an idea about how the imbalanced data can result in inappropriate results. • Some papers also gave idea about how we can deal with imbalanced data.
  • 28. Problem Statement • To classify imbalanced data set to predict the accuracy of minority class from a given structured data set using diversified ensemble and clustering methods. 25-04-2023 Classification of Imbalanced Dataset 28
  • 29. Objectives 1. To develop balanced data set from imbalanced multiclass dataset by applying sampling techniques. 2. To improve the classification precision and achieve good balancing among the class samples. 3. To develop an efficient method for solving the class imbalance problem in the classification process. 25-04-2023 Classification of Imbalanced Dataset 29
  • 30. 25-04-2023 Classification of Imbalanced Dataset 30 Activities per Objective • Develop balanced data set from imbalanced multiclass dataset by applying sampling techniques. • Activities:: 1. Input imbalanced dataset. 2. Apply SMOTE algorithm with sampling. 3. Result balanced dataset.
  • 31. Activities per Objective 25-04-2023 Classification of Imbalanced Dataset 31 • To improve the classification precision and achieve good balancing among the class samples. • Activities: 1. Testing with any one classification algorithm then proceed with combined algorithm . 2. Proper cleansing and pre-processing before testing. 3. Using the learning curve approach to find the performance of algorithm.
  • 32. Activities per Objective 25-04-2023 • To develop an efficient method for solving the class imbalance problem in the classification process. • Activities: 1.Evaluate Classifiers in Imbalanced Domains. 2. Using statistical measures for evaluating classification. Classification of Imbalanced Dataset 32
  • 33. Feasibility Study 25-04-2023 Following points are analyzed to check feasibility of Classification of imabalanced Dataset: 1. Dataset for training and testing 2. SMOTE for synthetic oversampling 3. Random forest for ensembled classification 4. Tkinter for creating GUI 5. XGBoost for performing boosting Classification of Imbalanced Dataset 33
  • 34. Schedule 25-04-2023 Classification of Imbalanced Dataset 34 • - To develop a blockchain manager and integrate it with the web portal. • 2.1 Building a blockchain manager interface that can support dynamic addition of blockchains. • 2.2 Authorizing the blockchain dynamically to create a private blockchain network. • 2.3 Creating methods to support image checking and uploading to the blockchain via blockchain manager. • 2.4 Creating method for restoration after the blockchain failure. • 2.5 Calling the methods from web interface.
  • 36. Mathematical model 25-04-2023 Classification of Imbalanced Dataset 36 • The math's behind XGBoost is as follows – The following steps are involved in gradient boosting: • F0(x)- with which we initialize the boosting algo as: • The gradient of loss function is computed iteratively:
  • 37. Mathematical model 25-04-2023 Classification of Imbalanced Dataset 37 • Each hm(x) is fit on gradient obtained at each step. • The multiplicative factor ym for each terminal node is derived and the boosted model Fm(x) is defined:
  • 38. Mathematical model 25-04-2023 Classification of Imbalanced Dataset 38 • The math’s behind the working of RFC is as follows: • RFC does both row sampling and column sampling with decision tree as a base.
  • 39. Mathematical model 25-04-2023 Classification of Imbalanced Dataset 39 • As the no. of base learner(k) increases, the variance gets decreased. When k is decreased, variance is increased. But bias remain constant for whole process. • It can be expressed as: – Random forest= DT(base learner)+ bagging(Row sampling with replacement)+ feature bagging(column sampling) + aggregation(mean/median, majority vote)
  • 40. Mathematical model 25-04-2023 Classification of Imbalanced Dataset 40 • Math’s behind RFC is: • Assume training set of microarrays –D = {(x1,y1),….,(xn,yn)} • drawn randomly from a (possibly unknown) probabilitydistribution . –(xi,yi) ~ (X,Y)
  • 41. Mathematical model 25-04-2023 Classification of Imbalanced Dataset 41 • Goal: to build a classifier which predicts y from x based on the data set of examples D. • Given: ensemble of (possibly weak) classifiers – h ={ h1(x),….,hk(x)}. • If each hk(x) is a decision tree, then the ensemble is a random forest. We define the parameters of decision tree for classifier hk(x) to be –Qk = (qk1,qk2,....,qkp) • (these parameters include the structure of tree, which variables are split in which nodes, etc)
  • 42. Mathematical model 25-04-2023 Classification of Imbalanced Dataset 42 • We sometimes write hk(x) = h(x |qk) • Thus decision tree k leads to a classifier hk(x) = h(x |qk)
  • 43. SRS • SRS Hyperlink System Requirment: 25-04-2023 Classification of Imbalanced Dataset 43 Hardware: • Pentium IV • 40 GB HDD • 4 GB Ram Software: • Windows 10 • Python • PyCharm
  • 44. Non System Requirement • Performance • Extendibility • Reusability • Reliability 25-04-2023 Classification of Imbalanced Dataset 44
  • 45. UML Diagrams / DFD Diagrams 25-04-2023 Classification of Imbalanced Dataset 45 • Data Flow Diagram: DFD Level 0
  • 46. 25-04-2023 DFD Level 1 Classification of Imbalanced Dataset 46
  • 47. 25-04-2023 DFD Level 2 Classification of Imbalanced Dataset 47
  • 51. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 51 • Algorithms used are as follows: – Random Forest Classifier – Xtreme Gradient Boosting – SMOTE
  • 52. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 52 • Working of RFC – We can understand the working of Random Forest algorithm with the help of following steps − – Step 1 − First, start with the selection of random samples from a given dataset. – Step 2 − Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.
  • 53. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 53 • Working of RFC – Step 3 − In this step, voting will be performed for every predicted result. – Step 4 − At last, select the most voted prediction result as the final prediction result. The working of RFC is as illustrated in fig.1.
  • 54. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 54 Fig. 1 Working of Random Forest Classification
  • 55. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 55 • Working of XGBoost – XGBoost is an implementation of Gradient Boosted decision trees. – It is a type of Software library that was designed basically to improve speed and model performance.
  • 56. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 56 • Working of XGBoost – In this algorithm, decision trees are created in sequential form. – Weights play an important role in XGBoost. Weights are assigned to all the independent variables which are then fed into the decision tree which predicts results. – Weight of variables predicted wrong by the tree is increased and these the variables are then fed to the second decision tree.
  • 57. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 57 • Working of XGBoost – These individual classifiers/predictors then ensemble to give a strong and more precise model. The Working of XGB is as illustrated in fig. 2.
  • 58. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 58 Fig. 2 Working of XGBoost
  • 59. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 59 • Working of SMOTE – Step 1: Setting the minority class set A, for each ‘x belongs to a’ the k- nearest neighbours of x are obtained by calculating Euclidean distance between x and every other sample in set A. – Step 2: The sampling rate N is set according to imbalanced proportion. For each ‘x belongs to a’, N examples (i.e. x1,x2,x3,…xn) are randomly selected from its KNN and they construct the set A1.
  • 60. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 60 • Working of SMOTE – Step 3: For each example xk belongs to A1(k= 1,2,3,…N), the following formula is used to generate a new example: x’ = x +rand(0,1)* |x-xk| in which rand(0,1) represents the random number between 0 & 1. The working of SMOTE is as illustrated in fig 3.
  • 61. Project Implementation : Algorithms 25-04-2023 Classification of Imbalanced Dataset 61 Fig. 3: Working of Smote
  • 62. Project Implementation : Testing 25-04-2023 Classification of Imbalanced Dataset 62 Test case ID 1: Execute the implemented system in Windows 10 Test Case Summary To verify if the system is able to execute on Win 10. Prerequisites User must have python with packages installed. Test Procedure 1. Open the System 2. Run the System 3. Execute the Algorithm Expected Result All the algorithms should be successfully executed. Actual Result All Algorithms are executed Successfully. Status Pass
  • 63. Project Implementation : Testing 25-04-2023 Classification of Imbalanced Dataset 63 Test case ID 2: Execute the implemented system in Macintosh Test Case Summary To verify if the system is able to execute on Mac. Prerequisites User must have python with packages installed. Test Procedure 1. Open the System 2. Run the System 3. Execute the Algorithm Expected Result All the algorithms should be successfully executed. Actual Result All Algorithms are executed Successfully. Status Pass
  • 64. Project Implementation : Testing 25-04-2023 Classification of Imbalanced Dataset 64 Test case ID 3: Does component of UI works as expected Test Case Summary To verify if components of UI are working in every system. Prerequisites User must have python with packages installed. User must have Ubuntu Test Procedure 1. Open the System 2. Run the System 3. Execute the Algorithm Expected Result All the Components of UI should perform its task as assigned. Actual Result All components of UI performs its task as assigned. Status Pass
  • 65. Project Implementation : Results 25-04-2023 Classification of Imbalanced Dataset 65 • We have evaluated some of the Classifiers along with the proposed algorithm. • The Classifiers we used are: – Random Forest Classifier – Xtreme Gradient Boosting – Bagging Classifier – SMOTE + XGBoost + RFC
  • 66. Project Implementation : Results 25-04-2023 Classification of Imbalanced Dataset 66 • For RFC we got performance as: – Recall: 0.788 – Precision: 0.947 – F1 Score: 0.860 – Average Precision recall: 0.838
  • 67. Project Implementation : Results 25-04-2023 Classification of Imbalanced Dataset 67 • After tuning RFC we got performance as: – Recall: 0.788 – Precision: 0.947 – F1 Score: 0.860 – Average Precision recall: 0.854
  • 68. Project Implementation : Results 25-04-2023 Classification of Imbalanced Dataset 68 • Using XGB we got performance as: – Recall: 0.805 – Precision: 0.948 – F1 Score: 0.871 – Average Precision recall: 0.868
  • 69. Project Implementation : Results 25-04-2023 Classification of Imbalanced Dataset 69 • After tuning XGB we got performance as: – Recall: 0.000 – Precision: 0.nan – F1 Score: 0.nan – Average Precision recall: 0.669
  • 70. Project Implementation : Results 25-04-2023 Classification of Imbalanced Dataset 70 • Using XGB+Smote we got performance as: – Recall: 1.000 – Precision: 0.999 – F1 Score: 1.000 – Average Precision recall: 1.000
  • 71. Project Implementation : Results 25-04-2023 Classification of Imbalanced Dataset 71 • Using Smote+RFC we got performance as: – Recall: 1.000 – Precision: 1.000 – F1 Score: 1.000 – Average Precision recall: 1.000
  • 72. Project Implementation : Results 25-04-2023 Classification of Imbalanced Dataset 72 • From the Evaluation performance of each algo. we came to conclusion that the Smote resampled data along with XGB and RFC gave better results than other algorithms.
  • 73. Conclusion • Classification is one among the foremost common problem in machine learning. One of the common issues found in dataset while classification is imbalanced classes issue. Various researchers have put their best efforts to solve this issue. The literature review has yielded some of the useful approaches to face this problem. The best solution to face this problem is to make use of ensemble methods. Ensemble methods are combination of multiple machine learning technique into one model to decrease variance (bagging), bias (boosting), or improve predictions (stacking). Other methods also exist but are not capable of performing best in classification of imbalanced dataset. In future, the proposed work will be implemented using python programming which will ensure that the target performance will be more than 95% so that ensemble classifier can be used reliably. 25-04-2023 Classification of Imbalanced Dataset 73
  • 74. References 25-04-2023 Classification of Imbalanced Dataset 74 [1] Abdellatif, Safa & Ben Hassine, Mohamed Ali & Ben Yahia, Sadok & Bouzeghoub, Amel. (2018). “ARCID: A New Approach to Deal with Imbalanced Datasets Classification.”, SOFSEM 2018: Theory and Practice of Computer Science, pp.569-580 doi:10.1007/978-3-319-73117-9_40. [2] Jedrzejowicz, Joanna & Neumann, Jakub & Synowczyk, Piotr & Zakrzewska, Magdalena. (2017). “ Applying Map-Reduce to imbalanced data classification”. 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA) pp.29-33. doi10.1109/INISTA.2017.8001127. [3] T. Lakshmi and S. Prasad, “A study on classifying imbalanced datasets,” IEEE Conference Publication, 2014. [4] Yongqing, Z. & Min, & Zhu, Danling, & Zhang, M. & Gang, and M. Daichuan, “Improved SMOTEBagging and its application in imbalanced data classification,” IEEE Conference Anthology, 2013. [5] M. R. Longadge, M. S. S. Dongre, D. L. Malik et al., “Class Imbalance Problem in Data Mining: Review,” International Journal of Computer Science and Network (IJCSN), vol. 2, no. 1, 2013. [Online]. Available: www.ijcsn.org [6] S. Hukerikar, A. Tumma, A. Nikam, and V. Attar, “Skew Boost: An algorithm for classifying imbalanced datasets,” in 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011), 2011, pp. 46–52.
  • 75. [7] Seiffert, Chris & Khoshgoftaar, Taghi & Van Hulse, Jason & Napolitano, Amri. (2010). “RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. Systems, Man and Cybernetics, Part A: Systems and Humans,” IEEE Transactions, 2010. [8] S. L. Phung, “Learning pattern classification tasks with imbalanced datasets,” in Pattern Recognition. InTech, 2009. [9] Imam, Tasadduq & Ting, Kai & Kamruzzaman, Joarder. “Z-SVM: An SVM for improved classification of imbalanced data,” Advances in Artificial Intelligence, vol. 4304, 2006. [10]Chawla, Nitesh & Japkowicz, Nathalie & Kołcz, Aleksander. (2004). “Editorial: Special Issue on Learning from Imbalanced Data Sets.”, SIGKDD Explorations. 6. 1-6. 10.1145/1007730.1007733.. [11]Guo, Hongyu & Viktor, Herna. (2004). “Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach.” SIGKDD Explorations. 6. 30-39. 10.1145/1007730.1007736. [12]N. Chawla, “Synthetic Minority Oversampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. 25-04-2023 Classification of Imbalanced Dataset 75