SlideShare a Scribd company logo
1 of 19
Classification Technique KNN in
           Data Mining
       ---on dataset “Iris”


      Comp722 data mining
        Kaiwen Qi, UNC
          Spring 2012
Outline
   Dataset introduction
   Data processing
   Data analysis
   KNN & Implementation
   Testing
Dataset
   Raw dataset
    Iris(http://archive.ics.uci.edu/ml/datasets/Iris)




                                                                 5 Attributes
                                                (a) Raw
    150 total records                                                     Sepal length in cm
                                                data                     (continious number)
                                                                            Sepal width in cm
                                                                          (continious number)
                  50 records Iris Setosa                                   Petal length in cm
                                                                          (continious number)

                                                                            Petal width in cm
                  50 records Iris Versicolour                             (continious number)
                                                                                   Class
                                                                        (nominal data:
                  50 records Iris Virginica                                  Iris Setosa
                                                                             Iris Versicolour
                                                                             Iris Virginica)

       (b) Data
                                                             (C) Data
       organization
Classification Goal
   Task
Data Processing
   Original data
Data Processing
• Balanced distribution
Data Analysis
   Statistics
Data Analysis
   Histogram
Data Analysis
   Histogram
KNN
   KNN algorithm




    The unknown data, the green circle, is classified to be square when
    K is 5. The distance between two points is calculated with Euclidean
    distance d(p, q)=         . .In this example, square is the majority
    in 5 nearest neighbors.
KNN
   Advantage
       the skimpiness of implementation. It is good
        at dealing with numeric attributes.
       Does not set up the model and just imports
        the dataset with very low computer overhead.
       Does not need to calculate the useful attribute
        subset. Compared with naïve Bayesian, we
        do not need to worry about lack of available
        probability data
Implementation of KNN
   Algorithm
        Algorithm: KNN. Asses a classification label from training data for an
         unlabeled data
         Input: K, the number of neighbors.
         Dataset that include training data
        Output: A string that indicates unknown tuple’s classification

    Method:
     Create a distance array whose size is K
     Initialize the array with the distances between the unlabeled tuple with
      first K records in dataset
     Let i=k+1
     calculate the distance between the unlabeled tuple with the (k+1)th
      record in dataset, if the distance is greater than the biggest distance in
      the array, replace the old max distance with the new distance; i=i+1
     repeat step (4) until i is greater than dataset size(150)
     Count the class number in the array, the class of biggest number is
      mining result
Implementation of KNN
   UML
Testing
   Testing (K=7, total 150 tuples)
Testing
   Testing (K=7, 60% data as training data)
Testing
   Input random distribution dataset



               Random dataset




      Accuracy test:
Performance
   Comparison
     Decision tree
    Advantage                                    Naïve Bayesian
    • comprehensibility
    • construct a decision tree without any     Advantage
      domain knowledge                          • relatively simply.
    • handle high dimensional                   • By simply calculating
    • By eliminating unrelated attributes         attributes frequency from
      and tree pruning, it simplifies             training datanand without
      classification calculation                  any other operations (e.g.
    Disadvantage                                  sort, search),
    • requires good quality of training data.   Disadvantage
    • usually runs in memory                    • The assumption of
    • Not good at handling continuous             independence is not right
      number features.                          • No available probability data
                                                  to calculate probability
Conclusion
   KNN is a simple algorithm with high
    classification accuracy for dataset with
    continuous attributes.
   It shows high performance with balanced
    distribution training data as input.
Thanks
Question?

More Related Content

What's hot

What's hot (20)

Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia datamining
 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Data Augmentation
Data AugmentationData Augmentation
Data Augmentation
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised2.6 support vector machines and associative classifiers revised
2.6 support vector machines and associative classifiers revised
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 

Viewers also liked

Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 
Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027
ankitgadgil
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
kevinlan
 
Marekting research applications ppt
Marekting research applications pptMarekting research applications ppt
Marekting research applications ppt
ANSHU TIWARI
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
jagdish_93
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
pcherukumalla
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
vivekjv
 

Viewers also liked (20)

Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027Dwdm naive bayes_ankit_gadgil_027
Dwdm naive bayes_ankit_gadgil_027
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
ML KNN-ALGORITHM
ML KNN-ALGORITHMML KNN-ALGORITHM
ML KNN-ALGORITHM
 
Multidimensional Database Design & Architecture
Multidimensional Database Design & ArchitectureMultidimensional Database Design & Architecture
Multidimensional Database Design & Architecture
 
Data modelling 101
Data modelling 101Data modelling 101
Data modelling 101
 
Multidimensional data models
Multidimensional data  modelsMultidimensional data  models
Multidimensional data models
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Marekting research applications ppt
Marekting research applications pptMarekting research applications ppt
Marekting research applications ppt
 
Data mining
Data miningData mining
Data mining
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Data Modeling Basics
Data Modeling BasicsData Modeling Basics
Data Modeling Basics
 
Data Modeling PPT
Data Modeling PPTData Modeling PPT
Data Modeling PPT
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
 
Copy Testing
Copy TestingCopy Testing
Copy Testing
 
Multi dimensional model vs (1)
Multi dimensional model vs (1)Multi dimensional model vs (1)
Multi dimensional model vs (1)
 
Promotion
PromotionPromotion
Promotion
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 

Similar to Data mining project presentation

Instance based learning
Instance based learningInstance based learning
Instance based learning
Slideshare
 
ICCV2009: Max-Margin Ađitive Classifiers for Detection
ICCV2009: Max-Margin Ađitive Classifiers for DetectionICCV2009: Max-Margin Ađitive Classifiers for Detection
ICCV2009: Max-Margin Ađitive Classifiers for Detection
zukun
 
lecture_RNN Autoencoder.pdf
lecture_RNN Autoencoder.pdflecture_RNN Autoencoder.pdf
lecture_RNN Autoencoder.pdf
ssusercc3ff71
 
London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09
bwhitcher
 
FUNCTION APPROXIMATION
FUNCTION APPROXIMATIONFUNCTION APPROXIMATION
FUNCTION APPROXIMATION
ankita pandey
 

Similar to Data mining project presentation (14)

Statistical classification: A review on some techniques
Statistical classification: A review on some techniquesStatistical classification: A review on some techniques
Statistical classification: A review on some techniques
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
ICCV2009: Max-Margin Ađitive Classifiers for Detection
ICCV2009: Max-Margin Ađitive Classifiers for DetectionICCV2009: Max-Margin Ađitive Classifiers for Detection
ICCV2009: Max-Margin Ađitive Classifiers for Detection
 
Principal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty DetectionPrincipal Component Analysis For Novelty Detection
Principal Component Analysis For Novelty Detection
 
lecture_RNN Autoencoder.pdf
lecture_RNN Autoencoder.pdflecture_RNN Autoencoder.pdf
lecture_RNN Autoencoder.pdf
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09London useR Meeting 21-Jul-09
London useR Meeting 21-Jul-09
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana Artificial Intelligence Meetup 1/31/17
Intel Nervana Artificial Intelligence Meetup 1/31/17
 
Bayesian Counters
Bayesian CountersBayesian Counters
Bayesian Counters
 
FUNCTION APPROXIMATION
FUNCTION APPROXIMATIONFUNCTION APPROXIMATION
FUNCTION APPROXIMATION
 
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachConvolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approach
 
Machine Learning with R
Machine Learning with RMachine Learning with R
Machine Learning with R
 
Learning Classifier Systems for Class Imbalance Problems
Learning Classifier Systems  for Class Imbalance  ProblemsLearning Classifier Systems  for Class Imbalance  Problems
Learning Classifier Systems for Class Imbalance Problems
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 

Data mining project presentation

  • 1. Classification Technique KNN in Data Mining ---on dataset “Iris” Comp722 data mining Kaiwen Qi, UNC Spring 2012
  • 2. Outline  Dataset introduction  Data processing  Data analysis  KNN & Implementation  Testing
  • 3. Dataset  Raw dataset Iris(http://archive.ics.uci.edu/ml/datasets/Iris) 5 Attributes (a) Raw 150 total records Sepal length in cm data (continious number) Sepal width in cm (continious number) 50 records Iris Setosa Petal length in cm (continious number) Petal width in cm 50 records Iris Versicolour (continious number) Class (nominal data: 50 records Iris Virginica Iris Setosa Iris Versicolour Iris Virginica) (b) Data (C) Data organization
  • 5. Data Processing  Original data
  • 7. Data Analysis  Statistics
  • 8. Data Analysis  Histogram
  • 9. Data Analysis  Histogram
  • 10. KNN  KNN algorithm The unknown data, the green circle, is classified to be square when K is 5. The distance between two points is calculated with Euclidean distance d(p, q)= . .In this example, square is the majority in 5 nearest neighbors.
  • 11. KNN  Advantage  the skimpiness of implementation. It is good at dealing with numeric attributes.  Does not set up the model and just imports the dataset with very low computer overhead.  Does not need to calculate the useful attribute subset. Compared with naïve Bayesian, we do not need to worry about lack of available probability data
  • 12. Implementation of KNN  Algorithm  Algorithm: KNN. Asses a classification label from training data for an unlabeled data Input: K, the number of neighbors. Dataset that include training data Output: A string that indicates unknown tuple’s classification Method:  Create a distance array whose size is K  Initialize the array with the distances between the unlabeled tuple with first K records in dataset  Let i=k+1  calculate the distance between the unlabeled tuple with the (k+1)th record in dataset, if the distance is greater than the biggest distance in the array, replace the old max distance with the new distance; i=i+1  repeat step (4) until i is greater than dataset size(150)  Count the class number in the array, the class of biggest number is mining result
  • 14. Testing  Testing (K=7, total 150 tuples)
  • 15. Testing  Testing (K=7, 60% data as training data)
  • 16. Testing  Input random distribution dataset Random dataset Accuracy test:
  • 17. Performance  Comparison Decision tree Advantage Naïve Bayesian • comprehensibility • construct a decision tree without any Advantage domain knowledge • relatively simply. • handle high dimensional • By simply calculating • By eliminating unrelated attributes attributes frequency from and tree pruning, it simplifies training datanand without classification calculation any other operations (e.g. Disadvantage sort, search), • requires good quality of training data. Disadvantage • usually runs in memory • The assumption of • Not good at handling continuous independence is not right number features. • No available probability data to calculate probability
  • 18. Conclusion  KNN is a simple algorithm with high classification accuracy for dataset with continuous attributes.  It shows high performance with balanced distribution training data as input.