SlideShare uma empresa Scribd logo
1 de 36
Canopy Clustering and K-Means Clustering Machine Learning Big Data  at Hacker Dojo Anandha L Ranganathan (Anand)analog76@gmail.com Anandha L Ranganathan  analog76@gmail.com MLBigData 1
Movie Dataset	  Download the movie dataset from  	http://www.grouplens.org/node/73 The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760 Anandha L Ranganathan analog76@gmail.com MLBigData
Similarity Measure	 Jaccard similarity coefficient  Cosine similarity Anandha L Ranganathan analog76@gmail.com MLBigData
JaccardIndex Distance = # of movies watched by by User A and B / Total # of movies watched by either user. In other words       A  B   /  A  B. For our applicaton I am going to compare the the subset of user z₁ and  z₂  where z₁,z₂  ε Z http://en.wikipedia.org/wiki/Jaccard_index Anandha L Ranganathan analog76@gmail.com MLBigData
Jaccard Similarity Coefficient. similarity(String[] s1, String[] s2){ 	List<String> lstSx=Arrays.asList(s1); 	List<String> lstSy=Arrays.asList(s2); 	Set<String> unionSxSy = new HashSet<String>(lstSx); unionSxSy.addAll(lstSy); 	Set<String> intersectionSxSy =new HashSet<String>(lstSx); intersectionSxSy.retainAll(lstSy);  sim= intersectionSxSy.size() /  (double)unionSxSy.size(); }  Anandha L Ranganathan analog76@gmail.com MLBigData
Cosine Similiarty distance  =  Dot Inner Product (A, B) / sqrt(||A||*||B||) Simple distance calculation will be used for Canopy clustering. Expensive distance calculation will be used for K-means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Clustering- Mapper Canopy cluster are subset of total popultation. Points in that cluster are movies. If z₁subset of the whole population, rated movie M1 and same subset are rated M2 also then the movie M1and M2 are belong the same canopy  cluster. Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData  First received point/data is center of Canopy .  Say P1 Receive the second point and if it is distance from canopy center is less than T2then they are point of that canopy.   If d(P1,P2) >T2then P2 point is new canopy center. If d(P1,P2) < T2 then P1is point of centroidP1. Continue the step 2,3,4  until the mappercomplets its job.  Distances are measured between 0 to 1.  T2 value is 0.005 and I expect around 200 canopy clusters. T1 value is 0.0010.
Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData  Pseudo Code. booleanpointStronglyBoundToCanopyCenter = false 	for (Canopy canopy : canopies) { 	double centerPoint= canopyCenter.getPoint(); 	if(distanceMeasure.similarity(centerPoint, movie_id) > T1) pointStronglyBoundToCanopyCenter = true } 	if(!pointStronglyBoundToCanopyCenter){ canopies.add(new Canopy(0.0d));
Data Massaging Convert the data into the required format.  In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>> Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster – Mapper A Anandha L Ranganathan analog76@gmail.com MLBigData
Threshold value  Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  T1 and T2 are  wrong. Inner circle is T2 and outer circle is T1.
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
ReducerMapper A -  Red center  Mapper B – Green center Anandha L Ranganathan analog76@gmail.com MLBigData
Redundant centers within the threshold of each other. Anandha L Ranganathan analog76@gmail.com MLBigData
Add small error  => Threshold+ξ Anandha L Ranganathan analog76@gmail.com MLBigData
So far we found , only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters areready when the job is completed. How it would look like ?  Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Cluster -  Before MR jobSparse Matrix Anandha L Ranganathan analog76@gmail.com MLBigData
 Canopy Cluster – After  MR job Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  Cells with values 1 are grouped together and users are moved from their original location
K – Means Clustering	 Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users.  To find Cosine similarity create a vector  in the format  <UserId,List<Movies>> <UserId,{m1,m2,m3,m4,m5}> Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData
Anandha L Ranganathan analog76@gmail.com MLBigData  Vector(A) - 1111000  Vector (B)-  0100111 Vector (C)-  1110010 distance(A,B) = Vector (A) * Vector (B) / 					(||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4   ¼=.25 Similarity (A,B) = .25
Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster  > # of Canopy cluster. After couple of map-reduce jobs  K-means cluster is ready Anandha L Ranganathan analog76@gmail.com MLBigData
Find Nearest Cluster of a point	- Map Public void addPointToCluster(Point p ,Iterable<KMeansCluster>  lstKMeansCluster) { kMeansClusterclosesCluster = null; Double closestDistance = CanopyThresholdT1/3 For(KMeansClustercluster :lstKMeansCluster){    double distance=distance(cluster.getCenter(),point) if(closesCluster ||  closestDistance >distance){ closesetCluster= cluster; closesDistance= distance          }  } closesCluster.add(point); } Anandha L Ranganathan analog76@gmail.com MLBigData
Compute centroid till it converges. Public void computeConvergence((Iterable<KMeansCluster> clusters){ 	for(Cluster cluster:clusters){ newCentroid = cluster.computeCentroid(cluster);                 if(cluster.getCentroid()==newCentroid){ cluster.converged=true;               }     else             { cluster.setCentroid(newCentroid)    }   } Run the process to find nearest cluster of a point and centroid until the centroidbecomes static. Anandha L Ranganathan analog76@gmail.com MLBigData
All points –before clustering Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy - clustering Anandha L Ranganathan analog76@gmail.com MLBigData
Canopy Clusering and K means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
? Anandha L Ranganathan analog76@gmail.com MLBigData
References Apache Mahout - https://cwiki.apache.org/MAHOUT/canopy-clustering.html Canopy Clustering  - http://code.google.com/p/canopy-clustering/  Google Lectures. http://www.youtube.com/watch?v=1ZDybXl212Q http://cs.boisestate.edu/~amit/research/makho_ngazimbi_project.pdf Anandha L Ranganathan analog76@gmail.com MLBigData

Mais conteúdo relacionado

Mais procurados

Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
butest
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Simplilearn
 
Lecture 1 graphical models
Lecture 1  graphical modelsLecture 1  graphical models
Lecture 1 graphical models
Duy Tung Pham
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
Rui Pedro Paiva
 

Mais procurados (20)

Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Back Propagation Neural Network In AI PowerPoint Presentation Slide Templates...
Back Propagation Neural Network In AI PowerPoint Presentation Slide Templates...Back Propagation Neural Network In AI PowerPoint Presentation Slide Templates...
Back Propagation Neural Network In AI PowerPoint Presentation Slide Templates...
 
Machine learning
Machine learningMachine learning
Machine learning
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Conv-TasNet.pdf
Conv-TasNet.pdfConv-TasNet.pdf
Conv-TasNet.pdf
 
vehicular communications
vehicular communicationsvehicular communications
vehicular communications
 
Simulated annealing
Simulated annealingSimulated annealing
Simulated annealing
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Genetic Algorithms
Genetic AlgorithmsGenetic Algorithms
Genetic Algorithms
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders
 
Lecture 1 graphical models
Lecture 1  graphical modelsLecture 1  graphical models
Lecture 1 graphical models
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Autoencoder
AutoencoderAutoencoder
Autoencoder
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Machine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and TechniquesMachine Learning: Applications, Process and Techniques
Machine Learning: Applications, Process and Techniques
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Introduction to Genetic Algorithms
Introduction to Genetic AlgorithmsIntroduction to Genetic Algorithms
Introduction to Genetic Algorithms
 
Pattern Recognition
Pattern RecognitionPattern Recognition
Pattern Recognition
 

Semelhante a Canopy k-means using Hadoop

Canopy kmeans
Canopy kmeansCanopy kmeans
Canopy kmeans
nagwww
 
Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]
Joachim Nkendeys
 

Semelhante a Canopy k-means using Hadoop (20)

Canopy kmeans
Canopy kmeansCanopy kmeans
Canopy kmeans
 
K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 
KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)KNN - Classification Model (Step by Step)
KNN - Classification Model (Step by Step)
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Data analysis of weather forecasting
Data analysis of weather forecastingData analysis of weather forecasting
Data analysis of weather forecasting
 
KNN
KNNKNN
KNN
 
About decision tree induction which helps in learning
About decision tree induction  which helps in learningAbout decision tree induction  which helps in learning
About decision tree induction which helps in learning
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
Tutorial ground classification with Laserdata LiS
Tutorial ground classification with Laserdata LiSTutorial ground classification with Laserdata LiS
Tutorial ground classification with Laserdata LiS
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical Equations
 
Fa18_P2.pptx
Fa18_P2.pptxFa18_P2.pptx
Fa18_P2.pptx
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]Satellite_Image_Analysis[1]
Satellite_Image_Analysis[1]
 
Tutorial: Image Generation and Image-to-Image Translation using GAN
Tutorial: Image Generation and Image-to-Image Translation using GANTutorial: Image Generation and Image-to-Image Translation using GAN
Tutorial: Image Generation and Image-to-Image Translation using GAN
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Canopy k-means using Hadoop

  • 1. Canopy Clustering and K-Means Clustering Machine Learning Big Data at Hacker Dojo Anandha L Ranganathan (Anand)analog76@gmail.com Anandha L Ranganathan analog76@gmail.com MLBigData 1
  • 2. Movie Dataset Download the movie dataset from http://www.grouplens.org/node/73 The data is in the format UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 2::1194::4::978300762 7::1123::1::978300760 Anandha L Ranganathan analog76@gmail.com MLBigData
  • 3. Similarity Measure Jaccard similarity coefficient Cosine similarity Anandha L Ranganathan analog76@gmail.com MLBigData
  • 4. JaccardIndex Distance = # of movies watched by by User A and B / Total # of movies watched by either user. In other words A  B / A  B. For our applicaton I am going to compare the the subset of user z₁ and z₂ where z₁,z₂ ε Z http://en.wikipedia.org/wiki/Jaccard_index Anandha L Ranganathan analog76@gmail.com MLBigData
  • 5. Jaccard Similarity Coefficient. similarity(String[] s1, String[] s2){ List<String> lstSx=Arrays.asList(s1); List<String> lstSy=Arrays.asList(s2); Set<String> unionSxSy = new HashSet<String>(lstSx); unionSxSy.addAll(lstSy); Set<String> intersectionSxSy =new HashSet<String>(lstSx); intersectionSxSy.retainAll(lstSy); sim= intersectionSxSy.size() / (double)unionSxSy.size(); } Anandha L Ranganathan analog76@gmail.com MLBigData
  • 6. Cosine Similiarty distance = Dot Inner Product (A, B) / sqrt(||A||*||B||) Simple distance calculation will be used for Canopy clustering. Expensive distance calculation will be used for K-means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 7. Canopy Clustering- Mapper Canopy cluster are subset of total popultation. Points in that cluster are movies. If z₁subset of the whole population, rated movie M1 and same subset are rated M2 also then the movie M1and M2 are belong the same canopy cluster. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 8. Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData First received point/data is center of Canopy . Say P1 Receive the second point and if it is distance from canopy center is less than T2then they are point of that canopy. If d(P1,P2) >T2then P2 point is new canopy center. If d(P1,P2) < T2 then P1is point of centroidP1. Continue the step 2,3,4 until the mappercomplets its job. Distances are measured between 0 to 1. T2 value is 0.005 and I expect around 200 canopy clusters. T1 value is 0.0010.
  • 9. Canopy Cluster – Mapper Anandha L Ranganathan analog76@gmail.com MLBigData Pseudo Code. booleanpointStronglyBoundToCanopyCenter = false for (Canopy canopy : canopies) { double centerPoint= canopyCenter.getPoint(); if(distanceMeasure.similarity(centerPoint, movie_id) > T1) pointStronglyBoundToCanopyCenter = true } if(!pointStronglyBoundToCanopyCenter){ canopies.add(new Canopy(0.0d));
  • 10. Data Massaging Convert the data into the required format. In this case the converted data to be displayed in <MovieId,List of Users> <MovieId, List<userId,ranking>> Anandha L Ranganathan analog76@gmail.com MLBigData
  • 11. Canopy Cluster – Mapper A Anandha L Ranganathan analog76@gmail.com MLBigData
  • 12. Threshold value Anandha L Ranganathan analog76@gmail.com MLBigData
  • 13. Anandha L Ranganathan analog76@gmail.com MLBigData T1 and T2 are wrong. Inner circle is T2 and outer circle is T1.
  • 14. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 15. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 16. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 17. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 18. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 19. ReducerMapper A - Red center Mapper B – Green center Anandha L Ranganathan analog76@gmail.com MLBigData
  • 20. Redundant centers within the threshold of each other. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 21. Add small error => Threshold+ξ Anandha L Ranganathan analog76@gmail.com MLBigData
  • 22. So far we found , only the canopy center. Run another MR job to find out points that are belong to canopy center. canopy clusters areready when the job is completed. How it would look like ? Anandha L Ranganathan analog76@gmail.com MLBigData
  • 23. Canopy Cluster - Before MR jobSparse Matrix Anandha L Ranganathan analog76@gmail.com MLBigData
  • 24. Canopy Cluster – After MR job Anandha L Ranganathan analog76@gmail.com MLBigData
  • 25. Anandha L Ranganathan analog76@gmail.com MLBigData Cells with values 1 are grouped together and users are moved from their original location
  • 26. K – Means Clustering Output of Canopy cluster will become input of K-means clustering. Apply Cosine similarity metric to find out similar users. To find Cosine similarity create a vector in the format <UserId,List<Movies>> <UserId,{m1,m2,m3,m4,m5}> Anandha L Ranganathan analog76@gmail.com MLBigData
  • 27. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 28. Anandha L Ranganathan analog76@gmail.com MLBigData Vector(A) - 1111000 Vector (B)- 0100111 Vector (C)- 1110010 distance(A,B) = Vector (A) * Vector (B) / (||A||*||B||) Vector(A)*Vector(B) = 1 ||A||*||B||=2*2=4  ¼=.25 Similarity (A,B) = .25
  • 29. Find k-neighbors from the same canopy cluster. Do not get any point from another canopy cluster if you want small number of neighbors # of K-means cluster > # of Canopy cluster. After couple of map-reduce jobs K-means cluster is ready Anandha L Ranganathan analog76@gmail.com MLBigData
  • 30. Find Nearest Cluster of a point - Map Public void addPointToCluster(Point p ,Iterable<KMeansCluster> lstKMeansCluster) { kMeansClusterclosesCluster = null; Double closestDistance = CanopyThresholdT1/3 For(KMeansClustercluster :lstKMeansCluster){ double distance=distance(cluster.getCenter(),point) if(closesCluster || closestDistance >distance){ closesetCluster= cluster; closesDistance= distance } } closesCluster.add(point); } Anandha L Ranganathan analog76@gmail.com MLBigData
  • 31. Compute centroid till it converges. Public void computeConvergence((Iterable<KMeansCluster> clusters){ for(Cluster cluster:clusters){ newCentroid = cluster.computeCentroid(cluster); if(cluster.getCentroid()==newCentroid){ cluster.converged=true; } else { cluster.setCentroid(newCentroid) } } Run the process to find nearest cluster of a point and centroid until the centroidbecomes static. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 32. All points –before clustering Anandha L Ranganathan analog76@gmail.com MLBigData
  • 33. Canopy - clustering Anandha L Ranganathan analog76@gmail.com MLBigData
  • 34. Canopy Clusering and K means clustering. Anandha L Ranganathan analog76@gmail.com MLBigData
  • 35. ? Anandha L Ranganathan analog76@gmail.com MLBigData
  • 36. References Apache Mahout - https://cwiki.apache.org/MAHOUT/canopy-clustering.html Canopy Clustering - http://code.google.com/p/canopy-clustering/  Google Lectures. http://www.youtube.com/watch?v=1ZDybXl212Q http://cs.boisestate.edu/~amit/research/makho_ngazimbi_project.pdf Anandha L Ranganathan analog76@gmail.com MLBigData