SlideShare a Scribd company logo
1 of 19
Parallel K-means Clustering
using CUDA
Lan Liu
Pritha D N
12/06/2016
1
Outline
● Performance Analysis
● Overview of CUDA
● Future Work
2
Review of Parallelization
● Complexity of sequential K-Means algorithm: O(N*D*K*T)
N: # of datas.
D: # of dimension.
K: # of clusters.
T: # of iterations.
● Complexity of each iteration step:
Part2: compute the new center as mean of new cluster datas. O((N+K)*D)--> O((2*delta+K)*D)
delta: #of membership change.
3
Part1: for each data point, compute the distance with K cluster centers and assign to the
nearest one. O(N*D*K) Parallelize on CUDA (SIMD: single instruction
multiple data)
Modify Part 2
● Complexity of sequential K-Means algorithm: O(N*D*K*T)
N: # of datas.
D: # of dimension.
K: # of clusters.
T: # of iterations.
● Complexity of each iteration step:
Part1: Compute the distance with K cluster centers and assign to the nearest one. O(N*D*K)
4
Part2: compute the new center as mean of new cluster datas. O(N+k)--> O(2*delta+K)
delta: #of membership change.
Performance Analysis
5
Experiment1: set K=128, D=1000, changing size N.
Performance Analysis
Experiment 2: set N=51200, D=1000, changing number of clusters K.
6
Speedup (N>>K, D>K)
Performance Analysis
Experiment 3: set N=51200, D=1000, T=30. changing number of clusters K.
7
Performance Analysis
Experiment 3 continue..
8
Slope~1, K double, running time double
CUDA - Execution of a CUDA program
9
find_nearest_cluster
CUDA - Memory Organisation
10
CUDA - Thread Organization
Execution resources are organized into Streaming Multiprocessors(SM).
Blocks are assigned to Streaming Multiprocessors in arbitrary order.
Blocks are further partitioned into warps.
SM executes only one of its resident warps at a time.
The goal is to keep an SM max occupied.
11
Threads per Warp 32
Max Warps per
Multiprocessor
64
Max Thread Blocks
per Multiprocessor
16
NVProf
Time(%) Time Calls Avg Min Max Name
98.99% 4.11982s 21 196.18ms 195.58ms 197.96ms find_nearest_cluster(int, int, int, float*, float*,
int*, int*)
0.98% 40.635ms 23 1.7668ms 30.624us 39.102ms [CUDA memcpy HtoD]
0.03% 1.2578ms 42 29.946us 28.735us 31.104us [CUDA memcpy DtoH]
Time(%) Time Calls Avg Min Max Name
93.06% 4.12058s 21 196.22ms 195.62ms 198.00ms cudaDeviceSynchronize
5.79% 256.47ms 4 64.117ms 4.9510us 255.97ms cudaMalloc
1.02% 45.072ms 65 693.42us 82.267us 39.230ms cudaMemcpy
12
N = 51200
Dimension = 1000
k = 128
Loop iterations = 21
Future work:
1. Compare time taken with implementations on OpenMP, MPI, standard libraries -
SkLearn, Matlab, etc.
2. Apply the Map-Reduce methodology.
3. Efficiently parallelize Part2.
13
Thank You!
Questions?
14
Introduction to K Means Clustering
● Clustering algorithm used in data mining
● Aims to partition N data points into K clusters
---->where, each data belongs to the cluster with nearest mean.
● Objective Function
15
number of clusters center/mean of cluster i
number of data points in cluster i
Parallelization: CUDA C Implementation
Step0: HOST initialize cluster centers, copy N data coordinates to DEVICE.
Step1: Copy data membership and k cluster centers from HOST to DEVICE.
Step2: In DEVICE, each thread process a single data point, compute the distance and
update data membership.
Step3: Copy the new membership to HOST, and recompute cluster centers.
Step4: Check convergence, if not, go back to step1.
Step5: Host free the allocated memory at last.
16
Step2: In DEVICE, each thread process a single data point, compute the distance and
update data membership.
1. Pick the first k data points as initial cluster centers.
2. Attribute each data point to the nearest cluster.
3. For each reassigned data point, d=d+1.
4. Set the position of each new cluster to be the mean of all data points belonging to
that cluster
5. Repeat steps 2-4 until convergence.
➔How to define convergence?
Stop condition: d/N <0.001
d: number of inputs that changed membership,
N: total number of data points
17
Sequential Algorithm Steps1
1Reference: http://users.eecs.northwestern.edu/~wkliao/Kmeans/
Sequential Algorithm
18
GPU based----CUDA C
If CPU has p cores, n datas, each core process n/p datas.
GPU process one element per thread.(The number of threads are very large ~1000
or more)
GPU is more effective than CPU when dealing with large blocks of data in parallel.
CUDA C: Host---> CPU, Device-->GPU. The host launches a kernel that execute on
the device.
19

More Related Content

What's hot

第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwJan Holčapek
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_mapslcplcp1
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1Kenta Oono
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codesNAVER D2
 

What's hot (20)

第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
Survey onhpcs languages
Survey onhpcs languagesSurvey onhpcs languages
Survey onhpcs languages
 
Xdp and ebpf_maps
Xdp and ebpf_mapsXdp and ebpf_maps
Xdp and ebpf_maps
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
 
Cuda
CudaCuda
Cuda
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 

Viewers also liked

Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDAJörn Dinkla
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsMarcos Gonzalez
 
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Igor Korkin
 
PL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database AnalyticsPL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database AnalyticsKohei KaiGai
 
Implementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedImplementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedTommaso Campari
 
ICISA 2010 Conference Presentation
ICISA 2010 Conference PresentationICISA 2010 Conference Presentation
ICISA 2010 Conference PresentationGonçalo Amador
 

Viewers also liked (7)

Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
 
CUDA-Aware MPI
CUDA-Aware MPICUDA-Aware MPI
CUDA-Aware MPI
 
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
 
PL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database AnalyticsPL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database Analytics
 
Implementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedImplementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-based
 
ICISA 2010 Conference Presentation
ICISA 2010 Conference PresentationICISA 2010 Conference Presentation
ICISA 2010 Conference Presentation
 

Similar to Parallel K means clustering using CUDA

CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
Gossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsGossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsRerngvit Yanggratoke
 
Trackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity CalorimeterTrackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity CalorimeterYousef Fadila
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...RISC-V International
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
My Final Year B.Tech Research Project
My Final Year B.Tech Research ProjectMy Final Year B.Tech Research Project
My Final Year B.Tech Research ProjectEeshan Srivastava
 

Similar to Parallel K means clustering using CUDA (20)

CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTES
 
Data race
Data raceData race
Data race
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Gossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsGossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large clouds
 
Trackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity CalorimeterTrackster Pruning at the CMS High-Granularity Calorimeter
Trackster Pruning at the CMS High-Granularity Calorimeter
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
Data miningpresentation
Data miningpresentationData miningpresentation
Data miningpresentation
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 
A22 Introduction to DTrace by Kyle Hailey
A22 Introduction to DTrace by Kyle HaileyA22 Introduction to DTrace by Kyle Hailey
A22 Introduction to DTrace by Kyle Hailey
 
Extending ns
Extending nsExtending ns
Extending ns
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Week13 lec2
Week13 lec2Week13 lec2
Week13 lec2
 
Reduction
ReductionReduction
Reduction
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Project PPT
Project PPTProject PPT
Project PPT
 
My Final Year B.Tech Research Project
My Final Year B.Tech Research ProjectMy Final Year B.Tech Research Project
My Final Year B.Tech Research Project
 

Recently uploaded

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 

Recently uploaded (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

Parallel K means clustering using CUDA

  • 1. Parallel K-means Clustering using CUDA Lan Liu Pritha D N 12/06/2016 1
  • 2. Outline ● Performance Analysis ● Overview of CUDA ● Future Work 2
  • 3. Review of Parallelization ● Complexity of sequential K-Means algorithm: O(N*D*K*T) N: # of datas. D: # of dimension. K: # of clusters. T: # of iterations. ● Complexity of each iteration step: Part2: compute the new center as mean of new cluster datas. O((N+K)*D)--> O((2*delta+K)*D) delta: #of membership change. 3 Part1: for each data point, compute the distance with K cluster centers and assign to the nearest one. O(N*D*K) Parallelize on CUDA (SIMD: single instruction multiple data)
  • 4. Modify Part 2 ● Complexity of sequential K-Means algorithm: O(N*D*K*T) N: # of datas. D: # of dimension. K: # of clusters. T: # of iterations. ● Complexity of each iteration step: Part1: Compute the distance with K cluster centers and assign to the nearest one. O(N*D*K) 4 Part2: compute the new center as mean of new cluster datas. O(N+k)--> O(2*delta+K) delta: #of membership change.
  • 5. Performance Analysis 5 Experiment1: set K=128, D=1000, changing size N.
  • 6. Performance Analysis Experiment 2: set N=51200, D=1000, changing number of clusters K. 6 Speedup (N>>K, D>K)
  • 7. Performance Analysis Experiment 3: set N=51200, D=1000, T=30. changing number of clusters K. 7
  • 8. Performance Analysis Experiment 3 continue.. 8 Slope~1, K double, running time double
  • 9. CUDA - Execution of a CUDA program 9 find_nearest_cluster
  • 10. CUDA - Memory Organisation 10
  • 11. CUDA - Thread Organization Execution resources are organized into Streaming Multiprocessors(SM). Blocks are assigned to Streaming Multiprocessors in arbitrary order. Blocks are further partitioned into warps. SM executes only one of its resident warps at a time. The goal is to keep an SM max occupied. 11 Threads per Warp 32 Max Warps per Multiprocessor 64 Max Thread Blocks per Multiprocessor 16
  • 12. NVProf Time(%) Time Calls Avg Min Max Name 98.99% 4.11982s 21 196.18ms 195.58ms 197.96ms find_nearest_cluster(int, int, int, float*, float*, int*, int*) 0.98% 40.635ms 23 1.7668ms 30.624us 39.102ms [CUDA memcpy HtoD] 0.03% 1.2578ms 42 29.946us 28.735us 31.104us [CUDA memcpy DtoH] Time(%) Time Calls Avg Min Max Name 93.06% 4.12058s 21 196.22ms 195.62ms 198.00ms cudaDeviceSynchronize 5.79% 256.47ms 4 64.117ms 4.9510us 255.97ms cudaMalloc 1.02% 45.072ms 65 693.42us 82.267us 39.230ms cudaMemcpy 12 N = 51200 Dimension = 1000 k = 128 Loop iterations = 21
  • 13. Future work: 1. Compare time taken with implementations on OpenMP, MPI, standard libraries - SkLearn, Matlab, etc. 2. Apply the Map-Reduce methodology. 3. Efficiently parallelize Part2. 13
  • 15. Introduction to K Means Clustering ● Clustering algorithm used in data mining ● Aims to partition N data points into K clusters ---->where, each data belongs to the cluster with nearest mean. ● Objective Function 15 number of clusters center/mean of cluster i number of data points in cluster i
  • 16. Parallelization: CUDA C Implementation Step0: HOST initialize cluster centers, copy N data coordinates to DEVICE. Step1: Copy data membership and k cluster centers from HOST to DEVICE. Step2: In DEVICE, each thread process a single data point, compute the distance and update data membership. Step3: Copy the new membership to HOST, and recompute cluster centers. Step4: Check convergence, if not, go back to step1. Step5: Host free the allocated memory at last. 16 Step2: In DEVICE, each thread process a single data point, compute the distance and update data membership.
  • 17. 1. Pick the first k data points as initial cluster centers. 2. Attribute each data point to the nearest cluster. 3. For each reassigned data point, d=d+1. 4. Set the position of each new cluster to be the mean of all data points belonging to that cluster 5. Repeat steps 2-4 until convergence. ➔How to define convergence? Stop condition: d/N <0.001 d: number of inputs that changed membership, N: total number of data points 17 Sequential Algorithm Steps1 1Reference: http://users.eecs.northwestern.edu/~wkliao/Kmeans/
  • 19. GPU based----CUDA C If CPU has p cores, n datas, each core process n/p datas. GPU process one element per thread.(The number of threads are very large ~1000 or more) GPU is more effective than CPU when dealing with large blocks of data in parallel. CUDA C: Host---> CPU, Device-->GPU. The host launches a kernel that execute on the device. 19

Editor's Notes

  1. To compute new centers, previously we add all datas in the cluster together and average them, it costs N+K. Since only a small portion of data will change membership, as long as a data changes membership, we add it to the new cluster and remove it from the old one. This will enhance the efficiency of part2.
  2. To compute new centers, previously we add all datas in the cluster together and average them, it costs N+K. Since only a small portion of data will change membership, as long as a data changes membership, we add it to the new cluster and remove it from the old one. This will enhance the efficiency of part2.
  3. The capacity of GPU on comet is 11gb. And we tested for 8gb of data was the max we tried, 1.7hour, 3 days. CUDA code outperforms sequential code a lot.
  4. If keep N the same, increase K. what will happen? Each data has been assigned more work to do.
  5. GPU on Comet has Compute Capacity of 3.7. Max 2048 threads per SM. Once a grid is launched, its blocks are assigned to Streaming Multiprocessors in arbitrary order, resulting in transparent scalability of applications. The transparent scalability comes with a limitation: threads in different blocks cannot synchronize with each other. The only safe way for threads in different blocks to synchronize with each other is to terminate the kernel and start a new kernel for the activities after the synchronization point. Threads are assigned to SM for execution on a block-by-block basis. For X80 processors, each SM can accommodate up to 16 blocks or 2048 threads, which ever becomes a limitation first. Once a block is assigned to SM, it is further partitioned into warps. At any time, the SM executes only one of its resident warps. If we use 64 threads per block then there are 2 warps but SM can have 4 warps - wasted. 128 threads = 128/32 = 4 warps; max blocks = 16; max warps = 64 => 64/16=4 warps per block A legitimate question is why we need to have so many warps in an SM considering the fact that it executes only one of them at any point in time. The answer is that this is how these processors efficiently execute long latency operations such as access to the global memory. When an instruction executed by threads in a warp needs to wait for the result of a previously initiated long-latency operation, the warp is placed into a waiting area. One of the other resident warps who are no longer waiting for results is selected for execution. If more than one warp is ready for execution, a priority mechanism is used to select one for execution.
  6. 4 mallocs - input 2d, cluster data 2d, membership 1d and memchanged 1d 2 copy D-H - membership and membershipChanged H-D - data, cluster; then every iteration new clusters copied