SlideShare uma empresa Scribd logo
1 de 67
A Practical Quicksort Algorithmfor Graphics Processors Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology
System Model The Algorithm Experiments Overview
System Model The Algorithm Experiments System Model
CPU vs. GPU Normal processors Graphics processors Multi-core Large cache Few threads Speculative execution Many-core Small cache Wide and fast memory bus Thousands of threads hides memory latency
Designed for general purpose computation Extensions to C/C++ CUDA
System Model – Global Memory Global Memory
System Model – Shared Memory Global Memory Multiprocessor 0 Multiprocessor 1 Shared Memory Shared Memory
System Model – Thread Blocks Global Memory Multiprocessor 0 Multiprocessor 1 Thread Block 0 Thread Block 1 Thread Block x Thread Block y Shared Memory Shared Memory
Minimal, manual cache 32-word SIMD instruction Coalesced memory access No block synchronization Expensive synchronization primitives Challenges
System Model The Algorithm Experiments The Algorithm
Quicksort
Quicksort Data Parallel!
Quicksort NOT Data Parallel!
Quicksort Phase One ,[object Object]
Block synchronization on the CPUThread Block 1 Thread Block 2
Quicksort Phase Two ,[object Object]
Run entirely on GPUThread Block 1 Thread Block 2
Quicksort – Phase One 2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Quicksort – Phase One Assign Thread Blocks 2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Quicksort – Phase One Uses an auxiliary array In-place too expensive 2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8
Quicksort – Synchronization 2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8 LeftPointer = 0 1 Thread Block ONE Fetch-And-Add on LeftPointer 0 Thread Block TWO
Quicksort – Synchronization 22 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8 3 2 LeftPointer = 2 2 Thread Block ONE Fetch-And-Add on LeftPointer 3 Thread Block TWO
Quicksort – Synchronization 220 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8 3 2 2 3 LeftPointer = 4 5 Thread Block ONE Fetch-And-Add on LeftPointer 4 Thread Block TWO
Quicksort – Synchronization 2 2 0 7 4 9 3 3 3 7 8 48 7 2 5 0 7 6 3 4 2 9 8 3 2 2 3 3 0 2 3 9 7 7 8 7 8 5 7 6 9 8 0 2 LeftPointer = 10
Quicksort – Synchronization 2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8 3 2 2 3 3 0 2 3 9 4 7 7 8 7 8 5 7 6 9 8 4 4 0 2 Three way partition
Two Pass Partitioning 2 2 0 7 4 9 3 3 3 7 8 4 8 7 2 5 0 7 6 3 4 2 9 8 < < < < < < < < < < = = = > > > > > > > > > > > = = =
Recurse on smallest sequence Max depth log(n) Explicit stack Alternative sorting Bitonic sort Second Phase
System Model The Algorithm Experiments The Algorithm
GPU-Quicksort		Cederman and Tsigas, ESA08 GPUSort			Govindaraju et. al., SIGMOD05 Radix-Merge		Harris et. al., GPU Gems 3 ’07 Global radix		Sengupta et. al., GH07 Hybrid			Sintorn and Assarsson, GPGPU07 Introsort 			David Musser, Software: 						Practice and Experience Set of Algorithms
Uniform Gaussian Bucket Sorted Zero StanfordModels Distributions
Uniform Gaussian Bucket Sorted Zero StanfordModels Distributions
Uniform Gaussian Bucket Sorted Zero StanfordModels Distributions
Uniform Gaussian Bucket Sorted Zero StanfordModels Distributions
Uniform Gaussian Bucket Sorted Zero StanfordModels Distributions
Uniform Gaussian Bucket Sorted Zero StanfordModels Distributions x1 x2 x3 P x4
First Phase Average of maximum and minimum element Second Phase Median of first, middle and last element Pivot selection
8800GTX 16 multiprocessors (128 cores) 86.4 GB/s bandwidth 8600GTS 4 multiprocessors (32 cores) 32 GB/s bandwidth CPU-Reference AMD Dual-Core Opteron 265 / 1.8 GHz Hardware
8800GTX – Uniform Distribution
8800GTX – Uniform Distribution
8800GTX – Uniform – 16M
8800GTX – Uniform – 16M CPU-Reference ~4s
8800GTX – Uniform – 16M GPUSort and Radix-Merge ~1.5s
8800GTX – Uniform – 16M GPU-Quicksort ~380ms Global Radix ~510ms
8800GTX – Sorted Distribution
8800GTX – Sorted Distribution
8800GTX – Sorted – 8M
8800GTX – Sorted – 8M Reference is faster than GPUSort and Radix-Merge
8800GTX – Sorted – 8M GPU-Quicksort takes less than 200ms
8800GTX – Zero Distribution
8800GTX – Zero Distribution
8800GTX – Zero – 16M
8800GTX – Zero – 16M Bitonic is unaffected by distribution
8800GTX – Zero – 16M Three-way partition
8600GTS – Uniform Distribution
8600GTS – Uniform Distribution
8600GTS – Uniform – 8M
8600GTS – Uniform – 8M Reference equal to Radix-Merge
8600GTS – Uniform – 8M Quicksort and hybrid ~4 times faster
8600GTS – Sorted Distribution
8600GTS – Sorted Distribution
8600GTS – Sorted – 8M
8600GTS – Sorted – 8M Hybrid needs randomization
8600GTS – Sorted – 8M Reference better
Visibility Ordering – 8800GTX
Visibility Ordering – 8800GTX Similar to uniform distribution
Visibility Ordering – 8800GTX Sequence needs to be a power of two in length
Minimal, manual cache Used only for prefix sum and bitonic 32-word SIMD instruction Main part executes same instructions Coalesced memory access All reads coalesced No block synchronization Only required in first phase Expensive synchronization primitives Two passes amortizes cost Conclusions

Mais conteúdo relacionado

Mais procurados

Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit Ganesan Narayanasamy
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDAprithan
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
 
hajer
hajerhajer
hajerra na
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...Ganesan Narayanasamy
 
1.meena tushir finalpaper-1-12
1.meena tushir finalpaper-1-121.meena tushir finalpaper-1-12
1.meena tushir finalpaper-1-12Alexander Decker
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
Scheduling in Linux and Web Servers
Scheduling in Linux and Web ServersScheduling in Linux and Web Servers
Scheduling in Linux and Web ServersDavid Evans
 
Design and implementation of modified clock generation
Design and implementation of modified clock generationDesign and implementation of modified clock generation
Design and implementation of modified clock generationeSAT Journals
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverPraveen Narayanan
 
A complete fingerprint matching algorithm on gpu
A complete fingerprint matching algorithm on gpuA complete fingerprint matching algorithm on gpu
A complete fingerprint matching algorithm on gpuHai le Hong
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 

Mais procurados (20)

Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Real time-embedded-system-lec-07
Real time-embedded-system-lec-07Real time-embedded-system-lec-07
Real time-embedded-system-lec-07
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
hajer
hajerhajer
hajer
 
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...Targeting GPUs using OpenMP  Directives on Summit with  GenASiS: A Simple and...
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
 
1.meena tushir finalpaper-1-12
1.meena tushir finalpaper-1-121.meena tushir finalpaper-1-12
1.meena tushir finalpaper-1-12
 
A03530107
A03530107A03530107
A03530107
 
Aa sort-v4
Aa sort-v4Aa sort-v4
Aa sort-v4
 
Paralell
ParalellParalell
Paralell
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Scheduling in Linux and Web Servers
Scheduling in Linux and Web ServersScheduling in Linux and Web Servers
Scheduling in Linux and Web Servers
 
nn network
nn networknn network
nn network
 
Design and implementation of modified clock generation
Design and implementation of modified clock generationDesign and implementation of modified clock generation
Design and implementation of modified clock generation
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
 
A complete fingerprint matching algorithm on gpu
A complete fingerprint matching algorithm on gpuA complete fingerprint matching algorithm on gpu
A complete fingerprint matching algorithm on gpu
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 

Semelhante a GPU-Quicksort

Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programminginside-BigData.com
 
HPE ProLiant DL380 Gen9 Server Data Sheet
HPE ProLiant DL380 Gen9 Server Data SheetHPE ProLiant DL380 Gen9 Server Data Sheet
HPE ProLiant DL380 Gen9 Server Data Sheet美兰 曾
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain
 
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...Kinson Chan
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V International
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computingrinnocente
 
Esp32 datasheet
Esp32 datasheetEsp32 datasheet
Esp32 datasheetMoises .
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorJinho Lee
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architectureinside-BigData.com
 
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance FuckupsNETFest
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsHPCC Systems
 
osdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxosdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxCive1971
 
Dell Technologies Dell EMC POWERMAX Storage On One Single Page - POSTER - v1a...
Dell Technologies Dell EMC POWERMAX Storage On One Single Page - POSTER - v1a...Dell Technologies Dell EMC POWERMAX Storage On One Single Page - POSTER - v1a...
Dell Technologies Dell EMC POWERMAX Storage On One Single Page - POSTER - v1a...Dell Technologies
 

Semelhante a GPU-Quicksort (20)

Theta and the Future of Accelerator Programming
Theta and the Future of Accelerator ProgrammingTheta and the Future of Accelerator Programming
Theta and the Future of Accelerator Programming
 
HPE ProLiant DL380 Gen9 Server Data Sheet
HPE ProLiant DL380 Gen9 Server Data SheetHPE ProLiant DL380 Gen9 Server Data Sheet
HPE ProLiant DL380 Gen9 Server Data Sheet
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Tridiagonal solver in gpu
Tridiagonal solver in gpuTridiagonal solver in gpu
Tridiagonal solver in gpu
 
CUDA
CUDACUDA
CUDA
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentor
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
Esp32 datasheet
Esp32 datasheetEsp32 datasheet
Esp32 datasheet
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
 
Tvs x82 range
Tvs x82 range Tvs x82 range
Tvs x82 range
 
Argonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer ArchitectureArgonne's Theta Supercomputer Architecture
Argonne's Theta Supercomputer Architecture
 
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
.NET Fest 2019. Łukasz Pyrzyk. Daily Performance Fuckups
 
Moving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC SystemsMoving Toward Deep Learning Algorithms on HPCC Systems
Moving Toward Deep Learning Algorithms on HPCC Systems
 
osdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxosdi20-slides_zhao.pptx
osdi20-slides_zhao.pptx
 
Dell Technologies Dell EMC POWERMAX Storage On One Single Page - POSTER - v1a...
Dell Technologies Dell EMC POWERMAX Storage On One Single Page - POSTER - v1a...Dell Technologies Dell EMC POWERMAX Storage On One Single Page - POSTER - v1a...
Dell Technologies Dell EMC POWERMAX Storage On One Single Page - POSTER - v1a...
 

GPU-Quicksort