SlideShare uma empresa Scribd logo
1 de 34
AA-Sort: A New Parallel Sorting
Algorithm for Multi-Core SIMD
Processors
By: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami
Presented By: M. Edirisinghe, H. Nawarathna
Content

• Introduction
• SIMD instruction set
• AA-sort algorithm
• In-core algorithm
• Out-of-core algorithm
• Sorting scheme in AA-sort
• Experimental results
Introduction
• High-performance processors provide multiple
hardware threads within one physical
processor with multiple cores and
simultaneous multithreading
• Many processors provide Single Instruction
Multiple Data (SIMD) instructions

3
SIMD Instructions
• Advantages:
– Data parallelism
– Reduce the number of conditional branches in
programs (can use vector compare and vector
select instead)

5
SIMD Instruction Set
• Used Vector Multimedia eXtension (VMX or
AltiVec) instructions
• Provides a set of 128 bit vector registers
– Use four 32 bit values

• Useful VMX instructions for sorting:
– Vector Compare
– Vector Selected
– Vector Permutation
6
Sorting Algorithms and SIMD
• Many sorting algorithms require unaligned or
element wise memory access (Eg: quicksort)
• It incur additional overhead and attenuate the
benefits of SIMD instructions

7
Paper’s Contribution
• Propose Aligned-Access sort (AA-sort), a new
parallel sorting algorithm suitable for
exploiting both SIMD instructions and thread
level parallelism available on today’s multi
core processors with computational
complexity of O(N log(N)

8
AA-Sort Algorithm
• Assumptions:
– First element of the array to be sorted is
aligned on a 128 bit boundary
– Number of elements in the array, N, is a
multiple of four

9
AA-Sort Algorithm
• Array of integer values a[N] is equivalent to an
array of vector integers va[N/4]

10
AA-Sort Algorithm
• Consist of 2 algorithms:
1. In-core sorting algorithm
2. Out-of-core sorting algorithm

• Phases of execution:
–Divide all of the data into blocks that fit into the
cache of the processor
–Sort each block with the in-core sorting algorithm
–Merge the sorted blocks with the out-of-core
sorting algorithm
11
Combsort
• Extension to bubble sort (kill turtles-lower
values in the end)
• Compares and swaps non-adjacent elements
• Improves performance
• Computational complexity N log (N) average
• Problems with SIMD instructions:
– Unaligned memory access
– Loop-carried dependencies
12
Combsort
In-Core Algorithm
• Execution steps:
1. Sort values within each vector in ascending
order
2. Execute combsort to sort the values into the
transposed order

14
In-Core Algorithm
• Use extended Combsort

15
In-Core Algorithm
3. Reorder the values from the transposed order
into the original order

16
In-Core Algorithm
• All 3 steps can be executed using SIMD
instructions without unaligned memory access
• Computational complexity dominated by step
2
– Average O(N log N)
– Worst case O(N^2)

• Poor memory access locality
– Performance degrade if the data cannot fit into
the cache of the processor
17
Out of core Algorithm
• Used to merge two sorted vectors
– a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are sorted
– c = [b:a] = merge and sort (a, b)
sorted
a

a0

a1

a2

a3

sorted
b

b0

b1

b2

b3

[b:a] = vector_merge(a,b)
c0

c1

c2

c3

c4

c5

c6

c7

sorted
18
Dataflow of Merge
sorted
a0

min00

a1

<

a2

max00

min11

sorted

a3

<

b0

max11

min22

<

<

lg(P + 1) stages,
P – No of elements in a vector

b1

<

b2

max22

min33

b3

<

max33

<

<

<

Here P = 4
lg(P + 1) = 3

19
Merge Operation

20
Out of core Algorithm
• No unaligned memory accesses
• Better memory access locality compared with
in-core sorting algorithm
– Higher performance when data cannot fit in the
cache

21
Overall AA Sort Scheme
• Divide all of the data to be sorted into blocks
that fit in the cache or the local memory of
the processor
• Sort each block with the in-core sorting
algorithm in parallel using multiple threads,
where each thread processes an independent
block.
• Merge the sorted blocks with the out-of-core
sorting algorithm using multiple threads
22
Overall AA Sort Scheme Contd.
No of elements of data
No of elements per block
No of blocks

=N
=B
= (N/B)

Considering In-core sorting phase
Computational time for the in-core sorting of each block proportional
to B log(B)
Complexity of in-core sorting
= O(N)
Considering out-of-core sorting phase
Merging sorted blocks in out-of-core sorting involves log(N/B) stages
Computational complexity of each stage = O(N)
Complexity of out-of-core sorting
= O(N log(N))
Hence,
Computational complexity of entire AA-sort = O(N log(N))
23
Overall AA Sort Scheme Contd.

An example of the entire AA-sort process,
where number of blocks (N/B) = 8 and the number of threads = 4

24
Experimental Setup
• PowerPC 970MP System
– Two 2.5 GHz dual-core processors
– 8GB system memory
– Each core had 1MB L2 cache memory
– Linux kernel 2.6.20

• System with Cell BE processors
– Two 2.4 GHz processors
– 1GB system memory
– Only SPE cores were used (16 SPE cores with
256KB local memory each)
– Linux kernel 2.6.15
25
Implementation
• Half of the size of L2 cache as the block size
– 512KB (128K of 32 bit values) on PowerPC 970MP
– 128KB (32K of 32 bit values) on the SPE

• Shrink factor – 1.28
• Multiway merge technique with out-of-core
sorting
– 4 way merge
– Number or merging stages reduced from log2(N/B)
to log4(N/B)
26
Effects of Using SIMD Instructions

Branch misprediction rate.

Acceleration by SIMD
instructions for sorting 16 K random
integers on one core of PowerPC
970MP

27
Performance for 32 bit Integers

Performance of sequential version of each algorithm on a PowerPC
970MP core for sorting random 32-bit integers with various data sizes.
28
Performance for 32 bit Integers Contd.

Performance
comparison on one
PowerPC 970MP core
for various input
datasets with 32
million integers.

29
Performance for 32 bit Integers Contd.

The execution time of parallel versions of AA-sort and GPUTeraSort on
up to 4 cores of PowerPC 970MP.
30
Performance for 32 bit Integers Contd.

Scalability with increasing number of cores on Cell BE for 32 million
integers
31
Conclusions
• Describes a new parallel sorting algorithm
called Aligned Access Sort
• The algorithm does not involve any unaligned
memory accesses
• Evaluated on PowerPC 970MP and Cell
Broadband Engine Processors
• Demonstrated better scalability and
performance in both sequential and parallel
versions
32
Conclusions Contd.
• Evaluation was performed only on 32 bit integers
• Performance comparison was performed on
limited number of architectures
– Jatin Chhugani et al.,” Efficient Implementation of Sorting on Multi-Core
SIMD CPU Architecture”, Applications Research Lab, Corporate Technology
Group, Intel Corporation, August 2008, Auckland, New Zealand

• Does not discuss how multiple threads cooperate
on one merge operation when number of blocks
becomes smaller than number of threads

33
Thank You.

34

Mais conteúdo relacionado

Destaque

Parallel sorting Algorithms
Parallel  sorting AlgorithmsParallel  sorting Algorithms
Parallel sorting AlgorithmsGARIMA SHAKYA
 
Different Sorting tecniques in Data Structure
Different Sorting tecniques in Data StructureDifferent Sorting tecniques in Data Structure
Different Sorting tecniques in Data StructureTushar Gonawala
 
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Saliya Ekanayake
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Sorting Algorithm
Sorting AlgorithmSorting Algorithm
Sorting AlgorithmAl Amin
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsguest084d20
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm ModelsMartin Coronel
 

Destaque (9)

Parallel sorting Algorithms
Parallel  sorting AlgorithmsParallel  sorting Algorithms
Parallel sorting Algorithms
 
AA-sort with SSE4.1
AA-sort with SSE4.1AA-sort with SSE4.1
AA-sort with SSE4.1
 
Different Sorting tecniques in Data Structure
Different Sorting tecniques in Data StructureDifferent Sorting tecniques in Data Structure
Different Sorting tecniques in Data Structure
 
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Sorting Algorithm
Sorting AlgorithmSorting Algorithm
Sorting Algorithm
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm Models
 

Semelhante a Aa sort-v4

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Summary Of Course Projects
Summary Of Course ProjectsSummary Of Course Projects
Summary Of Course Projectsawan2008
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Bruno Castelucci
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image CompressionA B Shinde
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsSabidur Rahman
 
Blackfin Processor Core Architecture Part 2
Blackfin Processor Core Architecture Part 2Blackfin Processor Core Architecture Part 2
Blackfin Processor Core Architecture Part 2Premier Farnell
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
POLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel QueryPOLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel Queryoysteing
 
feedback_optimizations_v2
feedback_optimizations_v2feedback_optimizations_v2
feedback_optimizations_v2Ani Sridhar
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerZahra Sadeghi
 
Computer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryComputer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryBudditha Hettige
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptRaJibRaju3
 
Computer organization & ARM microcontrollers module 3 PPT
Computer organization & ARM microcontrollers module 3 PPTComputer organization & ARM microcontrollers module 3 PPT
Computer organization & ARM microcontrollers module 3 PPTChetanNaikJECE
 

Semelhante a Aa sort-v4 (20)

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Summary Of Course Projects
Summary Of Course ProjectsSummary Of Course Projects
Summary Of Course Projects
 
lect13_programmable_dp.pptx
lect13_programmable_dp.pptxlect13_programmable_dp.pptx
lect13_programmable_dp.pptx
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
Blackfin Processor Core Architecture Part 2
Blackfin Processor Core Architecture Part 2Blackfin Processor Core Architecture Part 2
Blackfin Processor Core Architecture Part 2
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Processors selection
Processors selectionProcessors selection
Processors selection
 
POLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel QueryPOLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel Query
 
feedback_optimizations_v2
feedback_optimizations_v2feedback_optimizations_v2
feedback_optimizations_v2
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
Computer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryComputer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary Memory
 
Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...
Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...
Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.ppt
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
Computer organization & ARM microcontrollers module 3 PPT
Computer organization & ARM microcontrollers module 3 PPTComputer organization & ARM microcontrollers module 3 PPT
Computer organization & ARM microcontrollers module 3 PPT
 

Último

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 

Último (20)

Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 

Aa sort-v4

  • 1. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors By: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami Presented By: M. Edirisinghe, H. Nawarathna
  • 2. Content • Introduction • SIMD instruction set • AA-sort algorithm • In-core algorithm • Out-of-core algorithm • Sorting scheme in AA-sort • Experimental results
  • 3. Introduction • High-performance processors provide multiple hardware threads within one physical processor with multiple cores and simultaneous multithreading • Many processors provide Single Instruction Multiple Data (SIMD) instructions 3
  • 4.
  • 5. SIMD Instructions • Advantages: – Data parallelism – Reduce the number of conditional branches in programs (can use vector compare and vector select instead) 5
  • 6. SIMD Instruction Set • Used Vector Multimedia eXtension (VMX or AltiVec) instructions • Provides a set of 128 bit vector registers – Use four 32 bit values • Useful VMX instructions for sorting: – Vector Compare – Vector Selected – Vector Permutation 6
  • 7. Sorting Algorithms and SIMD • Many sorting algorithms require unaligned or element wise memory access (Eg: quicksort) • It incur additional overhead and attenuate the benefits of SIMD instructions 7
  • 8. Paper’s Contribution • Propose Aligned-Access sort (AA-sort), a new parallel sorting algorithm suitable for exploiting both SIMD instructions and thread level parallelism available on today’s multi core processors with computational complexity of O(N log(N) 8
  • 9. AA-Sort Algorithm • Assumptions: – First element of the array to be sorted is aligned on a 128 bit boundary – Number of elements in the array, N, is a multiple of four 9
  • 10. AA-Sort Algorithm • Array of integer values a[N] is equivalent to an array of vector integers va[N/4] 10
  • 11. AA-Sort Algorithm • Consist of 2 algorithms: 1. In-core sorting algorithm 2. Out-of-core sorting algorithm • Phases of execution: –Divide all of the data into blocks that fit into the cache of the processor –Sort each block with the in-core sorting algorithm –Merge the sorted blocks with the out-of-core sorting algorithm 11
  • 12. Combsort • Extension to bubble sort (kill turtles-lower values in the end) • Compares and swaps non-adjacent elements • Improves performance • Computational complexity N log (N) average • Problems with SIMD instructions: – Unaligned memory access – Loop-carried dependencies 12
  • 14. In-Core Algorithm • Execution steps: 1. Sort values within each vector in ascending order 2. Execute combsort to sort the values into the transposed order 14
  • 15. In-Core Algorithm • Use extended Combsort 15
  • 16. In-Core Algorithm 3. Reorder the values from the transposed order into the original order 16
  • 17. In-Core Algorithm • All 3 steps can be executed using SIMD instructions without unaligned memory access • Computational complexity dominated by step 2 – Average O(N log N) – Worst case O(N^2) • Poor memory access locality – Performance degrade if the data cannot fit into the cache of the processor 17
  • 18. Out of core Algorithm • Used to merge two sorted vectors – a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are sorted – c = [b:a] = merge and sort (a, b) sorted a a0 a1 a2 a3 sorted b b0 b1 b2 b3 [b:a] = vector_merge(a,b) c0 c1 c2 c3 c4 c5 c6 c7 sorted 18
  • 19. Dataflow of Merge sorted a0 min00 a1 < a2 max00 min11 sorted a3 < b0 max11 min22 < < lg(P + 1) stages, P – No of elements in a vector b1 < b2 max22 min33 b3 < max33 < < < Here P = 4 lg(P + 1) = 3 19
  • 21. Out of core Algorithm • No unaligned memory accesses • Better memory access locality compared with in-core sorting algorithm – Higher performance when data cannot fit in the cache 21
  • 22. Overall AA Sort Scheme • Divide all of the data to be sorted into blocks that fit in the cache or the local memory of the processor • Sort each block with the in-core sorting algorithm in parallel using multiple threads, where each thread processes an independent block. • Merge the sorted blocks with the out-of-core sorting algorithm using multiple threads 22
  • 23. Overall AA Sort Scheme Contd. No of elements of data No of elements per block No of blocks =N =B = (N/B) Considering In-core sorting phase Computational time for the in-core sorting of each block proportional to B log(B) Complexity of in-core sorting = O(N) Considering out-of-core sorting phase Merging sorted blocks in out-of-core sorting involves log(N/B) stages Computational complexity of each stage = O(N) Complexity of out-of-core sorting = O(N log(N)) Hence, Computational complexity of entire AA-sort = O(N log(N)) 23
  • 24. Overall AA Sort Scheme Contd. An example of the entire AA-sort process, where number of blocks (N/B) = 8 and the number of threads = 4 24
  • 25. Experimental Setup • PowerPC 970MP System – Two 2.5 GHz dual-core processors – 8GB system memory – Each core had 1MB L2 cache memory – Linux kernel 2.6.20 • System with Cell BE processors – Two 2.4 GHz processors – 1GB system memory – Only SPE cores were used (16 SPE cores with 256KB local memory each) – Linux kernel 2.6.15 25
  • 26. Implementation • Half of the size of L2 cache as the block size – 512KB (128K of 32 bit values) on PowerPC 970MP – 128KB (32K of 32 bit values) on the SPE • Shrink factor – 1.28 • Multiway merge technique with out-of-core sorting – 4 way merge – Number or merging stages reduced from log2(N/B) to log4(N/B) 26
  • 27. Effects of Using SIMD Instructions Branch misprediction rate. Acceleration by SIMD instructions for sorting 16 K random integers on one core of PowerPC 970MP 27
  • 28. Performance for 32 bit Integers Performance of sequential version of each algorithm on a PowerPC 970MP core for sorting random 32-bit integers with various data sizes. 28
  • 29. Performance for 32 bit Integers Contd. Performance comparison on one PowerPC 970MP core for various input datasets with 32 million integers. 29
  • 30. Performance for 32 bit Integers Contd. The execution time of parallel versions of AA-sort and GPUTeraSort on up to 4 cores of PowerPC 970MP. 30
  • 31. Performance for 32 bit Integers Contd. Scalability with increasing number of cores on Cell BE for 32 million integers 31
  • 32. Conclusions • Describes a new parallel sorting algorithm called Aligned Access Sort • The algorithm does not involve any unaligned memory accesses • Evaluated on PowerPC 970MP and Cell Broadband Engine Processors • Demonstrated better scalability and performance in both sequential and parallel versions 32
  • 33. Conclusions Contd. • Evaluation was performed only on 32 bit integers • Performance comparison was performed on limited number of architectures – Jatin Chhugani et al.,” Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture”, Applications Research Lab, Corporate Technology Group, Intel Corporation, August 2008, Auckland, New Zealand • Does not discuss how multiple threads cooperate on one merge operation when number of blocks becomes smaller than number of threads 33