Aa sort-v4

AA-Sort: A New Parallel Sorting
Algorithm for Multi-Core SIMD
Processors
By: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami
Presented By: M. Edirisinghe, H. Nawarathna

Content

• Introduction
• SIMD instruction set
• AA-sort algorithm
• In-core algorithm
• Out-of-core algorithm
• Sorting scheme in AA-sort
• Experimental results

Introduction
• High-performance processors provide multiple
hardware threads within one physical
processor with multiple cores and
simultaneous multithreading
• Many processors provide Single Instruction
Multiple Data (SIMD) instructions

3

SIMD Instructions
• Advantages:
– Data parallelism
– Reduce the number of conditional branches in
programs (can use vector compare and vector
select instead)

5

SIMD Instruction Set
• Used Vector Multimedia eXtension (VMX or
AltiVec) instructions
• Provides a set of 128 bit vector registers
– Use four 32 bit values

• Useful VMX instructions for sorting:
– Vector Compare
– Vector Selected
– Vector Permutation
6

Sorting Algorithms and SIMD
• Many sorting algorithms require unaligned or
element wise memory access (Eg: quicksort)
• It incur additional overhead and attenuate the
benefits of SIMD instructions

7

Paper’s Contribution
• Propose Aligned-Access sort (AA-sort), a new
parallel sorting algorithm suitable for
exploiting both SIMD instructions and thread
level parallelism available on today’s multi
core processors with computational
complexity of O(N log(N)

8

AA-Sort Algorithm
• Assumptions:
– First element of the array to be sorted is
aligned on a 128 bit boundary
– Number of elements in the array, N, is a
multiple of four

9

AA-Sort Algorithm
• Array of integer values a[N] is equivalent to an
array of vector integers va[N/4]

10

AA-Sort Algorithm
• Consist of 2 algorithms:
1. In-core sorting algorithm
2. Out-of-core sorting algorithm

• Phases of execution:
–Divide all of the data into blocks that fit into the
cache of the processor
–Sort each block with the in-core sorting algorithm
–Merge the sorted blocks with the out-of-core
sorting algorithm
11

Combsort
• Extension to bubble sort (kill turtles-lower
values in the end)
• Compares and swaps non-adjacent elements
• Improves performance
• Computational complexity N log (N) average
• Problems with SIMD instructions:
– Unaligned memory access
– Loop-carried dependencies
12

In-Core Algorithm
• Execution steps:
1. Sort values within each vector in ascending
order
2. Execute combsort to sort the values into the
transposed order

14

In-Core Algorithm
• Use extended Combsort

15

In-Core Algorithm
3. Reorder the values from the transposed order
into the original order

16

In-Core Algorithm
• All 3 steps can be executed using SIMD
instructions without unaligned memory access
• Computational complexity dominated by step
2
– Average O(N log N)
– Worst case O(N^2)

• Poor memory access locality
– Performance degrade if the data cannot fit into
the cache of the processor
17

Out of core Algorithm
• Used to merge two sorted vectors
– a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are sorted
– c = [b:a] = merge and sort (a, b)
sorted
a

a0

a1

a2

a3

sorted
b

b0

b1

b2

b3

[b:a] = vector_merge(a,b)
c0

c1

c2

c3

c4

c5

c6

c7

sorted
18

Dataflow of Merge
sorted
a0

min00

a1

<

a2

max00

min11

sorted

a3

<

b0

max11

min22

<

<

lg(P + 1) stages,
P – No of elements in a vector

b1

<

b2

max22

min33

b3

<

max33

<

<

<

Here P = 4
lg(P + 1) = 3

19

Out of core Algorithm
• No unaligned memory accesses
• Better memory access locality compared with
in-core sorting algorithm
– Higher performance when data cannot fit in the
cache

21

Overall AA Sort Scheme
• Divide all of the data to be sorted into blocks
that fit in the cache or the local memory of
the processor
• Sort each block with the in-core sorting
algorithm in parallel using multiple threads,
where each thread processes an independent
block.
• Merge the sorted blocks with the out-of-core
sorting algorithm using multiple threads
22

Overall AA Sort Scheme Contd.
No of elements of data
No of elements per block
No of blocks

=N
=B
= (N/B)

Considering In-core sorting phase
Computational time for the in-core sorting of each block proportional
to B log(B)
Complexity of in-core sorting
= O(N)
Considering out-of-core sorting phase
Merging sorted blocks in out-of-core sorting involves log(N/B) stages
Computational complexity of each stage = O(N)
Complexity of out-of-core sorting
= O(N log(N))
Hence,
Computational complexity of entire AA-sort = O(N log(N))
23

Overall AA Sort Scheme Contd.

An example of the entire AA-sort process,
where number of blocks (N/B) = 8 and the number of threads = 4

24

Experimental Setup
• PowerPC 970MP System
– Two 2.5 GHz dual-core processors
– 8GB system memory
– Each core had 1MB L2 cache memory
– Linux kernel 2.6.20

• System with Cell BE processors
– Two 2.4 GHz processors
– 1GB system memory
– Only SPE cores were used (16 SPE cores with
256KB local memory each)
– Linux kernel 2.6.15
25

Implementation
• Half of the size of L2 cache as the block size
– 512KB (128K of 32 bit values) on PowerPC 970MP
– 128KB (32K of 32 bit values) on the SPE

• Shrink factor – 1.28
• Multiway merge technique with out-of-core
sorting
– 4 way merge
– Number or merging stages reduced from log2(N/B)
to log4(N/B)
26

Effects of Using SIMD Instructions

Branch misprediction rate.

Acceleration by SIMD
instructions for sorting 16 K random
integers on one core of PowerPC
970MP

27

Performance for 32 bit Integers

Performance of sequential version of each algorithm on a PowerPC
970MP core for sorting random 32-bit integers with various data sizes.
28

Performance for 32 bit Integers Contd.

Performance
comparison on one
PowerPC 970MP core
for various input
datasets with 32
million integers.

29


The execution time of parallel versions of AA-sort and GPUTeraSort on
up to 4 cores of PowerPC 970MP.
30


Scalability with increasing number of cores on Cell BE for 32 million
integers
31

Conclusions
• Describes a new parallel sorting algorithm
called Aligned Access Sort
• The algorithm does not involve any unaligned
memory accesses
• Evaluated on PowerPC 970MP and Cell
Broadband Engine Processors
• Demonstrated better scalability and
performance in both sequential and parallel
versions
32

Conclusions Contd.
• Evaluation was performed only on 32 bit integers
• Performance comparison was performed on
limited number of architectures
– Jatin Chhugani et al.,” Efﬁcient Implementation of Sorting on Multi-Core
SIMD CPU Architecture”, Applications Research Lab, Corporate Technology
Group, Intel Corporation, August 2008, Auckland, New Zealand

• Does not discuss how multiple threads cooperate
on one merge operation when number of blocks
becomes smaller than number of threads

33

Aa sort-v4

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (9)

Semelhante a Aa sort-v4

Semelhante a Aa sort-v4 (20)

Último

Último (20)

Aa sort-v4