Programming Models for Heterogeneous Chips

Programming Models for
Heterogeneous Chips
Rafael Asenjo
Dept. of Computer Architecture
University of Malaga, Spain.

Agenda
• Motivation
• Hardware
– Heterogeneous chips
– Integrated GPUs
– Advantages
• Software
– Programming models for heterogeneous systems
– Programming models for heterogeneous chips
– Our approach based on TBB
2

Motivation
• A new mantra: Power and Energy saving
• In all domains
3

Motivation
• GPUs came to rescue:
– Massive Data Parallel Code at a
low price in terms of power
– Supercomputers and servers:
NVIDIA
• GREEN500 Top 15:
• TOP500:
– 45 systems w. NVIDIA
– 19 systems w. Xeon Phi
4

Motivation
• There is (parallel) live beyond supercomputers:
5

Motivation
• Plenty of GPUs elsewhere:
– Integrated GPUs on more than 90% of shipped processors
6

Motivation
• Plenty of GPUs on desktops and laptops:
– Desktops (35 – 130W) and laptops (15– 57 W):
7
Intel Haswell AMD APU Kaveri
http://www.techspot.com/photos/article/http://techguru3d.com/4th-gen-intel-haswell-processors- 770-amd-a8-7600-kaveri/
architecture-and-lineup/

Motivation
• Plenty of integrated GPUs in mobile devices.
Samsung
Galaxy S5
SM-G900H
9
Samsung Exynos 5 Octa (2 - 6 W)
Samsung
Galaxy Note
Pro 12
http://www.samsung.com/us/showcase/galaxy-smartphones-and-tablets/

Motivation
• Plenty of integrated GPUs in mobile devices.
10
Qualcomm Snapdragon 800 (2 - 6 W)
https://www.qualcomm.com/products/snapdragon/processors/800
Nexus 5
Nokia Lumia
Sony Xperia

Motivation
• Plenty of room for improvements
– Want to make the most out of the CPU and the GPU
– Lack of programming models
– “Heterogeneous exec., but homogeneous programming”
– Huge potential impact
• Servers and supercomputing market
– Google: porting the search engine for ARM and PowerPC
– AMD Seattle Server-on-a-Chip based on Cortex-A57 (v8)
– Mont Blanc project: supercomputer made of ARM
• Once commodity processors took over
• Be prepared for when mobile processors do so
– E4’s EK003 Servers: X-Gene ARM A57 (8 cores) + K20
11

Agenda
• Motivation
• Hardware
– Integrated GPUs
– Advantages
• Software
12

Hardware
13
Intel Haswell AMD Kaveri
Samsung Exynos 5 Octa Qualcomm Snapdragon 800

Intel Haswell
• Modular design
14
– 2 or 4 cores
– GPU
• GT-1: 10 EU
• GT-2: 20 EU
• GT-3: 40 EU
• TSX: HW transac. mem.
– HLE (HW lock elis.)
• XACQUIRE
• XRELEASE
– RTM (Restrtd. TM)
• XBEGIN
• XEND
http://www.anandtech.com/show/6355/intels-haswell-architecture

Intel Haswell
• Three frequency domains
– Cores
– GPU
– LLC and Ring
• On old Ivy Bridge
– Only 2 domains
– Cores and LLC together
– Only GPU à CPU Fz é
• OpenCL driver only for Win.
• PCM as power monitor
15
http://www.anandtech.com/show/7744/intel-reveals-new-haswell-details-at-isscc-2014

Intel Iris Graphics
https://software.intel.com/en-us/articles/opencl-fall-webinar-series
16

Intel Iris Graphics
• GPU slice
17
– 2 sub slices
– 20 EU (GPU cores)
– Local L3 cache (256KB)
– 16 barriers per sub slice
– 2 x 64KB Local mem.
• 2 GPU slices = 40 EU
• Up to 7 in-flight EU-th
• 8, 16 or 32 SIMD per EU-th
• In flight:7x40x32=8960 work it
• Each EU à 2 x 4-wide FPU
– 40x8x2 (fmadd) = 640 op sim.
– 1.3GHz à 832GFlops

Intel Iris GPU
18
Matrix work-group ≈ block
EU-threads (SIMD16) ≈ warp ≈ wavefront

AMD Kaveri
• Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores.
19
http://wccftech.com/

AMD Kaveri
• Steamroller microarch.
– Each moduleà 2 “Cores”.
– 2 threads, each with
• 4x superscalar INT
• 2x SIMD4 FP
– 3.7GHz
• Max GFLOPS:
• 3.7 GHz x
• 4 threads x
• 4 wide x
• 2 fmad =
• 118 GFLOPS
20

AMD Graphics Core Next (GCN)
• In Kaveri, GCG takes 47% of the die
– 8 Compute Units (CU)
– Each CU: 4 SIMD16
– Each SIMD16: 16 lines
– Total: 512 FPUs
– 720 MHz
• Max GFLOPS=
• 0.72 GHz x
• 512 FPUs x
• 2 fmad =
• 737 GFLOPS
• CPU+GPU à 855 GFLOPS
21

OpenCL execution on GCN
Work-group à wavefronts (64 work-items) à pools
22
WG
CU0
pool number
SIMD0-0
SIMD0 SIMD1 SIMD2 SIMD3
SIMD1-0
SIMD2-0
SIMD3-0
SIMD0-1
SIMD1-1
SIMD2-1
SIMD3-1
SIMD0-2
SIMD1-2
4 pools:
4 wavefronts in
flight per SIMD
4 ck to execute
each wavefront
Wavefronts

HSA (Heterogeneous System Architecture)
• HSA Foundation’s goal: Productivity on heterogeneous HW
– CPU, GPU, DSPs..
• Scheduled on three phasesà
• Second phase: Kaveri
– hUMA
– Same pointers used on CPU
and GPU
– Cache coherency
23

Kaveri’s main HSA features
• hUMA
– Shared and coherent view of up to 32GB
• Heterogeneous queuing (hQ)
– CPU and GPU can create and dispatch work
24

HSA Motivation
• Too many steps to get the job done
25
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/

Requirements
• To enable lower overhead job dispatch requires four mechanisms:
– Shared Virtual Memory
• Send pointers (not data) back and forth between HSA agents.
– System Coherency
• Data accesses to global memory segment from all HSA Agents
shall be coherent without the need for explicit cache maintenance
– Signaling
• HSA Agents can directly create/access signal objects.
– Signaling a signal object (this will wake up HSA agents waiting
upon the object)
– Query current object
– Wait on the current object (various conditions supported).
– User mode queueing
• Enables user space applications to directly, without OS intervention,
enqueue jobs (“Dispatch Packets”) for HSA agents.
26

Non-HSA Shared Virtual Memory
• Multiple Virtual memory address spaces
PHYSICAL MEMORY
VIRTUAL MEMORY1
27
PHYSICAL MEMORY
VIRTUAL MEMORY2
CPU0 GPU
VA1->PA1 VA2->PA1

HSA Shared Virtual Memory
• Common Virtual Memory for all HSA agents
28
PHYSICAL MEMORY
VIRTUAL MEMORY
CPU0 GPU
VA->PA VA->PA

After adding SVM
• With SVM we get rid of copy/map memory back and forth
29
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory

After adding coherency
• If the CPU allocates a global pointer, the GPU see that value
30
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory

After adding signaling
• The CPU can wait on a signal object
31
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory

After adding user-level enqueuing
• The user directly enqueues the job without OS intervention
32
Application OS GPU
Transfer
buffer to GPU
Copy/Map
Memory
Queue Job
Schedule Job
Start Job
Finish Job
Schedule
Application
Get Buffer
Copy/Map
Memory

Success!!
• That’s definitely way simpler and with less overhead
33
Application OS GPU
Queue Job
Start Job
Finish Job

OpenCL 2.0
• OpenCL 2.0 will contain most of the features of HSA
– Intel’s version supports HSA for Core M (Broadwell). Windows.
– AMD’s version does not support SVM fine grain.
• AMD 1.2 beta driver
– http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-1-2-beta-driver/
– Only for Windows 8.1
– Example: allocating “Coherent Host Memory” on Kaveri:
34
#include
<CL/cl_ext.h>
//
Implements
SVM
#include
"hsa_helper.h”
//
AMD
helper
functions
…
cl_svm_mem_flags_amd
flags
=
CL_MEM_READ_WRITE
|
CL_MEM_SVM_FINE_GRAIN_BUFFER_AMD
|
CL_MEM_SVM_ATOMICS_AMD;
volatile
std::atomic_int
*
data;
data
=
(volatile
std::atomic_int
*)
clSVMAlloc(context,
flags,
MAX_DATA
*
sizeof(volatile
std::atomic_int),
0);

Samsung Exynos 5
• Odroid XU-E and XU3 bareboards
• Sports a Exynos 5 Octa
35
– big.LITTLE architecture
– big: Cortex-A15 quad
– LITTLE: Cortex-A7 quad
• Exynos 5 Octa 5410
– Only 4 CPU-cores active at a time
– GPU: Power VR SGX544MP3 (Imagination Technologies)
• 3 GPU-cores at 533MHz à 51 GFLOPS
• Exynos 5 Octa 5422
– All 8 CPU-cores can be working simultaneously
– GPU: ARM Mali-T628 MP6
• 6 GPU-cores at 533MHz à 102 GFLOPS
180$

Power VR SGX544MP3
• OpenCL 1.1 for Android
• Some limitations:
– Compute units: 1
– Max WG Size: 1
– Local Mem: 1KB
– Pick MFLOPS:
• 12 Ops per ck
• 3 SIMD-ALUs x 4-wide
• Power monitor:
– 4 x INA231 monitors
• A15, A7, GPU, Mem.
• Instant Power
• Every 260ms
36
SGX architecture
…N

ARM Mali-T628 MP6
• Supporting:
– OpenGL® ES 3.0
– OpenCL™ 1.1
– DirectX® 11
– Renderscript™
• Cache L2 size
– 32 – 256KB per core
• 6 Cores
– 16 FP units
– 2 SIMD4 each
• Other
– Built-in MMU
– Standard ARM Bus
• AMBA 4 ACE-Lite
38
Mali architecture

Snapdragon 800
• CPU: Quad-core Krait 400 up to 2.26GHz (ARMv7 ISA)
– Similar to Cortex-A15. 11 stage integer pipeline with 3-way
decode and 4-way out-of-order speculative issue superscalar
execution
– Pipelined VFPv4[2] and 128-bit wide NEON (SIMD)
– 4 KB + 4 KB direct mapped L0 cache
– 16 KB + 16 KB 4-way set associative L1 cache
– 2 MB (quad-core) L2 cache
• GPU: Adreno 330, 450MHz
– OpenGL ES 3.0, DirectX, OpenCL 1.2, RenderScript
– 32 Execution Units. Each with 2 x SIMD4 units
• DSP: Hexagon 600MHz
41

Measuring power
• Snapdragon
Performance
Visualizer
• Trepn Profiler
• Power Tutor
– Tuned for Nexus One
– Model with 5%
precision
– Open Source
42

More development boards
• Jetson TK1 board
– Tegra K1
– Kepler GPU with 192 CUDA cores
– 4-Plus-1 quad-core ARM Cortex A15
– Linux + CUDA
– 180$
• Arndale
– Exynos 5420
– big.LITTLE (A15 + A7)
– GPU Mali T628 MP6
– Linux + OpenCL
– 200$
• …
43

Advantages of integrated GPUs
• Discrete and integrated GPUs: different goals
– NVIDIA Kepler: 2880 CUDA cores, 235W, 4.3 TFLOPS
– Intel Iris 5200: 40 EU x 8 SIMD, 15-28W, 0.83 TFLOPS
– PowerVR: 3 EU x 16 SIMD, < 1W, 0.051 TFLOPS
• Higher bandwidth between CPU and GPU.
– Shared DRAM
• Avoid PCI data transfer
– Shared LLC (Last Level Cache)
• Data coherence in some cases…
• CPU and GPU may have similar performance
– It’s more likely that they can collaborate
• Cheaper!
44

Integrated GPUs are also improving
45

Agenda
• Motivation
• Hardware
– Integrated GPUs
– Advantages
• Software
46

Programming models for heterogeneous
• Targeted at single device
– CUDA (NVIDIA)
– OpenCL (Khronos Group Standard)
– OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0)
– C++AMP (Windows’ extension of C++. Recently HSA announced own ver.)
– RenderScript (Google’s Java API for Android)
– ParallDroid (Java + Directives from ULL, Spain)
– Many more (Sycl, Numba Python, IBM Java, Matlab, R, JavaScript, …)
• Targeted at several devices (discrete GPUs)
– Qilin (C++ and Qilin API compiled to TBB+CUDA)
– OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler)
– XKaapi
– StarPU
• Targeted at several devices (integrated GPUs)
– Qualcomm MARE
– Intel Concord
47

OpenCL on mobile devices
48
http://streamcomputing.eu/blog/2014-06-30/opencl-support-recent-android-smartphones/

OpenCL running on CPU
49
CPU Ivy Bridge
45% 41%
75%
15%
9%
34%
10%
50
45
40
35
30
25
20
15
10
5
0
Base Auto T-Auto SSE AVX-SSE AVX OpenCL
Execution Time (ms)
3.3 GHz
AVX code version is
- 1.8x times faster than OpenCL
- 1.8x more Halstead effort
“Easy, Fast and Energy Efficient Object Detection on Heterogeneous On-Chip
Architectures”, E. Totoni, M. Dikmen, M. J. Garzaran, ACM Transactions on Architecture
and Code Optimization (TACO),10(4), December 2013.

Complexities of AVX Intrinsics
__m256 image_cache0 = _mm256_broadcast_ss(&fr_ptr[pixel_offsets[0]]);!
curr_filter = _mm256_load_ps(&fb_array[fi]);!
temp_sum = _mm256_add_ps(_mm256_mul_ps(image_cache7, "
" curr_filter), temp_sum);!
temp_sum2 = _mm256_insertf128_ps(temp_sum,!
" _mm256_extractf128_ps(temp_sum, 1), 0);!
cpm = _mm256_cmp_ps(temp_sum2, max_fil, _CMP_GT_OS);!
r = _mm256_movemask_ps(cpm);!
!
if(r&(1<<1)) {!
"best_ind = filter_ind+2;!
"int control = 1|(1<<2)|(1<<4)|(1<<6);;!
"max_fil = _mm256_permute_ps(temp_sum2, control); !
" " " " " " "
"r=_mm256_movemask_ps( _mm256_cmp_ps(temp_sum2, !
" "max_fil, _CMP_GT_OS));!
}!
50
Load
Multiply-add
Copy high to low
Compare
Store max
Store index

OpenCL doesn’t have to be tough
Courtesy: Khronos Group
51

Libraries and languages using OpenCL
Courtesy: AMD
52

Libraries and languages using OpenCL
Courtesy: AMD
53

Libraries and languages using OpenCL (cont.)
Courtesy: AMD
54

Libraries and languages using OpenCL (cont.)
Courtesy: AMD
55

C++AMP
• C++ Accelerated Massive Parallelism
• Pioneered by Microsoft
– Requirements: Windows 7 + Visual Studio 2012
• Followed by Intel's experimental implementation
– C++ AMP on Clang/LLVM and OpenCL (AWOL since 2013)
• Now HSA Foundation taking the lead
• Keywords: restrict(device), array_view, parallel_for_each,…
– Example: SUM = A + B; // (2D arrays)
56

OpenCL Ecosystem
Courtesy: Khronos Group
57

SYCL’s flavour: A[i]=B[i]*2
58
Work in progress developments:
- AMD: trySYCL à https://github.com/amd/triSYCL
- Codeplay: http://www.codeplay.com/
Advantages:
1. Easy to understand the concept of work-groups
2. Performance-portable between CPU and GPU
3. Barriers are automatically deduced

StarPU
• A runtime system for
heterogeneous architectures
• Dynamically schedule tasks on
all processing units
– See a pool of heterogeneous
cores
• Avoid unnecessary data
transfers between accelerators
– Software SVM for
heterogeneous machines
59
CPU
CPU
CPU
CPU
CPU
CPU
A
M
CPU
CPU
M
GPU
M
GPU
M
GPU
M
GPU
M
A
=
A+B
B
B

Overview of StarPU
• Maximizing PU occupancy, minimizing data transfers
• Ideas:
– Accept tasks that may have multiple
implementations
60
• Together with potential inter-dependencies
– Leads to a dynamic acyclic graph of
tasks
– Provide a high-level data management
layer (Virtual Shared Memory VSM)
• Application should only describe
– which data may be accessed by tasks
– how data may be divided
Applica0ons
Parallel
Compilers
Parallel
Libraries
StarPU
Drivers
(CUDA,
OpenCL)
CPU
GPU
…

Tasks scheduling
• Dealing with heterogeneous hardware accelerators
• Tasks =
61
– Data input & output
– Dependencies with other tasks
– Multiple implementations
• E.g. CUDA + CPU
• Scheduling hints
• StarPU provides an Open Scheduling
platform
– Scheduling algorithm = plug-ins
– Predefined set of popular policies
Applica0ons
Parallel
Compilers
Parallel
Libraries
StarPU
Drivers
(CUDA,
OpenCL)
CPU
GPU
…
f
(ARW,
BR)
cpu
gpu
spu

Tasks scheduling
• Predefined set of popular policies
• Eager Scheduler
62
– First come, first served policy
– Only one queue
• Work Stealing Scheduler
– Load balancing policy
– One queue per worker
• Priority Scheduler
– Describe the relative importance
of tasks
– One queue per priority
CPU
CPU
CPU
GPU
GPU
Eager Scheduler
task
task
CPU
CPU
CPU
GPU
GPU
WS Scheduler
task
CPU
CPU
CPU
GPU
GPU
Prio. Scheduler
prio2 prio1 prio0

Tasks scheduling
• Predefined set of popular policies
• Dequeue Model (DM) Scheduler
63
– Using codelet performance models
• Kernel calibration on each
available computing device
– Raw history model of kernels’
past execution times
– Refined models using
regression on kernels’ execution
times history
• Dequeue Model Data Aware (DMDA)
Scheduler
– Data transfer cost vs kernel offload
benefit
– Transfer cost modelling ( )
– Bus calibration
task
cpu1
cpu2
cpu3
gpu1
gpu2
time
CPU
CPU
CPU
GPU
GPU
DM Scheduler
cpu1 cpu2 cpu3 gpu1 gpu2
task
cpu1
cpu2
cpu3
gpu1
gpu2
time
CPU
CPU
CPU
GPU
GPU
DMDA Scheduler
cpu1 cpu2 cpu3 gpu1 gpu2

Some results (MxV, 4 CPUs, 1 GPU)
SPU config: Eager, 3 CPUs, 1GPU SPU config: DMDA, 3 CPUs, 1GPU
SPU config: Eager, 4 CPUs, 1GPU SPU config: DMDA, 4 CPUs, 1GPU
64

Terminology
• A Codelet. . .
– . . . relates an abstract computation kernel to its implementation(s)
– . . . can be instantiated into one or more tasks
– . . . defines characteristics common to a set of tasks
• A Task. . .
– . . . is an instantiation of a Codelet
– . . . atomically executes a kernel from its beginning to its end
– . . . receives some input
– . . . produces some output
• A Data Handle. . .
– . . . designates a piece of data managed by StarPU
– . . . is typed (vector, matrix, etc.)
– . . . can be passed as input/output for a Task
65

Basic Example: Scaling a Vector
66
kernel Declaring a Codelet
functions
data
pieces
data mode
access
123456
struct starpu_codelet scal_cl = {
. cpu_funcs = { scal_cpu_f, NULL},
. cuda_funcs = { scal_cuda_f, NULL } ,
. nbuffers = 1,
. modes = { STARPU_RW } ,
};
1
2
3
4
5
6
7
8
9
Kernel functions
void scal_cpu_f(void ∗buffers [] , void ∗cl_arg) {
struct starpu_vector_interface ∗vector_handle = buffers [ 0 ] ;
float ∗vector = STARPU_VECTOR_GET_PTR(vector_handle);
float ∗ptr_factor = cl_arg ;
for (i = 0; i < NX; i++)
vector [ i ] ∗= ∗ptr_factor ;
}
void scal_cuda_f(void ∗buffers [] , void ∗cl_arg) { …
}
kernel
function
prototype
retrieve data
handle
get pointer from
data handle
get small-size
inline data
do computation

Basic Example: Scaling a Vector
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
67
Main code
float factor = 3.14;
float vector1 [NX] ;
float vector2 [NX] ;
starpu_data_handle_t vector_handle1 ;
starpu_data_handle_t vector_handle2 ;
/∗ ..... ∗/
starpu_vector_data_register(&vector_handle1, 0, (uintptr_t)vector1,
NX, sizeof(vector1[0]));
starpu_vector_data_register(&vector_handle2, 0, (uintptr_t)vector2,
NX, sizeof(vector2[0]));
/∗ non−blocking task submits ∗/
starpu_task_insert (&scal_cl , STARPU_RW , vector_handle1 ,
STARPU_VALUE , &factor , sizeof ( factor ) , 0) ;
starpu_task_insert (&scal_cl , STARPU_RW , vector_handle2 ,
STARPU_VALUE , &factor , sizeof ( factor ) , 0) ;
/∗ wait for all task submitted so far ∗/
starpu_task_wait_for_all () ;
starpu_data_unregister ( vector_handle1 ) ;
starpu_data_unregister ( vector_handle2 ) ;
/∗ ..... ∗/
declare data handles
register pieces of
data and get the
handles
(now under StarPU control)
submit tasks
(param: codelet, StarPU-managed
data, small-size
inline data)
wait for all submitted
tasks
Unregister pieces of
data
(the handles are destroyed,
the vectors are now back
under user control)

Qualcomm MARE
• MARE is a programming model and a runtime system that
provides simple yet powerful abstractions for parallel, power-efficient
software
– Simple C++ API allows developers to express concurrency
– User-level library that runs on any Android device, and on Linux,
Mac OS X, and Windows platforms
• The goal of MARE is to reduce the effort required to write
apps that fully utilize heterogeneous SoCs
• Concepts:
– Tasks are units of work that can be asynchronously executed
– Groups are sets of tasks that can be canceled or waited on
68

Basic Example: Hello World
69

More complex example: C=A+B on GPU
70

More complex example: C=A+B on GPU
71

MARE departures
• Similarities with TBB
– Based on tasks and 2-level API (task level and templates)
• pfor_each, ptransform, pscan, …
• Synchronous Dataflow classes ≈ TBB’s Flow Graphs
– Concurrent data structures: queue, stack, …
• Departures
– Expression of dependencies is first class
– Flexible group membership and work or group cancelation
– Optimized for some Qualcomm chips
• Power classes:
– Static: mare::power::mode {efficient, saver, …}
– Dynamic: mare:power::set_goal(desired, tolerance)
• Aware of the mobile architecture: agressive power mangmt.
– Cores can be shutdown or affected by DVFS
72

MARE results
• Zoomm web browser implemented on top of MARE
C. Cascaval, et al.. ZOOMM: a parallel web browser engine for multicore mobile devices. In
Symposium on Principles and practice of parallel programming, PPoPP ’13, pages 271–280, 2013.
73

MARE results
• Bullet Physics parallelized with MARE
Courtesy: Calin Cascaval
74

Intel Concord
• C++ heterogeneous programming framework for integrated
CPU and GPU processors
– Shared Virtual Memory (SVM) in software
– Adapts existing data-parallel C++ constructs to heterogeneous
computing using TBB
– Available open source as Intel Heterogeneous Research
Compiler (iHRC) at https://github.com/IntelLabs/iHRC/
• Papers:
– Rajkishore Barik, Tatiana Shpeisman, et al. Efficient mapping of
irregular C++ applications to integrated GPUs. CGO 2014.
– Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian
Lewis, Chunling Hu, and Keshav Pingali. Adaptive
heterogeneous scheduling on integrated GPUs. PACT 2014.
75

Intel Concord
• Extend TBB API:
– parallel_for_hetero (int numiters, const Body &B, bool device);
– parallel_reduce_hetero (int numiters, const Body &B, bool
device);
Courtesy: Intel
76

Example: Parellel_for_hetero
• Concord compiler generates OpenCL version
– Automatically takes care of the data thanks to SVM
Courtesy: Intel
77

Concord framework
Courtesy: Intel
78

SVM SW implementation on Haswell
79

SVM translation in OpenCL code
• svm_const is a runtime constant and is computed once
• Every CPU pointer before dereference on the GPU is
converted into GPU address space using AS_GPU_PTR
80

Speedup & Energy savings vs multicore CPU
82

Heterogeneous execution on both devices
• Iteration space distributed among available devices
• Problem: find the best data partition
• Example: Barnes Hut and Facedetect relative exect. time
– Varying the amount of work offloaded to the GPU
– For BH the optimum is 40% of the work carried out on the GPU
– For FD the optimum is 0% of the work carried out on the GPU
83

Partitioning based on on-line profiling
Naïve profiling Asymmetric profiling
84
assign chunk
to CPU and
GPU
compute
chunk on
CPU
compute
chunk on
GPU
barrier
according to
relative speeds
partition the rest
of the iteration
space
assign chunk
to just to
GPU
compute
on CPU
compute
chunk on
GPU
when the GPU is done
according to
relative speeds
partition the rest
of the iteration
space

Agenda
• Motivation
• Hardware
– Integrated GPUs
– Advantages
• Software
85

Our heterogeneous parallel_for
Angeles Navarro, Antonio Vilches, Francisco Corbera and Rafael Asenjo
Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures,
The Journal of Supercomputing, May 2014
86

Comparison with StarPU
• MxV benchmark
– Three schedulers tested: greedy, work-stealing, HEFT
– Static chunk size: 2000, 200 and 20 matrix rows
91

Choosing the GPU block size
• Belviranli, M. E., Bhuyan, L. N., & Gupta, R. (2013). A dynamic self-scheduling scheme for
heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4), 57:1–57:20.
92

GPU block size for irregular codes
• Adapt between time-steps and inside the time-step
93
100
90
80
70
60
50
40
30
20
10
0
Barnes−Hut: Average throughput per chunk size
0 40 160 640 2560 10240 40980 81960
Chunk size
Average Throughput
static, ts=0
static, ts=30
adap., ts=0
adap., ts=30

GPU block size for irregular codes
• Throughput variation along the iteration space
– For two different time-steps
– Different GPU chunk-sizes
160
140
120
100
80
60
40
0 2 4 6 8 10
94
4
x 10
20
Barnes−Hut: Throughput variation (time step=0)
Iteration Space
Throughput
chunk=320
chunk=640
chunk=1280
chunk=2560
160
140
120
100
80
60
40
0 2 4 6 8 10
4
x 10
20
Barnes−Hut: Throughput variation (time step=5)
Iteration Space
Throughput
chunk=320
chunk=640
chunk=1280
chunk=2560

Adapting the GPU chunk-size
• Assumption:
– Irregular behavior as a sequence of regimes of regular behavior
95 throughput
higher λG
λG a·ln(x)+b
λG
increase
chunk size
x
decrease
chunk size
lower λG
x
G(t-1)/2 G(t-1)*2 x
samples G=a/thld
x
40
20
1750
1400
1050
700
0 2 4 6 8 10
4
x 10
0
Iteration space
GPU Throughput
GPU Thr. & Chunk size: LogFit
350
GPU Chunk size
Throughput
Chunk Size
100
50
3500
2800
2100
1400
0 2 4 6 8 10
4
x 10
0
Iteration space
GPU Throughput
GPU Thr. & Chunk size: LogFit
700
GPU Chunk size
Throughput
Chunk Size

Preliminary results: Energy-Performance
On Haswell
6
5
4
65 70 75 80 85 90 95 100 105 110
150"
100"
50"
Barnes)Hut:)Offline)search)for)sta.c)par..on)
Execu2on"2me"in"seconds"
• Static: Oracle-Like static partition of the work based on profiling
• Concord: Intel approach: GPU size computed once
• HDSS: Belviranli et al. approach: GPU size computed once
• LogFit: our dynamic CPU and GPU 96 chunk size partitioner
3
x 10−4
Performance (iterations per ms.)
Energy per iteration (Joules)
Barnes Hut: Energy − Performance
Static
Concord
HDSS
LogFit
0"
0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%"
Percentage)of)the)itera.on)space)offloaded)to)the)GPU)

Preliminary results: Energy-Performance
13
12
11
10
9
8
7
6
5
x 10−4
• w.r.t. Static: up to 52% (18%
on average)
• w.r.t. Concord and HDSS: up
to 94% and 69% of (28% and
27% on average)
30 35 40 45 50 55 60 65
x 10−6
CFD: Energy − Performance
Performance (iterations per ms.) Energy per item (Joules)
97
9
8
7
6
5
4
45 50 55 60 65 70 75 80 85
2.2
2
1.8
1.6
1.4
1.2
1
0.8
0.6
x 10−4
100 150 200 250 300 350 400 450 500
SpMV: Enery −3erformance
Static
Concord
HDSS
LogFit
3
x 10−4
Energy per item (Joules)
P(3: Energy Performance
Static
Concord
HDSS
LogFit
10
9
8
7
6
5
4
3
2
Static
Concord
HDSS
LogFit
4000 5000 6000 7000 8000 9000 10000
Nbody: Energy − Performance
Static
Concord
HDSS
LogFit

Our heterogeneous pipeline
• ViVid, an object detection application
• Contains three main kernels that form a pipeline
• Would like to answer the following questions:
– Granularity: coarse or fine grained parallelism?
– Mapping of stages: where do we run them (CPU/GPU)?
– Number of cores: how many of them when running on CPU?
– Optimum: what metric do we optimize (time, energy, both)?
98
Stage 1
Stage 2
Stage 3
Input
frame Filter Histogram Classifier
Output
response mtx.
index mtx.
histograms
detect. response

Granularity
• Coarse grain:
– CG
• Medium grain:
– MG
• Fine grain:
CPU
item
C
CPU
item
C
item
C
CPU
item
C
item
C
C
C C
item
C
C
C C
GPU
item
C C C
C C C
CPU
item
C
GPU
item
C C C
C C C
– Fine grain also in the CPU via AVX intrinsics
99
CPU
item
C
item
C
Input Stage Stage 1
Output Stage
CPU
Stage 2
CPU
Stage 3
C =core
Input Stage
CPU
item
C
Output Stage
CPU
Stage 1
CPU
Stage 2
CPU
item
C
C
C C
Stage 3
CPU
item
C
GPU
item
C C C
C C C
Input Stage Stage 1 Stage 2 Stage 3
Output Stage

More mappings
GPU
item
C C C
C C C
GPU
CPU
item
C
CPU
item
C
CPU
item
C
C
C C
item
C
C
C C
GPU
100 Input Stage Output Stage
Stage 2
CPU
Stage 3
Stage 1
CPU
GPU
CPU
GPU
CPU
GPU
CPU CPU CPU
GPU
CPU CPU CPU CPU
CPU CPU
CPU
CPU CPU CPU
GPU
CPU CPU
GPU
GPU
CPU CPU
CPU CPU CPU CPU CPU
CPU CPU
GPU
CPU
GPU
CPU CPU
GPU
CPU CPU CPU CPU CPU
CPU CPU CPU CPU CPU

Accounting for all alternatives
• In general: nC CPU cores, 1 GPU and p pipeline stages
item
item
item
item
# alternatives = 2p x (nC +2)
item
CPU
item
C
item
item
• For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives
101
Input Stage
GPU
C C C
C C C
CPU
C
Output Stage
GPU
C C C
C C C
GPU
C C C
C C C
CPU
Stage 1
CPU
Stage 2
CPU
Stage p
C C
nC cores
C C
nC cores
C C
nC cores

Framework and Model
• Key idea:
1. Run only on GPU
2. Run only on CPU
3. Analytically extrapolate for heterogeneous execution
4. Find out the best configuration à RUN
102
item
Input Stage
GPU
item
C C C
C C C
CPU
C
CPU
item
C
Output Stage
GPU
item
C C C
C C C
GPU
item
C C C
C C C
CPU
item
C
C
C C
Stage 1
CPU
item
C
C
C C
Stage 2
CPU
item
C
C
C C
Stage 3
DP-MG
collect λ and E (homogeneous values)

Environmental Setup: Benchmarks
• Four Benchmarks
– ViVid (Low and High Definition inputs)
– SRAD
– Tracking
St 1 St 2 St 3 Out
St 1 St 2 St 3 St 4
– Scene Recognition
103
Inpt
Filter Histogram Classifier
Inpt
Extrac. Prep. Reduct.
St 5 St 6 Out
Comp.1 Comp. 2
Statist.
Inpt
St 1 Out
Track.
Inpt
St 1 St 2 Out
Feature. SVM

ViVid: throughput/energy on Ivy Bridge
1.0E-‐01
9.0E-‐02
8.0E-‐02
7.0E-‐02
6.0E-‐02
5.0E-‐02
4.0E-‐02
3.0E-‐02
2.0E-‐02
1.0E-‐02
106
LD (600x416), Ivy Bridge HD (1920x1080), Ivy Bridge
0.0E+00
Num-‐Threads
(CG)
8.0E-‐04
7.0E-‐04
6.0E-‐04
5.0E-‐04
4.0E-‐04
3.0E-‐04
2.0E-‐04
1.0E-‐04
0.0E+00
Num-‐Threads
(CG)
GPU
item
C C C
C C C
C =Core
CPU
item
C
CPU
item
C
item
item
C
item
C
Input Stage Output Stage
Stage 1
CPU
C
CPU
Stage 2
CPU
Stage 3
CP-CG GPU-CPU Path
CPU Path
GPU
item
C C C
C C C
CPU
item
C
CPU
item
C
CPU
item
C
C
C C
Stage 1
CPU
item
C
C
C C
item
C
C
C C
Input Stage Stage 2
Output Stage
CPU
Stage 3
CP-MG
C =Core

ViVid: throughput/energy on Haswell
1.8E-‐01
1.6E-‐01
1.4E-‐01
1.2E-‐01
1.0E-‐01
8.0E-‐02
6.0E-‐02
4.0E-‐02
2.0E-‐02
107
LD, Haswell HD, Haswell
0.0E+00
Num-‐Threads
(CG)
1.0E-‐03
8.0E-‐04
6.0E-‐04
4.0E-‐04
2.0E-‐04
0.0E+00
Num-‐Threads
(CG)
GPU
item
C C C
C C C
CPU
item
C
CPU
item
C
CPU
item
C
C
C C
Stage 1
CPU
item
C
C
C C
item
C
C
C C
Input Stage Stage 2
Output Stage
CPU
Stage 3
CP-MG
C =Core

Does Higher Throughput imply Lower Energy?
9.0E-‐03
8.0E-‐03
7.0E-‐03
6.0E-‐03
5.0E-‐03
4.0E-‐03
3.0E-‐03
2.0E-‐03
1.0E-‐03
108
Num-‐Threads
(CG)
16.0
14.0
12.0
10.0
8.0
6.0
Num-‐Threads
(CG)
1.0E-‐03
8.0E-‐04
6.0E-‐04
4.0E-‐04
2.0E-‐04
0.0E+00
Num-‐Threads
(CG)
Haswell
HD

SRAD on Ivy Bridge
1.4E-‐01
1.2E-‐01
1.0E-‐01
8.0E-‐02
6.0E-‐02
4.0E-‐02
2.0E-‐02
109
item
item
C
C
C C
item
item
C
C
C C
6.0E-‐01
5.0E-‐01
4.0E-‐01
3.0E-‐01
2.0E-‐01
1.0E-‐01
0.0E+00
item
C
C
C C
item
item
C
C
C C
item
item
C
C C
item
item
C
C
C C
Num-‐Threads
(CG)
0.0E+00
item
item
Num-‐Threads
(CG)
item
1.1E+00
1.0E+00
9.0E-‐01
8.0E-‐01
7.0E-‐01
C
6.0E-‐01
C C
5.0E-‐01
4.0E-‐01
3.0E-‐01
2.0E-‐01
1.0E-‐01
CPU
item
C
item
item
C
C
C C
CPU
item
C
item
item
C
C
Num-‐Threads
(CG)
Input Stage
GPU
C C C
C C C
CPU
C
CPU
Stage 1
CPU
Stage 2
Output Stage
GPU
C C C
C C C
CPU
Stage 6
CP-MG
GPU
C C C
C C C
CPU
Stage 3
GPU
C C C
C C C
CPU
item
C
Stage 4
GPU
item
C C C
C C C
CPU
item
C
C
C C
Stage 5
Input Stage
GPU
C C C
C C C
CPU
C
CPU
Stage 1
GPU
C C C
C C C
CPU
C
Stage 2
Output Stage
GPU
C C C
C C C
CPU
C C
Stage 6
DP-MG
GPU
C C C
C C C
CPU
Stage 3
GPU
item
C C C
C C C
CPU
item
C
C
C C
Stage 4
GPU
item
C C C
C C C
CPU
item
C
C
C C
Stage 5

SRAD on Haswell
1.1E+00
9.0E-‐01
7.0E-‐01
5.0E-‐01
item
3.0E-‐01
1.0E-‐01
item
item
item
item
Num-‐Threads
(CG)
1.8E-‐01
1.6E-‐01
1.4E-‐01
1.2E-‐01
1.0E-‐01
8.0E-‐02
6.0E-‐02
4.0E-‐02
2.0E-‐02
110
item
item
C
8.0E-‐01
7.0E-‐01
6.0E-‐01
5.0E-‐01
4.0E-‐01
3.0E-‐01
2.0E-‐01
1.0E-‐01
0.0E+00
item
item
C
Num-‐Threads
(CG)
0.0E+00
item
Num-‐Threads
(CG)
Input Stage
GPU
C C C
C C C
CPU
C
CPU
Stage 1
GPU
item
C C C
C C C
CPU
item
C
Stage 2
CPU
item
C
Output Stage
GPU
C C C
C C C
CPU
C
Stage 6
DP-CG
GPU
C C C
C C C
CPU
Stage 3
GPU
C C C
C C C
CPU
item
C
Stage 4
GPU
C C C
C C C
CPU
C
Stage 5

On-going work
• Test the model in other heterogeneous chips
• ViVid LD running on Odroid XU-E
– Ivy Bridge: 65 fps, 0.7 J/frame and 92 fps/J with CP-CG(5)
– Exynos 5: 0.7 fps, 11 J/frame and 0.06 fps/J with CP-CG(4)
0.0008
0.0007
0.0006
0.0005
0.0004
0.0003
0.0002
0.0001
0
112
Throughput
CP-‐CG
Est.
CP-‐CG
Mea.
DP-‐CG
Est.
DP-‐CG
Mea.
CPU-‐CG
1
2
3
4
0.00007
0.00006
0.00005
0.00004
0.00003
0.00002
0.00001
0
Throughput/Energy
1
2
3
4
CP-‐CG
Est.
CP-‐CG
Mea.
DP-‐CG
Est.
DP-‐CG
Mea.
CPU-‐CG

Future Work
• Consider energy for parallel loops scheduling/partitioning
• Consider MARE as an alternative to our TBB-based implem.
• Apply DVFS to reduce energy consumption
– For video applications: no need to go faster than 33 fps
• Also explore other parallel patterns:
– Reduce
– Parallel_do
– …
113

Conclusions
• Plenty of heterogeneous on-chip architectures out there
• It is important to use both devices
– Need to find the best mapping/distribution/scheduling out of the
many possible alternatives.
• Programming models and runtimes aimed at this goal are in
their infancy: it may have huge impact in mobile market
• Challenges:
– Hide hardware complexity
– Consider energy in the partition/scheduling decisions
– Minimize overhead of adaptation policies
114

Collaborators
• Mª Ángeles González Navarro (UMA, Spain)
• Fracisco Corbera (UMA, Spain)
• Antonio Vilches (UMA, Spain)
• Andrés Rodríguez (UMA, Spain)
• Alejandro Villegas (UMA, Spain)
• Rubén Gran (U. Zaragoza, Spain)
• Maria Jesús Garzarán (UIUC, USA)
• Mert Dikmen (UIUC, USA)
• Kurt Fellows (UIUC, USA)
• Ehsan Totoni (UIUC, USA)
115

Questions
asenjo@uma.es
http://www.ac.uma.es/~asenjo
116

Programming Models for Heterogeneous Chips

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Programming Models for Heterogeneous Chips

Similar to Programming Models for Heterogeneous Chips (20)

More from Facultad de Informática UCM

More from Facultad de Informática UCM (20)

Recently uploaded

Recently uploaded (20)

Programming Models for Heterogeneous Chips