SlideShare a Scribd company logo
1 of 109
Download to read offline
Programming Models for 
Heterogeneous Chips 
Rafael Asenjo 
Dept. of Computer Architecture 
University of Malaga, Spain.
Agenda 
• Motivation 
• Hardware 
– Heterogeneous chips 
– Integrated GPUs 
– Advantages 
• Software 
– Programming models for heterogeneous systems 
– Programming models for heterogeneous chips 
– Our approach based on TBB 
2
Motivation 
• A new mantra: Power and Energy saving 
• In all domains 
3
Motivation 
• GPUs came to rescue: 
– Massive Data Parallel Code at a 
low price in terms of power 
– Supercomputers and servers: 
NVIDIA 
• GREEN500 Top 15: 
• TOP500: 
– 45 systems w. NVIDIA 
– 19 systems w. Xeon Phi 
4
Motivation 
• There is (parallel) live beyond supercomputers: 
5
Motivation 
• Plenty of GPUs elsewhere: 
– Integrated GPUs on more than 90% of shipped processors 
6
Motivation 
• Plenty of GPUs on desktops and laptops: 
– Desktops (35 – 130W) and laptops (15– 57 W): 
7 
Intel Haswell AMD APU Kaveri 
http://www.techspot.com/photos/article/http://techguru3d.com/4th-gen-intel-haswell-processors- 770-amd-a8-7600-kaveri/ 
architecture-and-lineup/
Motivation 
8
Motivation 
• Plenty of integrated GPUs in mobile devices. 
Samsung 
Galaxy S5 
SM-G900H 
9 
Samsung Exynos 5 Octa (2 - 6 W) 
Samsung 
Galaxy Note 
Pro 12 
http://www.samsung.com/us/showcase/galaxy-smartphones-and-tablets/
Motivation 
• Plenty of integrated GPUs in mobile devices. 
10 
Qualcomm Snapdragon 800 (2 - 6 W) 
https://www.qualcomm.com/products/snapdragon/processors/800 
Nexus 5 
Nokia Lumia 
Sony Xperia
Motivation 
• Plenty of room for improvements 
– Want to make the most out of the CPU and the GPU 
– Lack of programming models 
– “Heterogeneous exec., but homogeneous programming” 
– Huge potential impact 
• Servers and supercomputing market 
– Google: porting the search engine for ARM and PowerPC 
– AMD Seattle Server-on-a-Chip based on Cortex-A57 (v8) 
– Mont Blanc project: supercomputer made of ARM 
• Once commodity processors took over 
• Be prepared for when mobile processors do so 
– E4’s EK003 Servers: X-Gene ARM A57 (8 cores) + K20 
11
Agenda 
• Motivation 
• Hardware 
– Heterogeneous chips 
– Integrated GPUs 
– Advantages 
• Software 
– Programming models for heterogeneous systems 
– Programming models for heterogeneous chips 
– Our approach based on TBB 
12
Hardware 
13 
Intel Haswell AMD Kaveri 
Samsung Exynos 5 Octa Qualcomm Snapdragon 800
Intel Haswell 
• Modular design 
14 
– 2 or 4 cores 
– GPU 
• GT-1: 10 EU 
• GT-2: 20 EU 
• GT-3: 40 EU 
• TSX: HW transac. mem. 
– HLE (HW lock elis.) 
• XACQUIRE 
• XRELEASE 
– RTM (Restrtd. TM) 
• XBEGIN 
• XEND 
http://www.anandtech.com/show/6355/intels-haswell-architecture
Intel Haswell 
• Three frequency domains 
– Cores 
– GPU 
– LLC and Ring 
• On old Ivy Bridge 
– Only 2 domains 
– Cores and LLC together 
– Only GPU à CPU Fz é 
• OpenCL driver only for Win. 
• PCM as power monitor 
15 
http://www.anandtech.com/show/7744/intel-reveals-new-haswell-details-at-isscc-2014
Intel Iris Graphics 
https://software.intel.com/en-us/articles/opencl-fall-webinar-series 
16
Intel Iris Graphics 
• GPU slice 
17 
– 2 sub slices 
– 20 EU (GPU cores) 
– Local L3 cache (256KB) 
– 16 barriers per sub slice 
– 2 x 64KB Local mem. 
• 2 GPU slices = 40 EU 
• Up to 7 in-flight EU-th 
• 8, 16 or 32 SIMD per EU-th 
• In flight:7x40x32=8960 work it 
• Each EU à 2 x 4-wide FPU 
– 40x8x2 (fmadd) = 640 op sim. 
– 1.3GHz à 832GFlops
Intel Iris GPU 
18 
Matrix work-group ≈ block 
EU-threads (SIMD16) ≈ warp ≈ wavefront
AMD Kaveri 
• Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores. 
19 
http://wccftech.com/
AMD Kaveri 
• Steamroller microarch. 
– Each moduleà 2 “Cores”. 
– 2 threads, each with 
• 4x superscalar INT 
• 2x SIMD4 FP 
– 3.7GHz 
• Max GFLOPS: 
• 3.7 GHz x 
• 4 threads x 
• 4 wide x 
• 2 fmad = 
• 118 GFLOPS 
20
AMD Graphics Core Next (GCN) 
• In Kaveri, GCG takes 47% of the die 
– 8 Compute Units (CU) 
– Each CU: 4 SIMD16 
– Each SIMD16: 16 lines 
– Total: 512 FPUs 
– 720 MHz 
• Max GFLOPS= 
• 0.72 GHz x 
• 512 FPUs x 
• 2 fmad = 
• 737 GFLOPS 
• CPU+GPU à 855 GFLOPS 
21
OpenCL execution on GCN 
Work-group à wavefronts (64 work-items) à pools 
22 
WG 
CU0 
pool number 
SIMD0-0 
SIMD0 SIMD1 SIMD2 SIMD3 
SIMD1-0 
SIMD2-0 
SIMD3-0 
SIMD0-1 
SIMD1-1 
SIMD2-1 
SIMD3-1 
SIMD0-2 
SIMD1-2 
4 pools: 
4 wavefronts in 
flight per SIMD 
4 ck to execute 
each wavefront 
Wavefronts
HSA (Heterogeneous System Architecture) 
• HSA Foundation’s goal: Productivity on heterogeneous HW 
– CPU, GPU, DSPs.. 
• Scheduled on three phasesà 
• Second phase: Kaveri 
– hUMA 
– Same pointers used on CPU 
and GPU 
– Cache coherency 
23
Kaveri’s main HSA features 
• hUMA 
– Shared and coherent view of up to 32GB 
• Heterogeneous queuing (hQ) 
– CPU and GPU can create and dispatch work 
24
HSA Motivation 
• Too many steps to get the job done 
25 
Application OS GPU 
Transfer 
buffer to GPU 
Copy/Map 
Memory 
Queue Job 
Schedule Job 
Start Job 
Finish Job 
Schedule 
Application 
Get Buffer 
Copy/Map 
Memory 
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
Requirements 
• To enable lower overhead job dispatch requires four mechanisms: 
– Shared Virtual Memory 
• Send pointers (not data) back and forth between HSA agents. 
– System Coherency 
• Data accesses to global memory segment from all HSA Agents 
shall be coherent without the need for explicit cache maintenance 
– Signaling 
• HSA Agents can directly create/access signal objects. 
– Signaling a signal object (this will wake up HSA agents waiting 
upon the object) 
– Query current object 
– Wait on the current object (various conditions supported). 
– User mode queueing 
• Enables user space applications to directly, without OS intervention, 
enqueue jobs (“Dispatch Packets”) for HSA agents. 
26
Non-HSA Shared Virtual Memory 
• Multiple Virtual memory address spaces 
PHYSICAL MEMORY 
VIRTUAL MEMORY1 
27 
PHYSICAL MEMORY 
VIRTUAL MEMORY2 
CPU0 GPU 
VA1->PA1 VA2->PA1
HSA Shared Virtual Memory 
• Common Virtual Memory for all HSA agents 
28 
PHYSICAL MEMORY 
VIRTUAL MEMORY 
CPU0 GPU 
VA->PA VA->PA
After adding SVM 
• With SVM we get rid of copy/map memory back and forth 
29 
Application OS GPU 
Transfer 
buffer to GPU 
Copy/Map 
Memory 
Queue Job 
Schedule Job 
Start Job 
Finish Job 
Schedule 
Application 
Get Buffer 
Copy/Map 
Memory 
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
After adding coherency 
• If the CPU allocates a global pointer, the GPU see that value 
30 
Application OS GPU 
Transfer 
buffer to GPU 
Copy/Map 
Memory 
Queue Job 
Schedule Job 
Start Job 
Finish Job 
Schedule 
Application 
Get Buffer 
Copy/Map 
Memory 
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
After adding signaling 
• The CPU can wait on a signal object 
31 
Application OS GPU 
Transfer 
buffer to GPU 
Copy/Map 
Memory 
Queue Job 
Schedule Job 
Start Job 
Finish Job 
Schedule 
Application 
Get Buffer 
Copy/Map 
Memory 
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
After adding user-level enqueuing 
• The user directly enqueues the job without OS intervention 
32 
Application OS GPU 
Transfer 
buffer to GPU 
Copy/Map 
Memory 
Queue Job 
Schedule Job 
Start Job 
Finish Job 
Schedule 
Application 
Get Buffer 
Copy/Map 
Memory 
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
Success!! 
• That’s definitely way simpler and with less overhead 
33 
Application OS GPU 
Queue Job 
Start Job 
Finish Job
OpenCL 2.0 
• OpenCL 2.0 will contain most of the features of HSA 
– Intel’s version supports HSA for Core M (Broadwell). Windows. 
– AMD’s version does not support SVM fine grain. 
• AMD 1.2 beta driver 
– http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-1-2-beta-driver/ 
– Only for Windows 8.1 
– Example: allocating “Coherent Host Memory” on Kaveri: 
34 
#include 
<CL/cl_ext.h> 
// 
Implements 
SVM 
#include 
"hsa_helper.h” 
// 
AMD 
helper 
functions 
… 
cl_svm_mem_flags_amd 
flags 
= 
CL_MEM_READ_WRITE 
| 
CL_MEM_SVM_FINE_GRAIN_BUFFER_AMD 
| 
CL_MEM_SVM_ATOMICS_AMD; 
volatile 
std::atomic_int 
* 
data; 
data 
= 
(volatile 
std::atomic_int 
*) 
clSVMAlloc(context, 
flags, 
MAX_DATA 
* 
sizeof(volatile 
std::atomic_int), 
0);
Samsung Exynos 5 
• Odroid XU-E and XU3 bareboards 
• Sports a Exynos 5 Octa 
35 
– big.LITTLE architecture 
– big: Cortex-A15 quad 
– LITTLE: Cortex-A7 quad 
• Exynos 5 Octa 5410 
– Only 4 CPU-cores active at a time 
– GPU: Power VR SGX544MP3 (Imagination Technologies) 
• 3 GPU-cores at 533MHz à 51 GFLOPS 
• Exynos 5 Octa 5422 
– All 8 CPU-cores can be working simultaneously 
– GPU: ARM Mali-T628 MP6 
• 6 GPU-cores at 533MHz à 102 GFLOPS 
180$
Power VR SGX544MP3 
• OpenCL 1.1 for Android 
• Some limitations: 
– Compute units: 1 
– Max WG Size: 1 
– Local Mem: 1KB 
– Pick MFLOPS: 
• 12 Ops per ck 
• 3 SIMD-ALUs x 4-wide 
• Power monitor: 
– 4 x INA231 monitors 
• A15, A7, GPU, Mem. 
• Instant Power 
• Every 260ms 
36 
SGX architecture 
…N
Texas Instrument INA231 
37
ARM Mali-T628 MP6 
• Supporting: 
– OpenGL® ES 3.0 
– OpenCL™ 1.1 
– DirectX® 11 
– Renderscript™ 
• Cache L2 size 
– 32 – 256KB per core 
• 6 Cores 
– 16 FP units 
– 2 SIMD4 each 
• Other 
– Built-in MMU 
– Standard ARM Bus 
• AMBA 4 ACE-Lite 
38 
Mali architecture
Qualcomm Snapdragon 
39
Snapdragon 800 
40
Snapdragon 800 
• CPU: Quad-core Krait 400 up to 2.26GHz (ARMv7 ISA) 
– Similar to Cortex-A15. 11 stage integer pipeline with 3-way 
decode and 4-way out-of-order speculative issue superscalar 
execution 
– Pipelined VFPv4[2] and 128-bit wide NEON (SIMD) 
– 4 KB + 4 KB direct mapped L0 cache 
– 16 KB + 16 KB 4-way set associative L1 cache 
– 2 MB (quad-core) L2 cache 
• GPU: Adreno 330, 450MHz 
– OpenGL ES 3.0, DirectX, OpenCL 1.2, RenderScript 
– 32 Execution Units. Each with 2 x SIMD4 units 
• DSP: Hexagon 600MHz 
41
Measuring power 
• Snapdragon 
Performance 
Visualizer 
• Trepn Profiler 
• Power Tutor 
– Tuned for Nexus One 
– Model with 5% 
precision 
– Open Source 
42
More development boards 
• Jetson TK1 board 
– Tegra K1 
– Kepler GPU with 192 CUDA cores 
– 4-Plus-1 quad-core ARM Cortex A15 
– Linux + CUDA 
– 180$ 
• Arndale 
– Exynos 5420 
– big.LITTLE (A15 + A7) 
– GPU Mali T628 MP6 
– Linux + OpenCL 
– 200$ 
• … 
43
Advantages of integrated GPUs 
• Discrete and integrated GPUs: different goals 
– NVIDIA Kepler: 2880 CUDA cores, 235W, 4.3 TFLOPS 
– Intel Iris 5200: 40 EU x 8 SIMD, 15-28W, 0.83 TFLOPS 
– PowerVR: 3 EU x 16 SIMD, < 1W, 0.051 TFLOPS 
• Higher bandwidth between CPU and GPU. 
– Shared DRAM 
• Avoid PCI data transfer 
– Shared LLC (Last Level Cache) 
• Data coherence in some cases… 
• CPU and GPU may have similar performance 
– It’s more likely that they can collaborate 
• Cheaper! 
44
Integrated GPUs are also improving 
45
Agenda 
• Motivation 
• Hardware 
– Heterogeneous chips 
– Integrated GPUs 
– Advantages 
• Software 
– Programming models for heterogeneous systems 
– Programming models for heterogeneous chips 
– Our approach based on TBB 
46
Programming models for heterogeneous 
• Targeted at single device 
– CUDA (NVIDIA) 
– OpenCL (Khronos Group Standard) 
– OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0) 
– C++AMP (Windows’ extension of C++. Recently HSA announced own ver.) 
– RenderScript (Google’s Java API for Android) 
– ParallDroid (Java + Directives from ULL, Spain) 
– Many more (Sycl, Numba Python, IBM Java, Matlab, R, JavaScript, …) 
• Targeted at several devices (discrete GPUs) 
– Qilin (C++ and Qilin API compiled to TBB+CUDA) 
– OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler) 
– XKaapi 
– StarPU 
• Targeted at several devices (integrated GPUs) 
– Qualcomm MARE 
– Intel Concord 
47
OpenCL on mobile devices 
48 
http://streamcomputing.eu/blog/2014-06-30/opencl-support-recent-android-smartphones/
OpenCL running on CPU 
49 
CPU Ivy Bridge 
45% 41% 
75% 
15% 
9% 
34% 
10% 
50 
45 
40 
35 
30 
25 
20 
15 
10 
5 
0 
Base Auto T-Auto SSE AVX-SSE AVX OpenCL 
Execution Time (ms) 
3.3 GHz 
AVX code version is 
- 1.8x times faster than OpenCL 
- 1.8x more Halstead effort 
“Easy, Fast and Energy Efficient Object Detection on Heterogeneous On-Chip 
Architectures”, E. Totoni, M. Dikmen, M. J. Garzaran, ACM Transactions on Architecture 
and Code Optimization (TACO),10(4), December 2013.
Complexities of AVX Intrinsics 
__m256 image_cache0 = _mm256_broadcast_ss(&fr_ptr[pixel_offsets[0]]);! 
curr_filter = _mm256_load_ps(&fb_array[fi]);! 
temp_sum = _mm256_add_ps(_mm256_mul_ps(image_cache7, " 
" curr_filter), temp_sum);! 
temp_sum2 = _mm256_insertf128_ps(temp_sum,! 
" _mm256_extractf128_ps(temp_sum, 1), 0);! 
cpm = _mm256_cmp_ps(temp_sum2, max_fil, _CMP_GT_OS);! 
r = _mm256_movemask_ps(cpm);! 
! 
if(r&(1<<1)) {! 
"best_ind = filter_ind+2;! 
"int control = 1|(1<<2)|(1<<4)|(1<<6);;! 
"max_fil = _mm256_permute_ps(temp_sum2, control); ! 
" " " " " " " 
"r=_mm256_movemask_ps( _mm256_cmp_ps(temp_sum2, ! 
" "max_fil, _CMP_GT_OS));! 
}! 
50 
Load 
Multiply-add 
Copy high to low 
Compare 
Store max 
Store index
OpenCL doesn’t have to be tough 
Courtesy: Khronos Group 
51
Libraries and languages using OpenCL 
Courtesy: AMD 
52
Libraries and languages using OpenCL 
Courtesy: AMD 
53
Libraries and languages using OpenCL (cont.) 
Courtesy: AMD 
54
Libraries and languages using OpenCL (cont.) 
Courtesy: AMD 
55
C++AMP 
• C++ Accelerated Massive Parallelism 
• Pioneered by Microsoft 
– Requirements: Windows 7 + Visual Studio 2012 
• Followed by Intel's experimental implementation 
– C++ AMP on Clang/LLVM and OpenCL (AWOL since 2013) 
• Now HSA Foundation taking the lead 
• Keywords: restrict(device), array_view, parallel_for_each,… 
– Example: SUM = A + B; // (2D arrays) 
56
OpenCL Ecosystem 
Courtesy: Khronos Group 
57
SYCL’s flavour: A[i]=B[i]*2 
58 
Work in progress developments: 
- AMD: trySYCL à https://github.com/amd/triSYCL 
- Codeplay: http://www.codeplay.com/ 
Advantages: 
1. Easy to understand the concept of work-groups 
2. Performance-portable between CPU and GPU 
3. Barriers are automatically deduced
StarPU 
• A runtime system for 
heterogeneous architectures 
• Dynamically schedule tasks on 
all processing units 
– See a pool of heterogeneous 
cores 
• Avoid unnecessary data 
transfers between accelerators 
– Software SVM for 
heterogeneous machines 
59 
CPU 
CPU 
CPU 
CPU 
CPU 
CPU 
A 
M 
CPU 
CPU 
M 
GPU 
M 
GPU 
M 
GPU 
M 
GPU 
M 
A 
= 
A+B 
B 
B
Overview of StarPU 
• Maximizing PU occupancy, minimizing data transfers 
• Ideas: 
– Accept tasks that may have multiple 
implementations 
60 
• Together with potential inter-dependencies 
– Leads to a dynamic acyclic graph of 
tasks 
– Provide a high-level data management 
layer (Virtual Shared Memory VSM) 
• Application should only describe 
– which data may be accessed by tasks 
– how data may be divided 
Applica0ons 
Parallel 
Compilers 
Parallel 
Libraries 
StarPU 
Drivers 
(CUDA, 
OpenCL) 
CPU 
GPU 
…
Tasks scheduling 
• Dealing with heterogeneous hardware accelerators 
• Tasks = 
61 
– Data input & output 
– Dependencies with other tasks 
– Multiple implementations 
• E.g. CUDA + CPU 
• Scheduling hints 
• StarPU provides an Open Scheduling 
platform 
– Scheduling algorithm = plug-ins 
– Predefined set of popular policies 
Applica0ons 
Parallel 
Compilers 
Parallel 
Libraries 
StarPU 
Drivers 
(CUDA, 
OpenCL) 
CPU 
GPU 
… 
f 
(ARW, 
BR) 
cpu 
gpu 
spu
Tasks scheduling 
• Predefined set of popular policies 
• Eager Scheduler 
62 
– First come, first served policy 
– Only one queue 
• Work Stealing Scheduler 
– Load balancing policy 
– One queue per worker 
• Priority Scheduler 
– Describe the relative importance 
of tasks 
– One queue per priority 
CPU 
CPU 
CPU 
GPU 
GPU 
Eager Scheduler 
task 
task 
CPU 
CPU 
CPU 
GPU 
GPU 
WS Scheduler 
task 
CPU 
CPU 
CPU 
GPU 
GPU 
Prio. Scheduler 
prio2 prio1 prio0
Tasks scheduling 
• Predefined set of popular policies 
• Dequeue Model (DM) Scheduler 
63 
– Using codelet performance models 
• Kernel calibration on each 
available computing device 
– Raw history model of kernels’ 
past execution times 
– Refined models using 
regression on kernels’ execution 
times history 
• Dequeue Model Data Aware (DMDA) 
Scheduler 
– Data transfer cost vs kernel offload 
benefit 
– Transfer cost modelling ( ) 
– Bus calibration 
task 
cpu1 
cpu2 
cpu3 
gpu1 
gpu2 
time 
CPU 
CPU 
CPU 
GPU 
GPU 
DM Scheduler 
cpu1 cpu2 cpu3 gpu1 gpu2 
task 
cpu1 
cpu2 
cpu3 
gpu1 
gpu2 
time 
CPU 
CPU 
CPU 
GPU 
GPU 
DMDA Scheduler 
cpu1 cpu2 cpu3 gpu1 gpu2
Some results (MxV, 4 CPUs, 1 GPU) 
SPU config: Eager, 3 CPUs, 1GPU SPU config: DMDA, 3 CPUs, 1GPU 
SPU config: Eager, 4 CPUs, 1GPU SPU config: DMDA, 4 CPUs, 1GPU 
64
Terminology 
• A Codelet. . . 
– . . . relates an abstract computation kernel to its implementation(s) 
– . . . can be instantiated into one or more tasks 
– . . . defines characteristics common to a set of tasks 
• A Task. . . 
– . . . is an instantiation of a Codelet 
– . . . atomically executes a kernel from its beginning to its end 
– . . . receives some input 
– . . . produces some output 
• A Data Handle. . . 
– . . . designates a piece of data managed by StarPU 
– . . . is typed (vector, matrix, etc.) 
– . . . can be passed as input/output for a Task 
65
Basic Example: Scaling a Vector 
66 
kernel Declaring a Codelet 
functions 
data 
pieces 
data mode 
access 
123456 
struct starpu_codelet scal_cl = { 
. cpu_funcs = { scal_cpu_f, NULL}, 
. cuda_funcs = { scal_cuda_f, NULL } , 
. nbuffers = 1, 
. modes = { STARPU_RW } , 
}; 
1 
2 
3 
4 
5 
6 
7 
8 
9 
Kernel functions 
void scal_cpu_f(void ∗buffers [] , void ∗cl_arg) { 
struct starpu_vector_interface ∗vector_handle = buffers [ 0 ] ; 
float ∗vector = STARPU_VECTOR_GET_PTR(vector_handle); 
float ∗ptr_factor = cl_arg ; 
for (i = 0; i < NX; i++) 
vector [ i ] ∗= ∗ptr_factor ; 
} 
void scal_cuda_f(void ∗buffers [] , void ∗cl_arg) { … 
} 
kernel 
function 
prototype 
retrieve data 
handle 
get pointer from 
data handle 
get small-size 
inline data 
do computation
Basic Example: Scaling a Vector 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
67 
Main code 
float factor = 3.14; 
float vector1 [NX] ; 
float vector2 [NX] ; 
starpu_data_handle_t vector_handle1 ; 
starpu_data_handle_t vector_handle2 ; 
/∗ ..... ∗/ 
starpu_vector_data_register(&vector_handle1, 0, (uintptr_t)vector1, 
NX, sizeof(vector1[0])); 
starpu_vector_data_register(&vector_handle2, 0, (uintptr_t)vector2, 
NX, sizeof(vector2[0])); 
/∗ non−blocking task submits ∗/ 
starpu_task_insert (&scal_cl , STARPU_RW , vector_handle1 , 
STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; 
starpu_task_insert (&scal_cl , STARPU_RW , vector_handle2 , 
STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; 
/∗ wait for all task submitted so far ∗/ 
starpu_task_wait_for_all () ; 
starpu_data_unregister ( vector_handle1 ) ; 
starpu_data_unregister ( vector_handle2 ) ; 
/∗ ..... ∗/ 
declare data handles 
register pieces of 
data and get the 
handles 
(now under StarPU control) 
submit tasks 
(param: codelet, StarPU-managed 
data, small-size 
inline data) 
wait for all submitted 
tasks 
Unregister pieces of 
data 
(the handles are destroyed, 
the vectors are now back 
under user control)
Qualcomm MARE 
• MARE is a programming model and a runtime system that 
provides simple yet powerful abstractions for parallel, power-efficient 
software 
– Simple C++ API allows developers to express concurrency 
– User-level library that runs on any Android device, and on Linux, 
Mac OS X, and Windows platforms 
• The goal of MARE is to reduce the effort required to write 
apps that fully utilize heterogeneous SoCs 
• Concepts: 
– Tasks are units of work that can be asynchronously executed 
– Groups are sets of tasks that can be canceled or waited on 
68
Basic Example: Hello World 
69
More complex example: C=A+B on GPU 
70
More complex example: C=A+B on GPU 
71
MARE departures 
• Similarities with TBB 
– Based on tasks and 2-level API (task level and templates) 
• pfor_each, ptransform, pscan, … 
• Synchronous Dataflow classes ≈ TBB’s Flow Graphs 
– Concurrent data structures: queue, stack, … 
• Departures 
– Expression of dependencies is first class 
– Flexible group membership and work or group cancelation 
– Optimized for some Qualcomm chips 
• Power classes: 
– Static: mare::power::mode {efficient, saver, …} 
– Dynamic: mare:power::set_goal(desired, tolerance) 
• Aware of the mobile architecture: agressive power mangmt. 
– Cores can be shutdown or affected by DVFS 
72
MARE results 
• Zoomm web browser implemented on top of MARE 
C. Cascaval, et al.. ZOOMM: a parallel web browser engine for multicore mobile devices. In 
Symposium on Principles and practice of parallel programming, PPoPP ’13, pages 271–280, 2013. 
73
MARE results 
• Bullet Physics parallelized with MARE 
Courtesy: Calin Cascaval 
74
Intel Concord 
• C++ heterogeneous programming framework for integrated 
CPU and GPU processors 
– Shared Virtual Memory (SVM) in software 
– Adapts existing data-parallel C++ constructs to heterogeneous 
computing using TBB 
– Available open source as Intel Heterogeneous Research 
Compiler (iHRC) at https://github.com/IntelLabs/iHRC/ 
• Papers: 
– Rajkishore Barik, Tatiana Shpeisman, et al. Efficient mapping of 
irregular C++ applications to integrated GPUs. CGO 2014. 
– Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian 
Lewis, Chunling Hu, and Keshav Pingali. Adaptive 
heterogeneous scheduling on integrated GPUs. PACT 2014. 
75
Intel Concord 
• Extend TBB API: 
– parallel_for_hetero (int numiters, const Body &B, bool device); 
– parallel_reduce_hetero (int numiters, const Body &B, bool 
device); 
Courtesy: Intel 
76
Example: Parellel_for_hetero 
• Concord compiler generates OpenCL version 
– Automatically takes care of the data thanks to SVM 
Courtesy: Intel 
77
Concord framework 
Courtesy: Intel 
78
SVM SW implementation on Haswell 
79
SVM translation in OpenCL code 
• svm_const is a runtime constant and is computed once 
• Every CPU pointer before dereference on the GPU is 
converted into GPU address space using AS_GPU_PTR 
80
Concord results 
81
Speedup & Energy savings vs multicore CPU 
82
Heterogeneous execution on both devices 
• Iteration space distributed among available devices 
• Problem: find the best data partition 
• Example: Barnes Hut and Facedetect relative exect. time 
– Varying the amount of work offloaded to the GPU 
– For BH the optimum is 40% of the work carried out on the GPU 
– For FD the optimum is 0% of the work carried out on the GPU 
83
Partitioning based on on-line profiling 
Naïve profiling Asymmetric profiling 
84 
assign chunk 
to CPU and 
GPU 
compute 
chunk on 
CPU 
compute 
chunk on 
GPU 
barrier 
according to 
relative speeds 
partition the rest 
of the iteration 
space 
assign chunk 
to just to 
GPU 
compute 
on CPU 
compute 
chunk on 
GPU 
when the GPU is done 
according to 
relative speeds 
partition the rest 
of the iteration 
space
Agenda 
• Motivation 
• Hardware 
– Heterogeneous chips 
– Integrated GPUs 
– Advantages 
• Software 
– Programming models for heterogeneous systems 
– Programming models for heterogeneous chips 
– Our approach based on TBB 
85
Our heterogeneous parallel_for 
Angeles Navarro, Antonio Vilches, Francisco Corbera and Rafael Asenjo 
Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures, 
The Journal of Supercomputing, May 2014 
86
Comparison with StarPU 
• MxV benchmark 
– Three schedulers tested: greedy, work-stealing, HEFT 
– Static chunk size: 2000, 200 and 20 matrix rows 
91
Choosing the GPU block size 
• Belviranli, M. E., Bhuyan, L. N., & Gupta, R. (2013). A dynamic self-scheduling scheme for 
heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4), 57:1–57:20. 
92
GPU block size for irregular codes 
• Adapt between time-steps and inside the time-step 
93 
100 
90 
80 
70 
60 
50 
40 
30 
20 
10 
0 
Barnes−Hut: Average throughput per chunk size 
0 40 160 640 2560 10240 40980 81960 
Chunk size 
Average Throughput 
static, ts=0 
static, ts=30 
adap., ts=0 
adap., ts=30
GPU block size for irregular codes 
• Throughput variation along the iteration space 
– For two different time-steps 
– Different GPU chunk-sizes 
160 
140 
120 
100 
80 
60 
40 
0 2 4 6 8 10 
94 
4 
x 10 
20 
Barnes−Hut: Throughput variation (time step=0) 
Iteration Space 
Throughput 
chunk=320 
chunk=640 
chunk=1280 
chunk=2560 
160 
140 
120 
100 
80 
60 
40 
0 2 4 6 8 10 
4 
x 10 
20 
Barnes−Hut: Throughput variation (time step=5) 
Iteration Space 
Throughput 
chunk=320 
chunk=640 
chunk=1280 
chunk=2560
Adapting the GPU chunk-size 
• Assumption: 
– Irregular behavior as a sequence of regimes of regular behavior 
95 throughput 
higher λG 
λG a·ln(x)+b 
λG 
increase 
chunk size 
x 
decrease 
chunk size 
lower λG 
x 
G(t-1)/2 G(t-1)*2 x 
samples G=a/thld 
x 
40 
20 
1750 
1400 
1050 
700 
0 2 4 6 8 10 
4 
x 10 
0 
Iteration space 
GPU Throughput 
GPU Thr. & Chunk size: LogFit 
350 
GPU Chunk size 
Throughput 
Chunk Size 
100 
50 
3500 
2800 
2100 
1400 
0 2 4 6 8 10 
4 
x 10 
0 
Iteration space 
GPU Throughput 
GPU Thr. & Chunk size: LogFit 
700 
GPU Chunk size 
Throughput 
Chunk Size
Preliminary results: Energy-Performance 
On Haswell 
6 
5 
4 
65 70 75 80 85 90 95 100 105 110 
150" 
100" 
50" 
Barnes)Hut:)Offline)search)for)sta.c)par..on) 
Execu2on"2me"in"seconds" 
• Static: Oracle-Like static partition of the work based on profiling 
• Concord: Intel approach: GPU size computed once 
• HDSS: Belviranli et al. approach: GPU size computed once 
• LogFit: our dynamic CPU and GPU 96 chunk size partitioner 
3 
x 10−4 
Performance (iterations per ms.) 
Energy per iteration (Joules) 
Barnes Hut: Energy − Performance 
Static 
Concord 
HDSS 
LogFit 
0" 
0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%" 
Percentage)of)the)itera.on)space)offloaded)to)the)GPU)
Preliminary results: Energy-Performance 
13 
12 
11 
10 
9 
8 
7 
6 
5 
x 10−4 
• w.r.t. Static: up to 52% (18% 
on average) 
• w.r.t. Concord and HDSS: up 
to 94% and 69% of (28% and 
27% on average) 
30 35 40 45 50 55 60 65 
x 10−6 
Performance (iterations per ms.) 
CFD: Energy − Performance 
Performance (iterations per ms.) Energy per item (Joules) 
97 
9 
8 
7 
6 
5 
4 
45 50 55 60 65 70 75 80 85 
2.2 
2 
1.8 
1.6 
1.4 
1.2 
1 
0.8 
0.6 
x 10−4 
100 150 200 250 300 350 400 450 500 
Performance (iterations per ms.) 
Energy per iteration (Joules) 
SpMV: Enery −3erformance 
Static 
Concord 
HDSS 
LogFit 
3 
x 10−4 
Performance (iterations per ms.) 
Energy per item (Joules) 
P(3: Energy Performance 
Static 
Concord 
HDSS 
LogFit 
10 
9 
8 
7 
6 
5 
4 
3 
2 
Static 
Concord 
HDSS 
LogFit 
4000 5000 6000 7000 8000 9000 10000 
Energy per iteration (Joules) 
Nbody: Energy − Performance 
Static 
Concord 
HDSS 
LogFit
Our heterogeneous pipeline 
• ViVid, an object detection application 
• Contains three main kernels that form a pipeline 
• Would like to answer the following questions: 
– Granularity: coarse or fine grained parallelism? 
– Mapping of stages: where do we run them (CPU/GPU)? 
– Number of cores: how many of them when running on CPU? 
– Optimum: what metric do we optimize (time, energy, both)? 
98 
Stage 1 
Stage 2 
Stage 3 
Input 
frame Filter Histogram Classifier 
Output 
response mtx. 
index mtx. 
histograms 
detect. response
Granularity 
• Coarse grain: 
– CG 
• Medium grain: 
– MG 
• Fine grain: 
CPU 
item 
C 
CPU 
item 
C 
item 
C 
CPU 
item 
C 
item 
C 
C 
C C 
item 
C 
C 
C C 
GPU 
item 
C C C 
C C C 
CPU 
item 
C 
GPU 
item 
C C C 
C C C 
– Fine grain also in the CPU via AVX intrinsics 
99 
CPU 
item 
C 
item 
C 
Input Stage Stage 1 
Output Stage 
CPU 
Stage 2 
CPU 
Stage 3 
C =core 
Input Stage 
CPU 
item 
C 
Output Stage 
CPU 
Stage 1 
CPU 
Stage 2 
CPU 
item 
C 
C 
C C 
Stage 3 
CPU 
item 
C 
GPU 
item 
C C C 
C C C 
Input Stage Stage 1 Stage 2 Stage 3 
Output Stage
More mappings 
GPU 
item 
C C C 
C C C 
GPU 
CPU 
item 
C 
CPU 
item 
C 
CPU 
item 
C 
C 
C C 
item 
C 
C 
C C 
GPU 
100 Input Stage Output Stage 
Stage 2 
CPU 
Stage 3 
Stage 1 
CPU 
GPU 
CPU 
GPU 
CPU 
GPU 
CPU CPU CPU 
GPU 
CPU CPU CPU CPU 
CPU CPU 
CPU 
CPU CPU CPU 
GPU 
CPU CPU 
GPU 
GPU 
CPU CPU 
CPU CPU CPU CPU CPU 
CPU CPU 
GPU 
CPU 
GPU 
CPU CPU 
GPU 
CPU CPU CPU CPU CPU 
CPU CPU CPU CPU CPU
Accounting for all alternatives 
• In general: nC CPU cores, 1 GPU and p pipeline stages 
item 
item 
item 
item 
# alternatives = 2p x (nC +2) 
item 
CPU 
item 
C 
item 
item 
• For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives 
101 
Input Stage 
GPU 
C C C 
C C C 
CPU 
C 
Output Stage 
GPU 
C C C 
C C C 
GPU 
C C C 
C C C 
CPU 
Stage 1 
CPU 
Stage 2 
CPU 
Stage p 
C C 
nC cores 
C C 
nC cores 
C C 
nC cores
Framework and Model 
• Key idea: 
1. Run only on GPU 
2. Run only on CPU 
3. Analytically extrapolate for heterogeneous execution 
4. Find out the best configuration à RUN 
102 
item 
Input Stage 
GPU 
item 
C C C 
C C C 
CPU 
C 
CPU 
item 
C 
Output Stage 
GPU 
item 
C C C 
C C C 
GPU 
item 
C C C 
C C C 
CPU 
item 
C 
C 
C C 
Stage 1 
CPU 
item 
C 
C 
C C 
Stage 2 
CPU 
item 
C 
C 
C C 
Stage 3 
DP-MG 
collect λ and E (homogeneous values)
Environmental Setup: Benchmarks 
• Four Benchmarks 
– ViVid (Low and High Definition inputs) 
– SRAD 
– Tracking 
St 1 St 2 St 3 Out 
St 1 St 2 St 3 St 4 
– Scene Recognition 
103 
Inpt 
Filter Histogram Classifier 
Inpt 
Extrac. Prep. Reduct. 
St 5 St 6 Out 
Comp.1 Comp. 2 
Statist. 
Inpt 
St 1 Out 
Track. 
Inpt 
St 1 St 2 Out 
Feature. SVM
ViVid: throughput/energy on Ivy Bridge 
1.0E-­‐01 
9.0E-­‐02 
8.0E-­‐02 
7.0E-­‐02 
6.0E-­‐02 
5.0E-­‐02 
4.0E-­‐02 
3.0E-­‐02 
2.0E-­‐02 
1.0E-­‐02 
106 
LD (600x416), Ivy Bridge HD (1920x1080), Ivy Bridge 
0.0E+00 
Num-­‐Threads 
(CG) 
8.0E-­‐04 
7.0E-­‐04 
6.0E-­‐04 
5.0E-­‐04 
4.0E-­‐04 
3.0E-­‐04 
2.0E-­‐04 
1.0E-­‐04 
0.0E+00 
Num-­‐Threads 
(CG) 
GPU 
item 
C C C 
C C C 
C =Core 
CPU 
item 
C 
CPU 
item 
C 
item 
item 
C 
item 
C 
Input Stage Output Stage 
Stage 1 
CPU 
C 
CPU 
Stage 2 
CPU 
Stage 3 
CP-CG GPU-CPU Path 
CPU Path 
GPU 
item 
C C C 
C C C 
CPU 
item 
C 
CPU 
item 
C 
CPU 
item 
C 
C 
C C 
Stage 1 
CPU 
item 
C 
C 
C C 
item 
C 
C 
C C 
Input Stage Stage 2 
Output Stage 
CPU 
Stage 3 
CP-MG 
C =Core
ViVid: throughput/energy on Haswell 
1.8E-­‐01 
1.6E-­‐01 
1.4E-­‐01 
1.2E-­‐01 
1.0E-­‐01 
8.0E-­‐02 
6.0E-­‐02 
4.0E-­‐02 
2.0E-­‐02 
107 
LD, Haswell HD, Haswell 
0.0E+00 
Num-­‐Threads 
(CG) 
1.0E-­‐03 
8.0E-­‐04 
6.0E-­‐04 
4.0E-­‐04 
2.0E-­‐04 
0.0E+00 
Num-­‐Threads 
(CG) 
GPU 
item 
C C C 
C C C 
CPU 
item 
C 
CPU 
item 
C 
CPU 
item 
C 
C 
C C 
Stage 1 
CPU 
item 
C 
C 
C C 
item 
C 
C 
C C 
Input Stage Stage 2 
Output Stage 
CPU 
Stage 3 
CP-MG 
C =Core
Does Higher Throughput imply Lower Energy? 
9.0E-­‐03 
8.0E-­‐03 
7.0E-­‐03 
6.0E-­‐03 
5.0E-­‐03 
4.0E-­‐03 
3.0E-­‐03 
2.0E-­‐03 
1.0E-­‐03 
108 
Num-­‐Threads 
(CG) 
16.0 
14.0 
12.0 
10.0 
8.0 
6.0 
Num-­‐Threads 
(CG) 
1.0E-­‐03 
8.0E-­‐04 
6.0E-­‐04 
4.0E-­‐04 
2.0E-­‐04 
0.0E+00 
Num-­‐Threads 
(CG) 
Haswell 
HD
SRAD on Ivy Bridge 
1.4E-­‐01 
1.2E-­‐01 
1.0E-­‐01 
8.0E-­‐02 
6.0E-­‐02 
4.0E-­‐02 
2.0E-­‐02 
109 
item 
item 
C 
C 
C C 
item 
item 
C 
C 
C C 
6.0E-­‐01 
5.0E-­‐01 
4.0E-­‐01 
3.0E-­‐01 
2.0E-­‐01 
1.0E-­‐01 
0.0E+00 
item 
C 
C 
C C 
item 
item 
C 
C 
C C 
item 
item 
C 
C C 
item 
item 
C 
C 
C C 
Num-­‐Threads 
(CG) 
0.0E+00 
item 
item 
Num-­‐Threads 
(CG) 
item 
1.1E+00 
1.0E+00 
9.0E-­‐01 
8.0E-­‐01 
7.0E-­‐01 
C 
6.0E-­‐01 
C C 
5.0E-­‐01 
4.0E-­‐01 
3.0E-­‐01 
2.0E-­‐01 
1.0E-­‐01 
CPU 
item 
C 
item 
item 
C 
C 
C C 
CPU 
item 
C 
item 
item 
C 
C 
Num-­‐Threads 
(CG) 
Input Stage 
GPU 
C C C 
C C C 
CPU 
C 
CPU 
Stage 1 
CPU 
Stage 2 
Output Stage 
GPU 
C C C 
C C C 
CPU 
Stage 6 
CP-MG 
GPU 
C C C 
C C C 
CPU 
Stage 3 
GPU 
C C C 
C C C 
CPU 
item 
C 
Stage 4 
GPU 
item 
C C C 
C C C 
CPU 
item 
C 
C 
C C 
Stage 5 
Input Stage 
GPU 
C C C 
C C C 
CPU 
C 
CPU 
Stage 1 
GPU 
C C C 
C C C 
CPU 
C 
Stage 2 
Output Stage 
GPU 
C C C 
C C C 
CPU 
C C 
Stage 6 
DP-MG 
GPU 
C C C 
C C C 
CPU 
Stage 3 
GPU 
item 
C C C 
C C C 
CPU 
item 
C 
C 
C C 
Stage 4 
GPU 
item 
C C C 
C C C 
CPU 
item 
C 
C 
C C 
Stage 5
SRAD on Haswell 
1.1E+00 
9.0E-­‐01 
7.0E-­‐01 
5.0E-­‐01 
item 
3.0E-­‐01 
1.0E-­‐01 
item 
item 
item 
item 
Num-­‐Threads 
(CG) 
1.8E-­‐01 
1.6E-­‐01 
1.4E-­‐01 
1.2E-­‐01 
1.0E-­‐01 
8.0E-­‐02 
6.0E-­‐02 
4.0E-­‐02 
2.0E-­‐02 
110 
item 
item 
C 
8.0E-­‐01 
7.0E-­‐01 
6.0E-­‐01 
5.0E-­‐01 
4.0E-­‐01 
3.0E-­‐01 
2.0E-­‐01 
1.0E-­‐01 
0.0E+00 
item 
item 
C 
Num-­‐Threads 
(CG) 
0.0E+00 
item 
Num-­‐Threads 
(CG) 
Input Stage 
GPU 
C C C 
C C C 
CPU 
C 
CPU 
Stage 1 
GPU 
item 
C C C 
C C C 
CPU 
item 
C 
Stage 2 
CPU 
item 
C 
Output Stage 
GPU 
C C C 
C C C 
CPU 
C 
Stage 6 
DP-CG 
GPU 
C C C 
C C C 
CPU 
Stage 3 
GPU 
C C C 
C C C 
CPU 
item 
C 
Stage 4 
GPU 
C C C 
C C C 
CPU 
C 
Stage 5
On-going work 
• Test the model in other heterogeneous chips 
• ViVid LD running on Odroid XU-E 
– Ivy Bridge: 65 fps, 0.7 J/frame and 92 fps/J with CP-CG(5) 
– Exynos 5: 0.7 fps, 11 J/frame and 0.06 fps/J with CP-CG(4) 
0.0008 
0.0007 
0.0006 
0.0005 
0.0004 
0.0003 
0.0002 
0.0001 
0 
112 
Throughput 
CP-­‐CG 
Est. 
CP-­‐CG 
Mea. 
DP-­‐CG 
Est. 
DP-­‐CG 
Mea. 
CPU-­‐CG 
1 
2 
3 
4 
0.00007 
0.00006 
0.00005 
0.00004 
0.00003 
0.00002 
0.00001 
0 
Throughput/Energy 
1 
2 
3 
4 
CP-­‐CG 
Est. 
CP-­‐CG 
Mea. 
DP-­‐CG 
Est. 
DP-­‐CG 
Mea. 
CPU-­‐CG
Future Work 
• Consider energy for parallel loops scheduling/partitioning 
• Consider MARE as an alternative to our TBB-based implem. 
• Apply DVFS to reduce energy consumption 
– For video applications: no need to go faster than 33 fps 
• Also explore other parallel patterns: 
– Reduce 
– Parallel_do 
– … 
113
Conclusions 
• Plenty of heterogeneous on-chip architectures out there 
• It is important to use both devices 
– Need to find the best mapping/distribution/scheduling out of the 
many possible alternatives. 
• Programming models and runtimes aimed at this goal are in 
their infancy: it may have huge impact in mobile market 
• Challenges: 
– Hide hardware complexity 
– Consider energy in the partition/scheduling decisions 
– Minimize overhead of adaptation policies 
114
Collaborators 
• Mª Ángeles González Navarro (UMA, Spain) 
• Fracisco Corbera (UMA, Spain) 
• Antonio Vilches (UMA, Spain) 
• Andrés Rodríguez (UMA, Spain) 
• Alejandro Villegas (UMA, Spain) 
• Rubén Gran (U. Zaragoza, Spain) 
• Maria Jesús Garzarán (UIUC, USA) 
• Mert Dikmen (UIUC, USA) 
• Kurt Fellows (UIUC, USA) 
• Ehsan Totoni (UIUC, USA) 
115
Questions 
asenjo@uma.es 
http://www.ac.uma.es/~asenjo 
116

More Related Content

What's hot

Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciência
Campus Party Brasil
 
RedGateWebinar - Where did my CPU go?
RedGateWebinar - Where did my CPU go?RedGateWebinar - Where did my CPU go?
RedGateWebinar - Where did my CPU go?
Kristofferson A
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
Kohei KaiGai
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
Angela Mendoza M.
 

What's hot (20)

Computação acelerada – a era das ap us roberto brandão, ciência
Computação acelerada – a era das ap us   roberto brandão,  ciênciaComputação acelerada – a era das ap us   roberto brandão,  ciência
Computação acelerada – a era das ap us roberto brandão, ciência
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
GPU
GPUGPU
GPU
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
RedGateWebinar - Where did my CPU go?
RedGateWebinar - Where did my CPU go?RedGateWebinar - Where did my CPU go?
RedGateWebinar - Where did my CPU go?
 
CUDA
CUDACUDA
CUDA
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
BKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPABKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPA
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Easy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java ProgrammersEasy and High Performance GPU Programming for Java Programmers
Easy and High Performance GPU Programming for Java Programmers
 
Cuda
CudaCuda
Cuda
 
GPU for DL
GPU for DLGPU for DL
GPU for DL
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
Cuda
CudaCuda
Cuda
 

Viewers also liked

Qualcomm Snapdragon 820 Product and Infographics
Qualcomm Snapdragon 820 Product and InfographicsQualcomm Snapdragon 820 Product and Infographics
Qualcomm Snapdragon 820 Product and Infographics
Mark Shedd
 
Qualcomm Snapdragon 600-based SmartPhone
Qualcomm Snapdragon 600-based SmartPhoneQualcomm Snapdragon 600-based SmartPhone
Qualcomm Snapdragon 600-based SmartPhone
JJ Wu
 
Snapdragon processors
Snapdragon processorsSnapdragon processors
Snapdragon processors
Deepak Mathew
 

Viewers also liked (10)

Qualcomm Snapdragon Processor
Qualcomm Snapdragon ProcessorQualcomm Snapdragon Processor
Qualcomm Snapdragon Processor
 
Qualcomm Snapdragon S4 Pro-based Smart Phone(Simple)
Qualcomm Snapdragon S4 Pro-based Smart Phone(Simple)Qualcomm Snapdragon S4 Pro-based Smart Phone(Simple)
Qualcomm Snapdragon S4 Pro-based Smart Phone(Simple)
 
Qualcomm Snapdragon 820 Product and Infographics
Qualcomm Snapdragon 820 Product and InfographicsQualcomm Snapdragon 820 Product and Infographics
Qualcomm Snapdragon 820 Product and Infographics
 
Snapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 ArchitectureSnapdragon SoC and ARMv7 Architecture
Snapdragon SoC and ARMv7 Architecture
 
Android Tools for Qualcomm Snapdragon Processors
Android Tools for Qualcomm Snapdragon Processors Android Tools for Qualcomm Snapdragon Processors
Android Tools for Qualcomm Snapdragon Processors
 
Qualcomm Snapdragon 600-based SmartPhone
Qualcomm Snapdragon 600-based SmartPhoneQualcomm Snapdragon 600-based SmartPhone
Qualcomm Snapdragon 600-based SmartPhone
 
SNAPDRAGON SoC Family and ARM Architecture
SNAPDRAGON SoC Family and ARM Architecture SNAPDRAGON SoC Family and ARM Architecture
SNAPDRAGON SoC Family and ARM Architecture
 
Snapdragon Processor
Snapdragon ProcessorSnapdragon Processor
Snapdragon Processor
 
Snapdragon processors
Snapdragon processorsSnapdragon processors
Snapdragon processors
 
Snapdragon
SnapdragonSnapdragon
Snapdragon
 

Similar to Programming Models for Heterogeneous Chips

CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
byteLAKE
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 

Similar to Programming Models for Heterogeneous Chips (20)

Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Available HPC resources at CSUC
Available HPC resources at CSUCAvailable HPC resources at CSUC
Available HPC resources at CSUC
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
HSA Features
HSA FeaturesHSA Features
HSA Features
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech Talk
 
Infrastructure optimization for seismic processing (eng)
Infrastructure optimization for seismic processing (eng)Infrastructure optimization for seismic processing (eng)
Infrastructure optimization for seismic processing (eng)
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
[OpenStack Days Korea 2016] Track3 - OpenStack on 64-bit ARM with X-Gene
 
Ac922 cdac webinar
Ac922 cdac webinarAc922 cdac webinar
Ac922 cdac webinar
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Using FPGA in Embedded Devices
Using FPGA in Embedded DevicesUsing FPGA in Embedded Devices
Using FPGA in Embedded Devices
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 

More from Facultad de Informática UCM

More from Facultad de Informática UCM (20)

¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?¿Por qué debemos seguir trabajando en álgebra lineal?
¿Por qué debemos seguir trabajando en álgebra lineal?
 
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
TECNOPOLÍTICA Y ACTIVISMO DE DATOS: EL MAPEO COMO FORMA DE RESILIENCIA ANTE L...
 
DRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation ComputersDRAC: Designing RISC-V-based Accelerators for next generation Computers
DRAC: Designing RISC-V-based Accelerators for next generation Computers
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
Tendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura ArmTendencias en el diseño de procesadores con arquitectura Arm
Tendencias en el diseño de procesadores con arquitectura Arm
 
Formalizing Mathematics in Lean
Formalizing Mathematics in LeanFormalizing Mathematics in Lean
Formalizing Mathematics in Lean
 
Introduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented ComputingIntroduction to Quantum Computing and Quantum Service Oriented Computing
Introduction to Quantum Computing and Quantum Service Oriented Computing
 
Computer Design Concepts for Machine Learning
Computer Design Concepts for Machine LearningComputer Design Concepts for Machine Learning
Computer Design Concepts for Machine Learning
 
Inteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuroInteligencia Artificial en la atención sanitaria del futuro
Inteligencia Artificial en la atención sanitaria del futuro
 
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 Design Automation Approaches for Real-Time Edge Computing for Science Applic... Design Automation Approaches for Real-Time Edge Computing for Science Applic...
Design Automation Approaches for Real-Time Edge Computing for Science Applic...
 
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
Estrategias de navegación para robótica móvil de campo: caso de estudio proye...
 
Fault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error CorrectionFault-tolerance Quantum computation and Quantum Error Correction
Fault-tolerance Quantum computation and Quantum Error Correction
 
Cómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intentoCómo construir un chatbot inteligente sin morir en el intento
Cómo construir un chatbot inteligente sin morir en el intento
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
Type and proof structures for concurrency
Type and proof structures for concurrencyType and proof structures for concurrency
Type and proof structures for concurrency
 
Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...Hardware/software security contracts: Principled foundations for building sec...
Hardware/software security contracts: Principled foundations for building sec...
 
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
Jose carlossancho slidesLa seguridad en el desarrollo de software implementad...
 
Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?Do you trust your artificial intelligence system?
Do you trust your artificial intelligence system?
 
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.Redes neuronales y reinforcement learning. Aplicación en energía eólica.
Redes neuronales y reinforcement learning. Aplicación en energía eólica.
 
Challenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore windChallenges and Opportunities for AI and Data analytics in Offshore wind
Challenges and Opportunities for AI and Data analytics in Offshore wind
 

Recently uploaded

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 

Recently uploaded (20)

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 

Programming Models for Heterogeneous Chips

  • 1. Programming Models for Heterogeneous Chips Rafael Asenjo Dept. of Computer Architecture University of Malaga, Spain.
  • 2. Agenda • Motivation • Hardware – Heterogeneous chips – Integrated GPUs – Advantages • Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB 2
  • 3. Motivation • A new mantra: Power and Energy saving • In all domains 3
  • 4. Motivation • GPUs came to rescue: – Massive Data Parallel Code at a low price in terms of power – Supercomputers and servers: NVIDIA • GREEN500 Top 15: • TOP500: – 45 systems w. NVIDIA – 19 systems w. Xeon Phi 4
  • 5. Motivation • There is (parallel) live beyond supercomputers: 5
  • 6. Motivation • Plenty of GPUs elsewhere: – Integrated GPUs on more than 90% of shipped processors 6
  • 7. Motivation • Plenty of GPUs on desktops and laptops: – Desktops (35 – 130W) and laptops (15– 57 W): 7 Intel Haswell AMD APU Kaveri http://www.techspot.com/photos/article/http://techguru3d.com/4th-gen-intel-haswell-processors- 770-amd-a8-7600-kaveri/ architecture-and-lineup/
  • 9. Motivation • Plenty of integrated GPUs in mobile devices. Samsung Galaxy S5 SM-G900H 9 Samsung Exynos 5 Octa (2 - 6 W) Samsung Galaxy Note Pro 12 http://www.samsung.com/us/showcase/galaxy-smartphones-and-tablets/
  • 10. Motivation • Plenty of integrated GPUs in mobile devices. 10 Qualcomm Snapdragon 800 (2 - 6 W) https://www.qualcomm.com/products/snapdragon/processors/800 Nexus 5 Nokia Lumia Sony Xperia
  • 11. Motivation • Plenty of room for improvements – Want to make the most out of the CPU and the GPU – Lack of programming models – “Heterogeneous exec., but homogeneous programming” – Huge potential impact • Servers and supercomputing market – Google: porting the search engine for ARM and PowerPC – AMD Seattle Server-on-a-Chip based on Cortex-A57 (v8) – Mont Blanc project: supercomputer made of ARM • Once commodity processors took over • Be prepared for when mobile processors do so – E4’s EK003 Servers: X-Gene ARM A57 (8 cores) + K20 11
  • 12. Agenda • Motivation • Hardware – Heterogeneous chips – Integrated GPUs – Advantages • Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB 12
  • 13. Hardware 13 Intel Haswell AMD Kaveri Samsung Exynos 5 Octa Qualcomm Snapdragon 800
  • 14. Intel Haswell • Modular design 14 – 2 or 4 cores – GPU • GT-1: 10 EU • GT-2: 20 EU • GT-3: 40 EU • TSX: HW transac. mem. – HLE (HW lock elis.) • XACQUIRE • XRELEASE – RTM (Restrtd. TM) • XBEGIN • XEND http://www.anandtech.com/show/6355/intels-haswell-architecture
  • 15. Intel Haswell • Three frequency domains – Cores – GPU – LLC and Ring • On old Ivy Bridge – Only 2 domains – Cores and LLC together – Only GPU à CPU Fz é • OpenCL driver only for Win. • PCM as power monitor 15 http://www.anandtech.com/show/7744/intel-reveals-new-haswell-details-at-isscc-2014
  • 16. Intel Iris Graphics https://software.intel.com/en-us/articles/opencl-fall-webinar-series 16
  • 17. Intel Iris Graphics • GPU slice 17 – 2 sub slices – 20 EU (GPU cores) – Local L3 cache (256KB) – 16 barriers per sub slice – 2 x 64KB Local mem. • 2 GPU slices = 40 EU • Up to 7 in-flight EU-th • 8, 16 or 32 SIMD per EU-th • In flight:7x40x32=8960 work it • Each EU à 2 x 4-wide FPU – 40x8x2 (fmadd) = 640 op sim. – 1.3GHz à 832GFlops
  • 18. Intel Iris GPU 18 Matrix work-group ≈ block EU-threads (SIMD16) ≈ warp ≈ wavefront
  • 19. AMD Kaveri • Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores. 19 http://wccftech.com/
  • 20. AMD Kaveri • Steamroller microarch. – Each moduleà 2 “Cores”. – 2 threads, each with • 4x superscalar INT • 2x SIMD4 FP – 3.7GHz • Max GFLOPS: • 3.7 GHz x • 4 threads x • 4 wide x • 2 fmad = • 118 GFLOPS 20
  • 21. AMD Graphics Core Next (GCN) • In Kaveri, GCG takes 47% of the die – 8 Compute Units (CU) – Each CU: 4 SIMD16 – Each SIMD16: 16 lines – Total: 512 FPUs – 720 MHz • Max GFLOPS= • 0.72 GHz x • 512 FPUs x • 2 fmad = • 737 GFLOPS • CPU+GPU à 855 GFLOPS 21
  • 22. OpenCL execution on GCN Work-group à wavefronts (64 work-items) à pools 22 WG CU0 pool number SIMD0-0 SIMD0 SIMD1 SIMD2 SIMD3 SIMD1-0 SIMD2-0 SIMD3-0 SIMD0-1 SIMD1-1 SIMD2-1 SIMD3-1 SIMD0-2 SIMD1-2 4 pools: 4 wavefronts in flight per SIMD 4 ck to execute each wavefront Wavefronts
  • 23. HSA (Heterogeneous System Architecture) • HSA Foundation’s goal: Productivity on heterogeneous HW – CPU, GPU, DSPs.. • Scheduled on three phasesà • Second phase: Kaveri – hUMA – Same pointers used on CPU and GPU – Cache coherency 23
  • 24. Kaveri’s main HSA features • hUMA – Shared and coherent view of up to 32GB • Heterogeneous queuing (hQ) – CPU and GPU can create and dispatch work 24
  • 25. HSA Motivation • Too many steps to get the job done 25 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
  • 26. Requirements • To enable lower overhead job dispatch requires four mechanisms: – Shared Virtual Memory • Send pointers (not data) back and forth between HSA agents. – System Coherency • Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance – Signaling • HSA Agents can directly create/access signal objects. – Signaling a signal object (this will wake up HSA agents waiting upon the object) – Query current object – Wait on the current object (various conditions supported). – User mode queueing • Enables user space applications to directly, without OS intervention, enqueue jobs (“Dispatch Packets”) for HSA agents. 26
  • 27. Non-HSA Shared Virtual Memory • Multiple Virtual memory address spaces PHYSICAL MEMORY VIRTUAL MEMORY1 27 PHYSICAL MEMORY VIRTUAL MEMORY2 CPU0 GPU VA1->PA1 VA2->PA1
  • 28. HSA Shared Virtual Memory • Common Virtual Memory for all HSA agents 28 PHYSICAL MEMORY VIRTUAL MEMORY CPU0 GPU VA->PA VA->PA
  • 29. After adding SVM • With SVM we get rid of copy/map memory back and forth 29 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
  • 30. After adding coherency • If the CPU allocates a global pointer, the GPU see that value 30 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
  • 31. After adding signaling • The CPU can wait on a signal object 31 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
  • 32. After adding user-level enqueuing • The user directly enqueues the job without OS intervention 32 Application OS GPU Transfer buffer to GPU Copy/Map Memory Queue Job Schedule Job Start Job Finish Job Schedule Application Get Buffer Copy/Map Memory http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
  • 33. Success!! • That’s definitely way simpler and with less overhead 33 Application OS GPU Queue Job Start Job Finish Job
  • 34. OpenCL 2.0 • OpenCL 2.0 will contain most of the features of HSA – Intel’s version supports HSA for Core M (Broadwell). Windows. – AMD’s version does not support SVM fine grain. • AMD 1.2 beta driver – http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-1-2-beta-driver/ – Only for Windows 8.1 – Example: allocating “Coherent Host Memory” on Kaveri: 34 #include <CL/cl_ext.h> // Implements SVM #include "hsa_helper.h” // AMD helper functions … cl_svm_mem_flags_amd flags = CL_MEM_READ_WRITE | CL_MEM_SVM_FINE_GRAIN_BUFFER_AMD | CL_MEM_SVM_ATOMICS_AMD; volatile std::atomic_int * data; data = (volatile std::atomic_int *) clSVMAlloc(context, flags, MAX_DATA * sizeof(volatile std::atomic_int), 0);
  • 35. Samsung Exynos 5 • Odroid XU-E and XU3 bareboards • Sports a Exynos 5 Octa 35 – big.LITTLE architecture – big: Cortex-A15 quad – LITTLE: Cortex-A7 quad • Exynos 5 Octa 5410 – Only 4 CPU-cores active at a time – GPU: Power VR SGX544MP3 (Imagination Technologies) • 3 GPU-cores at 533MHz à 51 GFLOPS • Exynos 5 Octa 5422 – All 8 CPU-cores can be working simultaneously – GPU: ARM Mali-T628 MP6 • 6 GPU-cores at 533MHz à 102 GFLOPS 180$
  • 36. Power VR SGX544MP3 • OpenCL 1.1 for Android • Some limitations: – Compute units: 1 – Max WG Size: 1 – Local Mem: 1KB – Pick MFLOPS: • 12 Ops per ck • 3 SIMD-ALUs x 4-wide • Power monitor: – 4 x INA231 monitors • A15, A7, GPU, Mem. • Instant Power • Every 260ms 36 SGX architecture …N
  • 38. ARM Mali-T628 MP6 • Supporting: – OpenGL® ES 3.0 – OpenCL™ 1.1 – DirectX® 11 – Renderscript™ • Cache L2 size – 32 – 256KB per core • 6 Cores – 16 FP units – 2 SIMD4 each • Other – Built-in MMU – Standard ARM Bus • AMBA 4 ACE-Lite 38 Mali architecture
  • 41. Snapdragon 800 • CPU: Quad-core Krait 400 up to 2.26GHz (ARMv7 ISA) – Similar to Cortex-A15. 11 stage integer pipeline with 3-way decode and 4-way out-of-order speculative issue superscalar execution – Pipelined VFPv4[2] and 128-bit wide NEON (SIMD) – 4 KB + 4 KB direct mapped L0 cache – 16 KB + 16 KB 4-way set associative L1 cache – 2 MB (quad-core) L2 cache • GPU: Adreno 330, 450MHz – OpenGL ES 3.0, DirectX, OpenCL 1.2, RenderScript – 32 Execution Units. Each with 2 x SIMD4 units • DSP: Hexagon 600MHz 41
  • 42. Measuring power • Snapdragon Performance Visualizer • Trepn Profiler • Power Tutor – Tuned for Nexus One – Model with 5% precision – Open Source 42
  • 43. More development boards • Jetson TK1 board – Tegra K1 – Kepler GPU with 192 CUDA cores – 4-Plus-1 quad-core ARM Cortex A15 – Linux + CUDA – 180$ • Arndale – Exynos 5420 – big.LITTLE (A15 + A7) – GPU Mali T628 MP6 – Linux + OpenCL – 200$ • … 43
  • 44. Advantages of integrated GPUs • Discrete and integrated GPUs: different goals – NVIDIA Kepler: 2880 CUDA cores, 235W, 4.3 TFLOPS – Intel Iris 5200: 40 EU x 8 SIMD, 15-28W, 0.83 TFLOPS – PowerVR: 3 EU x 16 SIMD, < 1W, 0.051 TFLOPS • Higher bandwidth between CPU and GPU. – Shared DRAM • Avoid PCI data transfer – Shared LLC (Last Level Cache) • Data coherence in some cases… • CPU and GPU may have similar performance – It’s more likely that they can collaborate • Cheaper! 44
  • 45. Integrated GPUs are also improving 45
  • 46. Agenda • Motivation • Hardware – Heterogeneous chips – Integrated GPUs – Advantages • Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB 46
  • 47. Programming models for heterogeneous • Targeted at single device – CUDA (NVIDIA) – OpenCL (Khronos Group Standard) – OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0) – C++AMP (Windows’ extension of C++. Recently HSA announced own ver.) – RenderScript (Google’s Java API for Android) – ParallDroid (Java + Directives from ULL, Spain) – Many more (Sycl, Numba Python, IBM Java, Matlab, R, JavaScript, …) • Targeted at several devices (discrete GPUs) – Qilin (C++ and Qilin API compiled to TBB+CUDA) – OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler) – XKaapi – StarPU • Targeted at several devices (integrated GPUs) – Qualcomm MARE – Intel Concord 47
  • 48. OpenCL on mobile devices 48 http://streamcomputing.eu/blog/2014-06-30/opencl-support-recent-android-smartphones/
  • 49. OpenCL running on CPU 49 CPU Ivy Bridge 45% 41% 75% 15% 9% 34% 10% 50 45 40 35 30 25 20 15 10 5 0 Base Auto T-Auto SSE AVX-SSE AVX OpenCL Execution Time (ms) 3.3 GHz AVX code version is - 1.8x times faster than OpenCL - 1.8x more Halstead effort “Easy, Fast and Energy Efficient Object Detection on Heterogeneous On-Chip Architectures”, E. Totoni, M. Dikmen, M. J. Garzaran, ACM Transactions on Architecture and Code Optimization (TACO),10(4), December 2013.
  • 50. Complexities of AVX Intrinsics __m256 image_cache0 = _mm256_broadcast_ss(&fr_ptr[pixel_offsets[0]]);! curr_filter = _mm256_load_ps(&fb_array[fi]);! temp_sum = _mm256_add_ps(_mm256_mul_ps(image_cache7, " " curr_filter), temp_sum);! temp_sum2 = _mm256_insertf128_ps(temp_sum,! " _mm256_extractf128_ps(temp_sum, 1), 0);! cpm = _mm256_cmp_ps(temp_sum2, max_fil, _CMP_GT_OS);! r = _mm256_movemask_ps(cpm);! ! if(r&(1<<1)) {! "best_ind = filter_ind+2;! "int control = 1|(1<<2)|(1<<4)|(1<<6);;! "max_fil = _mm256_permute_ps(temp_sum2, control); ! " " " " " " " "r=_mm256_movemask_ps( _mm256_cmp_ps(temp_sum2, ! " "max_fil, _CMP_GT_OS));! }! 50 Load Multiply-add Copy high to low Compare Store max Store index
  • 51. OpenCL doesn’t have to be tough Courtesy: Khronos Group 51
  • 52. Libraries and languages using OpenCL Courtesy: AMD 52
  • 53. Libraries and languages using OpenCL Courtesy: AMD 53
  • 54. Libraries and languages using OpenCL (cont.) Courtesy: AMD 54
  • 55. Libraries and languages using OpenCL (cont.) Courtesy: AMD 55
  • 56. C++AMP • C++ Accelerated Massive Parallelism • Pioneered by Microsoft – Requirements: Windows 7 + Visual Studio 2012 • Followed by Intel's experimental implementation – C++ AMP on Clang/LLVM and OpenCL (AWOL since 2013) • Now HSA Foundation taking the lead • Keywords: restrict(device), array_view, parallel_for_each,… – Example: SUM = A + B; // (2D arrays) 56
  • 57. OpenCL Ecosystem Courtesy: Khronos Group 57
  • 58. SYCL’s flavour: A[i]=B[i]*2 58 Work in progress developments: - AMD: trySYCL à https://github.com/amd/triSYCL - Codeplay: http://www.codeplay.com/ Advantages: 1. Easy to understand the concept of work-groups 2. Performance-portable between CPU and GPU 3. Barriers are automatically deduced
  • 59. StarPU • A runtime system for heterogeneous architectures • Dynamically schedule tasks on all processing units – See a pool of heterogeneous cores • Avoid unnecessary data transfers between accelerators – Software SVM for heterogeneous machines 59 CPU CPU CPU CPU CPU CPU A M CPU CPU M GPU M GPU M GPU M GPU M A = A+B B B
  • 60. Overview of StarPU • Maximizing PU occupancy, minimizing data transfers • Ideas: – Accept tasks that may have multiple implementations 60 • Together with potential inter-dependencies – Leads to a dynamic acyclic graph of tasks – Provide a high-level data management layer (Virtual Shared Memory VSM) • Application should only describe – which data may be accessed by tasks – how data may be divided Applica0ons Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) CPU GPU …
  • 61. Tasks scheduling • Dealing with heterogeneous hardware accelerators • Tasks = 61 – Data input & output – Dependencies with other tasks – Multiple implementations • E.g. CUDA + CPU • Scheduling hints • StarPU provides an Open Scheduling platform – Scheduling algorithm = plug-ins – Predefined set of popular policies Applica0ons Parallel Compilers Parallel Libraries StarPU Drivers (CUDA, OpenCL) CPU GPU … f (ARW, BR) cpu gpu spu
  • 62. Tasks scheduling • Predefined set of popular policies • Eager Scheduler 62 – First come, first served policy – Only one queue • Work Stealing Scheduler – Load balancing policy – One queue per worker • Priority Scheduler – Describe the relative importance of tasks – One queue per priority CPU CPU CPU GPU GPU Eager Scheduler task task CPU CPU CPU GPU GPU WS Scheduler task CPU CPU CPU GPU GPU Prio. Scheduler prio2 prio1 prio0
  • 63. Tasks scheduling • Predefined set of popular policies • Dequeue Model (DM) Scheduler 63 – Using codelet performance models • Kernel calibration on each available computing device – Raw history model of kernels’ past execution times – Refined models using regression on kernels’ execution times history • Dequeue Model Data Aware (DMDA) Scheduler – Data transfer cost vs kernel offload benefit – Transfer cost modelling ( ) – Bus calibration task cpu1 cpu2 cpu3 gpu1 gpu2 time CPU CPU CPU GPU GPU DM Scheduler cpu1 cpu2 cpu3 gpu1 gpu2 task cpu1 cpu2 cpu3 gpu1 gpu2 time CPU CPU CPU GPU GPU DMDA Scheduler cpu1 cpu2 cpu3 gpu1 gpu2
  • 64. Some results (MxV, 4 CPUs, 1 GPU) SPU config: Eager, 3 CPUs, 1GPU SPU config: DMDA, 3 CPUs, 1GPU SPU config: Eager, 4 CPUs, 1GPU SPU config: DMDA, 4 CPUs, 1GPU 64
  • 65. Terminology • A Codelet. . . – . . . relates an abstract computation kernel to its implementation(s) – . . . can be instantiated into one or more tasks – . . . defines characteristics common to a set of tasks • A Task. . . – . . . is an instantiation of a Codelet – . . . atomically executes a kernel from its beginning to its end – . . . receives some input – . . . produces some output • A Data Handle. . . – . . . designates a piece of data managed by StarPU – . . . is typed (vector, matrix, etc.) – . . . can be passed as input/output for a Task 65
  • 66. Basic Example: Scaling a Vector 66 kernel Declaring a Codelet functions data pieces data mode access 123456 struct starpu_codelet scal_cl = { . cpu_funcs = { scal_cpu_f, NULL}, . cuda_funcs = { scal_cuda_f, NULL } , . nbuffers = 1, . modes = { STARPU_RW } , }; 1 2 3 4 5 6 7 8 9 Kernel functions void scal_cpu_f(void ∗buffers [] , void ∗cl_arg) { struct starpu_vector_interface ∗vector_handle = buffers [ 0 ] ; float ∗vector = STARPU_VECTOR_GET_PTR(vector_handle); float ∗ptr_factor = cl_arg ; for (i = 0; i < NX; i++) vector [ i ] ∗= ∗ptr_factor ; } void scal_cuda_f(void ∗buffers [] , void ∗cl_arg) { … } kernel function prototype retrieve data handle get pointer from data handle get small-size inline data do computation
  • 67. Basic Example: Scaling a Vector 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 67 Main code float factor = 3.14; float vector1 [NX] ; float vector2 [NX] ; starpu_data_handle_t vector_handle1 ; starpu_data_handle_t vector_handle2 ; /∗ ..... ∗/ starpu_vector_data_register(&vector_handle1, 0, (uintptr_t)vector1, NX, sizeof(vector1[0])); starpu_vector_data_register(&vector_handle2, 0, (uintptr_t)vector2, NX, sizeof(vector2[0])); /∗ non−blocking task submits ∗/ starpu_task_insert (&scal_cl , STARPU_RW , vector_handle1 , STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; starpu_task_insert (&scal_cl , STARPU_RW , vector_handle2 , STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; /∗ wait for all task submitted so far ∗/ starpu_task_wait_for_all () ; starpu_data_unregister ( vector_handle1 ) ; starpu_data_unregister ( vector_handle2 ) ; /∗ ..... ∗/ declare data handles register pieces of data and get the handles (now under StarPU control) submit tasks (param: codelet, StarPU-managed data, small-size inline data) wait for all submitted tasks Unregister pieces of data (the handles are destroyed, the vectors are now back under user control)
  • 68. Qualcomm MARE • MARE is a programming model and a runtime system that provides simple yet powerful abstractions for parallel, power-efficient software – Simple C++ API allows developers to express concurrency – User-level library that runs on any Android device, and on Linux, Mac OS X, and Windows platforms • The goal of MARE is to reduce the effort required to write apps that fully utilize heterogeneous SoCs • Concepts: – Tasks are units of work that can be asynchronously executed – Groups are sets of tasks that can be canceled or waited on 68
  • 70. More complex example: C=A+B on GPU 70
  • 71. More complex example: C=A+B on GPU 71
  • 72. MARE departures • Similarities with TBB – Based on tasks and 2-level API (task level and templates) • pfor_each, ptransform, pscan, … • Synchronous Dataflow classes ≈ TBB’s Flow Graphs – Concurrent data structures: queue, stack, … • Departures – Expression of dependencies is first class – Flexible group membership and work or group cancelation – Optimized for some Qualcomm chips • Power classes: – Static: mare::power::mode {efficient, saver, …} – Dynamic: mare:power::set_goal(desired, tolerance) • Aware of the mobile architecture: agressive power mangmt. – Cores can be shutdown or affected by DVFS 72
  • 73. MARE results • Zoomm web browser implemented on top of MARE C. Cascaval, et al.. ZOOMM: a parallel web browser engine for multicore mobile devices. In Symposium on Principles and practice of parallel programming, PPoPP ’13, pages 271–280, 2013. 73
  • 74. MARE results • Bullet Physics parallelized with MARE Courtesy: Calin Cascaval 74
  • 75. Intel Concord • C++ heterogeneous programming framework for integrated CPU and GPU processors – Shared Virtual Memory (SVM) in software – Adapts existing data-parallel C++ constructs to heterogeneous computing using TBB – Available open source as Intel Heterogeneous Research Compiler (iHRC) at https://github.com/IntelLabs/iHRC/ • Papers: – Rajkishore Barik, Tatiana Shpeisman, et al. Efficient mapping of irregular C++ applications to integrated GPUs. CGO 2014. – Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian Lewis, Chunling Hu, and Keshav Pingali. Adaptive heterogeneous scheduling on integrated GPUs. PACT 2014. 75
  • 76. Intel Concord • Extend TBB API: – parallel_for_hetero (int numiters, const Body &B, bool device); – parallel_reduce_hetero (int numiters, const Body &B, bool device); Courtesy: Intel 76
  • 77. Example: Parellel_for_hetero • Concord compiler generates OpenCL version – Automatically takes care of the data thanks to SVM Courtesy: Intel 77
  • 79. SVM SW implementation on Haswell 79
  • 80. SVM translation in OpenCL code • svm_const is a runtime constant and is computed once • Every CPU pointer before dereference on the GPU is converted into GPU address space using AS_GPU_PTR 80
  • 82. Speedup & Energy savings vs multicore CPU 82
  • 83. Heterogeneous execution on both devices • Iteration space distributed among available devices • Problem: find the best data partition • Example: Barnes Hut and Facedetect relative exect. time – Varying the amount of work offloaded to the GPU – For BH the optimum is 40% of the work carried out on the GPU – For FD the optimum is 0% of the work carried out on the GPU 83
  • 84. Partitioning based on on-line profiling Naïve profiling Asymmetric profiling 84 assign chunk to CPU and GPU compute chunk on CPU compute chunk on GPU barrier according to relative speeds partition the rest of the iteration space assign chunk to just to GPU compute on CPU compute chunk on GPU when the GPU is done according to relative speeds partition the rest of the iteration space
  • 85. Agenda • Motivation • Hardware – Heterogeneous chips – Integrated GPUs – Advantages • Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB 85
  • 86. Our heterogeneous parallel_for Angeles Navarro, Antonio Vilches, Francisco Corbera and Rafael Asenjo Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures, The Journal of Supercomputing, May 2014 86
  • 87. Comparison with StarPU • MxV benchmark – Three schedulers tested: greedy, work-stealing, HEFT – Static chunk size: 2000, 200 and 20 matrix rows 91
  • 88. Choosing the GPU block size • Belviranli, M. E., Bhuyan, L. N., & Gupta, R. (2013). A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4), 57:1–57:20. 92
  • 89. GPU block size for irregular codes • Adapt between time-steps and inside the time-step 93 100 90 80 70 60 50 40 30 20 10 0 Barnes−Hut: Average throughput per chunk size 0 40 160 640 2560 10240 40980 81960 Chunk size Average Throughput static, ts=0 static, ts=30 adap., ts=0 adap., ts=30
  • 90. GPU block size for irregular codes • Throughput variation along the iteration space – For two different time-steps – Different GPU chunk-sizes 160 140 120 100 80 60 40 0 2 4 6 8 10 94 4 x 10 20 Barnes−Hut: Throughput variation (time step=0) Iteration Space Throughput chunk=320 chunk=640 chunk=1280 chunk=2560 160 140 120 100 80 60 40 0 2 4 6 8 10 4 x 10 20 Barnes−Hut: Throughput variation (time step=5) Iteration Space Throughput chunk=320 chunk=640 chunk=1280 chunk=2560
  • 91. Adapting the GPU chunk-size • Assumption: – Irregular behavior as a sequence of regimes of regular behavior 95 throughput higher λG λG a·ln(x)+b λG increase chunk size x decrease chunk size lower λG x G(t-1)/2 G(t-1)*2 x samples G=a/thld x 40 20 1750 1400 1050 700 0 2 4 6 8 10 4 x 10 0 Iteration space GPU Throughput GPU Thr. & Chunk size: LogFit 350 GPU Chunk size Throughput Chunk Size 100 50 3500 2800 2100 1400 0 2 4 6 8 10 4 x 10 0 Iteration space GPU Throughput GPU Thr. & Chunk size: LogFit 700 GPU Chunk size Throughput Chunk Size
  • 92. Preliminary results: Energy-Performance On Haswell 6 5 4 65 70 75 80 85 90 95 100 105 110 150" 100" 50" Barnes)Hut:)Offline)search)for)sta.c)par..on) Execu2on"2me"in"seconds" • Static: Oracle-Like static partition of the work based on profiling • Concord: Intel approach: GPU size computed once • HDSS: Belviranli et al. approach: GPU size computed once • LogFit: our dynamic CPU and GPU 96 chunk size partitioner 3 x 10−4 Performance (iterations per ms.) Energy per iteration (Joules) Barnes Hut: Energy − Performance Static Concord HDSS LogFit 0" 0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%" Percentage)of)the)itera.on)space)offloaded)to)the)GPU)
  • 93. Preliminary results: Energy-Performance 13 12 11 10 9 8 7 6 5 x 10−4 • w.r.t. Static: up to 52% (18% on average) • w.r.t. Concord and HDSS: up to 94% and 69% of (28% and 27% on average) 30 35 40 45 50 55 60 65 x 10−6 Performance (iterations per ms.) CFD: Energy − Performance Performance (iterations per ms.) Energy per item (Joules) 97 9 8 7 6 5 4 45 50 55 60 65 70 75 80 85 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 x 10−4 100 150 200 250 300 350 400 450 500 Performance (iterations per ms.) Energy per iteration (Joules) SpMV: Enery −3erformance Static Concord HDSS LogFit 3 x 10−4 Performance (iterations per ms.) Energy per item (Joules) P(3: Energy Performance Static Concord HDSS LogFit 10 9 8 7 6 5 4 3 2 Static Concord HDSS LogFit 4000 5000 6000 7000 8000 9000 10000 Energy per iteration (Joules) Nbody: Energy − Performance Static Concord HDSS LogFit
  • 94. Our heterogeneous pipeline • ViVid, an object detection application • Contains three main kernels that form a pipeline • Would like to answer the following questions: – Granularity: coarse or fine grained parallelism? – Mapping of stages: where do we run them (CPU/GPU)? – Number of cores: how many of them when running on CPU? – Optimum: what metric do we optimize (time, energy, both)? 98 Stage 1 Stage 2 Stage 3 Input frame Filter Histogram Classifier Output response mtx. index mtx. histograms detect. response
  • 95. Granularity • Coarse grain: – CG • Medium grain: – MG • Fine grain: CPU item C CPU item C item C CPU item C item C C C C item C C C C GPU item C C C C C C CPU item C GPU item C C C C C C – Fine grain also in the CPU via AVX intrinsics 99 CPU item C item C Input Stage Stage 1 Output Stage CPU Stage 2 CPU Stage 3 C =core Input Stage CPU item C Output Stage CPU Stage 1 CPU Stage 2 CPU item C C C C Stage 3 CPU item C GPU item C C C C C C Input Stage Stage 1 Stage 2 Stage 3 Output Stage
  • 96. More mappings GPU item C C C C C C GPU CPU item C CPU item C CPU item C C C C item C C C C GPU 100 Input Stage Output Stage Stage 2 CPU Stage 3 Stage 1 CPU GPU CPU GPU CPU GPU CPU CPU CPU GPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU GPU CPU CPU GPU GPU CPU CPU CPU CPU CPU CPU CPU CPU CPU GPU CPU GPU CPU CPU GPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU
  • 97. Accounting for all alternatives • In general: nC CPU cores, 1 GPU and p pipeline stages item item item item # alternatives = 2p x (nC +2) item CPU item C item item • For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives 101 Input Stage GPU C C C C C C CPU C Output Stage GPU C C C C C C GPU C C C C C C CPU Stage 1 CPU Stage 2 CPU Stage p C C nC cores C C nC cores C C nC cores
  • 98. Framework and Model • Key idea: 1. Run only on GPU 2. Run only on CPU 3. Analytically extrapolate for heterogeneous execution 4. Find out the best configuration à RUN 102 item Input Stage GPU item C C C C C C CPU C CPU item C Output Stage GPU item C C C C C C GPU item C C C C C C CPU item C C C C Stage 1 CPU item C C C C Stage 2 CPU item C C C C Stage 3 DP-MG collect λ and E (homogeneous values)
  • 99. Environmental Setup: Benchmarks • Four Benchmarks – ViVid (Low and High Definition inputs) – SRAD – Tracking St 1 St 2 St 3 Out St 1 St 2 St 3 St 4 – Scene Recognition 103 Inpt Filter Histogram Classifier Inpt Extrac. Prep. Reduct. St 5 St 6 Out Comp.1 Comp. 2 Statist. Inpt St 1 Out Track. Inpt St 1 St 2 Out Feature. SVM
  • 100. ViVid: throughput/energy on Ivy Bridge 1.0E-­‐01 9.0E-­‐02 8.0E-­‐02 7.0E-­‐02 6.0E-­‐02 5.0E-­‐02 4.0E-­‐02 3.0E-­‐02 2.0E-­‐02 1.0E-­‐02 106 LD (600x416), Ivy Bridge HD (1920x1080), Ivy Bridge 0.0E+00 Num-­‐Threads (CG) 8.0E-­‐04 7.0E-­‐04 6.0E-­‐04 5.0E-­‐04 4.0E-­‐04 3.0E-­‐04 2.0E-­‐04 1.0E-­‐04 0.0E+00 Num-­‐Threads (CG) GPU item C C C C C C C =Core CPU item C CPU item C item item C item C Input Stage Output Stage Stage 1 CPU C CPU Stage 2 CPU Stage 3 CP-CG GPU-CPU Path CPU Path GPU item C C C C C C CPU item C CPU item C CPU item C C C C Stage 1 CPU item C C C C item C C C C Input Stage Stage 2 Output Stage CPU Stage 3 CP-MG C =Core
  • 101. ViVid: throughput/energy on Haswell 1.8E-­‐01 1.6E-­‐01 1.4E-­‐01 1.2E-­‐01 1.0E-­‐01 8.0E-­‐02 6.0E-­‐02 4.0E-­‐02 2.0E-­‐02 107 LD, Haswell HD, Haswell 0.0E+00 Num-­‐Threads (CG) 1.0E-­‐03 8.0E-­‐04 6.0E-­‐04 4.0E-­‐04 2.0E-­‐04 0.0E+00 Num-­‐Threads (CG) GPU item C C C C C C CPU item C CPU item C CPU item C C C C Stage 1 CPU item C C C C item C C C C Input Stage Stage 2 Output Stage CPU Stage 3 CP-MG C =Core
  • 102. Does Higher Throughput imply Lower Energy? 9.0E-­‐03 8.0E-­‐03 7.0E-­‐03 6.0E-­‐03 5.0E-­‐03 4.0E-­‐03 3.0E-­‐03 2.0E-­‐03 1.0E-­‐03 108 Num-­‐Threads (CG) 16.0 14.0 12.0 10.0 8.0 6.0 Num-­‐Threads (CG) 1.0E-­‐03 8.0E-­‐04 6.0E-­‐04 4.0E-­‐04 2.0E-­‐04 0.0E+00 Num-­‐Threads (CG) Haswell HD
  • 103. SRAD on Ivy Bridge 1.4E-­‐01 1.2E-­‐01 1.0E-­‐01 8.0E-­‐02 6.0E-­‐02 4.0E-­‐02 2.0E-­‐02 109 item item C C C C item item C C C C 6.0E-­‐01 5.0E-­‐01 4.0E-­‐01 3.0E-­‐01 2.0E-­‐01 1.0E-­‐01 0.0E+00 item C C C C item item C C C C item item C C C item item C C C C Num-­‐Threads (CG) 0.0E+00 item item Num-­‐Threads (CG) item 1.1E+00 1.0E+00 9.0E-­‐01 8.0E-­‐01 7.0E-­‐01 C 6.0E-­‐01 C C 5.0E-­‐01 4.0E-­‐01 3.0E-­‐01 2.0E-­‐01 1.0E-­‐01 CPU item C item item C C C C CPU item C item item C C Num-­‐Threads (CG) Input Stage GPU C C C C C C CPU C CPU Stage 1 CPU Stage 2 Output Stage GPU C C C C C C CPU Stage 6 CP-MG GPU C C C C C C CPU Stage 3 GPU C C C C C C CPU item C Stage 4 GPU item C C C C C C CPU item C C C C Stage 5 Input Stage GPU C C C C C C CPU C CPU Stage 1 GPU C C C C C C CPU C Stage 2 Output Stage GPU C C C C C C CPU C C Stage 6 DP-MG GPU C C C C C C CPU Stage 3 GPU item C C C C C C CPU item C C C C Stage 4 GPU item C C C C C C CPU item C C C C Stage 5
  • 104. SRAD on Haswell 1.1E+00 9.0E-­‐01 7.0E-­‐01 5.0E-­‐01 item 3.0E-­‐01 1.0E-­‐01 item item item item Num-­‐Threads (CG) 1.8E-­‐01 1.6E-­‐01 1.4E-­‐01 1.2E-­‐01 1.0E-­‐01 8.0E-­‐02 6.0E-­‐02 4.0E-­‐02 2.0E-­‐02 110 item item C 8.0E-­‐01 7.0E-­‐01 6.0E-­‐01 5.0E-­‐01 4.0E-­‐01 3.0E-­‐01 2.0E-­‐01 1.0E-­‐01 0.0E+00 item item C Num-­‐Threads (CG) 0.0E+00 item Num-­‐Threads (CG) Input Stage GPU C C C C C C CPU C CPU Stage 1 GPU item C C C C C C CPU item C Stage 2 CPU item C Output Stage GPU C C C C C C CPU C Stage 6 DP-CG GPU C C C C C C CPU Stage 3 GPU C C C C C C CPU item C Stage 4 GPU C C C C C C CPU C Stage 5
  • 105. On-going work • Test the model in other heterogeneous chips • ViVid LD running on Odroid XU-E – Ivy Bridge: 65 fps, 0.7 J/frame and 92 fps/J with CP-CG(5) – Exynos 5: 0.7 fps, 11 J/frame and 0.06 fps/J with CP-CG(4) 0.0008 0.0007 0.0006 0.0005 0.0004 0.0003 0.0002 0.0001 0 112 Throughput CP-­‐CG Est. CP-­‐CG Mea. DP-­‐CG Est. DP-­‐CG Mea. CPU-­‐CG 1 2 3 4 0.00007 0.00006 0.00005 0.00004 0.00003 0.00002 0.00001 0 Throughput/Energy 1 2 3 4 CP-­‐CG Est. CP-­‐CG Mea. DP-­‐CG Est. DP-­‐CG Mea. CPU-­‐CG
  • 106. Future Work • Consider energy for parallel loops scheduling/partitioning • Consider MARE as an alternative to our TBB-based implem. • Apply DVFS to reduce energy consumption – For video applications: no need to go faster than 33 fps • Also explore other parallel patterns: – Reduce – Parallel_do – … 113
  • 107. Conclusions • Plenty of heterogeneous on-chip architectures out there • It is important to use both devices – Need to find the best mapping/distribution/scheduling out of the many possible alternatives. • Programming models and runtimes aimed at this goal are in their infancy: it may have huge impact in mobile market • Challenges: – Hide hardware complexity – Consider energy in the partition/scheduling decisions – Minimize overhead of adaptation policies 114
  • 108. Collaborators • Mª Ángeles González Navarro (UMA, Spain) • Fracisco Corbera (UMA, Spain) • Antonio Vilches (UMA, Spain) • Andrés Rodríguez (UMA, Spain) • Alejandro Villegas (UMA, Spain) • Rubén Gran (U. Zaragoza, Spain) • Maria Jesús Garzarán (UIUC, USA) • Mert Dikmen (UIUC, USA) • Kurt Fellows (UIUC, USA) • Ehsan Totoni (UIUC, USA) 115