2. DISCLAIMER:
This presentation is not an Official HSA Foundation
presentation.
Most of the Material is taken from HSA HotChips 2013
Some slides contains my insights / Opinions
5. HISTORIC PERSPECTIVE
Accelerated System
Program runs on CPU
API to access Accelerators
ASIC or Firmware
Configurable, but operation is
fixed
Heterogeneous System
Program runs on CPU
Offloads work on Accelerators
GPU, DSP, etc.
Offloaded work is JITed (compiled at
runtime)
5
Distributed SoC based
6. HSA FOUNDATION
Originated from AMD’s FSA – Fusion System Architecture
HSA Foundation Founded in June 2012
6
8. WHAT IS HSA ALL ABOUT ?
(MY TAKE)
“Bring Accelerators forward as a first class processor”
Unified address space, pageable memory, coherency
Eliminate drivers from dispatch path (user mode queues)
Standardized SW stack built on top of a set of HW requirements
Improve interoperability between IP vendors
Unified Architecture for Accelerators
Start from GPU, extend to DSP / FPGA
/ Fixed-Function Acc , etc.
SoC Centric
Major features are optimal for SoC
environment (same memory/die)
Support of distributed system is
possible, yet inefficient (PCI atomics,
others)
8Slide Taken from Phil Rogers HSA Overview, HotChips 2013
9. HSA WORKING GROUPS
HSA Systems Architecture
hUMA – Unified Memory Model
hQ – HSA Queuing Model
HSA Programmer Reference Specification
HSAIL – HSA Intermediate Language
HSA System Runtime
HSA Compliance
HSA Tools
9http://hsafoundation.com/standards/
10. OPENCL™ AND HSA
HSA is an optimized platform architecture
for OpenCL™
Not an alternative to OpenCL™
OpenCL™ on HSA will benefit from
Avoidance of wasteful copies
Low latency dispatch
Improved memory model
Pointers shared between CPU and GPU
OpenCL™ 2.0 shows considerable alignment
with HSA
Many HSA member companies are also active
with Khronos in the OpenCL™ working group
10Slide Taken from Phil Rogers HSA Overview, HotChips 2013
12. hUMA
HSA Unified Memory Architecture
Evolution of CPU / GPU memory systems:
1. CPU uses Virtual Addresses, GPU uses Physical Addresses
Memory had to be pinned
GPU can access a limited area in the CPU memory (Aperture)
Requires copy from system memory to GPU-visible memory
Pointer-based data structures can’t be shared
2. CPU uses Virtual Addresses, GPU uses Virtual Addresses (but not the same)
Memory still had to be pinned
12
GPU can access the entire system
memory
Copy is not required
Pointer-based data structures still
can’t be shared
3. hUMA
13. hUMA
HSA Unified Memory Architecture
Shared Virtual Memory
CPU & GPU see the
same addresses
Pageable Memory
GPU can (somehow)
initiate a page fault
Cache coherency
13
14. SHARED VIRTUAL MEMORY
Advantages
No mapping tricks, no copying back-and-forth between different PA
addresses
Send pointers (not data) back and forth between HSA agents.
Note the Hardware Implications …
Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc).
Common mechanisms for address translation (and servicing
address translation faults)
Concept of a process address space (PASID) to allow multiple, per
process virtual address spaces within the system.
14 Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013
15. CACHE COHERENCY DOMAINS
Advantages
Composability
Reduced SW complexity when communicating between agents
Lower barrier to entry when porting software
Note the Hardware Implications …
Hardware coherency support between all HSA agents
Can take many forms
Stand alone Snoop Filters / Directories
Combined L3/Filters
Snoop-based systems (no filter)
Etc …
15 Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013
17. hQ Motivation
1. GPU Dispatch has a lot of overhead
SW/Driver stack overhead
User mode to Kernel mode switch
17
18. hQ Motivation
2. Master/Slave pattern is limiting (and has a lot of overhead)
CPU schedules work to the GPU
Communication overhead (report results next kernel grid size)
18
Slide from “Introduction to Dynamic Parallelism”, Stephen Jones, NVIDIA Corporation
19. hQ
HSA QUEUING MODEL
User mode queuing for low latency dispatch
Application dispatches directly
No OS or driver in the dispatch path
Architected Queuing Layer
Single compute dispatch path for all hardware
No driver translation, direct to hardware
Allows for dispatch to queue from any agent
CPU or GPU
GPU can spawn its
own work
19
Picture from AMD Blog:
hQ: From Master/Slave to Masterpiece
20. ARCHITECTED QUEUEING LANGUAGE
HSA Queues look just like standard
shared memory queues, supporting
multi-producer, single-consumer
Support is allowed for single-producer,
single-consumer
Queues consist of storage, read/write
indices, ID, etc.
Queues are created/destroyed via calls
to the HSA runtime
“Packets” are placed in queues directly
from user mode, via an architected
protocol
Packet format is architected
20
Producer Producer
Consumer
Read Index
Write Index
Storage in
coherent, shared
memory
Packets
Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013
22. WHAT IS HSAIL?
HSAIL is the intermediate language for parallel compute in HSA
Generated by a high level compiler (LLVM, gcc, Java VM, etc)
Low-level IR, close to machine ISA level
Compiled down to target ISA by an IHV “Finalizer”
Finalizer may execute at run time, install time, or build time
Example: OpenCL™ Compilation Stack using HSAIL
22
OpenCL™ Kernel
High-Level Compiler Flow (Developer)
Finalizer Flow (Runtime)
EDG or CLANG
SPIR
LLVM
HSAIL HSAIL
Finalizer
Hardware ISA
Slide Taken from Ben Sander’s HSAIL: Portable Compiler IR FOR HSA, HotChips 2013
23. HSAIL INSTRUCTION SET HIGHLIGHTS
“SIMT” – Single Instruction Multiple Data
ISA is Scalar, describes one serial thread – Parallelism is done by HW
RISC-Like
Load-store architecture
136 opcodes
Fixed number of Registers
1 Control
Pool of 512 bytes
Single
Double
Quad
7 segments of memory
global, read only, group, spill, private, arg, kernarg
23
01: version 0:95: $full : $large;
02: // static method HotSpotMethod<Main.lambda$2(Player)>
03: kernel &run (
04: kernarg_u64 %_arg0 // Kernel signature for lambda method
05: ) {
06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register
07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord
08:
09: cvt_u64_s32 $d2, $s2; // Convert X gid to long
10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref
11: add_u64 $d2, $d2, 24; // Adjust for actual elements start
12: add_u64 $d2, $d2, $d6; // Add to array ref ptr
13: ld_global_u64 $d6, [$d2]; // Load from array element into reg
14: @L0:
15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()
16: mov_b64 $d3, $d0;
17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()
18: cvt_f32_s32 $s16, $s3;
19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()
20: cvt_f32_s32 $s17, $s0;
21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores
22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()
23: ret;
24: };
25. HIGH-LEVEL SOFTWARE STACK
Programming Languages
OpenCL 2.0
C++ AMP
Java (Aparapi/Sumatra)
HSA Runtime (User Mode Driver)
System Query
Access to JIT Compilers
Access to Queues
JIT Compilers
Offline or online (JIT)
LLVM Compiler (LLVM HSAIL)
HSAIL Finalizer (HSAIL BIN)
Kernel Mode Driver
25
http://www.hsafoundation.com/hsa-developer-tools/
26. HSA OPEN SOURCE SOFTWARE
HSA will feature an open source linux execution and compilation stack
Allows a single shared implementation for many components
Enables university research and collaboration in all areas
Because it’s the right thing to do
26
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros
Slide Taken from Phil Rogers “Heterogeneous System Architecture Overview”, HotChips 2013
27. JAVA HETEROGENEOUS
ENABLEMENT ROADMAP
CPU ISA GPU ISA
JVM
Application
APARAPI
GPUCPU
OpenCL™
27
CPU ISA GPU ISA
JVM
Application
APARAPI
HSA CPUHSA CPU
HSA Finalizer
HSAIL
CPU ISA GPU ISA
JVM
Application
APARAPI
HSA CPUHSA CPU
HSA Finalizer
HSAIL
HSA Runtime
LLVM Optimizer
IR
CPU ISA GPU ISA
Sumatra Enabled JVM
Application
HSA CPUHSA CPU
HSA Finalizer
HSAIL
Slide Taken from Phil Rogers “Heterogeneous System Architecture Overview”, HotChips 2013
29. HSA CHALLENGES –
VENDOR SUPPORT
29
Founders
Promoters
Supporters
Contributors
Academic
Slide Taken from Phil Rogers HSA Overview, HotChips 2013
Missing some key players:
Intel, NVIDIA, Apple, Microsoft, Google, …
30. HSA CHALLENGES –
LANGUAGES SUPPORT
HSAIL (or LLVM) is not an attractive level to code at…
Leverage existing parallel languages/paradigms to exploit HSA features:
C++ AMP
OpenCL 2.0 (done!)
OpenMP
Add your favorite …
Extend popular languages to exploit HSA:
Scripting languages: Python
Web languages : HTML5, RoR, Javascript, …
DSL languages
30
31. HSA CHALLENGES –
SECURITY
HSA design had some security measures in mind:
Accelerator supports privilege level, with user and privileged memory
Execute, Read and Write are protected by page table entries
Support in fixed time context sceduling (DoS protection)
But:
Advanced features such as hUMA & uQ are potential back door
OS & Security Apps currently do not monitor the accelerators
Monitoring may require OS changes
Detailed specification can be used to find attack vectors
Some accelerators architecture may introduce a security flaw
Example: local memory on GPU
31
34. HSA AVAILABILITY
Simulators:
HSAEMU – A full system emulator for HSA platforms
Work done by System SW Lab at NTHU (National Tsing Hua University)
http://hsaemu.org/
Code available on GitHub - https://github.com/SSLAB-HSA/HSAemu
HSAIL Simulator
Code available on GitHub - https://github.com/HSAFoundation/HSAIL-Instruction-
Set-Simulator
34
38. hUMA & Discrete GPUs
hUMA can be extended beyond SoC, if the proper HW exists
(such as Hawaii GPU…)
38
Slide from “IOMMUv2: the Ins and Outs of Heterogeneous GPU use”,
AFDS 2012
39. HSAIL AND SPIR
39
Feature HSAIL SPIR
Intended Users
Compiler developers who want to
control their own code generation.
Compiler developers who want a fast
path to acceleration across a wide
variety of devices.
IR Level
Low-level, just above the machine
instruction set High-level, just below LLVM-IR
Back-end code generation Thin, fast, robust.
Flexible. Can include many
optimizations and compiler
transformation including register
allocation.
Where are compiler
optimizations performed?
Most done in high-level compiler,
before HSAIL generation.
Most done in back-end code generator,
between SPIR and device machine
instruction set
Registers Fixed-size register pool Infinite
SSA Form No Yes
Binary format Yes Yes
Code generator for LLVM Yes Yes
Back-end device targets
Modern GPU architectures
supported by members of the HSA
Foundation
Any OpenCL device including GPUs,
CPUs, FPGAs
Memory Model
Relaxed consistency with
acquire/release, barriers, and fine-
grained barriers
Flexible. Can support the OpenCL 1.2
Memory Model
Slide Taken from Ben Sander’s HSAIL: Portable Compiler IR FOR HSA, HotChips 2013
41. OPENCL™ AND HSA
HSA is an optimized platform architecture
for OpenCL™
Not an alternative to OpenCL™
OpenCL™ on HSA will benefit from
Avoidance of wasteful copies
Low latency dispatch
Improved memory model
Pointers shared between CPU and GPU
OpenCL™ 2.0 shows considerable alignment
with HSA
Many HSA member companies are also active
with Khronos in the OpenCL™ working group
41Slide Taken from Phil Rogers HSA Overview, HotChips 2013
42. BOLT — PARALLEL PRIMITIVES
LIBRARY FOR HSA
Easily leverage the inherent power efficiency of GPU computing
Common routines such as scan, sort, reduce, transform
More advanced routines like heterogeneous pipelines
Bolt library works with OpenCL and C++ AMP
Enjoy the unique advantages of the HSA platform
Move the computation not the data
Finally a single source code base for the CPU and GPU!
Developers can focus on core algorithms
Bolt version 1.0 for OpenCL and C++ AMP is available now at
https://github.com/HSA-Libraries/Bolt
42Slide Taken from Phil Rogers HSA Overview, HotChips 2013
43. HSA OPEN SOURCE SOFTWARE
HSA will feature an open source linux execution and compilation stack
Allows a single shared implementation for many components
Enables university research and collaboration in all areas
Because it’s the right thing to do
43
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros
44. LINES-OF-CODE AND PERFORMANCE FOR
DIFFERENT PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350
LOC
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Performance
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0Copy-back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)
44