HSA Introduction

HETEROGENEOUS SYSTEM
ARCHITECTURE OVERVIEW
Ofer Rosenberg

DISCLAIMER:
This presentation is not an Official HSA Foundation
presentation.
Most of the Material is taken from HSA HotChips 2013
Some slides contains my insights / Opinions

CONTENT
 Introduction
 hUMA
 hQ
 HSAIL
 HSA Software
 HSA Challenges
 HSA Availability

HISTORIC PERSPECTIVE
Accelerated System
 Program runs on CPU
 API to access Accelerators
 ASIC or Firmware
 Configurable, but operation is
fixed
Heterogeneous System
 Program runs on CPU
 Offloads work on Accelerators
 GPU, DSP, etc.
 Offloaded work is JITed (compiled at
runtime)
5
Distributed SoC based

HSA FOUNDATION
 Originated from AMD’s FSA – Fusion System Architecture
 HSA Foundation Founded in June 2012
6

HSA FOUNDATION MEMBERS
7
Founders
Promoters
Supporters
Contributors
Academic
Associates
Slide Taken from Phil Rogers HSA Overview, HotChips 2013

WHAT IS HSA ALL ABOUT ?
(MY TAKE)
 “Bring Accelerators forward as a first class processor”
 Unified address space, pageable memory, coherency
 Eliminate drivers from dispatch path (user mode queues)
 Standardized SW stack built on top of a set of HW requirements
 Improve interoperability between IP vendors
 Unified Architecture for Accelerators
 Start from GPU, extend to DSP / FPGA
/ Fixed-Function Acc , etc.
 SoC Centric
 Major features are optimal for SoC
environment (same memory/die)
 Support of distributed system is
possible, yet inefficient (PCI atomics,
others)
8Slide Taken from Phil Rogers HSA Overview, HotChips 2013

HSA WORKING GROUPS
 HSA Systems Architecture
 hUMA – Unified Memory Model
 hQ – HSA Queuing Model
 HSA Programmer Reference Specification
 HSAIL – HSA Intermediate Language
 HSA System Runtime
 HSA Compliance
 HSA Tools
9http://hsafoundation.com/standards/

OPENCL™ AND HSA
 HSA is an optimized platform architecture
for OpenCL™
 Not an alternative to OpenCL™
 OpenCL™ on HSA will benefit from
 Avoidance of wasteful copies
 Low latency dispatch
 Improved memory model
 Pointers shared between CPU and GPU
 OpenCL™ 2.0 shows considerable alignment
with HSA
 Many HSA member companies are also active
with Khronos in the OpenCL™ working group

hUMA
© Copyright 2012 HSA Foundation. All Rights Reserved. 11

hUMA
HSA Unified Memory Architecture
Evolution of CPU / GPU memory systems:
1. CPU uses Virtual Addresses, GPU uses Physical Addresses
 Memory had to be pinned
 GPU can access a limited area in the CPU memory (Aperture)
 Requires copy from system memory to GPU-visible memory
 Pointer-based data structures can’t be shared
2. CPU uses Virtual Addresses, GPU uses Virtual Addresses (but not the same)
 Memory still had to be pinned
12
 GPU can access the entire system
memory
 Copy is not required
 Pointer-based data structures still
can’t be shared
3. hUMA

hUMA
HSA Unified Memory Architecture
 Shared Virtual Memory
 CPU & GPU see the
same addresses
 Pageable Memory
 GPU can (somehow)
initiate a page fault
 Cache coherency
13

SHARED VIRTUAL MEMORY
 Advantages
 No mapping tricks, no copying back-and-forth between different PA
addresses
 Send pointers (not data) back and forth between HSA agents.
 Note the Hardware Implications …
 Common Page Tables (and common interpretation of architectural
semantics such as shareability, protection, etc).
 Common mechanisms for address translation (and servicing
address translation faults)
 Concept of a process address space (PASID) to allow multiple, per
process virtual address spaces within the system.
14 Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013

CACHE COHERENCY DOMAINS
 Advantages
 Composability
 Reduced SW complexity when communicating between agents
 Lower barrier to entry when porting software
 Note the Hardware Implications …
 Hardware coherency support between all HSA agents
 Can take many forms
 Stand alone Snoop Filters / Directories
 Combined L3/Filters
 Snoop-based systems (no filter)
 Etc …
15 Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013

hQ

hQ Motivation
1. GPU Dispatch has a lot of overhead
 SW/Driver stack overhead
 User mode to Kernel mode switch
17

hQ Motivation
2. Master/Slave pattern is limiting (and has a lot of overhead)
 CPU schedules work to the GPU
 Communication overhead (report results  next kernel grid size)
18
Slide from “Introduction to Dynamic Parallelism”, Stephen Jones, NVIDIA Corporation

hQ
HSA QUEUING MODEL
 User mode queuing for low latency dispatch
 Application dispatches directly
 No OS or driver in the dispatch path
 Architected Queuing Layer
 Single compute dispatch path for all hardware
 No driver translation, direct to hardware
 Allows for dispatch to queue from any agent
 CPU or GPU
 GPU can spawn its
own work
19
Picture from AMD Blog:
hQ: From Master/Slave to Masterpiece

ARCHITECTED QUEUEING LANGUAGE
 HSA Queues look just like standard
shared memory queues, supporting
multi-producer, single-consumer
 Support is allowed for single-producer,
single-consumer
 Queues consist of storage, read/write
indices, ID, etc.
 Queues are created/destroyed via calls
to the HSA runtime
 “Packets” are placed in queues directly
from user mode, via an architected
protocol
 Packet format is architected
20
Producer Producer
Consumer
Read Index
Write Index
Storage in
coherent, shared
memory
Packets
 Slide Taken from Ian bratt HSA QUEUEING, HotChips 2013

HSAIL

WHAT IS HSAIL?
 HSAIL is the intermediate language for parallel compute in HSA
 Generated by a high level compiler (LLVM, gcc, Java VM, etc)
 Low-level IR, close to machine ISA level
 Compiled down to target ISA by an IHV “Finalizer”
 Finalizer may execute at run time, install time, or build time
 Example: OpenCL™ Compilation Stack using HSAIL
22
OpenCL™ Kernel
High-Level Compiler Flow (Developer)
Finalizer Flow (Runtime)
EDG or CLANG
SPIR
LLVM
HSAIL HSAIL
Finalizer
Hardware ISA
Slide Taken from Ben Sander’s HSAIL: Portable Compiler IR FOR HSA, HotChips 2013

HSAIL INSTRUCTION SET HIGHLIGHTS
 “SIMT” – Single Instruction Multiple Data
 ISA is Scalar, describes one serial thread – Parallelism is done by HW
 RISC-Like
 Load-store architecture
 136 opcodes
 Fixed number of Registers
 1 Control
 Pool of 512 bytes
 Single
 Double
 Quad
 7 segments of memory
 global, read only, group, spill, private, arg, kernarg
23
01: version 0:95: $full : $large;
02: // static method HotSpotMethod<Main.lambda$2(Player)>
03: kernel &run (
04: kernarg_u64 %_arg0 // Kernel signature for lambda method
05: ) {
06: ld_kernarg_u64 $d6, [%_arg0]; // Move arg to an HSAIL register
07: workitemabsid_u32 $s2, 0; // Read the work-item global “X” coord
08:
09: cvt_u64_s32 $d2, $s2; // Convert X gid to long
10: mul_u64 $d2, $d2, 8; // Adjust index for sizeof ref
11: add_u64 $d2, $d2, 24; // Adjust for actual elements start
12: add_u64 $d2, $d2, $d6; // Add to array ref ptr
13: ld_global_u64 $d6, [$d2]; // Load from array element into reg
14: @L0:
15: ld_global_u64 $d0, [$d6 + 120]; // p.getTeam()
16: mov_b64 $d3, $d0;
17: ld_global_s32 $s3, [$d6 + 40]; // p.getScores ()
18: cvt_f32_s32 $s16, $s3;
19: ld_global_s32 $s0, [$d0 + 24]; // Team getScores()
20: cvt_f32_s32 $s17, $s0;
21: div_f32 $s16, $s16, $s17; // p.getScores()/teamScores
22: st_global_f32 $s16, [$d6 + 100]; // p.setPctOfTeamScores()
23: ret;
24: };

HSA SOFTWARE

HIGH-LEVEL SOFTWARE STACK
 Programming Languages
 OpenCL 2.0
 C++ AMP
 Java (Aparapi/Sumatra)
 HSA Runtime (User Mode Driver)
 System Query
 Access to JIT Compilers
 Access to Queues
 JIT Compilers
 Offline or online (JIT)
 LLVM Compiler (LLVM  HSAIL)
 HSAIL Finalizer (HSAIL  BIN)
 Kernel Mode Driver
25
http://www.hsafoundation.com/hsa-developer-tools/

HSA OPEN SOURCE SOFTWARE
 HSA will feature an open source linux execution and compilation stack
 Allows a single shared implementation for many components
 Enables university research and collaboration in all areas
 Because it’s the right thing to do
26
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros
Slide Taken from Phil Rogers “Heterogeneous System Architecture Overview”, HotChips 2013

JAVA HETEROGENEOUS
ENABLEMENT ROADMAP
CPU ISA GPU ISA
JVM
Application
APARAPI
GPUCPU
OpenCL™
27
CPU ISA GPU ISA
JVM
Application
APARAPI
HSA CPUHSA CPU
HSA Finalizer
HSAIL
CPU ISA GPU ISA
JVM
Application
APARAPI
HSA CPUHSA CPU
HSA Finalizer
HSAIL
HSA Runtime
LLVM Optimizer
IR
CPU ISA GPU ISA
Sumatra Enabled JVM
Application
HSA CPUHSA CPU
HSA Finalizer
HSAIL
Slide Taken from Phil Rogers “Heterogeneous System Architecture Overview”, HotChips 2013

HSA Challenges
(My Take)

HSA CHALLENGES –
VENDOR SUPPORT
29
Founders
Promoters
Supporters
Contributors
Academic
Slide Taken from Phil Rogers HSA Overview, HotChips 2013
Missing some key players:
Intel, NVIDIA, Apple, Microsoft, Google, …

HSA CHALLENGES –
LANGUAGES SUPPORT
 HSAIL (or LLVM) is not an attractive level to code at…
 Leverage existing parallel languages/paradigms to exploit HSA features:
 C++ AMP
 OpenCL 2.0 (done!)
 OpenMP
 Add your favorite …
 Extend popular languages to exploit HSA:
 Scripting languages: Python
 Web languages : HTML5, RoR, Javascript, …
 DSL languages
30

HSA CHALLENGES –
SECURITY
 HSA design had some security measures in mind:
 Accelerator supports privilege level, with user and privileged memory
 Execute, Read and Write are protected by page table entries
 Support in fixed time context sceduling (DoS protection)
 But:
 Advanced features such as hUMA & uQ are potential back door
 OS & Security Apps currently do not monitor the accelerators
 Monitoring may require OS changes
 Detailed specification can be used to find attack vectors
 Some accelerators architecture may introduce a security flaw
 Example: local memory on GPU
31

HSA Availability

HSA AVAILABILITY
 AMD released “Kaveri”, the first SoC which is HSA-able
 HW supports HUMA, hQ, etc.
 HSA software stack is not publicly available yet (expected this year)
http://www.tomshardware.com/reviews/a8
8x-socket-fm2-motherboard,3764.html

HSA AVAILABILITY
Simulators:
 HSAEMU – A full system emulator for HSA platforms
 Work done by System SW Lab at NTHU (National Tsing Hua University)
 http://hsaemu.org/
 Code available on GitHub - https://github.com/SSLAB-HSA/HSAemu
 HSAIL Simulator
 Code available on GitHub - https://github.com/HSAFoundation/HSAIL-Instruction-
Set-Simulator
34

REFERENCES
• HSA Foundation:
• http://hsafoundation.com/
• HSA whitepaper
• http://developer.amd.com/wordpress/media/2012/10/hsa10.pdf
• hUMA
• http://www.slideshare.net/AMD/amd-heterogeneous-uniform-memory-access
• http://www.pcper.com/reviews/Processors/AMD-Details-hUMA-HSA-Action
• http://www.bit-tech.net/news/hardware/2013/04/30/amd-huma-heterogeneous-unified-memory-acces/
• http://www.amd.com/us/products/technologies/hsa/Pages/hsa.aspx#3
• ANANDTECH Hawaii architecture
• http://www.anandtech.com/show/7457/the-radeon-r9-290x-review/3
• hQ
• http://community.amd.com/community/amd-blogs/amd-business/blog/2013/10/21/hq-from-masterslave-to-masterpiece
• http://on-demand.gputechconf.com/gtc/2012/presentations/S0338-GTC2012-CUDA-Programming-Model.pdf
• HSA purpose analysis by Moor
• http://developer.amd.com/apu/wordpress/wp-content/uploads/2012/01/HSAF-Purpose-and-Outlook-by-Moor-Insights-Strategy.pdf
• IOMMUv2 spec
• http://developer.amd.com/wordpress/media/2012/10/48882.pdf

hUMA & Discrete GPUs
 hUMA can be extended beyond SoC, if the proper HW exists
(such as Hawaii GPU…)
38
Slide from “IOMMUv2: the Ins and Outs of Heterogeneous GPU use”,
AFDS 2012

HSAIL AND SPIR
39
Feature HSAIL SPIR
Intended Users
Compiler developers who want to
control their own code generation.
Compiler developers who want a fast
path to acceleration across a wide
variety of devices.
IR Level
Low-level, just above the machine
instruction set High-level, just below LLVM-IR
Back-end code generation Thin, fast, robust.
Flexible. Can include many
optimizations and compiler
transformation including register
allocation.
Where are compiler
optimizations performed?
Most done in high-level compiler,
before HSAIL generation.
Most done in back-end code generator,
between SPIR and device machine
instruction set
Registers Fixed-size register pool Infinite
SSA Form No Yes
Binary format Yes Yes
Code generator for LLVM Yes Yes
Back-end device targets
Modern GPU architectures
supported by members of the HSA
Foundation
Any OpenCL device including GPUs,
CPUs, FPGAs
Memory Model
Relaxed consistency with
acquire/release, barriers, and fine-
grained barriers
Flexible. Can support the OpenCL 1.2
Memory Model
Slide Taken from Ben Sander’s HSAIL: Portable Compiler IR FOR HSA, HotChips 2013

Hardware - APUs, CPUs, GPUs
Driver Stack
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
Graphics Kernel Mode Driver
Apps
Apps
Apps
Apps
Apps
Apps
HSA Software Stack
Task Queuing
Libraries
HSA Domain Libraries,
OpenCL ™ 2.x Runtime
HSA Kernel
Mode Driver
HSA Runtime
HSA JIT
Apps
Apps
Apps
Apps
Apps
Apps
User mode component Kernel mode component Components contributed by third parties
HSA SOFTWARE STACK
40

OPENCL™ AND HSA
 HSA is an optimized platform architecture
for OpenCL™
 Not an alternative to OpenCL™
 OpenCL™ on HSA will benefit from
 Avoidance of wasteful copies
 Low latency dispatch
 Improved memory model
 Pointers shared between CPU and GPU
 OpenCL™ 2.0 shows considerable alignment
with HSA
 Many HSA member companies are also active
with Khronos in the OpenCL™ working group

BOLT — PARALLEL PRIMITIVES
LIBRARY FOR HSA
 Easily leverage the inherent power efficiency of GPU computing
 Common routines such as scan, sort, reduce, transform
 More advanced routines like heterogeneous pipelines
 Bolt library works with OpenCL and C++ AMP
 Enjoy the unique advantages of the HSA platform
 Move the computation not the data
 Finally a single source code base for the CPU and GPU!
 Developers can focus on core algorithms
 Bolt version 1.0 for OpenCL and C++ AMP is available now at
https://github.com/HSA-Libraries/Bolt

HSA OPEN SOURCE SOFTWARE
 HSA will feature an open source linux execution and compilation stack
 Allows a single shared implementation for many components
 Enables university research and collaboration in all areas
 Because it’s the right thing to do
43
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros

LINES-OF-CODE AND PERFORMANCE FOR
DIFFERENT PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350
LOC
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Performance
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0Copy-back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)
44

AMD’S FIRST HSA SOC

HSA Introduction

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a HSA Introduction

Semelhante a HSA Introduction (20)

Mais de Ofer Rosenberg

Mais de Ofer Rosenberg (9)

Último

Último (20)

HSA Introduction