ISCA Final Presentation - Intro

INTRODUCTION
PHIL ROGERS, AMD CORPORATE FELLOW &
PRESIDENT OF HSA FOUNDATION

HSA FOUNDATION
 Founded in June 2012
 Developing a new platform for heterogeneous
systems
 www.hsafoundation.com
 Specifications under development in working
groups to define the platform
 Membership consists of 43 companies and 16
universities
 Adding 1-2 new members each month
© Copyright 2014 HSA Foundation. All Rights Reserved

DIVERSE PARTNERS DRIVING FUTURE OF
HETEROGENEOUS COMPUTING
Founders
Promoters
Supporters
Contributors
Academic
Needs Updating – Add Toshiba
Logo

MEMBERSHIP TABLE
Membership Level Number List
Founder 6 AMD, ARM, Imagination Technologies, MediaTek Inc.,
Qualcomm Inc., Samsung Electronics Co Ltd
Promoter 1 LG Electronics
Contributor 25 Analog Devices Inc., Apical, Broadcom, Canonical
Limited, CEVA Inc., Digital Media Professionals,
Electronics and Telecommunications Research,
Institute (ETRI), General Processor, Huawei, Industrial
Technology Res. Institute, Marvell International Ltd.,
Mobica, Oracle, Sonics, Inc, Sony Mobile,
Communications, Swarm 64 GmbH, Synopsys,
Tensilica, Inc., Texas Instruments Inc., Toshiba, VIA
Technologies, Vivante Corporation
Supporter 13 Allinea Software Ltd, Arteris Inc., Codeplay Software,
Fabric Engine, Kishonti, Lawrence Livermore National
Laboratory, Linaro, MultiCoreWare, Oak Ridge
National Laboratory, Sandia Corporation,
StreamComputing, SUSE LLC, UChicago Argonne LLC,
Operator of Argonne National Laboratory
Academic 17 Institute for Computing Systems Architecture,
Missouri University of Science & Technology, National
Tsing Hua University, NMAM Institute of Technology,
Northeastern University, Rice University, Seoul
National University, System Software Lab National,
Tsing Hua University, Tampere University of
Technology, TEI of Crete, The University of Mississippi,
University of North Texas, University of Bologna,
University of Bristol Microelectronic Research Group,
University of Edinburgh, University of Illinois at
Urbana-Champaign Department of Computer Science

HETEROGENEOUS PROCESSORS HAVE
PROLIFERATED — MAKE THEM BETTER
 Heterogeneous SOCs have arrived and are a
tremendous advance over previous platforms
 SOCs combine CPU cores, GPU cores and
other accelerators, with high bandwidth access
to memory
 How do we make them even better?
 Easier to program
 Easier to optimize
 Higher performance
 Lower power
 HSA unites accelerators architecturally
 Early focus on the GPU compute accelerator,
but HSA will go well beyond the GPU

INFLECTIONS IN PROCESSOR DESIGN
?
Single-thread
Performance
Time
we are
here
Enabled by:
 Moore’s
Law
 Voltage
Scaling
Constrained by:
Power
Complexity
Single-Core Era
ModernApplication
Performance
Time (Data-parallel exploitation)
we are
here
Heterogeneous
Systems Era
Enabled by:
 Abundant data
parallelism
 Power efficient
GPUs
Temporarily
Constrained by:
Programming
models
Comm.overhead
Throughput
Performance
Time (# of processors)
we are
here
Enabled by:
 Moore’s Law
 SMP
architecture
Constrained
by:
Power
Parallel SW
Scalability
Multi-Core Era
Assembly C/C++ Java … pthreads OpenMP / TBB …
Shader CUDA OpenCL
C++ and Java

LEGACY GPU COMPUTE
PCIe
™
System Memory
(Coherent)
CPU CPU CPU
. .
.
CU CU CU CU
CU CU CU CU
GPU Memory
(Non-Coherent)
GPU
 Multiple memory pools
 Multiple address spaces
 High overhead dispatch
 Data copies across PCIe
 New languages for
programming
 Dual source development
 Proprietary environments
 Expert programmers only
 Need to fix all of this to
unleash our programmers
The limiters

EXISTING APUS AND SOCS
CPU
1
CPU
N…
CPU
2
Physical Integration
CU
1 …
CU
2
CU
3
CU
M-2
CU
M-1
CU
M
System Memory
(Coherent)
GPU Memory
(Non-Coherent)
GPU
 Physical Integration
 Good first step
 Some copies gone
 Two memory pools remain
 Still queue through the OS
 Still requires expert
programmers
 Need to finish the job

AN HSA ENABLED SOC
 Unified Coherent
Memory enables
data sharing across
all processors
 Processors
architected to
operate
cooperatively
 Designed to enable
the application to
run on different
processors at
different times
Unified Coherent Memory
CPU
1
CPU
N…
CPU
2
CU
1
CU
2
CU
3
CU
M-2
CU
M-1
CU
M…

PILLARS OF HSA*
 Unified addressing across all processors
 Operation into pageable system memory
 Full memory coherency
 User mode dispatch
 Architected queuing language
 Scheduling and context switching
 HSA Intermediate Language (HSAIL)
 High level language support for GPU compute processors
* All features of HSA are subject to change, pending ratification of 1.0 Final specifications by the HSA Board of Directors

HSA SPECIFICATIONS
 HSA System Architecture Specification
 Version 1.0 Provisional, Released April 2014
 Defines discovery, memory model, queue management, atomics, etc
 HSA Programmers Reference Specification
 Version 1.0 Provisional, Released June 2014
 Defines the HSAIL language and object format
 HSA Runtime Software Specification
 Version 1.0 Provisional, expected to be released in July 2014
 Defines the APIs through which an HSA application uses the platform
 All released specifications can be found at the HSA Foundation web site:
 www.hsafoundation.com/standards

HSA - AN OPEN PLATFORM
 Open Architecture, membership open to all
 HSA Programmers Reference Manual
 HSA System Architecture
 HSA Runtime
 Delivered via royalty free standards
 Royalty Free IP, Specifications and APIs
 ISA agnostic for both CPU and GPU
 Membership from all areas of computing
 Hardware companies
 Operating Systems
 Tools and Middleware
 Applications
 Universities

HSA INTERMEDIATE LAYER — HSAIL
 HSAIL is a virtual ISA for parallel programs
 Finalized to ISA by a JIT compiler or “Finalizer”
 ISA independent by design for CPU & GPU
 Explicitly parallel
 Designed for data parallel programming
 Support for exceptions, virtual functions,
and other high level language features
 Lower level than OpenCL SPIR
 Fits naturally in the OpenCL compilation stack
 Suitable to support additional high level languages and programming models:
 Java, C++, OpenMP, C++, Python, etc

HSA MEMORY MODEL
 Defines visibility ordering between all
threads in the HSA System
 Designed to be compatible with
C++11, Java, OpenCL and .NET
Memory Models
 Relaxed consistency memory model
for parallel compute performance
 Visibility controlled by:
 Load.Acquire
 Store.Release
 Fences

HSA QUEUING MODEL
 User mode queuing for low latency dispatch
 Application dispatches directly
 No OS or driver required in the dispatch path
 Architected Queuing Layer
 Single compute dispatch path for all hardware
 No driver translation, direct to hardware
 Allows for dispatch to queue from any agent
 CPU or GPU
 GPU self enqueue enables lots of solutions
 Recursion
 Tree traversal
 Wavefront reforming

Hardware - APUs, CPUs, GPUs
Driver Stack
Domain Libraries
OpenCL™, DX Runtimes,
User Mode Drivers
Graphics Kernel Mode Driver
Apps
Apps
Apps
Apps
Apps
Apps
HSA Software Stack
Task Queuing
Libraries
HSA Domain Libraries,
OpenCL ™ 2.x Runtime
HSA Kernel
Mode Driver
HSA Runtime
HSA JIT
Apps
Apps
Apps
Apps
Apps
Apps
User mode component Kernel mode component Components contributed by third parties
EVOLUTION OF THE SOFTWARE STACK

OPENCL™ AND HSA
 HSA is an optimized platform architecture
for OpenCL
 Not an alternative to OpenCL
 OpenCL on HSA will benefit from
 Avoidance of wasteful copies
 Low latency dispatch
 Improved memory model
 Pointers shared between CPU and GPU
 OpenCL 2.0 leverages HSA Features
 Shared Virtual Memory
 Platform Atomics

ADDITIONAL LANGUAGES ON HSA
 In development
Language Body More Information
Java Sumatra OpenJDK http://openjdk.java.net/projects/sumatra/
LLVM LLVM Code
generator for HSAIL
C++ AMP Multicoreware https://bitbucket.org/multicoreware/cppa
mp-driver-ng/wiki/Home
OpenMP, GCC AMD, Suse https://gcc.gnu.org/viewcvs/gcc/branches
/hsa/gcc/README.hsa?view=markup&p
athrev=207425

SUMATRA PROJECT OVERVIEW
 AMD/Oracle sponsored Open Source (OpenJDK) project
 Targeted at Java 9 (2015 release)
 Allows developers to efficiently represent data parallel algorithms in
Java
 Sumatra ‘repurposes’ Java 8’s multi-core Stream/Lambda API’s to
enable both CPU or GPU computing
 At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch
‘selected’ constructs to available HSA enabled devices
 Developers of Java libraries are already refactoring their library code to
use these same constructs
 So developers using existing libraries should see GPU acceleration
without any code changes
 http://openjdk.java.net/projects/sumatra/
 https://wikis.oracle.com/display/HotSpotInternals/Sumatra
 http://mail.openjdk.java.net/pipermail/sumatra-dev/
Application.java
Java Compiler
GPUCPU
Sumatra Enabled JVM
Application
GPU ISA
Lambda/Stream API
CPU ISA
Application.clas
s
Development
Runtime
HSA Finalizer

HSA OPEN SOURCE SOFTWARE
 HSA will feature an open source linux execution and compilation stack
 Allows a single shared implementation for many components
 Enables university research and collaboration in all areas
 Because it’s the right thing to do
Component Name IHV or Common Rationale
HSA Bolt Library Common Enable understanding and debug
HSAIL Code Generator Common Enable research
LLVM Contributions Common Industry and academic collaboration
HSAIL Assembler Common Enable understanding and debug
HSA Runtime Common Standardize on a single runtime
HSA Finalizer IHV Enable research and debug
HSA Kernel Driver IHV For inclusion in linux distros

WORKLOAD EXAMPLE
SUFFIX ARRAY CONSTRUCTION
CLOUD SERVER WORKLOAD

SUFFIX ARRAYS
 Suffix Arrays are a fundamental data structure
 Designed for efficient searching of a large text
 Quickly locate every occurrence of a substring S in a text T
 Suffix Arrays are used to accelerate in-memory cloud workloads
 Full text index search
 Lossless data compression
 Bio-informatics

ACCELERATED SUFFIX ARRAY
CONSTRUCTION ON HSA
M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013.
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM
By offloading data parallel computations to
GPU, HSA increases performance and
reduces energy for Suffix Array
Construction.
By efficiently sharing data between CPU and
GPU, HSA lets us move compute to data
without penalty of intermediate copies.
+5.8x
-5x
INCREASED
PERFORMANCE
DECREASED
ENERGYMerge Sort::GPU
Radix Sort::GPU
Compute SA::CPU
Lexical Rank::CPU
Radix Sort::GPU
Skew Algorithm for Compute SA

EASE OF PROGRAMMING
CODE COMPLEXITY VS. PERFORMANCE

LINES-OF-CODE AND PERFORMANCE FOR DIFFERENT
PROGRAMMING MODELS
AMD A10-5800K APU with Radeon™ HD Graphics – CPU: 4 cores, 3800MHz (4200MHz Turbo); GPU: AMD Radeon HD 7660D, 6 compute units, 800MHz; 4GB RAM.
Software – Windows 7 Professional SP1 (64-bit OS); AMD OpenCL™ 1.2 AMD-APP (937.2); Microsoft Visual Studio 11 Beta
0
50
100
150
200
250
300
350
LOC
Copy-back Algorithm Launch Copy Compile Init Performance
Serial CPU TBB Intrinsics+TBB OpenCL™-C OpenCL™ -C++ C++ AMP HSA Bolt
Performance
35.00
30.00
25.00
20.00
15.00
10.00
5.00
0Copy-
back
Algorithm
Launch
Copy
Compile
Init.
Copy-back
Algorithm
Launch
Copy
Compile
Copy-back
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
Algorithm
Launch
(Exemplary ISV “Hessian” Kernel)

THE HSA FUTURE
 Architected heterogeneous processing on the SOC
 Programming of accelerators becomes much easier
 Accelerated software that runs across multiple hardware vendors
 Scalability from smart phones to super computers on a common architecture
 GPU acceleration of parallel processing is the initial target, with DSPs
and other accelerators coming to the HSA system architecture model
 Heterogeneous software ecosystem evolves at a much faster pace
 Lower power, more capable devices in your hand, on the wall, in the cloud

JOIN US!
WWW.HSAFOUNDATION.COM

ISCA Final Presentation - Intro

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a ISCA Final Presentation - Intro

Semelhante a ISCA Final Presentation - Intro (20)

Mais de HSA Foundation

Mais de HSA Foundation (13)

Último

Último (20)

ISCA Final Presentation - Intro

Notas do Editor