SlideShare a Scribd company logo
1 of 38
GUIDE TO HETEROGENEOUS
SYSTEM ARCHITECTURE (HSA)
DIBYENDU DAS, PRAKASH RAGHAVENDRA
DEC 16TH 2013
OUTLINE
 Introduction to HSA
 Unified Memory Access
 Power Management
 HSA Programming Languages
 Workloads

2 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
WHAT IS HSA?
An intelligent computing architecture that enables CPU, GPU and other
processors to work in harmony on a single piece of silicon by seamlessly
moving the right tasks to the best suited processing element

hUMA (MEMORY)
PARALLEL
WORKLOADS

SERIAL
WORKLOADS

APU
ACCELERATED PROCESSING UNIT
3 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
HSA EVOLUTION

Benefits
Unified power
efficiency

Improved compute
efficiency

Simplified
data sharing

Capabilities
Integrate CPU and
GPU in silicon

4 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

GPU can access CPU
memory

Uniform memory
access for CPU and
GPU
STATE-OF-THE-ART HETEROGENEOUS PROCESSOR
Shared Northbridge  access to overlapping
CPU/GPU physical address spaces

Graphics processing unit
(GPU):
384 AMD Radeon™ cores

Multi-threaded CPU
cores

Accelerated processing unit (APU)

Many resources are shared between the CPU and GPU
– For example, memory hierarchy, power, and thermal capacity

5 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
A NEW ERA OF PROCESSOR PERFORMANCE
Single-Core Era
Constrained by:
Power
Complexity

Enabled by:
 Moore’s Law
 SMP architecture

?

Throughput
Performance

Single-thread Performance

Enabled by:
 Abundant data
parallelism
 Power efficient
GPUs

Time

6 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

we are
here

Time (# of processors)

Temporarily
Constrained by:
Programming
models
Comm.overhead

Shader  CUDA OpenCL C+
+AMP …

pthreads  OpenMP / TBB …

Assembly  C/C++  Java …

we are
here

Constrained by:
Power
Parallel SW
Scalability

Modern Application
Performance

Enabled by:
 Moore’s Law
 Voltage Scaling

Heterogeneous
Systems Era

Multi-Core Era

we are
here

Time (Data-parallel exploitation)
Architecture Maturity & Programmer Accessibility

Excellent

EVOLUTION OF HETEROGENEOUS COMPUTING
Standards Drivers Era
OpenCL™, DirectCompute
Driver-based APIs

Proprietary Drivers Era
Graphics & Proprietary
Driver-based APIs

 “Adventurous” programmers
 Exploit early programmable
“shader cores” in the GPU

Expert programmers
C and C++ subsets
Compute centric APIs , data types
Multiple address spaces with
explicit data movement
 Specialized work queue based
structures
 Kernel mode dispatch





Architected Era

AMD Heterogeneous System Architecture
GPU Peer Processor










Mainstream programmers
Full C++
GPU as a co-processor
Unified coherent address space
Task parallel runtimes
Nested Data Parallel programs
User mode dispatch
Pre-emption and context
switching

 Make your program look like
“graphics” to the GPU

Poor

 CUDA™, Brook+, etc

2002 - 2008

7 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

2009 - 2011

2012 - 2020
HETEROGENEOUS PROCESSORS - EVERYWHERE
SMARTPHONES TO SUPER-COMPUTERS

Super
computer
Dense
Server
Tablet
Phone

Workstation
Notebook

A SINGLE SCALABLE
ARCHITECTURE FOR THE
WORLD’S PROGRAMMERS
IS DEMANDED AT THIS
POINT
8 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
HOW DOES HSA MAKE THIS ALL WORK?
 Enables acceleration of languages like Java, C++ AMP and
Python
 All processors use the same addresses, and can share data
structures in place
 Heterogeneous computing can use all of virtual and physical
memory
 Extends multicore coherency to the GPU and other processors
 Pass work quickly between the processors
 Enables quality of service

HSA FOUNDATION –
BUILDING THE ECOSYSTEM

9 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
HSA FOUNDATION AT LAUNCH
BORN IN JUNE 2012

Founders

10 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
HSA FOUNDATION TODAY – DECEMBER 2013
A GROWING AND POWERFUL FAMILY

Founders
Promoters

rters

butors

ORACLE

sities

NTHU Programming
Language Lab

11 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

NTHU System
Software Lab

COMPUTER SCIENCE
Unified
Memory
Access
UNDERSTANDING UMA

Original meaning of UMA is Uniform Memory Access
•

Refers to how processing cores in a system view and access memory

•

All processing cores in a true UMA system share a single memory address space

Introduction of GPU compute created Non-Uniform Memory Access (NUMA)
•

Require data to be managed across multiple heaps with different address spaces

•

Add programming complexity due to frequent copies, synchronization, and address
translation

HSA restores the GPU to Uniform memory Access
•

Heterogeneous computing replaces GPU Computing

13 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
INTRODUCING hUMA
UMA

CPU

Memory
C
P
U

C
P
U

C
P
U

C
P
U

NUMA

APU

GPU
Memory

CPU Memory
C
P
U

C
P
U

C
P
U

C
P
U

GPU
GPU
GPU
GPU

hUMA

APU
with
HSA

Memory
C
P
U

14 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

C
P
U

C
P
U

C
P
U

GPU
GPU
GPU
GPU
hUMA KEY FEATURES
Coherent Memory:

CPU

Ensures CPU and GPU
caches both see
an up-to-date view of
data

HW
Cache
Coherency

Cache

Physical Memory

Virtual Memory

Pageable memory:
The GPU can seamlessly
access virtual memory
addresses that are not
(yet)
present in physical
memory

Entire memory space:
Both CPU and GPU can access and allocate
any location in the system’s virtual memory
space
15 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
WITHOUT POINTERS AND DATA SHARING
Without hUMA:

•CPU explicitly copies data to GPU memory
•GPU completes computation
•CPU explicitly copies result back to CPU memory

CPU

|
|
|
|
|
Memory

16 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

|
|
|
|
|

GPU Memory

Only the data array
can be copied since GPU
cannot follow embedded
data-structure links
WITH POINTERS AND DATA SHARING

CPU can pass a pointer
to entire data structure
since the GPU can now
follow embedded links
|
|
|
|
|

CPU / GPU Uniform Memory

17 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
hUMA FEATURES

Access to Entire Memory Space
Pageable memory
Bi-directional Coherency
Fast GPU access to system memory
Dynamic Memory Allocation
18 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC






Power
Management
KEY OBSERVATIONS
 Applications exhibit varying degrees of CPU and GPU frequency sensitivities due
to
‒ Control divergence
‒ Interference at shared resources
‒ Performance coupling between CPU and GPU

 Efficient energy management requires metrics that can predict frequency
sensitivity (power) in heterogeneous processors
 Sensitivity metrics drive the coordinated setting of CPU and GPU power states

20 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION
POWER MANAGEMENT (BAPM)
Chip is divided into
BAPM-controlled
thermal entities (TEs)
CU0
TE

CU1
TE

GPU TE

 Power management algorithm
1. Calculate digital estimate of power consumption
2. Convert power to temperature
- RC network model for heat transfer
1. Assign new power budgets to TEs based on temperature
headroom
2. TEs locally control (boost) their own DVFS states to maximize
performance
21 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
DYNACO: RUN-TIME SYSTEM FOR COORDINATED
ENERGY MANAGEMENT
CPU-GPU
Frequency
Sensitivity
Computation

Performance
Metric Monitor

CPU-GPU Power
State Decision

GPU Frequency
Sensitivity

CPU Frequency
Sensitivity

Decision

High

Low

Shift power to GPU

High

High

Proportional power
allocation

Low

High

Shift power to CPU

Low

Low

Reduce power of
both CPU and GPU

 DynaCo implemented as a run-time software policy overlaid on top of BAPM in
real hardware

22 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
Programming
Languages
PROGRAMMING LANGUAGES PROLIFERATING ON HSA
OpenCL™
OpenCL™
App
App

Java App
Java App

C++ AMP
C++ AMP
App
App

Python
Python
App
App

OpenCL
OpenCL
Runtime
Runtime

Java JVM
Java JVM
(Sumatra)
(Sumatra)

Various
Various
Runtimes
Runtimes

Fabric
Fabric
Engine RT
Engine RT

HSAIL (HSA Intermediate Language)
HSA
HSA
Helper
Helper
Libraries
Libraries

HSA Core
HSA Core
Runtime
Runtime

Kernel Fusion
Kernel Fusion
Driver (KFD)
Driver (KFD)

24 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

HSA
HSA
Finalizer
Finalizer
PROGRAMMING MODELS EMBRACING HSAIL AND HSA
THE RIGHT LEVEL OF ABSTRACTION

UNDER DEVELOPMENT

NEXT

Java: Project Sumatra
OpenJDK 9
OpenMP from SuSE
C++ AMP, based on
CLANG/LLVM
Python and KL from Fabric
Engine

DSLs: Halide, Julia, Rust
Fortran
JavaScript
Open Shading Language
R

25 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
HSAIL
HSAIL (HSA Intermediate Language) as the
SW interface
‒ A virtual ISA for parallel programs

‒ Finalized to a native ISA by a finalizer/JIT

‒ Accommodate to rapid innovations in native GPU architectures

‒ HSAIL expected to be stable and backward compatible
across implementations

High-Level Compiler
Flow
OpenCL™ Kernel
OpenCL™ Kernel

EDG or CLANG
EDG or CLANG
SPIR
SPIR
LLVM
LLVM
HSAIL
HSAIL

‒ Enable multiple hardware vendors to support HSA

Key design points and benefits for HSA
compilers

Finalizer Flow
(Runtime)

‒ Adopt a thin finalizer approach

‒ Enable fast translation time and robustness in the
finalizer

‒ Drive performance optimizations through high-level compilers (HLC)

HSAIL
HSAIL
Finalizer
Finalizer
Hardware ISA
Hardware ISA

‒ Take advantage of the strength and compilation time
budget in HLCs for aggressive optimizations
EDG – Edison Design Group

26 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

CLANG – LLVM FE
SPIR – Standard Portable Intermediate
Representation
HSA ENABLEMENT OF JAVA
JAVA 7 – OpenCL ENABLED
APARAPI


AMD initiated Open Source project



APIs for data parallel algorithms
‒ GPU accelerate Java applications
‒ No need to learn OpenCL™



Active community captured mindshare
‒ ~20 contributors
‒ >7000 downloads
‒ ~150 visits per day

JAVA 8 – HSA ENABLED
APARAPI




Java 8 brings Stream + Lambda API.
‒ More natural way of expressing data
parallel algorithms
‒ Initially targeted at multi-core.
APARAPI will :
‒ Support Java 8 Lambdas
‒ Dispatch code to HSA enabled devices at
runtime via HSAIL

Java Application



Adds native GPU acceleration to Java Virtual
Machine (JVM)



Developer uses JDK Lambda, Stream API



JVM uses GRAAL compiler to generate HSAIL



JVM decides at runtime to execute on either
CPU or GPU depending on workload
characteristics.

Java Application
Java JDK Stream +
Lambda API

APARAPI + Lambda
API

OpenCL™
OpenCL™ Compiler

Java GRAAL JIT
backend

HSAIL

HSAIL

HSA Finalizer
& Runtime

& Runtime
JVM
JVM

CPU
CPU

(SUMATRA)

Java Application

APARAPI API

CPU ISA

JAVA 9 – HSA ENABLED JAVA

HSA Finalizer
& Runtime

JVM
JVM
GPU ISA
GPU
GPU

27 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

CPU ISA
CPU
CPU

JVM
JVM
GPU ISA
GPU
GPU

CPU ISA
CPU
CPU

GPU ISA
GPU
GPU
Workloads
OVERVIEW OF B+ TREES
 B+ Trees are a special case of B
 A B+ Tree …
Trees
‒ is a dynamic, multi-level index
‒ Is efficient for retrieval of data, stored in a
 Fundamental data structure used in
block-oriented context
several popular database
 Order (b) of a B+ Tree measures the
management systems
‒ SQLite
capacity of its nodes
‒ CouchDB

3
2

4

5
6

7

1

2

3

4

5

6

7 8

d1

d2

d3

d4

d5

d6

d7 d8

29 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
HOW WE ACCELERATE
 Utilize coarse-grained parallelism in B+ Tree searches
‒ Perform many queries in parallel
‒ Increase memory bandwidth utilization with parallel reads
‒ Increase throughput (transactions per second for OLTP)

 B+ Tree searches on an HSA enabled APU

‒ Allows much larger B+ Trees to be searched, than traditional GPU compute
‒ Eliminates data-copies since CPU and GPU cores can access the same memory

30 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
RESULTS
1M search queries in
parallel

 Input B+ Tree contains
112 million keys and
uses 6GB of memory
 Hardware: AMD
“Kaveri” APU with Quad
Core CPU and 8 GCN
Compute Units at 35W
TDP
 Software: OpenCL on
HSA
Baseline: 4-core OpenMP + hand-tuned SSE CPU
implementation
31 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
REVERSE TIME MIGRATION (RTM)


A technique for creating images based
on sensor data to improve seismic
interpretations done by geophysicists



RTM is run on massive data sets



A natural scale out algorithm



Often run today on 100K node CPU
systems



Bringing this to HSA and APU based
supercomputing will increase
performance for current sensor arrays,
and allow more sensors and accuracy
in the future.

Marine crews

A memory-intensive and highly
parallel algorithm



Land crews

32 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

HOWEVER, SPEED OF PROCESSING
AND INTERPRETATION IS A CRITICAL
BOTTLENECK IN MAKING FULL USE
OF ACQUISITION ASSETS
TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA
SEARCH
MINING BIG DATA




Multi-stage pipeline or
parallel processing stages
Traditional GPU Compute is
challenged by copies

sort
split 0

map

Sort
Compression
Regular expression parsing
CRC generation

Acceleration of large data
search scales out across the
cluster of APU nodes

33 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

copy

Output HDFS
merge
reduce

split 1

split 2

part 0

HDFS
Replication

reduce

APU with HSA accelerates
each stage in place
‒
‒
‒
‒



Input HDFS (Hadoop
Distributed File System)

part 1

HDFS
Replication

map

map
DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update
or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from
time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may
be trademarks of their respective owners.
34 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
BACKUP

35 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
Programming
Tools
AMD

V1.3

 AMD’s comprehensive heterogeneous developer tool suite including:
‒ CPU and GPU Profiling
‒ GPU kernel Debugging
‒ GPU kernel analysis

 New features in version 1.3:

‒ Supports Java
‒ Integrated static kernel analysis
‒ Remote debugging/profiling
‒ Supports latest AMD APU and GPU products

37 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
OPEN SOURCE LIBRARIES ACCELERATED BY AMD

OpenCV

Bolt

clMath

Aparapi

 Most popular
computer vision
library
 Now with many
OpenCL™
accelerated
functions

 C++ template library
 Provides GPU offload for common
data-parallel
algorithms
 Now with cross-OS
support and
improved
performance/functio
nality

 AMD released
APPML as open
source to create
clMath
 Accelerated BLAS
and FFT libraries
 Accessible from
Fortran, C and C++

 OpenCL™
accelerated Java 7
 Java APIs for data
parallel algorithms
(no need to learn
OpenCL™

38 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

More Related Content

What's hot

HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasAMD Developer Central
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsHSA Foundation
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory ModelHSA Foundation
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AIJMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AILablup Inc.
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013HSA Foundation
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerAMD Developer Central
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...Edge AI and Vision Alliance
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platforminside-BigData.com
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”HSA Foundation
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective HSA Foundation
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Clusterinside-BigData.com
 
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry KozlovGS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry KozlovAMD Developer Central
 

What's hot (20)

HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
ISCA Final Presentation - Applications
ISCA Final Presentation - ApplicationsISCA Final Presentation - Applications
ISCA Final Presentation - Applications
 
HSA Overview
HSA Overview HSA Overview
HSA Overview
 
ISCA final presentation - Memory Model
ISCA final presentation - Memory ModelISCA final presentation - Memory Model
ISCA final presentation - Memory Model
 
JMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AIJMI Techtalk: 한재근 - How to use GPU for developing AI
JMI Techtalk: 한재근 - How to use GPU for developing AI
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
HSA Introduction Hot Chips 2013
HSA Introduction  Hot Chips 2013HSA Introduction  Hot Chips 2013
HSA Introduction Hot Chips 2013
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul BlinzerHC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
HC-4018, How to make the most of GPU accessible memory, by Paul Blinzer
 
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta..."The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
"The Xilinx AI Engine: High Performance with Future-proof Architecture Adapta...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
Deep Learning on the SaturnV Cluster
Deep Learning on the SaturnV ClusterDeep Learning on the SaturnV Cluster
Deep Learning on the SaturnV Cluster
 
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry KozlovGS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
GS-4139, RapidFire for Cloud Gaming, by Dmitry Kozlov
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 

Similar to Guide to Heterogeneous System Architecture (HSA

Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...AMD Developer Central
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - IntroHSA Foundation
 
Heterogenous system architecture(HSA)
Heterogenous system architecture(HSA)Heterogenous system architecture(HSA)
Heterogenous system architecture(HSA)Dr. Michael Agbaje
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computingRashid Ansari
 
Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21r Skip
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSA Foundation
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC
 
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...Matteo Ferroni
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overviewinside-BigData.com
 
Design and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGADesign and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGAIJERA Editor
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...HSA Foundation
 

Similar to Guide to Heterogeneous System Architecture (HSA (20)

Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
 
HSA Features
HSA FeaturesHSA Features
HSA Features
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
 
Heterogenous system architecture(HSA)
Heterogenous system architecture(HSA)Heterogenous system architecture(HSA)
Heterogenous system architecture(HSA)
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 
Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21Final apu13 phil-rogers-keynote-21
Final apu13 phil-rogers-keynote-21
 
Amd fusion apus
Amd fusion apusAmd fusion apus
Amd fusion apus
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
CUDA
CUDACUDA
CUDA
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Mantle for Developers
Mantle for DevelopersMantle for Developers
Mantle for Developers
 
B9 cmis
B9 cmisB9 cmis
B9 cmis
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020
 
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
[EWiLi2016] Towards a performance-aware power capping orchestrator for the Xe...
 
Heterogeneous System Architecture Overview
Heterogeneous System Architecture OverviewHeterogeneous System Architecture Overview
Heterogeneous System Architecture Overview
 
Design and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGADesign and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGA
 
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
ISCA 2014 | Heterogeneous System Architecture (HSA): Architecture and Algorit...
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Guide to Heterogeneous System Architecture (HSA

  • 1. GUIDE TO HETEROGENEOUS SYSTEM ARCHITECTURE (HSA) DIBYENDU DAS, PRAKASH RAGHAVENDRA DEC 16TH 2013
  • 2. OUTLINE  Introduction to HSA  Unified Memory Access  Power Management  HSA Programming Languages  Workloads 2 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 3. WHAT IS HSA? An intelligent computing architecture that enables CPU, GPU and other processors to work in harmony on a single piece of silicon by seamlessly moving the right tasks to the best suited processing element hUMA (MEMORY) PARALLEL WORKLOADS SERIAL WORKLOADS APU ACCELERATED PROCESSING UNIT 3 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 4. HSA EVOLUTION Benefits Unified power efficiency Improved compute efficiency Simplified data sharing Capabilities Integrate CPU and GPU in silicon 4 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC GPU can access CPU memory Uniform memory access for CPU and GPU
  • 5. STATE-OF-THE-ART HETEROGENEOUS PROCESSOR Shared Northbridge  access to overlapping CPU/GPU physical address spaces Graphics processing unit (GPU): 384 AMD Radeon™ cores Multi-threaded CPU cores Accelerated processing unit (APU) Many resources are shared between the CPU and GPU – For example, memory hierarchy, power, and thermal capacity 5 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 6. A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Constrained by: Power Complexity Enabled by:  Moore’s Law  SMP architecture ? Throughput Performance Single-thread Performance Enabled by:  Abundant data parallelism  Power efficient GPUs Time 6 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC we are here Time (# of processors) Temporarily Constrained by: Programming models Comm.overhead Shader  CUDA OpenCL C+ +AMP … pthreads  OpenMP / TBB … Assembly  C/C++  Java … we are here Constrained by: Power Parallel SW Scalability Modern Application Performance Enabled by:  Moore’s Law  Voltage Scaling Heterogeneous Systems Era Multi-Core Era we are here Time (Data-parallel exploitation)
  • 7. Architecture Maturity & Programmer Accessibility Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards Drivers Era OpenCL™, DirectCompute Driver-based APIs Proprietary Drivers Era Graphics & Proprietary Driver-based APIs  “Adventurous” programmers  Exploit early programmable “shader cores” in the GPU Expert programmers C and C++ subsets Compute centric APIs , data types Multiple address spaces with explicit data movement  Specialized work queue based structures  Kernel mode dispatch     Architected Era AMD Heterogeneous System Architecture GPU Peer Processor         Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching  Make your program look like “graphics” to the GPU Poor  CUDA™, Brook+, etc 2002 - 2008 7 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC 2009 - 2011 2012 - 2020
  • 8. HETEROGENEOUS PROCESSORS - EVERYWHERE SMARTPHONES TO SUPER-COMPUTERS Super computer Dense Server Tablet Phone Workstation Notebook A SINGLE SCALABLE ARCHITECTURE FOR THE WORLD’S PROGRAMMERS IS DEMANDED AT THIS POINT 8 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 9. HOW DOES HSA MAKE THIS ALL WORK?  Enables acceleration of languages like Java, C++ AMP and Python  All processors use the same addresses, and can share data structures in place  Heterogeneous computing can use all of virtual and physical memory  Extends multicore coherency to the GPU and other processors  Pass work quickly between the processors  Enables quality of service HSA FOUNDATION – BUILDING THE ECOSYSTEM 9 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 10. HSA FOUNDATION AT LAUNCH BORN IN JUNE 2012 Founders 10 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 11. HSA FOUNDATION TODAY – DECEMBER 2013 A GROWING AND POWERFUL FAMILY Founders Promoters rters butors ORACLE sities NTHU Programming Language Lab 11 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC NTHU System Software Lab COMPUTER SCIENCE
  • 13. UNDERSTANDING UMA Original meaning of UMA is Uniform Memory Access • Refers to how processing cores in a system view and access memory • All processing cores in a true UMA system share a single memory address space Introduction of GPU compute created Non-Uniform Memory Access (NUMA) • Require data to be managed across multiple heaps with different address spaces • Add programming complexity due to frequent copies, synchronization, and address translation HSA restores the GPU to Uniform memory Access • Heterogeneous computing replaces GPU Computing 13 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 15. hUMA KEY FEATURES Coherent Memory: CPU Ensures CPU and GPU caches both see an up-to-date view of data HW Cache Coherency Cache Physical Memory Virtual Memory Pageable memory: The GPU can seamlessly access virtual memory addresses that are not (yet) present in physical memory Entire memory space: Both CPU and GPU can access and allocate any location in the system’s virtual memory space 15 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 16. WITHOUT POINTERS AND DATA SHARING Without hUMA: •CPU explicitly copies data to GPU memory •GPU completes computation •CPU explicitly copies result back to CPU memory CPU | | | | | Memory 16 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC | | | | | GPU Memory Only the data array can be copied since GPU cannot follow embedded data-structure links
  • 17. WITH POINTERS AND DATA SHARING CPU can pass a pointer to entire data structure since the GPU can now follow embedded links | | | | | CPU / GPU Uniform Memory 17 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 18. hUMA FEATURES Access to Entire Memory Space Pageable memory Bi-directional Coherency Fast GPU access to system memory Dynamic Memory Allocation 18 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC     
  • 20. KEY OBSERVATIONS  Applications exhibit varying degrees of CPU and GPU frequency sensitivities due to ‒ Control divergence ‒ Interference at shared resources ‒ Performance coupling between CPU and GPU  Efficient energy management requires metrics that can predict frequency sensitivity (power) in heterogeneous processors  Sensitivity metrics drive the coordinated setting of CPU and GPU power states 20 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 21. STATE-OF-THE-ART: BI-DIRECTIONAL APPLICATION POWER MANAGEMENT (BAPM) Chip is divided into BAPM-controlled thermal entities (TEs) CU0 TE CU1 TE GPU TE  Power management algorithm 1. Calculate digital estimate of power consumption 2. Convert power to temperature - RC network model for heat transfer 1. Assign new power budgets to TEs based on temperature headroom 2. TEs locally control (boost) their own DVFS states to maximize performance 21 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 22. DYNACO: RUN-TIME SYSTEM FOR COORDINATED ENERGY MANAGEMENT CPU-GPU Frequency Sensitivity Computation Performance Metric Monitor CPU-GPU Power State Decision GPU Frequency Sensitivity CPU Frequency Sensitivity Decision High Low Shift power to GPU High High Proportional power allocation Low High Shift power to CPU Low Low Reduce power of both CPU and GPU  DynaCo implemented as a run-time software policy overlaid on top of BAPM in real hardware 22 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 24. PROGRAMMING LANGUAGES PROLIFERATING ON HSA OpenCL™ OpenCL™ App App Java App Java App C++ AMP C++ AMP App App Python Python App App OpenCL OpenCL Runtime Runtime Java JVM Java JVM (Sumatra) (Sumatra) Various Various Runtimes Runtimes Fabric Fabric Engine RT Engine RT HSAIL (HSA Intermediate Language) HSA HSA Helper Helper Libraries Libraries HSA Core HSA Core Runtime Runtime Kernel Fusion Kernel Fusion Driver (KFD) Driver (KFD) 24 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC HSA HSA Finalizer Finalizer
  • 25. PROGRAMMING MODELS EMBRACING HSAIL AND HSA THE RIGHT LEVEL OF ABSTRACTION UNDER DEVELOPMENT NEXT Java: Project Sumatra OpenJDK 9 OpenMP from SuSE C++ AMP, based on CLANG/LLVM Python and KL from Fabric Engine DSLs: Halide, Julia, Rust Fortran JavaScript Open Shading Language R 25 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 26. HSAIL HSAIL (HSA Intermediate Language) as the SW interface ‒ A virtual ISA for parallel programs ‒ Finalized to a native ISA by a finalizer/JIT ‒ Accommodate to rapid innovations in native GPU architectures ‒ HSAIL expected to be stable and backward compatible across implementations High-Level Compiler Flow OpenCL™ Kernel OpenCL™ Kernel EDG or CLANG EDG or CLANG SPIR SPIR LLVM LLVM HSAIL HSAIL ‒ Enable multiple hardware vendors to support HSA Key design points and benefits for HSA compilers Finalizer Flow (Runtime) ‒ Adopt a thin finalizer approach ‒ Enable fast translation time and robustness in the finalizer ‒ Drive performance optimizations through high-level compilers (HLC) HSAIL HSAIL Finalizer Finalizer Hardware ISA Hardware ISA ‒ Take advantage of the strength and compilation time budget in HLCs for aggressive optimizations EDG – Edison Design Group 26 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC CLANG – LLVM FE SPIR – Standard Portable Intermediate Representation
  • 27. HSA ENABLEMENT OF JAVA JAVA 7 – OpenCL ENABLED APARAPI  AMD initiated Open Source project  APIs for data parallel algorithms ‒ GPU accelerate Java applications ‒ No need to learn OpenCL™  Active community captured mindshare ‒ ~20 contributors ‒ >7000 downloads ‒ ~150 visits per day JAVA 8 – HSA ENABLED APARAPI   Java 8 brings Stream + Lambda API. ‒ More natural way of expressing data parallel algorithms ‒ Initially targeted at multi-core. APARAPI will : ‒ Support Java 8 Lambdas ‒ Dispatch code to HSA enabled devices at runtime via HSAIL Java Application  Adds native GPU acceleration to Java Virtual Machine (JVM)  Developer uses JDK Lambda, Stream API  JVM uses GRAAL compiler to generate HSAIL  JVM decides at runtime to execute on either CPU or GPU depending on workload characteristics. Java Application Java JDK Stream + Lambda API APARAPI + Lambda API OpenCL™ OpenCL™ Compiler Java GRAAL JIT backend HSAIL HSAIL HSA Finalizer & Runtime & Runtime JVM JVM CPU CPU (SUMATRA) Java Application APARAPI API CPU ISA JAVA 9 – HSA ENABLED JAVA HSA Finalizer & Runtime JVM JVM GPU ISA GPU GPU 27 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC CPU ISA CPU CPU JVM JVM GPU ISA GPU GPU CPU ISA CPU CPU GPU ISA GPU GPU
  • 29. OVERVIEW OF B+ TREES  B+ Trees are a special case of B  A B+ Tree … Trees ‒ is a dynamic, multi-level index ‒ Is efficient for retrieval of data, stored in a  Fundamental data structure used in block-oriented context several popular database  Order (b) of a B+ Tree measures the management systems ‒ SQLite capacity of its nodes ‒ CouchDB 3 2 4 5 6 7 1 2 3 4 5 6 7 8 d1 d2 d3 d4 d5 d6 d7 d8 29 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 30. HOW WE ACCELERATE  Utilize coarse-grained parallelism in B+ Tree searches ‒ Perform many queries in parallel ‒ Increase memory bandwidth utilization with parallel reads ‒ Increase throughput (transactions per second for OLTP)  B+ Tree searches on an HSA enabled APU ‒ Allows much larger B+ Trees to be searched, than traditional GPU compute ‒ Eliminates data-copies since CPU and GPU cores can access the same memory 30 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 31. RESULTS 1M search queries in parallel  Input B+ Tree contains 112 million keys and uses 6GB of memory  Hardware: AMD “Kaveri” APU with Quad Core CPU and 8 GCN Compute Units at 35W TDP  Software: OpenCL on HSA Baseline: 4-core OpenMP + hand-tuned SSE CPU implementation 31 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 32. REVERSE TIME MIGRATION (RTM)  A technique for creating images based on sensor data to improve seismic interpretations done by geophysicists  RTM is run on massive data sets  A natural scale out algorithm  Often run today on 100K node CPU systems  Bringing this to HSA and APU based supercomputing will increase performance for current sensor arrays, and allow more sensors and accuracy in the future. Marine crews A memory-intensive and highly parallel algorithm  Land crews 32 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC HOWEVER, SPEED OF PROCESSING AND INTERPRETATION IS A CRITICAL BOTTLENECK IN MAKING FULL USE OF ACQUISITION ASSETS
  • 33. TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA SEARCH MINING BIG DATA    Multi-stage pipeline or parallel processing stages Traditional GPU Compute is challenged by copies sort split 0 map Sort Compression Regular expression parsing CRC generation Acceleration of large data search scales out across the cluster of APU nodes 33 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC copy Output HDFS merge reduce split 1 split 2 part 0 HDFS Replication reduce APU with HSA accelerates each stage in place ‒ ‒ ‒ ‒  Input HDFS (Hadoop Distributed File System) part 1 HDFS Replication map map
  • 34. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. 34 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 35. BACKUP 35 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 37. AMD V1.3  AMD’s comprehensive heterogeneous developer tool suite including: ‒ CPU and GPU Profiling ‒ GPU kernel Debugging ‒ GPU kernel analysis  New features in version 1.3: ‒ Supports Java ‒ Integrated static kernel analysis ‒ Remote debugging/profiling ‒ Supports latest AMD APU and GPU products 37 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC
  • 38. OPEN SOURCE LIBRARIES ACCELERATED BY AMD OpenCV Bolt clMath Aparapi  Most popular computer vision library  Now with many OpenCL™ accelerated functions  C++ template library  Provides GPU offload for common data-parallel algorithms  Now with cross-OS support and improved performance/functio nality  AMD released APPML as open source to create clMath  Accelerated BLAS and FFT libraries  Accessible from Fortran, C and C++  OpenCL™ accelerated Java 7  Java APIs for data parallel algorithms (no need to learn OpenCL™ 38 | INDO US HPC SUMMIT | January 2, 2014 | PUBLIC

Editor's Notes

  1. So, here is what we will explore today
  2. HSA will empower software developers to easily innovate and unleash new levels of performance and functionality on all your modern devices and lead to powerful new experiences such as visually rich, intuitive, human-like interactivity.   
  3. Trinity contains two dual-core x86 modules or compute units (CU), and Radeon™ GPU cores along with miscellaneous other logic components such as a NorthBridge and a Unified Video Decoder (UVD). Each CU is composed of two, out-of-order cores that share the front-end and floating-point units. In addition, each CU is paired with a 2MB L2 cache that is shared between the cores. The GPU consists of 384 Radeon ™ cores, each capable of one single-precision fused multiply-add computation (FMAC) operation per cycle. The GPU is organized as six SIMD units, each containing sixteen processing units that are 4-way VLIW. The memory controller is shared between the CPU and the GPU.
  4. Punch this one out. Big emphasis on the last era. … and now the unprecedented step … four year roadmap
  5. http://acg.cis.upenn.edu/papers/cacm12_why_coherence.pdf.   Includes this quote: Continued coherence support lets programmers concentrate on what matters for parallel speedups: finding work to do in parallel with no undue communication and synchronization.  
  6. Field data is sent to a center (cluster of nodes) for processing and interpretation (geophysicists) Gaps in data typically require further data acquisition either holding up crews causing redeployment The problem is magnified due to multiple field crews depending on the same processing center The problem cannot be trivially solved by a “truck full of discrete GPUs” solution on site, because RTM is a memory-bound problem that is better solved by APUs