CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

HSA
ENABLEMENT
OF
APARAPI

EASING
THE
DEVELOPER
PATH
TO
APU/GPU
ACCELERATED
JAVA
APPLICATIONS

VIGNESH
RAVI
–
SOFTWARE
DEVELOPER
HSA
TEAM
AMD

GARY
FROST
–
SOFTWARE
FELLOW
AMD

HSA
ENABLEMENT
OF
APARAPI
:
AGENDA

!  Java GPU enablement via Aparapi
‒  Why Java?
‒  Aparapi
‒  What is it and how is it used?

!  Introduction to HSA
!  How HSA simplifies Java GPU programming with Aparapi
‒  Simpler programming model using lambda expressions
‒  Removal of previous constraints thanks to SVM (Shared Virtual Memory)

!  The nuts and bolts of our current HSA enablement
‒  HSAIL generation
‒  Dispatch via HSA Runtime APIs

!  Summary
!  Q&A
2
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

WHY
JAVA?

!  Java
by
the
numbers

‒ 9
Million
Developers

‒ 1
Billion
Java
downloads
per
year

‒ 97%

Enterprise
desktops
run
Java

‒ 100%

of
blue
ray
players
ship
with
Java

hVp://oracle.com.edgesuite.net/[meline/java/

!  Java
7
language
&
libraries
already
include
concurrency
features

‒ primi[ves
(threads,
locks,
monitors,
atomic
ops)

‒ libraries
(fork/join,
thread
pools,
executors,
futures)

!  Upcoming
Java
8
include
stream
processing
enhancements

‒ support
for
‘lambda’

expressions

‒ Lambda
centric
concurrent
stream
processing
libs/apis

(java.u[l.stream.*)

3
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

INITIAL
APARAPI
PROJECT
OVERVIEW
(2011)

!  Open Source framework

Java
Applica[on

!  Allows Java developers access to GPU compute

Overload
Aparapi
KKernel
Base

Overload
Aparapi
ernel
Class’s

run()
method

Class’s
run()
method

!  Aparapi Java API for expressing data parallel workloads

Aparapi
converts

bytecode
to

OpenCL™

Kernel kernel = new Kernel(){
@Override public void run(){
int i=getGlobalID();
square[i]=in[i]*in[i];
}
};
kernel.execute(size);

!  Aparapi runtime capable of converting bytecode to OpenCL™
‒  Execution on OpenCL™ 1.1+ capable devices (GPUs and APUs)
Or…
‒  Execute via a thread pool if OpenCL™ is unavailable.

4
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

OpenCL™
OpenCL™ compiler &
Runtime
JVM
CPU ISA

CPU

GPU ISA

GPU

MEET
HSA
AND
HSAIL

!  Heterogeneous
System
Architecture
standardizes
CPU/GPU
func[onality

‒ Be
ISA-‐agnos[c
for
both
CPUs
and
accelerators

‒ Support
high-‐level
programming
languages

‒ Provide
the
ability
to
access
pageable
system
memory
from
the
GPU

‒ Maintain
cache
coherency
for
system
memory
between
CPU
and
GPU

!  Specifica[ons
and
simulator
from
HSA
Founda[on

‒ HSAIL
portable
ISA
is

“finalized”
to
par[cular
hardware
ISA
at
run[me

‒ Run[me
specifica[on
for
job
launch
and
control

‒ HSAIL™
simulator
for
development
and
tes[ng
before
hardware
availability

5
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

APARAPI
HSA
ENABLEMENT
(2013-‐2014)

Java
Applica[on

!  Open
Source
project
sponsored

!  Enhanced
to
support
HSA
and
Java
8
lambda
expression

Aparapi
Lambda
based

API

Aparapi
converts
bytecode
to

HSAIL

Device.hsa().forEach(size,
i -> square[i]=in[i]*in[i]
);

HSAIL
HSA Finalizer & Runtime

!  Allow
developers
to
eﬃciently
represent
data
parallel
algorithms

using
new
Java
8
Lambda
expressions

!  API’s
have
same
look
&
feel
as
proposed
Java
8
stream
API
features

!  No
modiﬁca[ons
to
the
JVM.

‒  We
provide
external
JNI/Java
libraries.

6
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

JVM
CPU ISA

CPU

GPU ISA

GPU

HSA
AND
LAMBDA
ENABLED
APARAPI
EXECUTION
EXAMPLE

Does
PlaLorm

Supports
HSA?

Y

N
Y

Can
bytecode
be

converted
to

HSAIL?

N

Device.hsa().forEach(size,
i -> square[i]=int[i]*int[i]
);

Is
this
the
ﬁrst

execuAon
of
this

lambda

instance?

Y

Execute
Kernel

using
Java

thread
Pool

Convert

bytecode
to

HSAIL

N
N

Do
we
have
HSAIL

for
this
lambda
?

7
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

Y

Execute

HSAIL

Kernel
on

GPU/APU

SUMATRA
PROJECT
:
NATIVE
SUPPORT
FOR
GPU
OFFLOAD
ADDED
TO
JAVA

!  AMD/Oracle
sponsored
Open
Source
(OpenJDK)
project

!  Targeted
at
OpenJDK
Java
9
(2015)

Java
Applica[on

!  Allow
developers
to
eﬃciently
represent
data
parallel
algorithms
in
Java
using

Stream
API
+
Lambda
expressions

Java
JDK
Stream
+
Lambda

API

!  Sumatra
is
not
pushing
new
‘programming
model’

Java
GRAAL
JIT

backend

!  Instead
we
‘repurpose’
Stream
API
+
Lambda
to
enable
both
CPU
or
GPU

compu[ng

HSAIL

!  A
Sumatra
enabled
Java
Virtual
Machine™
will
dispatch
‘selected’
constructs
to
HSA

enabled
devices
at
run[me.

!  Developers
already
refactoring
JDK
to
use
stream
&
lambda
API’s

‒  So
anyone
using
exis[ng
JDK
should
see
GPU
accelera[on
without
any
code
changes.

!  Links:

‒  hVp://openjdk.java.net/projects/sumatra

‒  hVps://wikis.oracle.com/display/HotSpotInternals/Sumatra

‒  hVp://mail.openjdk.java.net/pipermail/sumatra-‐dev

8
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

HSA Finalizer & Runtime
JVM
CPU ISA

CPU

GPU ISA

GPU

HSA
ENABLEMENT
OF
JAVA

Java
7
–
OpenCL
enabled
Aparapi

Java
8
–

HSA
enabled
Aparapi

Java
9
–
HSA
enabled
Java
(Sumatra)

•  Java
8
brings
Stream
+
Lambda
API.

More
natural
way
of
expressing
data
parallel

algorithms

Ini[ally
targeted
at
mul[-‐core.

•  APARAPI
will
:-‐

Support
Java
8
Lambdas

Dispatch
code
to
HSA
enabled
devices
at
run[me
via

HSAIL

•  Adds
na[ve
GPU
compute
support
to
Java
Virtual
Machine

(JVM)

•  Developer
uses
JDK
provided

Lambda
+
Stream
API

•  AMD
ini[ated
Open
Source
project

•  APIs
for
data
parallel
algorithms

GPU
accelerate
Java
applica[ons

No
need
to
learn
OpenCL

•  Ac[ve
community
captured
mindshare

~20
contributors

>7000
downloads

~150
visits
per
day

We
plan
to
provide

HSA
Enabled
Aparapi
(Java
8)

as
a
bridge
technology
between

OpenCL
based
Aparapi
(Java
7)

and

HSA
Enabled
Sumatra
(Java
9)

Java
Applica[on

Java
Applica[on

APARAPI
+

Lambda
API

OpenCL™

Java
JDK
Stream
+
Lambda
API

Java
GRAAL
JIT
backend

HSAIL™

HSAIL™

OpenCL™
Compiler
and

Run[me

HSA
Finalizer
&
Run[me

JVM

HSA™
Finalizer
&
Run[me

JVM

JVM

GPU ISA

CPU

•  JVM
uses
GRAAL
compiler
to
generate
HSAIL

•  JVM
decides
at
run[me
to
execute
on
either
CPU
or
GPU

depending
on
workload
characteris[cs.

Java
Applica[on

APARAPI

API

CPU ISA

GPU

9
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

GPU ISA

CPU ISA
CPU

GPU

GPU ISA

CPU ISA
CPU

GPU

A
CASE
STUDY
CENTERED
ON
NBODY

!  A
Java
developer
implemen[ng
a
sequen[al
version
of
NBody
would
probably…

‒  Create
a
class

to
represent
each
body

class Body{
float x,y,z,m,vx,vy,vz;
// Include method to update position and display
void updateAndShow(Screen screen, Body[] bodies){
for (Body other:bodies){
// accumulate forces between other and this
}
// update vx,vy,vz,x,y and z from accumulated data
screen.paint(x,y,z);
}
}

!  Loop
through
each
Body
(in
array
of
bodies[])
to
update
and
display

for (Body b: bodies)
b.updateAndShow(screen, bodies);

10
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

WITHOUT
HSA
WE
CAN’T
(EFFICIENTLY)
USE
OBJECTS

!  In
Java;
allocated
Objects
are
scaVered
on
the
heap.

‒  There
is
no
way
to
allocate
an
array
of
objects
in
con[guous
memory
(as

with
C++)

‒  We
force
the
developer
to
resort
to
using
parallel
arrays
of
primi[ves
(which
are
con[guous)

float x[], y[], z[], m[], vx,[], vy[], vz[];
‒  And
to
infer
that

x[n],
y[n]
and
z[n]
holds
the
state
for
bodies[n].

Kernel kernel = new Kernel(){
public void run(){
int i = getGlobalId(0);
for (int j=0; j<bodies.length; j++){
// accum forces between (x,y,z)[j] and (x,y,z)[i]
}
// update vx[j],vy[j],vz[j],x[j],y[j] and z[j]
}
};

‒  Then
the
kernel

can
be
used
to
execute
the

code
on
the
GPU

Kernel.execute(bodies.length);

11
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

HSA
ENABLED
APARAPI
(AND
SUMATRA)
ALLOWS
USE
OF
OBJECTS

!  So
we
code
our
Body
class
exactly
as
we
would
if
execu[ng
in
Java.

class Body{
float x,y,z,m,vx,vy,vz;
// Include method to update position and display
void updateAndShow(Screen screen, Body[] bodies){
for (Body other:bodies){
// accumulate forces between other and this
}
// update vx,vy,vz,x,y and z from accumulated data
screen.paint(x,y,z);
}
}

!  Then
use
new
Aparapi
lambda
enabled
API
to
coordinate
dispatch
to
theGPU

Device.hsa().forEach(bodies, b -> {
b.updateAndShow(screen, bodies);
});

12
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

‒ Step
0:
Generate
HSAIL
from
Bytecode

‒ Step
1:
Generate
host
HSA
Run[me
calls

‒ Step
1.1:
Ini[alize
HSA
run[me,
device,
queue

…

‒ Step
1.2:
Finalize
HSAIL
to
generate
GPU
ISA

‒ Step
1.3:
Bind
Java
args
to
HSA
args

‒ Step
1.4:
Dispatch
the
kernel

‒ Step
1.5:
Wait
for
comple[on

‒ Repeat
steps
1.3
-‐
1.5
for
next
itera[on
of
same

kernel

‒ Repeat
step
0
–
1
for
each
new
kernel

MyLambda.java

javac (compiler)

MyLambda.class

Runtime

!  HSA
enabled
Aparapi,
at
run[me:

Development time

OVERVIEW
OF
HSA
ENABLED
APARAPI

Application

Aparapi

Generate
HSA RT
calls
Initialize

JVM

Contains

CPU ISA

Finalize

Bind Args
CPU

GPU

Dispatch
GPU ISA

13
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

Generate
HSAIL

Input

HIGH
LEVEL
HSA
FEATURES

! Features
currently
being
defined
in
the
HSA
Working
Groups**

‒ Unified
addressing
across
all
processors

‒ Opera[on
into
pageable
system
memory

‒ Full
memory
coherency

‒ Pla|orm

atomics

‒ User
mode
dispatch

‒ Enables
fast
dispatch
with
no
driver
involvement

‒ Architected
queuing
language

‒ Flexible
compute
dispatch,
easier
GPU
self-‐enqueue

‒ High
level
language
support
for
GPU
compute
processors

‒ Preemp[on
and
context
switching

**
All
features
subject
to
change,
pending
comple[on
and
ra[fica[on
of
specifica[ons
in
the
HSA
Working
Groups

14
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

@
Copyright
2012
HSA
Founda[on.
All
Rights
Reserved.

HSA
INTERMEDIATE
LANGUAGE
(HSAIL)**

!  HSAIL
is
a
virtual
ISA
for
parallel
programs

‒ Finalized
to
vendor-‐specific
ISA
by
a
JIT
compiler
or
“Finalizer”

‒ ISA
independent
by
design
for
CPU
&
GPU

!  Explicitly
parallel

‒ Designed
for
data
parallel
programming

!  Support
for
excep[ons,
virtual
func[ons,
and
other
high
level
language
features

!  Lower
level
than
OpenCL™
SPIR

‒ Fits
naturally
in
the
OpenCL™
compila[on
stack

!  Suitable
to
support
addi[onal
high
level
languages
and
programming
models:

‒ Java,
C++,
OpenMP,
etc

**
Subject
to
change,
pending
comple[on
and
ra[fica[on
of
specifica[ons
in
the
HSA
Working
Groups

15
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

@
Copyright
2012
HSA
Founda[on.
All
Rights
Reserved.

HSAIL
OVERVIEW**

INSTRUCTION
SET

!  Similar
to
assembly
language
for
a
RISC
CPU

‒  Load-‐store
architecture

ld_global_u64 $d0, [$d6 + 120];

$d0= load($d6+120)

add_u64

$d1= $d2+24

$d1, $d2, 24;

!  136
opcodes
(Java™
bytecode
has
200)

‒  Floa[ng
point
(single,
double,
half
(f16))

‒  Integer
(32-‐bit,
64-‐bit)

‒  Some
packed
opera[ons

‒  Branches

‒  Func[on
calls

‒  Pla$orm
Atomic
Opera[ons:

and,
or,
xor,
exch,
add,
sub,

inc,
dec,
max,
min,
cas

‒  Synchronize
host
CPU
and
HSA
Component!

!  Text
and
Binary
formats
(“BRIG”)

REGISTERS

!  Four
classes
of
registers

‒  C:
1-‐bit,
Control
Registers

‒  S:
32-‐bit,
Single-‐precision
FP
or
Int

‒  D:
64-‐bit,
Double-‐precision
FP
or
Long
Int

‒  Q:
128-‐bit,
Packed
data.

!  Fixed
number
of
registers:

‒  8
C

‒  S,
D,
Q
share
a
single
pool
of
resources

S + 2*D + 4*Q <= 128
Up to 128 S or 64 D or 32 Q (or a blend)

!  Register
alloca[on
done
in
high-‐level

compiler

‒  Finalizer
doesn’t
have
to
perform
expensive

register
alloca[on

**
Subject
to
change,
pending
comple[on
and
ra[ﬁca[on
of
speciﬁca[ons
in
the
HSA
Working
Groups

16
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

@
Copyright
2012
HSA
Founda[on.
All
Rights
Reserved.

SEGMENTS
AND
MEMORY
**

!  7
segments
of
memory

‒  global,
readonly,
group,
spill,
private,
arg,
kernarg,

‒  Memory
instruc[ons
can
(op[onally)
specify
a
segment

!  Global
Segment

!  Kernarg
Segment

‒  Programmer
writes
kernarg
segment
to
pass

arguments
to
a
kernel

!  Read-‐Only
Segment

‒  Visible
to
all
HSA
agents
(including
host
CPU)

‒  Remains
constant
during
execu[on
of
kernel

‒  HSAIL
provides
sync
opera[ons
to
control
visibility
of

group
memory

addressing

‒  Very
useful
for
high-‐level
language
support
(ie

classes,
libraries)

‒  Aligns
well
with
OpenCL
2.0
“generic”
addressing

feature

ld_global_u64 $d0, [$d6]
!  Flat
Addressing

!  Group
Segment

ld_group_u64 $d0,[$d6+24]
‒  Each
segment
mapped
into
virtual
address
space

‒  Provides
high-‐performance
memory
shared
in
the
work-‐
st_spill_f32 $s1,[$d6+4] can
map
to
segments
based
on

‒  Flat
addresses

group.

ld_kernarg_u64 $d6,virtual
address

[%_arg0]
‒  Group
memory
can
be
read
and
wriVen
by
any
work-‐
‒  Instruc[ons
with
n
item
in
the
work-‐group

ld_u64 $d0,[$d6+24] ; flat o
explicit
segment
use
flat

!  Spill,
Private,
Arg
Segments

‒  Represent
different
regions
of
a
per-‐work-‐item
stack

‒  Typically
generated
by
compiler,
not
specified
by

programmer

**
Subject
to
change,
pending
comple[on
and
ra[fica[on
of
specifica[ons
in
the
HSA
Working
Groups

17
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

@
Copyright
2012
HSA
Founda[on.
All
Rights
Reserved.

EXAMPLE
–
BYTECODE
TO
HSAIL
GENERATION

Generated HSAIL

javac –g squares.java

int in[], out[];
Device.hsa().forEach(len, i->
out[i] = in[i] * in[i]
);

18
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

0: aload_0 //out[]
1: iload_2 //i
2: aload_1 //in[]
3: iload_2
4: iaload
5: aload_1
6: iload_2
7: iaload
8: imul
9: iastore
10: return

version 0:95: $full : $large;
kernel &run(
kernarg_u64 %_arg0,
//out[]
kernarg_u64 %_arg1,
//in[]
kernarg_s32 %_arg2
){
ld_kernarg_u64 $d0, [%_arg0];
ld_kernarg_u64 $d1, [%_arg1];
ld_kernarg_s32 $s2, [%_arg2];
workitemabsid_u32 $s2, 0; //i
mov_b64 $d3, $d0;
mov_b32 $s4, $s2;
mov_b64 $d5, $d1;
mov_b32 $s6, $s2;
cvt_u64_s32 $d6, $s6;
mad_u64 $d6, $d6, 4, $d5;
ld_global_s32 $s5, [$d6+24];
mov_b64 $d6, $d1;
mov_b32 $s7, $s2;
cvt_u64_s32 $d7, $s7;
mad_u64 $d7, $d7, 4, $d6;
ld_global_s32 $s6, [$d7+24];
mul_s32 $s5, $s5, $s6;
cvt_u64_s32 $d4, $s4;
mad_u64 $d4, $d4, 4, $d3;
st_global_s32 $s5, [$d4+24];
ret;
};

APARAPI
JNI
CALL
-‐>
HSA
RUNTIME
API

Device
Discovery
&
Queue
Crea[on
APIs**

!  Discover
HSA
Device

‒  Both
count
and
device_list
are
out
params

‒  User
can
iterate
over
HSA
devices
in
the
list

!  User-‐Mode
Queue
Crea[on

‒  User
can
provide
pre-‐allocated
buffer

‒  If
not,
API
will
allocate
a
buffer

‒  queue
is
the
user-‐mode
queue

HsaStatus
HsaGetDevices(unsigned
int
*count,

const
HsaDevice
**device_list);

HsaStatus
HsaCreateUserModeQueue(const
HsaDevice
*device,

void
*buffer,
size_t
buffer_size,

HsaQueuePriority
queue_priority,

HsaQueueFrac[on
queue_frac[on,

HsaQueue
**queue);

**
All
APIs
subject
to
change,
pending
comple[on
and
ra[fica[on
of
specifica[ons
in
the
HSA
Working
Groups

19
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

APARAPI
JNI
-‐>
HSA
RUNTIME
API

Finalize
HSAIL
to
GPU
ISA**

!  Transla[ng
HSAIL
text
to
Binary
(BRIG)

‒  BRIG
is
a
binary
container
for
several
sec[ons

‒  Code

‒  String

‒  Direc[ve

‒  …

‒  libHsail
is
an
assembler/disassembler

‒  This
is
a
standalone
compiler
library

‒  Not
part
of
Run[me

!  Finalize
Brig
to
IHV
specific
GPU
ISA

‒  Input:
Brig

‒  Output:
HsaKernelCode
which
contains
ISA

Status
Assemble
(const
char*
hsail_text,
HsaBrig
*brig);

HsaStatus
HsaFinalizeBrig(const
HsaDevice
*device,

HsaBrig
*brig,

const
char
*kernel_name,

const
char
*op[ons,

HsaKernelCode
**kernel);

**
All
APIs
subject
to
change,
pending
comple[on
and
ra[fica[on
of
specifica[ons
in
the
HSA
Working
Groups

20
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

APARAPI
JNI
-‐>
POPULATION
OF
AQL
DISPATCH
PACKET

!  AQL
Dispatch
Packet**

‒  Header
enables:

‒  Different
packet
types

‒  Specify
if
this
packet
should
wait
for
all
previous
to

complete

‒  Control
visibility
of
data
and
memory
fences
before

and
aƒer
dispatch

‒  Body
enables:

‒  Specify
the
problem
fan
out
using
launch
config

related
fields

‒  How
much
workgroup
memory?

‒  Loca[on
of
IHV
specific
GPU
ISA

‒  Loca[on
of
where
kernelargs
can
be
found

‒  A
signal
mechanism
to
wait
on
kernel
comple[on

!  Only
popula[ng
Kernel
info
and
signal
are

opaque,
so
require
run[me
APIs

typedef
struct
HsaAqlDispatchPacket
{

uint32_t
format
:
8;

uint32_t
barrier
:
1;

uint32_t
acquire_fence_scope
:
2;

Header
Fields

uint32_t
release_fence_scope
:
2;

uint32_t
invalidate_instruction_cache
:
1;

uint32_t
invalidate_roi_image_cache
:
1;

uint32_t
dimensions
:
2;

uint32_t
reserved
:
15;

uint16_t
workgroup_size[3];
Launch
Config

uint16_t
reserved2;

uint32_t
grid_size[3];

uint32_t
private_segment_size_bytes;

uint32_t
group_segment_size_bytes;

Kernel
Info

uint64_t
kernel_object_address;

uint64_t
kernel_arg_address;

uint64_t
reserved3;

uint64_t
completion_signal;

Kernel
SynchronizaAon

}
HsaAqlDispatchPacket;

‒  Other
fields

are
open,
so
simple
assignments

**
Subject
to
change,
pending
comple[on
and
ra[fica[on
of
specifica[ons
in
the
HSA
Working
Groups

21
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

POPULATING
KERNEL
INFO
AND
SIGNAL
USING
HSA
RT
API**

HsaStatus
HsaFinalizeBrig(const
HsaDevice
*device,

HsaBrig
*brig,

const
char
*kernel_name,

const
char
*op[ons,

HsaKernelCode
**kernel);

typedef
struct
HsaKernelCode
{

…

uint32_t
workitem_private_segment_byte_size;

uint32_t
workgroup_group_segment_byte_size;

uint64_t
kernarg_segment_byte_size;

…

}
HsaKernelCode;

typedef
struct
HsaAqlDispatchPacket
{

…

uint32_t
private_segment_size_bytes;

uint32_t
group_segment_size_bytes;

uint64_t
kernel_object_address;

uint64_t
kernel_arg_address;

…

uint64_t
completion_signal;

}

HsaStatus
HsaCreateSignal(HsaSignal
*signal);

**
Subject
to
change,
pending
comple[on
and
ra[ﬁca[on
of
speciﬁca[ons
in
the
HSA
Working
Groups

22
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

Pack
Java
Args
into
a
vector
in
JNI

Register
vector
data
address

HsaStatus
HsaRegisterSystemMemory(void
*address,
size_t
size);

DISPATCH
AND
WAIT
ON
KERNEL
COMPLETION

!  Dispatch

‒  Submit
AQL
Packet
into
the
HsaQueue

‒  Thread
safe
API

HsaStatus
HsaSubmitAql(HsaQueue
*queue,HsaAqlDispatchPacket
*aql_packet);

!  Wait
on
Kernel
Comple[on

bool
is_done
=
false;

while
(!is_done)
{

status
=
HsaQuerySignal(signal,
&is_done);

assert(status
==
kHsaStatusSuccess);

}

**
Subject
to
change,
pending
comple[on
and
ra[ﬁca[on
of
speciﬁca[ons
in
the
HSA
Working
Groups

23
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

!  Aƒer
comple[on,
disposing
HSA
resources

‒  Release
queue

‒  Release
signal

‒  Release
Kernel
object

‒  Deregister
kernel
args
related
memory

HsaStatus
HsaDestroyUserModeQueue(HsaQueue
*queue);

HsaStatus
HsaDestroySignal(HsaSignal
signal);

HsaStatus
HsaFreeKernelCode(HsaKernelCode
*kernel);

HsaStatus
HsaDeregisterSystemMemory(void
*address);

DEMO

24
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

SUMMARY

!  Aparapi
is
already
an
establish
framework
for
simplifying
execu[on
of
Java
on
GPU
devices

!  HSA
enabled
Aparapi
further
simplifies
GPU
accelera[on
of
Java
applica[ons

‒  Aligns
with
Java
8
features
to
support
‘lambda’
expression
for
compactness

‒  Enables
‘large
unified’
system
memory
for
GPU
accelera[on

‒  Eases
programming
by
enabling
direct
access
to
Java
objects
on
heap

‒  Enables
fast
offload
of
Java
kernels
through
User-‐mode
queue
and
AQL

!  HSA
enabled
Aparapi
lends
to
more
interes[ng
future
possibili[es

‒  Simplified
communica[on
and
workload
balancing
across
both
CPU
and
GPU

‒  Exploit
new
computa[on
paVerns
and
recursions
through
kernel
self-‐enqueue

25
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

QUESTIONS
&
ANSWERS?

26
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

DISCLAIMER
&
ATTRIBUTION

The
informa[on
presented
in
this
document
is
for
informa[onal
purposes
only
and
may
contain
technical
inaccuracies,
omissions
and
typographical
errors.

The
informa[on
contained
herein
is
subject
to
change
and
may
be
rendered
inaccurate
for
many
reasons,
including
but
not
limited
to
product
and
roadmap

changes,
component
and
motherboard
version
changes,
new
model
and/or
product
releases,
product
differences
between
differing
manufacturers,
soƒware

changes,
BIOS
flashes,
firmware
upgrades,
or
the
like.
AMD
assumes
no
obliga[on
to
update
or
otherwise
correct
or
revise
this
informa[on.
However,
AMD

reserves
the
right
to
revise
this
informa[on
and
to
make
changes
from
[me
to
[me
to
the
content
hereof
without
obliga[on
of
AMD
to
no[fy
any
person
of

such
revisions
or
changes.

AMD
MAKES
NO
REPRESENTATIONS
OR
WARRANTIES
WITH
RESPECT
TO
THE
CONTENTS
HEREOF
AND
ASSUMES
NO
RESPONSIBILITY
FOR
ANY

INACCURACIES,
ERRORS
OR
OMISSIONS
THAT
MAY
APPEAR
IN
THIS
INFORMATION.

AMD
SPECIFICALLY
DISCLAIMS
ANY
IMPLIED
WARRANTIES
OF
MERCHANTABILITY
OR
FITNESS
FOR
ANY
PARTICULAR
PURPOSE.
IN
NO
EVENT
WILL
AMD
BE

LIABLE
TO
ANY
PERSON
FOR
ANY
DIRECT,
INDIRECT,
SPECIAL
OR
OTHER
CONSEQUENTIAL
DAMAGES
ARISING
FROM
THE
USE
OF
ANY
INFORMATION

CONTAINED
HEREIN,
EVEN
IF
AMD
IS
EXPRESSLY
ADVISED
OF
THE
POSSIBILITY
OF
SUCH
DAMAGES.

ATTRIBUTION

©
2013
Advanced
Micro
Devices,
Inc.
All
rights
reserved.
AMD,
the
AMD
Arrow
logo
and
combina[ons
thereof
are
trademarks
of
Advanced
Micro
Devices,

Inc.
in
the
United
States
and/or
other
jurisdic[ons.
OpenCL
is
a
trademark
of
Apple
Inc.

HSA
is
a
trademark
of
the
Heterogeneous
System
Architecture

Founda[on.
Other
names
are
for
informa[onal
purposes
only
and
may
be
trademarks
of
their
respec[ve
owners.

27
|

HSA
ENABLEMENT

OF
APARAPI

|
NOVEMBER
2013|

CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (16)

Semelhante a CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi

Semelhante a CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi (20)

Mais de AMD Developer Central

Mais de AMD Developer Central (20)

Último

Último (20)

CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Java applications, by Gary Frost and Vignesh Ravi