SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
Implement Runtime Environments
for HSA using LLVM
Jim Huang ( 黃敬群 ) <jserv.tw@gmail.com>
Oct 17, 2013 / ITRI, Taiwan
About this presentation
• Transition to heterogeneous
– mobile phone to data centre
• LLVM and HSA
• HSA driven computing environment
Transition to Heterogeneous
Environments
Implement Runtime Environments for HSA using LLVM
Herb Sutter’s new outlook
http://herbsutter.com/welcome-to-the-jungle /
“In the twilight of Moore’s Law, the transitions to multicore
processors, GPU computing, and HaaS cloud computing are
not separate trends, but aspects of a single trend –
mainstream computers from desktops to ‘smartphones’ are
being permanently transformed into heterogeneous
supercomputer clusters. Henceforth, a single computeintensive application will need to harness different kinds of
cores, in immense numbers, to get its job done.”

“The free lunch is over.
Now welcome to the hardware jungle.”
Four causes of heterogeneity
• Multiple types of programmable core
•
•
•

CPU (lightweight, heavyweight)
GPU
Accelerators and application-specific

• Interconnect asymmetry
• Memory hierarchies
• Service oriented software demanding
High Performance vs. Embedded
Embedded
Type of
processors
Size
Memory

HPC

Heterogeneous

Homogeneous

Small

Massive

Shared

Distributed

but getting closer day by day...
Heterogeneity is mainstream

Dual-core ARM 1.4GHz, ARMv7s CPU
Quad-core ARM Cortex A9 CPU
Quad-core SGX543MP4+ Imagination GPU Triple-core SGX554MP4 Imagination GPU

Most tablets and smartphones are already
powered by heterogeneous processors.
Source: Heterogeneous Computing in ARM
Implement Runtime Environments for HSA using LLVM
Current limitations
• Disjoint view of memory spaces between
CPUs and GPUs
• Hard partition between “host” and
“devices” in programming models
• Dynamically varying nested parallelism
almost impossible to support
• Large overheads in scheduling
heterogeneous, parallel tasks
Background Facilities
OpenCL Framework overview
Host Program
int main(int argc, char **argv)
{
...
clBuildProgram(program, ...);
clCreateKernel(program, “dot”...);
...
}

OpenCL Kernels
__kernel void dot(__global const float4 *a
__global const float4 *b
__global float4 *c)
{
int tid = get_global_id(0);
c[tid] = a[tid] * b[tid];
}

Application
Separate programs into
host-side and kernel-side
code fragment

OpenCL Framework
OpenCL Runtime

Runtime APIs

Platform APIs

OpenCL Compiler
Compiler
• compile OpenCL C
language just-in-time
Runtime
• allow host program to
manipulate context

Front-end
Front-end
Back-end
Back-end

Platform

MPU

GPU

MPU : host, kernel program
GPU : kernel program

Source: OpenCL_for_Halifux.pdf, OpenCL overview, Intel Visual Adrenaline
OpenCL Execution Scenario
kernel
{
code fragment 1
code fragment 2
code fragment 3
}
Task-level parallelism
1

2

3

X O

Data-level parallelism
kernel A
{
code 1
code 2
code 3
}

A:1,2,3
data(0)

A:1,2,3
data(2)

A:1,2,3
data(3)

kernel A
{
code 1
}

A:1,2,3
data(1)
A:1,2,3
data(4)

A:1,2,3
data(5)

kernel B
{
code 2
}

kernel C
{
code 3
}

OpenCL Runtime
A:1

B:2

C:3

A:1

B:2

C:3

14
Supporting OpenCL
• Syntax parsing by compiler
–
–
–
–

qualif i r
e
vector
built-in function
Optimizations on
single core

__kernel void add( __global float4 *a,
__global float4 *b,
__global float4 *c)
{
int gid = get_global_id(0);
float4 data = (float4) (1.0, 2.0, 3.0, 4.0);
c[gid] = a[gid] + b[gid] + data;
}

• Runtime implementation
– handle multi-core issues
– Co-work with device vendor
Platform APIs

MPU

Runtime APIs

GPU
LLVM-based compilation toolchain
Data in

Module 1
.
.
.

opt
&
link

Front
ends

Linked
module

Profiling
?

yes

lli

Module N
no
Back
end 1

Exe
1
.
.
.

Exe
M

.
.
.
Back
end
M

Module 1
.
.
.

Module
M

Partitioning
& mapping

Estimation

Profile
info
HSA: Heterogeneous System
Architecture
HSA overview
• open architecture specification
• HSAIL virtual (parallel) instruction set
• HSA memory model
• HSA dispatcher and run-time

• Provides an optimized platform
architecture for heterogeneous
programming models such as OpenCL,
C++AMP, et al
18
HSA Runtime Stack
Heterogeneous Programming
• Unified virtual address space for all cores
• CPU and GPU
• distributed arrays

• Hardware queues per code with
lightweight user mode task dispatch
• Enables GPU context switching, preemption,
efficient heterogeneous scheduling

• First class barrier objects
• Aids parallel program composability
HSA Intermediate Layer (HSAIL)
Virtual ISA for parallel programs
Similar to LLVM IR and OpenCL SPIR
Finalised to specific ISA by a JIT compiler
Make late decisions on which core should run a
task
• Features:
•
•
•
•

• Explicitly parallel
• Support for exceptions, virtual functions and other high-level
features
• syscall methods (I/O, printf etc.)
• Debugging support
HSA stack for Android (conceptual)
HSA memory model
• Compatible with C++11, OpenCL, Java and .NET
memory models
• Relaxed consistency
• Designed to support both managed language (such
as Java) and unmanaged languages (such as C)
• Will make it much easier to develop 3rd party
compilers for a wide range of heterogeneous
products
• E.g. Fortran, C++, C++AMP, Java et al
HSA dispatch
• HSA designed to enable heterogeneous
task queuing
• A work queue per core (CPU, GPU, …)
• Distribution of work into queues
• Load balancing by work stealing

• Any core can schedule work for any other,
including itself
• Significant reduction in overhead of
scheduling work for a core
Today’s Command and Dispatch Flow
Command Flow

Application
A

Direct3D

Data Flow

User
Mode
Driver

Soft
Queue
Command Buffer

Kernel
Mode
Driver
DMA Buffer

GPU
HARDWARE

A

Hardware
Queue
Today’s Command and Dispatch Flow
Command Flow

Application
A

Direct3D

Data Flow

User
Mode
Driver

Soft
Queue

Kernel
Mode
Driver

Command Buffer
Command Flow

DMA Buffer

Data Flow

B

A

B

C
Application
B

Direct3D

User
Mode
Driver

Soft
Queue

Kernel
Mode
Driver

Command Buffer
Command Flow

Application
C

Direct3D

GPU
HARDWARE

A

DMA Buffer

Data Flow

User
Mode
Driver

Soft
Queue
Command Buffer

Hardware
Queue

Kernel
Mode
Driver
DMA Buffer
HSA Dispatch Flow

✔
✔
✔
✔
✔
✔

Software View
User-mode dispatch to hardware
No kernel mode driver overhead
Low dispatch latency

✔
✔
✔
✔
✔
✔

Hardware View
HW / microcode controlled
HW scheduling
HW-managed protection
HSA enabled dispatch
Application / Runtime

CPU1

CPU2

GPU
HSA Memory Model
• compatible with C++11/Java/.NET memory models
• Relaxed consistency memory model
• Loads and stores can be re-ordered by the finalizer
Data Flow of HSA
HSAIL
• intermediate language for parallel compute in
HSA
–

Generated by a high level compiler (LLVM, gcc, Java VM, etc)

–

Compiled down to GPU ISA or parallel ISAFinalizer

–

Finalizer may execute at run time, install time or build time

• low level instruction set designed for parallel
compute in a shared virtual memory environment.
• designed for fast compile time, moving most
optimizations to HL compiler
–

Limited register set avoids full register allocation in finalizer
GPU based Languages
• Dynamic Language for exploration of
heterogeneous parallel runtimes
• LLVM-based compilation
– Java, Scala, JavaScript, OpenMP
• Project Sumatra:GPU Acceleration for Java in
OpenJDK
• ThorScript
Accelerating Java

Source: HSA, Phil Rogers
Open Source software stack for HSA
A Linux execution and compilation stack is open-sourced
by AMD
• Jump start the ecosystem
• Allow a single shared implementation where appropriate
• Enable university research in all areas
Component Name

Purpose

HSA Bolt Library

Enable understanding and debug

OpenCL HSAIL Code Generator

Enable research

LLVM Contributions

Industry and academic collaboration

HSA Assembler

Enable understanding and debug

HSA Runtime

Standardize on a single runtime

HSA Finalizer

Enable research and debug

HSA Kernel Driver

For inclusion in Linux distros
Open Source Tools
•
•
•
•

Hosted at GitHub: https://github.com/HSAFoundation
HSAIL-Tools: Assembler/Disassembler
Instruction Set Simulator
HSA ISS Loader Library for Java and C++ for
creation and dispatch HSAIL kernel
HSAIL example
ld_kernarg_u64 $d0, [%_out];
ld_kernarg_u64 $d1, [%_in];

sqrt((float
(i*i +
(i+1)*(i+1) +
(i+2)*(i+2)))

@block0:
workitemabsid_u32 $s2, 0;
cvt_s64_s32 $d2, $s2;
mad_u64 $d3, $d2, 8, $d1;
ld_global_u64 $d3, [$d3];
//pointer
ld_global_f32 $s0, [$d3+0]; // x
ld_global_f32 $s1, [$d3+4]; // y
ld_global_f32 $s2, [$d3+8]; // z
mul_f32 $s0, $s0, $s0;
// x*x
mul_f32 $s1, $s1, $s1;
// y*y
add_f32 $s0, $s0, $s1; // x*x + y*y
mul_f32 $s2, $s2, $s2;
// z*z
add_f32 $s0, $s0, $s2;// x*x + y*y + z*z
sqrt_f32 $s0, $s0;
mad_u64 $d4, $d2, 4, $d0;
st_global_f32 $s0, [$d4];
ret;
Conclusions
• Heterogeneity is an increasingly important trend
• The market is finally starting to create and adopt
the necessary open standards
• HSA should enable much more dynamically
heterogeneous nested parallel programs and
programming models
Reference
• Heterogeneous Computing in ARM (2013)
• Project Sumatra: GPU Acceleration for Java in
OpenJDK
– http://openjdk.java.net/projects/sumatra/
• HETEROGENEOUS SYSTEM ARCHITECTURE,
Phil Rogers

Mais conteúdo relacionado

Mais procurados

Shorten Device Boot Time for Automotive IVI and Navigation Systems
Shorten Device Boot Time for Automotive IVI and Navigation SystemsShorten Device Boot Time for Automotive IVI and Navigation Systems
Shorten Device Boot Time for Automotive IVI and Navigation SystemsNational Cheng Kung University
 
A tour of F9 microkernel and BitSec hypervisor
A tour of F9 microkernel and BitSec hypervisorA tour of F9 microkernel and BitSec hypervisor
A tour of F9 microkernel and BitSec hypervisorLouie Lu
 
Develop Your Own Operating Systems using Cheap ARM Boards
Develop Your Own Operating Systems using Cheap ARM BoardsDevelop Your Own Operating Systems using Cheap ARM Boards
Develop Your Own Operating Systems using Cheap ARM BoardsNational Cheng Kung University
 
Microkernel-based operating system development
Microkernel-based operating system developmentMicrokernel-based operating system development
Microkernel-based operating system developmentSenko Rašić
 
Introduction to Microkernels
Introduction to MicrokernelsIntroduction to Microkernels
Introduction to MicrokernelsVasily Sartakov
 
Gnu linux for safety related systems
Gnu linux for safety related systemsGnu linux for safety related systems
Gnu linux for safety related systemsDTQ4
 

Mais procurados (20)

Shorten Device Boot Time for Automotive IVI and Navigation Systems
Shorten Device Boot Time for Automotive IVI and Navigation SystemsShorten Device Boot Time for Automotive IVI and Navigation Systems
Shorten Device Boot Time for Automotive IVI and Navigation Systems
 
Faults inside System Software
Faults inside System SoftwareFaults inside System Software
Faults inside System Software
 
Android Virtualization: Opportunity and Organization
Android Virtualization: Opportunity and OrganizationAndroid Virtualization: Opportunity and Organization
Android Virtualization: Opportunity and Organization
 
Hints for L4 Microkernel
Hints for L4 MicrokernelHints for L4 Microkernel
Hints for L4 Microkernel
 
A tour of F9 microkernel and BitSec hypervisor
A tour of F9 microkernel and BitSec hypervisorA tour of F9 microkernel and BitSec hypervisor
A tour of F9 microkernel and BitSec hypervisor
 
olibc: Another C Library optimized for Embedded Linux
olibc: Another C Library optimized for Embedded Linuxolibc: Another C Library optimized for Embedded Linux
olibc: Another C Library optimized for Embedded Linux
 
Embedded Virtualization for Mobile Devices
Embedded Virtualization for Mobile DevicesEmbedded Virtualization for Mobile Devices
Embedded Virtualization for Mobile Devices
 
Develop Your Own Operating Systems using Cheap ARM Boards
Develop Your Own Operating Systems using Cheap ARM BoardsDevelop Your Own Operating Systems using Cheap ARM Boards
Develop Your Own Operating Systems using Cheap ARM Boards
 
Explore Android Internals
Explore Android InternalsExplore Android Internals
Explore Android Internals
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
Xvisor: embedded and lightweight hypervisor
Xvisor: embedded and lightweight hypervisorXvisor: embedded and lightweight hypervisor
Xvisor: embedded and lightweight hypervisor
 
Plan 9: Not (Only) A Better UNIX
Plan 9: Not (Only) A Better UNIXPlan 9: Not (Only) A Better UNIX
Plan 9: Not (Only) A Better UNIX
 
Microkernel-based operating system development
Microkernel-based operating system developmentMicrokernel-based operating system development
Microkernel-based operating system development
 
Priority Inversion on Mars
Priority Inversion on MarsPriority Inversion on Mars
Priority Inversion on Mars
 
Introduction to Microkernels
Introduction to MicrokernelsIntroduction to Microkernels
Introduction to Microkernels
 
seL4 intro
seL4 introseL4 intro
seL4 intro
 
Android Optimization: Myth and Reality
Android Optimization: Myth and RealityAndroid Optimization: Myth and Reality
Android Optimization: Myth and Reality
 
Gnu linux for safety related systems
Gnu linux for safety related systemsGnu linux for safety related systems
Gnu linux for safety related systems
 
Understanding the Dalvik Virtual Machine
Understanding the Dalvik Virtual MachineUnderstanding the Dalvik Virtual Machine
Understanding the Dalvik Virtual Machine
 
Learn C Programming Language by Using GDB
Learn C Programming Language by Using GDBLearn C Programming Language by Using GDB
Learn C Programming Language by Using GDB
 

Destaque

Lecture notice about Embedded Operating System Design and Implementation
Lecture notice about Embedded Operating System Design and ImplementationLecture notice about Embedded Operating System Design and Implementation
Lecture notice about Embedded Operating System Design and ImplementationNational Cheng Kung University
 
中輟生談教育: 完全用開放原始碼軟體進行 嵌入式系統教學
中輟生談教育: 完全用開放原始碼軟體進行 嵌入式系統教學中輟生談教育: 完全用開放原始碼軟體進行 嵌入式系統教學
中輟生談教育: 完全用開放原始碼軟體進行 嵌入式系統教學National Cheng Kung University
 
給自己更好未來的 3 個練習:嵌入式作業系統設計、實做,與移植 (2015 年春季 ) 課程說明
給自己更好未來的 3 個練習:嵌入式作業系統設計、實做,與移植 (2015 年春季 ) 課程說明給自己更好未來的 3 個練習:嵌入式作業系統設計、實做,與移植 (2015 年春季 ) 課程說明
給自己更好未來的 3 個練習:嵌入式作業系統設計、實做,與移植 (2015 年春季 ) 課程說明National Cheng Kung University
 
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明National Cheng Kung University
 
PyPy's approach to construct domain-specific language runtime
PyPy's approach to construct domain-specific language runtimePyPy's approach to construct domain-specific language runtime
PyPy's approach to construct domain-specific language runtimeNational Cheng Kung University
 

Destaque (9)

Lecture notice about Embedded Operating System Design and Implementation
Lecture notice about Embedded Operating System Design and ImplementationLecture notice about Embedded Operating System Design and Implementation
Lecture notice about Embedded Operating System Design and Implementation
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
中輟生談教育: 完全用開放原始碼軟體進行 嵌入式系統教學
中輟生談教育: 完全用開放原始碼軟體進行 嵌入式系統教學中輟生談教育: 完全用開放原始碼軟體進行 嵌入式系統教學
中輟生談教育: 完全用開放原始碼軟體進行 嵌入式系統教學
 
How A Compiler Works: GNU Toolchain
How A Compiler Works: GNU ToolchainHow A Compiler Works: GNU Toolchain
How A Compiler Works: GNU Toolchain
 
給自己更好未來的 3 個練習:嵌入式作業系統設計、實做,與移植 (2015 年春季 ) 課程說明
給自己更好未來的 3 個練習:嵌入式作業系統設計、實做,與移植 (2015 年春季 ) 課程說明給自己更好未來的 3 個練習:嵌入式作業系統設計、實做,與移植 (2015 年春季 ) 課程說明
給自己更好未來的 3 個練習:嵌入式作業系統設計、實做,與移植 (2015 年春季 ) 課程說明
 
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明
進階嵌入式系統開發與實做 (2014 年秋季 ) 課程說明
 
從線上售票看作業系統設計議題
從線上售票看作業系統設計議題從線上售票看作業系統設計議題
從線上售票看作業系統設計議題
 
Virtual Machine Constructions for Dummies
Virtual Machine Constructions for DummiesVirtual Machine Constructions for Dummies
Virtual Machine Constructions for Dummies
 
PyPy's approach to construct domain-specific language runtime
PyPy's approach to construct domain-specific language runtimePyPy's approach to construct domain-specific language runtime
PyPy's approach to construct domain-specific language runtime
 

Semelhante a Implement Runtime Environments for HSA using LLVM

Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie
 
Large Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - NautilusLarge Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - NautilusGabriele Di Bernardo
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - IntroHSA Foundation
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...KRamasamy2
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...VAISHNAVI MADHAN
 
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono..."The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...Edge AI and Vision Alliance
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OSri Ambati
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...OpenStack
 
A Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesA Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesChunhua Liao
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective HSA Foundation
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...Rogue Wave Software
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computingRashid Ansari
 

Semelhante a Implement Runtime Environments for HSA using LLVM (20)

Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Large Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - NautilusLarge Scale Computing Infrastructure - Nautilus
Large Scale Computing Infrastructure - Nautilus
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Neptune @ SoCal
Neptune @ SoCalNeptune @ SoCal
Neptune @ SoCal
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
HSA Features
HSA FeaturesHSA Features
HSA Features
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
ISCA Final Presentation - Intro
ISCA Final Presentation - IntroISCA Final Presentation - Intro
ISCA Final Presentation - Intro
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
Floating Point Operations , Memory Chip Organization , Serial Bus Architectur...
 
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono..."The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
"The Vision API Maze: Options and Trade-offs," a Presentation from the Khrono...
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
A Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC ChallengesA Source-To-Source Approach to HPC Challenges
A Source-To-Source Approach to HPC Challenges
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
Heterogeneous computing
Heterogeneous computingHeterogeneous computing
Heterogeneous computing
 

Mais de National Cheng Kung University

Mais de National Cheng Kung University (9)

2016 年春季嵌入式作業系統課程說明
2016 年春季嵌入式作業系統課程說明2016 年春季嵌入式作業系統課程說明
2016 年春季嵌入式作業系統課程說明
 
Interpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratchInterpreter, Compiler, JIT from scratch
Interpreter, Compiler, JIT from scratch
 
進階嵌入式作業系統設計與實做 (2015 年秋季 ) 課程說明
進階嵌入式作業系統設計與實做 (2015 年秋季 ) 課程說明進階嵌入式作業系統設計與實做 (2015 年秋季 ) 課程說明
進階嵌入式作業系統設計與實做 (2015 年秋季 ) 課程說明
 
The Internals of "Hello World" Program
The Internals of "Hello World" ProgramThe Internals of "Hello World" Program
The Internals of "Hello World" Program
 
Open Source from Legend, Business, to Ecosystem
Open Source from Legend, Business, to EcosystemOpen Source from Legend, Business, to Ecosystem
Open Source from Legend, Business, to Ecosystem
 
Summer Project: Microkernel (2013)
Summer Project: Microkernel (2013)Summer Project: Microkernel (2013)
Summer Project: Microkernel (2013)
 
進階嵌入式系統開發與實作 (2013 秋季班 ) 課程說明
進階嵌入式系統開發與實作 (2013 秋季班 ) 課程說明進階嵌入式系統開發與實作 (2013 秋季班 ) 課程說明
進階嵌入式系統開發與實作 (2013 秋季班 ) 課程說明
 
LLVM 總是打開你的心:從電玩模擬器看編譯器應用實例
LLVM 總是打開你的心:從電玩模擬器看編譯器應用實例LLVM 總是打開你的心:從電玩模擬器看編譯器應用實例
LLVM 總是打開你的心:從電玩模擬器看編譯器應用實例
 
Develop Your Own Operating System
Develop Your Own Operating SystemDevelop Your Own Operating System
Develop Your Own Operating System
 

Último

How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 

Último (20)

How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 

Implement Runtime Environments for HSA using LLVM

  • 1. Implement Runtime Environments for HSA using LLVM Jim Huang ( 黃敬群 ) <jserv.tw@gmail.com> Oct 17, 2013 / ITRI, Taiwan
  • 2. About this presentation • Transition to heterogeneous – mobile phone to data centre • LLVM and HSA • HSA driven computing environment
  • 5. Herb Sutter’s new outlook http://herbsutter.com/welcome-to-the-jungle / “In the twilight of Moore’s Law, the transitions to multicore processors, GPU computing, and HaaS cloud computing are not separate trends, but aspects of a single trend – mainstream computers from desktops to ‘smartphones’ are being permanently transformed into heterogeneous supercomputer clusters. Henceforth, a single computeintensive application will need to harness different kinds of cores, in immense numbers, to get its job done.” “The free lunch is over. Now welcome to the hardware jungle.”
  • 6. Four causes of heterogeneity • Multiple types of programmable core • • • CPU (lightweight, heavyweight) GPU Accelerators and application-specific • Interconnect asymmetry • Memory hierarchies • Service oriented software demanding
  • 7. High Performance vs. Embedded Embedded Type of processors Size Memory HPC Heterogeneous Homogeneous Small Massive Shared Distributed but getting closer day by day...
  • 8. Heterogeneity is mainstream Dual-core ARM 1.4GHz, ARMv7s CPU Quad-core ARM Cortex A9 CPU Quad-core SGX543MP4+ Imagination GPU Triple-core SGX554MP4 Imagination GPU Most tablets and smartphones are already powered by heterogeneous processors.
  • 11. Current limitations • Disjoint view of memory spaces between CPUs and GPUs • Hard partition between “host” and “devices” in programming models • Dynamically varying nested parallelism almost impossible to support • Large overheads in scheduling heterogeneous, parallel tasks
  • 13. OpenCL Framework overview Host Program int main(int argc, char **argv) { ... clBuildProgram(program, ...); clCreateKernel(program, “dot”...); ... } OpenCL Kernels __kernel void dot(__global const float4 *a __global const float4 *b __global float4 *c) { int tid = get_global_id(0); c[tid] = a[tid] * b[tid]; } Application Separate programs into host-side and kernel-side code fragment OpenCL Framework OpenCL Runtime Runtime APIs Platform APIs OpenCL Compiler Compiler • compile OpenCL C language just-in-time Runtime • allow host program to manipulate context Front-end Front-end Back-end Back-end Platform MPU GPU MPU : host, kernel program GPU : kernel program Source: OpenCL_for_Halifux.pdf, OpenCL overview, Intel Visual Adrenaline
  • 14. OpenCL Execution Scenario kernel { code fragment 1 code fragment 2 code fragment 3 } Task-level parallelism 1 2 3 X O Data-level parallelism kernel A { code 1 code 2 code 3 } A:1,2,3 data(0) A:1,2,3 data(2) A:1,2,3 data(3) kernel A { code 1 } A:1,2,3 data(1) A:1,2,3 data(4) A:1,2,3 data(5) kernel B { code 2 } kernel C { code 3 } OpenCL Runtime A:1 B:2 C:3 A:1 B:2 C:3 14
  • 15. Supporting OpenCL • Syntax parsing by compiler – – – – qualif i r e vector built-in function Optimizations on single core __kernel void add( __global float4 *a, __global float4 *b, __global float4 *c) { int gid = get_global_id(0); float4 data = (float4) (1.0, 2.0, 3.0, 4.0); c[gid] = a[gid] + b[gid] + data; } • Runtime implementation – handle multi-core issues – Co-work with device vendor Platform APIs MPU Runtime APIs GPU
  • 16. LLVM-based compilation toolchain Data in Module 1 . . . opt & link Front ends Linked module Profiling ? yes lli Module N no Back end 1 Exe 1 . . . Exe M . . . Back end M Module 1 . . . Module M Partitioning & mapping Estimation Profile info
  • 18. HSA overview • open architecture specification • HSAIL virtual (parallel) instruction set • HSA memory model • HSA dispatcher and run-time • Provides an optimized platform architecture for heterogeneous programming models such as OpenCL, C++AMP, et al 18
  • 20. Heterogeneous Programming • Unified virtual address space for all cores • CPU and GPU • distributed arrays • Hardware queues per code with lightweight user mode task dispatch • Enables GPU context switching, preemption, efficient heterogeneous scheduling • First class barrier objects • Aids parallel program composability
  • 21. HSA Intermediate Layer (HSAIL) Virtual ISA for parallel programs Similar to LLVM IR and OpenCL SPIR Finalised to specific ISA by a JIT compiler Make late decisions on which core should run a task • Features: • • • • • Explicitly parallel • Support for exceptions, virtual functions and other high-level features • syscall methods (I/O, printf etc.) • Debugging support
  • 22. HSA stack for Android (conceptual)
  • 23. HSA memory model • Compatible with C++11, OpenCL, Java and .NET memory models • Relaxed consistency • Designed to support both managed language (such as Java) and unmanaged languages (such as C) • Will make it much easier to develop 3rd party compilers for a wide range of heterogeneous products • E.g. Fortran, C++, C++AMP, Java et al
  • 24. HSA dispatch • HSA designed to enable heterogeneous task queuing • A work queue per core (CPU, GPU, …) • Distribution of work into queues • Load balancing by work stealing • Any core can schedule work for any other, including itself • Significant reduction in overhead of scheduling work for a core
  • 25. Today’s Command and Dispatch Flow Command Flow Application A Direct3D Data Flow User Mode Driver Soft Queue Command Buffer Kernel Mode Driver DMA Buffer GPU HARDWARE A Hardware Queue
  • 26. Today’s Command and Dispatch Flow Command Flow Application A Direct3D Data Flow User Mode Driver Soft Queue Kernel Mode Driver Command Buffer Command Flow DMA Buffer Data Flow B A B C Application B Direct3D User Mode Driver Soft Queue Kernel Mode Driver Command Buffer Command Flow Application C Direct3D GPU HARDWARE A DMA Buffer Data Flow User Mode Driver Soft Queue Command Buffer Hardware Queue Kernel Mode Driver DMA Buffer
  • 27. HSA Dispatch Flow ✔ ✔ ✔ ✔ ✔ ✔ Software View User-mode dispatch to hardware No kernel mode driver overhead Low dispatch latency ✔ ✔ ✔ ✔ ✔ ✔ Hardware View HW / microcode controlled HW scheduling HW-managed protection
  • 28. HSA enabled dispatch Application / Runtime CPU1 CPU2 GPU
  • 29. HSA Memory Model • compatible with C++11/Java/.NET memory models • Relaxed consistency memory model • Loads and stores can be re-ordered by the finalizer
  • 31. HSAIL • intermediate language for parallel compute in HSA – Generated by a high level compiler (LLVM, gcc, Java VM, etc) – Compiled down to GPU ISA or parallel ISAFinalizer – Finalizer may execute at run time, install time or build time • low level instruction set designed for parallel compute in a shared virtual memory environment. • designed for fast compile time, moving most optimizations to HL compiler – Limited register set avoids full register allocation in finalizer
  • 32. GPU based Languages • Dynamic Language for exploration of heterogeneous parallel runtimes • LLVM-based compilation – Java, Scala, JavaScript, OpenMP • Project Sumatra:GPU Acceleration for Java in OpenJDK • ThorScript
  • 34. Open Source software stack for HSA A Linux execution and compilation stack is open-sourced by AMD • Jump start the ecosystem • Allow a single shared implementation where appropriate • Enable university research in all areas Component Name Purpose HSA Bolt Library Enable understanding and debug OpenCL HSAIL Code Generator Enable research LLVM Contributions Industry and academic collaboration HSA Assembler Enable understanding and debug HSA Runtime Standardize on a single runtime HSA Finalizer Enable research and debug HSA Kernel Driver For inclusion in Linux distros
  • 35. Open Source Tools • • • • Hosted at GitHub: https://github.com/HSAFoundation HSAIL-Tools: Assembler/Disassembler Instruction Set Simulator HSA ISS Loader Library for Java and C++ for creation and dispatch HSAIL kernel
  • 36. HSAIL example ld_kernarg_u64 $d0, [%_out]; ld_kernarg_u64 $d1, [%_in]; sqrt((float (i*i + (i+1)*(i+1) + (i+2)*(i+2))) @block0: workitemabsid_u32 $s2, 0; cvt_s64_s32 $d2, $s2; mad_u64 $d3, $d2, 8, $d1; ld_global_u64 $d3, [$d3]; //pointer ld_global_f32 $s0, [$d3+0]; // x ld_global_f32 $s1, [$d3+4]; // y ld_global_f32 $s2, [$d3+8]; // z mul_f32 $s0, $s0, $s0; // x*x mul_f32 $s1, $s1, $s1; // y*y add_f32 $s0, $s0, $s1; // x*x + y*y mul_f32 $s2, $s2, $s2; // z*z add_f32 $s0, $s0, $s2;// x*x + y*y + z*z sqrt_f32 $s0, $s0; mad_u64 $d4, $d2, 4, $d0; st_global_f32 $s0, [$d4]; ret;
  • 37. Conclusions • Heterogeneity is an increasingly important trend • The market is finally starting to create and adopt the necessary open standards • HSA should enable much more dynamically heterogeneous nested parallel programs and programming models
  • 38. Reference • Heterogeneous Computing in ARM (2013) • Project Sumatra: GPU Acceleration for Java in OpenJDK – http://openjdk.java.net/projects/sumatra/ • HETEROGENEOUS SYSTEM ARCHITECTURE, Phil Rogers