Keynote presentation, Is There Anything New in Heterogeneous Computing, by Mike Muller, Chief Technology Officer, ARM, at the AMD Developer Summit (APU13), Nov. 11-13, 2013.
6. Printing and Imprinting Thin Film Transistors (TFT)
Can be transparent, bio-degradable and even ingestible
Unit cost 1000 less than mainstream CMOS
CMOS @ $40,000/m2 vs. TFT @ $10/m2
Printing CAPEX can be less than $1,000
350dpi = 200um @ 20 m/s
Can print batteries, antenna
Mainly organic at ~20 volts
Imprint CAPEX a $2M DVD press is high volume
Better controllability hence higher density and performance
1um today scale to 50nm features as used today for BluRay discs
Mainly Inorganic NMOS only at ~2 volts
9. Is There Anything New in Heterogeneous Computing?
Vector Add
Reduction
Matrix Mul
GPU OpenCL on GPU
1.00
1.00
1.00
GPU OpenCL on FPGA
0.14
0.02
0.89
FPGA OpenCL on FPGA
1.71
1.62
31.85
1998
Manual Partitioning
C & Assembler
ARM
+
DSP
2013
Manual Partitioning
C++ & OpenCL/RenderScript
ARM
+
GPU
10. How Do People Program?
~20M Programmers
Web
Mobile
Embedded
~200k
Desktop
Simple, old-school ray tracer
Start with C++ code and accelerate the code with Heterogeneous Systems
void traceScreen()
{
for(y = 0; y < height; ++y) {
for(x = 0; x < width; ++x){
Ray ray = generateRay(x, y);
IntersectableObject *obj = traceRay(ray);
framebuffer[y][x] = colorPixelForObject(obj);
}
}
}
void traceScreen()
{
par_for_2D(height, width, [&](int y, int x) {
Ray ray = generateRay(x, y);
IntersectableObject *obj = traceRay(ray);
framebuffer[y][x] = colorPixelForObject(obj);
});
}
11. Moving the Code onto OpenCL 1.x
Need to make the following changes
a)
b)
c)
d)
e)
f)
g)
Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to OpenCL C
12. Moving the Code onto OpenCL 2
Need to make the following changes
a)
b)
c)
d)
e)
f)
g)
Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to OpenCL C
OpenCL 2 solves point a) with shared address space, but not the rest
13. Moving the Code onto C++ AMP
Need to make the following changes
a)
b)
c)
d)
e)
f)
g)
Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as C++ AMP cannot call into C++ standard library
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to OpenCL C
C++ AMP solves points d), f) and g), but not the rest
14. Moving the Code onto HSA
Need to make the following changes
a)
b)
c)
d)
e)
f)
g)
Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as HSAIL does not understand C++ data type internals
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to a language on top of HSAIL
HSA solves points a), c), d), e) and soon f)
15. What Makes GPUs Good For Power Efficient Compute?
Relaxed single-threaded performance
No dynamic scheduling
No branch prediction
No register renaming, no result forwarding
Longer pipelines
Lower clock frequencies
Multi-threading
Tolerate long latencies to memory
Increasing the ALU/control ratio
Short-vectors exposed to programmers
SIMT/Warp/VLIW/Wavefront based execution
16. ..
Heterogeneous Compute Homogeneous Architecture
big
LITTLE
How about a SIMTish ARM?
Familiar programming model, C++ and OpenMP
Fewer seams
Sharing data structures and function pointers/vtables
Integer Pipe
FP Pipe
Load/Store Pipe
Write
SIMT
Queue
RESEARCH
Throughput
17. Moving the Code onto a Warped ARM
Need to make the following changes
Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to OpenCL C
18. Performance vs Effort
We’ve implemented SGEMM, a matrix-matrix multiplication benchmark, in various
ways, to investigate the tradeoff between programmer effort and performance payoff
SGEMM version
ARM in C
Speedup
Effort
1x
Low
ARM in C with NEON intrinsics, prefetching
15x
Medium - High
ARM in assembly with NEON, prefetching
26x
High
SIMTish ARM in C
35x
Low
SIMTish ARM in C, unrolled
44x
Low - Medium
Mali GPU x 4 way
136x
High
20. Works for geeks…
No proper orchestration
Battle for the apps platform
Needs home IT support
Or only single manufacturer
IPv4
Sonosnet
IPv6
Imagine that there
were a 1000 of these
connected devices….
24. IOT Medical Devices
First implantable Pacemaker 1958
Can a pacemaker be hacked to kill?
Or just a plot line in US TV series
RF interface for adjusting settings
First hacked in 2008
“Sustained effort by a team of specialists” – The New York Times
Range a few cm
Today
MIT grad students
One weekend
Range 50 feet
26. It’s a Heterogeneous Future
Reach
The future
Open Data
and Objects
Scale Needs Standards
Sharing Needs Trust
Trust Needs Security
Applications
Mobile internet
Internet / broadband
M2M
SaaS
Fixed Telephony Networks
Smart
Everything
Sensors & Actuators
Networks
Today
Mobile Telephony