SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Mike Muller
CTO
Is there anything new in heterogeneous computing?
Evolution

Wearable
Intelligence
13
Mobile
Computing

PC
82

89

93

07

10
IOT

Embedded
77

97

Consumer

Smart
Appliances

Computing

Cloud
Server
1960

1970

1980

1990

2000

2010

2020
What’s the Innovation?
Wireless

3G

MEMS
CCD
Media
Social Media?
Semiconductor Process?

GPS
Mobility Trends: CMOS
10,000

cm2/(V·s)

1,000
100
10
1990

NMOS
PMOS
1995

2000

2005

2010

2015

Planar CMOS

5nm

HNW

FinFET

Strain

3.5nm

2020

2025

III-V GE NEMS

HKMG

Switches

7nm

14nm 10nm

VNW

spintronics

2D: C, MoS

Graphene wire, CNT via

Interconnect

Al wires

// 3DIC Opto I/O Opto int

CU wires

SADP

Patterning

LELE

SAQP

LELELE

EUV

Seq. 3D

EUV + DWEB
EUV LELE

EUV + DSA
Printing:

Moore’s Law and Ink Jets
Drops/Second

1/Size (pL-1)

1E11

1E1
10’s microns

1E10
100’s microns

1E9

1E0

1E8
1E7

1E-1

1E6
10,000 nozzles

1E5

1E-2

10 nozzles

1E4
1E3

1E-3
1980

1985

1990

1995

2000

2005

2010

2015

2020
Printing and Imprinting Thin Film Transistors (TFT)
 Can be transparent, bio-degradable and even ingestible
 Unit cost 1000 less than mainstream CMOS




 CMOS @ $40,000/m2 vs. TFT @ $10/m2
Printing CAPEX can be less than $1,000
 350dpi = 200um @ 20 m/s
 Can print batteries, antenna
 Mainly organic at ~20 volts
Imprint CAPEX a $2M DVD press is high volume
 Better controllability hence higher density and performance
 1um today scale to 50nm features as used today for BluRay discs
 Mainly Inorganic NMOS only at ~2 volts
Mobility Trends: CMOS & Thin Film Transistors
10000
1000
CPU

cm2/(V·s)

100
10
1
0.1

ARM1

3µ
6MHz
CortexM0

0.01

2µ
20kHz

0.001
0.0001
0.00001
1990

1995

2000

2005

2010

2015

Conventional NMOS
Conventional PMOS
TFT

2020

2025
Top Right

and Bottom Left
Is There Anything New in Heterogeneous Computing?
Vector Add

Reduction

Matrix Mul

GPU OpenCL on GPU

1.00

1.00

1.00

GPU OpenCL on FPGA

0.14

0.02

0.89

FPGA OpenCL on FPGA

1.71

1.62

31.85

1998
Manual Partitioning
C & Assembler

ARM

+

DSP

2013
Manual Partitioning
C++ & OpenCL/RenderScript

ARM

+

GPU
How Do People Program?

~20M Programmers

Web

Mobile
Embedded
~200k

Desktop

 Simple, old-school ray tracer
 Start with C++ code and accelerate the code with Heterogeneous Systems
void traceScreen()
{
for(y = 0; y < height; ++y) {
for(x = 0; x < width; ++x){
Ray ray = generateRay(x, y);
IntersectableObject *obj = traceRay(ray);
framebuffer[y][x] = colorPixelForObject(obj);
}
}
}

void traceScreen()
{
par_for_2D(height, width, [&](int y, int x) {
Ray ray = generateRay(x, y);
IntersectableObject *obj = traceRay(ray);
framebuffer[y][x] = colorPixelForObject(obj);
});
}
Moving the Code onto OpenCL 1.x
 Need to make the following changes
a)
b)
c)
d)
e)
f)
g)

Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to OpenCL C
Moving the Code onto OpenCL 2
 Need to make the following changes
a)
b)
c)
d)
e)
f)
g)

Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to OpenCL C

 OpenCL 2 solves point a) with shared address space, but not the rest
Moving the Code onto C++ AMP
 Need to make the following changes
a)
b)
c)
d)
e)
f)
g)

Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as C++ AMP cannot call into C++ standard library
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to OpenCL C

 C++ AMP solves points d), f) and g), but not the rest
Moving the Code onto HSA
 Need to make the following changes
a)
b)
c)
d)
e)
f)
g)

Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as HSAIL does not understand C++ data type internals
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to a language on top of HSAIL

 HSA solves points a), c), d), e) and soon f)
What Makes GPUs Good For Power Efficient Compute?
 Relaxed single-threaded performance




 No dynamic scheduling
 No branch prediction
 No register renaming, no result forwarding
 Longer pipelines
 Lower clock frequencies
Multi-threading
 Tolerate long latencies to memory
Increasing the ALU/control ratio
 Short-vectors exposed to programmers
 SIMT/Warp/VLIW/Wavefront based execution
..
Heterogeneous Compute Homogeneous Architecture

big

LITTLE

 How about a SIMTish ARM?
 Familiar programming model, C++ and OpenMP
 Fewer seams
 Sharing data structures and function pointers/vtables

Integer Pipe
FP Pipe
Load/Store Pipe

Write

SIMT
Queue

RESEARCH

Throughput
Moving the Code onto a Warped ARM
 Need to make the following changes








Get rid of all the pointers, both in scene vector and internally in CSGObject
Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals
Get rid of the virtual function calls
Change the classes to structs
Get rid of recursion in CSGObject
Avoid accessing the global scene variable in accelerated code
Port the code base to OpenCL C
Performance vs Effort
 We’ve implemented SGEMM, a matrix-matrix multiplication benchmark, in various
ways, to investigate the tradeoff between programmer effort and performance payoff
SGEMM version
ARM in C

Speedup

Effort
1x

Low

ARM in C with NEON intrinsics, prefetching

15x

Medium - High

ARM in assembly with NEON, prefetching

26x

High

SIMTish ARM in C

35x

Low

SIMTish ARM in C, unrolled

44x

Low - Medium

Mali GPU x 4 way

136x

High
Scale Needs Standards
Works for geeks…
No proper orchestration
Battle for the apps platform
Needs home IT support
Or only single manufacturer

IPv4
Sonosnet

IPv6

Imagine that there
were a 1000 of these
connected devices….
Functional Becomes the Internet of things
Functional

Little Data
Mike

My Data

X

Gym

X
Life
Insurance

!
Their Data

Car
Insurance

Rob Curtis Haymakers Cambridge
Picture by Keith Jones
Sharing Needs Trust
IOT Medical Devices
 First implantable Pacemaker 1958
 Can a pacemaker be hacked to kill?
 Or just a plot line in US TV series
RF interface for adjusting settings


 First hacked in 2008


 “Sustained effort by a team of specialists” – The New York Times
 Range a few cm
Today
 MIT grad students
 One weekend
 Range 50 feet
Trust Needs Security
It’s a Heterogeneous Future

Reach

The future
Open Data
and Objects

Scale Needs Standards
Sharing Needs Trust
Trust Needs Security

Applications
Mobile internet
Internet / broadband
M2M
SaaS
Fixed Telephony Networks

Smart
Everything
Sensors & Actuators
Networks

Today

Mobile Telephony

Mais conteúdo relacionado

Mais procurados

Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
Ofer Rosenberg
 

Mais procurados (20)

CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
 
CE-4027, Sensor Fusion – HID virtualized over LPC, by Reed Hinkel
CE-4027, Sensor Fusion – HID virtualized over LPC, by Reed HinkelCE-4027, Sensor Fusion – HID virtualized over LPC, by Reed Hinkel
CE-4027, Sensor Fusion – HID virtualized over LPC, by Reed Hinkel
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA  by Ben Sanders, AMDBolt C++ Standard Template Libary for HSA  by Ben Sanders, AMD
Bolt C++ Standard Template Libary for HSA by Ben Sanders, AMD
 
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard HoffnungPG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
 
ONNC - 0.9.1 release
ONNC - 0.9.1 releaseONNC - 0.9.1 release
ONNC - 0.9.1 release
 
Intel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOFIntel's Presentation in SIGGRAPH OpenCL BOF
Intel's Presentation in SIGGRAPH OpenCL BOF
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 

Semelhante a Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by Mike Muller, Chief Technology Officer, ARM

“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
Edge AI and Vision Alliance
 
Marek Suplata Projects
Marek Suplata ProjectsMarek Suplata Projects
Marek Suplata Projects
guest14f12f
 
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
D I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{Www
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
guest3f9c6b
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timing
Tom Spyrou
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
imec.archive
 

Semelhante a Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by Mike Muller, Chief Technology Officer, ARM (20)

Circuit Simplifier
Circuit SimplifierCircuit Simplifier
Circuit Simplifier
 
OrientDB and Hazelcast
OrientDB and HazelcastOrientDB and Hazelcast
OrientDB and Hazelcast
 
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
“Introduction to the TVM Open Source Deep Learning Compiler Stack,” a Present...
 
Marek Suplata Projects
Marek Suplata ProjectsMarek Suplata Projects
Marek Suplata Projects
 
OrientDB & Hazelcast: In-Memory Distributed Graph Database
 OrientDB & Hazelcast: In-Memory Distributed Graph Database OrientDB & Hazelcast: In-Memory Distributed Graph Database
OrientDB & Hazelcast: In-Memory Distributed Graph Database
 
An35225228
An35225228An35225228
An35225228
 
Embedded system
Embedded systemEmbedded system
Embedded system
 
Digital Ic Applications Jntu Model Paper{Www.Studentyogi.Com}
Digital Ic Applications Jntu Model Paper{Www.Studentyogi.Com}Digital Ic Applications Jntu Model Paper{Www.Studentyogi.Com}
Digital Ic Applications Jntu Model Paper{Www.Studentyogi.Com}
 
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
D I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{Www
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
 
11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore11 Synchoricity as the basis for going Beyond Moore
11 Synchoricity as the basis for going Beyond Moore
 
tau 2015 spyrou fpga timing
tau 2015 spyrou fpga timingtau 2015 spyrou fpga timing
tau 2015 spyrou fpga timing
 
Introduction to 2D/3D Graphics
Introduction to 2D/3D GraphicsIntroduction to 2D/3D Graphics
Introduction to 2D/3D Graphics
 
Edge optimized architecture for fabric defect detection in real-time
Edge optimized architecture for fabric defect detection in real-timeEdge optimized architecture for fabric defect detection in real-time
Edge optimized architecture for fabric defect detection in real-time
 
I010315760
I010315760I010315760
I010315760
 
20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris20081114 Friday Food iLabt Bart Joris
20081114 Friday Food iLabt Bart Joris
 
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...
 
Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel Scalability for All: Unreal Engine* 4 with Intel
Scalability for All: Unreal Engine* 4 with Intel
 
CAD STANDARDS - SMART MANUFACTURING MECH
CAD STANDARDS - SMART MANUFACTURING MECHCAD STANDARDS - SMART MANUFACTURING MECH
CAD STANDARDS - SMART MANUFACTURING MECH
 

Mais de AMD Developer Central

Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
AMD Developer Central
 

Mais de AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
 
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
 
Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14Mantle - Introducing a new API for Graphics - AMD at GDC14
Mantle - Introducing a new API for Graphics - AMD at GDC14
 
Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14Direct3D and the Future of Graphics APIs - AMD at GDC14
Direct3D and the Future of Graphics APIs - AMD at GDC14
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 

Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by Mike Muller, Chief Technology Officer, ARM

  • 1. Mike Muller CTO Is there anything new in heterogeneous computing?
  • 4. Mobility Trends: CMOS 10,000 cm2/(V·s) 1,000 100 10 1990 NMOS PMOS 1995 2000 2005 2010 2015 Planar CMOS 5nm HNW FinFET Strain 3.5nm 2020 2025 III-V GE NEMS HKMG Switches 7nm 14nm 10nm VNW spintronics 2D: C, MoS Graphene wire, CNT via Interconnect Al wires // 3DIC Opto I/O Opto int CU wires SADP Patterning LELE SAQP LELELE EUV Seq. 3D EUV + DWEB EUV LELE EUV + DSA
  • 5. Printing: Moore’s Law and Ink Jets Drops/Second 1/Size (pL-1) 1E11 1E1 10’s microns 1E10 100’s microns 1E9 1E0 1E8 1E7 1E-1 1E6 10,000 nozzles 1E5 1E-2 10 nozzles 1E4 1E3 1E-3 1980 1985 1990 1995 2000 2005 2010 2015 2020
  • 6. Printing and Imprinting Thin Film Transistors (TFT)  Can be transparent, bio-degradable and even ingestible  Unit cost 1000 less than mainstream CMOS    CMOS @ $40,000/m2 vs. TFT @ $10/m2 Printing CAPEX can be less than $1,000  350dpi = 200um @ 20 m/s  Can print batteries, antenna  Mainly organic at ~20 volts Imprint CAPEX a $2M DVD press is high volume  Better controllability hence higher density and performance  1um today scale to 50nm features as used today for BluRay discs  Mainly Inorganic NMOS only at ~2 volts
  • 7. Mobility Trends: CMOS & Thin Film Transistors 10000 1000 CPU cm2/(V·s) 100 10 1 0.1 ARM1 3µ 6MHz CortexM0 0.01 2µ 20kHz 0.001 0.0001 0.00001 1990 1995 2000 2005 2010 2015 Conventional NMOS Conventional PMOS TFT 2020 2025
  • 9. Is There Anything New in Heterogeneous Computing? Vector Add Reduction Matrix Mul GPU OpenCL on GPU 1.00 1.00 1.00 GPU OpenCL on FPGA 0.14 0.02 0.89 FPGA OpenCL on FPGA 1.71 1.62 31.85 1998 Manual Partitioning C & Assembler ARM + DSP 2013 Manual Partitioning C++ & OpenCL/RenderScript ARM + GPU
  • 10. How Do People Program? ~20M Programmers Web Mobile Embedded ~200k Desktop  Simple, old-school ray tracer  Start with C++ code and accelerate the code with Heterogeneous Systems void traceScreen() { for(y = 0; y < height; ++y) { for(x = 0; x < width; ++x){ Ray ray = generateRay(x, y); IntersectableObject *obj = traceRay(ray); framebuffer[y][x] = colorPixelForObject(obj); } } } void traceScreen() { par_for_2D(height, width, [&](int y, int x) { Ray ray = generateRay(x, y); IntersectableObject *obj = traceRay(ray); framebuffer[y][x] = colorPixelForObject(obj); }); }
  • 11. Moving the Code onto OpenCL 1.x  Need to make the following changes a) b) c) d) e) f) g) Get rid of all the pointers, both in scene vector and internally in CSGObject Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals Get rid of the virtual function calls Change the classes to structs Get rid of recursion in CSGObject Avoid accessing the global scene variable in accelerated code Port the code base to OpenCL C
  • 12. Moving the Code onto OpenCL 2  Need to make the following changes a) b) c) d) e) f) g) Get rid of all the pointers, both in scene vector and internally in CSGObject Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals Get rid of the virtual function calls Change the classes to structs Get rid of recursion in CSGObject Avoid accessing the global scene variable in accelerated code Port the code base to OpenCL C  OpenCL 2 solves point a) with shared address space, but not the rest
  • 13. Moving the Code onto C++ AMP  Need to make the following changes a) b) c) d) e) f) g) Get rid of all the pointers, both in scene vector and internally in CSGObject Rewrite the use of std::vector, as C++ AMP cannot call into C++ standard library Get rid of the virtual function calls Change the classes to structs Get rid of recursion in CSGObject Avoid accessing the global scene variable in accelerated code Port the code base to OpenCL C  C++ AMP solves points d), f) and g), but not the rest
  • 14. Moving the Code onto HSA  Need to make the following changes a) b) c) d) e) f) g) Get rid of all the pointers, both in scene vector and internally in CSGObject Rewrite the use of std::vector, as HSAIL does not understand C++ data type internals Get rid of the virtual function calls Change the classes to structs Get rid of recursion in CSGObject Avoid accessing the global scene variable in accelerated code Port the code base to a language on top of HSAIL  HSA solves points a), c), d), e) and soon f)
  • 15. What Makes GPUs Good For Power Efficient Compute?  Relaxed single-threaded performance    No dynamic scheduling  No branch prediction  No register renaming, no result forwarding  Longer pipelines  Lower clock frequencies Multi-threading  Tolerate long latencies to memory Increasing the ALU/control ratio  Short-vectors exposed to programmers  SIMT/Warp/VLIW/Wavefront based execution
  • 16. .. Heterogeneous Compute Homogeneous Architecture big LITTLE  How about a SIMTish ARM?  Familiar programming model, C++ and OpenMP  Fewer seams  Sharing data structures and function pointers/vtables Integer Pipe FP Pipe Load/Store Pipe Write SIMT Queue RESEARCH Throughput
  • 17. Moving the Code onto a Warped ARM  Need to make the following changes        Get rid of all the pointers, both in scene vector and internally in CSGObject Rewrite the use of std::vector, as OpenCL C does not understand C++ data type internals Get rid of the virtual function calls Change the classes to structs Get rid of recursion in CSGObject Avoid accessing the global scene variable in accelerated code Port the code base to OpenCL C
  • 18. Performance vs Effort  We’ve implemented SGEMM, a matrix-matrix multiplication benchmark, in various ways, to investigate the tradeoff between programmer effort and performance payoff SGEMM version ARM in C Speedup Effort 1x Low ARM in C with NEON intrinsics, prefetching 15x Medium - High ARM in assembly with NEON, prefetching 26x High SIMTish ARM in C 35x Low SIMTish ARM in C, unrolled 44x Low - Medium Mali GPU x 4 way 136x High
  • 20. Works for geeks… No proper orchestration Battle for the apps platform Needs home IT support Or only single manufacturer IPv4 Sonosnet IPv6 Imagine that there were a 1000 of these connected devices….
  • 21. Functional Becomes the Internet of things Functional Little Data
  • 22. Mike My Data X Gym X Life Insurance ! Their Data Car Insurance Rob Curtis Haymakers Cambridge Picture by Keith Jones
  • 24. IOT Medical Devices  First implantable Pacemaker 1958  Can a pacemaker be hacked to kill?  Or just a plot line in US TV series RF interface for adjusting settings   First hacked in 2008   “Sustained effort by a team of specialists” – The New York Times  Range a few cm Today  MIT grad students  One weekend  Range 50 feet
  • 26. It’s a Heterogeneous Future Reach The future Open Data and Objects Scale Needs Standards Sharing Needs Trust Trust Needs Security Applications Mobile internet Internet / broadband M2M SaaS Fixed Telephony Networks Smart Everything Sensors & Actuators Networks Today Mobile Telephony