SlideShare uma empresa Scribd logo
1 de 38
Building affordable and
programmable exascale capable
computers
SURFsara, 18 April 2019
Georgi Gaydadjiev, Director of Maxeler IoT-Labs BV, Delft
Honorary Visiting Professor at the Department of Computing, Imperial College London
Think Ångströms not nanometers in 2019
Ideally we should steer movements of almost each individual electron to solve our problems
0.1 nm  1 Å
14 nm  140 Å
DNA
C-C
bond
1Å 10Å 102Å 103Å 104Å
glucose
hemoglobin
ribosome
…
cells
100s of Si atoms in 14nm
very few atoms
(e.g., 3nm / 30Å  6 to 12 atoms)
light microscope
resolution
2
Moving data on-chip will use as much energy as computing with it
Moving data off-chip will use 200x more energy!
and is much slower as well
The power challenge
Today* 2020
Double precision Float Op ~20pJ <10pJ
Moving data on-chip: 1mm 6pJ
Moving data on-chip: 20mm 120pJ
Moving data to off-chip
memory
4,000pJ 2,000pJ
3
The data movement challenge
Next generation computer systems should take data movements at all levels very seriously
3
Data movement challenge confirmed
4
Wires that carry the data (and instructions) become more and more important
(Courtesy: NVIDIA and ITRS)
BUT
4
5
“… without dramatic increases in
efficiency, ICT industry could
use 20% of all electricity and
emit up to 5.5% of the world’s
carbon emissions by 2025.”
“We have a tsunami of data
approaching. Everything which
can be is being digitalised. It is
a perfect storm.”
“ … a single $1bn Apple data
centre planned for Athenry in Co
Galway, expects to eventually
use 300MW of electricity, or over
8% of the national capacity and
more than the daily entire usage
of Dublin. It will require 144 large
diesel generators as back up for
when the wind does not blow.”
Why is all of this important?
6
Computing in Time:
Follow a recipe step by step
one at the time
Computing in Space:
Build a “recipe specific” factory with multiple
paths performed simultaneously
One result per clock cycle
Efficient, predictable, reliable “mass production” of huge data amounts
Build Computers for your Problem and Data
1. Describe Conjugate Gradient
as dataflow graph
3. Stream data through the
Custom Accelerator
2. Compile dataflow structure and load to hardware
Create customized mega accelerators with massive inherent throughputs
Programming a Dataflow “mass production” Engine
7
Program in HL Language
Machine Architecture
Implementation
Circuits
Algorithm
Devices
Problem
Solutions
Solutions
Co-optimise the HW and the SW
stack for the performance critical
areas of the application
8
Solving Computing Problems Vertically
8
From Equations to Dataflow Hardware
u
x
s
x
vd
x
F
ah
p
ah
TRuu
a
vu
ah
u
t
u





















1ln

9
Real data flow graph as
generated by MaxCompiler
4,866 nodes;
10,000s of stages/cycles
Full Customization in:
Space, Value and Time
(SVT)
1010
Easy it is not (and not really new)
Slotnick’s law (of effort):
“The parallel approach to computing does require that
some original thinking be done about numerical
analysis and data management in order to secure
efficient use.
In an environment which has represented the
absence of the need to think as the highest virtue this
is a decided disadvantage.”
Daniel Slotnick (1931-1985)
Chief Architect of Illiac IV
11
Programing in Space basics
12
• Control and Data-flows are decoupled
– Both are fully programmable
• Operations exist in space and by default run in parallel
– Their number is limited only by the available space
• All operations can be customized at various levels
– e.g., from algorithm down to the number representation
• Multiple operations constitute kernels
• Data streams through the operations / kernels
• The data transport and processing can be balanced
• All resources work all of the time for max performance
• The In/Out data rates determine the operating frequency
Equally spread the available “forces” and move no faster than required by the application
12
The Computational Model
13
• Dataflow sub-system (DataFlow Engine- DFE)
– Spatial arithmetic chip “hardware” technology with flexible arithmetic units
and programmable interconnect (looks like FPGAs but is not limited to)
– Programmable Static Dataflow
– Systolic Execution at kernel level
– Streaming Custom Computing at system level
– Implicit GALS* IO and kernel-to-kernel communication
• Dedicated software (MaxCompiler, MaxelerOS and SLiC)
– compilation toolchain and design methodology
– Incorporated simulation and debug environment for rapid development
– Linux fully integrated runtime system and low level software support
– Help designer focus on the data/algorithm and the system architecture
• Only three basic memory types (explicitly exposed)
– Scalars (exposed to the CPU)
– Fast Memory (FMEM): small and fast (on-chip)
– Large Memory (LMEM): large and slow (off-chip)
* GALS – Globally Asynchronous Locally Synchronous
13
Maxeler’s DataFlow Engine (DFE, MAX4)
14
MaxRing
Interconnect
Dataflow Engine (DFE)LMEM
(Large Memory)
4-96GB
Reconfigurable
compute fabric
Dataflow cores &
FMEM (Fast Memory)
High bandwidth
memory link
Link to main data network
(e.g., PCIe, Infiniband)
MaxRing links
• 48GB DRAM (LMEM)
• Stratix V D8
• MaxRing interconnect
• 4,000 multipliers
• 700K logic cells
• 6.25MB of FMEM
14
Application Level Components
SLiC
MaxelerOS
Memory
CPU
DFE
Memory
Kernels (MaxJ)
(instantiate the arithmetic
structure)
*+
+
Manager (MaxJ)
(arrange the data
orchestration)
Host application (C, Python, Matlab..)
15
PCI Express
or
Infiniband
15
MaxJ: Moving Average of three numbers
Dataflow computing in hardware using a language you know
16
x
x
+
30
y
DFEVar x = io.input("x", dfeFloat(10,31));
DFEVar result = x * x + 30;
io.output("y", result, dfeFloat(10,31));
17
Simple example: y = x2 + 30
17
MaxJ example: Control in Space
18
x
+
1
y
-
1
>
10
class SimpleKernel extends Kernel {
SimpleKernel() {
DFEVar x = io.input(“x”, dfeInt(24));
DFEVar result = (x>10) ? x+1 : x-1;
io.output(“y”, result, dfeInt(25));
}
}
18
19
SIMULATE AND DEBUG
GENERATE DATAFLOWPROGRAMARCHITECTANALYSE
Used to build real systems, however, very difficult to learn/educate
Non Traditional Design Process
OK?many hours …
Custom
HW
Multiple scales of
computing
Important features for
optimization
complete system level  balance compute, storage
and IO
parallel node level  maximize utilization of
compute and interconnect
microarchitecture level  minimize data movement
arithmetic level  tradeoff range, precision
and accuracy
= discretize in Time, Space
and Value
bit level  encode and add
redundancy
transistor level => manipulate ‘0’ and ‘1’
and more, e.g., trade/hide Communication (Time) for/behind Computation (Space)
20
Optimizations at all levelsFlow/Time
Space
20
21
1. Higher chip / system price compared to microprocessors
2. Lead design times (3 months in the best case)
a. Complex numerical transformations
b. Non-trivial area and data movement optimizations
3. “Painful” Place & Route times (12 hours to 24 hours)
a. Expensive Vendor Specific tools
b. Serious developments ask for dedicated build clusters
4. Need to compete at 200MHz with processors at 3GHz
5. Current HW technology is sub-optimal
• On-chip memory not built for stream processing
• On-chip interconnect overdesigned for Dataflow
6. Long learning curve (Tools and Methods needed)
7. Designer’s productivity should improve (Tools and Methods)
8. …
Some of the challenges
Ongoing effort on improved methodologies and tools
MaxRing
Interconnect
Dataflow Engine (DFE)
LMEM
(Large Memory)
4-96GB
Reconfigurable
compute fabric
Dataflow cores &
FMEM (Fast Memory)
High bandwidth
memory link
Link to main data network
(e.g., PCIe, Infiniband)
MaxRing links
Multiple platforms, single DFE abstraction
+
{
Application and MaxJ
gen4 gen5
Performance Portable Migration
(Intel based) (Xilinx based)
22
• MaxCompiler generates VHDL
ready for FPGA vendor tools
• Synthesis transforms VHDL into
logical “netlist” – sets of basic logic
expressions
• Map fits basic logic into N-input
look-up tables
• Place puts LUTs, DSPs, RAMs etc
at specific locations on chip
• Route sets up wiring between
blocks
23
Substrate Agnostic Compilation
MaxCompiler compilation
Synthesis
Map
Place
Route
Generate Maxfile
VHDL
Complete FPGA
Netlist
LUTs
Placed FPGA
23
DFE Place and Route example
24
Mon 16:27: MaxCompiler version: 2012.2
Mon 16:27: Build “MyKernel" start time: Mon Apr 08 16:27:24 BST 2013
Mon 16:27: Main build process running as user training1 on host Maxworkstation7478
Mon 16:27: Build location: /home/training1/maxcompiler-builds/MyKernel
Mon 16:27: Instantiating manager
Mon 16:27: Instantiating kernel “MyKernel"
Mon 16:27: Compiling manager (CPU I/O Only)
Mon 16:27: Compiling kernel "MyKernel"
Mon 16:27: Generating input files (VHDL, netlists, CoreGen)
Mon 16:27: Running back-end build (12 phases)
Mon 16:27: (1/12) - Prepare MaxFile Data (GenerateMaxFileDataFile)
Mon 16:27: (2/12) - Synthesize DFE Modules (XST)
Mon 16:30: (3/12) - Link DFE Modules (NGCBuild)
Mon 16:30: (4/12) - Prepare for Resource Analysis (EDIF2MxruBuildPass)
Mon 16:30: (5/12) - Generate Preliminary Annotated Source Code
Mon 16:30: (6/12) - Report Resource Usage (ResourceCounter)
Mon 16:30: About to start chip vendor Map/Place/Route toolflow. This will take some time.
Mon 16:30: (7/12) - Prepare for Placement (NGDBuild)
Mon 16:30: (8/12) - Place and Route DFE (MPPR)
Mon 16:30: Executing MPPR with 1 cost tables and 1 threads.
Mon 16:30: MPPR: Starting 1 cost table
Mon 16:43: MPPR: Cost table 1 met timing with score 0 (best score 0)
Mon 16:43: (9/12) - Prepare for Resource Analysis (XDLBuild)
Mon 16:44: (10/12) - Generate Resource Report (ResourceUsageBuildPass)
Mon 16:44: (11/12) - Generate Annotated Source Code (ResourceAnnotationBuildPass)
Mon 16:44: (12/12) - Generate MaxFile (GenerateMaxFile)
Mon 16:45:
Mon 16:45: FINAL RESOURCE USAGE
Mon 16:45: LUTs: 9503 / 149760 (6.35%)
Mon 16:45: FFs: 12749 / 149760 (8.51%)
Mon 16:45: BRAMs: 34 / 516 (6.59%)
Mon 16:45: DSPs: 0 / 1056 (0.00%)
Mon 16:45:
Mon 16:45: MaxFile: /home/training1/maxcompiler-builds/MyKernel/results/MyKernel.max
(MD5Sum: e564cd922aeeda04acfa2f4ecce8236d)
Mon 16:45: Build completed: Mon Apr 08 16:45:58 BST 2013 (took 18 mins, 33 secs)
FPGA vendor specific
back-end tool flow
Abstracted by MaxCompiler
24
• Allows you to see what lines of code are
• using what resources and focus optimization
• Separate reports for each kernel and for the manager
DFE Resource Usage Reporting
LUTs FFs BRAMs DSPs : MyKernel.java
727 871 1.0 2 : resources used by this file
0.24% 0.15% 0.09% 0.10% : % of available
71.41% 61.82% 100.00% 100.00% : % of total used
94.29% 97.21% 100.00% 100.00% : % of user resources
:
: public class MyKernel extends Kernel {
: public MyKernel (KernelParameters parameters) {
: super(parameters);
1 31 0.0 0 : DFEVar p = io.input("p", dfeFloat(8,24));
2 9 0.0 0 : DFEVar q = io.input("q", dfeUInt(8));
: DFEVar offset = io.scalarInput("offset", dfeUInt(8));
8 8 0.0 0 : DFEVar addr = offset + q;
18 40 1.0 0 : DFEVar v = mem.romMapped("table", addr,
: dfeFloat(8,24), 256);
139 145 0.0 2 : p = p * p;
401 541 0.0 0 : p = p + v;
: io.output("r", p, dfeFloat(8,24));
: }
: }
DSP Blocks
Block RAMs
IO Blocks
LUT/FFs
? ?
Different operations
use different
resources
25
• MaxCompiler gives detailed latency and area annotation
back to the programmer
• Evaluate precise effect of code
on latency and chip area
26
Optimization Feedback
12.8ns 6.4ns+ = 19.2ns (total compute latency)
26
27
Small pilot system deployed in Oct 2017
• one 1U MPC-X with 8 MAX5 DFEs
• one 1U AMD EPYC based server
• one 1U login head node
Scaling using Amazon AWS cloud
• MAX5 fully compatible with F1 instances
• Elastic scaling between private and public
MPC-X node
Remote users
MAX5 DFE EPYC CPU
1TB DDR4
Head/Build node
ipmi
56 Gbps 2x Infiniband @ 56 Gbps
10 Gbps
10 Gbps
Supermicro EPYC node
Pilot System Deployed at Jülich
http://www.prace-ri.eu/pcp/
27
28
PRACE-PCP: SpecFEM3D on DFE
28
The BQCD Chip - AERIAL VIEW
Scalable Conjugate Gradient Design for the CG step of BQCD
Problem
(Small/Large)
System (composition) (size) TTS
[sec]
ETS
[kWh]
DTS
(F1)
BQCD 32x32x32x32 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 1,054 0.44 -
64x64x64x64 1PF equivalent (48 DFEs, 512 EPYC cores) (14U) 1,703.8 4.26 $39.93
NEMO GYRE6 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 388 0.164 -
GYRE144 1PF equivalent (48 DFEs, 92 EPYC cores) (8U) 1,942 3.77 $42.72
SFM3D 1 chunk x64x64 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 232 0.096 -
6 chunks x1,440x1,440 1PF equivalent (384 DFEs, 768 EPYC cores) (60U) 5,150 70.1 $1,267.2
QE Al2O3 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 32 0.013 -
Ta2O5 1PF equivalent (64 DFEs, 64 EPYC cores) (9U) 3,210 7.58 $94.16
Achieved Performance PRACE workloads
30
Global Weather Simulation with DFEs in China
⬥L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X.
Huang, Y. Zhang, and G. Yang, Accelerating
solvers for global atmospheric equations
through mixed-precision data flow engine,
published at FPL 2013
⬥Joint research with Imperial College and
Tsinghua University
⬥Simulating the atmosphere using the
shallow water equation
An order of magnitude improvement over the Linpack-driven supercomputer technology
Platform Speedup Efficiency
6 Core CPU 1x 1x
Tianhe-1A Node 23x 15x
Maxeler MPC-X 330x 145x
31
32
• A (fancy) name does not help with solving the problem at hand
• Cloud, (Intelligent) Edge, Fog are just names like … Maria
• FPGA is just a technology that can help bridging the gap to something
better (Spatial Computing Acceleration HW, Quantum Processing, …)
• just focus on building the best computer for the given job
• Learn, think, pioneer and stay always critical
• abstraction is powerful but quite often not needed
• use it with great care and remember Dan Slotnick
• We are turning Earth into a heterogeneous, planet-wide computer
• so we should try to not kill it in the process
• There is a lot of interest in this topic 
Conclusions
?
Questions
Contact me at: georgi@maxeler.com
Or find me on Google: “Georgi computer” should do
33
Some links with more information
Maxeler Multiscale Dataflow Computing:
https://www.maxeler.com/technology/dataflow-computing/
Computing in Space explained by Mike Flynn:
http://www.openspl.org/what-is-openspl/
Computing in Space Course at Imperial College:
http://cc.doc.ic.ac.uk/openspl16/
Exciting Applications for DFEs (and JDFEs):
http://appgallery.maxeler.com
Maxeler DFEs on AWS EC2 F1:
https://aws.amazon.com/marketplace/seller-profile?id=2780c6ec-d326-47fc-9ff6-
c66ab2ba202a
Maxeler and Xilinx Alveo collaboration:
https://www.xilinx.com/products/boards-and-kits/alveo.html
34
Maxeler Applications Gallery
Dataflow Engine (DFE) Ecosystem
⬥ With over 150 universities in our university program, we
decided to create an app gallery to enable the community
to share applications, examples, demos, …
⬥ The App Gallery is complemented by a teaching program,
with the first successful course taught at Imperial College in
2014. see
http://cc.doc.ic.ac.uk/openspl14
⬥ Top 10 APPS:
➢ Correlation: in real-time, pairwise, on 6,000 streams
➢ 100% Guaranteed Packet Capture
➢ Webserver, cache and load balancing
➢ HESTON Option pricer
➢ N-body simulation
➢ Regex matching (e.g. for Security)
➢ Brain network simulation
➢ Quantum Chromo-Dynamics kernel
➢ Seismic Imaging
➢ Realtime Classification
Dataflow Apps and Analytics for Machine Learning http://appgallery.maxeler.com/
35
Peer Reviewed Dataflow Publications
2008: Seismic Imaging with Dataflow Engines 25x faster, An Implementation of the Acoustic
Wave Equation, T. Nemeth et al, Chevron, Society of Exploration Geophysicists, Nov 2008.
2010: Credit Derivatives Valuation and Risk, from 8 hours to 2 minutes, American Finance
Technology Award, with JP Morgan.
2011: Modeling and Imaging with Schlumberger, Beyond Traditional Microprocessors for
Geoscience High-Performance Computing Applications, O. Lindtjörn et al, Schlumberger,
IEEE Micro, vol. 31, no. 2, March/April 2011.
2012: Weather Imaging with CRS4, 60x faster, Acceleration of a Meteorological Limited Area
Model with dataflow Engines, Diego Oriato†, Simon Tilbury†, Marino Marrocu§, Gabrielle
Pusceddu§†Maxeler, §CRS4, 2012 Symposium on Application Accelerators in HPC.
2013: Convergence of Risk and Trading in partnership with CME Group and birth of OpenSPL
industry standard (www.openspl.org), In Cloud Computing it’s the Era of Convergence, Open
Markets Magazine, Ari Studnitzer, CME Group.
2014: Brain Simulation with Erasmus, Real-Time Olivary Neuron Simulations on Dataflow
Computing Machines, Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis,
Catalin Ciobanu, Oskar Mencer, and Chris I. De Zeeuw, Supercomputing; Springer, 487-497
2017: High Energy Physics with Imperial, Using MaxCompiler for the high level synthesis
of trigger algorithms, S. Summers, A. Rose and P. Sanders, Journal of Instrumentation,
Volume 12, IOP Publishing.
36
Maxeler University Program
37
Maxeler Trophy Cabinet
Academic History since 2005
• Imperial College Research Excellence Award
• Top EPSRC Advanced Fellowship
• Two Best Paper Awards
• Early Dataflow paper by Maxeler’s Founder has been recognized as one of the most
influential papers at the FPL conference in the last 25 years.
Recent Commercial Awards
• HPCwire Editors Choice Award, November 2011.
• American Finance Technology Awards, New York, winner,
“Most Cutting Edge IT Initiative,” December 2011.
• Golden Arrow, “...for revolutionizing Computers, ”
COM-SULT, January 2012.
• Gartner “Cool Vendor of the Year,” March 2012.
• Frost and Sullivan “Most innovative IT vendor, ” Dec 2013.
• CIO Review, 20 Most Promising Networking Companies, March 2014.
• CIO Review, 20 Most Promising HPC Companies, March 2015.
38

Mais conteúdo relacionado

Mais procurados

HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)Ryousei Takano
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015Karel Ha
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...Ryousei Takano
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...AMD Developer Central
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wallugur candan
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)inside-BigData.com
 

Mais procurados (20)

HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Lec04 gpu architecture
Lec04 gpu architectureLec04 gpu architecture
Lec04 gpu architecture
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)USENIX NSDI 2016 (Session: Resource Sharing)
USENIX NSDI 2016 (Session: Resource Sharing)
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)Tensor Processing Unit (TPU)
Tensor Processing Unit (TPU)
 
HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015HTCC poster for CERN Openlab opendays 2015
HTCC poster for CERN Openlab opendays 2015
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...AIST Super Green Cloud: lessons learned from the operation and the performanc...
AIST Super Green Cloud: lessons learned from the operation and the performanc...
 
GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)Lightweight DNN Processor Design (based on NVDLA)
Lightweight DNN Processor Design (based on NVDLA)
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Gpu and The Brick Wall
Gpu and The Brick WallGpu and The Brick Wall
Gpu and The Brick Wall
 
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
 

Semelhante a Programmable Exascale Supercomputer

Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceLEGATO project
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster ComputingNIKHIL NAIR
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptMohmdUmer
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersRyousei Takano
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05Rajesh Gupta
 
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)Jose Antonio Coarasa Perez
 
Flink Forward Berlin 2018: George Theodorakis - "Hardware-efficient Stream Pr...
Flink Forward Berlin 2018: George Theodorakis - "Hardware-efficient Stream Pr...Flink Forward Berlin 2018: George Theodorakis - "Hardware-efficient Stream Pr...
Flink Forward Berlin 2018: George Theodorakis - "Hardware-efficient Stream Pr...Flink Forward
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialmadhuinturi
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Data-Centric Parallel Programming
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programminginside-BigData.com
 

Semelhante a Programmable Exascale Supercomputer (20)

Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Cluster Computing
Cluster ComputingCluster Computing
Cluster Computing
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.ppt
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Par com
Par comPar com
Par com
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
From Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computersFrom Rack scale computers to Warehouse scale computers
From Rack scale computers to Warehouse scale computers
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
 
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)
The CMS openstack, opportunistic, overlay, online-cluster Cloud (CMSooooCloud)
 
Flink Forward Berlin 2018: George Theodorakis - "Hardware-efficient Stream Pr...
Flink Forward Berlin 2018: George Theodorakis - "Hardware-efficient Stream Pr...Flink Forward Berlin 2018: George Theodorakis - "Hardware-efficient Stream Pr...
Flink Forward Berlin 2018: George Theodorakis - "Hardware-efficient Stream Pr...
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Data-Centric Parallel Programming
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programming
 

Último

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Último (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Programmable Exascale Supercomputer

  • 1. Building affordable and programmable exascale capable computers SURFsara, 18 April 2019 Georgi Gaydadjiev, Director of Maxeler IoT-Labs BV, Delft Honorary Visiting Professor at the Department of Computing, Imperial College London
  • 2. Think Ångströms not nanometers in 2019 Ideally we should steer movements of almost each individual electron to solve our problems 0.1 nm  1 Å 14 nm  140 Å DNA C-C bond 1Å 10Å 102Å 103Å 104Å glucose hemoglobin ribosome … cells 100s of Si atoms in 14nm very few atoms (e.g., 3nm / 30Å  6 to 12 atoms) light microscope resolution 2
  • 3. Moving data on-chip will use as much energy as computing with it Moving data off-chip will use 200x more energy! and is much slower as well The power challenge Today* 2020 Double precision Float Op ~20pJ <10pJ Moving data on-chip: 1mm 6pJ Moving data on-chip: 20mm 120pJ Moving data to off-chip memory 4,000pJ 2,000pJ 3 The data movement challenge Next generation computer systems should take data movements at all levels very seriously 3
  • 4. Data movement challenge confirmed 4 Wires that carry the data (and instructions) become more and more important (Courtesy: NVIDIA and ITRS) BUT 4
  • 5. 5 “… without dramatic increases in efficiency, ICT industry could use 20% of all electricity and emit up to 5.5% of the world’s carbon emissions by 2025.” “We have a tsunami of data approaching. Everything which can be is being digitalised. It is a perfect storm.” “ … a single $1bn Apple data centre planned for Athenry in Co Galway, expects to eventually use 300MW of electricity, or over 8% of the national capacity and more than the daily entire usage of Dublin. It will require 144 large diesel generators as back up for when the wind does not blow.” Why is all of this important?
  • 6. 6 Computing in Time: Follow a recipe step by step one at the time Computing in Space: Build a “recipe specific” factory with multiple paths performed simultaneously One result per clock cycle Efficient, predictable, reliable “mass production” of huge data amounts Build Computers for your Problem and Data
  • 7. 1. Describe Conjugate Gradient as dataflow graph 3. Stream data through the Custom Accelerator 2. Compile dataflow structure and load to hardware Create customized mega accelerators with massive inherent throughputs Programming a Dataflow “mass production” Engine 7
  • 8. Program in HL Language Machine Architecture Implementation Circuits Algorithm Devices Problem Solutions Solutions Co-optimise the HW and the SW stack for the performance critical areas of the application 8 Solving Computing Problems Vertically 8
  • 9. From Equations to Dataflow Hardware u x s x vd x F ah p ah TRuu a vu ah u t u                      1ln  9
  • 10. Real data flow graph as generated by MaxCompiler 4,866 nodes; 10,000s of stages/cycles Full Customization in: Space, Value and Time (SVT) 1010
  • 11. Easy it is not (and not really new) Slotnick’s law (of effort): “The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.” Daniel Slotnick (1931-1985) Chief Architect of Illiac IV 11
  • 12. Programing in Space basics 12 • Control and Data-flows are decoupled – Both are fully programmable • Operations exist in space and by default run in parallel – Their number is limited only by the available space • All operations can be customized at various levels – e.g., from algorithm down to the number representation • Multiple operations constitute kernels • Data streams through the operations / kernels • The data transport and processing can be balanced • All resources work all of the time for max performance • The In/Out data rates determine the operating frequency Equally spread the available “forces” and move no faster than required by the application 12
  • 13. The Computational Model 13 • Dataflow sub-system (DataFlow Engine- DFE) – Spatial arithmetic chip “hardware” technology with flexible arithmetic units and programmable interconnect (looks like FPGAs but is not limited to) – Programmable Static Dataflow – Systolic Execution at kernel level – Streaming Custom Computing at system level – Implicit GALS* IO and kernel-to-kernel communication • Dedicated software (MaxCompiler, MaxelerOS and SLiC) – compilation toolchain and design methodology – Incorporated simulation and debug environment for rapid development – Linux fully integrated runtime system and low level software support – Help designer focus on the data/algorithm and the system architecture • Only three basic memory types (explicitly exposed) – Scalars (exposed to the CPU) – Fast Memory (FMEM): small and fast (on-chip) – Large Memory (LMEM): large and slow (off-chip) * GALS – Globally Asynchronous Locally Synchronous 13
  • 14. Maxeler’s DataFlow Engine (DFE, MAX4) 14 MaxRing Interconnect Dataflow Engine (DFE)LMEM (Large Memory) 4-96GB Reconfigurable compute fabric Dataflow cores & FMEM (Fast Memory) High bandwidth memory link Link to main data network (e.g., PCIe, Infiniband) MaxRing links • 48GB DRAM (LMEM) • Stratix V D8 • MaxRing interconnect • 4,000 multipliers • 700K logic cells • 6.25MB of FMEM 14
  • 15. Application Level Components SLiC MaxelerOS Memory CPU DFE Memory Kernels (MaxJ) (instantiate the arithmetic structure) *+ + Manager (MaxJ) (arrange the data orchestration) Host application (C, Python, Matlab..) 15 PCI Express or Infiniband 15
  • 16. MaxJ: Moving Average of three numbers Dataflow computing in hardware using a language you know 16
  • 17. x x + 30 y DFEVar x = io.input("x", dfeFloat(10,31)); DFEVar result = x * x + 30; io.output("y", result, dfeFloat(10,31)); 17 Simple example: y = x2 + 30 17
  • 18. MaxJ example: Control in Space 18 x + 1 y - 1 > 10 class SimpleKernel extends Kernel { SimpleKernel() { DFEVar x = io.input(“x”, dfeInt(24)); DFEVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, dfeInt(25)); } } 18
  • 19. 19 SIMULATE AND DEBUG GENERATE DATAFLOWPROGRAMARCHITECTANALYSE Used to build real systems, however, very difficult to learn/educate Non Traditional Design Process OK?many hours … Custom HW
  • 20. Multiple scales of computing Important features for optimization complete system level  balance compute, storage and IO parallel node level  maximize utilization of compute and interconnect microarchitecture level  minimize data movement arithmetic level  tradeoff range, precision and accuracy = discretize in Time, Space and Value bit level  encode and add redundancy transistor level => manipulate ‘0’ and ‘1’ and more, e.g., trade/hide Communication (Time) for/behind Computation (Space) 20 Optimizations at all levelsFlow/Time Space 20
  • 21. 21 1. Higher chip / system price compared to microprocessors 2. Lead design times (3 months in the best case) a. Complex numerical transformations b. Non-trivial area and data movement optimizations 3. “Painful” Place & Route times (12 hours to 24 hours) a. Expensive Vendor Specific tools b. Serious developments ask for dedicated build clusters 4. Need to compete at 200MHz with processors at 3GHz 5. Current HW technology is sub-optimal • On-chip memory not built for stream processing • On-chip interconnect overdesigned for Dataflow 6. Long learning curve (Tools and Methods needed) 7. Designer’s productivity should improve (Tools and Methods) 8. … Some of the challenges Ongoing effort on improved methodologies and tools
  • 22. MaxRing Interconnect Dataflow Engine (DFE) LMEM (Large Memory) 4-96GB Reconfigurable compute fabric Dataflow cores & FMEM (Fast Memory) High bandwidth memory link Link to main data network (e.g., PCIe, Infiniband) MaxRing links Multiple platforms, single DFE abstraction + { Application and MaxJ gen4 gen5 Performance Portable Migration (Intel based) (Xilinx based) 22
  • 23. • MaxCompiler generates VHDL ready for FPGA vendor tools • Synthesis transforms VHDL into logical “netlist” – sets of basic logic expressions • Map fits basic logic into N-input look-up tables • Place puts LUTs, DSPs, RAMs etc at specific locations on chip • Route sets up wiring between blocks 23 Substrate Agnostic Compilation MaxCompiler compilation Synthesis Map Place Route Generate Maxfile VHDL Complete FPGA Netlist LUTs Placed FPGA 23
  • 24. DFE Place and Route example 24 Mon 16:27: MaxCompiler version: 2012.2 Mon 16:27: Build “MyKernel" start time: Mon Apr 08 16:27:24 BST 2013 Mon 16:27: Main build process running as user training1 on host Maxworkstation7478 Mon 16:27: Build location: /home/training1/maxcompiler-builds/MyKernel Mon 16:27: Instantiating manager Mon 16:27: Instantiating kernel “MyKernel" Mon 16:27: Compiling manager (CPU I/O Only) Mon 16:27: Compiling kernel "MyKernel" Mon 16:27: Generating input files (VHDL, netlists, CoreGen) Mon 16:27: Running back-end build (12 phases) Mon 16:27: (1/12) - Prepare MaxFile Data (GenerateMaxFileDataFile) Mon 16:27: (2/12) - Synthesize DFE Modules (XST) Mon 16:30: (3/12) - Link DFE Modules (NGCBuild) Mon 16:30: (4/12) - Prepare for Resource Analysis (EDIF2MxruBuildPass) Mon 16:30: (5/12) - Generate Preliminary Annotated Source Code Mon 16:30: (6/12) - Report Resource Usage (ResourceCounter) Mon 16:30: About to start chip vendor Map/Place/Route toolflow. This will take some time. Mon 16:30: (7/12) - Prepare for Placement (NGDBuild) Mon 16:30: (8/12) - Place and Route DFE (MPPR) Mon 16:30: Executing MPPR with 1 cost tables and 1 threads. Mon 16:30: MPPR: Starting 1 cost table Mon 16:43: MPPR: Cost table 1 met timing with score 0 (best score 0) Mon 16:43: (9/12) - Prepare for Resource Analysis (XDLBuild) Mon 16:44: (10/12) - Generate Resource Report (ResourceUsageBuildPass) Mon 16:44: (11/12) - Generate Annotated Source Code (ResourceAnnotationBuildPass) Mon 16:44: (12/12) - Generate MaxFile (GenerateMaxFile) Mon 16:45: Mon 16:45: FINAL RESOURCE USAGE Mon 16:45: LUTs: 9503 / 149760 (6.35%) Mon 16:45: FFs: 12749 / 149760 (8.51%) Mon 16:45: BRAMs: 34 / 516 (6.59%) Mon 16:45: DSPs: 0 / 1056 (0.00%) Mon 16:45: Mon 16:45: MaxFile: /home/training1/maxcompiler-builds/MyKernel/results/MyKernel.max (MD5Sum: e564cd922aeeda04acfa2f4ecce8236d) Mon 16:45: Build completed: Mon Apr 08 16:45:58 BST 2013 (took 18 mins, 33 secs) FPGA vendor specific back-end tool flow Abstracted by MaxCompiler 24
  • 25. • Allows you to see what lines of code are • using what resources and focus optimization • Separate reports for each kernel and for the manager DFE Resource Usage Reporting LUTs FFs BRAMs DSPs : MyKernel.java 727 871 1.0 2 : resources used by this file 0.24% 0.15% 0.09% 0.10% : % of available 71.41% 61.82% 100.00% 100.00% : % of total used 94.29% 97.21% 100.00% 100.00% : % of user resources : : public class MyKernel extends Kernel { : public MyKernel (KernelParameters parameters) { : super(parameters); 1 31 0.0 0 : DFEVar p = io.input("p", dfeFloat(8,24)); 2 9 0.0 0 : DFEVar q = io.input("q", dfeUInt(8)); : DFEVar offset = io.scalarInput("offset", dfeUInt(8)); 8 8 0.0 0 : DFEVar addr = offset + q; 18 40 1.0 0 : DFEVar v = mem.romMapped("table", addr, : dfeFloat(8,24), 256); 139 145 0.0 2 : p = p * p; 401 541 0.0 0 : p = p + v; : io.output("r", p, dfeFloat(8,24)); : } : } DSP Blocks Block RAMs IO Blocks LUT/FFs ? ? Different operations use different resources 25
  • 26. • MaxCompiler gives detailed latency and area annotation back to the programmer • Evaluate precise effect of code on latency and chip area 26 Optimization Feedback 12.8ns 6.4ns+ = 19.2ns (total compute latency) 26
  • 27. 27 Small pilot system deployed in Oct 2017 • one 1U MPC-X with 8 MAX5 DFEs • one 1U AMD EPYC based server • one 1U login head node Scaling using Amazon AWS cloud • MAX5 fully compatible with F1 instances • Elastic scaling between private and public MPC-X node Remote users MAX5 DFE EPYC CPU 1TB DDR4 Head/Build node ipmi 56 Gbps 2x Infiniband @ 56 Gbps 10 Gbps 10 Gbps Supermicro EPYC node Pilot System Deployed at Jülich http://www.prace-ri.eu/pcp/ 27
  • 29. The BQCD Chip - AERIAL VIEW Scalable Conjugate Gradient Design for the CG step of BQCD
  • 30. Problem (Small/Large) System (composition) (size) TTS [sec] ETS [kWh] DTS (F1) BQCD 32x32x32x32 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 1,054 0.44 - 64x64x64x64 1PF equivalent (48 DFEs, 512 EPYC cores) (14U) 1,703.8 4.26 $39.93 NEMO GYRE6 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 388 0.164 - GYRE144 1PF equivalent (48 DFEs, 92 EPYC cores) (8U) 1,942 3.77 $42.72 SFM3D 1 chunk x64x64 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 232 0.096 - 6 chunks x1,440x1,440 1PF equivalent (384 DFEs, 768 EPYC cores) (60U) 5,150 70.1 $1,267.2 QE Al2O3 PRACE pilot (8 DFEs, 64 EPYC cores) (2U) 32 0.013 - Ta2O5 1PF equivalent (64 DFEs, 64 EPYC cores) (9U) 3,210 7.58 $94.16 Achieved Performance PRACE workloads 30
  • 31. Global Weather Simulation with DFEs in China ⬥L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X. Huang, Y. Zhang, and G. Yang, Accelerating solvers for global atmospheric equations through mixed-precision data flow engine, published at FPL 2013 ⬥Joint research with Imperial College and Tsinghua University ⬥Simulating the atmosphere using the shallow water equation An order of magnitude improvement over the Linpack-driven supercomputer technology Platform Speedup Efficiency 6 Core CPU 1x 1x Tianhe-1A Node 23x 15x Maxeler MPC-X 330x 145x 31
  • 32. 32 • A (fancy) name does not help with solving the problem at hand • Cloud, (Intelligent) Edge, Fog are just names like … Maria • FPGA is just a technology that can help bridging the gap to something better (Spatial Computing Acceleration HW, Quantum Processing, …) • just focus on building the best computer for the given job • Learn, think, pioneer and stay always critical • abstraction is powerful but quite often not needed • use it with great care and remember Dan Slotnick • We are turning Earth into a heterogeneous, planet-wide computer • so we should try to not kill it in the process • There is a lot of interest in this topic  Conclusions
  • 33. ? Questions Contact me at: georgi@maxeler.com Or find me on Google: “Georgi computer” should do 33
  • 34. Some links with more information Maxeler Multiscale Dataflow Computing: https://www.maxeler.com/technology/dataflow-computing/ Computing in Space explained by Mike Flynn: http://www.openspl.org/what-is-openspl/ Computing in Space Course at Imperial College: http://cc.doc.ic.ac.uk/openspl16/ Exciting Applications for DFEs (and JDFEs): http://appgallery.maxeler.com Maxeler DFEs on AWS EC2 F1: https://aws.amazon.com/marketplace/seller-profile?id=2780c6ec-d326-47fc-9ff6- c66ab2ba202a Maxeler and Xilinx Alveo collaboration: https://www.xilinx.com/products/boards-and-kits/alveo.html 34
  • 35. Maxeler Applications Gallery Dataflow Engine (DFE) Ecosystem ⬥ With over 150 universities in our university program, we decided to create an app gallery to enable the community to share applications, examples, demos, … ⬥ The App Gallery is complemented by a teaching program, with the first successful course taught at Imperial College in 2014. see http://cc.doc.ic.ac.uk/openspl14 ⬥ Top 10 APPS: ➢ Correlation: in real-time, pairwise, on 6,000 streams ➢ 100% Guaranteed Packet Capture ➢ Webserver, cache and load balancing ➢ HESTON Option pricer ➢ N-body simulation ➢ Regex matching (e.g. for Security) ➢ Brain network simulation ➢ Quantum Chromo-Dynamics kernel ➢ Seismic Imaging ➢ Realtime Classification Dataflow Apps and Analytics for Machine Learning http://appgallery.maxeler.com/ 35
  • 36. Peer Reviewed Dataflow Publications 2008: Seismic Imaging with Dataflow Engines 25x faster, An Implementation of the Acoustic Wave Equation, T. Nemeth et al, Chevron, Society of Exploration Geophysicists, Nov 2008. 2010: Credit Derivatives Valuation and Risk, from 8 hours to 2 minutes, American Finance Technology Award, with JP Morgan. 2011: Modeling and Imaging with Schlumberger, Beyond Traditional Microprocessors for Geoscience High-Performance Computing Applications, O. Lindtjörn et al, Schlumberger, IEEE Micro, vol. 31, no. 2, March/April 2011. 2012: Weather Imaging with CRS4, 60x faster, Acceleration of a Meteorological Limited Area Model with dataflow Engines, Diego Oriato†, Simon Tilbury†, Marino Marrocu§, Gabrielle Pusceddu§†Maxeler, §CRS4, 2012 Symposium on Application Accelerators in HPC. 2013: Convergence of Risk and Trading in partnership with CME Group and birth of OpenSPL industry standard (www.openspl.org), In Cloud Computing it’s the Era of Convergence, Open Markets Magazine, Ari Studnitzer, CME Group. 2014: Brain Simulation with Erasmus, Real-Time Olivary Neuron Simulations on Dataflow Computing Machines, Georgios Smaragdos, Craig Davies, Christos Strydis, Ioannis Sourdis, Catalin Ciobanu, Oskar Mencer, and Chris I. De Zeeuw, Supercomputing; Springer, 487-497 2017: High Energy Physics with Imperial, Using MaxCompiler for the high level synthesis of trigger algorithms, S. Summers, A. Rose and P. Sanders, Journal of Instrumentation, Volume 12, IOP Publishing. 36
  • 38. Maxeler Trophy Cabinet Academic History since 2005 • Imperial College Research Excellence Award • Top EPSRC Advanced Fellowship • Two Best Paper Awards • Early Dataflow paper by Maxeler’s Founder has been recognized as one of the most influential papers at the FPL conference in the last 25 years. Recent Commercial Awards • HPCwire Editors Choice Award, November 2011. • American Finance Technology Awards, New York, winner, “Most Cutting Edge IT Initiative,” December 2011. • Golden Arrow, “...for revolutionizing Computers, ” COM-SULT, January 2012. • Gartner “Cool Vendor of the Year,” March 2012. • Frost and Sullivan “Most innovative IT vendor, ” Dec 2013. • CIO Review, 20 Most Promising Networking Companies, March 2014. • CIO Review, 20 Most Promising HPC Companies, March 2015. 38