This document provides a historical overview of the evolution of FPGA technology and programming approaches over several decades. It discusses early theoretical foundations in the 1930s-40s and the development of integrated circuits, hardware description languages, and high-level synthesis tools from the 1950s onwards. More recently, it describes the rise of heterogeneous computing using GPUs, FPGAs and other accelerators, and the ongoing challenges around programming such systems at a suitable level of abstraction.
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote)
1. FPGAs
as
Components
in
Heterogeneous
High-‐Performance
Compu8ng
Systems:
Raising
the
Abstrac8on
Level
Wim
Vanderbauwhede
School
of
Compu6ng
Science
University
of
Glasgow
3. 80
Years
ago:
The
Theory
Turing,
Alan
Mathison.
"On
computable
numbers,
with
an
applica6on
to
the
Entscheidungsproblem."
J.
of
Math
58,
no.
345-‐363
(1936):
5.
4. 1936:
Universal
machine
(Alan
Turing)
1936:
Lambda
calculus
(Alonzo
Church)
1936:
Stored-‐program
concept
(Konrad
Zuse)
1937:
Church-‐Turing
thesis
1945:
The
Von
Neumann
architecture
Church,
Alonzo.
"A
set
of
postulates
for
the
founda6on
of
logic."
Annals
of
mathema0cs
(1932):
346-‐366.
6. 1957:
Fortran,
John
Backus,
IBM
1958:
First
IC,
Jack
Kilby,
Texas
Instruments
1965:
Moore’s
law
1971:
First
microprocessor,
Texas
Instruments
1972:
C,
Dennis
Ritchie,
Bell
Labs
1977:
Fortran-‐77
1977:
von
Neumann
bocleneck,
John
Backus
8. 1984:
Verilog
1984:
First
reprogrammable
logic
device,
Altera
1985:
First
FPGA,Xilinx
1987:
VHDL
Standard
IEEE
1076-‐1987
1989:
Algotronix
CAL1024,
the
first
FPGA
to
offer
random
access
to
its
control
memory
9. 20
Years
ago:
High-‐level
Synthesis
Page,
Ian.
"Closing
the
gap
between
hardware
and
soiware:
hardware-‐soiware
cosynthesis
at
Oxford."
(1996):
2-‐2.
17. High-‐Level
Synthesis
•
For
many
years,
Verilog/VHDL
were
good
enough
•
Then
the
complexity
gap
created
the
need
for
HLS.
•
This
reflects
the
ra6onale
behind
VHDL:
"A
language
with
a
wide
range
of
descrip0ve
capability
that
was
independent
of
technology
or
design
methodology."
•
What
is
lacking
in
this
requirement
is
"capability
for
scalable
abstrac6on".
18. “C
to
Gates”
•
"C-‐to-‐Gates"
offered
that
higher
abstrac6on
level
•
But
it
was
in
a
way
a
return
to
the
days
before
standardised
VHDL/Verilog:
the
various
components
making
up
a
system
were
designed
and
verified
using
a
wide
range
of
different
and
incompa0ble
languages
and
tools.
19. The
Choice
of
C
•
C
was
designed
by
Ritchie
for
the
specific
purpose
of
wri6ng
the
UNIX
opera6ng
system.
•
i.e.
to
create
a
control
system
for
a
RAM-‐
based
single-‐threaded
system.
•
It
is
basically
a
syntac6c
layer
over
assembly
language.
•
Very
different
seman6cs
from
HDLs
•
But
it
became
the
lingua
franca
for
engineers,
and
hence
the
de-‐facto
language
for
HLS
tools.
20. Really
C?
•
None
of
them
was
ever
really
C
though:
•
"C
with
restric6ons
and
pragmas”
(e.g.
DIME-‐C)
•
“C
with
restric6ons
and
a
CSP
API”
(e.g.
Impulse-‐C)
•
"C-‐syntax
language
with
parallel
and
CSP
seman6cs”
(e.g.
Handel-‐C)
•
Typically,
no
recursion,
func6on
pointers
(no
stack)
and
dynamic
alloca6on
(no
OS)
21. The
Odd
Ones
Out
•
Mitrion-‐C
is
func6onal
dataflow
language,
very
different
from
C
in
abstrac6on
level
and
seman6cs;
•
Bluespec
too
was
a
radical
departure
from
"C-‐
to-‐Gates".
•
Both
were
inspired
by
func6onal
languages
like
Haskell.
24. GPUs,
Manycores
and
FPGAs
•
Accelerators
acached
to
host
systems
have
become
increasingly
popular
•
Mainly
GPUs,
•
But
increasingly
manycores
(MIC,
Tilera)
•
And
FPGAs
26. Heterogeneous
Programming
•
State
of
affairs
today:
•
Programmer
must
decide
what
to
offload
•
Write
host-‐accelerator
control
and
data
movement
code
using
dedicated
API
•
Write
accelerator
code
using
dedicated
language
•
Many
approaches
(CUDA,
OpenCL,
MaxJ,
C++
AMP)
27. Programming
Model
•
All
solu6ons
assume
data
parallelism:
•
Each
kernel
is
single-‐threaded,
works
on
a
por6on
of
the
data
•
Programmer
must
iden6fy
these
por6ons
and
the
amount
of
parallelism
•
So
not
ideal
for
FPGAs
•
Recent
OpenCL
specifica6ons
have
kernel
pipes
allowing
construc6on
of
pipelines
•
Also
support
for
unified
memory
space
28. Performance
Nearest
neighbour
Lava
MD
Document
Classifica6on
Kernel
Speed
5.345454545
4.317035156
1.306521951
Total
Speed
1.603603604
4.232313328
1.037712032
0
1
2
3
4
5
6
Speed-‐up
OpenCL-‐FPGA
Speed-‐up
vs
OpenCL-‐CPU
Segal,
Oren,
Nasibeh
Nasiri,
Mar6n
Margala,
and
Wim
Vanderbauwhede.
"High
level
programming
of
FPGAs
for
HPC
and
data
centric
applica6ons.”
Proc.
IEEE
HPEC
2014,
pp.
1-‐3.
29. Power
Savings
Nearest
neighbour
Lava
MD
Document
Classifica6on
Kernel
Power
5.241358852
4.232966577
1.281079155
Total
Power
1.572375533
4.149894595
1.017503956
0
1
2
3
4
5
6
Power
Saving
CPU/FPGA
Power
Consump8on
Ra8o
Segal,
Oren,
Nasibeh
Nasiri,
Mar6n
Margala,
and
Wim
Vanderbauwhede.
"High
level
programming
of
FPGAs
for
HPC
and
data
centric
applica6ons.”
Proc.
IEEE
HPEC
2014,
pp.
1-‐3.
31. Heterogeneous
HPC
Systems
•
Modern
HPC
cluster
node:
•
Mul6core/manycore
host
•
Accelerators:
GPGPU,
MIC
and
increasingly,
FPGAs
•
HPC
workloads
•
Very
complex
codebase
•
Legacy
code
32.
33. Example:
WRF
•
Weather
Research
and
Forecas6ng
Model
•
Fortran-‐90,
support
for
MPI
and
OpenMP
•
1,263,320
lines
of
code
•
So
about
ten
thousand
pages
of
code
lis6ngs
•
Parts
of
it
have
been
accelerated
manually
on
GPU
(a
few
thousands
of
lines)
•
Changing
the
code
for
a
GPU/FPGA
system
would
be
a
huge
task,
and
the
result
would
not
be
portable.
34. FPGAs
in
HPC
•
FPGAs
are
good
at
some
tasks,
e.g.:
•
Bit
level,
integer
and
string
opera6ons
•
Pipeline
parallelism
rather
than
data
parallelism
•
Superior
internal
memory
bandwidth
•
Streaming
dataflow
computa6ons
•
But
not
so
good
at
others
•
Double-‐precision
floa6ng
point
computa6ons
•
Random
memory
access
computa6ons
35. "On the Capability and Achievable
Performance of FPGAs for HPC
Applications"
Wim Vanderbauwhede
School of Computing Science, University of Glasgow, UK
hcp://www.slideshare.net/WimVanderbauwhede
38. One
Codebase,
Many
Components
•
For
complex
HPC
applica6ons,
FPGAs
will
never
be
op6mal
for
the
whole
codebase
•
But
neither
will
mul6cores
or
GPUs
•
So
we
need
to
be
able
to
split
the
codebase
automa6cally
over
the
different
components
in
the
heterogeneous
system.
•
Therefore,
we
need
to
raise
the
abstrac6on
level
beyond
“heterogeneous
programming”
and
“high-‐
level
synthesis”
40. •
Device-‐specific
high-‐level
abstrac6on
is
no
longer
good
enough
•
OpenCL
is
rela6vely
high-‐level
and
device-‐
independent,
but
it
is
s6ll
not
good
enough
•
High-‐level
synthesis
languages
and
heterogeneous
programming
frameworks
should
be
compila6on
targets!
•
Just
like
assembly
/IR
languages
and
HDLs
43. •
Star6ng
from
a
complete,
unop6mised
program
•
Compiler-‐based
program
transforma6ons
•
Correct-‐by-‐construc6on
•
Component-‐based,
hierarchical
cost
model
for
the
full
system
•
Op6miza6on
problem:
find
the
op6mal
program
variant
given
the
system
cost
model
44. A
Func6onal-‐Programming
Approach
•
For
the
par6cular
case
of
scien6fic
HPC
codes
•
Focus
on
array
computa6ons
•
Express
the
program
using
higher-‐order
func6ons
•
Type
Transforma6on
based
program
transforma6on
45. Func6onal
Programming
•
There
are
only
func6ons
•
Func6ons
can
operate
on
func6ons
•
Func6ons
can
return
func6ons
•
Syntac6c
sugar
over
the
λ-‐calculus
46. Types
in
Func6onal
Programming
•
Types
are
just
labels
to
help
us
reason
about
the
values
in
a
computa6on
•
More
general
than
types
in
e.g
C
•
For
our
purpose,
we
focus
on
types
of
func6ons
that
perform
array
opera6ons
•
Func6ons
are
values,
so
they
need
a
type
47. Examples
of
Types
-‐-‐
a
func6on
f
taking
a
vector
of
n
values
of
type
a
and
returing
a
vector
of
m
values
of
type
b
f : Vec a n -> Vec b m
-‐-‐
a
func6on
map
taking
a
func0on
from
a
to
b
and
a
vector
of
type
a,
and
returning
a
vector
of
type
b
map : (a -> b) -> Vec a n -> Vec b n
48. Type
Transforma6ons
•
Transform
the
type
of
a
func6on
into
another
type
•
The
func6on
transforma6on
can
be
derived
automa6cally
from
the
type
transforma6on
•
The
type
transforma6ons
are
provably
correct
•
Thus
the
transformed
program
is
correct
by
construc6on!
49. Array
Type
Transforma6ons
•
For
this
talk,
focus
on
•
Vector
(array)
types
•
FPGA
cost
model
•
Programs
must
be
composed
using
par6cular
higher-‐order
func6ons
(correctness
condi6ons)
•
Transforma6ons
essen6ally
reshape
the
arrays
50. Higher-‐order
Func6ons
•
map:
perform
a
computa6on
on
all
elements
of
an
array
independently,
e.g.
square
all
values.
•
can
be
done
sequen6ally,
in
parallel
or
using
a
pipeline
if
the
computa6on
is
pipelined
•
foldl:
reduce
an
array
to
a
value
using
an
accumulator,
e.g.
sum
all
values.
•
can
be
done
sequen6ally
or,
if
the
computa6on
is
associa6ve,
using
a
binary
tree
51. Example:
SOR
•
Successive
Over-‐Relaxa6on
(SOR)
kernel
from
a
Large-‐Eddy
simulator
(weather
simula6on)
in
Fortran:
52. Example:
SOR
using
map
•
Fortran
code
rewricen
as
a
map
of
a
func6on
over
a
1-‐D
vector
53. Example:
Type
Transforma6on
•
Transform
the
1-‐D
vector
into
a
2-‐D
vector
•
The
program
transforma6on
is
derived
55. Cost
Calcula6on
•
Uses
an
Intermediate
Representa6on
Language,
the
TyTra-‐IR
•
TyTra-‐IR
uses
LLVM
syntax
but
can
express
sequen6al,
parallel
and
pipeline
seman6cs
•
Thus
a
direct
mapping
to
the
higher-‐order
func6ons
•
But
the
cost
of
computa6ons
and
communica6on
can
be
computed
directly
from
the
TyTra-‐IR
program
•
No
need
for
synthesis
57. Cost
Space
and
Cost
Es6ma6on
Logic and
Memory
Resources
Communication
Bandwidth
(local memory, global
memory, host)
Performance
(Throughput)
The Resource-Wall
(computation-bound)
The Bandwidth-Wall
(communication-bound)
60. Full-‐Program
Transforma6on
•
Type
Transforma6ons
are
not
FPGA-‐specific
•
Compiler
can
create
variants
for
full
program
•
Then
separate
out
parts
of
the
program
based
on
minimal
cost
on
given
components
of
the
system
•
For
parallelisa6on
over
a
cluster,
use
Mul6-‐Party
Session
Types
to
transform
the
program
into
communica6ng
processes
61. Conclusions
•
FPGAs
have
reached
maturity
as
HPC
plauorms
•
High-‐Level
Synthesis
and
Heterogeneous
Programming
are
both
very
important
steps
forward,
and
performance
is
already
impressive
•
But
we
need
to
raise
the
abstrac6on
level
even
more
•
Full-‐system
compilers
for
heterogeneous
systems
•
FPGAs
are
merely
components
in
such
systems
•
Type
Transforma6ons
are
one
possible
way