The document discusses the technical evolution of supercomputing and programming models from a perspective of punctuated equilibrium theory. It argues that supercomputing evolution consists of long periods of gradual change interrupted by periods of rapid revolution, such as the shift from vector machines to microprocessor clusters in the 1990s and the recent rise of accelerators. However, revolutions in supercomputing have had varying degrees of success in becoming the dominant paradigm, and key obstacles remain around the long lifespan of scientific codes and rapid changes in hardware.
3. Theory of Punctuated Equilibrium
(Eldredge, Gould, Mayer…)
§ Evolu.on
consists
of
long
periods
of
equilibrium,
with
liIle
change,
interspersed
with
short
periods
of
rapid
change.
– Muta.ons
are
diluted
in
large
popula.ons
in
equilibrium
–
homogenizing
effect
prevents
the
accumula.on
of
mul.ple
changes
– Small,
isolated
popula.on
under
heavy
natural
selec.on
pressure
evolve
rapidly
and
new
species
can
appear
– Major
cataclysms
can
be
a
cause
for
rapid
change
§ Punctuated
Equilibrium
is
a
good
model
for
technology
evolu.on:
– Revolu.ons
are
hard
in
large
markets
with
network
effects
and
technology
that
evolves
gradually
– Changes
can
be
much
faster
when
small,
isolated
product
markets
are
created,
or
when
current
technology
hits
a
wall
(cataclysm)
§ (Not
a
new
idea:
e.g.,
Levinthal
1998:
The
Slow
Pace
of
Rapid
Technological
Change:
Gradualism
and
Punctua;on
in
Technological
Change)
MCS
-‐-‐
Marc
Snir
3
July
13
4. Why it Matters to SPAA (and PODC)
§ Periods
of
paradigm
shiW
generate
a
rich
set
of
new
problems
(new
low-‐hanging
fruit?)
– It
is
a
.me
when
good
theory
can
help
§ E.g.,
Internet,
Wireless,
Big
data
– Punctuated
evolu.on
due
to
the
appearance
of
new
markets
§ Hypothesis:
HPC
now
and,
ul.mately,
much
of
IT
are
entering
a
period
of
fast
evolu.on:
Please
prepare
July
13
MCS
-‐-‐
Marc
Snir
4
5. Where Analogy with Biological Evolution Breaks Down
§ Technology
evolu.on
can
be
accelerated
by
gene.c
engineering
– Technology
developed
in
one
market
is
exploited
in
another
market
– E.g.,
Internet
or
wireless
were
enabled
by
cheap
microprocessors,
telephony
technology,
etc.
§ “Gene.c
engineering”
has
been
essen.al
for
HPC
in
the
last
25
years:
– Progress
enabled
by
reuse
of
technologies
from
other
markets
(micros,
GPUs…)
July
13
MCS
-‐-‐
Marc
Snir
5
7. Evidence of Punctuated Equilibrium in HPC
July
13
MCS
-‐-‐
Marc
Snir
7
1
10
100
1000
10000
100000
1000000
10000000
Core
Count
Leading
Top500
System
aIack
of
the
killer
micros
mul.core
accelerators
SPAA
8. 1990: The Attack of the Killer Micros
(Eugene Brooks, 1990)
§ ShiW
from
ECL
vector
machines
&
to
clusters
of
MOS
micros
– Cataclysm:
bipolar
evolu.on
reached
its
limits
(nitrogen
cooling,
gallium
arsenide…);
MOS
was
on
a
fast
evolu.on
path
– MOS
had
its
niche
markets:
controllers,
worksta.ons,
PCs
– Classical
example
of
“good
enough,
cheaper
technology”
(Christensen,
The
Innovator’s
Dilemma)
July
13
MCS
-‐-‐
Marc
Snir
8
9. 2002: Multicore
§ Clock
speed
stopped
increasing;
very
liIle
return
on
added
CPU
complexity;
chip
density
con.nued
to
increase
– Technology
push
–
not
market
pull
– S.ll
has
limited
success
July
13
MCS
-‐-‐
Marc
Snir
9
10. 2010: Accelerators
§ New
market
(graphics)
created
ecological
niche
§ Technology
transplanted
into
other
markets
(signal
processing/
vision,
scien.fic
compu.ng)
– Advantage
of
beIer
power/performance
ra.o
(less
logic)
§ Technology
s.ll
changing
rapidly:
integra.on
with
CPU
and
evolving
ISA
July
13
MCS
-‐-‐
Marc
Snir
10
11. Where the (R)evolutions Successful in HPC?
§ Killer
Micros:
Yes
– Totally
replaced
vector
machines
– All
HPC
codes
enabled
for
message-‐passing
(MPI)
– Took
>
10
years
and
>
$1B
govt.
investment
(DARPA)
§ Mul:core:
Incomplete
– Many
codes
s.ll
use
one
MPI
process
per
core
–
use
shared
memory
for
message-‐passing
– Use
of
two
programming
models
(MPI+OpenMP)
is
burdensome
– PGAS
is
not
used,
and
does
not
provide
(so
far)
a
real
advantage
over
MPI
– Many
open
issues
on
scaling
mul.threading
models
(OpenMP,
TBB,
Cilk…)
and
combining
them
with
message-‐passing
– (See
history
of
large-‐scale
NUMA
-‐-‐
which
did
not
become
a
viable
species)
July
13
MCS
-‐-‐
Marc
Snir
11
12. Where the (R)evolutions Successful? (2)
§ Accelerators:
Just
beginning
– Few
HPC
codes
converted
to
use
GPUs
§ Obstacles:
– Technology
s.ll
changing
fast
(integra.on
of
GPU
with
CPU,
con.nued
changes
in
ISA)
– No
good
non-‐proprietary
programming
systems
are
available,
and
their
long-‐term
viability
is
uncertain
July
13
MCS
-‐-‐
Marc
Snir
12
13. Key Obstacles
§ Scien.fic
codes
live
much
longer
than
computer
systems
(two
decades
or
more);
they
need
to
be
ported
across
successive
HW
genera.ons
§ Amount
of
code
to
be
ported
con.nuously
increases
(major
scien.fic
codes
each
have
>
1MLOCs)
§ Need
very
efficient,
well
tuned
codes
(HPC
plarorms
are
expensive)
§ Need
portability
across
plarorms
(HPC
programmers
are
expensive)
§ Squaring
the
circle?
§ Lack
of
performing,
portable
programming
models
has
become
the
major
impediment
to
the
evolu.on
of
HPC
hardware
July
13
MCS
-‐-‐
Marc
Snir
13
14. Did Theory Help?
§ Killer
Micros:
Helped
by
work
on
scalable
algorithms
and
on
interconnects
§ Mul:core:
Helped
by
work
on
communica.on
complexity
(efficient
use
of
caches)
– Very
liIle
use
of
work
on
coordina.on
algorithms
or
transac.onal
memory
§ Accelerators:
Cannot
think
of
relevant
work
– Interes.ng
ques.on:
power
of
branching
&
power
of
indirec.on;
– Surprising
result:
AKS
sor.ng
network
§ Too
oWen,
theory
follows
prac.ce,
rather
than
preceding
it.
July
13
MCS
-‐-‐
Marc
Snir
14
16. The End of Moore’s Law is Coming
§ Moore’s
Law:
The
number
of
transistors
per
chip
doubles
every
two
years
§ Stein’s
Law:
If
something
cannot
go
forever,
it
will
stop
§ Ques.on
is
not
whether
but
when
will
Moore’s
Law
stop?
– It
is
difficult
to
make
predic.ons,
especially
about
the
future
(Yogi
Berra)
July
13
MCS
-‐-‐
Marc
Snir
16
17. Current Obstacle: Current Leakage
§ Transistors
do
not
shut-‐off
completely
While
power
consump;on
is
an
urgent
challenge,
its
leakage
or
sta;c
component
will
become
a
major
industry
crisis
in
the
long
term,
threatening
the
survival
of
CMOS
technology
itself,
just
as
bipolar
technology
was
threatened
and
eventually
disposed
of
decades
ago
Interna.onal
Technology
Roadmap
for
Semiconductors
(ITRS)
2011
§ The
ITRS
“long
term”
is
the
2017-‐2024
.meframe.
§ No
“good
enough”
technology
wai.ng
in
the
wings
July
13
MCS
-‐-‐
Marc
Snir
17
18. Longer-Term Obstacle
§ Quantum
effects
totally
change
the
behavior
of
transistors
as
they
shrink
– 7-‐5nm
feature
size
is
predicted
to
be
the
lower
limit
for
CMOS
devices
– ITRS
predicts
7.5nm
will
be
reached
in
2024
July
13
MCS
-‐-‐
Marc
Snir
18
19. The 7nm Wall
24
July
2013
ANL-‐LBNL-‐ORNL-‐PNNL
19
(courtesy
S.
Dosanjh)
20. The Future Is Not What It Was
24
July
2013
ANL-‐LBNL-‐ORNL-‐PNNL
20
(courtesy
S.
Dosanjh)
21. Progress Does Not Stop
§ It
becomes
more
expensive
and
slows
down
– New
materials
(e.g.,
III-‐V,
germanium
thin
channels,
nanowires,
nanotubes
or
graphene)
– New
structures
(e.g.,
3D
transistor
structures)
– Aggressive
cooling
– New
packages
§ More
inven.on
at
the
architecture
level
§ Seeking
value
from
features
other
than
speed
(More
than
Moore)
– System
on
a
chip
–
integra.on
of
analog
and
digital
– MEMS…
§ Beyond
Moore?
(Quantum,
biological…)
–
beyond
my
horizon
July
13
MCS
-‐-‐
Marc
Snir
21
23. Supercomputer Evolution
§ X1,000
performance
increase
every
11
years
– X50
faster
than
Moore’s
Law
§ Extrapola.on
predicts
exaflop/s
(1018
floa.ng
point
opera.ons
per
second)
before
2020
– We
are
now
at
50
Petaflop/s
§ Extrapola.on
may
not
work
if
Moore’s
Law
slows
down
July
13
MCS
-‐-‐
Marc
Snir
23
24. Do We Care?
§ It’s
all
about
Big
Data
Now,
simula.ons
are
passé.
§ B***t
§ All
science
is
either
physics
or
stamp
collec;ng.
(Ernest
Rutherford)
– In
Physical
Sciences,
experiments
and
observa.ons
exist
to
validate/refute/mo.vate
theory.
“Data
Mining”
not
driven
by
a
scien.fic
hypothesis
is
“stamp
collec.on”.
§ Simula.on
is
needed
to
go
from
a
mathema.cal
model
to
predic.ons
on
observa.ons.
– If
system
is
complex
(e.g.,
climate)
then
simula.on
is
expensive
– Predic.ons
are
oWen
sta.s.cal
–
complica.ng
both
simula.on
and
data
analysis
July
13
MCS
-‐-‐
Marc
Snir
24
25. Observation Meets Data: Cosmology
Computation Meets Data: The Argonne View
Mapping the Sky with
Survey Instruments
Observations:
Statistical error bars
will ‘disappear’ soon!
Emulator based on Gaussian
Process Interpolation in High-
Dimensional Spaces
Supercomputer
Simulation Campaign
Markov chain
Monte Carlo
‘Precision
Oracle’
‘Cosmic
Calibration’
LSST Weak Lensing
HACC+CCF (Domain
science+CS+Math+Stats
+Machine learning)
CCF= Cosmic Calibration Framework
w = -1
w = - 0.9
LSST
HACC=Hardware/Hybrid Accelerated
Cosmology Code(s)
(courtesy
Salman
Habib)
Record-‐breaking
applica.on:
3.6
Trillion
par.cles,
14
Pflop/s
26. Exascale Design Point 202x with a
cap of $200M and 20MW
Systems
2012
BG/Q
Computer
2020-‐2024
Difference
Today
&
2019
System
peak
20
Pflop/s
1
Eflop/s
O(100)
Power
8.6
MW
~20
MW
System
memory
1.6
PB
(16*96*1024)
32
-‐
64
PB
O(10)
Node
performance
205
GF/s
(16*1.6GHz*8)
1.2
or
15TF/s
O(10)
–
O(100)
Node
memory
BW
42.6
GB/s
2
-‐
4TB/s
O(1000)
Node
concurrency
64
Threads
O(1k)
or
10k
O(100)
–
O(1000)
Total
Node
Interconnect
BW
20
GB/s
200-‐400GB/s
O(10)
System
size
(nodes)
98,304
(96*1024)
O(100,000)
or
O(1M)
O(100)
–
O(1000)
Total
concurrency
5.97
M
O(billion)
O(1,000)
MTTI
4
days
O(<1
day)
-‐
O(10)
Both
price
and
power
envelopes
may
be
too
aggressive!
27. Identified Issues
§ Scale
(billion
threads)
§ Power
(10’s
of
MWaIs)
– Communica:on:
>
99%
of
power
is
consumed
by
moving
operands
across
the
memory
hierarchy
and
across
nodes
– Reduced
memory
size:
(communica.on
in
.me)
§ Resilience:
Something
fails
every
hour;
the
machine
is
never
“whole”
– Trade-‐off
between
power
and
resilience
§ Asynchrony:
Equal
work
≠
equal
.me
– Power
management
– Error
recovery
July
13
MCS
-‐-‐
Marc
Snir
27
28. Other Issues
§ Uncertainly
about
underlying
HW
architecture
– Fast
evolu.on
of
architecture
(accelerators,
3D
memory
and
processing
near
memory,
nvram)
– Uncertainty
about
the
market
that
will
supply
components
to
HPC
– Possible
divergence
from
commodity
markets
§ Increased
complexity
of
soWware
– Simula.ons
of
complex
systems
+
uncertainty
quan.fica.on
+
op.miza.on…
– SoWware
management
of
power
and
failure
– Scale
and
.ght
coupling
(tail
of
distribu.on
maIers!)
July
13
MCS
-‐-‐
Marc
Snir
28
30. Scale
§ HPC
algorithms
are
being
designed
for
a
2-‐level
hierarchy
(node,
global);
can
they
be
designed
for
a
mul.-‐level
hierarchy?
Can
they
be
“hierarchy-‐oblivious”?
§ Can
we
have
a
programming
model
that
abstracts
the
specific
HW
mechanisms
are
each
level
(message-‐passing,
shared-‐
memory)
yet
can
leverage
these
mechanisms
efficiently?
– Global
shared
object
space
+
caching
+
explicit
communica.on
– Mul.level
programing
(compila.on
with
human
in
the
loop)
July
13
MCS
-‐-‐
Marc
Snir
30
31. Communication
§ Communica.on-‐efficient
algorithms
§ A
beIer
understanding
of
fundamental
communica.on-‐
computa.on
tradeoffs
for
PDE
solvers
(ge•ng
away
from
DAG
–
based
lower
bounds;
tradeoffs
between
communica.on
and
convergence
rate)
§ Programming
models,
libraries
and
languages
where
communica.on
is
a
first-‐class
ci.zen
(other
than
MPI)
July
13
MCS
-‐-‐
Marc
Snir
31
32. Resilient Distributed Systems
§ E.g.,
a
parallel
file
system,
with
768
I/O
nodes
>50K
disks
– Systems
are
built
to
tolerate
disk
and
node
failures
– However,
most
failures
in
the
field
are
due
to
“performance
bugs”:
e.g.,
.me-‐outs,
due
to
thrashing
§ How
do
we
build
feedback
mechanisms
that
ensure
stability?
(control
theory
for
large-‐scale,
discrete
systems)
§ How
do
we
provide
quality
of
service?
§ What
is
a
quan.ta.ve
theory
of
resilience?
(E.g.
Impact
of
failure
rate
on
overall
performance)
– Focus
on
systems
where
failures
are
not
excep.onal
July
13
MCS
-‐-‐
Marc
Snir
32
33. Resilient Parallel Algorithms – Overcoming Silent Data
Corruptions
§ SDCs
may
be
unavoidable
in
future
large
systems
(due
to
flips
in
computa.on
logic)
§ Intui.on:
SDC
can
either
be
– Type
1:
Grossly
violates
the
computa.on
model
(e.g.
jump
to
wrong
address,
message
sent
to
wrong
node),
or
– Type
2:
Introduces
noise
in
the
data
(bit
flip
in
a
large
array)
§ Many
itera.ve
algorithms
can
tolerate
infrequent
type
2
errors
§ Type
1
errors
are
oWen
catastrophic
and
easy
to
detect
in
soWware
§ Can
we
build
systems
that
avoid
or
correct
easy
to
detect
(type
1)
errors
and
tolerate
hard
to
detect
(type
2)
errors?
§ What
is
the
general
theory
of
fault-‐tolerant
numerical
algorithms?
July
13
MCS
-‐-‐
Marc
Snir
33
34. Asynchrony
§ What
is
a
measure
of
asynchrony
tolerance?
– Moving
away
from
the
qualita.ve
(e.g.,
wait-‐free)
to
the
quan.ta.ve:
– How
much
do
intermiIently
slow
processes
slow
down
the
en.re
computa.on
–
on
average?
§ What
are
the
trade-‐offs
between
synchronicity
and
computa.on
work?
§ Load
balancing,
driven
not
by
uncertainty
on
computa.on,
but
uncertainty
on
computer
July
13
MCS
-‐-‐
Marc
Snir
34
36. Portable Performance
§ Can
we
redefine
compila.on
so
that:
– It
supports
well
a
human
in
the
loop
(manual
high-‐level
decisions
vs.
automated
low-‐level
transforma.ons)
– It
integrates
auto-‐tuning
and
profile-‐guided
compila.on
– It
preserves
high-‐level
code
seman.cs
– It
preserves
high-‐level
code
“performance
seman.cs”
July
13
MCS
-‐-‐
Marc
Snir
36
Principle
High-‐level
code
Low-‐level,
plarorm-‐
specific
codes
“Compila.on”
Prac:ce
Code
A
Code
B
Code
C
Manual
conversion
“ifdef”
spaghe•
37. Conclusion
§ Moore’s
Law
is
slowing
down;
the
slow-‐down
has
many
fundamental
consequences
–
only
a
few
of
them
explored
in
this
talk
§ HPC
is
the
“canary
in
the
mine”:
– issues
appear
earlier
because
of
size
and
.ght
coupling
§ Op.mis.c
view
of
the
next
decades:
A
frenzy
of
innova.on
to
con.nue
pushing
current
ecosystem,
followed
by
frenzy
of
innova.on
to
use
totally
different
compute
technologies
§ Pessimis.c
view:
The
end
is
coming
July
13
MCS
-‐-‐
Marc
Snir
37