What’s Next in computing & the role of cloud FPGAs

What’s Next in
computing & the
role of cloud FPGAs
Dr. Dionysios Diamantopoulos
Research Staff Member,
Cloud FPGAs & Tape Group,
Cloud & AI Systems Research Department,
IBM Research Europe
Guest lecture at the Harokopio University of Athens, as part of the MSc program
of the Informatics & Telematics Department, invited by Prof. Sotiris Xydis.
21 Jan. 2021

IBM Legal Disclaimer
This content was provided for informational purposes only. The opinions and insights discussed are
those of the presenter and guests and do not necessarily represent those of the IBM Corporation.
Nothing contained in these materials or the products discussed is intended to, nor shall have the effect
of, creating any warranties or representations from IBM or its suppliers, or altering the terms and
conditions of any agreement you have with IBM.
The information presented is not intended to imply that any actions taken by you will result in any
specific result or benefit and should not be relied on in making a purchasing decision. IBM does not
warrant that any systems, products or services are immune from, or will make your enterprise immune
from, the malicious or illegal contact of any party.
All product plans, directions and intent are subject to change or withdrawal without notice. References
to IBM products, programs or services do not imply that they will be available in all countries in which
IBM operates. IBM, the IBM logo, and other IBM products and services are trademarks of the
International Business Machines Corporation, in the United States, other countries or both. Other
company, product, or services names may be trademarks or services marks of others.
For copyright and trademark information go to: http://www.ibm.com/legal/us/en/copytrade.shtml
2

Beijing
Tokyo
Shin-Kawasaki
Delhi
Bangalore
Singapore
Nairobi
Haifa
Zurich
Warrington
Dublin
Cambridge
Albany
Yorktown
Almaden
Rio de Janeiro
Sao Paulo Johannesburg
Melbourne
3000
Researchers
19
Locations
6
Continents
6 Nobel Laureates
10 Medals of Technology
5 National Medals of Science
6 Turing Awards
3

A legacy of world-class research
For 75 years, IBM Research has been propelling innovation for IBM, from the first
programmable computers to the quantum computers of today. More than anything,
our goal is to catalyze and drive the advancements that shape our world.
With more than 3,000 researchers across the globe, we are anticipating, examining, and
inventing What’s Next in science and technology every single day.
2019 IBM Project Debater
2018 Summit and Sierra: World’s Fastest Supercomputers
2017 Commercial Quantum Computing
2016 World’s first quantum computer on the cloud
2015 Watson Genomic Analytics for Personalized Cancer Treatment
2014 SyNAPSE: Biologically Inspired Neural Architecture
2013 Antimicrobial Polymers
2012 Atomic Imaging (Charge Distribution, Bond Order)
2011 Watson Wins Jeopardy!
2009 Nanoscale Magnetic Resonance Imaging (MRI)
2008 World’s First Petaflop Supercomputer
2007 Web-scale Mining
2005 Cell Broadband Engine
2004 Blue Gene/L
2003 5 Stage Carbon Nanotube Ring Oscillator
2000 Java Performance
1998 Silicon on Insulator (SOI)
1997 Copper Interconnect Wiring
1997 Deep Blue
1994 Silicon Germanium (SiGe)
1990 Chemically Amplified Photoresists
1987 High-Temperature Superconductivity (Nobel Prize)
1986 Scanning Tunneling Microscope (Nobel Prize)
1980 Reduced Instruction Set Computing (RISC)
1979 Thin Film Recording Heads
1973 Winchester Disk Drive
1971 Speech Recognition
1970 Relational Database
1967 Fractals
1966 One-Device Memory Cell
1957 FORTRAN
1956 Random Access Memory Accounting Machine (RAMAC)
4

© 2020 IBM Corporation
IBM Research Europe
5
Dublin Daresbury
Hursley Zurich
Daresbury
Zurich
5

IBM Research – Zurich
Established in 1956
45+ different nationalities
Open Collaboration:
o Horizon2020: 50+ funded projects
and 500+ partners
Two Nobel Prizes:
o 1986: Nobel Prize in Physics for
the invention of the scanning
tunneling microscope by Heinrich
Rohrer and Gerd K. Binnig
o 1987: Nobel Prize in Physics for
the discovery of high-temperature
superconductivity by K. Alex
Müller and J. Georg Bednorz
European Physical Society Historic Site
Binnig and Rohrer Nanotechnology
Centre (Public Private Partnership with
ETH Zürich and EMPA)
7 European Research Council Grants
My office
6

# who am I
v.0.1
1985 2009
Ph.D. @ ECE, NTUA.
“Cross-Layer Rapid Prototyping and
Synthesis of Application-Specific and
Reconfigurable Many-accelerator
Platforms”
2015
Military service
IT Engineer @
Hellenic Army
General Staff
2016
R&D Engineer,
Startup, LN2
2016 2017
Postdoc Researcher,
Heterogeneous Cognitive
Computing Systems Group,
Cloud & Computing
Infrastructure Department,
IBM Research – Zurich,
“Transprecision Computing”
PhD Researcher and R&D engineer in
ESA, EU and national funded projects
Postdoc Researcher,
Cloud FPGAs and Tape
Group, Cloud & AI Systems
Research Department,
IBM Research Europe,
“Transprecision Computing”,
“Near Memory Computing”,
“cloudFPGA”
2019 2021
Research Scientist,
Cloud FPGAs and Tape Group,
Cloud & AI Systems Research
Department,
IBM Research Europe
Not necessarily
linear scale
Time is relative
(to your frame of
reference)
Childhood & school @ Pylos, Greece
Met Prof. Sotirios Xydis
Enjoying our collaboration and
friendship thereafter
D.Eng. @ CEID, Univ.Patras
“Design and Implementation of a
dual-processor (RISC) System-on-
Chip targeting machine vision
algorithms on FPGAs and eASICs”
7

We’re Inventing What’s Next in:
Hybrid Cloud
AI
Quantum
Science 8
IBM’s innovation: Topping the US patent list for 28
years running
https://www.ibm.com/blogs/research/2021/01/ibm-patent-leadership-2020/
From automated teller machine (ATM), speech recognition technology, DRAM to a novel way to search multilingual documents using NLP, 2300 AI patents !
What’s Next in computing & the role of cloud FPGAs / © D. Diamantopoulos 2021 IBM Research

What’s Next in computing & the role of cloud FPGAs / © D. Diamantopoulos 2021 IBM Research 9
Maslow's hierarchy of needs:
Basic needs or physiology needs
A bit of motivation
“The basic need is a concept that was derived to explain and cultivate
the foundation for motivation.
This concept is the main physical requirement for human survival. This
means that basic needs are universal human needs. Basic needs, being
primal, are by default, a governor on the attainment of the "higher"
needs.
Efforts to accomplish higher needs may be interrupted temporarily by a
deficit of primal needs, such as a lack of food or air. Basic needs are
considered in internal motivation according to Maslow's hierarchy of
needs.
Maslow's idea is that humans are compelled to fulfill these basic needs
first to pursue intrinsic satisfaction on a higher level.[3] If these needs
are not achieved, it leads to an increase in displeasure within an
individual. In return, when individuals feel this increase in displeasure,
the motivation to decrease these discrepancies increases.”
What’s Next in computing &
the role of cloud FPGAs
Food, Water Health Breathing Rest Warmth
Abraham Harold Maslow was a psychology professor at Alliant International University,
Brandeis University, Brooklyn College, New School for Social Research, and Columbia University.
Quoted text and image source: Wikipedia
Horizontal Needs: Physiological
Vertical
Needs

Safety needs
A bit of motivation
“Once a person's physiological needs are relatively satisfied, their safety
needs to take precedence and dominate behavior.
In the absence of physical safety – due to war, natural disaster, family
violence, childhood abuse, etc. and/or in the absence of economic safety
– (due to an economic crisis and lack of work opportunities) these safety
needs manifest themselves in ways such as a preference for job
security, grievance procedures for protecting the individual from
unilateral authority, savings accounts, insurance policies, disability
accommodations, etc.
This level is more likely to predominate in children as they generally
have a greater need to feel safe. It includes shelter, job security, health,
and safe environments. If a person does not feel safe in an environment,
they will seek safety before attempting to meet any higher level of
survival”.
Vertical
Needs
Physiological Needs
House Security Care
Horizontal Needs: Safety
Financial

Social needs
A bit of motivation
“After physiological and safety needs are fulfilled, the third level of
human needs is interpersonal and involves feelings of belongingness.
According to Maslow, humans possess an effective need for a sense of
belonging and acceptance among social groups, regardless of whether
these groups are large or small. For example, some large social groups
may include clubs, co-workers, religious groups, professional
organizations, sports teams, gangs, and online communities.
Some examples of small social connections include family members,
intimate partners, mentors, colleagues, and confidants. Humans need to
love and be loved – both sexually and non-sexually – by others.
Many people become susceptible to loneliness, social anxiety, and
clinical depression in the absence of this love or belonging element.
Deficiencies due to hospitalism, neglect, shunning, ostracism, etc. can
adversely affect the individual's ability to form and maintain emotionally
significant relationships in general.”
Vertical
Needs
Physiological Needs
Safety Needs
Friendship Family
Horizontal Needs: Social
Intimacy

Esteem needs
A bit of motivation
“Esteem needs are ego needs or status needs. People develop a concern
with getting recognition, status, importance, and respect from others.
Most humans need to feel respected; this includes the need to have self-
esteem and self-respect. Esteem presents the typical human desire to
be accepted and valued by others. People often engage in a profession
or hobby to gain recognition. These activities give the person a sense of
contribution or value.
Low self-esteem or an inferiority complex may result from imbalances
during this level in the hierarchy. Psychological imbalances such as
depression can distract the person from obtaining a higher level of self-
esteem.
Most people have a need for stable self-respect and self-esteem.
Maslow noted two versions of esteem needs: a "lower" version and a
"higher" version. This means that esteem and the subsequent levels are
not strictly separated; instead, the levels are closely related.”
Vertical
Needs
Physiological Needs
Safety Needs
Social Needs
Recognition Trust
Horizontal Needs: Esteem
Respect

Self-actualization needs
A bit of motivation
“This level of need refers to the realization of one's full potential.
Maslow describes this as the desire to accomplish everything that one
can, to become the most that one can be. People may have a strong,
particular desire to become an ideal parent, succeed athletically, or
create paintings, pictures, or inventions.
To understand this level of need, a person must not only succeed in the
previous needs but master them. Self-actualization can be described as
a value-based system when discussing its role in motivation. Self-
actualization is understood as the goal or explicit motive, and the
previous stages in Maslow's Hierarchy fall in line to become the step-by-
step process by which self-actualization is achievable; an explicit motive
is the objective of a reward-based system that is used to intrinsically
drive completion of certain values or goals. Individuals who are
motivated to pursue this goal seek and understand how their needs,
relationships, and sense of self are expressed through their behavior.
Self-actualization can include: Partner Acquisition, Parenting, Utilizing &
Developing Talents & Abilities, Pursuing goals.”
Vertical
Needs
Physiological Needs
Safety Needs
Social Needs
Esteem Needs
Parenting Goals
Horizontal Needs: Self-actualization
Talents

Transcendence needs
A bit of motivation
“In his later years, Abraham Maslow explored a further dimension of
motivation, while criticizing his original vision of self-actualization.
By these later ideas, one finds the fullest realization in giving oneself to
something beyond oneself—for example, in altruism or spirituality. He
equated this with the desire to reach the infinite.
Transcendence refers to the very highest and most inclusive or holistic
levels of human consciousness, behaving and relating, as ends rather
than means, to oneself, to significant others, to human beings in general,
to other species, to nature, and to the cosmos” Maslow 1971, p. 269
Vertical
Needs
Physiological Needs
Safety Needs
Social Needs
Esteem Needs
Self-actualization Needs
Transcendence

Computing hierarchy of needs:
Physical needs
A bit of motivation
Information
Representation Materials Power Thermal
Horizontal Needs: Physiological
Vertical
Needs
. . .
Claude Shannon
The origins of information theory
Image source: Wikipedia
He is the founder of digital circuit design
theory when, in 1937, he wrote his
thesis demonstrating that electrical
applications of Boolean algebra could
construct any logical numerical
relationship.
Assumption: separation of information
from physics -> that separation is being
“challenged” by quantum computing
today.
0
1
Prior to Shannon those things had nothing in
common.
Today we get to see them both as processors
or carriers of information.
12-row/80-column IBM punched card from the mid-
twentieth century, Image source: Wikipedia
A section of DNA. The bases lie vertically between
the two spiraling strands, Image source: Wikipedia
0+1

Computing hierarchy of needs: Technological needs
A bit of motivation
Horizontal Needs: Technological
Vertical
Needs
. . .
Physical Needs
Transistor Variability Yield
Aging
. . .
H. -. P. Wong et al., "A Density Metric for Semiconductor Technology [Point of View]," in Proceedings of the IEEE, vol. 108, no. 4, pp. 478-
482, April 2020, doi: 10.1109/JPROC.2020.2981715.

Moore’s Law End ? Really ?
— “medium-K, oxide-minimized, semi-strained, anti-dielectric half-pitch.” ?
Transistor scaling
Intel, IEDM 2019,
Germanium-based
GAAFET PMOS
device layer on top
of a more traditional
silicon FinFET NMOS
System scaling: Beyond the
transistor, e.g. Intel’s EMIB
(Embedded Multi-die
Interconnect Bridge) and
Foveros to connect chiplets
in both 2 and 3 dimensions
(HBM in CPU-GPU)
• 5 chipmakers/foundries in the 16nm/14nm market—GlobalFoundries, Intel,
Samsung, TSMC UMC, SMIC (14nm finFETs).
• GlobalFoundries and UMC last year halted their respective 7nm process
efforts.
• Currently, TSMC's 7nm process is in its peak (orders from AMD for its Ryzen
3000-series CPUs and Navi graphics cards). Huge invest in 5nm.
• Compared to 7nm, Samsung’s 5nm finFET technology provides up to a 25%
increase in logic area with 20% lower power or 10% higher performance.
• TSMC expects mass 3nm production in 2022.
• A nanosheet FET is a type of gate-all-around (GAA) architecture. That’s not
the only possible scenario. “The industry is very conservative. They will try to
extend the finFET as much as possible,” IMEC’s Naoto Horiguchi said. “At
3nm, we have a window to use a finFET. But we need several process
innovations for finFET in terms of overall improvement.
• TSMC announced starting 2nm development (Apr. 2020) https://semiengineering.com/5nm-vs-3nm/ The Future of Computing: Bits + Neurons + Qubits, Dario Gil and
William M. J. Green, ISSCC2020: Plenary
https://semiwiki.com/eda/synopsys/294205-what-might-the-1nm-
node-look-like/
Nadine Collaert, "1.3 Future Scaling: Where Systems and Technology
Meet," 2020 IEEEInternational Solid- State Circuits Conference - (ISSCC),
San Francisco, CA, USA, 2020, pp. 25-29, doi:
10.1109/ISSCC19947.2020.9063033.
o “10 micron” in 1972 through “0.35 micron” in 1995, an impressive 23-year run where the node name matched gate length.
o Then, in 1997 with the “0.25 micron/250 nm” node they started over-achieving with an actual Lg of 200 nm – 20% better
than the name would imply.
o This “sandbagging” continued through the next 12 years, with one node (130nm) having Lg of only 70nm – almost a 2x
buffer. Then, in 2011, Intel jumped over to the other side of the ledger, ushering in what we might call the “overstating
decade” with the “22nm” node sporting an Lg of 26 nm. Since then, things have continued to slide further in that direction,
with the current “10nm” node measuring in with an Lg of 18 nm – almost 2x on the other side of the “named” dimension.
o Most industry folks understand that Intel’s “10nm” process is roughly equivalent to TSMC and Samsung’s “7nm” processes.
https://www.eejournal.com/article/no-more-nanometers/ July 23, 2020

Moore’s Law End ? Really ?
— I prefer to respect that it is aging fairly gracefully!
Transistor scaling Cost scaling
• The cost to design a 28nm planar device ranges from $10 million to
$35 million (Gartner).
• The cost to design a 7nm system-on-a-chip (SoC) ranges from $120
million to $420 million (Gartner).
• 5nm is a completely new process with updated EDA tools and IP. The
cost to design a 5nm device ranges from $210 million to $680 million
(Gartner).
$20Bper fab run at 3nm (IBS)
• The NRE costs of RnD optical proximity correction, multi-patterning, and
extreme ultraviolet (EUV). https://www.eejournal.com/article/no-more-nanometers/
• After 5nm, the next full node is 3nm. But 3nm is not for the faint of heart.
• The cost to design a 3nm device ranges from $500 million to $1.5 billion,
according to IBS.
• Process development costs ranges from $4 billion to $5 billion, while a fab
runs $15 billion to $20 billion, according to IBS.
• “Transistor costs at 3nm are expected to be 20% to 25% higher than at
5nm based on same level of maturity,” IBS’ Jones said. “Expect 15% more
performance and with 25% less power consumption compared to 5nm
finFETs.” https://semiengineering.com/5nm-vs-3nm/
18

1. Computer performance was driven by clock speeds.
2. Facing several physical walls, clock speed parallelism. (Amdahl’s
law - parallelism will soon be limited for nonscientific computations).
3. An unavoidable path towards specialization devices (such as ASICs,
DPUs, TPUs, IPUs, ...), Thus, computer performance will probably need to
seek another driving factor.
Architectural needs
A bit of motivation
Horizontal Needs: Architectural
Vertical
Needs
. . .
Physical Needs
. . .
Technological Needs
(μ)-architecture
Edge-to-cloud, HPC Storage
Gordon Moore’s law and its derivatives; T: Transistor total, Klauer, Bernd. “The Convey Hybrid-Core Architecture.” (2013).
1
2
3
Latency, Cost, …,
Energy efficiency
. . .

Software needs
A bit of motivation
Horizontal Needs: Software
Vertical
Needs
. . .
Physical Needs
. . .
Technological Needs
. . .
Languages Libraries Virtualization
Architectural Needs
. . .

Computing hierarchy of needs: Cloud needs
A bit of motivation
Horizontal Needs: Cloud
Vertical
Needs
. . .
Physical Needs
. . .
Technological Needs
Architectural Needs
Software Needs
PaaS, IaaS, …, FaaS 5G/6G Edge, IoT, V2X
. . .
https://amulya-bhatia.medium.com/iaas-vs-caas-vs-paas-vs-faas-vs-saas-whats-the-difference-ee84ecc2d519
. . .
Comparison of the 4G (IMT-Advanced) and
5G (IMT-2020) specifications. Source: ETSI

General AI
Revolutionary
True neuro-AI
Cross-domain learning
and reasoning
Broad autonomy with
moral reasoning
Wetware?
Transcendence ?
(personal view)
Broad AI
Disruptive and Pervasive
Neuro-symbolic AI
Multi-task, multi-domain, multi-
modal
Trusted AI capable of learning with
much less data
Reduced Precision and Analog HW
AI needs
A bit of motivation
. . .
Physical Needs
. . .
Technological Needs
Architectural Needs
Software Needs
“Transcendence refers to the very highest and most inclusive or
holistic levels of human consciousness, behaving and relating, as
ends rather than means, to oneself, to significant others, to human
beings in general, to other species, to nature, and to the cosmos”
Maslow 1971, p. 269
AI
Narrow AI
Emerging
Deep Learning
Single-task, single-domain, with
superhuman accuracy
Requires large-amounts of labeled
data
CPU & GPU
We are here now
Cloud Needs
The Future of Computing: Bits + Neurons + Qubits, Dario Gil and William M. J. Green, ISSCC2020: Plenary
. . .
. . .

My view on motivation for human-centric trusted AI
A bit of motivation
. . .
Physical Needs
. . .
Technological Needs
Architectural Needs
Software Needs
AI
Cloud Needs
. . .
. . .
Physiological Needs
Safety Needs
Social Needs
Esteem Needs
Self-actualization Needs
Transcendence
Augment human2human & human2cosmos consciousness
Computing hierarchy of needs Maslow's hierarchy of needs

https://aif360.mybluemix.net/
https://www.ibm.com/blogs/research/2018/04/ai-adversarial-robustness-toolbox/
http://aix360-dev.mybluemix.net/?_ga=2.230889183.1995265854.1610364654-99329142.1609856291
https://www.research.ibm.com/artificial-intelligence/trusted-ai/

Bits
Mathematics + Information
Today’s Computers and
Supercomputers
Neurons
Biology + Information
Today’s AI Systems
Qubits
Physics + Information
Today’s Quantum Systems
The Future of Computing: Bits + Neurons + Qubits, Dario Gil and
How we
get there

A bit of motivation
Attainable performance: G(FL)OPS
Computation to communication ratio: (FL)OP/Byte
CPU Computation Roof CPU
Memory
“In 1945, while consulting for the Moore School
of Electrical Engineering on the EDVAC project,
von Neumann wrote an incomplete set of notes,
titled the First Draft of a Report on the EDVAC.
This widely distributed paper laid foundations of a
computer architecture in which the data and the
program are both stored in the computer’s
memory in the same address space, which will be
described later as von Neumann Architecture
(drawing at right). This architecture became the
de facto standard for a long time and is still used
today (until technology enabled more advanced
architectures).”
https://history-computer.com/john-von-neumann-biography-history-and-inventions/

A bit of motivation
Memory
Compute-bound
Memory-bound
Optimal operation point
(Bandwidth and CPU are not under-utilized)

28
A bit of motivation
Memory
Compute-bound
Memory-bound
Future CPU Computation Roof
Amdahl’s Law &
Dark Silicon: The
future is not 1000s of
conventional cores
“Amdahl’s Law of specialization”
is it better to speedup 1% of
apps by 100×
or
allapps by 1% ?

A bit of motivation
Memory
Compute-bound
Memory-bound
System specialization
using accelerators:
Architectures designed
with a specific class of
computations in mind
Accelerator
Memory

A bit of motivation
Memory
Application A
Arithmetic intensity of A (depends only on application’s characteristics)
Application A
(specifications)

A bit of motivation
Memory
Application A
Arithmetic intensity of A
Computing performance “needs” for Application A
Application A
needs (perf. Pow)

A bit of motivation
Memory
Application A
Baseline performance for A
Coding App. A
(C,C++,Java,Python)

A bit of motivation
Memory
Application A
Compiler
Optimizations
(gcc -03 …)
A w/ comp. opt.

A bit of motivation
Memory
Application A
Multi-core
(pThreads,
openMP…)
A w/ comp. opt.
A w/ multi-core

A bit of motivation
Memory
Application A
SIMD
(SSE, AVX, …)
A w/ comp. opt.
A w/ multi-core
A w/ SIMD

A bit of motivation
Memory
Application A
DVFS
(freq. boost, …)
A w/ comp. opt.
A w/ multi-core
A w/ SIMD
A w/ DVFS

A bit of motivation
Memory
Application A
Manual Code
optimization
(profiling and fun…)
A w/ comp. opt.
A w/ multi-core
A w/ SIMD
A w/ DVFS
A w/ manual code opt.

Accelerator I
A bit of motivation
Memory
Buy accelerator I
(GPU, TPU, ASIC…)
A w/ comp. opt.
A w/ multi-core
A w/ SIMD
A w/ DVFS
Memory
Application A
A w/ accelerator I Assuming CPU-Acc.
BW is sufficient !!!

Accelerator I
A bit of motivation
Memory
Use vendor libs of
accelerator I
(cuBLAS, cuDNN, …)
A w/ comp. opt.
A w/ multi-core
A w/ SIMD
A w/ DVFS
Memory
Application A
A w/ accelerator I (HW)
A w/ accelerator I (HW+SW)
Assuming CPU-Acc.

Accelerator I
A bit of motivation
Memory
?
A w/ comp. opt.
A w/ multi-core
A w/ SIMD
A w/ DVFS
Memory
Application A
Arithmetic intensity of B
Application B
Computing performance “needs” for Application B

Accelerator II
Accelerator II
A bit of motivation
Memory
Buy Acc. II with
better memory
(HBM2, …)
A w/ comp. opt.
A w/ multi-core
A w/ SIMD
A w/ DVFS
Memory
Application A
Application B

Accelerator II
Accelerator II
Accelerator II
Accelerator II
Accelerator II
A bit of motivation
Memory
Buy Accelerators
III, IV, V …
A w/ comp. opt.
A w/ multi-core
A w/ SIMD
A w/ DVFS
Memory
Application A
Application B
Application C
Application D
Application E

A bit of motivation
Memory
Buy FPGA
o custom logic,
o custom memory,
o custom interconnects
A w/ comp. opt.
A w/ multi-core
A w/ SIMD
A w/ DVFS
FPGA
Application A
Application B
Application C
Application D
Application E
Not only custom,
but also
reconfigurable at
seconds’ speed !

Group Name / DOC ID / Month XX, 2018 / © 2018 IBM Corporation 44
✓ Reconfigurable
logic
✓ Reconfigurable
memory
✓ Reconfigurable
interconnects
ASICs
A GPU is effective at processing the same set of operations in parallel – single instruction, multiple data (SIMD).
A GPU has a well-defined instruction-set, and fixed word sizes – for example single, double, or half-precision
integer and floating point values.
▪ An FPGA is effective at processing the same or different operations in parallel – multiple instructions, multiple data (MIMD).
An FPGA does not have a predefined instruction-set, or a fixed data width.
Figures source: AWS - Announcing Amazon EC2 F1 Instances with Custom FPGAs, Bringing Hardware Acceleration closer to the programmer, Ecoscale-ExaNest workshop, 2017
Silicon alternatives for rapid enterprise-ready specialization
44

ASICs vs
GPUs vs
TPUs vs
DPUs vs
FPGAs vs
Apples vs …
45
WP492 (v1.0.1) June 13, 2017, Xilinx All Programmable Devices: A Superior Platform for Compute-Intensive Systems
Why FPGAs ?
GPUs
FPGAs
How the comparable raw
performance between FPGAs
and GPUs bring growth ?

TPUs vs
GPUs vs
FPS/TOPS
FPGAs vs
GPUs vs
ASICs vs
After Google announced the scaling capabilities of TPUv4 [1], Nvidia adopted the "per-chip" metric
for A100 [2]. So what is the “right” granularity? Who defines that ? (No-one!). MLperf has a diverse set
of benchmarks which unveil various system bottlenecks [3], but favors FLOPS, i.e. a game where
FPGAs have not the strongest point.There are a zillion companies out there doing inference [4], and
they make a lot of claims, but who is going to have the biggest ROI for improving results ?
[1] https://cloud.google.com/blog/products/ai-machine-learning/google-breaks-ai-performance-records-in-mlperf-with-worlds-
fastest-training-supercomputer
[2] https://www.eetimes.com/nvidia-google-both-claim-mlperf-training-crown/
[3] https://ieeexplore.ieee.org/document/9238612
[4] https://basicmi.github.io/AI-Chip/

47
“If you’re not sure of the optimal algorithms for
say compression or encryption for the data you’re
processing, or the data shape is going to be
changing over time so you don’t want to take the
risk of burning it to the silicon, you can
experiment and be agile on FPGAs”
Azure Chief Technology Officer, Mark Russinovic
Microsoft’s cloud strategy
favors FPGAs
The impact of FPGAs on query latency for Bing; even at double the query load FPGA-accelerated ranking has
lower latency than software-powered ranking at any load.
A. M. Caulfield et al., "A cloud-scaleacceleration architecture," 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), Taipei, 2016, pp. 1-13, doi: 10.1109/MICRO.2016.7783710.
… from research wise vision in 2015 …

(Microsoft’s) Doug Burger's talk at FPL, Sep. 1st 2020
… to industry adoption in 2020!

Live Video Transcoding Launch, Aaron Behman, Director of Video Product Marketing, Data Center Group
https://www.xilinx.com/publications/presentations/video-transcoding-media-deck.pdf
Definition: Capital expenditures
(CapEx) are funds used by a
company to acquire, upgrade, and
maintain physical assets such as
property, plants, buildings,
technology, or equipment. CapEx is
often used to undertake new
projects or investments by a
company. Capital Expenditure (CapEx)
Definition - Investopedia
www.investopedia.com

Hype source: Gartner,2020, https://www.gartner.com/en/documents/3988006/hype-cycle-for-artificial-intelligence-2020
What’s Next in
FPGAs for
cloud is
agility
How we envision to
increase FPGAs agility?
Many enterprises experience a
steady decline in their ability to
coordinate & operationalize FPGA
projects — particularly at scale.
50

1.Infra 2.Software
3.Automation 4.Composability
4. Composability
3. Automation
2. Software
1. Infra
Our visionary journey of FPGAs for cloud
51
What’s Next in
FPGAs for
cloud is
agility

1.Infra 2.Software
1. Infra
52
What’s Next in
FPGAs for
cloud is
agility

Accelerator I
What’s Next in FPGAs for
cloud is agility
Memory
Use FPGAs with
high CPU-FPGA
interconnect BW
Memory
Application A
1 A w/ accelerator I (HW+SW)
Assuming CPU-Acc.
2
Performance drops
due to inefficient
CPU-Accelerator BW

54
cloudFPGA and OpenCAPI-FPGAs in a nutshell
FPGA as a Co-Processor
POWER9 AC922 + V100 + 9V3/9H7
Up to 8 OpenCAPI FPGAs per 2U chassis.

Weather modeling
Research in OpenCAPI-attached FPGAs
What’s Next in FPGAs for cloud
is agility
Kaan Kara, Christoph Hagleitner, Dionysios Diamantopoulos, Dimitris Syrivelis,
Gustavo Alonso: High Bandwidth Memory on FPGAs: A Data Analytics
Perspective. FPL 2020
In-Memory Data Analytics
http://www.cosmo-model.org
Image source: payodsoft.com
Genomics
o Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner,
Sander Stuijk, Henk Corporaal: NARMADA: Near-Memory Horizontal
Diffusion Accelerator for Scalable Stencil Computations. FPL 2019
o Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan
Gómez-Luna, Sander Stuijk, Onur Mutlu, Henk Corporaal: NERO: A Near
High-Bandwidth Memory Stencil Accelerator for Weather Prediction
Modeling. FPL 2020
Abbas Haghi, Lluc Alvarez, Jordà Polo, Dionysios
Diamantopoulos, Christoph Hagleitner, Miquel
Moretó: A Hardware/Software Co-Design of K-mer
Counting Using a CAPI-Enabled FPGA. FPL 2020:

What’s Next in FPGAs for
cloud is agility
Memory
Buy FPGA
o custom logic,
o custom memory,
o custom interconnects
FPGA
Application A
Application B
Application C
Application D
Application L
I have 12
applications, but
in one 2u-node I
can only attach up
to 8x FPGAs
Well, multi-tenancy is also
available in FPGAs, but the
scaling problem remains
in the long-term…
Not only custom,
but also
reconfigurable at
seconds’ speed !

is agility
Use cloudFPGA
o End CPU slavery!
o Deploy FPGAs at large
scale in hyperscale DCs
FPGA
Application A
Application B
Application C
Application D
Application E
~1000 FPGAs / rack
FPGA
Excuse my artistic and
simplifying vision that
ignores :
o the bindings of an
application to SW
libraries,
o the runtime options,
o the scheduler policies,
o the resource-manager
policies,
o the control-plane and
data-plane management,
o the security
o and many-many more …
THINK BIG!

58
FPGA as a Co-Processor FPGA as a Peer-Processor
POWER9 AC922 + V100 + 9V3/9H7 64 FPGAs into one 19"×2U chassis (64-port 10GbE =640Gb/s BW).
In all, 16 such chassis fit into a 42U rack for a total of 1024 FPGAs
and 16 TB of DRAM.
Up to 8 OpenCAPI FPGAs per 2U chassis.

FPGA as a Co-Processor FPGA as a Peer-Processor
https://github.com/...<STAY_TUNED>
https://github.com/OpenCAPI/oc-accel

cloudFPGA
concept
60
Highlights
• dense
→ chassis w/ 64 compute units
→ ~1000 FPGAs / rack
• integration of 1st level switch
→ full cross-sectional BW
→ low cost (cables / rack space)
• energy efficient
→ no SW/FW overhead
→ no CPU overhead
→ (hot) water cooling
• self-hosted / network-attached
→ bare-metal support
→ scalabl
IP Address: 10.10.1.9
DRAM: 8GB, BRAM: 38MB
CLBs:660.000,
DSPs: 2760
IP Address: 10.10.1.50
DRAM: 32GB, Cores: 4
The FPGA becomes the node !
Goal → Deploy FPGAs at large scale in hyperscale DCs
1-10s of thousands per DC
is agility

Standalone network-attached FPGA
1.Replace PCIe I/F with
integrated NIC (iNIC).
2.Turn FPGA card into a self-
contained appliance.
3.Replace transceivers w/
backplane connectivity.

One carrier sled = 32 FPGA modules
1. Our first FPGA module uses a Xilinx Kintex Ultrascale KU060
o A mid-range FPGA with high performance/price and low wattage

Two carrier sleds per chassis = 64 FPGAs

Sixteen chassis per rack = 1024 FPGAs

cloudFPGA

67
Compute density - S822LC (aka Minsky) vs FPGA chassis
~x2 INT8 TOPS
~x4 INT4 TOPS
~x8 INT2 TOPS
~x16 Bin TOPS

1.Infra 2.Software
68
Core
themes
is agility
2. Software
1. Infra

Who we built this for
69
cF developer #1
cF developer #2
cF developer #3
cF (Nirvana) developer #4
Tools
❑ cloudFPGA Studio
✓ Host: cFPy, Jupyter Lab
✓ Kernel: VHDL, Verilog, C,
C++, SystemC, OpenCL
❑ cFDK
✓ Host: ZRLMPI, cFPy,
OpenROLE (SW)
✓ Kernel: VHDL, Verilog, C, C++,
SystemC, OpenCL with
OpenROLE (HW)
❑ cFDK
✓ Host: Custom API with
TCP/UDP
C++, SystemC, OpenCL with
AXI I/F
❑ User’s front-end application
integrates with cloud-native cF
software that leverages cF
nodes transparently.
✓ Host: gRPC, RESTapi, ...
SystemC, OpenCL
“I wasn’t aware the service I am
using involved cloudFPGA.”
“I need to accelerate an
application. I don’t know RTL/HLS
and hardware design.”
“I want to create or reuse my RTL/HLS
designs while designing HW and SW
middleware.”
“I want to create or reuse RTL/HLS
kernels while using standard APIs
whenever possible.”

Software action #1
cFDK REST API
70
cF developer #1
middleware.”
Disclaimer: Hardware in FPGA world can be software too! ☺
e.g. “FPGAs for Software Programmers”, Dirk Koch, Frank Hannig, and Daniel Ziener. 2016. Springer Publishing Company, Incorporated.
DONE
is agility

Software action #2
cFDK REST API
71
cF developer #1
cF developer #2
middleware.”
On
track
FCCM2020 Workshop: THE FUTURE OF FPGA-ACCELERATION IN CLOUD AND DATA CENTERS
openRole: Do we need a POSIX for FPGAs? Burkhard Ringlein, IBM Research Europe
http://www.fccm.org/proceedings/2020/Workshops/Future_of_FPGA_Workshop/2020-04-29_openRole_workshop_public-Burkhard%20Ringlein.pdf
FPL2020 Workshop: DevOps support for Cloud FPGA platforms
openRole: Can we bring ‘Design once, run everywhere’ to FPGAs? Burkhard Ringlein, IBM Research Europe
is agility

Software action #3
Quantitative
Finance
Kernel
Weather
Modeling Kernel
Computer
Vision
Kernel
Database
Acceleration
Kernel
DSP
Kernel
Data Security
Kernel
Linear Algebra
Kernel
AI Inference Kernel
Domain-specific
Languages
Accelerated
Libraries
Custom
Accelerators
Data
Analytics
Kernel
cFDK REST API
Abstraction
levels
72
cF developer #1
cF developer #2
middleware.”
cF developer #3
“I need to accelerate an application. I don’t
know RTL/HLS and hardware design.”
on
track
is agility

Software 1.0 Software 2.0
ML
The road to Software 2.0, M. Loukides and B. Lorica, O’Reilly, December 10, 2019
https://www.oreilly.com/radar/the-road-to-software-2-0/
is agility

Deploy Deep Learning Everywhere: Limitations
New operator
introduced by
operator fusion
optimization
potential benefit
TVM For Fun and Profit Tutorial, at FCRC 2019

Build intelligent systems with learning (offline and online)
TVM For Fun and Profit Tutorial, at FCRC 2019

End-to-end compilation flow for transprecision FPGAs
Agile Autotuning of a Transprecision Tensor
Accelerator Overlay D.Diamantopoulos et al., FPL2020

1.Infra 2.Software
3. Automation
2. Software
1. Infra
77
Core
themes
is agility

Who we built this for
78
I can run my containerized application
without having to worry about sizing,
creating or managing a cluster. “Run my
container” vs. “Give me a cluster, that I
can then run my container on”.
Container-Savvy Developer
Functions Developer
I love Functions-as-a-Service and can
now run them with almost no limits. I
now have a single platform to securely
combine Functions with Apps and
other containerized workloads.
cF developer #1
cF developer #2
cF developer #3
cF (Nirvana) developer #4
“I wasn’t aware the service I am
using involved cloudFPGA.”
“I need to accelerate an
application. I don’t know RTL/HLS
and hardware design.”
middleware.”
PaaS/IaaS Developer
I can start utilizing a new powerful
platform/infra and:
- keep using a “push source code”
experience
- do not have to worry about
containers
- can easily connect my code to
backing services
Tools
❑ cloudFPGA Studio
✓ Host: cFPy, Jupyter Lab
C++, SystemC, OpenCL
❑ cFDK
✓ Host: ZRLMPI, cFPy,
OpenROLE (SW)
SystemC, OpenCL with
OpenROLE (HW)
❑ cFDK
✓ Host: Custom API with
TCP/UDP
C++, SystemC, OpenCL with
AXI I/F
❑ User’s front-end application
integrates with cloud-native cF
software that leverages cF
nodes transparently.
✓ Host: gRPC, RESTapi, ...
SystemC, OpenCL

Software-defined multi-FPGA fabric

1.Infra 2.Software
4. Composability
3. Automation
2. Software
1. Infra
80
Core
themes
is agility

▪ Thousands of tiny CPUs using high
parallelization
▪ compute intensive application
▪ SIMD-oriented workloads
▪ Logic + IOs are customized
▪ Very low and predictable latency
▪ MIMD-oriented workloads
New AI HW
▪ 64 FPGAs into one 19"×2U chassis (64-port
10GbE =640Gb/s BW).
▪ In all, 16 such chassis fit into a 42U rack for a
total of 1024 FPGAs and 16 TB of DRAM.
cloudFPGA
https://www.zurich.ibm.com/cci/cloudFPGA
Hybrid cloud
GPU
FPGA
Composable systems with FPGAs

HelmGemm: AI HW fractionalization
➢ Docker container service: multi-tenant
environment with a high-level API to
provide lightweight containers that run
processes in isolation
➢ Kubernetes management: deploy,
maintain, and scale applications
➢ HelmGemm extension: hardware,
middleware and software
➢ Hardware support : 4xGPUs, 2xFPGAs
HelmGemm: Managing GPUs and FPGAs for Transprecision GEMM Workloads in
Containerized Environments

HelmGEMM overview
System Memory GPU Memory FPGA Memory Accelerators’ view of memory
CPU P9 GPU V100 FPGA 9V3
NVLink
3bricks x 50GBps
CAPI2
(32GBps)
OpenCAPI
(<50GBps)
120GBps/
socket
(Open)CAPI
➢ (Open)CAPI technology enables an HPC node with unified memory for accelerators
CPU P9 GPU V100 FPGA 9V3

HelmGEMM case study of Yolov3
Mapped to GPU V100 half precision: 140W
Mapped to FPGA 9V3 4-13bits: 32W
On CPU single precision
Mapped to GPU V100 half precision: 140W
On CPU single precision
➢ Heterogenous execution of Yolov3 CNN on P9+V100+AD9V3 for energy efficiency

HelmGEMM evaluation
28.7x more energy efficiency
59.3x more performance
$21,878 $30,587
83 G(fl)Ops/sec 4.94 T(fl)Ops/sec
POWER9 AC922
POWER9 AC922 + V100 + 9V3
0.16 G(fl)Ops/sec/Watt 4.78 G(fl)Ops/sec/Watt

Composable systems with FPGAs
▪ Thousands of tiny CPUs using high parallelization
▪ compute intensive application
▪ SIMD-oriented workloads
GPU FPGA
▪ Logic + IOs are customized exactly for the
application's needs.
▪ Very low and predictable latency applications
▪ MIMD-oriented workloads
New AI HW
Byte-addressable
Byte-addressable
External: Byte-addressable
Internal : >Bit-addressable
fp16, fp8, int4 int2

Communication for low-precision AI HW
PHRYCTORIA motivation:
Traditional communication mechanisms
for modern low-precision data-types
(e.g. brain-float16, int5) cannot exploit
the bandwidth of emerging
communication links for FPGA
accelerators (e.g. OpenCAPI, PCIe4, etc).
Heterogeneous System: IBM1 IC922, 2POWER91 CPUs,
AlphaData ADM-9H7 (Xilinx VU37P FPGA), OpenCAPI 3.0 25Gbps8.
-2x BW utilization for brain-float16
-6x BW utilization for int5
Name inspired after the ancient Greek
communication system “ΦΡΥΚΤΩΡΙΑ”, 1900 B.C.

Bits
Supercomputers
Neurons
Qubits
How we
get there

https://analog-ai-demo.mybluemix.net/
o The Future of Computing: Bits + Neurons + Qubits, Dario Gil and William M. J. Green, ISSCC2020: Plenary
o E. Eleftheriou et al., "Deep learning acceleration based on in-memory computing," in IBM Journal of Research
and Development, vol. 63, no. 6, pp. 7:1-7:16, 1 Nov.-Dec. 2019
o https://www.research.ibm.com/artificial-intelligence/ai-hardware-center/
https://www.ibm.com/blogs/research/2019/02/ai-hardware-center/ , Feb. 2019
o Extending performance by 2.5X / year through 2025
o Approximate computing principles applied to Digital AI Cores with
reduced precision and Analog AI cores (Non von Neumann HW)
o PCM devices have the ability to store synaptic weights in their analog
conductance state. When PCM devices are arranged in a crossbar
configuration, it allows to perform an analog matrix-vector
multiplication in a single time step, exploiting the advantages of
multi-level storage capability and Kirchhoff’s circuits laws.
https://www.technologyreview.com/2020/12/11/1014102/ai-trains-on-4-bit-computers/

Bits
Supercomputers
Neurons
Qubits
How we
get there
What’s Next in computing
& the role of cloud FPGAs

https://qiskit.org/ https://www.ibm.com/quantum-computing/experience/
At CES 2020, IBM research director Dario Gil gave the audience a primer on quantum computing and predicted that the industry will achieve quantum
advantage this decade.

Bits
Qubits
Neurons Bits
Qubits
Neurons Bits
Qubits
Neurons Bits
Qubits
Neurons
Accelerated discovery
Bits + Neurons + Qubits
Deep Search Intelligent Simulation Generative Models Autonomous Labs
Bits + Neurons Bits + Qubits Bits + Neurons Bits + Neurons

How we get there
The role of cloud FPGAs
1. FPGAs are eligible to become 1st class citizens
➢ Standalone approach sets the FPGA free from the CPU
o Large scale deployment of FPGAs independent of #servers
o Significantly lowers the entry barrier
➢ Promotes the use of medium and low-cost FPGAs
2. The network-attachment model
➢ Makes FPGAs IP-addressable and scalable in DCs
o Users can rent and link them in any type of topology
➢ Opens the path for use of FPGAs in large scale applications
o Serverless computing, HPC, DNN inference, Signal Processing, …
3. The hyperscale infrastructure
➢ Integrates FPGAs at the chassis (aka drawer) level
➢ Combines passive and active water cooling
➢ Key enabler for FPGAs to become plentiful in DCs
FCCM2020 Workshop: THE FUTURE OF FPGA-ACCELERATION IN CLOUD AND DATA CENTERS
cloudFPGA: Promote FPGAs to 1st Citizen in the Cloud, Francois Abel, IBM Research Europe
http://www.fccm.org/proceedings/2020/Workshops/Future_of_FPGA_Workshop/cloudFPGA.pdf

Research Ecosystem
Team Collaboration
Burkhard Ringlein, Francois Abel, Beat Weiss, Mitra Purandare,
Florian Auernhammer (OCAPI), Raphael Polig (ZYC2),
Christoph Hagleitner, Mark Lantz
ZRL Collaboration
Florian Scheidegger, Cristiano Malossi (H2020 OPRECOMP)
Eindhoven University of Technology
Gagandeep Singh, Sander Stuijk, Henk Corporaal
Former NeMeCo ZRL colleagues: Jan van Lunteren, Ronald Luijten
• COOLCHIPS2020, WiP: Automated precision-tuning methods
for deep learning models on FPGAs and IoT devices.
• ISCAS2018, COOLCHIPS2018, FPT2018, ASAP2019,
SAMOS2019, DATE2019, RAW-IPDPS2020, FPL2019,
FPL2020, FCCM2020, H2RC2020
• FPL2019, SAMOS2019, DATE2019, FPL2020. NeMeCo H2020
project, near-memory accelerators for weather modeling
ETH
Stefan Mach, Fabian Schuiki, Germain Haugou,
Michael Schaffner, Frank K. Gurkaynak, and Luca Benini
• COOLCHIPS2020, Transprecision PULP on IBM-ZRL Cloud
ETH
Kaan Kara (now Oracle), Dimitris Syrivelis (former IBM-
Ireland colleague, now Nvidia) Gustavo Alonso
• FPL2020, In-memory database acceleration (FPGA +
OpenCAPI + HBM)
Barcelona Supercomputing Center
Abbas Haghi, Lluc Alvarez, Santiago Marco, Miquel Moreto
• FPL2020, Genomics acceleration with CAPI2-FPGA
IBM France – IBM China
Bruno Mesnet, Alexandre Castellane, Yong Lu (now cyansemi)
• SNAP(CAPI1/2) and OC-Accel(OpenCAPI), OpenPOWER
Summit 2018, 2019
There is no one-man-show
ETH
Gagandeep Singh, Juan Gómez-Luna, Onur Mutlu
• FPGA2021, Optimizing Near-memory accelerators with ML
94

What’s Next in FPGAs for cloud is agility
Infrastructure | Software | Automation | Composability
95

Transprecision Computing DL: Constrained model synthesis for IoT applications
IoT Budget & Requirements
Inference
time
Memory
size
Device type
Dataset
Images, Sensors, Audio
Visual inspection
OUTPUT
Anomaly detection
USER INPUT
AUTOMATED ML MODEL SYNTHESIS FOR GIVEN EDGE DEVICE
CASE #1 CASE #2 CASE #3
…
FPGA ?
Sood, A. et al. “NeuNetS: An Automated Synthesis Engine for Neural Network Design.” ArXiv abs/1901.06261 (2019)
F. Scheidegger et al. “Constrained deep neural network architecture search for IoT devices accountingfor hardware calibration”, NeurIPS2019
Call for EU-funded PhD: Deep Learning Algorithms for Budget Constrained Applications in the IoT Domain
https://tuni.rekrytointi.com/paikat/?o=A_RJ&jgid=3&jid=794

PHRYCTORIA overview
Protocol Buffer
Serialization/Deserialization
Byte-addressable
enterprise system
Low-precision
FPGA accelerator
Synthetic FloatX dataset
NLP dataset
PHRYCTORIA: A Messaging System for Transprecision OpenCAPI-attached FPGA
Accelerators, D.Diamantopoulos et al., RAW-IPDPS2020
6.3x-7.4x
goodput BW
-4.8x MB
6.9x goodput BW
Compatible with any gRPC-
supported device/service

Survey and Benchmarking of Machine Learning Accelerators , 2019, MIT Lincoln Laboratory Supercomputing Centre
https://arxiv.org/abs/1908.11348
“Best” choice depends on requirements for
o Throughput (fps),
o Latency (ms),
o Energy efficiency (fps/watt),
o Cost efficiency (fps/$),
o Accuracy
is agility, operationalized

Autotuning of a Transprecision Tensor Accelerator
Agile Autotuning of a Transprecision Tensor
Accelerator Overlay D.Diamantopoulos et al., FPL2020
Instead of eliminating the hardware design space
with pruning, we propose a technique that builds
a prediction model which quantifies the impact
of a hardware design choice towards an
optimization goal.
By using the most important features in order to
generate an overlay we manage to perform auto-
tuning that succeeds in higher performance by
up to 2.5x and faster convergence by up to 8.1x.

-53x
GPU memory sharing in containerized systems can lead to
GPU performance inefficiencies that fall within the performance
envelope of FPGAs, which operate on a power budget one
order of magnitude lower.
The case of AI HW with
GPUs & FPGAs
Is your Neural Network
Memory-bound or Compute bound
for your NEW AI HW?
How does it matter for the cloud ?
What “AI HW diverseness” means for an enterprise system?

Selected ML workloads

Aggressive bit-width optimization for every AI HW device
Simulations to establish
which parts of an application can
be mapped to lower precision
such
that their accuracy is not
degraded
• DeepSpeech and Language
Modeling (Euclidean distance
compared to fp32 for 100%
accuracy)
Distribution of the workloads to the lower-precision counterparts as a
code-coverage percentage.
Yolov3 : aggressive bit-width optimization so that classification accuracy on ImageNet is not less than
72.9% and 91.2% for top-1% and top-5%.

HelmGEMM measurements

What’s Next in computing & the role of cloud FPGAs

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a What’s Next in computing & the role of cloud FPGAs

Semelhante a What’s Next in computing & the role of cloud FPGAs (20)

Último

Último (20)

What’s Next in computing & the role of cloud FPGAs