5. Datacenter Challenges
• Workload diversity
Software services change monthly
Number of applications increases
• Maintenance
Little HW maintenance, no accessibility
Machines last ~3 years, can be repurposed during lifetime
Homogeneity is critical to reduce cost
• Specialization
Slowing of Moore’s law performance scaling
Compute requirements increase beyond conventional CPU-only systems
5
*Cycles in 50 hottest binaries (%)
*S. Kanev, “Profiling a Warehouse-Scale Computer,” ISCA 2015
6. FPGA vs ASIC
6
Xeon CPU NICSearch Acc.
(FPGA)
Search Acc.
(ASIC)
Wasted Power,
Holds back SW
Xeon CPU NICSearch Acc. v2
(FPGA)
NICXeon CPU Math
Accelerator
Wasted Power,
One more thing that
can break
7. Catapult Gen1 (2014)
• Altera Stratix V D5
• 172,600 ALMs, 2,014 M20Ks, 1,590 DSPs
• PCIe Gen 3 x8
• 8GB DDR3-1333
• Powered by PCIe slot
• 6x8 Torus Network
7
Stratix V
8GB DDR3
PCIe Gen3 x8
4 x 20Gbps
transceiver
“Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services”, ISCA 2014
8. Open Compute Server
8
• Two 8-core Xeon 2.1 GHz CPUs
• 64 GB DRAM
• 4 HDDs @ 2 TB, 2 SSDs @ 512 GB
• 10 Gb Ethernet
• Plug-in FPGA via mezzanine connector
68 ⁰C
Mezz Conn.
9. Rack Design
9
• High density
- 1U (height:1.75inch), half-width servers
- Homogeneous design
- 1 FPGA per server, not enough for GPU
- Half rack: 2 x 24 servers
Server
Top Of Rack Switch (TOR)
Server
Server
Server
Server
Server
… …
D • Local Torus network
- Dedicated 6x8 torus enables multi-FPGA accelerators
- Requires additional cabling mapping physical 2x24 to
logical 6x8
10. Shell and Role
• Shell
- Operating system for FPGA
- Handles all I/O & management tasks
- Exposes simple FIFOs
• Role
- Only application logic
- Partial reconfiguration boundary
• Debug support
- Flight data recorder
- JTAG cable
10
West
SLIII
East
SLIII
South
SLIII
North
SLIII
x8 PCIe
Core
DMA
Engine
Config
Flash
(RSU)
DDR3 Core 1DDR3 Core 0
JTAG
LEDs
Temp
Sensors
Application
(Bing, Azure, DNN, etc.)
Shell
I2C
xcvr
reconfig
2 2 2 2
4 256 Mb
QSPI
Config
Flash
4 GB DDR3-
1333 ECC SO-
DIMM
4 GB DDR3-
1333 ECC SO-
DIMM
Host
CPU
72 72
Role
8
Inter-FPGA Router
SEU
11. Catapult Gen2 (2016)
11
• From Torus to Ethernet
- Bump-in-the-Wire (NIC FPGA Switch)
40G
NIC and TOR
FPGA
4GB DDR
2 x Gen3x8
PCIe
35W power budget
ToR
Switch
“A Cloud-Scale Acceleration Architecture,” Micro 2016
12. Integration to DC Infrastructure
12
…
A FPGA can communicate any FPGA in datacenter
13. Network Coverage and Latencies
13
0
5
10
15
20
25
1 10 100 1000 10000 100000 1000000
Round-TripLatency(us)
LTL L0 (same TOR)
LTL L1
Example L0 latency histogram
Example L1 latency histogram
Examples of L2 latency histograms for different pairs of FP
GAs
Number of Reachable Hosts/FPGAs
Catapult Gen1 Torus
(can reach up to 48 FPGAs)
LTL Average Laten
cyLTL 99.9th Percentile
6x8 Torus Latency
LTL L2
10K 100K 250K
15. Bing Ranking Acceleration
15
99.9% SW latency
99.9% FPGA latency
Average FPGA query load Average SW load
Day 1 Day 2 Day 3 Day 4 Day 5
1.0
2.0
3.0
4.0
5.0
6.0
7.0
NormalizedLoad&Latency
• Lower latency than software even with 2x query load
• More consistent 99.9th tail latency
16. AI Chip Market
16
2000 2010 2020 2030
Deep Learning
Revolution
(AlexNet 2012)
Existing chips
GPU
AI Chip
> 1000
energy
efficiency
65B in 2025, 19% of semiconductor market, 18-19% growth per year
223 224
295
17 32
65
2017 2020E 2025E
93
88
81
7 11
19
2017 2020E 2025E
240 256
370
AI semiconductor
total market
($ billion)
AI semiconductor
total market
(%)
Non-AI AI
AI semiconductor
total market CAGR
2017-25, (%)
3-4
18-19
5x
Growth for AI Semiconductors
McKinsey report, 2019
RequiredperformanceforAI
17. AI Chip Industry
17
Google Facebook Microsoft Baidu Tesla HabanaLabs
TPUdeployment
(2016)
OpenCompute
initiative
(2011)
AIchipdevteam
(2019)
CatapultFPGA
deployment
(2014)
Brainwave
deployment
(2018)
Kunlun
inproduction
(2020)
FullSelfDriving
(FSD)chipfor
autonomous
vehicle(2019)
AItraining
processorGaudi
(2019)
And more (Graphcore, Cerabras, Intel, Groq, WaveComputing, ..)
18. Google TPU
• Simple architecture to support MLP, CNN, and
RNN models and have fast development
- Host interface
- Unified buffer (24 MB), weight FIFO
- Matrix multiply unit (256 x 256)
- Accumulators (256, 4 MB buffers)
- 8-bit integer multiplication
• Systolic array architecture
- Systolic execution saves energy by reducing reads
and writes of the Unified Buffer
- Activation data flows in from the left and weights
are pre-loaded from the top
- A given 256-element multiply-accumulate operation
moves through the matrix as a diagonal wavefront
- Throughput-oriented: control and data are pipelined
18
19. TPU v1 Performance
• CPU vs GPU vs TPU
19
Operational Intensity: MAC ops/weight byte
Teraops/sec
Google TPU
nVidia K80
Intel Haswell
GM: geometric mean, WM: weighted mean
20. TPU v2 & TPU v3
• 128 x 128 systolic array (22.5 TFLOPS per core)
• float32 accumulate / bfloat16 multiplies
• 2 cores + 2 HBMs per chip / 4 chips per board
20TPU v2 TPU v3
21. Cloud TPU v2 Pod
• Single board: 180TFLOPS + 64GB HBM
• Single pod (64 boards):11.5 PFLOPS + 4TB HBM
• 2D torus topology
21
Single board
23. Habana Labs Gaudi
• Build for AI training performance at scale
- High throughput at low batch size
- High power efficiency
• Enable native Ethernet scale-out
- On-chip RDMA over Converged Ethernet (RoCE v2)
- Reduced system complexity, cost and power
- Leverage widely used standard Ethernet switches
• Promote standard form factors
- Open Compute Project Accelerator Module (OAM)
• SW infrastructure and tools
- Frameworks and ML compilers support
- TPC kernel library and user-friendly dev tools to enable
optimization/customization
23
24. Gaudi Processor Architecture
24
• 500mm2 die @ TSMC 16nm
• TPC 2.0 (Tensor Processing Core)
- Support DL training & inference
- VLIW SIMD (C-programmable)
• GEMM operation engine
- Highly configurable
• PCIe Gen4.0 x16
• 4 HBMs
- 2GT/s, 1TB/s BW, 32GB capacity
• RoCE v2
- 10 ports of 100Gb or 20 ports of 50Gb
• Mixed-precision data types
- FP32, BF16, INT32/16/8, UINT32/16/8
Heterogenous compute architecture
256GB/s, 8GB
per HBM
26. Training System with Gaudi
26
Various network configurations & systems possible for scale-out training
Habana Labs Systems-1 (HLS-1)
High-performance system with 16 Gaudi cards
27. Data & Model Parallelism using Gaudi
27
Topology for Data Parallelism Topology for Model Parallelism
• 3 reduction levels
• 8x11x12 = 1056 Gaudi cards
• Model parallelism requires more bandwidth
• Large-scale systems are built with all-to-all
connectivity utilizing a single networking
hop thanks to Ethernet integration
• 8x8 = 64 Gaudi cards
28. Gaudi Training Performance
28
• ResNet-50 Training Throughput
Images-per-second(thousands)
# of processors
Gaudi vs. V100
#ofGaudichipsused
Images-per-second (thousands)
Habana Gaudi
A single Gaudi dissipates 140 Watt, processes 1650 images/second
29. 29
Processor Gaudi HL-2000
Host Interface PCIe Gen 4.0 x 16
Memory 32GB HBM2
Memory BW 1 TB/s
ECC Protected Yes
Max Power Consumption 300W
Interconnect
2 Tbps: 20 56Gbps PAM4 Tx/Rx Serdes
(RoCE RDMA 10x100GbE or 20x 50GbE/25GbE)
System 8x Gaudi (HL-205)
Host Interface 4 ports of x16 PCIe Gen 4.0
Memory 256GB HBM2
Memory BW 8 TB/s
ECC Protected Yes
Max Power Consumption 3 kW
Interconnect
24 x 100Gbps RoCE v2 RDMA Ethernet ports
(6 x QSFP-DD)
Gaudi Mezzanine card & System
HL-205
HLS-1
30. • MIMD architecture for fine-grained parallelism
• 23.6B transistors, 800mm2 die @ 16nm
• 124.5 TFlops @ 120W (FP16 mul + FP32 acc)
• 1216 tiles (each tile = core + scratchpad)
- Support 7296 in total (6 per tile)
- 304 MB total memory (256 KB per tile)
- 45 TB/s memory BW & 6 cycle latency
- No shared memory
• PCIe Gen4 x16
- 64 GB/s bidirectional BW to host
• IPU-Exchange
- 8 TB/s all-to-all IPU tiles
- Non-blocking, any communication pattern
• IPU-Links
- 80 IPU-Links
- 320 GB/s chip-to-chip BW
Graphcore IPU Processor
30
TileIPU-Link
PCIe
IPU-
Exchange
31. IPU Tile
31
• Tile = computing core + 256 KB scratch pad
• Specialized pipelines called Accumulating
Matrix Product (AMP)
• AMP unit can accelerate matrix multiplication
and convolution operation
• The IPU tiles can be used for MIMD parallelism
Codelets exchange compute waiting
37. XPU Core Architecture
37
• Many tiny cores
- Instruction set based software-programmable
- Domain specific ISA
- No operating system & no cache
- Flexible to serve diverse workloads
• Customized logic
- Hardware-reconfigurable
- Achieve high performance efficiency
- SDA-II accelerator can be used for DL
• Resource allocation is reconfigurable
- Set the ratio of cores vs. custom logic
depending on application’s requirement
38. XPU: Architecture of Tiny Cores
38
• 32 cores are clustered and share
- 32KB multi-bank memory
- SFA (special function accelerator)
39. XPU: Architecture of Tiny Cores
39
• MIPS-like instruction set
• Private scratchpad memory
- 16 KB or 32 KB
• 4-stage pipeline
- Designed for low latency
- Branch history table (BHT)
41. • TSMC 16nm technology
• 1.2T transistors on 46,225 mm2 silicon wafer
• 400,000 AI optimized cores
• 18 GB on-chip memory (SRAM)
- 9.6 PB/s memory BW
• Memory architecture optimized for DL
- Memory uniformly distributed across cores
• High-bandwidth low-latency interconnect
- 2D mesh topology
- Hardware based communication
- 100 Pbit/s fabric bandwidth
• 1 GHz clock speed & 15kW power consumption
• Largest chip ever built
Cerebras Wafer Scale Engine (WSE)
41
42. • Fully programmable compute core
• Full array of general instructions with ML extensions
• Flexible general operations for control processing
- E.g. arithmetic, logical, load/store, branch
• Optimized for tensor operations
- Tensors as first class operands
• Sparsity harvesting technology
- SLA cores intelligently skip the zeros
- All zeros are filtered out
Cerebras Wafer Scale Engine Core
42
Sparse Linear Algebra (SLA) Core
43. • Neural network(NN) models expressed in common ML frameworks
• Cerebras interface to framework extracts the NN
• Performs placement and routing to map NN layers to fabric
• The entire wafer operates on the single neural network
Programming the Wafer Scale Engine
43
44. 44
Challenges of WSE
• Cross die connectivity
- Add cross-die wires across scribe lines of wafer in partnership with TSMC
• Yield
- Have redundant cores and reconnect fabric
• Wafer-wide package assembly technology
• Power and cooling
45. Intel NNP-T
45
• Intel Nervana Neural Network Processor
for Training (NNP-T)
• Train a network as fast as possible within
a given power budget, targeting larger
models and datasets
• Balance between compute, communication,
and memory for system performance
• Reuse on-die data as much as possible
• Optimize for batched workloads
• Build-in scale-out support
46. Intel NNP-T Architecture
46
• 27B transistors, 680mm2 die @
TSMC 16nm (2.5D packaging)
• 24 Tensor Processor Clusters
- Up to 119 TOPS
• 60MB on-chip distributed
memory
• 4 x HBM2
- 1.22 TB/s BW, 8GB capacity
• PCIe Gen 4.0 x16
• Up to 1.1GHz core frequency
• 64 lanes SerDes
- Inter-chip communication
48. Compute Core
48
• Bfloat16 matrix multiply core (32x32)
• FP32 & BF16 support for all other
operations
• 2x multiply cores per TPC to amortize other
SoC resources (control, memory, network)
• Vector operations for non-GEMM
- Compound pipeline
- DL specific optimizations
• Activation functions, reductions, random-
number generation & accumulations
• Programmable FP32 loop-up tables
49. • Bidirectional 2-D mesh architecture
to allow any to any communication
• Cut-through forwarding and multi-
cast support
• 2.6 TB/s total cross-sectional BW
• HBM & SerDes are shared through
the mesh
• Support for direct peer-to-peer
communications between TPCs
NNP-T On-Die Communication
49
50. • Full software stack built with open components
• Direct integration with DL frameworks
• nGraph
- Hardware agnostic DL library & compiler
- Provides common set of optimizations for NNP-T
• Argon
- NNP-T DNN compute & communication kernel
library
• Low-level programmability
- NNP-T kernel development toolchain w/ tensor
compiler
NNP-T Software Stack
50
Argon
DNN Kernel Library
Kernel Mode Driver
Board Firmware Chip Firmware
51. Benchmark Performance
51
Description Utilization
c64xh56xw56_k64xr3xs3_st1_n128 86%
c128xh28xw28_k128xr3xs3_st1_n128 71%
c512xh28xw28_k128xr1xs1_st1_n128 65%
c128xh28xw28_k512xr1xs1_st1_n128 59%
c256xh14xw14_k1024xr1xs1_st1_n128 62%
c256xh28xw28_k512xr1xs1_st2_n128 71%
c32xh120xw120_k64xr5xs5_st1_n128 87%
C=# input dimensions, H=height, W=width, K=# filters,
R=filter X, S=filter Y, ST=stride N=minibatch size
GEMM Size Utilization
1024 x 700 x 512 31.1%
1760 x 7133 x 1760 44.5%
2048 x 7133 x 2048 46.7%
2560 x 7133 x 2560 57.1%
4096 x 7133 x 4096 57.4%
5124 x 9124 x 2048 55.5%
Convolution operationGEMM operation
52. Summary
52
• Cloud AI accelerators’ goals
- High cost-performance over GPU, scalability, programmability
• Compute
- Specialized cores for tensor processing such as matrix, convolution
• Memory
- HBM
- Distributed on-chip memory & scratchpads
- No hardware caches
• Communications
- High bandwidth on-chip networks
- Custom inter-chip links
- PCIe Gen 4.0 to host
• Software
- Compatibility to existing frameworks (ONNX, TensorFlow, PyTorch)
- Graph compiler + device-oriented optimization
53. References
53
- https://www.hotchips.org/hc31/HC31_1.14_HabanaLabs.Eitan_Medina.v9.pdf
- https://habana.ai/wp-content/uploads/2019/06/Habana-Gaudi-Training-Platform-whitepaper.pdf
- https://www.graphcore.ai/products/ipu
- https://www.servethehome.com/hands-on-with-a-graphcore-c2-ipu-pcie-card-at-dell-tech-world/
- Z. Jia, “Dissecting the Graphcore IPU Architecture via Microbenchmarking,” Citadel technical report, 2019
- V. Rege, “Graphcore, the Need for New Hardware for Artificial Intelligence,” AI Hardware Summit 2019
- https://www.graphcore.ai/posts/new-graphcore-ipu-benchmarks
- https://m.itbiznews.com/news/newsview.php?ncode=1065569594387854
- https://www.hotchips.org/wp-content/uploads/hc_archives/hc29/HC29.21-Monday-Pub/HC29.21.40-
Processors-Pub/HC29.21.410-XPU-FPGA-Ouyang-Baidu.pdf
- https://www.firstxw.com/view/254356.html
- https://www.hotchips.org/hc31/HC31_1.13_Cerebras.SeanLie.v02.pdf
- https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/
- https://www.hotchips.org/hc31/HC31_1.12_Intel_Intel.AndrewYang.v0.92.pdf
- https://www.businesswire.com/news/home/20191112005277/en/Intel-Speeds-AI-Development-
Deployment-Performance-New