This document provides an overview of CPU, GPU, and TPU architectures for artificial intelligence. It discusses the historical context of the Harvard and von Neumann architectures. It describes key aspects of CPU architecture including CISC/RISC designs. GPU architecture is summarized as being well-suited for data parallelism. The document outlines the TPU architecture including its block diagram and use of matrix operations. Finally, it presents some next technological steps such as analog processors and distributed inference.
3. 3
A
M
L
I
Lviv
R&D Lab
Confidential
1. Historical context
2. CPU architecture
3. GPU architecture - Gin in a bottle for artificial intelligence
4. TPU architecture - AI dedicated processing chip
5. Next technological step
Table of contents
6. 6
A
M
L
I
Lviv
R&D Lab
Confidential
1. The principle of duality.
2. The principle of program management.
3. The principle of homogeneity of memory.
4. The principle of memory addressability.
5. The principle of sequential program control.
6. The principle of conditional transition.
It makes "programs that write programs" possible
von Neumann architecture
SISD (.Single Instruction, Single Data) Architecture
7. 7
A
M
L
I
Lviv
R&D Lab
Confidential
● Single instruction stream, single data
stream (SISD) - Traditional uniprocessor
machines like older personal computers
● Single instruction stream, multiple data
streams (SIMD) - Most common style of
parallel programming.
● Multiple instruction streams, single data
stream (MISD) - Uncommon architecture
which is generally used for fault tolerance.
● Multiple instruction streams, multiple data
streams (MIMD) - Distributed & multi core
systems
Flynn data / instruction classification models
9. 9
A
M
L
I
Lviv
R&D Lab
Confidential
CISC / RISC / MISC / VLIM CPUs
CISC RISK
Emphasis on hardware Emphasis on software
Multiple instruction size and format Instructions of the same set with few
formats
Less registers Uses more registers
More addressing models Few addressing models
Extensive use of microprogramming Complexity in compiler
Instructions take a very amount of cycles Instructions take one cycle time
Pipeline is difficult Pipeline is easy
10. 10
A
M
L
I
Lviv
R&D Lab
Confidential
CPU architecture
● The main task of the CPU is to execute a
chain of instructions in the shortest
possible time.
● The CPU may execute several chains at
the same time.
● After executing them separately, merge
them again into one, in the correct order.
● Each instruction in the stream depends
on the instructions following it.
13. 13
A
M
L
I
Lviv
R&D Lab
Confidential
Task parallel:
– Independent processes with little communication
– Easy to use
Data parallel:
– Lots of data on which the same computation is being executed – No
dependencies between data elements in each computation step
– Can saturate many ALUs
– But often requires redesign of traditional algorithms
Task vs Data parallelism
14. 14
A
M
L
I
Lviv
R&D Lab
Confidential
CPU
– Really fast caches (great for data reuse)
– Fine branching granularity
– Lots of different processes/threads
– High performance on a single thread of execution
GPU
– Lots of math units
– Fast access to onboard memory
– Run a program on each fragment/vertex
– High throughput on parallel tasks
● CPUs are great for task parallelism
● GPUs are great for data parallelism
CPU vs GPU (GPGPU)
15. 15
A
M
L
I
Lviv
R&D Lab
Confidential
– Large data sets
– High parallelism
– Minimal dependencies between data elements
– High arithmetic intensity
– Lots of work to do without CPU intervention
Ideal apps to target GPGPU
17. 17
A
M
L
I
Lviv
R&D Lab
Confidential
– Functions applied to each element in stream
• transforms, PDE, ...
– No dependencies between stream elements
• Encourage high Arithmetic Intensity
Kernels
20. 20
A
M
L
I
Lviv
R&D Lab
Confidential
Deep Learning Neural Network
Three kinds of NNs are popular today:
1. Multi-Layer Perceptrons (MLP): Each new layer is a set of
nonlinear functions of weighted sum of all outputs (fully
connected) from a prior one, which reuses the weights.
2. Convolutional Neural Networks (CNN): Each ensuing layer is
a set of of nonlinear functions of weighted sums of spatially
nearby subsets of outputs from the prior layer, which also
reuses the weights.
3. Recurrent Neural Networks (RNN): Each subsequent layer is
a collection of nonlinear functions of weighted sums of
outputs and the previous state. The most popular RNN
is Long Short-Term Memory (LSTM). The art of the LSTM is
in deciding what to forget and what to pass on as state to the
next layer. The weights are reused across time steps.
21. 21
A
M
L
I
Lviv
R&D Lab
Confidential
● CLB (Configurable Logic Block): These are the basic cells of
FPGA. It consists of one 8-bit function generator, two 16-bit
function generators, two registers (flip-flops or latches), and
reprogrammable routing controls (multiplexers). The CLBs are
applied to implement other designed function and macros. Each
CLBs have inputs on each side which makes them flexile for the
mapping and partitioning of logic.
● I/O Pads or Blocks: The Input/Output pads are used for the
outside peripherals to access the functions of FPGA and using the
I/O pads it can also communicate with FPGA for different
applications using different peripherals.
● Switch Matrix/ Interconnection Wires: Switch Matrix is used in
FPGA to connect the long and short interconnection wires
together in flexible combination. It also contains the transistors to
turn on/off connections between different lines.
FPGA architecture
23. 23
A
M
L
I
Lviv
R&D Lab
Confidential
TPU block diagram
● Instructions come true 3x16 PCI
● MMU - 256x256 by 8 bit mul-add integers
● Accumulator - 4MiB = 4Kx256x8b
● Matrix unit produces one 256-element partial
sum per clock cycle
● PCI functionality:
○ reads data from the CPU host memory
into the Unified Buffer(UB)
○ reads weights from Weight Memory into
the Weight FIFO as input to the Matrix
Unit.
○ Order Matrix Unit to perform a matrix
multiply or a convolution from the
Unified Buffer into the Accumulators.
● Activate performs the non linear function of
the artificial neuron, with options for ReLU,
Sigmoid. It can also perform the pooling
operations
24. 24
A
M
L
I
Lviv
R&D Lab
Confidential
Matrix operation
Weights
Data
● Given 256-element multiply-accumulate
operation moves through the matrix as a
diagonal wavefront.
● Weights are preloaded, and take effect
with the advancing wave alongside the
first data of a new block.
● Control and data are pipelined to give
the illusion that the 256 inputs are read
at once
28. 28
A
M
L
I
Lviv
R&D Lab
Confidential
● No chip size limitation
● Fixed NN graph
● Each weight represented by analog
memory
● Nonlinear function of memorization
and forgetting
Analog memory matrix
N
R
N
R
Ro Ro
R1 R1
Training Forgetting
NN Graph
Graph Anchor