SlideShare uma empresa Scribd logo
1 de 30
Baixar para ler offline
1
Confidential
2
A
M
L
I
Lviv
R&D Lab
Confidential
CPU / GPU / TPU architecture
Dov Nimratz
Senior Solution Architect
3
A
M
L
I
Lviv
R&D Lab
Confidential
1. Historical context
2. CPU architecture
3. GPU architecture - Gin in a bottle for artificial intelligence
4. TPU architecture - AI dedicated processing chip
5. Next technological step
Table of contents
4
Confidential
Historical context
5
A
M
L
I
Lviv
R&D Lab
Confidential
● Discovered by Harvard Eikon in 1930
● Separate storage and signal
pathways for instructions and data.
● Frequently using in DSP
Harvard architecture
6
A
M
L
I
Lviv
R&D Lab
Confidential
1. The principle of duality.
2. The principle of program management.
3. The principle of homogeneity of memory.
4. The principle of memory addressability.
5. The principle of sequential program control.
6. The principle of conditional transition.
It makes "programs that write programs" possible
von Neumann architecture
SISD (.Single Instruction, Single Data) Architecture
7
A
M
L
I
Lviv
R&D Lab
Confidential
● Single instruction stream, single data
stream (SISD) - Traditional uniprocessor
machines like older personal computers
● Single instruction stream, multiple data
streams (SIMD) - Most common style of
parallel programming.
● Multiple instruction streams, single data
stream (MISD) - Uncommon architecture
which is generally used for fault tolerance.
● Multiple instruction streams, multiple data
streams (MIMD) - Distributed & multi core
systems
Flynn data / instruction classification models
8
Confidential
CPU architecture
9
A
M
L
I
Lviv
R&D Lab
Confidential
CISC / RISC / MISC / VLIM CPUs
CISC RISK
Emphasis on hardware Emphasis on software
Multiple instruction size and format Instructions of the same set with few
formats
Less registers Uses more registers
More addressing models Few addressing models
Extensive use of microprogramming Complexity in compiler
Instructions take a very amount of cycles Instructions take one cycle time
Pipeline is difficult Pipeline is easy
10
A
M
L
I
Lviv
R&D Lab
Confidential
CPU architecture
● The main task of the CPU is to execute a
chain of instructions in the shortest
possible time.
● The CPU may execute several chains at
the same time.
● After executing them separately, merge
them again into one, in the correct order.
● Each instruction in the stream depends
on the instructions following it.
11
A
M
L
I
Lviv
R&D Lab
Confidential
ALU architecture
12
Confidential
GPU architecture -
Gin in a bottle for artificial intelligence
13
A
M
L
I
Lviv
R&D Lab
Confidential
Task parallel:
– Independent processes with little communication
– Easy to use
Data parallel:
– Lots of data on which the same computation is being executed – No
dependencies between data elements in each computation step
– Can saturate many ALUs
– But often requires redesign of traditional algorithms
Task vs Data parallelism
14
A
M
L
I
Lviv
R&D Lab
Confidential
CPU
– Really fast caches (great for data reuse)
– Fine branching granularity
– Lots of different processes/threads
– High performance on a single thread of execution
GPU
– Lots of math units
– Fast access to onboard memory
– Run a program on each fragment/vertex
– High throughput on parallel tasks
● CPUs are great for task parallelism
● GPUs are great for data parallelism
CPU vs GPU (GPGPU)
15
A
M
L
I
Lviv
R&D Lab
Confidential
– Large data sets
– High parallelism
– Minimal dependencies between data elements
– High arithmetic intensity
– Lots of work to do without CPU intervention
Ideal apps to target GPGPU
16
A
M
L
I
Lviv
R&D Lab
Confidential
Graphic pipeline in GPU
17
A
M
L
I
Lviv
R&D Lab
Confidential
– Functions applied to each element in stream
• transforms, PDE, ...
– No dependencies between stream elements
• Encourage high Arithmetic Intensity
Kernels
18
A
M
L
I
Lviv
R&D Lab
Confidential
● SIMD (single instruction, multiple data)
● 8-16 Stream core in each processor
● PE (process element) / ALU
GPGPU block diagram
19
A
M
L
I
Lviv
R&D Lab
Confidential
Patterns
20
A
M
L
I
Lviv
R&D Lab
Confidential
Deep Learning Neural Network
Three kinds of NNs are popular today:
1. Multi-Layer Perceptrons (MLP): Each new layer is a set of
nonlinear functions of weighted sum of all outputs (fully
connected) from a prior one, which reuses the weights.
2. Convolutional Neural Networks (CNN): Each ensuing layer is
a set of of nonlinear functions of weighted sums of spatially
nearby subsets of outputs from the prior layer, which also
reuses the weights.
3. Recurrent Neural Networks (RNN): Each subsequent layer is
a collection of nonlinear functions of weighted sums of
outputs and the previous state. The most popular RNN
is Long Short-Term Memory (LSTM). The art of the LSTM is
in deciding what to forget and what to pass on as state to the
next layer. The weights are reused across time steps.
21
A
M
L
I
Lviv
R&D Lab
Confidential
● CLB (Configurable Logic Block): These are the basic cells of
FPGA. It consists of one 8-bit function generator, two 16-bit
function generators, two registers (flip-flops or latches), and
reprogrammable routing controls (multiplexers). The CLBs are
applied to implement other designed function and macros. Each
CLBs have inputs on each side which makes them flexile for the
mapping and partitioning of logic.
● I/O Pads or Blocks: The Input/Output pads are used for the
outside peripherals to access the functions of FPGA and using the
I/O pads it can also communicate with FPGA for different
applications using different peripherals.
● Switch Matrix/ Interconnection Wires: Switch Matrix is used in
FPGA to connect the long and short interconnection wires
together in flexible combination. It also contains the transistors to
turn on/off connections between different lines.
FPGA architecture
22
Confidential
TPU architecture -
AI dedicated processing chip
23
A
M
L
I
Lviv
R&D Lab
Confidential
TPU block diagram
● Instructions come true 3x16 PCI
● MMU - 256x256 by 8 bit mul-add integers
● Accumulator - 4MiB = 4Kx256x8b
● Matrix unit produces one 256-element partial
sum per clock cycle
● PCI functionality:
○ reads data from the CPU host memory
into the Unified Buffer(UB)
○ reads weights from Weight Memory into
the Weight FIFO as input to the Matrix
Unit.
○ Order Matrix Unit to perform a matrix
multiply or a convolution from the
Unified Buffer into the Accumulators.
● Activate performs the non linear function of
the artificial neuron, with options for ReLU,
Sigmoid. It can also perform the pooling
operations
24
A
M
L
I
Lviv
R&D Lab
Confidential
Matrix operation
Weights
Data
● Given 256-element multiply-accumulate
operation moves through the matrix as a
diagonal wavefront.
● Weights are preloaded, and take effect
with the advancing wave alongside the
first data of a new block.
● Control and data are pipelined to give
the illusion that the 256 inputs are read
at once
25
A
M
L
I
Lviv
R&D Lab
Confidential
Memory subsystem Architecture
26
Confidential
Next technological step
CONFIDENTIAL
GPUCPU
Analog
processors
Distributed
inference
Cognitive
computing
Number of Semiconductor Elements per 1
process module
109
101
103
106
108
1012
102
1010
Today
103
NN CNN
Optimized
Models
TPU
Number of process
modules per system
AI Technology Trend
28
A
M
L
I
Lviv
R&D Lab
Confidential
● No chip size limitation
● Fixed NN graph
● Each weight represented by analog
memory
● Nonlinear function of memorization
and forgetting
Analog memory matrix
N
R
N
R
Ro Ro
R1 R1
Training Forgetting
NN Graph
Graph Anchor
29
Questions?
Skype: dovnmr
E-mail: dov.nimratz@globallogic.com
30
Thank You

Mais conteúdo relacionado

Mais procurados

Case study on Intel core i3 processor.
Case study on Intel core i3 processor. Case study on Intel core i3 processor.
Case study on Intel core i3 processor.
Mauryasuraj98
 

Mais procurados (20)

GPU Computing
GPU ComputingGPU Computing
GPU Computing
 
AMD Processor
AMD ProcessorAMD Processor
AMD Processor
 
TPU paper slide
TPU paper slideTPU paper slide
TPU paper slide
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Introduction to GPU Programming
Introduction to GPU ProgrammingIntroduction to GPU Programming
Introduction to GPU Programming
 
ARM architcture
ARM architcture ARM architcture
ARM architcture
 
DDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : PresentationDDR, GDDR, HBM Memory : Presentation
DDR, GDDR, HBM Memory : Presentation
 
Intel Core I5
Intel Core I5Intel Core I5
Intel Core I5
 
GPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU ArchitecturesGPUDirect RDMA and Green Multi-GPU Architectures
GPUDirect RDMA and Green Multi-GPU Architectures
 
Privacy preserving machine learning
Privacy preserving machine learningPrivacy preserving machine learning
Privacy preserving machine learning
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Arm Processors Architectures
Arm Processors ArchitecturesArm Processors Architectures
Arm Processors Architectures
 
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super AffordableSupermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
Supermicro AI Pod that’s Super Simple, Super Scalable, and Super Affordable
 
Graphics processing unit (GPU)
Graphics processing unit (GPU)Graphics processing unit (GPU)
Graphics processing unit (GPU)
 
Case study on Intel core i3 processor.
Case study on Intel core i3 processor. Case study on Intel core i3 processor.
Case study on Intel core i3 processor.
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Solution(1)
Solution(1)Solution(1)
Solution(1)
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 

Semelhante a Architecture of TPU, GPU and CPU

Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
madhuinturi
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
NECST Lab @ Politecnico di Milano
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)
micchie
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-user
gdburton
 

Semelhante a Architecture of TPU, GPU and CPU (20)

Computer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer systemComputer system Architecture. This PPT is based on computer system
Computer system Architecture. This PPT is based on computer system
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
High Performance Computer Architecture
High Performance Computer ArchitectureHigh Performance Computer Architecture
High Performance Computer Architecture
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
Blue gene- IBM's SuperComputer
Blue gene- IBM's SuperComputerBlue gene- IBM's SuperComputer
Blue gene- IBM's SuperComputer
 
Maxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorialMaxwell siuc hpc_description_tutorial
Maxwell siuc hpc_description_tutorial
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
module4.ppt
module4.pptmodule4.ppt
module4.ppt
 
HPC_June2011
HPC_June2011HPC_June2011
HPC_June2011
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 
PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)
 
PFQ@ PAM12
PFQ@ PAM12PFQ@ PAM12
PFQ@ PAM12
 
Scolari's ICCD17 Talk
Scolari's ICCD17 TalkScolari's ICCD17 Talk
Scolari's ICCD17 Talk
 
unit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptxunit 1ARM INTRODUCTION.pptx
unit 1ARM INTRODUCTION.pptx
 
Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)Recent advance in netmap/VALE(mSwitch)
Recent advance in netmap/VALE(mSwitch)
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-user
 
0.FPGA for dummies: Historical introduction
0.FPGA for dummies: Historical introduction0.FPGA for dummies: Historical introduction
0.FPGA for dummies: Historical introduction
 

Mais de GlobalLogic Ukraine

GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Ukraine
 

Mais de GlobalLogic Ukraine (20)

GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
GlobalLogic Embedded Community x ROS Ukraine Webinar "Surgical Robots"
 
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
GlobalLogic Java Community Webinar #17 “SpringJDBC vs JDBC. Is Spring a Hero?”
 
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
GlobalLogic JavaScript Community Webinar #18 “Long Story Short: OSI Model”
 
Штучний інтелект як допомога в навчанні, а не замінник.pptx
Штучний інтелект як допомога в навчанні, а не замінник.pptxШтучний інтелект як допомога в навчанні, а не замінник.pptx
Штучний інтелект як допомога в навчанні, а не замінник.pptx
 
Задачі AI-розробника як застосовується штучний інтелект.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptxЗадачі AI-розробника як застосовується штучний інтелект.pptx
Задачі AI-розробника як застосовується штучний інтелект.pptx
 
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptxЩо треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
Що треба вивчати, щоб стати розробником штучного інтелекту та нейромереж.pptx
 
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
GlobalLogic Java Community Webinar #16 “Zaloni’s Architecture for Data-Driven...
 
JavaScript Community Webinar #14 "Why Is Git Rebase?"
JavaScript Community Webinar #14 "Why Is Git Rebase?"JavaScript Community Webinar #14 "Why Is Git Rebase?"
JavaScript Community Webinar #14 "Why Is Git Rebase?"
 
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
GlobalLogic .NET Community Webinar #3 "Exploring Serverless with Azure Functi...
 
Страх і сила помилок - IT Inside від GlobalLogic Education
Страх і сила помилок - IT Inside від GlobalLogic EducationСтрах і сила помилок - IT Inside від GlobalLogic Education
Страх і сила помилок - IT Inside від GlobalLogic Education
 
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
GlobalLogic .NET Webinar #2 “Azure RBAC and Managed Identity”
 
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”GlobalLogic QA Webinar “What does it take to become a Test Engineer”
GlobalLogic QA Webinar “What does it take to become a Test Engineer”
 
“How to Secure Your Applications With a Keycloak?
“How to Secure Your Applications With a Keycloak?“How to Secure Your Applications With a Keycloak?
“How to Secure Your Applications With a Keycloak?
 
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
GlobalLogic Machine Learning Webinar “Advanced Statistical Methods for Linear...
 
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
GlobalLogic Machine Learning Webinar “Statistical learning of linear regressi...
 
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
GlobalLogic C++ Webinar “The Minimum Knowledge to Become a C++ Developer”
 
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
Embedded Webinar #17 "Low-level Network Testing in Embedded Devices Development"
 
GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Webinar "Introduction to Embedded QA"GlobalLogic Webinar "Introduction to Embedded QA"
GlobalLogic Webinar "Introduction to Embedded QA"
 
C++ Webinar "Why Should You Learn C++ in 2021-22?"
C++ Webinar "Why Should You Learn C++ in 2021-22?"C++ Webinar "Why Should You Learn C++ in 2021-22?"
C++ Webinar "Why Should You Learn C++ in 2021-22?"
 
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
GlobalLogic Test Automation Live Testing Session “Android Behind UI — Testing...
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

Architecture of TPU, GPU and CPU

  • 2. 2 A M L I Lviv R&D Lab Confidential CPU / GPU / TPU architecture Dov Nimratz Senior Solution Architect
  • 3. 3 A M L I Lviv R&D Lab Confidential 1. Historical context 2. CPU architecture 3. GPU architecture - Gin in a bottle for artificial intelligence 4. TPU architecture - AI dedicated processing chip 5. Next technological step Table of contents
  • 5. 5 A M L I Lviv R&D Lab Confidential ● Discovered by Harvard Eikon in 1930 ● Separate storage and signal pathways for instructions and data. ● Frequently using in DSP Harvard architecture
  • 6. 6 A M L I Lviv R&D Lab Confidential 1. The principle of duality. 2. The principle of program management. 3. The principle of homogeneity of memory. 4. The principle of memory addressability. 5. The principle of sequential program control. 6. The principle of conditional transition. It makes "programs that write programs" possible von Neumann architecture SISD (.Single Instruction, Single Data) Architecture
  • 7. 7 A M L I Lviv R&D Lab Confidential ● Single instruction stream, single data stream (SISD) - Traditional uniprocessor machines like older personal computers ● Single instruction stream, multiple data streams (SIMD) - Most common style of parallel programming. ● Multiple instruction streams, single data stream (MISD) - Uncommon architecture which is generally used for fault tolerance. ● Multiple instruction streams, multiple data streams (MIMD) - Distributed & multi core systems Flynn data / instruction classification models
  • 9. 9 A M L I Lviv R&D Lab Confidential CISC / RISC / MISC / VLIM CPUs CISC RISK Emphasis on hardware Emphasis on software Multiple instruction size and format Instructions of the same set with few formats Less registers Uses more registers More addressing models Few addressing models Extensive use of microprogramming Complexity in compiler Instructions take a very amount of cycles Instructions take one cycle time Pipeline is difficult Pipeline is easy
  • 10. 10 A M L I Lviv R&D Lab Confidential CPU architecture ● The main task of the CPU is to execute a chain of instructions in the shortest possible time. ● The CPU may execute several chains at the same time. ● After executing them separately, merge them again into one, in the correct order. ● Each instruction in the stream depends on the instructions following it.
  • 12. 12 Confidential GPU architecture - Gin in a bottle for artificial intelligence
  • 13. 13 A M L I Lviv R&D Lab Confidential Task parallel: – Independent processes with little communication – Easy to use Data parallel: – Lots of data on which the same computation is being executed – No dependencies between data elements in each computation step – Can saturate many ALUs – But often requires redesign of traditional algorithms Task vs Data parallelism
  • 14. 14 A M L I Lviv R&D Lab Confidential CPU – Really fast caches (great for data reuse) – Fine branching granularity – Lots of different processes/threads – High performance on a single thread of execution GPU – Lots of math units – Fast access to onboard memory – Run a program on each fragment/vertex – High throughput on parallel tasks ● CPUs are great for task parallelism ● GPUs are great for data parallelism CPU vs GPU (GPGPU)
  • 15. 15 A M L I Lviv R&D Lab Confidential – Large data sets – High parallelism – Minimal dependencies between data elements – High arithmetic intensity – Lots of work to do without CPU intervention Ideal apps to target GPGPU
  • 17. 17 A M L I Lviv R&D Lab Confidential – Functions applied to each element in stream • transforms, PDE, ... – No dependencies between stream elements • Encourage high Arithmetic Intensity Kernels
  • 18. 18 A M L I Lviv R&D Lab Confidential ● SIMD (single instruction, multiple data) ● 8-16 Stream core in each processor ● PE (process element) / ALU GPGPU block diagram
  • 20. 20 A M L I Lviv R&D Lab Confidential Deep Learning Neural Network Three kinds of NNs are popular today: 1. Multi-Layer Perceptrons (MLP): Each new layer is a set of nonlinear functions of weighted sum of all outputs (fully connected) from a prior one, which reuses the weights. 2. Convolutional Neural Networks (CNN): Each ensuing layer is a set of of nonlinear functions of weighted sums of spatially nearby subsets of outputs from the prior layer, which also reuses the weights. 3. Recurrent Neural Networks (RNN): Each subsequent layer is a collection of nonlinear functions of weighted sums of outputs and the previous state. The most popular RNN is Long Short-Term Memory (LSTM). The art of the LSTM is in deciding what to forget and what to pass on as state to the next layer. The weights are reused across time steps.
  • 21. 21 A M L I Lviv R&D Lab Confidential ● CLB (Configurable Logic Block): These are the basic cells of FPGA. It consists of one 8-bit function generator, two 16-bit function generators, two registers (flip-flops or latches), and reprogrammable routing controls (multiplexers). The CLBs are applied to implement other designed function and macros. Each CLBs have inputs on each side which makes them flexile for the mapping and partitioning of logic. ● I/O Pads or Blocks: The Input/Output pads are used for the outside peripherals to access the functions of FPGA and using the I/O pads it can also communicate with FPGA for different applications using different peripherals. ● Switch Matrix/ Interconnection Wires: Switch Matrix is used in FPGA to connect the long and short interconnection wires together in flexible combination. It also contains the transistors to turn on/off connections between different lines. FPGA architecture
  • 22. 22 Confidential TPU architecture - AI dedicated processing chip
  • 23. 23 A M L I Lviv R&D Lab Confidential TPU block diagram ● Instructions come true 3x16 PCI ● MMU - 256x256 by 8 bit mul-add integers ● Accumulator - 4MiB = 4Kx256x8b ● Matrix unit produces one 256-element partial sum per clock cycle ● PCI functionality: ○ reads data from the CPU host memory into the Unified Buffer(UB) ○ reads weights from Weight Memory into the Weight FIFO as input to the Matrix Unit. ○ Order Matrix Unit to perform a matrix multiply or a convolution from the Unified Buffer into the Accumulators. ● Activate performs the non linear function of the artificial neuron, with options for ReLU, Sigmoid. It can also perform the pooling operations
  • 24. 24 A M L I Lviv R&D Lab Confidential Matrix operation Weights Data ● Given 256-element multiply-accumulate operation moves through the matrix as a diagonal wavefront. ● Weights are preloaded, and take effect with the advancing wave alongside the first data of a new block. ● Control and data are pipelined to give the illusion that the 256 inputs are read at once
  • 27. CONFIDENTIAL GPUCPU Analog processors Distributed inference Cognitive computing Number of Semiconductor Elements per 1 process module 109 101 103 106 108 1012 102 1010 Today 103 NN CNN Optimized Models TPU Number of process modules per system AI Technology Trend
  • 28. 28 A M L I Lviv R&D Lab Confidential ● No chip size limitation ● Fixed NN graph ● Each weight represented by analog memory ● Nonlinear function of memorization and forgetting Analog memory matrix N R N R Ro Ro R1 R1 Training Forgetting NN Graph Graph Anchor