SlideShare uma empresa Scribd logo
1 de 27
MaPU –A Novel Mathematical
Computing Architecture
shaolin.xie@ia.ac.cn
Donglin Wang, Shaolin Xie et al.
Outline
31 Introduction
32 Architecture Highlight
33 The First MaPU Chip
34 Test Result & Analysis
2
Introduction—Initial Ideas
As transistor count increases, More complex computations can be
implemented as dedicated hardware
Intel 4004
Motorola 6809
AMD Ad9511 IBM POWER1
UltraSPARC I
PowerPC
AltiVec
Pentium III
SSE
19831971 1977 1979
TI TMS32010
1990 1995 1999
On-Chip transistor count increase under Moore’law
2006
Nvidia
GeForce
What’s Next ?
3
Reduce power consumed
by control logic
Provides various memory
access patterns
Introduction—Initial Ideas
Massive cascading
function units with
simple controller logic
Novel Multi-granularity
Parallel Memory
We try to map the mathematical equations to hardware!
Computing Storage
FU0 FU1
FU2
FU3
Mem0 Mem1
Mem2
Y=A±BW
dataflow
dataflow
Mem0 Mem1
FU2
FU4
FU0 FU1
Mem2
mapping
mapping
(a) Data path mapping for FFT
(b) Data path mapping for Matrix Multiply
Y=A•B
4
Scalar
Register File
Scalar Controller
Microcode Fetch
Microcode Memory
FU2
FU5
FU6
FU7
FU8
FU1
FU9
FU0
FU3
FU4
Multi-granularity
parallel storage
CSU
Microcode
Controller Microcode
Pipeline
Bus
Interface
Bus
Interface
Bus
Interface
Scalar Pipeline
Scalar pipeline & Microcode pipeline
Introduction--Instruction Set Architecture
Overview
Massive function unit, e.g. ALU, MAC, LD/SD
Each FU is controlled by a microcode
Relatively simple Hardware
Wide SIMD E.g. 16X32 bitsForwarding matrix between FUs
Highly Structured
Enable Functional Unit cascade
Multi-Granularity Parallel(MGP) storage system
Simultaneous matrix row and column access
Parallel microcode emission
Tens of microcode at each cycle
Microcode can be updated dynamically
VLIW scalar pipeline
Control microcode pipeline
Communication with SoC
5
Outline
31 Introduction
32 Architecture Highlight
33 The First MaPU Chip
34 Test Result & Analysis
6
Highlight 1: Multi-granularity parallel (MGP) storage system
Architecture Highlight
• Requires Granularity Parameter (G), ranges from 0 to log2W
• Requires W physical memory banks, each can read/write W
bytes in parallel
• Physical memory banks cascade and group into logic banks
according to Granularity Parameter (G)
• All logic banks are accessed with the same address
System which enables W bytes parallel access for matrix row or
column elements with the same layout
Dongling Wang, Shalin Xie, et, al. “Multi-granularity parallel storage system and storage.” US patent , US 14/117,792. 7
4 logic banks
Each logic bank reads 1 byte
Highlight 1: Multi-granularity parallel storage system
Architecture Highlight
Logic Bank0 Logic Bank1
Physical
memory
banks
granularity=1granularity=2
address=0
Logic Bank0 Logic Bank1 Logic Bank2 Logic Bank3
Logic Bank0
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
32 33 34 35
36 37 38 39
40 41 42 43
44 45 46 47
48 49 50 51
52 53 54 55
56 57 58 59
60 61 62 63
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
2 logic banks
Each logic bank reads 2 bytes
1 logic banks
Each logic bank reads 4 bytes
granularity=4
G consecutive
physical memory
banks cascade into a
logic bank
Each logic bank
read/write G bytes
Suppose W=4
8
Highlight 1: Multi-granularity parallel storage system
Architecture Highlight
Suppose storage bit width W=4 bytes
The ith row is putted in logic bank labled with “i%W”
 Can be accessed parallel in row or column
Simultaneous matrix row and column access
Logic Bank 0
0 5 10 151 6 11 162 7 12 173 8 13 18
4 9 14 1920 21 22
23 24
0 1 2 3 4
5 6 7 8 9
12 13 14
15 16 17
10 11
18 19
20 21 22 23 24
Original Matrix
Logic Bank 1 Logic Bank 2 Logic Bank 3
G = 4
Access row
G = 1
Access column
Dongling Wang, Shalin Xie, et, al. “Multi-granularity parallel storage system and storage.” US patent , US 14/117,792. 9
Highlight 2: High Dimension Data Model
Architecture Highlight
Dimension
KB: base address
KS: address
stride
KI: total
Number of elements
KB0 KS0 KI0
KB1 KS1 KI1
KB2 KS2 KI2
KB3 KS3 KI3
Up to 4 dimensions
Data is represented by Load/Store Unit : BIU
Access controlled by Microcode
Calculate address automatically
Address calculation of the matrix elements mentioned before are
greatly simplified at the program level.
BIU0->DM(A++, K++);
Configuration registers
10
S
E
…
10 Cycles
S
E
…
7 Cycles
S
E
…
20 Cycles
S
E
…
22 Cycles
Delay
2 Cycles
Delay
5 Cycles
Delay
4 Cycles
State Machine of
Each Functional
Unit
Easy to map the mathematical
equation
Timing of the pipeline is highly
predictable
Program Each Function Unit
as State machine
Highlight 3: Cascading pipeline with state machine based program model
Architecture Highlight
11
Highlight 3: Cascading pipeline with state machine based program model
Architecture Highlight
FU2
FU5
FU6
FU7
FU8
FU1
FU9
FU0
FU3
FU4
Memory 0
Memory 2
BIU0 BIU1
FMAC
SHU0
SHU1
FALU
BIU2
Memory 1
M
Data Path for FFT computation
Input Data
FU1
pipeline
FU2
pipeline
FU3
pipeline
FU4
pipeline
Result
Forward
pipeline
Forward
pipeline
Forward
pipeline
.hmacro FU1SM
LPTO (1f ) @ (KI12); //Loop to label 1, loop counts stored in KI12 register
//load data to Register file and calculate the next load address
BIU0.DM(A++,K++) -> M[0];
NOP; //idle for one cycle
BIU0.DM(A++,K++) -> M[0];
NOP; //idle for one cycle
1:
.endhmacro
FU1 State Machine
.hmacro MainSM
FU1SM; //start FU1 state machine
REPEAT@(6); // wait 6 cycles
FU2SM || FU3SM; // start FU2 and FU3 state machine
REPEAT@(6) ; // wait 6 cycles
FU4SM ; // Start FU4 state machine
.endhmacro
Top State Machine 12
Outline
31 Introduction
32 Architecture Highlight
33 The First MaPU Chip
34 Test Result & Analysis
13
SoC Architecture
The First MaPU Chip
SoC Features
4 MaPU Cores with a ARM
3 Level Bus connection
Dedicated co-processors
APE
Local
Memory
APE
Local
Memory
APE
Local
Memory
APE
Local
Memory
HighSpeedNetwork
DDR 3
Controller 0 PCIe 0 RapidIO 0
DDR 3
Controller 1PCIe 1RapidIO 1
Cortex-A8 ShareMemory
High Speed Network
High Speed Network
Ethernet
DDR3
Controller
GPU CODEC
GPIO
UART
External Bus
Interface
I2C/SPI
IIS
Timer
Watch Dog
Interrupt
JTAG
Reset
L2Bus
L1 Bus
L2Bus
4x 4x 4x 4x
APE: Algebraic Process Engine
14
APE core Architecture
The First MaPU Chip
Architecture
Features
10 Functional Unit/
14 microcode slots
6 MGP memories, 2Mbit each
Large Matrix Register Files
512bit Data Path
Scalar
Register File
Scalar Controller
Microcode Fetch
Microcode Memory
SHU0
IALU
IMAC
FALU
FMAC
SHU1
Matrix Register File
BIU0
BIU1
BIU2
Multi-granularity
parallel storage
CSU
Microcode
Controller Microcode
Pipeline
AXI
Interface
AXI
Interface
AXI
Interface
Scalar Pipeline
Matrix Register File with 4 read/4 write ports
Circular Buffer and Slide Window auto index mode
Each read port controlled by microcode :M.r0~M.r3
15
MGP0
MGP1
MGP2
MGP3
MGP4
MGP5
Tool Chain
The First MaPU Chip
Tool name
Based Open Source
Framework
Compiler for State Machine based
language Ragel &Bison & LLVM
C compiler for Scalar Pipeline Clang & LLVM
Assemble /Disassemble
for Both Scalar & Microcode Pipeline Ragel & Bison & LLVM
Linker
for Both Scalar & Microcode Pipeline Binutils Gold
Debugger for Scalar Pipeline GDB
Simulator ( Scalar & Microcode ) Gem5
Emulator OpenOCD
16
SoC Implementation
The First MaPU Chip
Features
40nm LP Process
363.468mm2
1681 PAD
APE0APE2
APE1 APE3
SRIO0
SRIO1
PCIe0 PCIe1
Cortex-A8 IP1 IP0
Phy0 Phy1
GMAC
DDR3Controller0
DDR3Controller1
Bus
Matrix
DDR3
Controller
APE0
APE1 APE3
17
APE Implementation
The First MaPU Chip
Implementation
Features
35.992mm2
1.0 GHz Frequency
( Typical Case)
Multi-granularity
parallelMemory
system
Load/Store
&Forwarding
logics
Multi-granularity
parallelMemory
system
Microcode
Memory
Scalar Pipeline
SHU0/SHU1
Bus &
Microcode
Controller
FMAC
IMAC MReg
FALU IALU
FMAC
IMAC MReg
FALU IALU
FMAC
IMAC MReg
FALU IALU
FMAC
IMAC MReg
FALU IALU
FMAC
IMACMReg
FALUIALU
FMAC
IMACMReg
FALUIALU
FMAC
IMACMReg
FALUIALU
FMAC
IMACMReg
FALUIALU
Turbo Decoder Turbo Decoder
Microcode
Memory
64bit Data Path
18
Content
31 Introduction
32 Architecture Highlight
33 The First MaPU Chip
34 Test Result & Analysis
19
The Physical Chip
Test Result
42.50 x 42.50mm
20
APE Core Performance compared to TI C66x Core
Test Result
2.00 1.89
4.77
6.94 6.55
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Cplx SP FFT Cplx FP FFT Matrix mul 2D filter SP FIR
Average Speed Up of APE VS TI C66x Core
APE: @1GHz
TI 66x Core: CCSv5 Cycle Accurate Simulator with DSPLIB and
IMGLIB @1.25GHz
C66x Core
Similar Process Node ( 40nm)
Similar VLIW micro-arch
Similar logic resource
(SIMD fixed & float)
Can be improved
through microcode
optimization
21
Test Result
2.81
2.63
3.05
4.13
2.19
2.75
2.28
1.51
2.95 2.85
3.1
4.15
2.2
2.95
2.45
1.55
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
Cplx SP FFT Cplx FP FFT Matrix mul 2D filter SP FIR Table lookup Matrix Trans Idle
AweragePower(Watt)
Average Power of APE@1GHz
Estimated Power Tested Power
Needs
Improvement
Estimated Power: Prime Time with switching activity and final SDF annotation
Tested Power: Record the current increase after invoking the APE Core.
Difference < 8%
Following power
analysis is reliable
22
Instruction Statistics and Power Efficiency
Test Result
Algorithm
Tested
Power(Watt)
size
overall
time(us)
Instruction Count
MR0 MR1 MR2 MR3 SHU0 SHU1 IALU IMAC FALU FMAC BIU0 BIU1 BIU2
Cplx SP FFT 2.95 1024 2.692 1908 1841 1888 12 1878 1836 0 0 1771 1847 807 746 832
Cplx FT FFT 2.85 1024 1.500 942 930 897 10 894 894 0 885 0 0 255 193 288
Matrix mul 3.10 65*66*67 29.884 29478 0 0 1650 29478 0 0 0 12854 29476 1650 488 7451
2D Filter 4.15
508*508,
5*5
106.064 20336 20336 0 0 105740 105740 0 105737 0 0 8703 20344 7599
SP FIR 2.20 4096, 128 35.082 2048 2049 0 0 34817 34817 0 0 1792 34817 265 2048 511
Table lookup 2.95 4096 0.758 258 192 128 0 257 0 320 0 0 0 4 64 64
Matrix trans 2.45 512*256 8.386 0 0 0 4098 0 0 0 0 0 0 4099 0 4096
18.49
26.63
17.49
61.50
29.24
16.51
45.69
66.19
36.93
103.49
45.48 45.99
19.15
0
20
40
60
80
100
120
Cplx SP
FFT
Cplx FP
FFT
Matrix
Mul
2D Filter FIR Table
Lookup
Matrix
Trans
Tested ower efficiency of different micro
benchmarks
Computaion GOPS/W Total GOPS/W
1.2 2.6 5 5.4 7 7.4 8.2 10
26
45.7
0
5
10
15
20
25
30
35
40
45
50
GLOPS/W
Power efficiency of current processors[7]
[7] M. H. Ionica and D. Gregg, “The movidius myriad architecture’s potential for scientific computing,” Micro, IEEE, vol. 35, no.3, pp.
6-14, 2015.
28nm
40 nm
28nm22 nm
23
Where the Power efficiency comes from ?
Analysis
Hardware
effort
Software
effort
Matched
Datapath
0%
20%
40%
60%
80%
100%
Cplx SP
FFT
Cplx FP
FFT
Matrix
Mul
2D Filter FIR Table
lookup
Matrix
Trans
Instruction composition of different
benchmarks
MReg access computation load/store
133.25
609.2
345.65 335.18
387.23
788.77
213.04
0
100
200
300
400
500
600
700
800
900
Register
R/W
Load/Store FALU IALU FMAC IMAC SHU
pJ
Average energy consumed by different FUs(512bit SIMD)
Little effort on circuit optimization,
But
energy inefficient memory accesses
are minimized
Most energy consumed by computation
little energy is consumed by control logic
Energy consumed by memory
0.24% for fetching & dispatching & storage
24
Conclusion—Contribution
Proposed and verified a highly customizable
architecture with a real chip
• For architecture designer: FU sets are highly customizable
for different workloads.
• For programmer: Data path is highly customizable for
different workloads through microcodes
Proposed and verified a novel memory system
that supports row and column wise matrix access
• Parallel access without conflict, potential usage in
applications with regular access patterns.
• Can be integrated in other architectures.
25
Conclusion—Future work
High level program model
• Great efforts needed in programming MaPU
• 1 month per micro benchmark
Circuit level optimization
• Only register file is customized
• Stand by power is high (1.5 W, clock network needs improvement)
Customize FU sets for different workloads
• FU sets for communication applications
• FU sets for multi-media processing
• FU sets for machine learning
• …
26
Thanks & Question ?
Feel free to contact
shaolin.xie@ia.ac.cn or
Shawnless.xie@gmail.com

Mais conteúdo relacionado

Mais procurados

Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Pankaj Debbarma
 
FPGA Implementation of a GA
FPGA Implementation of a GAFPGA Implementation of a GA
FPGA Implementation of a GAHocine Merabti
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Nabil Chouba
 
Fast & Energy-Efficient Breadth-First Search on a Single NUMA System
Fast & Energy-Efficient Breadth-First Search on a Single NUMA SystemFast & Energy-Efficient Breadth-First Search on a Single NUMA System
Fast & Energy-Efficient Breadth-First Search on a Single NUMA SystemYuichiro Yasui
 
Computer architecture, a quantitative approach (solution for 5th edition)
Computer architecture, a quantitative approach (solution for 5th edition)Computer architecture, a quantitative approach (solution for 5th edition)
Computer architecture, a quantitative approach (solution for 5th edition)Zohaib Ali
 
Optimized architecture for SNOW 3G
Optimized architecture for SNOW 3GOptimized architecture for SNOW 3G
Optimized architecture for SNOW 3GIJECEIAES
 
FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)Hocine Merabti
 
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Hsien-Hsin Sean Lee, Ph.D.
 
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceHsien-Hsin Sean Lee, Ph.D.
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...Yuichiro Yasui
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
 
LOW POWER-AREA GDI & PTL TECHNIQUES BASED FULL ADDER DESIGNS
LOW POWER-AREA GDI & PTL TECHNIQUES BASED FULL ADDER DESIGNSLOW POWER-AREA GDI & PTL TECHNIQUES BASED FULL ADDER DESIGNS
LOW POWER-AREA GDI & PTL TECHNIQUES BASED FULL ADDER DESIGNScsandit
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale SupercomputerSagar Dolas
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCMLconf
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 

Mais procurados (18)

Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06Computer Graphics & Visualization - 06
Computer Graphics & Visualization - 06
 
FPGA Implementation of a GA
FPGA Implementation of a GAFPGA Implementation of a GA
FPGA Implementation of a GA
 
Chenchu
ChenchuChenchu
Chenchu
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation
 
Fast & Energy-Efficient Breadth-First Search on a Single NUMA System
Fast & Energy-Efficient Breadth-First Search on a Single NUMA SystemFast & Energy-Efficient Breadth-First Search on a Single NUMA System
Fast & Energy-Efficient Breadth-First Search on a Single NUMA System
 
Computer architecture, a quantitative approach (solution for 5th edition)
Computer architecture, a quantitative approach (solution for 5th edition)Computer architecture, a quantitative approach (solution for 5th edition)
Computer architecture, a quantitative approach (solution for 5th edition)
 
Optimized architecture for SNOW 3G
Optimized architecture for SNOW 3GOptimized architecture for SNOW 3G
Optimized architecture for SNOW 3G
 
FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)FPGA FIR filter implementation (Audio signal processing)
FPGA FIR filter implementation (Audio signal processing)
 
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
 
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- PerformanceLec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
LOW POWER-AREA GDI & PTL TECHNIQUES BASED FULL ADDER DESIGNS
LOW POWER-AREA GDI & PTL TECHNIQUES BASED FULL ADDER DESIGNSLOW POWER-AREA GDI & PTL TECHNIQUES BASED FULL ADDER DESIGNS
LOW POWER-AREA GDI & PTL TECHNIQUES BASED FULL ADDER DESIGNS
 
JGI_HMMER.pptx
JGI_HMMER.pptxJGI_HMMER.pptx
JGI_HMMER.pptx
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 

Destaque

"Overcoming Barriers to Consumer Adoption of Vision-enabled Products and Serv...
"Overcoming Barriers to Consumer Adoption of Vision-enabled Products and Serv..."Overcoming Barriers to Consumer Adoption of Vision-enabled Products and Serv...
"Overcoming Barriers to Consumer Adoption of Vision-enabled Products and Serv...Edge AI and Vision Alliance
 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and ClusteringYogendra Tamang
 
Recent developments in Deep Learning
Recent developments in Deep LearningRecent developments in Deep Learning
Recent developments in Deep LearningBrahim HAMADICHAREF
 
Unsupervised Classification of Images: A Review
Unsupervised Classification of Images: A ReviewUnsupervised Classification of Images: A Review
Unsupervised Classification of Images: A ReviewCSCJournals
 
Neural Network as a function
Neural Network as a functionNeural Network as a function
Neural Network as a functionTaisuke Oe
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreadingFraboni Ec
 
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li..."The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...Edge AI and Vision Alliance
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...Edge AI and Vision Alliance
 
"Trends and Recent Developments in Processors for Vision," a Presentation fro...
"Trends and Recent Developments in Processors for Vision," a Presentation fro..."Trends and Recent Developments in Processors for Vision," a Presentation fro...
"Trends and Recent Developments in Processors for Vision," a Presentation fro...Edge AI and Vision Alliance
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...Edge AI and Vision Alliance
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARMEdge AI and Vision Alliance
 
"Image and Video Summarization," a Presentation from the University of Washin...
"Image and Video Summarization," a Presentation from the University of Washin..."Image and Video Summarization," a Presentation from the University of Washin...
"Image and Video Summarization," a Presentation from the University of Washin...Edge AI and Vision Alliance
 
Efficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationEfficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationYogendra Tamang
 
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr..."Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...Edge AI and Vision Alliance
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processingAcad
 
Convolutional Neural Network @ CV勉強会関東
Convolutional Neural Network @ CV勉強会関東Convolutional Neural Network @ CV勉強会関東
Convolutional Neural Network @ CV勉強会関東Hokuto Kagaya
 

Destaque (20)

"Overcoming Barriers to Consumer Adoption of Vision-enabled Products and Serv...
"Overcoming Barriers to Consumer Adoption of Vision-enabled Products and Serv..."Overcoming Barriers to Consumer Adoption of Vision-enabled Products and Serv...
"Overcoming Barriers to Consumer Adoption of Vision-enabled Products and Serv...
 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and Clustering
 
CIFAR-10
CIFAR-10CIFAR-10
CIFAR-10
 
Recent developments in Deep Learning
Recent developments in Deep LearningRecent developments in Deep Learning
Recent developments in Deep Learning
 
Unsupervised Classification of Images: A Review
Unsupervised Classification of Images: A ReviewUnsupervised Classification of Images: A Review
Unsupervised Classification of Images: A Review
 
Neural Network as a function
Neural Network as a functionNeural Network as a function
Neural Network as a function
 
"A Vision of Safety," a Presentation from Nauto
"A Vision of Safety," a Presentation from Nauto"A Vision of Safety," a Presentation from Nauto
"A Vision of Safety," a Presentation from Nauto
 
Hardware multithreading
Hardware multithreadingHardware multithreading
Hardware multithreading
 
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li..."The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
 
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP..."Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
"Efficient Implementation of Convolutional Neural Networks using OpenCL on FP...
 
"Trends and Recent Developments in Processors for Vision," a Presentation fro...
"Trends and Recent Developments in Processors for Vision," a Presentation fro..."Trends and Recent Developments in Processors for Vision," a Presentation fro...
"Trends and Recent Developments in Processors for Vision," a Presentation fro...
 
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre..."Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
"Fast Deployment of Low-power Deep Learning on CEVA Vision Processors," a Pre...
 
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
"Using SGEMM and FFTs to Accelerate Deep Learning," a Presentation from ARM
 
"Image and Video Summarization," a Presentation from the University of Washin...
"Image and Video Summarization," a Presentation from the University of Washin..."Image and Video Summarization," a Presentation from the University of Washin...
"Image and Video Summarization," a Presentation from the University of Washin...
 
Efficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image ClassficationEfficient Neural Network Architecture for Image Classfication
Efficient Neural Network Architecture for Image Classfication
 
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr..."Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
 
pipeline and vector processing
pipeline and vector processingpipeline and vector processing
pipeline and vector processing
 
Asp.net orientation
Asp.net orientationAsp.net orientation
Asp.net orientation
 
Convolutional Neural Network @ CV勉強会関東
Convolutional Neural Network @ CV勉強会関東Convolutional Neural Network @ CV勉強会関東
Convolutional Neural Network @ CV勉強会関東
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 

Semelhante a MaPU-HPCA2016

customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAShien-Chun Luo
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...RISC-V International
 
Microprocessor Week1: Introduction
Microprocessor Week1: IntroductionMicroprocessor Week1: Introduction
Microprocessor Week1: IntroductionArkhom Jodtang
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaJohanAspro
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...NECST Lab @ Politecnico di Milano
 
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...NECST Lab @ Politecnico di Milano
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...IndicThreads
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...NECST Lab @ Politecnico di Milano
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
 

Semelhante a MaPU-HPCA2016 (20)

POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
 
QSpiders - Basic intel architecture
QSpiders - Basic intel architectureQSpiders - Basic intel architecture
QSpiders - Basic intel architecture
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Microprocessor Week1: Introduction
Microprocessor Week1: IntroductionMicroprocessor Week1: Introduction
Microprocessor Week1: Introduction
 
Embedded system Design introduction _ Karakola
Embedded system Design introduction _ KarakolaEmbedded system Design introduction _ Karakola
Embedded system Design introduction _ Karakola
 
Risc vs cisc
Risc vs ciscRisc vs cisc
Risc vs cisc
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
 
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
NNECST: an FPGA-based approach for the hardware acceleration of Convolutional...
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...Best Practices for performance evaluation and diagnosis of Java Applications ...
Best Practices for performance evaluation and diagnosis of Java Applications ...
 
04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
CNNECST: an FPGA-based approach for the hardware acceleration of Convolutiona...
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 

MaPU-HPCA2016

  • 1. MaPU –A Novel Mathematical Computing Architecture shaolin.xie@ia.ac.cn Donglin Wang, Shaolin Xie et al.
  • 2. Outline 31 Introduction 32 Architecture Highlight 33 The First MaPU Chip 34 Test Result & Analysis 2
  • 3. Introduction—Initial Ideas As transistor count increases, More complex computations can be implemented as dedicated hardware Intel 4004 Motorola 6809 AMD Ad9511 IBM POWER1 UltraSPARC I PowerPC AltiVec Pentium III SSE 19831971 1977 1979 TI TMS32010 1990 1995 1999 On-Chip transistor count increase under Moore’law 2006 Nvidia GeForce What’s Next ? 3
  • 4. Reduce power consumed by control logic Provides various memory access patterns Introduction—Initial Ideas Massive cascading function units with simple controller logic Novel Multi-granularity Parallel Memory We try to map the mathematical equations to hardware! Computing Storage FU0 FU1 FU2 FU3 Mem0 Mem1 Mem2 Y=A±BW dataflow dataflow Mem0 Mem1 FU2 FU4 FU0 FU1 Mem2 mapping mapping (a) Data path mapping for FFT (b) Data path mapping for Matrix Multiply Y=A•B 4
  • 5. Scalar Register File Scalar Controller Microcode Fetch Microcode Memory FU2 FU5 FU6 FU7 FU8 FU1 FU9 FU0 FU3 FU4 Multi-granularity parallel storage CSU Microcode Controller Microcode Pipeline Bus Interface Bus Interface Bus Interface Scalar Pipeline Scalar pipeline & Microcode pipeline Introduction--Instruction Set Architecture Overview Massive function unit, e.g. ALU, MAC, LD/SD Each FU is controlled by a microcode Relatively simple Hardware Wide SIMD E.g. 16X32 bitsForwarding matrix between FUs Highly Structured Enable Functional Unit cascade Multi-Granularity Parallel(MGP) storage system Simultaneous matrix row and column access Parallel microcode emission Tens of microcode at each cycle Microcode can be updated dynamically VLIW scalar pipeline Control microcode pipeline Communication with SoC 5
  • 6. Outline 31 Introduction 32 Architecture Highlight 33 The First MaPU Chip 34 Test Result & Analysis 6
  • 7. Highlight 1: Multi-granularity parallel (MGP) storage system Architecture Highlight • Requires Granularity Parameter (G), ranges from 0 to log2W • Requires W physical memory banks, each can read/write W bytes in parallel • Physical memory banks cascade and group into logic banks according to Granularity Parameter (G) • All logic banks are accessed with the same address System which enables W bytes parallel access for matrix row or column elements with the same layout Dongling Wang, Shalin Xie, et, al. “Multi-granularity parallel storage system and storage.” US patent , US 14/117,792. 7
  • 8. 4 logic banks Each logic bank reads 1 byte Highlight 1: Multi-granularity parallel storage system Architecture Highlight Logic Bank0 Logic Bank1 Physical memory banks granularity=1granularity=2 address=0 Logic Bank0 Logic Bank1 Logic Bank2 Logic Bank3 Logic Bank0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 2 logic banks Each logic bank reads 2 bytes 1 logic banks Each logic bank reads 4 bytes granularity=4 G consecutive physical memory banks cascade into a logic bank Each logic bank read/write G bytes Suppose W=4 8
  • 9. Highlight 1: Multi-granularity parallel storage system Architecture Highlight Suppose storage bit width W=4 bytes The ith row is putted in logic bank labled with “i%W”  Can be accessed parallel in row or column Simultaneous matrix row and column access Logic Bank 0 0 5 10 151 6 11 162 7 12 173 8 13 18 4 9 14 1920 21 22 23 24 0 1 2 3 4 5 6 7 8 9 12 13 14 15 16 17 10 11 18 19 20 21 22 23 24 Original Matrix Logic Bank 1 Logic Bank 2 Logic Bank 3 G = 4 Access row G = 1 Access column Dongling Wang, Shalin Xie, et, al. “Multi-granularity parallel storage system and storage.” US patent , US 14/117,792. 9
  • 10. Highlight 2: High Dimension Data Model Architecture Highlight Dimension KB: base address KS: address stride KI: total Number of elements KB0 KS0 KI0 KB1 KS1 KI1 KB2 KS2 KI2 KB3 KS3 KI3 Up to 4 dimensions Data is represented by Load/Store Unit : BIU Access controlled by Microcode Calculate address automatically Address calculation of the matrix elements mentioned before are greatly simplified at the program level. BIU0->DM(A++, K++); Configuration registers 10
  • 11. S E … 10 Cycles S E … 7 Cycles S E … 20 Cycles S E … 22 Cycles Delay 2 Cycles Delay 5 Cycles Delay 4 Cycles State Machine of Each Functional Unit Easy to map the mathematical equation Timing of the pipeline is highly predictable Program Each Function Unit as State machine Highlight 3: Cascading pipeline with state machine based program model Architecture Highlight 11
  • 12. Highlight 3: Cascading pipeline with state machine based program model Architecture Highlight FU2 FU5 FU6 FU7 FU8 FU1 FU9 FU0 FU3 FU4 Memory 0 Memory 2 BIU0 BIU1 FMAC SHU0 SHU1 FALU BIU2 Memory 1 M Data Path for FFT computation Input Data FU1 pipeline FU2 pipeline FU3 pipeline FU4 pipeline Result Forward pipeline Forward pipeline Forward pipeline .hmacro FU1SM LPTO (1f ) @ (KI12); //Loop to label 1, loop counts stored in KI12 register //load data to Register file and calculate the next load address BIU0.DM(A++,K++) -> M[0]; NOP; //idle for one cycle BIU0.DM(A++,K++) -> M[0]; NOP; //idle for one cycle 1: .endhmacro FU1 State Machine .hmacro MainSM FU1SM; //start FU1 state machine REPEAT@(6); // wait 6 cycles FU2SM || FU3SM; // start FU2 and FU3 state machine REPEAT@(6) ; // wait 6 cycles FU4SM ; // Start FU4 state machine .endhmacro Top State Machine 12
  • 13. Outline 31 Introduction 32 Architecture Highlight 33 The First MaPU Chip 34 Test Result & Analysis 13
  • 14. SoC Architecture The First MaPU Chip SoC Features 4 MaPU Cores with a ARM 3 Level Bus connection Dedicated co-processors APE Local Memory APE Local Memory APE Local Memory APE Local Memory HighSpeedNetwork DDR 3 Controller 0 PCIe 0 RapidIO 0 DDR 3 Controller 1PCIe 1RapidIO 1 Cortex-A8 ShareMemory High Speed Network High Speed Network Ethernet DDR3 Controller GPU CODEC GPIO UART External Bus Interface I2C/SPI IIS Timer Watch Dog Interrupt JTAG Reset L2Bus L1 Bus L2Bus 4x 4x 4x 4x APE: Algebraic Process Engine 14
  • 15. APE core Architecture The First MaPU Chip Architecture Features 10 Functional Unit/ 14 microcode slots 6 MGP memories, 2Mbit each Large Matrix Register Files 512bit Data Path Scalar Register File Scalar Controller Microcode Fetch Microcode Memory SHU0 IALU IMAC FALU FMAC SHU1 Matrix Register File BIU0 BIU1 BIU2 Multi-granularity parallel storage CSU Microcode Controller Microcode Pipeline AXI Interface AXI Interface AXI Interface Scalar Pipeline Matrix Register File with 4 read/4 write ports Circular Buffer and Slide Window auto index mode Each read port controlled by microcode :M.r0~M.r3 15 MGP0 MGP1 MGP2 MGP3 MGP4 MGP5
  • 16. Tool Chain The First MaPU Chip Tool name Based Open Source Framework Compiler for State Machine based language Ragel &Bison & LLVM C compiler for Scalar Pipeline Clang & LLVM Assemble /Disassemble for Both Scalar & Microcode Pipeline Ragel & Bison & LLVM Linker for Both Scalar & Microcode Pipeline Binutils Gold Debugger for Scalar Pipeline GDB Simulator ( Scalar & Microcode ) Gem5 Emulator OpenOCD 16
  • 17. SoC Implementation The First MaPU Chip Features 40nm LP Process 363.468mm2 1681 PAD APE0APE2 APE1 APE3 SRIO0 SRIO1 PCIe0 PCIe1 Cortex-A8 IP1 IP0 Phy0 Phy1 GMAC DDR3Controller0 DDR3Controller1 Bus Matrix DDR3 Controller APE0 APE1 APE3 17
  • 18. APE Implementation The First MaPU Chip Implementation Features 35.992mm2 1.0 GHz Frequency ( Typical Case) Multi-granularity parallelMemory system Load/Store &Forwarding logics Multi-granularity parallelMemory system Microcode Memory Scalar Pipeline SHU0/SHU1 Bus & Microcode Controller FMAC IMAC MReg FALU IALU FMAC IMAC MReg FALU IALU FMAC IMAC MReg FALU IALU FMAC IMAC MReg FALU IALU FMAC IMACMReg FALUIALU FMAC IMACMReg FALUIALU FMAC IMACMReg FALUIALU FMAC IMACMReg FALUIALU Turbo Decoder Turbo Decoder Microcode Memory 64bit Data Path 18
  • 19. Content 31 Introduction 32 Architecture Highlight 33 The First MaPU Chip 34 Test Result & Analysis 19
  • 20. The Physical Chip Test Result 42.50 x 42.50mm 20
  • 21. APE Core Performance compared to TI C66x Core Test Result 2.00 1.89 4.77 6.94 6.55 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 Cplx SP FFT Cplx FP FFT Matrix mul 2D filter SP FIR Average Speed Up of APE VS TI C66x Core APE: @1GHz TI 66x Core: CCSv5 Cycle Accurate Simulator with DSPLIB and IMGLIB @1.25GHz C66x Core Similar Process Node ( 40nm) Similar VLIW micro-arch Similar logic resource (SIMD fixed & float) Can be improved through microcode optimization 21
  • 22. Test Result 2.81 2.63 3.05 4.13 2.19 2.75 2.28 1.51 2.95 2.85 3.1 4.15 2.2 2.95 2.45 1.55 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 Cplx SP FFT Cplx FP FFT Matrix mul 2D filter SP FIR Table lookup Matrix Trans Idle AweragePower(Watt) Average Power of APE@1GHz Estimated Power Tested Power Needs Improvement Estimated Power: Prime Time with switching activity and final SDF annotation Tested Power: Record the current increase after invoking the APE Core. Difference < 8% Following power analysis is reliable 22
  • 23. Instruction Statistics and Power Efficiency Test Result Algorithm Tested Power(Watt) size overall time(us) Instruction Count MR0 MR1 MR2 MR3 SHU0 SHU1 IALU IMAC FALU FMAC BIU0 BIU1 BIU2 Cplx SP FFT 2.95 1024 2.692 1908 1841 1888 12 1878 1836 0 0 1771 1847 807 746 832 Cplx FT FFT 2.85 1024 1.500 942 930 897 10 894 894 0 885 0 0 255 193 288 Matrix mul 3.10 65*66*67 29.884 29478 0 0 1650 29478 0 0 0 12854 29476 1650 488 7451 2D Filter 4.15 508*508, 5*5 106.064 20336 20336 0 0 105740 105740 0 105737 0 0 8703 20344 7599 SP FIR 2.20 4096, 128 35.082 2048 2049 0 0 34817 34817 0 0 1792 34817 265 2048 511 Table lookup 2.95 4096 0.758 258 192 128 0 257 0 320 0 0 0 4 64 64 Matrix trans 2.45 512*256 8.386 0 0 0 4098 0 0 0 0 0 0 4099 0 4096 18.49 26.63 17.49 61.50 29.24 16.51 45.69 66.19 36.93 103.49 45.48 45.99 19.15 0 20 40 60 80 100 120 Cplx SP FFT Cplx FP FFT Matrix Mul 2D Filter FIR Table Lookup Matrix Trans Tested ower efficiency of different micro benchmarks Computaion GOPS/W Total GOPS/W 1.2 2.6 5 5.4 7 7.4 8.2 10 26 45.7 0 5 10 15 20 25 30 35 40 45 50 GLOPS/W Power efficiency of current processors[7] [7] M. H. Ionica and D. Gregg, “The movidius myriad architecture’s potential for scientific computing,” Micro, IEEE, vol. 35, no.3, pp. 6-14, 2015. 28nm 40 nm 28nm22 nm 23
  • 24. Where the Power efficiency comes from ? Analysis Hardware effort Software effort Matched Datapath 0% 20% 40% 60% 80% 100% Cplx SP FFT Cplx FP FFT Matrix Mul 2D Filter FIR Table lookup Matrix Trans Instruction composition of different benchmarks MReg access computation load/store 133.25 609.2 345.65 335.18 387.23 788.77 213.04 0 100 200 300 400 500 600 700 800 900 Register R/W Load/Store FALU IALU FMAC IMAC SHU pJ Average energy consumed by different FUs(512bit SIMD) Little effort on circuit optimization, But energy inefficient memory accesses are minimized Most energy consumed by computation little energy is consumed by control logic Energy consumed by memory 0.24% for fetching & dispatching & storage 24
  • 25. Conclusion—Contribution Proposed and verified a highly customizable architecture with a real chip • For architecture designer: FU sets are highly customizable for different workloads. • For programmer: Data path is highly customizable for different workloads through microcodes Proposed and verified a novel memory system that supports row and column wise matrix access • Parallel access without conflict, potential usage in applications with regular access patterns. • Can be integrated in other architectures. 25
  • 26. Conclusion—Future work High level program model • Great efforts needed in programming MaPU • 1 month per micro benchmark Circuit level optimization • Only register file is customized • Stand by power is high (1.5 W, clock network needs improvement) Customize FU sets for different workloads • FU sets for communication applications • FU sets for multi-media processing • FU sets for machine learning • … 26
  • 27. Thanks & Question ? Feel free to contact shaolin.xie@ia.ac.cn or Shawnless.xie@gmail.com

Notas do Editor

  1. Our work is a novel mathematical computing architecture called MaPU ( Ma PU for short)
  2. First I would like to introduce the MaPU architecture briefly, and then some interest features of MaPU. Next I will introduce the chip and test result.
  3. Here is the initial idea of MaPU. As Moores’s law is still effective, tremendous transistors can be used for computing. And more complex computation can be supported directly by hardware, for example, at early time, only fixed addition is supported, but now wide SIMD vector and CGRA are common in processors. We try to push this trend with a further step.
  4. Therefore, we started our work with a very simple idea: Try to map the mathematical primitives directly to the programmable hardware, so to boost the performance while reduce the power. For example, map the FFT and matrix multiply into configurable data path. To archive this goal, we have worked on two aspects. One is computing, and the other is storage.
  5. OK, this is the simplified diagram of the MaPU architecture. It is made up of three main components: A scalar pipeline, A microcode pipeline, and a multi-granularity parallel storage system. The scalar pipeline is used for controlling the microcode pipeline, The microcode pipeline contains massive function unit, and all of the FU are connected by a forwarding matrix. They operate in SIMD manner and are controlled by microcodes. The multi-granularity parallel storage system supports simultaneous matrix row and column /’ka:len/ access with the same layout, we will explore this feature in more details later on. The microcode pipeline has many features in common with thre coarse grain reconfigurable architecture, but it uses a highly coupled forwarding matrix instead of dedicated routing units to forward data. With this forwarding matrix, FUs can cascade into a compact data path that resembles the data flow of the algorithm. Therefore, it may provide performance and power efficiency comparable with that of ASIC.
  6. First I would like to introduce the MaPU architecture briefly,
  7. OK, next I would like to introduce some interesting features of the MaPU. The first feature of MaPU is the MGP storage system. It supports simultaneous matrix row or column wise access with the same layout. For a MGP memory with W bytes interface, it has following requirements: First, it requires a granularity parameter g, which should be an integer to the power of 2 and ranges from 0 to log2W Second, it requires W physical memory banks, each can read/write W bytes in parallel. When accessed ,the physical memory banks cascade and group into logic banks according to the granularity parameter g, and all logic banks has the same address map and be accessed with the same address.
  8. Here is an illustration of the MGP memory. We suppose W is 4. When accessed , G consecutive physical memory banks cascade into a logic bank, and each logic bank read/write g bytes. For example, when G=1, there are 4 logic banks and each logic bank reads 1 byte. When g=2 ,2 physical memory banks cascade into a logic bank, and each logic bank access 2 bytes. When g=4, there is only one logic bank and the MGP memory falls into an ordinary memory system with W bytes interface.
  9. The MGP memory system supports simultaneous matrix row and column wise access with the same layout, but the matrix has to be initialized properly: A simple rule is that the ith row is putted in logic bank labeled with i%W, and the rows in the same logic banks should be consecutive. For example, the first row was putted in logic banks 0, and the next row was putted in logic banks 1, and the fifth row was putted in logic bank 0 again. When access row elements, set G to 4, then the row elements can be accessed in parallel like in the ordinary memory system. When access column elements, set G to 1, there are four logic banks in the system, each logic bank provides 1 byte access, so the column elements can be accessed in parallel, for example ,the elements 0,5,10,15 of the first column can be accessed in one cycle.
  10. The second feature of MaPU is the high dimension data model. In the microcode pipeline, only high dimension vector data are supported. Each dimension of the data is represented by a configuration register sets, which contains the information of the base address, address stride and total number of elements. The address is calculated automatically, so the address calculation of the matrix elements mentioned before are greatly simplified at the program level.
  11. The third feature of the MaPU is the cascading pipeline with state machine based program model. In this model, each function unit is programmed as a simple state machine. Using this state machine it is easier to map the mathematical equations to the hardware. And all the timing of the pipeline is highly predicable.
  12. With the dedicated state machine ,all the FUs can cascade into a complete data path that suites the algorithm. For example ,the FFT dataflow can be represented as state machine and then be mapped into microcode lines. Here is the dataflow of FFT , here is the conceptual microcode of the state machine ,and here is the conceptually mapped pipeline.
  13. A SoC with 4 MaPU cores are designed and tested to prove the advantage of this architecture. next I would like to introduce this chip.
  14. The MaPU core is called APE which stands for Algebraic Process Engine. An ARM core is used as the host processor. And other high speed IOs like PCI express, RapidIO and DDR3 controller are also included. Other low speed IOs like GPIO, UART are used for debug/test. All of these modules are connected by a three level bus.
  15. The APE core was designed with 10 FUs with 14 microcode slots. Each FU supports five hundred and 12 bits operation. These FUs include integer ALU, integer MAC, Floating point ALU and Floating point MAC. Two Shuffle units and 3 load/store units were designed for data movements. A large matrix file with self index capability are also included as a high speed buffer. There are six MGP memories which can be used in a ping-pong style. Each memory is 2 Mega bits.
  16. Most tool chains of MaPU were based on the open source frame work. The compiler, assemble & disassemble were based on LLVM, the linker was based on Gold, the debugger was based on GDB and the simulator was based on Gem5. The emulator was also based on the open source hardware called OpenOCD.
  17. The SoC was implemented with 40nm Low power process. Here is the layout of the SoC. These are the Four APE cores, Between them is the buts matrix, and here is the High speed IO controllter. And here is the ARM core and other Ips. The total area is three hundreds and sixty three square milimeters. APEs occupy less than half of the area.
  18. This is the layout of the APE core, Here are the computation units. Here is the shuffle units. And here is the forwarding matrix which are spread over this area. Here is the MGP memory and the scalar pipeline. The total area is thirty-six square milimeters. Most of the area is occupied bythe MGP memory, whose size is 12 Mega bits. The microcode pipeline runs at 1 Giga Hertz under typical case. Other modules in APE run only at 500 Mega Hertz.
  19. OK, next, I will introduce the test result and analysis of the real chip
  20. Here is the picture of the real chip and the test board. The size of the packaged chip is about 4 centimeters multiple 4 centimeters.
  21. We have optimized some micro benchmarks in APE core and compared its performance with TI C double six X core. We choose this core because it has similar VLIW microarch, process node and logic resource with APE core. The statistic of APE was collected from real chip running at 1 Gig Hertz, and the statistic of the TI core was collected from the official cycle accurate simulator with official DSP and Image libraries. The TI core ran at 1.25 Giga Hertz during the simulation. We can see that APE runs at a lower frequency but it out performs TI core multiple times in FFT, matrix multiplication and filter algorithms. Furthermore, the performance of APE can be improved through microcode optimization if needed.
  22. Before tape out the chip,the power of APE was estimated when it ran different benchmarks. These benchmarks include fixe point & float point FFT, matrix multiplication, 2D filter, FIR, Table lookup, matrix transpose. After the chip was returned from the foundry, we have tested the power of the chip when it ran at 1 Giga Hertz. The typical power consumption of APE is about 2 to 4 Watt. We can see that the simulated power and the tested power are almost the same. The maximum difference is only 8%. This indicates that the following power analysis that based on simulation is highly reliable. We also noted that the standby power of APE is as high as 1.5 Watt. This should be improved in the future.
  23. We also collected the instruction statistics of different micro benchmarks as show in this table. Based on the power, execution time and the instructions counts, we can calculate the dynamic power efficiency of APE, as shown in this diagram. We can see that the maximum power efficiency for floating point algorithm is 45 Giga flops per Watt with FFT, and the maximum power efficiency for the fixed point is one hundred and 3 Giga ops per Watt with 2D filter. When comparing the power efficiency with contemporary processor with similar process node, which includes Core i7 and Xeon Phi from intel, and Tegra GPU from Nvidia. We can see that APE core has about two ~ four times improvement with the optimized benchmarks.
  24. Though little effort has been made for circuit optimization, the MaPU has manifested impressive power efficiency. The power efficiency comes from two parts: one is the hardware effort. This incudes the cascading pipeline and MGP memory system. They provide the programmer with the opportunity to customize the data path of the algorithm. In software effort, we try to minimized the energy consumption based on the these hardware feature. Combined with the hardware and software efforts, the MaPU processor formed a matched data path for different algorithms respectively. In this matched data path, most energy was consumed by computation. This diagram shows the average energy consumed by different operations. We can see that load/store consumes much more energy than that of the computation and register operations. Except the IMAC, which should be improved. So the memory access operation should be minimized. This effort can be seem from this diagram. It shows the operation composition of different benchmarks. The purple one stands for register file access, the orange one stands for computation, and the yellow one stands for load/store operation. Thanks to the MGP memory system, we can see that memory operations are minimized in all micro benchmarks. In fact, the energy consumed by the memory is only 8%, 6%, 3%, 3%, 2%, 3%, 15% for different benchmark respectively. And energy consumed by control logic is only 0.24%.This includes microcode storage, fetching and dispatching.
  25. We think our work has following contributions: First we proposed and verified a highly customizable architecture with a real chip. The customization can be made in two aspects: For architecture designer, the FU sets of MaPU are highly customizable for different workloads. For the programmer, the data path is highly customizable for different workloads through microcode. The second contribution is the proposed and verified memory system that supports row and column wise matrix access, with the same layout. It provides parallel access without conflict, and has potential usage in application with regular access patterns and can be integrated into other architectures.
  26. We think following works can de done to improve MaPU. First and the most important is the high level program model. Though state machine based program model provides great performance, great efforts are needed in programming MaPU. It takes about 1 month to optimized a micro benchmark for those who are familiar with MaPU. Second, we need circuit level optimization of the chip. When we implemented the SoC, only the register file is customized, so the standy power is very high and this should be improved. Third, As a proof of the concept chip, the SoC was designed for general computation. But MaPU itself is a highly customizable architecture, so it is very promising to customize FU sets for different workloads to get better performance, for example, FU sets for communication applications, FU set for machine learning etc.