MaPU-HPCA2016

MaPU –A Novel Mathematical
Computing Architecture
shaolin.xie@ia.ac.cn
Donglin Wang, Shaolin Xie et al.

Outline
31 Introduction
32 Architecture Highlight
33 The First MaPU Chip
34 Test Result & Analysis
2

Introduction—Initial Ideas
As transistor count increases, More complex computations can be
implemented as dedicated hardware
Intel 4004
Motorola 6809
AMD Ad9511 IBM POWER1
UltraSPARC I
PowerPC
AltiVec
Pentium III
SSE
19831971 1977 1979
TI TMS32010
1990 1995 1999
On-Chip transistor count increase under Moore’law
2006
Nvidia
GeForce
What’s Next ?
3

Reduce power consumed
by control logic
Provides various memory
access patterns
Introduction—Initial Ideas
Massive cascading
function units with
simple controller logic
Novel Multi-granularity
Parallel Memory
We try to map the mathematical equations to hardware!
Computing Storage
FU0 FU1
FU2
FU3
Mem0 Mem1
Mem2
Y=A±BW
dataflow
dataflow
Mem0 Mem1
FU2
FU4
FU0 FU1
Mem2
mapping
mapping
(a) Data path mapping for FFT
(b) Data path mapping for Matrix Multiply
Y=A•B
4

Scalar
Register File
Scalar Controller
Microcode Fetch
Microcode Memory
FU2
FU5
FU6
FU7
FU8
FU1
FU9
FU0
FU3
FU4
Multi-granularity
parallel storage
CSU
Microcode
Controller Microcode
Pipeline
Bus
Interface
Bus
Interface
Bus
Interface
Scalar Pipeline
Scalar pipeline & Microcode pipeline
Introduction--Instruction Set Architecture
Overview
Massive function unit, e.g. ALU, MAC, LD/SD
Each FU is controlled by a microcode
Relatively simple Hardware
Wide SIMD E.g. 16X32 bitsForwarding matrix between FUs
Highly Structured
Enable Functional Unit cascade
Multi-Granularity Parallel(MGP) storage system
Simultaneous matrix row and column access
Parallel microcode emission
Tens of microcode at each cycle
Microcode can be updated dynamically
VLIW scalar pipeline
Control microcode pipeline
Communication with SoC
5

Outline
31 Introduction
6

Highlight 1: Multi-granularity parallel (MGP) storage system
Architecture Highlight
• Requires Granularity Parameter (G), ranges from 0 to log2W
• Requires W physical memory banks, each can read/write W
bytes in parallel
• Physical memory banks cascade and group into logic banks
according to Granularity Parameter (G)
• All logic banks are accessed with the same address
System which enables W bytes parallel access for matrix row or
column elements with the same layout
Dongling Wang, Shalin Xie, et, al. “Multi-granularity parallel storage system and storage.” US patent , US 14/117,792. 7

4 logic banks
Each logic bank reads 1 byte
Highlight 1: Multi-granularity parallel storage system
Logic Bank0 Logic Bank1
Physical
memory
banks
granularity=1granularity=2
address=0
Logic Bank0 Logic Bank1 Logic Bank2 Logic Bank3
Logic Bank0
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
32 33 34 35
36 37 38 39
40 41 42 43
44 45 46 47
48 49 50 51
52 53 54 55
56 57 58 59
60 61 62 63
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
2 logic banks
Each logic bank reads 2 bytes
1 logic banks
Each logic bank reads 4 bytes
granularity=4
G consecutive
physical memory
banks cascade into a
logic bank
Each logic bank
read/write G bytes
Suppose W=4
8

Highlight 1: Multi-granularity parallel storage system
Suppose storage bit width W=4 bytes
The ith row is putted in logic bank labled with “i%W”
 Can be accessed parallel in row or column
Simultaneous matrix row and column access
Logic Bank 0
0 5 10 151 6 11 162 7 12 173 8 13 18
4 9 14 1920 21 22
23 24
0 1 2 3 4
5 6 7 8 9
12 13 14
15 16 17
10 11
18 19
20 21 22 23 24
Original Matrix
Logic Bank 1 Logic Bank 2 Logic Bank 3
G = 4
Access row
G = 1
Access column
Dongling Wang, Shalin Xie, et, al. “Multi-granularity parallel storage system and storage.” US patent , US 14/117,792. 9

Highlight 2: High Dimension Data Model
Dimension
KB: base address
KS: address
stride
KI: total
Number of elements
KB0 KS0 KI0
KB1 KS1 KI1
KB2 KS2 KI2
KB3 KS3 KI3
Up to 4 dimensions
Data is represented by Load/Store Unit : BIU
Access controlled by Microcode
Calculate address automatically
Address calculation of the matrix elements mentioned before are
greatly simplified at the program level.
BIU0->DM(A++, K++);
Configuration registers
10

S
E
…
10 Cycles
S
E
…
7 Cycles
S
E
…
20 Cycles
S
E
…
22 Cycles
Delay
2 Cycles
Delay
5 Cycles
Delay
4 Cycles
State Machine of
Each Functional
Unit
Easy to map the mathematical
equation
Timing of the pipeline is highly
predictable
Program Each Function Unit
as State machine
Highlight 3: Cascading pipeline with state machine based program model
11

Highlight 3: Cascading pipeline with state machine based program model
FU2
FU5
FU6
FU7
FU8
FU1
FU9
FU0
FU3
FU4
Memory 0
Memory 2
BIU0 BIU1
FMAC
SHU0
SHU1
FALU
BIU2
Memory 1
M
Data Path for FFT computation
Input Data
FU1
pipeline
FU2
pipeline
FU3
pipeline
FU4
pipeline
Result
Forward
pipeline
Forward
pipeline
Forward
pipeline
.hmacro FU1SM
LPTO (1f ) @ (KI12); //Loop to label 1, loop counts stored in KI12 register
//load data to Register file and calculate the next load address
BIU0.DM(A++,K++) -> M[0];
NOP; //idle for one cycle
BIU0.DM(A++,K++) -> M[0];
NOP; //idle for one cycle
1:
.endhmacro
FU1 State Machine
.hmacro MainSM
FU1SM; //start FU1 state machine
REPEAT@(6); // wait 6 cycles
FU2SM || FU3SM; // start FU2 and FU3 state machine
REPEAT@(6) ; // wait 6 cycles
FU4SM ; // Start FU4 state machine
.endhmacro
Top State Machine 12

Outline
31 Introduction
13

SoC Architecture
The First MaPU Chip
SoC Features
4 MaPU Cores with a ARM
3 Level Bus connection
Dedicated co-processors
APE
Local
Memory
APE
Local
Memory
APE
Local
Memory
APE
Local
Memory
HighSpeedNetwork
DDR 3
Controller 0 PCIe 0 RapidIO 0
DDR 3
Controller 1PCIe 1RapidIO 1
Cortex-A8 ShareMemory
High Speed Network
High Speed Network
Ethernet
DDR3
Controller
GPU CODEC
GPIO
UART
External Bus
Interface
I2C/SPI
IIS
Timer
Watch Dog
Interrupt
JTAG
Reset
L2Bus
L1 Bus
L2Bus
4x 4x 4x 4x
APE: Algebraic Process Engine
14

APE core Architecture
The First MaPU Chip
Architecture
Features
10 Functional Unit/
14 microcode slots
6 MGP memories, 2Mbit each
Large Matrix Register Files
512bit Data Path
Scalar
Register File
Scalar Controller
Microcode Fetch
Microcode Memory
SHU0
IALU
IMAC
FALU
FMAC
SHU1
Matrix Register File
BIU0
BIU1
BIU2
Multi-granularity
parallel storage
CSU
Microcode
Controller Microcode
Pipeline
AXI
Interface
AXI
Interface
AXI
Interface
Scalar Pipeline
Matrix Register File with 4 read/4 write ports
Circular Buffer and Slide Window auto index mode
Each read port controlled by microcode :M.r0~M.r3
15
MGP0
MGP1
MGP2
MGP3
MGP4
MGP5

Tool Chain
The First MaPU Chip
Tool name
Based Open Source
Framework
Compiler for State Machine based
language Ragel &Bison & LLVM
C compiler for Scalar Pipeline Clang & LLVM
Assemble /Disassemble
for Both Scalar & Microcode Pipeline Ragel & Bison & LLVM
Linker
for Both Scalar & Microcode Pipeline Binutils Gold
Debugger for Scalar Pipeline GDB
Simulator ( Scalar & Microcode ) Gem5
Emulator OpenOCD
16

SoC Implementation
The First MaPU Chip
Features
40nm LP Process
363.468mm2
1681 PAD
APE0APE2
APE1 APE3
SRIO0
SRIO1
PCIe0 PCIe1
Cortex-A8 IP1 IP0
Phy0 Phy1
GMAC
DDR3Controller0
DDR3Controller1
Bus
Matrix
DDR3
Controller
APE0
APE1 APE3
17

APE Implementation
The First MaPU Chip
Implementation
Features
35.992mm2
1.0 GHz Frequency
( Typical Case)
Multi-granularity
parallelMemory
system
Load/Store
&Forwarding
logics
Multi-granularity
parallelMemory
system
Microcode
Memory
Scalar Pipeline
SHU0/SHU1
Bus &
Microcode
Controller
FMAC
IMAC MReg
FALU IALU
FMAC
IMAC MReg
FALU IALU
FMAC
IMAC MReg
FALU IALU
FMAC
IMAC MReg
FALU IALU
FMAC
IMACMReg
FALUIALU
FMAC
IMACMReg
FALUIALU
FMAC
IMACMReg
FALUIALU
FMAC
IMACMReg
FALUIALU
Turbo Decoder Turbo Decoder
Microcode
Memory
64bit Data Path
18

Content
31 Introduction
19

The Physical Chip
Test Result
42.50 x 42.50mm
20

APE Core Performance compared to TI C66x Core
Test Result
2.00 1.89
4.77
6.94 6.55
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Cplx SP FFT Cplx FP FFT Matrix mul 2D filter SP FIR
Average Speed Up of APE VS TI C66x Core
APE: @1GHz
TI 66x Core: CCSv5 Cycle Accurate Simulator with DSPLIB and
IMGLIB @1.25GHz
C66x Core
Similar Process Node ( 40nm)
Similar VLIW micro-arch
Similar logic resource
(SIMD fixed & float)
Can be improved
through microcode
optimization
21

Test Result
2.81
2.63
3.05
4.13
2.19
2.75
2.28
1.51
2.95 2.85
3.1
4.15
2.2
2.95
2.45
1.55
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
Cplx SP FFT Cplx FP FFT Matrix mul 2D filter SP FIR Table lookup Matrix Trans Idle
AweragePower(Watt)
Average Power of APE@1GHz
Estimated Power Tested Power
Needs
Improvement
Estimated Power: Prime Time with switching activity and final SDF annotation
Tested Power: Record the current increase after invoking the APE Core.
Difference < 8%
Following power
analysis is reliable
22

Instruction Statistics and Power Efficiency
Test Result
Algorithm
Tested
Power(Watt)
size
overall
time(us)
Instruction Count
MR0 MR1 MR2 MR3 SHU0 SHU1 IALU IMAC FALU FMAC BIU0 BIU1 BIU2
Cplx SP FFT 2.95 1024 2.692 1908 1841 1888 12 1878 1836 0 0 1771 1847 807 746 832
Cplx FT FFT 2.85 1024 1.500 942 930 897 10 894 894 0 885 0 0 255 193 288
Matrix mul 3.10 65*66*67 29.884 29478 0 0 1650 29478 0 0 0 12854 29476 1650 488 7451
2D Filter 4.15
508*508,
5*5
106.064 20336 20336 0 0 105740 105740 0 105737 0 0 8703 20344 7599
SP FIR 2.20 4096, 128 35.082 2048 2049 0 0 34817 34817 0 0 1792 34817 265 2048 511
Table lookup 2.95 4096 0.758 258 192 128 0 257 0 320 0 0 0 4 64 64
Matrix trans 2.45 512*256 8.386 0 0 0 4098 0 0 0 0 0 0 4099 0 4096
18.49
26.63
17.49
61.50
29.24
16.51
45.69
66.19
36.93
103.49
45.48 45.99
19.15
0
20
40
60
80
100
120
Cplx SP
FFT
Cplx FP
FFT
Matrix
Mul
2D Filter FIR Table
Lookup
Matrix
Trans
Tested ower efficiency of different micro
benchmarks
Computaion GOPS/W Total GOPS/W
1.2 2.6 5 5.4 7 7.4 8.2 10
26
45.7
0
5
10
15
20
25
30
35
40
45
50
GLOPS/W
Power efficiency of current processors[7]
[7] M. H. Ionica and D. Gregg, “The movidius myriad architecture’s potential for scientific computing,” Micro, IEEE, vol. 35, no.3, pp.
6-14, 2015.
28nm
40 nm
28nm22 nm
23

Where the Power efficiency comes from ?
Analysis
Hardware
effort
Software
effort
Matched
Datapath
0%
20%
40%
60%
80%
100%
Cplx SP
FFT
Cplx FP
FFT
Matrix
Mul
2D Filter FIR Table
lookup
Matrix
Trans
Instruction composition of different
benchmarks
MReg access computation load/store
133.25
609.2
345.65 335.18
387.23
788.77
213.04
0
100
200
300
400
500
600
700
800
900
Register
R/W
Load/Store FALU IALU FMAC IMAC SHU
pJ
Average energy consumed by different FUs(512bit SIMD)
Little effort on circuit optimization,
But
energy inefficient memory accesses
are minimized
Most energy consumed by computation
little energy is consumed by control logic
Energy consumed by memory
0.24% for fetching & dispatching & storage
24

Conclusion—Contribution
Proposed and verified a highly customizable
architecture with a real chip
• For architecture designer: FU sets are highly customizable
for different workloads.
• For programmer: Data path is highly customizable for
different workloads through microcodes
Proposed and verified a novel memory system
that supports row and column wise matrix access
• Parallel access without conflict, potential usage in
applications with regular access patterns.
• Can be integrated in other architectures.
25

Conclusion—Future work
High level program model
• Great efforts needed in programming MaPU
• 1 month per micro benchmark
Circuit level optimization
• Only register file is customized
• Stand by power is high (1.5 W, clock network needs improvement)
Customize FU sets for different workloads
• FU sets for communication applications
• FU sets for multi-media processing
• FU sets for machine learning
• …
26

Thanks & Question ?
Feel free to contact
shaolin.xie@ia.ac.cn or
Shawnless.xie@gmail.com

MaPU-HPCA2016

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (18)

Destaque

Destaque (20)

Semelhante a MaPU-HPCA2016

Semelhante a MaPU-HPCA2016 (20)

MaPU-HPCA2016

Notas do Editor