3. Introduction—Initial Ideas
As transistor count increases, More complex computations can be
implemented as dedicated hardware
Intel 4004
Motorola 6809
AMD Ad9511 IBM POWER1
UltraSPARC I
PowerPC
AltiVec
Pentium III
SSE
19831971 1977 1979
TI TMS32010
1990 1995 1999
On-Chip transistor count increase under Moore’law
2006
Nvidia
GeForce
What’s Next ?
3
4. Reduce power consumed
by control logic
Provides various memory
access patterns
Introduction—Initial Ideas
Massive cascading
function units with
simple controller logic
Novel Multi-granularity
Parallel Memory
We try to map the mathematical equations to hardware!
Computing Storage
FU0 FU1
FU2
FU3
Mem0 Mem1
Mem2
Y=A±BW
dataflow
dataflow
Mem0 Mem1
FU2
FU4
FU0 FU1
Mem2
mapping
mapping
(a) Data path mapping for FFT
(b) Data path mapping for Matrix Multiply
Y=A•B
4
5. Scalar
Register File
Scalar Controller
Microcode Fetch
Microcode Memory
FU2
FU5
FU6
FU7
FU8
FU1
FU9
FU0
FU3
FU4
Multi-granularity
parallel storage
CSU
Microcode
Controller Microcode
Pipeline
Bus
Interface
Bus
Interface
Bus
Interface
Scalar Pipeline
Scalar pipeline & Microcode pipeline
Introduction--Instruction Set Architecture
Overview
Massive function unit, e.g. ALU, MAC, LD/SD
Each FU is controlled by a microcode
Relatively simple Hardware
Wide SIMD E.g. 16X32 bitsForwarding matrix between FUs
Highly Structured
Enable Functional Unit cascade
Multi-Granularity Parallel(MGP) storage system
Simultaneous matrix row and column access
Parallel microcode emission
Tens of microcode at each cycle
Microcode can be updated dynamically
VLIW scalar pipeline
Control microcode pipeline
Communication with SoC
5
7. Highlight 1: Multi-granularity parallel (MGP) storage system
Architecture Highlight
• Requires Granularity Parameter (G), ranges from 0 to log2W
• Requires W physical memory banks, each can read/write W
bytes in parallel
• Physical memory banks cascade and group into logic banks
according to Granularity Parameter (G)
• All logic banks are accessed with the same address
System which enables W bytes parallel access for matrix row or
column elements with the same layout
Dongling Wang, Shalin Xie, et, al. “Multi-granularity parallel storage system and storage.” US patent , US 14/117,792. 7
9. Highlight 1: Multi-granularity parallel storage system
Architecture Highlight
Suppose storage bit width W=4 bytes
The ith row is putted in logic bank labled with “i%W”
Can be accessed parallel in row or column
Simultaneous matrix row and column access
Logic Bank 0
0 5 10 151 6 11 162 7 12 173 8 13 18
4 9 14 1920 21 22
23 24
0 1 2 3 4
5 6 7 8 9
12 13 14
15 16 17
10 11
18 19
20 21 22 23 24
Original Matrix
Logic Bank 1 Logic Bank 2 Logic Bank 3
G = 4
Access row
G = 1
Access column
Dongling Wang, Shalin Xie, et, al. “Multi-granularity parallel storage system and storage.” US patent , US 14/117,792. 9
10. Highlight 2: High Dimension Data Model
Architecture Highlight
Dimension
KB: base address
KS: address
stride
KI: total
Number of elements
KB0 KS0 KI0
KB1 KS1 KI1
KB2 KS2 KI2
KB3 KS3 KI3
Up to 4 dimensions
Data is represented by Load/Store Unit : BIU
Access controlled by Microcode
Calculate address automatically
Address calculation of the matrix elements mentioned before are
greatly simplified at the program level.
BIU0->DM(A++, K++);
Configuration registers
10
11. S
E
…
10 Cycles
S
E
…
7 Cycles
S
E
…
20 Cycles
S
E
…
22 Cycles
Delay
2 Cycles
Delay
5 Cycles
Delay
4 Cycles
State Machine of
Each Functional
Unit
Easy to map the mathematical
equation
Timing of the pipeline is highly
predictable
Program Each Function Unit
as State machine
Highlight 3: Cascading pipeline with state machine based program model
Architecture Highlight
11
12. Highlight 3: Cascading pipeline with state machine based program model
Architecture Highlight
FU2
FU5
FU6
FU7
FU8
FU1
FU9
FU0
FU3
FU4
Memory 0
Memory 2
BIU0 BIU1
FMAC
SHU0
SHU1
FALU
BIU2
Memory 1
M
Data Path for FFT computation
Input Data
FU1
pipeline
FU2
pipeline
FU3
pipeline
FU4
pipeline
Result
Forward
pipeline
Forward
pipeline
Forward
pipeline
.hmacro FU1SM
LPTO (1f ) @ (KI12); //Loop to label 1, loop counts stored in KI12 register
//load data to Register file and calculate the next load address
BIU0.DM(A++,K++) -> M[0];
NOP; //idle for one cycle
BIU0.DM(A++,K++) -> M[0];
NOP; //idle for one cycle
1:
.endhmacro
FU1 State Machine
.hmacro MainSM
FU1SM; //start FU1 state machine
REPEAT@(6); // wait 6 cycles
FU2SM || FU3SM; // start FU2 and FU3 state machine
REPEAT@(6) ; // wait 6 cycles
FU4SM ; // Start FU4 state machine
.endhmacro
Top State Machine 12
14. SoC Architecture
The First MaPU Chip
SoC Features
4 MaPU Cores with a ARM
3 Level Bus connection
Dedicated co-processors
APE
Local
Memory
APE
Local
Memory
APE
Local
Memory
APE
Local
Memory
HighSpeedNetwork
DDR 3
Controller 0 PCIe 0 RapidIO 0
DDR 3
Controller 1PCIe 1RapidIO 1
Cortex-A8 ShareMemory
High Speed Network
High Speed Network
Ethernet
DDR3
Controller
GPU CODEC
GPIO
UART
External Bus
Interface
I2C/SPI
IIS
Timer
Watch Dog
Interrupt
JTAG
Reset
L2Bus
L1 Bus
L2Bus
4x 4x 4x 4x
APE: Algebraic Process Engine
14
15. APE core Architecture
The First MaPU Chip
Architecture
Features
10 Functional Unit/
14 microcode slots
6 MGP memories, 2Mbit each
Large Matrix Register Files
512bit Data Path
Scalar
Register File
Scalar Controller
Microcode Fetch
Microcode Memory
SHU0
IALU
IMAC
FALU
FMAC
SHU1
Matrix Register File
BIU0
BIU1
BIU2
Multi-granularity
parallel storage
CSU
Microcode
Controller Microcode
Pipeline
AXI
Interface
AXI
Interface
AXI
Interface
Scalar Pipeline
Matrix Register File with 4 read/4 write ports
Circular Buffer and Slide Window auto index mode
Each read port controlled by microcode :M.r0~M.r3
15
MGP0
MGP1
MGP2
MGP3
MGP4
MGP5
16. Tool Chain
The First MaPU Chip
Tool name
Based Open Source
Framework
Compiler for State Machine based
language Ragel &Bison & LLVM
C compiler for Scalar Pipeline Clang & LLVM
Assemble /Disassemble
for Both Scalar & Microcode Pipeline Ragel & Bison & LLVM
Linker
for Both Scalar & Microcode Pipeline Binutils Gold
Debugger for Scalar Pipeline GDB
Simulator ( Scalar & Microcode ) Gem5
Emulator OpenOCD
16
17. SoC Implementation
The First MaPU Chip
Features
40nm LP Process
363.468mm2
1681 PAD
APE0APE2
APE1 APE3
SRIO0
SRIO1
PCIe0 PCIe1
Cortex-A8 IP1 IP0
Phy0 Phy1
GMAC
DDR3Controller0
DDR3Controller1
Bus
Matrix
DDR3
Controller
APE0
APE1 APE3
17
18. APE Implementation
The First MaPU Chip
Implementation
Features
35.992mm2
1.0 GHz Frequency
( Typical Case)
Multi-granularity
parallelMemory
system
Load/Store
&Forwarding
logics
Multi-granularity
parallelMemory
system
Microcode
Memory
Scalar Pipeline
SHU0/SHU1
Bus &
Microcode
Controller
FMAC
IMAC MReg
FALU IALU
FMAC
IMAC MReg
FALU IALU
FMAC
IMAC MReg
FALU IALU
FMAC
IMAC MReg
FALU IALU
FMAC
IMACMReg
FALUIALU
FMAC
IMACMReg
FALUIALU
FMAC
IMACMReg
FALUIALU
FMAC
IMACMReg
FALUIALU
Turbo Decoder Turbo Decoder
Microcode
Memory
64bit Data Path
18
21. APE Core Performance compared to TI C66x Core
Test Result
2.00 1.89
4.77
6.94 6.55
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
Cplx SP FFT Cplx FP FFT Matrix mul 2D filter SP FIR
Average Speed Up of APE VS TI C66x Core
APE: @1GHz
TI 66x Core: CCSv5 Cycle Accurate Simulator with DSPLIB and
IMGLIB @1.25GHz
C66x Core
Similar Process Node ( 40nm)
Similar VLIW micro-arch
Similar logic resource
(SIMD fixed & float)
Can be improved
through microcode
optimization
21
24. Where the Power efficiency comes from ?
Analysis
Hardware
effort
Software
effort
Matched
Datapath
0%
20%
40%
60%
80%
100%
Cplx SP
FFT
Cplx FP
FFT
Matrix
Mul
2D Filter FIR Table
lookup
Matrix
Trans
Instruction composition of different
benchmarks
MReg access computation load/store
133.25
609.2
345.65 335.18
387.23
788.77
213.04
0
100
200
300
400
500
600
700
800
900
Register
R/W
Load/Store FALU IALU FMAC IMAC SHU
pJ
Average energy consumed by different FUs(512bit SIMD)
Little effort on circuit optimization,
But
energy inefficient memory accesses
are minimized
Most energy consumed by computation
little energy is consumed by control logic
Energy consumed by memory
0.24% for fetching & dispatching & storage
24
25. Conclusion—Contribution
Proposed and verified a highly customizable
architecture with a real chip
• For architecture designer: FU sets are highly customizable
for different workloads.
• For programmer: Data path is highly customizable for
different workloads through microcodes
Proposed and verified a novel memory system
that supports row and column wise matrix access
• Parallel access without conflict, potential usage in
applications with regular access patterns.
• Can be integrated in other architectures.
25
26. Conclusion—Future work
High level program model
• Great efforts needed in programming MaPU
• 1 month per micro benchmark
Circuit level optimization
• Only register file is customized
• Stand by power is high (1.5 W, clock network needs improvement)
Customize FU sets for different workloads
• FU sets for communication applications
• FU sets for multi-media processing
• FU sets for machine learning
• …
26
27. Thanks & Question ?
Feel free to contact
shaolin.xie@ia.ac.cn or
Shawnless.xie@gmail.com
Notas do Editor
Our work is a novel mathematical computing architecture called MaPU ( Ma PU for short)
First I would like to introduce the MaPU architecture briefly, and then some interest features of MaPU. Next I will introduce the chip and test result.
Here is the initial idea of MaPU. As Moores’s law is still effective, tremendous transistors can be used for computing. And more complex computation can be supported directly by hardware, for example, at early time, only fixed addition is supported, but now wide SIMD vector and CGRA are common in processors. We try to push this trend with a further step.
Therefore, we started our work with a very simple idea: Try to map the mathematical primitives directly to the programmable hardware, so to boost the performance while reduce the power. For example, map the FFT and matrix multiply into configurable data path. To archive this goal, we have worked on two aspects. One is computing, and the other is storage.
OK, this is the simplified diagram of the MaPU architecture. It is made up of three main components: A scalar pipeline, A microcode pipeline, and a multi-granularity parallel storage system.
The scalar pipeline is used for controlling the microcode pipeline, The microcode pipeline contains massive function unit, and all of the FU are connected by a forwarding matrix. They operate in SIMD manner and are controlled by microcodes.
The multi-granularity parallel storage system supports simultaneous matrix row and column /’ka:len/ access with the same layout, we will explore this feature in more details later on.
The microcode pipeline has many features in common with thre coarse grain reconfigurable architecture, but it uses a highly coupled forwarding matrix instead of dedicated routing units to forward data. With this forwarding matrix, FUs can cascade into a compact data path that resembles the data flow of the algorithm. Therefore, it may provide performance and power efficiency comparable with that of ASIC.
First I would like to introduce the MaPU architecture briefly,
OK, next I would like to introduce some interesting features of the MaPU. The first feature of MaPU is the MGP storage system. It supports simultaneous matrix row or column wise access with the same layout.
For a MGP memory with W bytes interface, it has following requirements:
First, it requires a granularity parameter g, which should be an integer to the power of 2 and ranges from 0 to log2W
Second, it requires W physical memory banks, each can read/write W bytes in parallel.
When accessed ,the physical memory banks cascade and group into logic banks according to the granularity parameter g, and all logic banks has the same address map and be accessed with the same address.
Here is an illustration of the MGP memory. We suppose W is 4. When accessed , G consecutive physical memory banks cascade into a logic bank, and each logic bank read/write g bytes.
For example, when G=1, there are 4 logic banks and each logic bank reads 1 byte. When g=2 ,2 physical memory banks cascade into a logic bank, and each logic bank access 2 bytes. When g=4, there is only one logic bank and the MGP memory falls into an ordinary memory system with W bytes interface.
The MGP memory system supports simultaneous matrix row and column wise access with the same layout, but the matrix has to be initialized properly: A simple rule is that the ith row is putted in logic bank labeled with i%W, and the rows in the same logic banks should be consecutive.
For example, the first row was putted in logic banks 0, and the next row was putted in logic banks 1, and the fifth row was putted in logic bank 0 again.
When access row elements, set G to 4, then the row elements can be accessed in parallel like in the ordinary memory system. When access column elements, set G to 1, there are four logic banks in the system, each logic bank provides 1 byte access, so the column elements can be accessed in parallel, for example ,the elements 0,5,10,15 of the first column can be accessed in one cycle.
The second feature of MaPU is the high dimension data model. In the microcode pipeline, only high dimension vector data are supported. Each dimension of the data is represented by a configuration register sets, which contains the information of the base address, address stride and total number of elements.
The address is calculated automatically, so the address calculation of the matrix elements mentioned before are greatly simplified at the program level.
The third feature of the MaPU is the cascading pipeline with state machine based program model.
In this model, each function unit is programmed as a simple state machine.
Using this state machine it is easier to map the mathematical equations to the hardware. And all the timing of the pipeline is highly predicable.
With the dedicated state machine ,all the FUs can cascade into a complete data path that suites the algorithm. For example ,the FFT dataflow can be represented as state machine and then be mapped into microcode lines. Here is the dataflow of FFT , here is the conceptual microcode of the state machine ,and here is the conceptually mapped pipeline.
A SoC with 4 MaPU cores are designed and tested to prove the advantage of this architecture. next I would like to introduce this chip.
The MaPU core is called APE which stands for Algebraic Process Engine. An ARM core is used as the host processor. And other high speed IOs like PCI express, RapidIO and DDR3 controller are also included. Other low speed IOs like GPIO, UART are used for debug/test.
All of these modules are connected by a three level bus.
The APE core was designed with 10 FUs with 14 microcode slots. Each FU supports five hundred and 12 bits operation. These FUs include integer ALU, integer MAC, Floating point ALU and Floating point MAC. Two Shuffle units and 3 load/store units were designed for data movements. A large matrix file with self index capability are also included as a high speed buffer. There are six MGP memories which can be used in a ping-pong style. Each memory is 2 Mega bits.
Most tool chains of MaPU were based on the open source frame work. The compiler, assemble & disassemble were based on LLVM, the linker was based on Gold, the debugger was based on GDB and the simulator was based on Gem5. The emulator was also based on the open source hardware called OpenOCD.
The SoC was implemented with 40nm Low power process. Here is the layout of the SoC. These are the Four APE cores, Between them is the buts matrix, and here is the High speed IO controllter. And here is the ARM core and other Ips. The total area is three hundreds and sixty three square milimeters. APEs occupy less than half of the area.
This is the layout of the APE core, Here are the computation units. Here is the shuffle units. And here is the forwarding matrix which are spread over this area. Here is the MGP memory and the scalar pipeline.
The total area is thirty-six square milimeters. Most of the area is occupied bythe MGP memory, whose size is 12 Mega bits.
The microcode pipeline runs at 1 Giga Hertz under typical case. Other modules in APE run only at 500 Mega Hertz.
OK, next, I will introduce the test result and analysis of the real chip
Here is the picture of the real chip and the test board. The size of the packaged chip is about 4 centimeters multiple 4 centimeters.
We have optimized some micro benchmarks in APE core and compared its performance with TI C double six X core. We choose this core because it has similar VLIW microarch, process node and logic resource with APE core. The statistic of APE was collected from real chip running at 1 Gig Hertz, and the statistic of the TI core was collected from the official cycle accurate simulator with official DSP and Image libraries. The TI core ran at 1.25 Giga Hertz during the simulation.
We can see that APE runs at a lower frequency but it out performs TI core multiple times in FFT, matrix multiplication and filter algorithms. Furthermore, the performance of APE can be improved through microcode optimization if needed.
Before tape out the chip,the power of APE was estimated when it ran different benchmarks. These benchmarks include fixe point & float point FFT, matrix multiplication, 2D filter, FIR, Table lookup, matrix transpose.
After the chip was returned from the foundry, we have tested the power of the chip when it ran at 1 Giga Hertz.
The typical power consumption of APE is about 2 to 4 Watt. We can see that the simulated power and the tested power are almost the same. The maximum difference is only 8%. This indicates that the following power analysis that based on simulation is highly reliable.
We also noted that the standby power of APE is as high as 1.5 Watt. This should be improved in the future.
We also collected the instruction statistics of different micro benchmarks as show in this table. Based on the power, execution time and the instructions counts, we can calculate the dynamic power efficiency of APE, as shown in this diagram.
We can see that the maximum power efficiency for floating point algorithm is 45 Giga flops per Watt with FFT, and the maximum power efficiency for the fixed point is one hundred and 3 Giga ops per Watt with 2D filter.
When comparing the power efficiency with contemporary processor with similar process node, which includes Core i7 and Xeon Phi from intel, and Tegra GPU from Nvidia. We can see that APE core has about two ~ four times improvement with the optimized benchmarks.
Though little effort has been made for circuit optimization, the MaPU has manifested impressive power efficiency. The power efficiency comes from two parts: one is the hardware effort. This incudes the cascading pipeline and MGP memory system. They provide the programmer with the opportunity to customize the data path of the algorithm. In software effort, we try to minimized the energy consumption based on the these hardware feature.
Combined with the hardware and software efforts, the MaPU processor formed a matched data path for different algorithms respectively. In this matched data path, most energy was consumed by computation. This diagram shows the average energy consumed by different operations. We can see that load/store consumes much more energy than that of the computation and register operations. Except the IMAC, which should be improved. So the memory access operation should be minimized.
This effort can be seem from this diagram. It shows the operation composition of different benchmarks. The purple one stands for register file access, the orange one stands for computation, and the yellow one stands for load/store operation. Thanks to the MGP memory system, we can see that memory operations are minimized in all micro benchmarks. In fact, the energy consumed by the memory is only 8%, 6%, 3%, 3%, 2%, 3%, 15% for different benchmark respectively. And energy consumed by control logic is only 0.24%.This includes microcode storage, fetching and dispatching.
We think our work has following contributions:
First we proposed and verified a highly customizable architecture with a real chip. The customization can be made in two aspects: For architecture designer, the FU sets of MaPU are highly customizable for different workloads. For the programmer, the data path is highly customizable for different workloads through microcode.
The second contribution is the proposed and verified memory system that supports row and column wise matrix access, with the same layout. It provides parallel access without conflict, and has potential usage in application with regular access patterns and can be integrated into other architectures.
We think following works can de done to improve MaPU.
First and the most important is the high level program model.
Though state machine based program model provides great performance, great efforts are needed in programming MaPU. It takes about 1 month to optimized a micro benchmark for those who are familiar with MaPU.
Second, we need circuit level optimization of the chip.
When we implemented the SoC, only the register file is customized, so the standy power is very high and this should be improved.
Third, As a proof of the concept chip, the SoC was designed for general computation. But MaPU itself is a highly customizable architecture, so it is very promising to customize FU sets for different workloads to get better performance, for example, FU sets for communication applications, FU set for machine learning etc.