1. A NoC Based Distributed Memory
Architecture with Programmable and
Partitionable Capabilities.
Mohammad Adeel Tajammul, M. A. Shami, A. Hemani S. Moorthi
School of ICT, Royal Institute of Technology, Dept. of EEE,
Stockholm, Sweden. NIT, Tiruchirappalli, INDIA.
(tajammul, shami, hemani)@kth.se srimoorthi@nitt.edu
Abstract: The paper focuses on the design of a Performance and Energy: The distributed nature of
Network-on-chip based programmable and memory architecture and the concept of private
partitionable distributed memory architecture execution environments enable a short distance
which can be integrated with a Coarse Grain between storage and computation, which in turn
Reconfigurable Architecture (CGRA). The contributes to low latency. Further, it enables effective
proposed interconnect enables better interaction power management by allowing the unused mBanks
between computation fabric and memory fabric. to shutdown or put into low power mode.
The system can modify its memory to computation Scalability: The DiMArch is scalable with the size of
element ratio at runtime. The extensive capabilities memory partitions and clock frequency. The circuit-
of the memory system are analyzed by interfacing switched segments of the data Network-on-Chip
it with a Dynamically Reconfigurable Resource (dNoC) can be optionally pipelined.
Array (DRRA), a CGRA. The interconnect can II. RELATED WORK
provide multiple interfaces which supports upto 8
Reconfigurable architectures of the past decade are
GB/s per interface.
Distributed Memory System; DSM; Network on Chip;
investigated for memory organization in [5].
NoC; Coarse Grain Reconfigurable Architecture; CGRA; Lambrechts et al. [6] investigate power and
performance for multiple forms of interconnects. The
I. INTRODUCTION design proposed in this paper is very similar to b-neg
With the advent increase in the integration of the design in [6]. The programmable pipelined and
distributed processing elements, Network-on-Chip private partitioning differentiates the proposed design
(NoC) is considered to be a promising scheme [1]. from [6]. Further, the control traffic is routed over bus
On-Chip interconnect network can provide high network as it has lower traffic. Data traffic is routed
bandwidth, low-latency, scalable and reliable over the crossbar with programmable pipelined
communication. Usually, the NoCs will have large interconnects. Memory systems for MP-SoCs can be
amount of memories which supports the parallel either caches or scratch-pad based memory systems.
transactions [2]. The exploration of new memory Marescaux [7] provides a case where scratch-pad
architecture is the need for any Network-on-Chip. memory systems behave superior to cache based
This paper presents on-chip distributed memory systems.
architecture for the PHY layer of DRRA, with the SMART CELL [8] describes an innovative
following salient features: reconfigurable architecture with distributed data and
Distributed: DRRA being a fabric, the computation instruction memory architecture. These memory units
is distributed across the chip [3, 4] which runs several are directly connected to processing elements. Each
applications in parallel. With distributed memory, the mBank is 1K and works as a scratch pad memory.
proposed design enables multiple private and parallel MONTIUM Tile processor [9] has a local
execution environments (PREX). memory for each tile processor. Each tile processor is
Partitioning: Partitioning is also a distributed then connected to a Network-on-Chip (NoC) which
exercise and it happens in parallel. The proposed can bring data from an on-chip memory via AHB-
system also enables runtime re-partitioning. bridge interface. The local memory per tile remains
Streaming: Individual partitions composed of fixed in each MONTIUM implementations [9][10].
memory banks (mBanks) should be able to act as a The DRRA architecture with integrated memory
unit and stream data to the computational units. It can system is discussed in sections III & IV. The proposed
introduce elasticity in streaming by adjusting the memory architecture can alter its memory to
delay values. computational fabric ratio by partitioning, as
discussed in section V. Further-more, if the system is
running with variable clock, then it can alter its
This Research is funded by the Higher Education Commission critical path accordingly as discussed in section VI.
of Pakistan
978-1-4244-8971-8/10$26.00 c 2010 IEEE
2. mDPUs are native 16-bit i integer units with four
16-bit inputs corresponding to two complex numbers
and two 16-bit output correspo onding to one complex
number. mDPU also has two comparators, one for
each output and a counter. The results of comparators,
counter and overflow, underflow are logged in a status
w
word read by sequencer. mDP can do saturation,
PU
truncation/ rounding, overflow, underflow check. The
,
end result bit-width can be con nfigured to be anything
from 8 to 16 bits. RFile - the D DRRA Register File is
64 word 16 bit register file wi dual read and write
ith
ports. RFile has a DSP s style AGU (Address
Generation Unit) with vectoris sed, circular buffer and
bit reverse addressing that is uuseful in implementing
FFT.
Sequencer is a micro-code sequencing machine
ed
that controls a single mDPU and a RFile and the
switchbox. Sequencer can be daisy chained to allow a
single sequencer to control adjoining Sequencers
within the sliding-window reach. This concept is used
to implement a hierarchy of con ntrollers, for instance to
implement Rx/Tx FSMs of a MODEM or
encode/decode FSMs of a C CODEC. With elastic
streaming capability of RFile together with the
proposed memory architecture described in section
e(
IV) the sequencers provide the capability to
implement chained elastic str reaming functionalities
that matches very well the natu of most PHY layers
ure
for radio and multi-media appli ications. The fabric can
be as large as the die allows; se
everal thousand DRRA
cells can be accommodated in a 45 nm, 300 mm2 die.
Figure 1 shows only a fragment for clarity.
t
IV. DISTRIBUTED MEMOR ARCHITECTURE
RY
The proposed Distributed Memory Architecture
(DiMArch) for DRRA is com mposed of (a) a set of
distributed memory banks (m mBanks), (b) a circuit-
switched data Network-on- -Chip (dNoC) (that
transports data between mBan and RFiles (DRRA
nks
Register Files)), (c) a packe et-switched instruction
Fig. 1 DRRA Architecture with Mem
mory fabric Network-on-Chip (iNoC), a No and bus hybrid used
oC
to create partitions, program m mBanks to stream data
and transport instructions fro sequencer to the
om
III. DRRA ARCHITECTURE instruction Switch (iSwitch).
DRRA is a Coarse Grain Reconfigurable
n A. Memory Banks (mBan nk): The distributed
Architecture (CGRA) capable of hosting multiple, memory banks are SRAM ma acros, typically 2 to 4
complete Radio and Multimedia a applications. It has KB, a design time decision, a the goal is to align
as
resources for physical layer (PHY layer), Protocol
Y mBanks with the columns o the DRRA fabric.
of
Processing layers (PP layer), appli
ication and system mBanks are controlled by mF FSMs - state machines
control and runtime management. The DRRA fabric that also acts as interface betw ween mBanks and the
for the PHY layer has been imple emented in [1] [2] data Switches (dSwitch). mFSMs act as
and is shown in Figure 1 along w with the proposed programmable address generati unit with a general
ion
memory system. A single DRRA cell is composed of timing model. They implement single read/write;
a morphable DataPath Units (mDPU a Register Files
U), vectorized read/writes with p programmable address
(RFile), a sequencer and an int terconnect scheme offset, circular buffer and bit reversed addressing.
t
gluing these elements together. Pre
esently, the storage mFSMs also provide a general purpose timing model
of DRRA fabric is restricted to RFFiles which are 64 using three delays, an initial ddelay before a loop, an
words of 16 bits. intermittent delay before every read/ write within a
y
loop and an end delay at the e of the loop before
end
repeating the next iterations. Thhese delays are used to
3. synchronize the memory to register file streams with processing and also take the data back to the mBanks
the computation. Individual delays can be changed once processed.
depending on the intermediate results of the cFSMs associated with each dSwitch are programmed
computation which makes streaming elastic. mFSMs to time multiplex the path to and from register file in a
are programmed via iNoC with special instructions. co-ordinated way so that it appears as if RFile is
B. Data Network-on-Chip (dNoC): dNoC is a reading from/writing to one large contiguous memory.
half-duplex circuit-switched mesh Network on-Chip. Compiler ensures that the computation is
The streaming nature of applications, the inherent synchronised with the behaviour of cFSMs controlling
QoS guarantees and improved latency compared to the memory transactions. This works fine for the
packet-switched network were the motivations for signal processing application with deterministic,
using circuit-switched network. A memory partition cyclo-stationary behaviour. The ability to partially
together with a computation partition is called private reprogram these streams, allows these streams to be
execution environment (PREX). The interface elastic as well. The DRRA sequencers have the hooks
between memory and computational partition is as to chain these elastic streams but the present
wide as the number of RFiles involved; the width here DiMArch does not support chained elastic streams
implies the number of dNoC connections, each dNoC
The architecture can deal with the degenerate case of
being 256 bit wide, can be changed as it is a
nondeterministic random individual memory
GENERIC VHDL parameter in a template. Since the
transactions as well like a normal processor; this case
data traffic at each RFile/dNoC (RFMI) interface can
will obviously not benefit from the efficiency of
only be read or written, half-duplex interconnects are
autonomic (elastic) streaming capability of cFSMs in
proposed. dNoC is realized as a mesh network of
DiMArch. cFSMs are programmed by special
dSwitches. As shown in Figure 2, each dSwitch is
instructions via iNoC.
made up of five dSwitch cells (dCell) serving the N,
E, W, S and the mBank directions. Each dCell has C. Instruction Network-on-Chip (iNoC): iNoC
four inputs coming from the other four directions; one is a packet-switched network used in DiMArch to
of these four inputs is multiplexed out in the output program the cFSMs and mFSMs, as the packet-
mode; in the input mode, data from the associated switched networks are primarily used for short
direction enters the dCell. The bidirectional I/O is programming messages and life of a certain path is
optionally buffered to cope with long wires and very short. Also, it includes the feature of packetized
provide flexibility to implement the planned Dynamic network to reach any node of the DiMArch from any
Voltage Frequency Scaling. cFSMs control the sequencer. The agility of programming or
temporal behaviour of dSwitch. They are essential to reconfiguring DiMArchs partitions and behaviors is a
make multiple mBanks behave as a contiguous key goal of the DRRA architecture to make it
memory. Figure 3 shows an example of a memory dynamically reconfigurable. To achieve this agility,
partition made up of three mBanks A, B and C that
bring data to a single Register File (RFile) for
IMUX ISEL IMUX ISEL IMUX ISEL IMUX IMUX
ISEL ISEL
REG REG REG REG REG
PMUX PSEL PMUX PSEL PMUX PSEL PMUX PMUX PSEL
PSEL
IOSEL IOSEL IOSEL IOSEL IOSEL
mBank South West East North
Fig. 2 dSwitch
4. while retaining the generality of packet-switching
network, two architectural measures have been taken. Sequencer 0 Sequencer 1 Sequencer 2
The first is that the horizontal and vertical segments of
the iNoC are a hybrid of bus and NoC behaviours.
iSwitch iSwitch iSwitch
Any message asserted on an iSwitch is broadcast
(0,0) (1,0) (2,0)
along its entire length of vertical segment, behaving
like a bus as the broadcast happens in a single cycle.
Every iSwitch on the vertical segment analyzes the
message in parallel to check if the message address is iSwitch iSwitch iSwitch
on its associated horizontal segment and if it is, a (0,1) (1,1) (2,1)
second broadcast happens on the horizontal segment.
Again, every iSwitch on the horizontal segment
listens to the broadcast and analyzes if the message is
addressed to it and if it is, it forwards it to zFSM that iSwitch iSwitch iSwitch
(0,2) (1,2) (2,2)
analyzes it and appropriately acts on it. By having a
bus like behaviour, the message is broadcast in a
single cycle, i.e., each iSwitch can be reached in two
cycles. The second measure is the partitioning
capability which is explained in forthcoming section
Fig. 4 Private Partitioning
V.
dSwitches + CFSMs
Then Sequencer 1 instructs iSwitch (1, 2) to close
mBank horizontal splitter right-left between iSwitch (0, 2)
A and iSwitch (1, 2). At this point all sequencers have
access to their desired private partitions. At run-time
Sequencer 2 can gain access to iSwitch (2, 2) to
allocate more memory for additional memory
mBank mBank requirements. However, since the proposed
B C architecture does not provide memory locks, all
access conflicts are resolved at compile time. When
two PREX needs a shared memory space, then a
shared iSwitch is specified. e.g. Sequencer 0 and
Sequencer 1 can specify iSwitch (0, 1) as a shared
space. In that case, iSwitch(0,1) will receive
RFile instructions from both PREX.
Fig. 3 Single memory partition VI. PROGRAMMABLE PIPELINING
Consider an example where three RFiles
V. PRIVATE PARTITIONING connected to nine MTiles in 3x3 configuration. For a
single block transfer from mBank to RFile, mBank
As in figure 4, consider three Sequencers which data is first sent to dSwitch. The pipelined mode is
need access to multiple memory banks (mBanks). used when PMUX is programmed to use the register
mBanks in the first row have dedicated access from in its path (see Figure 2). The pipelined path is
Sequencer in the same column. All splitters are open omitted for Single Cycle Multi-Hop Transfer
at the start. Sequencers 1 and 2 issue an instruction for (SCMHT) mode. The concerned dSwitch routes the
respective instruction switch (iSwitch) to close data to neighbouring dSwitch. At destination, data is
vertical splitter for top-to-down access. Horizontal directly loaded into RFile from neighbouring dSwitch.
splitters are set to remain open. Instruction Switch The number of cycles of each transfer is not always
takes one cycle to process this instruction. After a equal to the number of hops. The cycles may be
wait of one cycle, Sequencers 1 and 2 can now access reduced if any dSwitch is in SCMHT mode. The
iSwitches in second row (row 1). This one cycle wait critical path for single cycle transfer for any given
can be used to configure mFSM/ cFSM for required wireload model is variable; increase in number of
traffic patterns. Sequencer 1 issues another instruction hops will increase the critical path exponentially. So,
for iSwitch (1, 1) to get access to iSwitch (1, 2). increasing the maximum number of hops in SCMHT
Sequencer 1 issues instruction to close the vertical mode will reduce the clock speed of the system by the
splitter top-down. same rate. Hence, the number of hops should only be
‘ increased for such cases when the gain of SCMHT
mode is more than degradation due to lower clock
frequency.
5. VII. SYSTEM COSTS AND O
OVERHEADS reordering and reconfiguratio is performed by
on
common three stage data tran nsfer. If the number of
The system has three types of co and overheads
osts
butterfly operations per stage is more than the number
s
interms of cycles: 1. Com mputational Cost
of available butterflies, then ad
dditional butterflies are
(CComputation), the time spent (cyycles) in processing
processed serially. The interconnnect reconfiguration is
data by the DPU and is dependen on the mode of
nt
performed if re-ordering i required between
is
mDPU and data width which is give by the equation:
en
neighbouring RFiles [2]. D During reordering the
CComputation = (NSample/ 2) + OCompu
utation (1) mBank truly behaves as a scrat tch-pad memory, where
where NSample is Number of Sample and es OComputation intermediate data is stored.
is Overhead of Computation.
OComputation = CPipeline + CLoad Store (2)
where CLoad Store is RFile load store C Cost, CPipeline is cost
of pipeline which changes with each mDPU mode.
h
2. Reconfiguration Overhead, th time which is
he
required to reconfigure the interco onnect partitioning.
This time is directly proportional to the number of
splitters to be programmed. Programming a single
splitter takes three cycles (Instruct tion identification,
Decoding and Partition set/reset) 3. Interconnect
).
Overhead, deals with the amount o time (cycles) it
of
roughput.
Figure 5 FFT Thr
takes for the data to move between m mBank and RFile.
Computation Interconnect Rec configuration cost First step in this mapping is loa
ading the correct data in
(CCIR) is the cost of reconfig guration for the the correct RFile. Data is loade from the memory to
ed
computational fabric which direct depend on the
tly all the RFiles. By keeping a correct order at
g
number of interconnects to reconfig gure and the cost of instruction level, data is pick up by the correct
ked
reconfiguration. This cost can ha ave the maximum RFile.
value of six cycles.
CCIR = NInterconnect * CComp. Reconf (3)
mBank Interconnect Reconfigura ation cost (CMIR)
directly proportional to the numb of instructions
ber
used to reconfigure the memory part titions (CPR).
CMIR = NInstruction * CPR (4)
Reconfiguration is performed when 1. New
d
mBank is to be allocated, 2. The instruction
partitioning interconnects are reconnfigured to change
direction of instruction flow (or) 3. Computational
Figure 6 FFT Ov
verhead.
fabric interconnects are rec
configured. The
interconnect cost (CInterconnect) d depends on data Twiddle factors are also loa aded to RFile which act
transfers and is represented as: as a Look Up Table (LUT). Figure 5 extends the
results of [11] for DRRA arch hitecture. To keep the
CInterconnect = NTransfers * CNum of hops (5)
m
comparison fair, equal numb ber of computational
where NTransfers is Number of Transfers and CNumof hops
elements is used. The la ast two results for
is cost in cycles/ data transfer.
SMARTCELL and FPGA are interpolated based on
VIII. CASE STUDIES [11]. The proposed system out tperforms others by an
order of magnitude more than the expected error in
n
A. Mapping one dimensional poin FFT
nt interpolation. For FFT larger th 512, the SmartCell
han
An implementation of radix-2 FFT butterfly was is 1.24 to 1.36 times slow than DRRA. The
wer
carried out with four mDPUs use to perform two
ed implementations of FFT small than 512 points on
ler
butterfly operations (one real and o imaginary) [1].
one such parallel system do not ex xploit locality. Figure 6
The operations are pipelined and performed in six illustrates that the data for smalll-sized FFTs (less than
cycles. Depending on data acc cess patterns, the 512) spends more time in m motion rather than in
butterfly can be reused to implemen various stages of
nt computation. The memor ry-data interconnects
FFT. Hence, an algorithm can be defined for the overhead takes more or comp parable times than the
traffic pattern between RFile, mBa and mDPU. In
ank time spent in computation. This can be further
this case, FFT traffics are man nually mapped. A elaborated by the fact that it takes marginally less
maximum of sixteen butterflies are used in parallel. A cycles to compute the 64 poin FFT using a single
nt
single butterfly is fed with data sam
mples and twiddle butterfly (real and imaginary) set. Such case uses 16
factors from RFiles that can perform 32 operations in
m times less resources compared to the case presented
d
40 pipelined cycles. Between each FFT stage, (16 butterflies).
6. B. 2D Mapping vs McNoC[12] REFERENCES
The two dimensional FFT is performed in two [1] Axel Jantsch and Hannu Tenhunen, “Networks-on-
steps. First, row-wise FFTs are calculated. It is Chip”, Kluwer Academic Publishers, 2003.
followed by column-wise FFT calculation. Hence two
[2] William James Dally and Brian Towles, “Principles and
dimensional FFTs are broken down into multiple one- Practices of Interconnection Networks”, Morgan Kaufmann
dimensional FFTs. It is mapped using the same Publishers, 2004.
principles as in section VIII-A. In this experiment, the
size of FFT remains constant and number of resources [3] M. A. Shami and A. Hemani, “Morphable DPU: Smart
is increased. Furthermore, an extra step is added and Efficient Data Path for Signal Processing Applications,”
where horizontal to vertial translation is performed. vol. SiPS, 2009, pp. 167–172.
The results for such mapping are given in Figure 7.
[4] ——, “Partially Reconfigurable Interconnection
x 105 Network for Dynamically Reprogrammable Resource
Array,” in IEEE 8th International Conference on ASIC
(ASICON’09), 2009, pp. 122–125.
[5] M. Herz, R. Hartenstein, M. Miranda, and E.
Brockmeyer, “Memory Addressing Organization for
Streaming-based Reconfigurable Computing,” vol. 2, 2002,
pp. 813–817.
[6] A. Lambrechts, P. Raghavan, M. Jayapala, B. Mei, F.
Fig. 7 Cycle count for 2D-FFT Catthoor, and D. Verkest, “Interconnect Exploration for
Energy versus Performance tradeoffs for Coarse Grained
IX. CONCLUSION Reconfigurable Architectures,” vol. 17, no. 1, JANUARY
This paper proposes a programmable and 2009, pp. 151–155.
partitionable interconnect interface as a method of [7] T. Marescaus, E. Brockmeyer, and H. Corporaal, “The
communication between mBanks and RFiles. The Impact of Higher Communication Layers on NoC supported
programmable interconnect supports pipelined or MPSoCs,” in Proceeding of the First International
bufferless modes. The instruction interconnect also Symposium on Network-on-Chip (NOCS’07), 2007, pp.
has private partitioning capability where multiple 107–116.
sequencers can communicate with different mBanks
using different segments of the network. The [8] C. Liang and X. Huang, “SMARTCELL: A Power
controllers within the system help in providing Efficient Reconfigurable Architecture for Data Streaming
application,” in IEEE Workshop on Signal Processing
patterns of data which can be routed based on the Systems (SiPS’08), 2008, pp. 257–262.
instructions.
The FFT and 2D experiments show the overhead [9] G. Rauwerda, P. Heysters, and G. Smit, “Towards
of the interconnect, compared to the computational Software Defined Radios using Coarse-Grained
cost. The results show that for the given Reconfigurable Hardware,” vol. 16, no. 1, 2008, pp. 3–13.
programmable interconnect the best throughput is
obtained when reasonable computational resources are [10] L. Smit, A. Molclerink, P. Wolkotte, and G. Smit,
“Implementation of 2-D 8x8 IDCT on Reconfigurable
utilized with good locality to the data. Moreover, gate MONTIUM core,” in International Conference on Field
level synthesis results show that the system can run Programmable Logic and Applications, 2007, pp. 562–566.
up-to 400 MHz on 90nm technology.
As a part of future work, another sequencer will be [11] C. Liang and X. Huang, “Mapping parallel FFT
added which will act as a dedicated main controller algorithm onto SMARTCELL Coarse-Grained
for memory. A compiler to automate the process of Reconfigurable Architecture,” in IEEE 20th International
mapping is also under development. Conference on Application-specific Systems, Architectures
and Processors, 2009, pp. 231–234.
[12] X. Chen, Z. Lu, A. Jantsch, and S. Chen, “Supporting
Distributed Shared Memory on Multi-core Network-on-chip
using a Dual Micro-coded Controller”, 2010, pp. 39–44.