SlideShare uma empresa Scribd logo
1 de 6
Baixar para ler offline
A NoC Based Distributed Memory
              Architecture with Programmable and
                   Partitionable Capabilities.
     Mohammad Adeel Tajammul, M. A. Shami, A. Hemani                                        S. Moorthi
          School of ICT, Royal Institute of Technology,                                     Dept. of EEE,
                  Stockholm, Sweden.                                                NIT, Tiruchirappalli, INDIA.
              (tajammul, shami, hemani)@kth.se                                         srimoorthi@nitt.edu

     Abstract: The paper focuses on the design of a                   Performance and Energy: The distributed nature of
     Network-on-chip based programmable and                           memory architecture and the concept of private
     partitionable distributed memory architecture                    execution environments enable a short distance
     which can be integrated with a Coarse Grain                      between storage and computation, which in turn
     Reconfigurable Architecture (CGRA). The                          contributes to low latency. Further, it enables effective
     proposed interconnect enables better interaction                 power management by allowing the unused mBanks
     between computation fabric and memory fabric.                    to shutdown or put into low power mode.
     The system can modify its memory to computation                  Scalability: The DiMArch is scalable with the size of
     element ratio at runtime. The extensive capabilities             memory partitions and clock frequency. The circuit-
     of the memory system are analyzed by interfacing                 switched segments of the data Network-on-Chip
     it with a Dynamically Reconfigurable Resource                    (dNoC) can be optionally pipelined.
     Array (DRRA), a CGRA. The interconnect can                                        II. RELATED WORK
     provide multiple interfaces which supports upto 8
                                                                          Reconfigurable architectures of the past decade are
     GB/s per interface.
        Distributed Memory System; DSM; Network on Chip;
                                                                      investigated for memory organization in [5].
     NoC; Coarse Grain Reconfigurable Architecture; CGRA;             Lambrechts et al. [6] investigate power and
                                                                      performance for multiple forms of interconnects. The
                       I. INTRODUCTION                                design proposed in this paper is very similar to b-neg
         With the advent increase in the integration of the           design in [6]. The programmable pipelined and
     distributed processing elements, Network-on-Chip                 private partitioning differentiates the proposed design
     (NoC) is considered to be a promising scheme [1].                from [6]. Further, the control traffic is routed over bus
     On-Chip interconnect network can provide high                    network as it has lower traffic. Data traffic is routed
     bandwidth, low-latency, scalable and reliable                    over the crossbar with programmable pipelined
     communication. Usually, the NoCs will have large                 interconnects. Memory systems for MP-SoCs can be
     amount of memories which supports the parallel                   either caches or scratch-pad based memory systems.
     transactions [2]. The exploration of new memory                  Marescaux [7] provides a case where scratch-pad
     architecture is the need for any Network-on-Chip.                memory systems behave superior to cache based
     This paper presents on-chip distributed memory                   systems.
     architecture for the PHY layer of DRRA, with the                     SMART CELL [8] describes an innovative
     following salient features:                                      reconfigurable architecture with distributed data and
     Distributed: DRRA being a fabric, the computation                instruction memory architecture. These memory units
     is distributed across the chip [3, 4] which runs several         are directly connected to processing elements. Each
     applications in parallel. With distributed memory, the           mBank is 1K and works as a scratch pad memory.
     proposed design enables multiple private and parallel                MONTIUM Tile processor [9] has a local
     execution environments (PREX).                                   memory for each tile processor. Each tile processor is
     Partitioning: Partitioning is also a distributed                 then connected to a Network-on-Chip (NoC) which
     exercise and it happens in parallel. The proposed                can bring data from an on-chip memory via AHB-
     system also enables runtime re-partitioning.                     bridge interface. The local memory per tile remains
     Streaming: Individual partitions composed of                     fixed in each MONTIUM implementations [9][10].
     memory banks (mBanks) should be able to act as a                     The DRRA architecture with integrated memory
     unit and stream data to the computational units. It can          system is discussed in sections III & IV. The proposed
     introduce elasticity in streaming by adjusting the               memory architecture can alter its memory to
     delay values.                                                    computational fabric ratio by partitioning, as
                                                                      discussed in section V. Further-more, if the system is
                                                                      running with variable clock, then it can alter its
         This Research is funded by the Higher Education Commission   critical path accordingly as discussed in section VI.
     of Pakistan



978-1-4244-8971-8/10$26.00 c 2010 IEEE
mDPUs are native 16-bit i   integer units with four
                                                         16-bit inputs corresponding to two complex numbers
                                                         and two 16-bit output correspo  onding to one complex
                                                         number. mDPU also has two comparators, one for
                                                         each output and a counter. The results of comparators,
                                                         counter and overflow, underflow are logged in a status
                                                                                         w
                                                         word read by sequencer. mDP can do saturation,
                                                                                          PU
                                                         truncation/ rounding, overflow, underflow check. The
                                                                                         ,
                                                         end result bit-width can be con nfigured to be anything
                                                         from 8 to 16 bits. RFile - the D DRRA Register File is
                                                         64 word 16 bit register file wi dual read and write
                                                                                         ith
                                                         ports. RFile has a DSP s        style AGU (Address
                                                         Generation Unit) with vectoris sed, circular buffer and
                                                         bit reverse addressing that is uuseful in implementing
                                                         FFT.
                                                             Sequencer is a micro-code sequencing machine
                                                                                         ed
                                                         that controls a single mDPU and a RFile and the
                                                         switchbox. Sequencer can be daisy chained to allow a
                                                         single sequencer to control adjoining Sequencers
                                                         within the sliding-window reach. This concept is used
                                                         to implement a hierarchy of con ntrollers, for instance to
                                                         implement Rx/Tx FSMs of a MODEM or
                                                         encode/decode FSMs of a C       CODEC. With elastic
                                                         streaming capability of RFile together with the
                                                         proposed memory architecture described in section
                                                                                        e(
                                                         IV) the sequencers provide the capability to
                                                         implement chained elastic str   reaming functionalities
                                                         that matches very well the natu of most PHY layers
                                                                                        ure
                                                         for radio and multi-media appli ications. The fabric can
                                                         be as large as the die allows; se
                                                                                         everal thousand DRRA
                                                         cells can be accommodated in a 45 nm, 300 mm2 die.
                                                         Figure 1 shows only a fragment for clarity.
                                                                                         t
                                                          IV. DISTRIBUTED MEMOR ARCHITECTURE
                                                                              RY
                                                             The proposed Distributed Memory Architecture
                                                         (DiMArch) for DRRA is com      mposed of (a) a set of
                                                         distributed memory banks (m    mBanks), (b) a circuit-
                                                         switched data Network-on-       -Chip (dNoC) (that
                                                         transports data between mBan and RFiles (DRRA
                                                                                        nks
                                                         Register Files)), (c) a packe   et-switched instruction
Fig. 1 DRRA Architecture with Mem
                                mory fabric              Network-on-Chip (iNoC), a No and bus hybrid used
                                                                                         oC
                                                         to create partitions, program m mBanks to stream data
                                                         and transport instructions fro sequencer to the
                                                                                          om
            III. DRRA ARCHITECTURE                       instruction Switch (iSwitch).
    DRRA is a Coarse Grain Reconfigurable
                                  n                      A. Memory Banks (mBan           nk): The distributed
Architecture (CGRA) capable of hosting multiple,         memory banks are SRAM ma        acros, typically 2 to 4
complete Radio and Multimedia a   applications. It has   KB, a design time decision, a the goal is to align
                                                                                         as
resources for physical layer (PHY layer), Protocol
                                  Y                      mBanks with the columns o the DRRA fabric.
                                                                                         of
Processing layers (PP layer), appli
                                  ication and system     mBanks are controlled by mF    FSMs - state machines
control and runtime management. The DRRA fabric          that also acts as interface betw ween mBanks and the
for the PHY layer has been imple  emented in [1] [2]     data Switches (dSwitch). mFSMs act as
and is shown in Figure 1 along w   with the proposed     programmable address generati unit with a general
                                                                                          ion
memory system. A single DRRA cell is composed of         timing model. They implement single read/write;
a morphable DataPath Units (mDPU a Register Files
                                  U),                    vectorized read/writes with p   programmable address
(RFile), a sequencer and an int   terconnect scheme      offset, circular buffer and bit reversed addressing.
                                                                                          t
gluing these elements together. Pre
                                  esently, the storage   mFSMs also provide a general purpose timing model
of DRRA fabric is restricted to RFFiles which are 64     using three delays, an initial ddelay before a loop, an
words of 16 bits.                                        intermittent delay before every read/ write within a
                                                                                          y
                                                         loop and an end delay at the e of the loop before
                                                                                          end
                                                         repeating the next iterations. Thhese delays are used to
synchronize the memory to register file streams with        processing and also take the data back to the mBanks
the computation. Individual delays can be changed           once processed.
depending on the intermediate results of the                cFSMs associated with each dSwitch are programmed
computation which makes streaming elastic. mFSMs            to time multiplex the path to and from register file in a
are programmed via iNoC with special instructions.          co-ordinated way so that it appears as if RFile is
    B. Data Network-on-Chip (dNoC): dNoC is a               reading from/writing to one large contiguous memory.
half-duplex circuit-switched mesh Network on-Chip.          Compiler ensures that the computation is
The streaming nature of applications, the inherent          synchronised with the behaviour of cFSMs controlling
QoS guarantees and improved latency compared to             the memory transactions. This works fine for the
packet-switched network were the motivations for            signal processing application with deterministic,
using circuit-switched network. A memory partition          cyclo-stationary behaviour. The ability to partially
together with a computation partition is called private     reprogram these streams, allows these streams to be
execution environment (PREX). The interface                 elastic as well. The DRRA sequencers have the hooks
between memory and computational partition is as            to chain these elastic streams but the present
wide as the number of RFiles involved; the width here       DiMArch does not support chained elastic streams
implies the number of dNoC connections, each dNoC
                                                            The architecture can deal with the degenerate case of
being 256 bit wide, can be changed as it is a
                                                            nondeterministic     random     individual     memory
GENERIC VHDL parameter in a template. Since the
                                                            transactions as well like a normal processor; this case
data traffic at each RFile/dNoC (RFMI) interface can
                                                            will obviously not benefit from the efficiency of
only be read or written, half-duplex interconnects are
                                                            autonomic (elastic) streaming capability of cFSMs in
proposed. dNoC is realized as a mesh network of
                                                            DiMArch. cFSMs are programmed by special
dSwitches. As shown in Figure 2, each dSwitch is
                                                            instructions via iNoC.
made up of five dSwitch cells (dCell) serving the N,
E, W, S and the mBank directions. Each dCell has                C. Instruction Network-on-Chip (iNoC): iNoC
four inputs coming from the other four directions; one      is a packet-switched network used in DiMArch to
of these four inputs is multiplexed out in the output       program the cFSMs and mFSMs, as the packet-
mode; in the input mode, data from the associated           switched networks are primarily used for short
direction enters the dCell. The bidirectional I/O is        programming messages and life of a certain path is
optionally buffered to cope with long wires and             very short. Also, it includes the feature of packetized
provide flexibility to implement the planned Dynamic        network to reach any node of the DiMArch from any
Voltage Frequency Scaling. cFSMs control the                sequencer. The agility of programming or
temporal behaviour of dSwitch. They are essential to        reconfiguring DiMArchs partitions and behaviors is a
make multiple mBanks behave as a contiguous                 key goal of the DRRA architecture to make it
memory. Figure 3 shows an example of a memory               dynamically reconfigurable. To achieve this agility,
partition made up of three mBanks A, B and C that
bring data to a single Register File (RFile) for




         IMUX      ISEL       IMUX       ISEL       IMUX      ISEL          IMUX                     IMUX
                                                                                       ISEL                      ISEL




                 REG                  REG                    REG                      REG                       REG



         PMUX     PSEL        PMUX     PSEL         PMUX      PSEL          PMUX                     PMUX       PSEL
                                                                                       PSEL



                IOSEL                IOSEL                  IOSEL                      IOSEL                  IOSEL



        mBank                South                   West                    East                    North
                                                 Fig. 2 dSwitch
while retaining the generality of packet-switching
network, two architectural measures have been taken.           Sequencer 0           Sequencer 1      Sequencer 2
The first is that the horizontal and vertical segments of
the iNoC are a hybrid of bus and NoC behaviours.
                                                                        iSwitch            iSwitch          iSwitch
Any message asserted on an iSwitch is broadcast
                                                                          (0,0)              (1,0)            (2,0)
along its entire length of vertical segment, behaving
like a bus as the broadcast happens in a single cycle.
Every iSwitch on the vertical segment analyzes the
message in parallel to check if the message address is                  iSwitch            iSwitch          iSwitch
on its associated horizontal segment and if it is, a                      (0,1)              (1,1)            (2,1)
second broadcast happens on the horizontal segment.
Again, every iSwitch on the horizontal segment
listens to the broadcast and analyzes if the message is
addressed to it and if it is, it forwards it to zFSM that               iSwitch            iSwitch          iSwitch
                                                                          (0,2)              (1,2)            (2,2)
analyzes it and appropriately acts on it. By having a
bus like behaviour, the message is broadcast in a
single cycle, i.e., each iSwitch can be reached in two
cycles. The second measure is the partitioning
capability which is explained in forthcoming section
                                                                             Fig. 4 Private Partitioning
V.
                           dSwitches + CFSMs
                                                                Then Sequencer 1 instructs iSwitch (1, 2) to close
            mBank                                           horizontal splitter right-left between iSwitch (0, 2)
             A                                              and iSwitch (1, 2). At this point all sequencers have
                                                            access to their desired private partitions. At run-time
                                                            Sequencer 2 can gain access to iSwitch (2, 2) to
                                                            allocate more memory for additional memory
            mBank                       mBank               requirements. However, since the proposed
             B                            C                 architecture does not provide memory locks, all
                                                            access conflicts are resolved at compile time. When
                                                            two PREX needs a shared memory space, then a
                                                            shared iSwitch is specified. e.g. Sequencer 0 and
                                                            Sequencer 1 can specify iSwitch (0, 1) as a shared
                                                            space. In that case, iSwitch(0,1) will receive
                       RFile                                instructions from both PREX.

            Fig. 3 Single memory partition                         VI. PROGRAMMABLE PIPELINING
                                                                Consider an example where three RFiles
           V. PRIVATE PARTITIONING                          connected to nine MTiles in 3x3 configuration. For a
                                                            single block transfer from mBank to RFile, mBank
    As in figure 4, consider three Sequencers which         data is first sent to dSwitch. The pipelined mode is
need access to multiple memory banks (mBanks).              used when PMUX is programmed to use the register
mBanks in the first row have dedicated access from          in its path (see Figure 2). The pipelined path is
Sequencer in the same column. All splitters are open        omitted for Single Cycle Multi-Hop Transfer
at the start. Sequencers 1 and 2 issue an instruction for   (SCMHT) mode. The concerned dSwitch routes the
respective instruction switch (iSwitch) to close            data to neighbouring dSwitch. At destination, data is
vertical splitter for top-to-down access. Horizontal        directly loaded into RFile from neighbouring dSwitch.
splitters are set to remain open. Instruction Switch        The number of cycles of each transfer is not always
takes one cycle to process this instruction. After a        equal to the number of hops. The cycles may be
wait of one cycle, Sequencers 1 and 2 can now access        reduced if any dSwitch is in SCMHT mode. The
iSwitches in second row (row 1). This one cycle wait        critical path for single cycle transfer for any given
can be used to configure mFSM/ cFSM for required            wireload model is variable; increase in number of
traffic patterns. Sequencer 1 issues another instruction    hops will increase the critical path exponentially. So,
for iSwitch (1, 1) to get access to iSwitch (1, 2).         increasing the maximum number of hops in SCMHT
Sequencer 1 issues instruction to close the vertical        mode will reduce the clock speed of the system by the
splitter top-down.                                          same rate. Hence, the number of hops should only be
    ‘                                                       increased for such cases when the gain of SCMHT
                                                            mode is more than degradation due to lower clock
                                                            frequency.
VII. SYSTEM COSTS AND O
                           OVERHEADS                                 reordering and reconfiguratio is performed by
                                                                                                      on
                                                                     common three stage data tran   nsfer. If the number of
    The system has three types of co and overheads
                                   osts
                                                                     butterfly operations per stage is more than the number
                                                                                                     s
interms of cycles: 1. Com         mputational Cost
                                                                     of available butterflies, then ad
                                                                                                     dditional butterflies are
(CComputation), the time spent (cyycles) in processing
                                                                     processed serially. The interconnnect reconfiguration is
data by the DPU and is dependen on the mode of
                                   nt
                                                                     performed if re-ordering i required between
                                                                                                      is
mDPU and data width which is give by the equation:
                                   en
                                                                     neighbouring RFiles [2]. D     During reordering the
    CComputation = (NSample/ 2) + OCompu
                                       utation          (1)          mBank truly behaves as a scrat  tch-pad memory, where
where NSample is Number of Sample and  es            OComputation    intermediate data is stored.
is Overhead of Computation.
    OComputation = CPipeline + CLoad Store                 (2)
where CLoad Store is RFile load store C    Cost, CPipeline is cost
of pipeline which changes with each mDPU mode.
                                           h
2. Reconfiguration Overhead, th time which is
                                          he
required to reconfigure the interco       onnect partitioning.
This time is directly proportional to the number of
splitters to be programmed. Programming a single
splitter takes three cycles (Instruct       tion identification,
Decoding and Partition set/reset) 3. Interconnect
                                           ).
Overhead, deals with the amount o time (cycles) it
                                            of
                                                                                                  roughput.
                                                                                   Figure 5 FFT Thr
takes for the data to move between m       mBank and RFile.
    Computation Interconnect Rec           configuration cost        First step in this mapping is loa
                                                                                                     ading the correct data in
(CCIR) is the cost of reconfig              guration for the         the correct RFile. Data is loade from the memory to
                                                                                                      ed
computational fabric which direct depend on the
                                           tly                       all the RFiles. By keeping a correct order at
                                                                                                     g
number of interconnects to reconfig       gure and the cost of       instruction level, data is pick up by the correct
                                                                                                     ked
reconfiguration. This cost can ha          ave the maximum           RFile.
value of six cycles.
    CCIR = NInterconnect * CComp. Reconf                   (3)
mBank Interconnect Reconfigura             ation cost (CMIR)
directly proportional to the numb of instructions
                                           ber
used to reconfigure the memory part         titions (CPR).
         CMIR = NInstruction * CPR                   (4)
    Reconfiguration is performed when 1. New
                                   d
mBank is to be allocated, 2. The instruction
partitioning interconnects are reconnfigured to change
direction of instruction flow (or) 3. Computational
                                                                                     Figure 6 FFT Ov
                                                                                                   verhead.
fabric    interconnects     are   rec
                                    configured.    The
interconnect cost (CInterconnect) d depends on data                      Twiddle factors are also loa aded to RFile which act
transfers and is represented as:                                     as a Look Up Table (LUT). Figure 5 extends the
                                                                     results of [11] for DRRA arch     hitecture. To keep the
          CInterconnect = NTransfers * CNum of hops (5)
                                          m
                                                                     comparison fair, equal numb       ber of computational
where NTransfers is Number of Transfers and CNumof hops
                                                                     elements is used. The la         ast two results for
is cost in cycles/ data transfer.
                                                                     SMARTCELL and FPGA are interpolated based on
                   VIII. CASE STUDIES                                [11]. The proposed system out     tperforms others by an
                                                                     order of magnitude more than the expected error in
                                                                                                     n
A. Mapping one dimensional poin FFT
                              nt                                     interpolation. For FFT larger th 512, the SmartCell
                                                                                                       han
    An implementation of radix-2 FFT butterfly was                   is 1.24 to 1.36 times slow than DRRA. The
                                                                                                     wer
carried out with four mDPUs use to perform two
                                   ed                                implementations of FFT small than 512 points on
                                                                                                       ler
butterfly operations (one real and o imaginary) [1].
                                   one                               such parallel system do not ex  xploit locality. Figure 6
The operations are pipelined and performed in six                    illustrates that the data for smalll-sized FFTs (less than
cycles. Depending on data acc      cess patterns, the                512) spends more time in m       motion rather than in
butterfly can be reused to implemen various stages of
                                   nt                                computation. The memor           ry-data interconnects
FFT. Hence, an algorithm can be defined for the                      overhead takes more or comp      parable times than the
traffic pattern between RFile, mBa and mDPU. In
                                   ank                               time spent in computation. This can be further
this case, FFT traffics are man    nually mapped. A                  elaborated by the fact that it takes marginally less
maximum of sixteen butterflies are used in parallel. A               cycles to compute the 64 poin FFT using a single
                                                                                                       nt
single butterfly is fed with data sam
                                    mples and twiddle                butterfly (real and imaginary) set. Such case uses 16
factors from RFiles that can perform 32 operations in
                                    m                                times less resources compared to the case presented
                                                                                                     d
40 pipelined cycles. Between each FFT stage,                         (16 butterflies).
B. 2D Mapping vs McNoC[12]                                                      REFERENCES
    The two dimensional FFT is performed in two            [1] Axel Jantsch and Hannu Tenhunen, “Networks-on-
steps. First, row-wise FFTs are calculated. It is          Chip”, Kluwer Academic Publishers, 2003.
followed by column-wise FFT calculation. Hence two
                                                           [2] William James Dally and Brian Towles, “Principles and
dimensional FFTs are broken down into multiple one-        Practices of Interconnection Networks”, Morgan Kaufmann
dimensional FFTs. It is mapped using the same              Publishers, 2004.
principles as in section VIII-A. In this experiment, the
size of FFT remains constant and number of resources       [3] M. A. Shami and A. Hemani, “Morphable DPU: Smart
is increased. Furthermore, an extra step is added          and Efficient Data Path for Signal Processing Applications,”
where horizontal to vertial translation is performed.      vol. SiPS, 2009, pp. 167–172.
The results for such mapping are given in Figure 7.
                                                           [4] ——, “Partially Reconfigurable Interconnection
          x 105                                            Network for Dynamically Reprogrammable Resource
                                                           Array,” in IEEE 8th International Conference on ASIC
                                                           (ASICON’09), 2009, pp. 122–125.

                                                           [5] M. Herz, R. Hartenstein, M. Miranda, and E.
                                                           Brockmeyer, “Memory Addressing Organization for
                                                           Streaming-based Reconfigurable Computing,” vol. 2, 2002,
                                                           pp. 813–817.

                                                           [6] A. Lambrechts, P. Raghavan, M. Jayapala, B. Mei, F.
            Fig. 7 Cycle count for 2D-FFT                  Catthoor, and D. Verkest, “Interconnect Exploration for
                                                           Energy versus Performance tradeoffs for Coarse Grained
                  IX. CONCLUSION                           Reconfigurable Architectures,” vol. 17, no. 1, JANUARY
    This paper proposes a programmable and                 2009, pp. 151–155.
partitionable interconnect interface as a method of        [7] T. Marescaus, E. Brockmeyer, and H. Corporaal, “The
communication between mBanks and RFiles. The               Impact of Higher Communication Layers on NoC supported
programmable interconnect supports pipelined or            MPSoCs,” in Proceeding of the First International
bufferless modes. The instruction interconnect also        Symposium on Network-on-Chip (NOCS’07), 2007, pp.
has private partitioning capability where multiple         107–116.
sequencers can communicate with different mBanks
using different segments of the network. The               [8] C. Liang and X. Huang, “SMARTCELL: A Power
controllers within the system help in providing            Efficient Reconfigurable Architecture for Data Streaming
                                                           application,” in IEEE Workshop on Signal Processing
patterns of data which can be routed based on the          Systems (SiPS’08), 2008, pp. 257–262.
instructions.
    The FFT and 2D experiments show the overhead           [9] G. Rauwerda, P. Heysters, and G. Smit, “Towards
of the interconnect, compared to the computational         Software   Defined    Radios      using     Coarse-Grained
cost. The results show that for the given                  Reconfigurable Hardware,” vol. 16, no. 1, 2008, pp. 3–13.
programmable interconnect the best throughput is
obtained when reasonable computational resources are       [10] L. Smit, A. Molclerink, P. Wolkotte, and G. Smit,
                                                           “Implementation of 2-D 8x8 IDCT on Reconfigurable
utilized with good locality to the data. Moreover, gate    MONTIUM core,” in International Conference on Field
level synthesis results show that the system can run       Programmable Logic and Applications, 2007, pp. 562–566.
up-to 400 MHz on 90nm technology.
    As a part of future work, another sequencer will be    [11] C. Liang and X. Huang, “Mapping parallel FFT
added which will act as a dedicated main controller        algorithm     onto      SMARTCELL        Coarse-Grained
for memory. A compiler to automate the process of          Reconfigurable Architecture,” in IEEE 20th International
mapping is also under development.                         Conference on Application-specific Systems, Architectures
                                                           and Processors, 2009, pp. 231–234.

                                                           [12] X. Chen, Z. Lu, A. Jantsch, and S. Chen, “Supporting
                                                           Distributed Shared Memory on Multi-core Network-on-chip
                                                           using a Dual Micro-coded Controller”, 2010, pp. 39–44.

Mais conteúdo relacionado

Mais procurados

Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Effective Sparse Matrix Representation for the GPU Architectures
 Effective Sparse Matrix Representation for the GPU Architectures Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
 
Low Memory Low Complexity Image Compression Using HSSPIHT Encoder
Low Memory Low Complexity Image Compression Using HSSPIHT EncoderLow Memory Low Complexity Image Compression Using HSSPIHT Encoder
Low Memory Low Complexity Image Compression Using HSSPIHT EncoderIJERA Editor
 
Simulating the triba noc architecture
Simulating the triba noc architectureSimulating the triba noc architecture
Simulating the triba noc architectureijmnct
 
IRJET- Stock Market Cost Forecasting by Recurrent Neural Network on Long Shor...
IRJET- Stock Market Cost Forecasting by Recurrent Neural Network on Long Shor...IRJET- Stock Market Cost Forecasting by Recurrent Neural Network on Long Shor...
IRJET- Stock Market Cost Forecasting by Recurrent Neural Network on Long Shor...IRJET Journal
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...eSAT Publishing House
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...IJECEIAES
 
CCNxCon2012: Session 5: CCN Location Sharing System
CCNxCon2012: Session 5: CCN Location Sharing SystemCCNxCon2012: Session 5: CCN Location Sharing System
CCNxCon2012: Session 5: CCN Location Sharing SystemPARC, a Xerox company
 

Mais procurados (18)

61
6161
61
 
47 50
47 5047 50
47 50
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Effective Sparse Matrix Representation for the GPU Architectures
 Effective Sparse Matrix Representation for the GPU Architectures Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
ThePaper (1)
ThePaper (1)ThePaper (1)
ThePaper (1)
 
Low Memory Low Complexity Image Compression Using HSSPIHT Encoder
Low Memory Low Complexity Image Compression Using HSSPIHT EncoderLow Memory Low Complexity Image Compression Using HSSPIHT Encoder
Low Memory Low Complexity Image Compression Using HSSPIHT Encoder
 
20 24
20 2420 24
20 24
 
Simulating the triba noc architecture
Simulating the triba noc architectureSimulating the triba noc architecture
Simulating the triba noc architecture
 
IRJET- Stock Market Cost Forecasting by Recurrent Neural Network on Long Shor...
IRJET- Stock Market Cost Forecasting by Recurrent Neural Network on Long Shor...IRJET- Stock Market Cost Forecasting by Recurrent Neural Network on Long Shor...
IRJET- Stock Market Cost Forecasting by Recurrent Neural Network on Long Shor...
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
 
CCNxCon2012: Session 5: CCN Location Sharing System
CCNxCon2012: Session 5: CCN Location Sharing SystemCCNxCon2012: Session 5: CCN Location Sharing System
CCNxCon2012: Session 5: CCN Location Sharing System
 
Evaluation aodv
Evaluation aodvEvaluation aodv
Evaluation aodv
 
358 365
358 365358 365
358 365
 
Es24906911
Es24906911Es24906911
Es24906911
 

Destaque

Ensayo jcz-crisis financiera 2008
Ensayo jcz-crisis financiera 2008Ensayo jcz-crisis financiera 2008
Ensayo jcz-crisis financiera 2008juancarloszea2012
 
Análisis de la pélicula crisis económica del 2008 en estados unidos
Análisis de la pélicula crisis económica del 2008 en estados unidosAnálisis de la pélicula crisis económica del 2008 en estados unidos
Análisis de la pélicula crisis económica del 2008 en estados unidosJaime Quintana
 
La crisis economica mundial 2008
La crisis economica mundial 2008La crisis economica mundial 2008
La crisis economica mundial 2008Jaime Santamaria
 
(La crisis del petróleo de los años 1970)
(La crisis del petróleo de los años 1970)(La crisis del petróleo de los años 1970)
(La crisis del petróleo de los años 1970)korvac
 
Crisis del petróleo
Crisis del petróleoCrisis del petróleo
Crisis del petróleoEveSegovi
 

Destaque (6)

Ensayo jcz-crisis financiera 2008
Ensayo jcz-crisis financiera 2008Ensayo jcz-crisis financiera 2008
Ensayo jcz-crisis financiera 2008
 
Análisis de la pélicula crisis económica del 2008 en estados unidos
Análisis de la pélicula crisis económica del 2008 en estados unidosAnálisis de la pélicula crisis económica del 2008 en estados unidos
Análisis de la pélicula crisis económica del 2008 en estados unidos
 
Ensayo crisis financiera
Ensayo crisis financieraEnsayo crisis financiera
Ensayo crisis financiera
 
La crisis economica mundial 2008
La crisis economica mundial 2008La crisis economica mundial 2008
La crisis economica mundial 2008
 
(La crisis del petróleo de los años 1970)
(La crisis del petróleo de los años 1970)(La crisis del petróleo de los años 1970)
(La crisis del petróleo de los años 1970)
 
Crisis del petróleo
Crisis del petróleoCrisis del petróleo
Crisis del petróleo
 

Semelhante a 73

Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
Coarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureCoarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureDhiraj Chaudhary
 
Coarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerCoarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerDhiraj Chaudhary
 
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Dhiraj Chaudhary
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Dhiraj Chaudhary
 
Design and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGADesign and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGAIJERA Editor
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009Léia de Sousa
 
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPA ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPijaceeejournal
 
Performance analysis and implementation of modified sdm based noc for mpsoc o...
Performance analysis and implementation of modified sdm based noc for mpsoc o...Performance analysis and implementation of modified sdm based noc for mpsoc o...
Performance analysis and implementation of modified sdm based noc for mpsoc o...eSAT Journals
 
Network on Chip Architecture and Routing Techniques: A survey
Network on Chip Architecture and Routing Techniques: A surveyNetwork on Chip Architecture and Routing Techniques: A survey
Network on Chip Architecture and Routing Techniques: A surveyIJRES Journal
 
Blue gene detail journal
Blue gene detail journalBlue gene detail journal
Blue gene detail journalVivek Jha
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...CSCJournals
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable Systems
A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable SystemsA NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable Systems
A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable SystemsLisa Muthukumar
 

Semelhante a 73 (20)

Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Nehalem
NehalemNehalem
Nehalem
 
Smart memories
Smart memoriesSmart memories
Smart memories
 
Coarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureCoarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architecture
 
Coarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerCoarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c router
 
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
 
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
 
Design and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGADesign and Implementation of Quintuple Processor Architecture Using FPGA
Design and Implementation of Quintuple Processor Architecture Using FPGA
 
Address Interleaving in NoCs
Address Interleaving in NoCsAddress Interleaving in NoCs
Address Interleaving in NoCs
 
Cache performance-x86-2009
Cache performance-x86-2009Cache performance-x86-2009
Cache performance-x86-2009
 
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPA ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIP
 
Performance analysis and implementation of modified sdm based noc for mpsoc o...
Performance analysis and implementation of modified sdm based noc for mpsoc o...Performance analysis and implementation of modified sdm based noc for mpsoc o...
Performance analysis and implementation of modified sdm based noc for mpsoc o...
 
Ax24329333
Ax24329333Ax24329333
Ax24329333
 
Network on Chip Architecture and Routing Techniques: A survey
Network on Chip Architecture and Routing Techniques: A surveyNetwork on Chip Architecture and Routing Techniques: A survey
Network on Chip Architecture and Routing Techniques: A survey
 
Blue gene detail journal
Blue gene detail journalBlue gene detail journal
Blue gene detail journal
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Design of fault tolerant algorithm for network on chip router using field pr...
Design of fault tolerant algorithm for network on chip router  using field pr...Design of fault tolerant algorithm for network on chip router  using field pr...
Design of fault tolerant algorithm for network on chip router using field pr...
 
A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable Systems
A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable SystemsA NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable Systems
A NoC-Based Infrastructure To Enable Dynamic Self Reconfigurable Systems
 

Mais de srimoorthi (20)

94
9494
94
 
87
8787
87
 
84
8484
84
 
83
8383
83
 
82
8282
82
 
75
7575
75
 
72
7272
72
 
70
7070
70
 
69
6969
69
 
68
6868
68
 
63
6363
63
 
62
6262
62
 
60
6060
60
 
59
5959
59
 
57
5757
57
 
56
5656
56
 
50
5050
50
 
55
5555
55
 
52
5252
52
 
53
5353
53
 

73

  • 1. A NoC Based Distributed Memory Architecture with Programmable and Partitionable Capabilities. Mohammad Adeel Tajammul, M. A. Shami, A. Hemani S. Moorthi School of ICT, Royal Institute of Technology, Dept. of EEE, Stockholm, Sweden. NIT, Tiruchirappalli, INDIA. (tajammul, shami, hemani)@kth.se srimoorthi@nitt.edu Abstract: The paper focuses on the design of a Performance and Energy: The distributed nature of Network-on-chip based programmable and memory architecture and the concept of private partitionable distributed memory architecture execution environments enable a short distance which can be integrated with a Coarse Grain between storage and computation, which in turn Reconfigurable Architecture (CGRA). The contributes to low latency. Further, it enables effective proposed interconnect enables better interaction power management by allowing the unused mBanks between computation fabric and memory fabric. to shutdown or put into low power mode. The system can modify its memory to computation Scalability: The DiMArch is scalable with the size of element ratio at runtime. The extensive capabilities memory partitions and clock frequency. The circuit- of the memory system are analyzed by interfacing switched segments of the data Network-on-Chip it with a Dynamically Reconfigurable Resource (dNoC) can be optionally pipelined. Array (DRRA), a CGRA. The interconnect can II. RELATED WORK provide multiple interfaces which supports upto 8 Reconfigurable architectures of the past decade are GB/s per interface. Distributed Memory System; DSM; Network on Chip; investigated for memory organization in [5]. NoC; Coarse Grain Reconfigurable Architecture; CGRA; Lambrechts et al. [6] investigate power and performance for multiple forms of interconnects. The I. INTRODUCTION design proposed in this paper is very similar to b-neg With the advent increase in the integration of the design in [6]. The programmable pipelined and distributed processing elements, Network-on-Chip private partitioning differentiates the proposed design (NoC) is considered to be a promising scheme [1]. from [6]. Further, the control traffic is routed over bus On-Chip interconnect network can provide high network as it has lower traffic. Data traffic is routed bandwidth, low-latency, scalable and reliable over the crossbar with programmable pipelined communication. Usually, the NoCs will have large interconnects. Memory systems for MP-SoCs can be amount of memories which supports the parallel either caches or scratch-pad based memory systems. transactions [2]. The exploration of new memory Marescaux [7] provides a case where scratch-pad architecture is the need for any Network-on-Chip. memory systems behave superior to cache based This paper presents on-chip distributed memory systems. architecture for the PHY layer of DRRA, with the SMART CELL [8] describes an innovative following salient features: reconfigurable architecture with distributed data and Distributed: DRRA being a fabric, the computation instruction memory architecture. These memory units is distributed across the chip [3, 4] which runs several are directly connected to processing elements. Each applications in parallel. With distributed memory, the mBank is 1K and works as a scratch pad memory. proposed design enables multiple private and parallel MONTIUM Tile processor [9] has a local execution environments (PREX). memory for each tile processor. Each tile processor is Partitioning: Partitioning is also a distributed then connected to a Network-on-Chip (NoC) which exercise and it happens in parallel. The proposed can bring data from an on-chip memory via AHB- system also enables runtime re-partitioning. bridge interface. The local memory per tile remains Streaming: Individual partitions composed of fixed in each MONTIUM implementations [9][10]. memory banks (mBanks) should be able to act as a The DRRA architecture with integrated memory unit and stream data to the computational units. It can system is discussed in sections III & IV. The proposed introduce elasticity in streaming by adjusting the memory architecture can alter its memory to delay values. computational fabric ratio by partitioning, as discussed in section V. Further-more, if the system is running with variable clock, then it can alter its This Research is funded by the Higher Education Commission critical path accordingly as discussed in section VI. of Pakistan 978-1-4244-8971-8/10$26.00 c 2010 IEEE
  • 2. mDPUs are native 16-bit i integer units with four 16-bit inputs corresponding to two complex numbers and two 16-bit output correspo onding to one complex number. mDPU also has two comparators, one for each output and a counter. The results of comparators, counter and overflow, underflow are logged in a status w word read by sequencer. mDP can do saturation, PU truncation/ rounding, overflow, underflow check. The , end result bit-width can be con nfigured to be anything from 8 to 16 bits. RFile - the D DRRA Register File is 64 word 16 bit register file wi dual read and write ith ports. RFile has a DSP s style AGU (Address Generation Unit) with vectoris sed, circular buffer and bit reverse addressing that is uuseful in implementing FFT. Sequencer is a micro-code sequencing machine ed that controls a single mDPU and a RFile and the switchbox. Sequencer can be daisy chained to allow a single sequencer to control adjoining Sequencers within the sliding-window reach. This concept is used to implement a hierarchy of con ntrollers, for instance to implement Rx/Tx FSMs of a MODEM or encode/decode FSMs of a C CODEC. With elastic streaming capability of RFile together with the proposed memory architecture described in section e( IV) the sequencers provide the capability to implement chained elastic str reaming functionalities that matches very well the natu of most PHY layers ure for radio and multi-media appli ications. The fabric can be as large as the die allows; se everal thousand DRRA cells can be accommodated in a 45 nm, 300 mm2 die. Figure 1 shows only a fragment for clarity. t IV. DISTRIBUTED MEMOR ARCHITECTURE RY The proposed Distributed Memory Architecture (DiMArch) for DRRA is com mposed of (a) a set of distributed memory banks (m mBanks), (b) a circuit- switched data Network-on- -Chip (dNoC) (that transports data between mBan and RFiles (DRRA nks Register Files)), (c) a packe et-switched instruction Fig. 1 DRRA Architecture with Mem mory fabric Network-on-Chip (iNoC), a No and bus hybrid used oC to create partitions, program m mBanks to stream data and transport instructions fro sequencer to the om III. DRRA ARCHITECTURE instruction Switch (iSwitch). DRRA is a Coarse Grain Reconfigurable n A. Memory Banks (mBan nk): The distributed Architecture (CGRA) capable of hosting multiple, memory banks are SRAM ma acros, typically 2 to 4 complete Radio and Multimedia a applications. It has KB, a design time decision, a the goal is to align as resources for physical layer (PHY layer), Protocol Y mBanks with the columns o the DRRA fabric. of Processing layers (PP layer), appli ication and system mBanks are controlled by mF FSMs - state machines control and runtime management. The DRRA fabric that also acts as interface betw ween mBanks and the for the PHY layer has been imple emented in [1] [2] data Switches (dSwitch). mFSMs act as and is shown in Figure 1 along w with the proposed programmable address generati unit with a general ion memory system. A single DRRA cell is composed of timing model. They implement single read/write; a morphable DataPath Units (mDPU a Register Files U), vectorized read/writes with p programmable address (RFile), a sequencer and an int terconnect scheme offset, circular buffer and bit reversed addressing. t gluing these elements together. Pre esently, the storage mFSMs also provide a general purpose timing model of DRRA fabric is restricted to RFFiles which are 64 using three delays, an initial ddelay before a loop, an words of 16 bits. intermittent delay before every read/ write within a y loop and an end delay at the e of the loop before end repeating the next iterations. Thhese delays are used to
  • 3. synchronize the memory to register file streams with processing and also take the data back to the mBanks the computation. Individual delays can be changed once processed. depending on the intermediate results of the cFSMs associated with each dSwitch are programmed computation which makes streaming elastic. mFSMs to time multiplex the path to and from register file in a are programmed via iNoC with special instructions. co-ordinated way so that it appears as if RFile is B. Data Network-on-Chip (dNoC): dNoC is a reading from/writing to one large contiguous memory. half-duplex circuit-switched mesh Network on-Chip. Compiler ensures that the computation is The streaming nature of applications, the inherent synchronised with the behaviour of cFSMs controlling QoS guarantees and improved latency compared to the memory transactions. This works fine for the packet-switched network were the motivations for signal processing application with deterministic, using circuit-switched network. A memory partition cyclo-stationary behaviour. The ability to partially together with a computation partition is called private reprogram these streams, allows these streams to be execution environment (PREX). The interface elastic as well. The DRRA sequencers have the hooks between memory and computational partition is as to chain these elastic streams but the present wide as the number of RFiles involved; the width here DiMArch does not support chained elastic streams implies the number of dNoC connections, each dNoC The architecture can deal with the degenerate case of being 256 bit wide, can be changed as it is a nondeterministic random individual memory GENERIC VHDL parameter in a template. Since the transactions as well like a normal processor; this case data traffic at each RFile/dNoC (RFMI) interface can will obviously not benefit from the efficiency of only be read or written, half-duplex interconnects are autonomic (elastic) streaming capability of cFSMs in proposed. dNoC is realized as a mesh network of DiMArch. cFSMs are programmed by special dSwitches. As shown in Figure 2, each dSwitch is instructions via iNoC. made up of five dSwitch cells (dCell) serving the N, E, W, S and the mBank directions. Each dCell has C. Instruction Network-on-Chip (iNoC): iNoC four inputs coming from the other four directions; one is a packet-switched network used in DiMArch to of these four inputs is multiplexed out in the output program the cFSMs and mFSMs, as the packet- mode; in the input mode, data from the associated switched networks are primarily used for short direction enters the dCell. The bidirectional I/O is programming messages and life of a certain path is optionally buffered to cope with long wires and very short. Also, it includes the feature of packetized provide flexibility to implement the planned Dynamic network to reach any node of the DiMArch from any Voltage Frequency Scaling. cFSMs control the sequencer. The agility of programming or temporal behaviour of dSwitch. They are essential to reconfiguring DiMArchs partitions and behaviors is a make multiple mBanks behave as a contiguous key goal of the DRRA architecture to make it memory. Figure 3 shows an example of a memory dynamically reconfigurable. To achieve this agility, partition made up of three mBanks A, B and C that bring data to a single Register File (RFile) for IMUX ISEL IMUX ISEL IMUX ISEL IMUX IMUX ISEL ISEL REG REG REG REG REG PMUX PSEL PMUX PSEL PMUX PSEL PMUX PMUX PSEL PSEL IOSEL IOSEL IOSEL IOSEL IOSEL mBank South West East North Fig. 2 dSwitch
  • 4. while retaining the generality of packet-switching network, two architectural measures have been taken. Sequencer 0 Sequencer 1 Sequencer 2 The first is that the horizontal and vertical segments of the iNoC are a hybrid of bus and NoC behaviours. iSwitch iSwitch iSwitch Any message asserted on an iSwitch is broadcast (0,0) (1,0) (2,0) along its entire length of vertical segment, behaving like a bus as the broadcast happens in a single cycle. Every iSwitch on the vertical segment analyzes the message in parallel to check if the message address is iSwitch iSwitch iSwitch on its associated horizontal segment and if it is, a (0,1) (1,1) (2,1) second broadcast happens on the horizontal segment. Again, every iSwitch on the horizontal segment listens to the broadcast and analyzes if the message is addressed to it and if it is, it forwards it to zFSM that iSwitch iSwitch iSwitch (0,2) (1,2) (2,2) analyzes it and appropriately acts on it. By having a bus like behaviour, the message is broadcast in a single cycle, i.e., each iSwitch can be reached in two cycles. The second measure is the partitioning capability which is explained in forthcoming section Fig. 4 Private Partitioning V. dSwitches + CFSMs Then Sequencer 1 instructs iSwitch (1, 2) to close mBank horizontal splitter right-left between iSwitch (0, 2) A and iSwitch (1, 2). At this point all sequencers have access to their desired private partitions. At run-time Sequencer 2 can gain access to iSwitch (2, 2) to allocate more memory for additional memory mBank mBank requirements. However, since the proposed B C architecture does not provide memory locks, all access conflicts are resolved at compile time. When two PREX needs a shared memory space, then a shared iSwitch is specified. e.g. Sequencer 0 and Sequencer 1 can specify iSwitch (0, 1) as a shared space. In that case, iSwitch(0,1) will receive RFile instructions from both PREX. Fig. 3 Single memory partition VI. PROGRAMMABLE PIPELINING Consider an example where three RFiles V. PRIVATE PARTITIONING connected to nine MTiles in 3x3 configuration. For a single block transfer from mBank to RFile, mBank As in figure 4, consider three Sequencers which data is first sent to dSwitch. The pipelined mode is need access to multiple memory banks (mBanks). used when PMUX is programmed to use the register mBanks in the first row have dedicated access from in its path (see Figure 2). The pipelined path is Sequencer in the same column. All splitters are open omitted for Single Cycle Multi-Hop Transfer at the start. Sequencers 1 and 2 issue an instruction for (SCMHT) mode. The concerned dSwitch routes the respective instruction switch (iSwitch) to close data to neighbouring dSwitch. At destination, data is vertical splitter for top-to-down access. Horizontal directly loaded into RFile from neighbouring dSwitch. splitters are set to remain open. Instruction Switch The number of cycles of each transfer is not always takes one cycle to process this instruction. After a equal to the number of hops. The cycles may be wait of one cycle, Sequencers 1 and 2 can now access reduced if any dSwitch is in SCMHT mode. The iSwitches in second row (row 1). This one cycle wait critical path for single cycle transfer for any given can be used to configure mFSM/ cFSM for required wireload model is variable; increase in number of traffic patterns. Sequencer 1 issues another instruction hops will increase the critical path exponentially. So, for iSwitch (1, 1) to get access to iSwitch (1, 2). increasing the maximum number of hops in SCMHT Sequencer 1 issues instruction to close the vertical mode will reduce the clock speed of the system by the splitter top-down. same rate. Hence, the number of hops should only be ‘ increased for such cases when the gain of SCMHT mode is more than degradation due to lower clock frequency.
  • 5. VII. SYSTEM COSTS AND O OVERHEADS reordering and reconfiguratio is performed by on common three stage data tran nsfer. If the number of The system has three types of co and overheads osts butterfly operations per stage is more than the number s interms of cycles: 1. Com mputational Cost of available butterflies, then ad dditional butterflies are (CComputation), the time spent (cyycles) in processing processed serially. The interconnnect reconfiguration is data by the DPU and is dependen on the mode of nt performed if re-ordering i required between is mDPU and data width which is give by the equation: en neighbouring RFiles [2]. D During reordering the CComputation = (NSample/ 2) + OCompu utation (1) mBank truly behaves as a scrat tch-pad memory, where where NSample is Number of Sample and es OComputation intermediate data is stored. is Overhead of Computation. OComputation = CPipeline + CLoad Store (2) where CLoad Store is RFile load store C Cost, CPipeline is cost of pipeline which changes with each mDPU mode. h 2. Reconfiguration Overhead, th time which is he required to reconfigure the interco onnect partitioning. This time is directly proportional to the number of splitters to be programmed. Programming a single splitter takes three cycles (Instruct tion identification, Decoding and Partition set/reset) 3. Interconnect ). Overhead, deals with the amount o time (cycles) it of roughput. Figure 5 FFT Thr takes for the data to move between m mBank and RFile. Computation Interconnect Rec configuration cost First step in this mapping is loa ading the correct data in (CCIR) is the cost of reconfig guration for the the correct RFile. Data is loade from the memory to ed computational fabric which direct depend on the tly all the RFiles. By keeping a correct order at g number of interconnects to reconfig gure and the cost of instruction level, data is pick up by the correct ked reconfiguration. This cost can ha ave the maximum RFile. value of six cycles. CCIR = NInterconnect * CComp. Reconf (3) mBank Interconnect Reconfigura ation cost (CMIR) directly proportional to the numb of instructions ber used to reconfigure the memory part titions (CPR). CMIR = NInstruction * CPR (4) Reconfiguration is performed when 1. New d mBank is to be allocated, 2. The instruction partitioning interconnects are reconnfigured to change direction of instruction flow (or) 3. Computational Figure 6 FFT Ov verhead. fabric interconnects are rec configured. The interconnect cost (CInterconnect) d depends on data Twiddle factors are also loa aded to RFile which act transfers and is represented as: as a Look Up Table (LUT). Figure 5 extends the results of [11] for DRRA arch hitecture. To keep the CInterconnect = NTransfers * CNum of hops (5) m comparison fair, equal numb ber of computational where NTransfers is Number of Transfers and CNumof hops elements is used. The la ast two results for is cost in cycles/ data transfer. SMARTCELL and FPGA are interpolated based on VIII. CASE STUDIES [11]. The proposed system out tperforms others by an order of magnitude more than the expected error in n A. Mapping one dimensional poin FFT nt interpolation. For FFT larger th 512, the SmartCell han An implementation of radix-2 FFT butterfly was is 1.24 to 1.36 times slow than DRRA. The wer carried out with four mDPUs use to perform two ed implementations of FFT small than 512 points on ler butterfly operations (one real and o imaginary) [1]. one such parallel system do not ex xploit locality. Figure 6 The operations are pipelined and performed in six illustrates that the data for smalll-sized FFTs (less than cycles. Depending on data acc cess patterns, the 512) spends more time in m motion rather than in butterfly can be reused to implemen various stages of nt computation. The memor ry-data interconnects FFT. Hence, an algorithm can be defined for the overhead takes more or comp parable times than the traffic pattern between RFile, mBa and mDPU. In ank time spent in computation. This can be further this case, FFT traffics are man nually mapped. A elaborated by the fact that it takes marginally less maximum of sixteen butterflies are used in parallel. A cycles to compute the 64 poin FFT using a single nt single butterfly is fed with data sam mples and twiddle butterfly (real and imaginary) set. Such case uses 16 factors from RFiles that can perform 32 operations in m times less resources compared to the case presented d 40 pipelined cycles. Between each FFT stage, (16 butterflies).
  • 6. B. 2D Mapping vs McNoC[12] REFERENCES The two dimensional FFT is performed in two [1] Axel Jantsch and Hannu Tenhunen, “Networks-on- steps. First, row-wise FFTs are calculated. It is Chip”, Kluwer Academic Publishers, 2003. followed by column-wise FFT calculation. Hence two [2] William James Dally and Brian Towles, “Principles and dimensional FFTs are broken down into multiple one- Practices of Interconnection Networks”, Morgan Kaufmann dimensional FFTs. It is mapped using the same Publishers, 2004. principles as in section VIII-A. In this experiment, the size of FFT remains constant and number of resources [3] M. A. Shami and A. Hemani, “Morphable DPU: Smart is increased. Furthermore, an extra step is added and Efficient Data Path for Signal Processing Applications,” where horizontal to vertial translation is performed. vol. SiPS, 2009, pp. 167–172. The results for such mapping are given in Figure 7. [4] ——, “Partially Reconfigurable Interconnection x 105 Network for Dynamically Reprogrammable Resource Array,” in IEEE 8th International Conference on ASIC (ASICON’09), 2009, pp. 122–125. [5] M. Herz, R. Hartenstein, M. Miranda, and E. Brockmeyer, “Memory Addressing Organization for Streaming-based Reconfigurable Computing,” vol. 2, 2002, pp. 813–817. [6] A. Lambrechts, P. Raghavan, M. Jayapala, B. Mei, F. Fig. 7 Cycle count for 2D-FFT Catthoor, and D. Verkest, “Interconnect Exploration for Energy versus Performance tradeoffs for Coarse Grained IX. CONCLUSION Reconfigurable Architectures,” vol. 17, no. 1, JANUARY This paper proposes a programmable and 2009, pp. 151–155. partitionable interconnect interface as a method of [7] T. Marescaus, E. Brockmeyer, and H. Corporaal, “The communication between mBanks and RFiles. The Impact of Higher Communication Layers on NoC supported programmable interconnect supports pipelined or MPSoCs,” in Proceeding of the First International bufferless modes. The instruction interconnect also Symposium on Network-on-Chip (NOCS’07), 2007, pp. has private partitioning capability where multiple 107–116. sequencers can communicate with different mBanks using different segments of the network. The [8] C. Liang and X. Huang, “SMARTCELL: A Power controllers within the system help in providing Efficient Reconfigurable Architecture for Data Streaming application,” in IEEE Workshop on Signal Processing patterns of data which can be routed based on the Systems (SiPS’08), 2008, pp. 257–262. instructions. The FFT and 2D experiments show the overhead [9] G. Rauwerda, P. Heysters, and G. Smit, “Towards of the interconnect, compared to the computational Software Defined Radios using Coarse-Grained cost. The results show that for the given Reconfigurable Hardware,” vol. 16, no. 1, 2008, pp. 3–13. programmable interconnect the best throughput is obtained when reasonable computational resources are [10] L. Smit, A. Molclerink, P. Wolkotte, and G. Smit, “Implementation of 2-D 8x8 IDCT on Reconfigurable utilized with good locality to the data. Moreover, gate MONTIUM core,” in International Conference on Field level synthesis results show that the system can run Programmable Logic and Applications, 2007, pp. 562–566. up-to 400 MHz on 90nm technology. As a part of future work, another sequencer will be [11] C. Liang and X. Huang, “Mapping parallel FFT added which will act as a dedicated main controller algorithm onto SMARTCELL Coarse-Grained for memory. A compiler to automate the process of Reconfigurable Architecture,” in IEEE 20th International mapping is also under development. Conference on Application-specific Systems, Architectures and Processors, 2009, pp. 231–234. [12] X. Chen, Z. Lu, A. Jantsch, and S. Chen, “Supporting Distributed Shared Memory on Multi-core Network-on-chip using a Dual Micro-coded Controller”, 2010, pp. 39–44.