The document discusses different computational architectures including scalar, SIMD, CGRA, and beyond-neuromorphic systems. It focuses on IMAX3, a dataflow-centric coarse-grained reconfigurable array (CGRA) and its scalability. IMAX3 can perform 307,200 operations across 64 units in 4 cycles. The document also discusses pipelining strategies when using IMAX3 with HBM2 memory and developing a next-generation CGRA called IMAX2.
Computing Architecture Laboratory in Nara Institute of Science and Technology is now targeting at power efficient computers to suppress global warming. I present this video to all hungry engineers who are tired of CPU, GPU, FPGA, tensor core, AI core, who want some challenging one with no black box inside, and who want to improve by themselves. This video follows episode 11, and focuses on the scalability of IMAX.
Let's scale it up. A scalar processor also has SIMD instructions for about 32 elements. Increasing the number of cores increases performance. However, if you don't program it well, you will get many cache misses, and poor performance. SIMD with 256 elements or more are called vector. There are two types of vectors. Vector 1 is connected to cache memory. Since the cache memory is small, the number of elements for vector operations is only about 256. Vector 2 is directly connected to the main memory and the number of elements can be increased up to about 2048. The CGRA has various configurations. This diagram shows a sandwich structure of ALU and 64 kilobytes of local memory. The number of elements that can be handled at once is now 16000. By absorbing irregular memory references in local memory, main memory can keep high-speed with only regular access. You can also concatenate multiple memory spaces into the pipeline to build a longer pipeline.
As in the episode 1, HBM2 can provide multiple AXI busses for scaling up of IMAX. This is the case just increasing the number of lanes.
In addition to the micro pipelining, medium pipelining is available in each lane. The double buffering in local memory can isolate each stage in FFT, merge sort, and so on.
Furthermore, the multiple lanes can be concatenated through HBM2 like this way. This configuration combines micro, medium and macro pipelining all together.
One lane can support four modules of IMAX that has 64 units, then 10240 operations can be mapped on one lane. If 30 ports are available in HBM2, 307200 operations can be mapped at once. One module of IMAX will occupy 1.2 millimeter square. If we can fabricate with 8 nanometer technology, 120 modules of IMAX will occupy 144 millimeter square that is quarter of high-end GPGPU.
This is a top-down approach from the view of application. Suppose that various data are located in the main memory and processed in a pipelined manner. For avoiding the interference of multiple data flow in the main memory, the intermediate data should be stored to the space out of the main memory. The ring structure and the local memory of IMAX can be employed for building pipelines outside of the main memory.
This is an example. CPU can also be employed to cover complicated function with many conditional branches.
IMAX is ready on some FPGA boards. The bin files or SD card images in our web site provide you real CGRAs. HBM2 version and VMK version will appear soon.
Our web site has links for documents and tools. Note that only Verilog code is not included. I hope IMAX can contribute to stop global warming. Thank you for your attention.