Yushi KAMIYA, Tomoaki TSUMURA, Hiroshi MATSUO, Yasuhiko NAKASHIMA:
"A Speculative Technique for Auto-Memoization Processor with Multithreading"(発表資料)
Proc. 10th Intl. Conf. on Parallel and Distributed Computing, Applications and Technologies (PDCAT'09), Higashi-Hiroshima, Japan, pp.160-166 (Dec. 2009)
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
A Speculative Technique for Auto-Memoization Processor with Multithreading
1. A Speculative Technique for Auto-Memoization Processor with Multithreading Yushi KAMIYA † Tomoaki TSUMURA † Hiroshi MATSUO † Yasuhiko NAKASHIMA ‡ ○ † Nagoya Institute of Technology ‡ Nara Institute of Science and Technology The 10th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) Hiroshima, Japan on 9th, December, 2009
2.
3.
4.
5. Auto-Memoization Processor Regs D$1 ALU Temporary buffer Computing... End of computation store writeback Match MemoBuf MemoTbl Save the input/output sequence Detect a function or a loop D$2 Input Matching
6. Registration of an input sequence RB (CAM) RA (RAM) v=6 W1 pointer v=140 W1 (RAM) RF (RAM) Memory(Cache) 00000004 00:00001000 00000002 02:00001008 --:-------- 00000001 01 opr 1 2 0 0x1000 0x1004 0x1008 int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } MemoTbl x y[0] y[1] 00 02 FF 02:00002000 00000406 01 00:00004004 60000000 FF --:-------- 80000008 03 00 sum Memobuf val %i0 00000004 adr x 00001000 val x 00000002 adr y[1] 00001008 val y[1] 00000001 RB RA RB RA RB RA (A) (B) (C) (A) (B) (C) Store 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
7. Input Matching W1 pointer Memory(Cache) v=140 opr v=6 RB (CAM) RA (RAM) W1 (RAM) RF (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl x y[0] y[1] sum 02:00002000 00000406 01 00:00004004 60000000 FF --:-------- 80000008 03 00 00000002 02:00001008 --:-------- 00000001 01 00 02 00000004 00:00001000 FF int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } FF:00000004 00:00000002 02:00000001 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
8. Reuse Overhead W1 pointer Memory(Cache) v=140 Comparing the input sequence with the value of RB entries opr v=6 RB (CAM) RA (RAM) W1 (RAM) RF (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl x y[0] y[1] int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } 02:00002000 00000406 01 00:00004004 60000000 FF --:00000000 80000008 03 00 00000002 02:00001008 --:00000000 00000001 01 00 02 00000004 00:00001000 FF sum Regs D$1 Writing back the output sequence Reuse Overheads 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
9.
10.
11.
12.
13. Execution model ③ (A) (B) Main thread Preceding thread Main thread Preceding thread ① ① Proposal Model : Execution : Search : Write back Reuse overhead Former Model ② (C) ② No-memoization thread ① ④ ③ ② No-memoization thread Main thread ③ ② ... v = u / w; x = sum(5, 3); y = x + 4; z = x + y; ... x = sum(3, 6); z = x + y; ... int sum(a, b) { int i, sum = 0; for(i=0; i<a; i++) sum += i + b; return(sum); } (α) (β) Reduction (α + β) First several input values match the value of RB entries Completely matched Do not match time time
14. Prediction Pointer W1 pointer Prediction pointer v=6 Memory(Cache) 01 01 01 RB (CAM) RA (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl RF (RAM) W1 (RAM) opr x y[0] y[1] int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } 02:00002000 00000406 01 00:00004004 60000000 FF --:00000000 80000008 03 00 00000002 02:00001008 --:00000000 00000001 01 00 02 00000004 00:00001000 FF v=140 sum Match 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
15.
16. Architecture – the proposal model D$2 MemoTbl Shared Memobuf Regs D$1 ALU SpRF Regs D$1 ALU SpRF Regs D$1 ALU SpRF Regs D$1 ALU Memo Buf Input Pred. Main thread Preceding thread No-memoization thread SpMT cores Additional register file set SpMT cores don't use the shared MemoBuf Shared with all cores
23. Register copy overhead 099.go 147.vortex 126.gcc 130.li 146.wave5 124.m88ksim 129.compress 132.ijpeg 145.fpppp 141.aspi 099.tomcatv 110.applu 104.hydro2d 102.swim Copy all values The proposal model latency : 32 bits/cycle
Notas do Editor
☆ : Mouse click timing Thank you Mr. Chairman. Good afternoon ladies and gentlemen. In this presentation, I'd like to talk about an auto-memoization processor and it's improvement using multithreading.
This is the( ジ ) agenda of my presentation. First, I'm going to talk about the background of our study. Next, I'd like to talk about our proposal model and its hardware implementation. After that, I'm going to discuss the (ジ ) evaluation( 後にアクセント ) of it. Finally, I'd like to finish by making the conclusions. ☆ First, I'd like to talk about research background.
Now, microprocessors are facing to the crossroads of speedup techniques. Speedup techniques based on Instruction-Level Parallelism, such as the superscalar( スケーラ ) or SIMD( シムディー ) instruction sets and thread level parallelism such as auto-parallelization compiler have been counted on. However, the( ジ ) effect of these techniques has proved to be limited. One reason is that many programs have little distinct parallelism. Other reasons are memory throughput, the difficulty of finding thread level parallelism, and more. Meanwhile, in the software field, memoization is a widely used programming technique for speedup. It is storing the results of functions for later reuse, and avoids re-computing them. However, Memoization costs a certain overhead because it is implemented by software. ☆ So we have proposed an auto-memoization processor it can run binary programs faster without any software assist. Okay, now I'd like to talk about how hardware can skip execution of instructions.
The auto-memoization processor memoizes some instruction regions automatically( ティカリー ) by hardware. Input matching by hardware reduces overheads for memoization. The targets of memoization are not only functions but also loop iterations.(ra にアクセント ) ☆ A region between(v) a callee label (v) and return instruction will be detected as a function.( 指しながら ) ☆ A region between(v) a backward branch(v) and its target label will be detected as a loop iteration. ( 指しながら )
Here is the brief structure of auto-memoization processor. ☆ There are two memories for memoization, MemoBuf and MemoTbl. ☆ Through the( ジ ) execution of an instruction region, the processor stores the memory addresses and values of input and output sequence to MemoBuf. ☆ At the( ジ ) end of the region, the( ジ ) input and output sequence in MemoBuf is stored into MemoTbl. ☆ Next time the processor encounters(co にアクセント ) the same region, the processor tests whether the current input sequence completely matches with one of the past input sequence. ☆ If matches, the processor writes back the( ジ ) output sequence from MemoTbl to registers and caches. And the processor skips the( ジ ) execution of the region.
MemoTbl has four tables in it. RF stores start addresses of instruction regions, RB stores input data sequences, RA stores input address sequences, and W1 stores output data sequences. RF, RA, and W1 are implemented by RAM. And, RB is implemented by CAM. Now, let's see this sample program.( 指しながら ) ☆ First, when the function call opr() is detected, ☆ The processor searches the( ジ ) address of opr() through the RF table and the( ジ ) address is not stored in RF. ☆ So the processor stores the( ジ ) address of opr() to RF table ☆ and stores the value of argument &quot;a&quot; to MemoBuf. ☆ Next, the processor stores the memory address and the value of &quot;x&quot; and &quot;y[1]&quot; to MemoBuf. ☆ When the processor detects return instruction of the function opr(), the processor finishes storing the( ジ ) input sequence. ☆ The( ジ ) input sequence in MemoBuf is divided into some blocks which have an address and a value. ☆ After that, the input sequence is stored into the empty RB and RA entries in blocks. ☆ Then the( ジ ) output sequence is stored in the W1 entry &quot;01&quot; so the processor stores the value &quot;01&quot; to the W1 pointer of the terminal RA entry &quot;05&quot;.
In this slide, I will explain the behavior of input matching. ☆ First, when the function call opr() is detected, ☆ the processor searches the( ジ ) address of opr() through the RF table. ☆ After that, the processor reads the value of argument &quot;a&quot; and the value 4 matches the RB entry &quot;02&quot;. ☆ The next address is decided as &quot;1000&quot; which is the memory address &quot;x&quot;. ☆ Then, the processor reads the value from the address &quot;1000&quot;, and searches the value through RB again. ☆ This process is applied repeatedly until all input values are confirmed. If all inputs of a reuse target block have matched with one of the stored input sequence on MemoTbl, input matching succeeds. ☆ If input matching succeeds, the processor reads the output sequence from W1 by using the W1 pointer of the terminal RA entry &quot;05&quot;.
Meanwhile, accessing MemoTbl causes overhead inevitably. ☆ First, searching RB, referring RA, and reading registers and caches cost a certain time. ☆ Second, when the( ジ ) input matching has succeeded, the( ジ ) output sequence should be written back from W1. This also costs some time. We call these two kinds of overheads &quot;Reuse Overheads&quot;.
Meanwhile, the auto-memoization processor provides speculative multithreading which improves the( ジ ) effect of computation reuse. ☆ We append SpMT cores which have the same structure of the main core to the processor. ☆ In this example, the main core executes the function fact() and stores its input and output sequence to MemoTbl all together. ☆ The processor predicts the( ジ ) input sequence of the function fact() by stride value prediction. ☆ After that, SpMT cores execute the function fact() with predicted input sequence and store the( ジ ) input and output sequence to MemoTbl. ☆ In this example, although the main core has not executed the function fact(4), ☆ the main core can omit the( ジ ) execution of the region by using the( ジ ) input and output sequence (v) the second SpMT core stored.
Next, I'm going to talk about our new model.
However, if the number of SpMT cores increases, the speculative multithreading will reach its performance limit. One of the causes is that if the number of SpMT cores increase, MemoTbl will be filled with many input and output sequences. However, Entries which were not used may waste entries of MemoTbl. Accordingly, we should propose other techniques instead of speculative multithreading. ☆ So we will propose reducing the reuse overhead with multithreading and realizing the( ジ ) effective use of multi cores.
In the proposal model, the processor can run additional two threads which reduce the reuse overhead. ☆ First, the preceding thread assumes that the( ジ ) input matching will succeed, and executes the following codes of the reuse target region speculatively.( 指しながら ) ☆ Second, the no-memoization thread assumes that the input matching will fail, and executes the reuse target region normally. ( 指しながら ) In the next slide, I will explain the behavior of these threads in detail( 前にアクセント ).
Now, let's see this sample program.( 指しながら ) ☆ In the former model, input matching for the function sum(5, 3) succeeds and the processor can omit the execution of the region. ( 指しながら ) ☆ Then input matching for the function sum(3, 6) fails and the processor executes the region normally. ( 指しながら ) ----- 提案手法の説明 ----- ☆ Next, I will explain the proposal model. In this example, there are three cores (A), (B), and (C). ☆ At the beginning of the program, the core (A), (B), and (C) are assigned to the main thread, the preceding thread, and the no-memoization thread respectively. ☆ When the core (A) detects the function sum(), it starts input matching. Simultaneously, the core (B) and the core (C) copy the value of the program counter of the core (A) and core (C) executes the function sum() normally. ☆ When the core (A) finds that first several input values match RB entries on MemoTbl, ☆ the core (B) executes following codes of sum().( 指しながら ) ☆ After input matching finished, the preceding thread on the core (B) turns into the main thread and threads on the core (A) and (C) will be squashed. ☆ Next, the core (B) starts input matching on detecting the function sum(). ☆ When the core (B) detects that input matching fails, ☆ the no-memoization thread on the core (C) turns into the main thread and other threads will be squashed. ☆ So two threads can reduce these amounts of reuse overhead( 指しながら ). ☆ And the proposal model can reduce this amount of the reuse overhead in total.
By the way, the preceding thread should pick up an output sequence of the reuse target region for executing the following codes after the block. ☆ So we append &quot;Prediction Pointer&quot; to all RA entries. In this case, the input sequence of the function opr() is stored in the RB entry &quot;02&quot;, &quot;04&quot;, and &quot;05&quot;. ☆ In the proposal model, the value of W1 pointer is copied to the prediction pointer of all RA entries the ( ジ ) input sequence was stored. ☆ Then, I will explain how to use these prediction pointers. ☆ When first several input values matched RB entries, ☆ the preceding thread reads the output sequence from W1 by using prediction pointer and executes the following region. ☆ The main thread continues input matching. ☆ In this case, the value of W1 pointer is equal to the value of prediction pointer which was used. So the preceding thread can continue executing the following codes after the block.
Next, I'm going to talk about the implementation of our model.
Here is the brief structure of the proposal model. ☆ MemoBuf is shared by the three cores. ☆ The three cores are assigned to the main thread, the preceding thread, and the no-memoization thread. ☆ SpMT cores have their own MemoBuf and do not use the shared MemoBuf. ☆ MemoTbl and the second level data cache are shared with all cores. ☆ In addition, each of these three cores has an additional register file set. We call this SpRF. ☆ The( ジ ) ALU, the register file, and SpRF of all cores are connected to each other, so each cores can write the( ジ ) output to the register file and SpRF of all cores.
The preceding thread and the no-memoization thread use the SpRF instead of the register file. ☆ Register mask is a bitmask and it monitors the( ジ ) accesses to SpRF. Each bits of register mask corresponds to each register number. If register mask detects write access to the SpRF, the corresponding bit is enabled. The enabled bit means that the value stored in the corresponding SpRF number is active. ☆ Now, I'll show how SpRF and register mask work. Three cores are now assigned to the main thread, the preceding thread, and the no-memoization thread. The processor aims to synchronize the value of register file and SpRF. ☆ However the preceding thread and the no-memoization thread write values to their own SpRF. So register files and SpRFs of each cores cannot synchronize their values. ☆ After input matching failed, the new main thread uses old SpRF as the register file. ☆ And stored values in SpRF are synchronized to register file of all cores. This synchronization is executed in background so some overheads are concealed. ☆ However, the processor should synchronize some register values and it costs a certain time.
Next, I'm going to talk about performance evaluation and the conclusion of this research.
We have developed a single-issue simple SPARC-V8( ブイエイト ) simulator with the auto-memoization structures and evaluated the performance of the processor. Here are simulation parameters.( 発音注意: ra にアクセント ) In the next slide, I'll show the result chart.
This is the result of SPEC CPU95( ナインティーファイブ ) suite. Each benchmarks are represented by five bars. The left most bar plots the baseline that is the execution cycles(^) original benchmark costs. The second bar plots the cycles using the auto-memoization structures with no speculative cores. The third bar plots the cycles using parallel speculative execution with two SpMT cores. The fourth bar plots the cycles by overhead concealing model we proposed. The fifth bar plots the cycles by the hybrid model of the parallel speculative execution and the proposal model with five cores. ☆ The legend shows the( ジ ) itemized statements of cycles. Executed cycles, reuse overhead, register copy overhead, cache miss penalties and register window miss penalties. ☆ The execution cycles of some benchmark programs reduced by memoization. The parallel speculative execution works very well with CFP benchmarks. and the proposal model reduced the reuse overhead in CINT benchmarks. Then, the hybrid model can achieve the best performance in almost benchmarks. ☆ Now, I show the reduced cycles by each model. Above all, the hybrid model reduced upto 36%( サーティシックスパーセント ) cycles and 9%( ナインパーセント ) on average.
Now, I would like to finish by making the following conclusions. We have proposed an auto-memoization processor with multi-threading it can reduce the reuse overhead. The hybrid model can achieve good performance by their synergistic effect. Our future work is to change the assignment of cores to the threads dynamically. On the current implementation, the cores for parallel speculative execution and the three cores for concealing overheads does not exchange their threads each other. Therefore, a further improvement of the processor model will be required. ==========time over========== This is the conclusion of my presentation. Thank you for your attention.