SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
A Speculative Technique for Auto-Memoization Processor with Multithreading Yushi KAMIYA † Tomoaki TSUMURA † Hiroshi MATSUO † Yasuhiko NAKASHIMA ‡ ○ †   Nagoya Institute of Technology ‡   Nara Institute of Science and Technology The 10th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) Hiroshima, Japan on 9th, December, 2009
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Research background ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],・・・ Auto-Memoization Processor How to skip execution
Memoization for functions and loops ,[object Object],[object Object],[object Object],func: : : return %x main: : call func : : : .LL3: : : ba .LL3 : : (A) : Functions (B) : Loops Memoizable Instruction Regions between backward branch  and branch target label between a callee label and return instruction
Auto-Memoization Processor Regs D$1 ALU Temporary buffer Computing... End of computation store writeback Match MemoBuf MemoTbl Save the input/output sequence Detect a function or a loop D$2 Input Matching
Registration of an input sequence RB  (CAM) RA  (RAM) v=6 W1 pointer v=140 W1  (RAM) RF  (RAM) Memory(Cache) 00000004 00:00001000 00000002 02:00001008 --:-------- 00000001 01 opr 1 2 0 0x1000 0x1004 0x1008 int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } MemoTbl x y[0] y[1] 00 02 FF 02:00002000 00000406 01 00:00004004 60000000 FF --:-------- 80000008 03 00 sum Memobuf val %i0 00000004 adr x 00001000 val x 00000002 adr y[1] 00001008 val y[1] 00000001 RB RA RB RA RB RA (A) (B) (C) (A) (B) (C) Store 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
Input Matching W1 pointer Memory(Cache) v=140 opr v=6 RB  (CAM) RA  (RAM) W1  (RAM) RF  (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl x y[0] y[1] sum 02:00002000 00000406 01 00:00004004 60000000 FF --:-------- 80000008 03 00 00000002 02:00001008 --:-------- 00000001 01 00 02 00000004 00:00001000 FF int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } FF:00000004 00:00000002 02:00000001 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
Reuse Overhead W1 pointer Memory(Cache) v=140 Comparing the input sequence with the value of RB entries opr v=6 RB  (CAM) RA  (RAM) W1  (RAM) RF  (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl x y[0] y[1] int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } 02:00002000 00000406 01 00:00004004 60000000 FF --:00000000 80000008 03 00 00000002 02:00001008 --:00000000 00000001 01 00 02 00000004 00:00001000 FF sum Regs D$1 Writing back the output sequence Reuse Overheads 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
Speculative Multithreading ,[object Object],[object Object],[object Object],[object Object],SpMT core SpMT core Main core SpMT core Stride value Prediction MemoTbl Reuse the function fact(4) fact(3) fact(4) fact(5) fact(1) fact(2) fact(4) Calculation in advance fact(5) = 120 fact(4) = 24 fact(3) = 6 fact(1) = 1 fact(2) = 2 * fact : factorial(n!)
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Memoization and Multithreading ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Our proposal
Reduction of Reuse Overhead ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],No-memoization thread  : assumes that the input matching will fail Preceding thread  : assumes that the input matching will succeed  ・・・ The area (A) is executed normally ・・・ The area (B) is executed speculatively (B) ... v = u / w sum();  ・・・ (A) y = x + 4; ...
Execution model ③ (A) (B) Main thread Preceding thread Main thread Preceding thread ① ① Proposal Model : Execution : Search : Write back Reuse overhead Former Model ② (C) ② No-memoization thread ① ④ ③ ② No-memoization thread Main thread ③ ② ... v = u / w; x = sum(5, 3); y = x + 4; z = x + y; ... x = sum(3, 6); z = x + y;  ... int sum(a, b) { int i, sum = 0; for(i=0; i<a; i++) sum += i + b; return(sum); } (α) (β) Reduction (α + β) First several input values match the value of RB entries Completely matched Do not match time time
Prediction Pointer W1 pointer Prediction pointer v=6 Memory(Cache) 01 01 01 RB  (CAM) RA  (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl RF  (RAM) W1  (RAM) opr x y[0] y[1] int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } 02:00002000 00000406 01 00:00004004 60000000 FF --:00000000 80000008 03 00 00000002 02:00001008 --:00000000 00000001 01 00 02 00000004 00:00001000 FF v=140 sum Match 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Architecture – the proposal model D$2 MemoTbl Shared Memobuf Regs D$1 ALU SpRF Regs D$1 ALU SpRF Regs D$1 ALU SpRF Regs D$1 ALU Memo Buf Input Pred. Main thread Preceding thread No-memoization thread SpMT cores Additional register file set SpMT cores don't use the shared MemoBuf Shared with all cores
Register Synchronization 0 0 0 g0 g1 g2 ・・・・・ 0 0 0 g3 g4 g5 0 0 0 g6 g7 0 g0 g1 g2 g3 g4 g5 ... [0] 0 0 0 g0 g1 g2 ・・・・・ 0 0 0 g3 g4 g5 0 0 0 g6 g7 0 g0 g1 g2 g3 g4 g5 ... [1] 0 0 0 g0 g1 g2 ・・・・・ 0 0 0 g3 g4 g5 0 0 0 g6 g7 0 g0 g1 g2 g3 g4 g5 ... 1 1 [2] 0FFF1000 00000040 0FFF1000 0FFF1000 00000040 00000040 00000050 1 0FFF1000 00000040 0FFF1000 00000040 00000040 ... sum(); a = b * c; ... min(a, b, c); ... search (A) (B) (C) : Main : Preceding : No-memoization 0FFF1000 RF SpRF RF SpRF RF SpRF SpRF RF WB Register mask Main thread Preceding thread No-memoization thread Main thread No-memoization thread RF ⇔ SpRF Don't synchronized
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Performance Evaluation ,[object Object],[object Object],[object Object],[object Object],[object Object],Memo Buffer (Shared + Local) (RAM) 160 KBytes Memo Table (CAM) 128 KBytes (RAM) 448 KBytes Comparison (Register and CAM) 9 Cycles/32Bytes Comparison (Cache and CAM) 10 Cycles/32Bytes Write back (MemoTbl ⇒ Register or Cache) 1 Cycle/32Bytes Register copy 1 Cycle/32Bits
Performance – SPEC CPU95 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 147.vortex 101.tomcatv 102.swim 103.su2cor 104.hydro2d 107.mgrid 110.applu 125.turb3d 141.apsi 145.fpppp 146.wave5 134.perl (N) w/o Memoization (M)  Memoization (P)  Memoization  +  Proposal (A)  Memoization  +  SpMT  +  Proposal (S)  Memoization  +  SpMT CFP CINT Reduced cycles : reuse_ovh : D$2 : window : exec : regcopy : D$1 max ave. (M) Memoization 13.9% -0.1% (S) Memoization   +  SpMT 35.2% 5.6% (P) Memoization   +  Proposal 21.7% 2.1% (A) Memoization   +  SpMT  +  Proposal 36.0% 9.0%
Conclusion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Register copy overhead 099.go 147.vortex 126.gcc 130.li 146.wave5 124.m88ksim 129.compress 132.ijpeg 145.fpppp 141.aspi 099.tomcatv 110.applu 104.hydro2d 102.swim Copy all values The proposal model latency : 32 bits/cycle

Mais conteúdo relacionado

Mais procurados

C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 명신 김
 
computer notes - Data Structures - 9
computer notes - Data Structures - 9computer notes - Data Structures - 9
computer notes - Data Structures - 9ecomputernotes
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITEgor Bogatov
 
Let’s talk about microbenchmarking
Let’s talk about microbenchmarkingLet’s talk about microbenchmarking
Let’s talk about microbenchmarkingAndrey Akinshin
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...Andrey Karpov
 
第二回 冬のスイッチ大勉強会 - FullColorLED & MPU-6050編 -
第二回 冬のスイッチ大勉強会 - FullColorLED & MPU-6050編 -第二回 冬のスイッチ大勉強会 - FullColorLED & MPU-6050編 -
第二回 冬のスイッチ大勉強会 - FullColorLED & MPU-6050編 -Wataru Kani
 
Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Piotr Milanowski
 
Egor Bogatov - .NET Core intrinsics and other micro-optimizations
Egor Bogatov - .NET Core intrinsics and other micro-optimizationsEgor Bogatov - .NET Core intrinsics and other micro-optimizations
Egor Bogatov - .NET Core intrinsics and other micro-optimizationsEgor Bogatov
 
Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Platonov Sergey
 
Senior design project code for PPG
Senior design project code for PPGSenior design project code for PPG
Senior design project code for PPGFrankDin1
 
OREO - Hack.lu CTF 2014
OREO - Hack.lu CTF 2014OREO - Hack.lu CTF 2014
OREO - Hack.lu CTF 2014YOKARO-MON
 
Translating Classic Arcade Games to JavaScript
Translating Classic Arcade Games to JavaScriptTranslating Classic Arcade Games to JavaScript
Translating Classic Arcade Games to JavaScriptnorbert_kehrer
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugsComputer Science Club
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321Teddy Hsiung
 
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by Oracles
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by OraclesEfficient Volume and Edge-Skeleton Computation for Polytopes Given by Oracles
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by OraclesVissarion Fisikopoulos
 

Mais procurados (20)

C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략
 
computer notes - Data Structures - 9
computer notes - Data Structures - 9computer notes - Data Structures - 9
computer notes - Data Structures - 9
 
Faster Python, FOSDEM
Faster Python, FOSDEMFaster Python, FOSDEM
Faster Python, FOSDEM
 
AA-sort with SSE4.1
AA-sort with SSE4.1AA-sort with SSE4.1
AA-sort with SSE4.1
 
How to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJITHow to add an optimization for C# to RyuJIT
How to add an optimization for C# to RyuJIT
 
Let’s talk about microbenchmarking
Let’s talk about microbenchmarkingLet’s talk about microbenchmarking
Let’s talk about microbenchmarking
 
PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...PVS-Studio team experience: checking various open source projects, or mistake...
PVS-Studio team experience: checking various open source projects, or mistake...
 
第二回 冬のスイッチ大勉強会 - FullColorLED & MPU-6050編 -
第二回 冬のスイッチ大勉強会 - FullColorLED & MPU-6050編 -第二回 冬のスイッチ大勉強会 - FullColorLED & MPU-6050編 -
第二回 冬のスイッチ大勉強会 - FullColorLED & MPU-6050編 -
 
Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.
 
Egor Bogatov - .NET Core intrinsics and other micro-optimizations
Egor Bogatov - .NET Core intrinsics and other micro-optimizationsEgor Bogatov - .NET Core intrinsics and other micro-optimizations
Egor Bogatov - .NET Core intrinsics and other micro-optimizations
 
Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”Bartosz Milewski, “Re-discovering Monads in C++”
Bartosz Milewski, “Re-discovering Monads in C++”
 
Lec06
Lec06Lec06
Lec06
 
Introduction to Data Oriented Design
Introduction to Data Oriented DesignIntroduction to Data Oriented Design
Introduction to Data Oriented Design
 
Senior design project code for PPG
Senior design project code for PPGSenior design project code for PPG
Senior design project code for PPG
 
A Step Towards Data Orientation
A Step Towards Data OrientationA Step Towards Data Orientation
A Step Towards Data Orientation
 
OREO - Hack.lu CTF 2014
OREO - Hack.lu CTF 2014OREO - Hack.lu CTF 2014
OREO - Hack.lu CTF 2014
 
Translating Classic Arcade Games to JavaScript
Translating Classic Arcade Games to JavaScriptTranslating Classic Arcade Games to JavaScript
Translating Classic Arcade Games to JavaScript
 
20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs20140531 serebryany lecture01_fantastic_cpp_bugs
20140531 serebryany lecture01_fantastic_cpp_bugs
 
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
ExperiencesSharingOnEmbeddedSystemDevelopment_20160321
 
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by Oracles
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by OraclesEfficient Volume and Edge-Skeleton Computation for Polytopes Given by Oracles
Efficient Volume and Edge-Skeleton Computation for Polytopes Given by Oracles
 

Semelhante a A Speculative Technique for Auto-Memoization Processor with Multithreading

Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msanYandex
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msanYandex
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msanYandex
 
Windows Debugging with WinDbg
Windows Debugging with WinDbgWindows Debugging with WinDbg
Windows Debugging with WinDbgArno Huetter
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugsComputer Science Club
 
AllBits presentation - Lower Level SW Security
AllBits presentation - Lower Level SW SecurityAllBits presentation - Lower Level SW Security
AllBits presentation - Lower Level SW SecurityAllBits BVBA (freelancer)
 
Potapenko, vyukov forewarned is forearmed. a san and tsan
Potapenko, vyukov   forewarned is forearmed. a san and tsanPotapenko, vyukov   forewarned is forearmed. a san and tsan
Potapenko, vyukov forewarned is forearmed. a san and tsanDefconRussia
 
Kernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering OopsiesKernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering OopsiesAnne Nicolas
 
Windows kernel debugging workshop in florida
Windows kernel debugging   workshop in floridaWindows kernel debugging   workshop in florida
Windows kernel debugging workshop in floridaSisimon Soman
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Ontico
 
The forgotten art of assembly
The forgotten art of assemblyThe forgotten art of assembly
The forgotten art of assemblyMarian Marinov
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)bolovv
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading SkillsReverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading SkillsAsuka Nakajima
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMDWei-Ta Wang
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 

Semelhante a A Speculative Technique for Auto-Memoization Processor with Multithreading (20)

Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
 
Yandex may 2013 a san-tsan_msan
Yandex may 2013   a san-tsan_msanYandex may 2013   a san-tsan_msan
Yandex may 2013 a san-tsan_msan
 
Windows Debugging with WinDbg
Windows Debugging with WinDbgWindows Debugging with WinDbg
Windows Debugging with WinDbg
 
20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs20140531 serebryany lecture02_find_scary_cpp_bugs
20140531 serebryany lecture02_find_scary_cpp_bugs
 
AllBits presentation - Lower Level SW Security
AllBits presentation - Lower Level SW SecurityAllBits presentation - Lower Level SW Security
AllBits presentation - Lower Level SW Security
 
Potapenko, vyukov forewarned is forearmed. a san and tsan
Potapenko, vyukov   forewarned is forearmed. a san and tsanPotapenko, vyukov   forewarned is forearmed. a san and tsan
Potapenko, vyukov forewarned is forearmed. a san and tsan
 
Kernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering OopsiesKernel Recipes 2013 - Deciphering Oopsies
Kernel Recipes 2013 - Deciphering Oopsies
 
Windows kernel debugging workshop in florida
Windows kernel debugging   workshop in floridaWindows kernel debugging   workshop in florida
Windows kernel debugging workshop in florida
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
 
The forgotten art of assembly
The forgotten art of assemblyThe forgotten art of assembly
The forgotten art of assembly
 
Chapter Eight(3)
Chapter Eight(3)Chapter Eight(3)
Chapter Eight(3)
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading SkillsReverse Engineering Dojo: Enhancing Assembly Reading Skills
Reverse Engineering Dojo: Enhancing Assembly Reading Skills
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 

A Speculative Technique for Auto-Memoization Processor with Multithreading

  • 1. A Speculative Technique for Auto-Memoization Processor with Multithreading Yushi KAMIYA † Tomoaki TSUMURA † Hiroshi MATSUO † Yasuhiko NAKASHIMA ‡ ○ †   Nagoya Institute of Technology ‡   Nara Institute of Science and Technology The 10th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) Hiroshima, Japan on 9th, December, 2009
  • 2.
  • 3.
  • 4.
  • 5. Auto-Memoization Processor Regs D$1 ALU Temporary buffer Computing... End of computation store writeback Match MemoBuf MemoTbl Save the input/output sequence Detect a function or a loop D$2 Input Matching
  • 6. Registration of an input sequence RB (CAM) RA (RAM) v=6 W1 pointer v=140 W1 (RAM) RF (RAM) Memory(Cache) 00000004 00:00001000 00000002 02:00001008 --:-------- 00000001 01 opr 1 2 0 0x1000 0x1004 0x1008 int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } MemoTbl x y[0] y[1] 00 02 FF 02:00002000 00000406 01 00:00004004 60000000 FF --:-------- 80000008 03 00 sum Memobuf val %i0 00000004 adr x 00001000 val x 00000002 adr y[1] 00001008 val y[1] 00000001 RB RA RB RA RB RA (A) (B) (C) (A) (B) (C) Store 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
  • 7. Input Matching W1 pointer Memory(Cache) v=140 opr v=6 RB (CAM) RA (RAM) W1 (RAM) RF (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl x y[0] y[1] sum 02:00002000 00000406 01 00:00004004 60000000 FF --:-------- 80000008 03 00 00000002 02:00001008 --:-------- 00000001 01 00 02 00000004 00:00001000 FF int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } FF:00000004 00:00000002 02:00000001 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
  • 8. Reuse Overhead W1 pointer Memory(Cache) v=140 Comparing the input sequence with the value of RB entries opr v=6 RB (CAM) RA (RAM) W1 (RAM) RF (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl x y[0] y[1] int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } 02:00002000 00000406 01 00:00004004 60000000 FF --:00000000 80000008 03 00 00000002 02:00001008 --:00000000 00000001 01 00 02 00000004 00:00001000 FF sum Regs D$1 Writing back the output sequence Reuse Overheads 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. Execution model ③ (A) (B) Main thread Preceding thread Main thread Preceding thread ① ① Proposal Model : Execution : Search : Write back Reuse overhead Former Model ② (C) ② No-memoization thread ① ④ ③ ② No-memoization thread Main thread ③ ② ... v = u / w; x = sum(5, 3); y = x + 4; z = x + y; ... x = sum(3, 6); z = x + y; ... int sum(a, b) { int i, sum = 0; for(i=0; i<a; i++) sum += i + b; return(sum); } (α) (β) Reduction (α + β) First several input values match the value of RB entries Completely matched Do not match time time
  • 14. Prediction Pointer W1 pointer Prediction pointer v=6 Memory(Cache) 01 01 01 RB (CAM) RA (RAM) 1 2 0 0x1000 0x1004 0x1008 MemoTbl RF (RAM) W1 (RAM) opr x y[0] y[1] int x, y[5]; ... opr(4); ... opr(int a) { int v; v = x + a; v = v * y[1]; return (v); } 02:00002000 00000406 01 00:00004004 60000000 FF --:00000000 80000008 03 00 00000002 02:00001008 --:00000000 00000001 01 00 02 00000004 00:00001000 FF v=140 sum Match 00 01 02 03 04 05 .. 00 01 02 03 04 05 .. ... ... ... ...
  • 15.
  • 16. Architecture – the proposal model D$2 MemoTbl Shared Memobuf Regs D$1 ALU SpRF Regs D$1 ALU SpRF Regs D$1 ALU SpRF Regs D$1 ALU Memo Buf Input Pred. Main thread Preceding thread No-memoization thread SpMT cores Additional register file set SpMT cores don't use the shared MemoBuf Shared with all cores
  • 17. Register Synchronization 0 0 0 g0 g1 g2 ・・・・・ 0 0 0 g3 g4 g5 0 0 0 g6 g7 0 g0 g1 g2 g3 g4 g5 ... [0] 0 0 0 g0 g1 g2 ・・・・・ 0 0 0 g3 g4 g5 0 0 0 g6 g7 0 g0 g1 g2 g3 g4 g5 ... [1] 0 0 0 g0 g1 g2 ・・・・・ 0 0 0 g3 g4 g5 0 0 0 g6 g7 0 g0 g1 g2 g3 g4 g5 ... 1 1 [2] 0FFF1000 00000040 0FFF1000 0FFF1000 00000040 00000040 00000050 1 0FFF1000 00000040 0FFF1000 00000040 00000040 ... sum(); a = b * c; ... min(a, b, c); ... search (A) (B) (C) : Main : Preceding : No-memoization 0FFF1000 RF SpRF RF SpRF RF SpRF SpRF RF WB Register mask Main thread Preceding thread No-memoization thread Main thread No-memoization thread RF ⇔ SpRF Don't synchronized
  • 18.
  • 19.
  • 20. Performance – SPEC CPU95 099.go 124.m88ksim 126.gcc 129.compress 130.li 132.ijpeg 147.vortex 101.tomcatv 102.swim 103.su2cor 104.hydro2d 107.mgrid 110.applu 125.turb3d 141.apsi 145.fpppp 146.wave5 134.perl (N) w/o Memoization (M) Memoization (P) Memoization + Proposal (A) Memoization + SpMT + Proposal (S) Memoization + SpMT CFP CINT Reduced cycles : reuse_ovh : D$2 : window : exec : regcopy : D$1 max ave. (M) Memoization 13.9% -0.1% (S) Memoization + SpMT 35.2% 5.6% (P) Memoization + Proposal 21.7% 2.1% (A) Memoization + SpMT + Proposal 36.0% 9.0%
  • 21.
  • 22.  
  • 23. Register copy overhead 099.go 147.vortex 126.gcc 130.li 146.wave5 124.m88ksim 129.compress 132.ijpeg 145.fpppp 141.aspi 099.tomcatv 110.applu 104.hydro2d 102.swim Copy all values The proposal model latency : 32 bits/cycle

Notas do Editor

  1. ☆ : Mouse click timing Thank you Mr. Chairman. Good afternoon ladies and gentlemen. In this presentation, I&apos;d like to talk about an auto-memoization processor and it&apos;s improvement using multithreading.
  2. This is the( ジ ) agenda of my presentation. First, I&apos;m going to talk about the background of our study. Next, I&apos;d like to talk about our proposal model and its hardware implementation. After that, I&apos;m going to discuss the (ジ ) evaluation( 後にアクセント ) of it. Finally, I&apos;d like to finish by making the conclusions. ☆ First, I&apos;d like to talk about research background.
  3. Now, microprocessors are facing to the crossroads of speedup techniques. Speedup techniques based on Instruction-Level Parallelism, such as the superscalar( スケーラ ) or SIMD( シムディー ) instruction sets and thread level parallelism such as auto-parallelization compiler have been counted on. However, the( ジ ) effect of these techniques has proved to be limited. One reason is that many programs have little distinct parallelism. Other reasons are memory throughput, the difficulty of finding thread level parallelism, and more. Meanwhile, in the software field, memoization is a widely used programming technique for speedup. It is storing the results of functions for later reuse, and avoids re-computing them. However, Memoization costs a certain overhead because it is implemented by software. ☆ So we have proposed an auto-memoization processor it can run binary programs faster without any software assist. Okay, now I&apos;d like to talk about how hardware can skip execution of instructions.
  4. The auto-memoization processor memoizes some instruction regions automatically( ティカリー ) by hardware. Input matching by hardware reduces overheads for memoization. The targets of memoization are not only functions but also loop iterations.(ra にアクセント ) ☆ A region between(v) a callee label (v) and return instruction will be detected as a function.( 指しながら ) ☆ A region between(v) a backward branch(v) and its target label will be detected as a loop iteration. ( 指しながら )
  5. Here is the brief structure of auto-memoization processor. ☆ There are two memories for memoization, MemoBuf and MemoTbl. ☆ Through the( ジ ) execution of an instruction region, the processor stores the memory addresses and values of input and output sequence to MemoBuf. ☆ At the( ジ ) end of the region, the( ジ ) input and output sequence in MemoBuf is stored into MemoTbl. ☆ Next time the processor encounters(co にアクセント ) the same region, the processor tests whether the current input sequence completely matches with one of the past input sequence. ☆ If matches, the processor writes back the( ジ ) output sequence from MemoTbl to registers and caches. And the processor skips the( ジ ) execution of the region.
  6. MemoTbl has four tables in it. RF stores start addresses of instruction regions, RB stores input data sequences, RA stores input address sequences, and W1 stores output data sequences. RF, RA, and W1 are implemented by RAM. And, RB is implemented by CAM. Now, let&apos;s see this sample program.( 指しながら ) ☆ First, when the function call opr() is detected, ☆ The processor searches the( ジ ) address of opr() through the RF table and the( ジ ) address is not stored in RF. ☆ So the processor stores the( ジ ) address of opr() to RF table ☆ and stores the value of argument &amp;quot;a&amp;quot; to MemoBuf. ☆ Next, the processor stores the memory address and the value of &amp;quot;x&amp;quot; and &amp;quot;y[1]&amp;quot; to MemoBuf. ☆ When the processor detects return instruction of the function opr(), the processor finishes storing the( ジ ) input sequence. ☆ The( ジ ) input sequence in MemoBuf is divided into some blocks which have an address and a value. ☆ After that, the input sequence is stored into the empty RB and RA entries in blocks. ☆ Then the( ジ ) output sequence is stored in the W1 entry &amp;quot;01&amp;quot; so the processor stores the value &amp;quot;01&amp;quot; to the W1 pointer of the terminal RA entry &amp;quot;05&amp;quot;.
  7. In this slide, I will explain the behavior of input matching. ☆ First, when the function call opr() is detected, ☆ the processor searches the( ジ ) address of opr() through the RF table. ☆ After that, the processor reads the value of argument &amp;quot;a&amp;quot; and the value 4 matches the RB entry &amp;quot;02&amp;quot;. ☆ The next address is decided as &amp;quot;1000&amp;quot; which is the memory address &amp;quot;x&amp;quot;. ☆ Then, the processor reads the value from the address &amp;quot;1000&amp;quot;, and searches the value through RB again. ☆ This process is applied repeatedly until all input values are confirmed. If all inputs of a reuse target block have matched with one of the stored input sequence on MemoTbl, input matching succeeds. ☆ If input matching succeeds, the processor reads the output sequence from W1 by using the W1 pointer of the terminal RA entry &amp;quot;05&amp;quot;.
  8. Meanwhile, accessing MemoTbl causes overhead inevitably. ☆ First, searching RB, referring RA, and reading registers and caches cost a certain time. ☆ Second, when the( ジ ) input matching has succeeded, the( ジ ) output sequence should be written back from W1. This also costs some time. We call these two kinds of overheads &amp;quot;Reuse Overheads&amp;quot;.
  9. Meanwhile, the auto-memoization processor provides speculative multithreading which improves the( ジ ) effect of computation reuse. ☆ We append SpMT cores which have the same structure of the main core to the processor. ☆ In this example, the main core executes the function fact() and stores its input and output sequence to MemoTbl all together. ☆ The processor predicts the( ジ ) input sequence of the function fact() by stride value prediction. ☆ After that, SpMT cores execute the function fact() with predicted input sequence and store the( ジ ) input and output sequence to MemoTbl. ☆ In this example, although the main core has not executed the function fact(4), ☆ the main core can omit the( ジ ) execution of the region by using the( ジ ) input and output sequence (v) the second SpMT core stored.
  10. Next, I&apos;m going to talk about our new model.
  11. However, if the number of SpMT cores increases, the speculative multithreading will reach its performance limit. One of the causes is that if the number of SpMT cores increase, MemoTbl will be filled with many input and output sequences. However, Entries which were not used may waste entries of MemoTbl. Accordingly, we should propose other techniques instead of speculative multithreading. ☆ So we will propose reducing the reuse overhead with multithreading and realizing the( ジ ) effective use of multi cores.
  12. In the proposal model, the processor can run additional two threads which reduce the reuse overhead. ☆ First, the preceding thread assumes that the( ジ ) input matching will succeed, and executes the following codes of the reuse target region speculatively.( 指しながら ) ☆ Second, the no-memoization thread assumes that the input matching will fail, and executes the reuse target region normally. ( 指しながら ) In the next slide, I will explain the behavior of these threads in detail( 前にアクセント ).
  13. Now, let&apos;s see this sample program.( 指しながら ) ☆ In the former model, input matching for the function sum(5, 3) succeeds and the processor can omit the execution of the region. ( 指しながら ) ☆ Then input matching for the function sum(3, 6) fails and the processor executes the region normally. ( 指しながら ) ----- 提案手法の説明 ----- ☆ Next, I will explain the proposal model. In this example, there are three cores (A), (B), and (C). ☆ At the beginning of the program, the core (A), (B), and (C) are assigned to the main thread, the preceding thread, and the no-memoization thread respectively. ☆ When the core (A) detects the function sum(), it starts input matching. Simultaneously, the core (B) and the core (C) copy the value of the program counter of the core (A) and core (C) executes the function sum() normally. ☆ When the core (A) finds that first several input values match RB entries on MemoTbl, ☆ the core (B) executes following codes of sum().( 指しながら ) ☆ After input matching finished, the preceding thread on the core (B) turns into the main thread and threads on the core (A) and (C) will be squashed. ☆ Next, the core (B) starts input matching on detecting the function sum(). ☆ When the core (B) detects that input matching fails, ☆ the no-memoization thread on the core (C) turns into the main thread and other threads will be squashed. ☆ So two threads can reduce these amounts of reuse overhead( 指しながら ). ☆ And the proposal model can reduce this amount of the reuse overhead in total.
  14. By the way, the preceding thread should pick up an output sequence of the reuse target region for executing the following codes after the block. ☆ So we append &amp;quot;Prediction Pointer&amp;quot; to all RA entries. In this case, the input sequence of the function opr() is stored in the RB entry &amp;quot;02&amp;quot;, &amp;quot;04&amp;quot;, and &amp;quot;05&amp;quot;. ☆ In the proposal model, the value of W1 pointer is copied to the prediction pointer of all RA entries the ( ジ ) input sequence was stored. ☆ Then, I will explain how to use these prediction pointers. ☆ When first several input values matched RB entries, ☆ the preceding thread reads the output sequence from W1 by using prediction pointer and executes the following region. ☆ The main thread continues input matching. ☆ In this case, the value of W1 pointer is equal to the value of prediction pointer which was used. So the preceding thread can continue executing the following codes after the block.
  15. Next, I&apos;m going to talk about the implementation of our model.
  16. Here is the brief structure of the proposal model. ☆ MemoBuf is shared by the three cores. ☆ The three cores are assigned to the main thread, the preceding thread, and the no-memoization thread. ☆ SpMT cores have their own MemoBuf and do not use the shared MemoBuf. ☆ MemoTbl and the second level data cache are shared with all cores. ☆ In addition, each of these three cores has an additional register file set. We call this SpRF. ☆ The( ジ ) ALU, the register file, and SpRF of all cores are connected to each other, so each cores can write the( ジ ) output to the register file and SpRF of all cores.
  17. The preceding thread and the no-memoization thread use the SpRF instead of the register file. ☆ Register mask is a bitmask and it monitors the( ジ ) accesses to SpRF. Each bits of register mask corresponds to each register number. If register mask detects write access to the SpRF, the corresponding bit is enabled. The enabled bit means that the value stored in the corresponding SpRF number is active. ☆ Now, I&apos;ll show how SpRF and register mask work. Three cores are now assigned to the main thread, the preceding thread, and the no-memoization thread. The processor aims to synchronize the value of register file and SpRF. ☆ However the preceding thread and the no-memoization thread write values to their own SpRF. So register files and SpRFs of each cores cannot synchronize their values. ☆ After input matching failed, the new main thread uses old SpRF as the register file. ☆ And stored values in SpRF are synchronized to register file of all cores. This synchronization is executed in background so some overheads are concealed. ☆ However, the processor should synchronize some register values and it costs a certain time.
  18. Next, I&apos;m going to talk about performance evaluation and the conclusion of this research.
  19. We have developed a single-issue simple SPARC-V8( ブイエイト ) simulator with the auto-memoization structures and evaluated the performance of the processor. Here are simulation parameters.( 発音注意: ra にアクセント ) In the next slide, I&apos;ll show the result chart.
  20. This is the result of SPEC CPU95( ナインティーファイブ ) suite. Each benchmarks are represented by five bars. The left most bar plots the baseline that is the execution cycles(^) original benchmark costs. The second bar plots the cycles using the auto-memoization structures with no speculative cores. The third bar plots the cycles using parallel speculative execution with two SpMT cores. The fourth bar plots the cycles by overhead concealing model we proposed. The fifth bar plots the cycles by the hybrid model of the parallel speculative execution and the proposal model with five cores. ☆ The legend shows the( ジ ) itemized statements of cycles. Executed cycles, reuse overhead, register copy overhead, cache miss penalties and register window miss penalties. ☆ The execution cycles of some benchmark programs reduced by memoization. The parallel speculative execution works very well with CFP benchmarks. and the proposal model reduced the reuse overhead in CINT benchmarks. Then, the hybrid model can achieve the best performance in almost benchmarks. ☆ Now, I show the reduced cycles by each model. Above all, the hybrid model reduced upto 36%( サーティシックスパーセント ) cycles and 9%( ナインパーセント ) on average.
  21. Now, I would like to finish by making the following conclusions. We have proposed an auto-memoization processor with multi-threading it can reduce the reuse overhead. The hybrid model can achieve good performance by their synergistic effect. Our future work is to change the assignment of cores to the threads dynamically. On the current implementation, the cores for parallel speculative execution and the three cores for concealing overheads does not exchange their threads each other. Therefore, a further improvement of the processor model will be required. ==========time over========== This is the conclusion of my presentation. Thank you for your attention.