by: Performance evaluation with Arm HPC tools for SVE Miwako Tsuji (RIKEN), Yuetsu Kodama (RIKEN)
The "co-design" is a bi-directional approach where a system would be designed on demand from applications and the applications must be optimized to the system. The performance estimation and evaluation of applications are important for the co-design. In this talk, we focus on the performance evaluation with Arm HPC tools for SVE.
Miwako Tsuji received master and PhD degrees from Information Science and Technology, Hokkaido University. From 2007 to 2013, she was working in University of Hokkaido, University of Tokyo, University of Tsukuba and Universite de Versailles Saint-Quentin-en-Yvelines. She is a research scientist at RIKEN Advanced Institute for Computational Science since 2013. She is a member of the architecture development team of the flagship 2020 project, i.e. post-K computer project, since the project was started in 2014. She is a coauthor of ACM Gordon Bell Prize in 2011.
2. Gem5 simulator
Processor simulator
supports multiple ISA: Alpha, SPARC, x86, ARM
CPU model
• Atomic: instruction level simulation
• O3: Out of Order pipeline simulation
• Can estimate execution cycles
Development “gem5-sve”
Atomic mode for SVE is developed by ARM.
Gem5 supported SVE (atomic and o3) will be uploaded in
main stream by ARM soon.
Riken also originally developed o3 mode for SVE based on
ARM atomic model of SVE.
ARM HPC workshop 2017
http://gem5.org
3. Gem5 | CPU model
• Atomic: instruction level simulation
○ Number of dynamic executed instructions
○ Instruction MIX (ratio of arithmetic vs memory, ratio of
vectorization, etc.)
× Execution cycles
× Cache hit ratio (some instructions are divided to micro
operations)
○ Simulation speed is several millions of insts/sec
• O3: Out of Order pipeline simulation
○ Execution cycles
○ Cache hit ratio, L1/L2/Memory bandwidth/latency
× Simulation speed is less than 1/10 of atomic mode.
ARM HPC workshop 2017
4. Gem5 | O3 pipeline
Based on Alpha21264
7 stages pipeline: Fetch, Decode, Rename, Issue, Execute,
Write Back, Commit
Parameter file
can specify several parameters as next slide.
These are based on O3_ARM_v7a.py that is preset
parameter in gem5.
Add instruction latency for SVE referred to NEON
ARM HPC workshop 2017
5. Gem5 | architecture parameters
Based on O3_ARM_v7a.py that is preset parameter in
gem5.
ARM HPC workshop 2017
Hardware parameters
Clock Frequency 2.0GHz # of core 1
L1 Dcache, Icache size 32kB L2 cache size 2MB
# of Integer pipeline 2 Load/Store unit 1/1
# of Floating point pipeline 2 Fetch width 3
OoO resource parameters
IQ (Reservation Station) 64 (←32)
ROB (Re-order Buffer) 64 (←48)
LQ (Load Queue) 16
SQ (Store Queue) 16
Physical Vector Register 96 (new)
6. Gem5 | statistics (atomic)
ARM HPC workshop 2017
sim_seconds 0.000620
# Number of seconds simulated
host_inst_rate 714344
# Simulator instruction rate (inst/s)
host_seconds 1.73
# Real time elapsed on the host
sim_insts 1239041
# Number of instructions simulated
system.mem_ctrls.bytes_read::cpu.data 71168
# Number of bytes read from this memory
system.cpu.vector_ext_num_insts 1055097
# Number of vector instructions executed
system.cpu.vector_ext_num_mem_insts 768768
# Number of vector memory instructions executed
system.cpu.Branches 30653
# Number of branches fetched
system.cpu.op_class::IntAlu 180739 14.56% 14.56%
# Class of executed instruction
system.cpu.op_class::MemRead 1978 0.16% 14.76%
# Class of executed instruction
system.cpu.op_class::MemWrite 2000 0.16% 14.92%
# Class of executed instruction
system.cpu.op_class::VectorExtVFp 256768 20.69% 35.71%
# Class of executed instruction
system.cpu.op_class::VectorExtMread 512000 41.26% 79.31%
# Class of executed instruction
system.cpu.op_class::VectorExtMwrite 256768 20.69% 100.00%
# Class of executed instruction
7. Gem5 | statistics (o3)
ARM HPC workshop 2017
sim_seconds 0.000319
# Number of seconds simulated
host_inst_rate 200521
# Simulator instruction rate (inst/s)
host_seconds 6.18
# Real time elapsed on the host
sim_insts 1239028
# Number of instructions simulated
system.mem_ctrls.bw_total::total 665233479
# Total bandwidth to/from this memory (bytes/s)
system.cpu.rename.ROBFullEvents 12
# Number of times rename has blocked due to ROB full
system.cpu.rename.IQFullEvents 1
# Number of times rename has blocked due to IQ full
system.cpu.rename.LQFullEvents 5979
# Number of times rename has blocked due to LQ full
system.cpu.rename.SQFullEvents 180574
# Number of times rename has blocked due to SQ full
system.cpu.rename.FullRegisterEvents 79
# Number of times there has been no free registers
system.cpu.ipc 1.944260
# IPC: Instructions Per Cycle
system.cpu.dcache.ReadReq_miss_rate::total 0.000204
# miss rate for ReadReq accesses
system.cpu.dcache.WriteReq_miss_rate::total 0.002887
# miss rate for WriteReq accesses
8. Discussion
Gem5 o3 can simulate program precisely, but
it takes long time. For example of previous slide, 300us
execution takes 6 seconds, i.e. 20,000 times.
multithread program can be simulated, but it takes several
times of single core simulation.
-> simulation of whole application program is impossible,
so we must extract kernels from application. Tool chains
are required.
Gem5 o3 can flexiblly set parameters of pipelines, but
In other words, we must specify such parameters for target
processor, but processor vender will not disclose the
parameters.
What parameters we used is a big problem, especially if we
will compare the performance with others.
-> We want base parameters for HPC performance
comparison that anyone can shared.
ARM HPC workshop 2017