Performance evaluation with Arm HPC tools for SVE

Gem5 simulator
RIKEN AICS
(Advanced Institute for Computational Science)
2017/12/13
Y. Kodama
ARM HPC workshop 2017

Gem5 simulator
 Processor simulator
 supports multiple ISA: Alpha, SPARC, x86, ARM
 CPU model
• Atomic: instruction level simulation
• O3: Out of Order pipeline simulation
• Can estimate execution cycles
 Development “gem5-sve”
 Atomic mode for SVE is developed by ARM.
 Gem5 supported SVE (atomic and o3) will be uploaded in
main stream by ARM soon.
 Riken also originally developed o3 mode for SVE based on
ARM atomic model of SVE.
http://gem5.org

Gem5 | CPU model
• Atomic: instruction level simulation
○ Number of dynamic executed instructions
○ Instruction MIX (ratio of arithmetic vs memory, ratio of
vectorization, etc.)
× Execution cycles
× Cache hit ratio (some instructions are divided to micro
operations)
○ Simulation speed is several millions of insts/sec
• O3: Out of Order pipeline simulation
○ Execution cycles
○ Cache hit ratio, L1/L2/Memory bandwidth/latency
× Simulation speed is less than 1/10 of atomic mode.

Gem5 | O3 pipeline
 Based on Alpha21264
 7 stages pipeline: Fetch, Decode, Rename, Issue, Execute,
Write Back, Commit
 Parameter file
 can specify several parameters as next slide.
 These are based on O3_ARM_v7a.py that is preset
parameter in gem5.
 Add instruction latency for SVE referred to NEON

Gem5 | architecture parameters
 Based on O3_ARM_v7a.py that is preset parameter in
gem5.
Hardware parameters
Clock Frequency 2.0GHz # of core 1
L1 Dcache, Icache size 32kB L2 cache size 2MB
# of Integer pipeline 2 Load/Store unit 1/1
# of Floating point pipeline 2 Fetch width 3
OoO resource parameters
IQ (Reservation Station) 64 (←32)
ROB (Re-order Buffer) 64 (←48)
LQ (Load Queue) 16
SQ (Store Queue) 16
Physical Vector Register 96 (new)

Gem5 | statistics (atomic)
sim_seconds 0.000620
# Number of seconds simulated
host_inst_rate 714344
# Simulator instruction rate (inst/s)
host_seconds 1.73
# Real time elapsed on the host
sim_insts 1239041
# Number of instructions simulated
system.mem_ctrls.bytes_read::cpu.data 71168
# Number of bytes read from this memory
system.cpu.vector_ext_num_insts 1055097
# Number of vector instructions executed
system.cpu.vector_ext_num_mem_insts 768768
# Number of vector memory instructions executed
system.cpu.Branches 30653
# Number of branches fetched
system.cpu.op_class::IntAlu 180739 14.56% 14.56%
# Class of executed instruction
system.cpu.op_class::MemRead 1978 0.16% 14.76%
system.cpu.op_class::MemWrite 2000 0.16% 14.92%
system.cpu.op_class::VectorExtVFp 256768 20.69% 35.71%
system.cpu.op_class::VectorExtMread 512000 41.26% 79.31%
system.cpu.op_class::VectorExtMwrite 256768 20.69% 100.00%

Gem5 | statistics (o3)
sim_seconds 0.000319
# Number of seconds simulated
host_inst_rate 200521
# Simulator instruction rate (inst/s)
host_seconds 6.18
# Real time elapsed on the host
sim_insts 1239028
# Number of instructions simulated
system.mem_ctrls.bw_total::total 665233479
# Total bandwidth to/from this memory (bytes/s)
system.cpu.rename.ROBFullEvents 12
# Number of times rename has blocked due to ROB full
system.cpu.rename.IQFullEvents 1
# Number of times rename has blocked due to IQ full
system.cpu.rename.LQFullEvents 5979
# Number of times rename has blocked due to LQ full
system.cpu.rename.SQFullEvents 180574
# Number of times rename has blocked due to SQ full
system.cpu.rename.FullRegisterEvents 79
# Number of times there has been no free registers
system.cpu.ipc 1.944260
# IPC: Instructions Per Cycle
system.cpu.dcache.ReadReq_miss_rate::total 0.000204
# miss rate for ReadReq accesses
system.cpu.dcache.WriteReq_miss_rate::total 0.002887
# miss rate for WriteReq accesses

Discussion
 Gem5 o3 can simulate program precisely, but
 it takes long time. For example of previous slide, 300us
execution takes 6 seconds, i.e. 20,000 times.
 multithread program can be simulated, but it takes several
times of single core simulation.
 -> simulation of whole application program is impossible,
so we must extract kernels from application. Tool chains
are required.
 Gem5 o3 can flexiblly set parameters of pipelines, but
 In other words, we must specify such parameters for target
processor, but processor vender will not disclose the
parameters.
 What parameters we used is a big problem, especially if we
will compare the performance with others.
 -> We want base parameters for HPC performance
comparison that anyone can shared.

Performance evaluation with Arm HPC tools for SVE

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Performance evaluation with Arm HPC tools for SVE

Semelhante a Performance evaluation with Arm HPC tools for SVE (20)

Mais de Linaro

Mais de Linaro (20)

Último

Último (20)

Performance evaluation with Arm HPC tools for SVE