By Masaki Arai, Fujitsu Laboratories Ltd.
For numerical calculation programs on supercomputers, the kernel part occupies 80% or more of the execution time in many cases. Therefore, the quality of the code generated by the compiler for these kernel parts is significant. We created a tool, which is called HCQC, to aid in the investigation of the quality of the code generated by the compiler for the kernel part. In this presentation, we report the details of HCQC and the results of evaluating the quality of GCC and LLVM when compiling the kernel part of benchmark programs using HCQC.
Masaki Arai Bio
In 1992, He joined Fujitsu Laboratories Ltd. His research interests are in the area of compiler optimizations and computer architectures. He joined Linaro as member engineer in 2017.
Email
itaru.kitayama@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
2. Background and Purpose
• The quality of the kernel part is important in HPC
applications(number crunching on supercomputers).
• Make it easy to check the quality of compiler
optimizations and acquire data to improve them
HCQC:HPC compiler quality checker
2
4. Metrics for Quality Evaluation
• HCQC has the following metrics:
op : # of mnemonics in an assembly code
kind : The kind of mnemonics in an assembly code(memory, branch,
other)
regalloc : The quality of register allocation(# of spill in/out instructions)
height : The height of instruction dependence graph
ilp :Instruction level parallelism by instruction scheduler
vectorize : Vectrization/SIMDization situation
swpl : # of initiation interval by software pipelining
4
These data are basically static data at compile time.
6. Quality Evaluation by Comparison
• One investigation result has little meaning.
• Typical comparison examples:
GCC vs. LLVM(on AArch64)
LLVM 4.0.0 vs. LLVM 5.0.0(on AArch64)
LLVM with –O2 vs. LLVM with –O3(on AArch64)
LLVM on AArch64 vs. LLVM on x86_64
Missing optimizations on AArch64
LLVM on AArch64 vs. ICC on x86_84
Optimization hints for SVE from AVX codes
6
10. Workflow of HCQC
① Compile one test program
② Run the executable file and verify its result by
comparing output and answer data
③ Generate the assembly code file
④ Make the control flow graph of the kernel part from
the assembly code
⑤ Get result data using metric programs
⑥ Make the report file from data
10
% hcqc config test metric+
% hcqc-report config test metric+
11. Workflow of HCQC
11
COMMAND:/usr/bin/clang
OPT_FLAGS:-O2
Configuration FileTest Program
Executable File
out.data resut.data
in.data Assembly Code File
cfg.py
CFG+DEPTH
hcqc-report
diff check DISTRIBUTION : OpenSUSE Tumbleweed
ARCH : aarch64
CPU : AMD Opteron A1100 Cortex A57
LANGUAGE : C
COMPILER : ClangLLVM
COMMAND : /usr/bin/clang
VERSION :4.0.1
OPT_FLAGS : -O2
TEST_PROGRAM: sample
KERNEL=FUNCTION=NAME : kernel
DATE: 2017/11/07
ilp swpl
mem branch other spill in spill out IPC II kind mem arith other
BB0 cond LBB0_11 0 3 1 3 0 0 0.5 0 0 0
BB1 0 0 0 4 0 0 0.5 0 0 0
LBB0_2 1 3 0 5 0 1 0.7 0 0 0
LBB0_3 cond LBB0_5 2 2 1 1 2 0 0.9 SLP 0 1 0
BB4 LBB0_7 LBB0_9 2 1 2 5 0 0 0.9 SLP 0 1 1
LBB0_5 cond LBB0_9 2 3 1 5 0 0 1.3 SLP 0 1 1
LBB0_7 2 3 0 5 0 3 1.7 SLP 0 1 2
LBB0_8 cond LBB0_8 3 7 1 5 0 0 2.5 5 LOOP,SLP 2 2 2
LBB0_9 cond LBB0_3 2 2 1 3 0 2 1.3 SLP 0 1 1
BB10 cond LBB0_2 1 1 1 2 0 0 0.8 0 0 0
LBB0_11 0 0 0 1 0 0 0.2 0 0 0
*SUMMARY* 25 8 39 2 6 2 7 7
vectorizekind regalloc
CFG DEPTH
Report file(csv)
Result Data
Result Data
Result Data
Metric Program
Metric Program
Metric Program
Test Program
Info File
①
②
③
④
⑤
⑥
JSON format file
Debug Information?
12. Test Programs for HCQC
• Generate from programs that were problematic in Fujitsu's
production compilers in the past
• Extract kernel parts and modify them to use under HCQC
– Extract hot spots
– If it is Fortran program, then convert them to C
language(for comparison between GCC and Clang/LLVM)
– Prepare the data to run and check those kernel parts
12
13. Test Programs for HCQC
• All original benchmarks are
publically available.
• I/O data for HCQC is being prepared.
– Some data file sizes are very large.
– For different architectures or different
optimization levels, error tolerance is required.
13
benchark name kernel name
HimenoBMT-dynamic jacobi
HimenoBMT-static jacobi
hpcg-3.0 ComputeSYMGS_ref
ccs-qcd bicgstab_hmc
ccs-qcd clover
ffb-mini CALAX3
ffb-mini FLD3X2
ffb-mini GRAD3X
ffvc-mini poi_residual
ffvc-mini psor2sma_core
mVMC-mini calculateNewPfMTwo_child
mVMC-mini updateMAllTwo_child
mVMC-mini updateMAll_child
ngsa-mini bwt_match_exact_alt
ngsa-mini bwt_match_gap
nicam-dc-mini vi_path2
14. Future Work
• Add supports for SVE(if available in GCC or LLVM)
• Implement metric programs:
– vectorization(vectorize)
– software pipelining(swpl)
– instruction level parallelism(ilp)
• Add features for comparing with x86_64(SVE vs. AVX)
• Add tools for automatic and intelligent comparison
14