1. Revisiting Power5/Power5+
Performance Differences
Mike Page
ScicomP 14
Poughkeepsie, New York
May 23, 2008
NCAR/CISL/HSS/CSG
Consulting Services Group
mpage@ucar.edu
4. A Graphical Look at the Data:
Blueice/Bluevista performace
1.400
Bluevista Faster
1.300
1.200
cam_waccm
POP
1.100 hd3D
WRF
Equal
1.000
Ratio of ICESS Benchmark Times
0.900
Blueice Faster
0.800
0 32 64 96 128 160 192 224 256
Processor Count
5. Huang/Ghosh
• Analysis of POP performance
• Varied run configuration (ptile) for POP
• Conclusions
• POP performance on Blueice improves by 13%
when nodes are undersubscribed
• Undersubscription uses only 8 of 16 processors on a
Blueice node
• Undersubscription avoids sharing L3 cache
• POP performance on Blueice exceeds that on
Bluevista if Blueice nodes are undersubscribed
• POP Blueice vs. fully-subscribed-Bluevista
performance difference is mainly due to L2
cache misses
6. But here’s what caught my interest!
Model Performance: Bluevista vs. Blueice
MODEL PROCS BV BL FAST 100*BL/BV DIFF %
cam_waccm 256 5.61 5.96 BL 106.23 -6.23
Blueice/Bluevista
cam_waccm 128 3.34 3.12 BV 93.41 6.58
1.400
Bluevista Faster cam_waccm 64 2.13 2.13 SAME 100 0
1.300
cam_waccm 32 1.13 1.15 BL 101.76 -1.76
1.200
POP 128 22.29 21.67 BL 97.21 -2.78
cam_waccm
POP
1.100 hd3D POP 64 34.96 36.24 BV 103.66 3.66
WRF
Equal
POP 48 42.2 43.76 BV 103.69 3.69
1.000
ICESS Benchmark Time
POP 32 56.8 65.34 BV 115.03 15.03
0.900
Blueice Faster POP 24 71.83 79.96 BV 111.31 11.31
0.800
0 32 64 96 128 160 192 224 256 POP 16 103.96 112.75 BV 108.45 8.45
Processor Count
POP 8 197.11 231.32 BV 117.35 17.35
hd3D 128 0.1857 0.2188 BV 117.82 17.82
hd3D shows largest hd3D 64 0.3046 0.3961 BV 130.03 30.03
performance variation
hd3D 32 0.5261 0.5655 BV 107.48 7.48
of all the apps in the
ICESS suite hd3D 16 0.9408 0.9459 BV 100.54 0.54
hd3D 8 1.7917 1.7028 BL 95.03 -4.96
WRF 256 0.158 0.1466 BL 92.78 -7.21
WRF 128 0.2316 0.239 BV 103.19 3.19
WRF 64 0.3849 0.3934 BV 102.20 2.20
WRF 32 0.6736 0.7274 BV 107.98 7.98
WRF 16 1.3584 1.407 BV 103.57 3.57
WRF 8 2.6308 2.5783 BL 98.00 -1.99
WRF 4 4.9209 4.8376 BL 98.30 -1.69
WRF 2 9.8538 9.5474 BL 96.89 -3.10
WRF 1 19.8319 17.6361 BL 88.92 -11.07
7. hd3D needed to be
studied too
With the new Power 5+ system and an
AIX upgrade there were many new factors
that could affect performance:
• SMT
• Varied page sizes
• Processor Binding
8. hd3D Details
HD3D is a pseudospectral three-dimensional periodic
hydrodynamic/magnetohydrodynamic/Hall-MHD
turbulence model.
The results presented here are derived from a numerical
solution of the incompressible Navier-Stokes equations in
3 dimensions with periodic boundary conditions on a
256 x 256 x 256 grid.
hd3D uses a pseudo-spectral method to compute spatial
derivatives, while adjustable order Runge-Kutta method
is used to evolve the system in the time domain.
This benchmark does a free-decay simulation of
Taylor-Green vortices.
9. A Closer Look at hd3D - Bluevista runs
Bluevista - Single Core, Private L3
(smt degrades performance)
1.200
1.000
0.800
non-smt
0.600 smt
Huang_Ghosh
0.400
ICESS Benchmark Time (sec)
0.200
0.000
32 64 96 128
Processor Count
11. Huang/Ghosh Run Configurations
Model Performance: Bluevista vs. Blueice
MODEL PROCS BV BL SMT? FAST DIFF %
cam_waccm 128 3.34 3.12 2TPP (16 OMP) BV 6.58
cam_waccm 64 2.13 2.13 2TPP (16 OMP) SAME 0
cam_waccm 32 1.13 1.15 2TPP (16 OMP) BL -1.76
POP 128 22.29 21.67 1TPP BL -2.78
POP 64 34.96 36.24 2TPP BV 3.66
POP 48 42.2 43.76 2TPP BV 3.69
POP 32 56.8 65.34 2TPP BV 15.03
hd3D 128 0.1857 0.2188 1TPP BV 17.82
hd3D 64 0.3046 0.3961 1TPP BV 30.03
hd3D 32 0.5261 0.5655 1TPP BV 7.48
WRF 128 0.2316 0.239 2TPP BV 3.19
WRF 64 0.3849 0.3934 2TPP BV 2.20
WRF 32 0.6736 0.7274 2TPP BV 7.98
Give up on SMT for hd3D, look at shared L3 effects
12. A Closer Look at hd3D - Blueice runs
Blueice HD3D (no smt)
Private L3 improves performance
0.700
0.600
0.500
Private L3
Shared L3
Huang/Ghosh
0.400
ICESS Benchmark (sec)
0.300
0.200
32 64 96 128
Processor Count
This supports the conclusion that underscribing Blueice nodes,
making L3 cache private, improves Blueice performance.
13. A Closer Look at hd3D
Bluevista (no smt)
vs.
Blueice (no smt, private L3)
0.600
0.500
0.400
Bluevista
Blueice
0.300
0.200
ICESS Benchmark Time (sec)
0.100
32 64 96 128
Processor Count
• Undersubscribing Blueice improves hd3D performance. It is close to,
but still slower than, Bluevista.
• For hd3D, Bluevista still outperforms Blueice with ~6% difference for
32 and 64 processors and ~2% for 128.
• While Blueice POP is 13% faster than Bluevista POP on 16
logical(?) cpus, hd3D shows the opposite behavior.
14. A Closer Look at hd3D - memory
HD3D Memory Footprint
70
60
50
40
Mem Req per Task (Mb)
30
20
32 64 96 128
Processor Count
15. A Closer Look at hd3D
Two issues to investigate:
- 1TPP/2TPP differences
Bluevista - Single Core, Private L3
(smt degrades performance)
1.200
1.000
0.800
non-smt
0.600 smt
Huang_Ghosh
0.400
ICESS Benchmark Time (sec)
0.200
0.000
32 64 96 128
Processor Count
- Blueice/Bluevista 1TPP differences
Bluevista (no smt)
vs.
Blueice (no smt, private L3)
0.600
0.500
0.400
Bluevista
Blueice
0.300
0.200
ICESS Benchmark Time (sec)
0.100
32 64 96 128
Processor Count
16. A Closer Look at hd3D
CPI Breakdown Analysis
• Uses multiple Hardware Performance Counters on the processor to:
• Track processor cycles required to complete a given workload
• hd3D computational kernel with hpmcount API
• Track events in processor core
• Track events in the memory subsystem
• 17 counters required for Power5/Power5+ CPI Breakdown
PM_IOPS_CMPL PM_CMPL_STALL_LSU
PM_INST_CMPL PM_CMPL_STALL_REJECT
PM_RUN_CYC PM_CMPL_STALL_DCACHE_MISS
PM_GRP_CMPL PM_CMPL_STALL_ERAT_MISS
PM_GCT_NOSLOT_CYC PM_CMPL_STALL_FXU
PM_GCT_NOSLOT_IC_MISS PM_CMPL_STALL_DIV
PM_GCT_NOSLOT_SRQ_FULL PM_CMPL_STALL_FDIV
PM_GCT_NOSLOT_BR_MPRED PM_CMPL_STALL_FPU
PM_1PLUS_PPC_CMPL
21. A Closer Look at hd3D
Blueice/Bluevista 1TPP differences
Ratio of PM Counters
PM_IOPS_CMPL
6.000
PM_CMPLU_STALL_FPU PM_INST_CMPL
5.000
PM_CMPLU_STALL_FDIV PM_RUN_CYC
4.000
3.000
PM_CMPLU_STALL_DIV PM_GRP_CMPL
2.000
1.000
PM_CMPLU_STALL_FXU 0.000 PM_GCT_NOSLOT_CYC
PM_CMPLU_STALL_ERAT_MISS PM_GCT_NOSLOT_IC_MISS
PM_CMPLU_STALL_DCACHE_MISS PM_GCT_NOSLOT_SRQ_FULL
PM_CMPLU_STALL_REJECT PM_GCT_NOSLOT_BR_MPRED
PM_CMPLU_STALL_LSU
pmlist -p POWER5 -d -c 4,8
PM_CMPLU_STALL_ERAT_MISS,Completion stall caused by ERAT miss
Following a completion stall (any period when no groups completed) the last instruction to finish before
completion resumes suffered an ERAT miss. This is a subset of PM_CMPLU_STALL_REJECT.
23. Conclusions
> Nothing new <
Sharing cache can degrade performance
But lots of questions remain:
• Gathering and processing the data from performance counters
was extremely tedious.
•Is there an easier way?
• Does difficulty increase exponentially with level of detail?
• Are the Power 5/5+ performance counters accurate?
•Some say not
• Eyerman, et. al. (ASPLOS, 2004)
• What do the counters mean?
• Are there expanded references besides pmlist?
• What is an ERAT miss?
• What does it say about code performance?
• Will ACTC tools give more info that what’s available via
pmlist?
•
•
•