M Page ScicomP14 Hd3D

Revisiting Power5/Power5+
Performance Differences

Mike Page
ScicomP 14
Poughkeepsie, New York
May 23, 2008
NCAR/CISL/HSS/CSG
Consulting Services Group
mpage@ucar.edu

Model Performance: Bluevista vs. Blueice
MODEL PROCS BV BL FAST 100*BL/BV DIFF %
cam_waccm 256 5.61 5.96 BL 106.23 -6.23
cam_waccm 128 3.34 3.12 BV 93.41 6.58
cam_waccm 64 2.13 2.13 SAME 100 0
cam_waccm 32 1.13 1.15 BL 101.76 -1.76
POP 128 22.29 21.67 BL 97.21 -2.78
POP 64 34.96 36.24 BV 103.66 3.66
POP 48 42.2 43.76 BV 103.69 3.69
POP 32 56.8 65.34 BV 115.03 15.03
POP 24 71.83 79.96 BV 111.31 11.31
POP 16 103.96 112.75 BV 108.45 8.45
POP 8 197.11 231.32 BV 117.35 17.35
hd3D 128 0.1857 0.2188 BV 117.82 17.82
hd3D 64 0.3046 0.3961 BV 130.03 30.03
hd3D 32 0.5261 0.5655 BV 107.48 7.48
hd3D 16 0.9408 0.9459 BV 100.54 0.54
hd3D 8 1.7917 1.7028 BL 95.03 -4.96
WRF 256 0.158 0.1466 BL 92.78 -7.21
WRF 128 0.2316 0.239 BV 103.19 3.19
WRF 64 0.3849 0.3934 BV 102.20 2.20
WRF 32 0.6736 0.7274 BV 107.98 7.98
WRF 16 1.3584 1.407 BV 103.57 3.57
WRF 8 2.6308 2.5783 BL 98.00 -1.99
WRF 4 4.9209 4.8376 BL 98.30 -1.69
WRF 2 9.8538 9.5474 BL 96.89 -3.10
WRF 1 19.8319 17.6361 BL 88.92 -11.07

A Graphical Look at the Data:
CAM_WACCM POP

6 250

5.5

5 200

4.5

4 150

3.5

3 100

2.5
Blueice

2
ICESS Benchmark Time (sec) Bluevista ICESS Benchmark Time(sec)
50

1.5

1 0
32 64 96 128 160 192 224 256 0 32 64 96 128
Processor Count Processor Count

HD3D WRF

2 20

18

16
1.5
14

12

1 10

8

6
0.5
ICESS Benchmark Time (sec) 4
ICESS Benchmark Time (sec)

2

0 0
0 16 32 48 64 80 96 112 128 144 0 20 40 60 80 100 120 140 160 180 200 220 240 260
Processor Count Processor Count

A Graphical Look at the Data:
Blueice/Bluevista performace

1.400

Bluevista Faster
1.300

1.200

cam_waccm
POP
1.100 hd3D
WRF
Equal

1.000

Ratio of ICESS Benchmark Times

0.900

Blueice Faster
0.800
0 32 64 96 128 160 192 224 256
Processor Count

Huang/Ghosh
• Analysis of POP performance
• Varied run configuration (ptile) for POP
• Conclusions
• POP performance on Blueice improves by 13%
when nodes are undersubscribed
• Undersubscription uses only 8 of 16 processors on a
Blueice node
• Undersubscription avoids sharing L3 cache
• POP performance on Blueice exceeds that on
Bluevista if Blueice nodes are undersubscribed
• POP Blueice vs. fully-subscribed-Bluevista
performance difference is mainly due to L2
cache misses

But here’s what caught my interest!
MODEL PROCS BV BL FAST 100*BL/BV DIFF %

cam_waccm 256 5.61 5.96 BL 106.23 -6.23
Blueice/Bluevista
cam_waccm 128 3.34 3.12 BV 93.41 6.58
1.400

Bluevista Faster cam_waccm 64 2.13 2.13 SAME 100 0
1.300

cam_waccm 32 1.13 1.15 BL 101.76 -1.76

1.200
POP 128 22.29 21.67 BL 97.21 -2.78
cam_waccm
POP
1.100 hd3D POP 64 34.96 36.24 BV 103.66 3.66
WRF
Equal
POP 48 42.2 43.76 BV 103.69 3.69
1.000
ICESS Benchmark Time

POP 32 56.8 65.34 BV 115.03 15.03
0.900

Blueice Faster POP 24 71.83 79.96 BV 111.31 11.31

0.800
0 32 64 96 128 160 192 224 256 POP 16 103.96 112.75 BV 108.45 8.45
Processor Count

POP 8 197.11 231.32 BV 117.35 17.35

hd3D 128 0.1857 0.2188 BV 117.82 17.82
hd3D shows largest hd3D 64 0.3046 0.3961 BV 130.03 30.03
performance variation
hd3D 32 0.5261 0.5655 BV 107.48 7.48
of all the apps in the
ICESS suite hd3D 16 0.9408 0.9459 BV 100.54 0.54
hd3D 8 1.7917 1.7028 BL 95.03 -4.96
WRF 256 0.158 0.1466 BL 92.78 -7.21

WRF 128 0.2316 0.239 BV 103.19 3.19

WRF 64 0.3849 0.3934 BV 102.20 2.20

WRF 32 0.6736 0.7274 BV 107.98 7.98

WRF 16 1.3584 1.407 BV 103.57 3.57

WRF 8 2.6308 2.5783 BL 98.00 -1.99

WRF 4 4.9209 4.8376 BL 98.30 -1.69

WRF 2 9.8538 9.5474 BL 96.89 -3.10

WRF 1 19.8319 17.6361 BL 88.92 -11.07

hd3D needed to be
studied too
With the new Power 5+ system and an
AIX upgrade there were many new factors
that could affect performance:
• SMT
• Varied page sizes
• Processor Binding

hd3D Details
HD3D is a pseudospectral three-dimensional periodic
hydrodynamic/magnetohydrodynamic/Hall-MHD
turbulence model.

The results presented here are derived from a numerical
solution of the incompressible Navier-Stokes equations in
3 dimensions with periodic boundary conditions on a
256 x 256 x 256 grid.

hd3D uses a pseudo-spectral method to compute spatial
derivatives, while adjustable order Runge-Kutta method
is used to evolve the system in the time domain.

This benchmark does a free-decay simulation of
Taylor-Green vortices.

A Closer Look at hd3D - Bluevista runs
Bluevista - Single Core, Private L3
(smt degrades performance)

1.200

1.000

0.800

non-smt
0.600 smt
Huang_Ghosh

0.400

0.200

0.000
32 64 96 128
Processor Count

A Closer Look at hd3D - Blueice runs
Blueice HD3D (Shared L3)

1.200

1.000

0.800

0.600 non-smt
smt
Huang_Ghosh
0.400

ICESS Benchmark (sec)
0.200

0.000
32 64 96 128
Processor Count

Huang/Ghosh Run Configurations
MODEL PROCS BV BL SMT? FAST DIFF %
cam_waccm 128 3.34 3.12 2TPP (16 OMP) BV 6.58

cam_waccm 64 2.13 2.13 2TPP (16 OMP) SAME 0
cam_waccm 32 1.13 1.15 2TPP (16 OMP) BL -1.76
POP 128 22.29 21.67 1TPP BL -2.78

POP 64 34.96 36.24 2TPP BV 3.66
POP 48 42.2 43.76 2TPP BV 3.69
POP 32 56.8 65.34 2TPP BV 15.03

hd3D 128 0.1857 0.2188 1TPP BV 17.82
hd3D 64 0.3046 0.3961 1TPP BV 30.03
hd3D 32 0.5261 0.5655 1TPP BV 7.48
WRF 128 0.2316 0.239 2TPP BV 3.19

WRF 64 0.3849 0.3934 2TPP BV 2.20
WRF 32 0.6736 0.7274 2TPP BV 7.98

Give up on SMT for hd3D, look at shared L3 effects

A Closer Look at hd3D - Blueice runs
Blueice HD3D (no smt)
Private L3 improves performance

0.700

0.600

0.500
Private L3
Shared L3
Huang/Ghosh
0.400

ICESS Benchmark (sec)
0.300

0.200
32 64 96 128
Processor Count

This supports the conclusion that underscribing Blueice nodes,
making L3 cache private, improves Blueice performance.

A Closer Look at hd3D
Bluevista (no smt)
vs.
Blueice (no smt, private L3)

0.600

0.500

0.400
Bluevista
Blueice
0.300

0.200

0.100
32 64 96 128
Processor Count

• Undersubscribing Blueice improves hd3D performance. It is close to,
but still slower than, Bluevista.
• For hd3D, Bluevista still outperforms Blueice with ~6% difference for
32 and 64 processors and ~2% for 128.
• While Blueice POP is 13% faster than Bluevista POP on 16
logical(?) cpus, hd3D shows the opposite behavior.

A Closer Look at hd3D - memory
HD3D Memory Footprint

70

60

50

40

Mem Req per Task (Mb)

30

20
32 64 96 128
Processor Count

Two issues to investigate:
- 1TPP/2TPP differences
Bluevista - Single Core, Private L3

1.200

1.000

0.800

non-smt
0.600 smt
Huang_Ghosh

0.400

0.200

0.000
32 64 96 128
Processor Count

- Blueice/Bluevista 1TPP differences
Bluevista (no smt)
vs.
Blueice (no smt, private L3)

0.600

0.500

0.400
Bluevista
Blueice
0.300

0.200

0.100
32 64 96 128
Processor Count

CPI Breakdown Analysis
• Uses multiple Hardware Performance Counters on the processor to:
• Track processor cycles required to complete a given workload
• hd3D computational kernel with hpmcount API
• Track events in processor core
• Track events in the memory subsystem
• 17 counters required for Power5/Power5+ CPI Breakdown
PM_IOPS_CMPL PM_CMPL_STALL_LSU
PM_INST_CMPL PM_CMPL_STALL_REJECT
PM_RUN_CYC PM_CMPL_STALL_DCACHE_MISS
PM_GRP_CMPL PM_CMPL_STALL_ERAT_MISS
PM_GCT_NOSLOT_CYC PM_CMPL_STALL_FXU
PM_GCT_NOSLOT_IC_MISS PM_CMPL_STALL_DIV
PM_GCT_NOSLOT_SRQ_FULL PM_CMPL_STALL_FDIV
PM_GCT_NOSLOT_BR_MPRED PM_CMPL_STALL_FPU
PM_1PLUS_PPC_CMPL

2TPP/1TPP performance differences
Bluevista HPM Counters
hd3D on 32 cpus

3.0E+12

2.5E+12

2.0E+12
nosmt
1.5E+12 smt
1.0E+12

5.0E+11

0.0E+00

PM_RUN_CYC
PM_INST_CMPLPM_GRP_CMPL
PM_IOPS_CMPL

PM_GCT_NOSLOT_CYC PM_CMPLU_STALL_LSU PM_CMPLU_STALL_DIVPM_1PLUS_PPC_CMPL
PM_CMPLU_STALL_FXUPM_CMPLU_STALL_FPU
PM_CMPLU_STALL_FDIV
PM_GCT_NOSLOT_IC_MISS PM_CMPLU_STALL_REJECT
PM_GCT_NOSLOT_SRQ_FULL
PM_GCT_NOSLOT_BR_MPRED PM_CMPLU_STALL_ERAT_MISS
PM_CMPLU_STALL_DCACHE_MISS

Ratio of smt counters to non-smt counters - Bluevista
Ratios: smt/nosmt 32 tasks 64 tasks 128 tasks
PM_IOPS_CMPL 0.94 0.96 1.02
PM_INST_CMPL 0.94 0.96 1.02
PM_RUN_CYC 1.67 1.66 1.67
PM_GRP_CMPL 0.94 0.95 1.03
PM_GCT_NOSLOT_CYC 3.21 2.72 2.94
PM_GCT_NOSLOT_IC_MISS 1.61 1.74 1.71
PM_GCT_NOSLOT_SRQ_FULL 5101.47 2380.99 470.68
PM_GCT_NOSLOT_BR_MPRED 3.67 3.27 3.78
PM_CMPLU_STALL_LSU 2.48 2.62 2.28
PM_CMPLU_STALL_REJECT 2.82 3.90 3.21
PM_CMPLU_STALL_DCACHE_MISS 2.42 2.36 2.21
PM_CMPLU_STALL_ERAT_MISS 2.32 2.67 3.06
PM_CMPLU_STALL_FXU 1.43 1.48 1.61
PM_CMPLU_STALL_DIV 1.13 1.20 1.17
PM_CMPLU_STALL_FDIV 1.15 1.30 1.11
PM_CMPLU_STALL_FPU 2.11 2.15 2.05

pmlist -d -c 2,244
PM_LSU_SRQ_FULL_CYC,Cycles SRQ full
Cycles the Store Request Queue is full.

hd3D Bluevista nosmt_32 smt_32

T PM_RUN_CYC

Completed(A) A PM_GRP_CMPL 0.373 0.210

Completion Cycles(A1) A1 PM_1PLUS_PPC_CMPL 0.355 0.200

Completed(A1A) A1A PM_INST_CMPL/5 0.251 0.142

Overhead(A1B) A1-A1A 0.104 0.058

Overhead(A2) A-A1 0.018 0.011

Empty(B) B PM_GCT_NOSLOT_CYC 0.025 0.048

Miss(B1) B1 PM_GCT_NOSLOT_IC_MISS 0.003 0.003

Mispredict(B2) B2 PM_GCT_NOSLOT_BR_MPRED 0.012 0.027

B3 PM_GCT_NOSLOT_SRQ_FULL 0.000 0.000

B-B1-B2-B3 0.009 0.017

T-A-B C 0.602 0.742

C1 PM_CMPLU_STALL_LSU 0.260 0.387

C1A PM_CMPLU_STALL_REJECT 0.086 0.146

Miss(C1A1) C1A1 PM_CMPLU_STALL_ERAT_MISS 0.022 0.031

C1A-C1A1 0.064 0.115

Miss(C1B) C1B PM_CMPLU_STALL_DCACHE_MISS 0.088 0.128

C1-C1A-C1B 0.428 0.469

C2 PM_CMPLU_STALL_FXU 0.057 0.049

C2A PM_CMPLU_STALL_DIV 0.015 0.010

C2-C2A 0.042 0.039

C3 PM_CMPLU_STALL_FPU 0.205 0.260

C3A PM_CMPLU_STALL_FDIV 0.046 0.032

C3-C3A 0.159 0.229

C-C1-C2-C3 0.254 0.287

CPI 3.480 3.638

Blueice/Bluevista 1TPP differences
Bv and Bl HPM Counters
nosmt_32, private L3

3.E+12

2.E+12

2.E+12
Bluevista
Blueice
1.E+12

5.E+11

0.E+00

PM_RUN_CYC
PM_INST_CMPL PM_GRP_CMPL
PM_IOPS_CMPL

PM_GCT_NOSLOT_CYC PM_CMPLU_STALL_LSU PM_CMPLU_STALL_DIV
PM_CMPLU_STALL_FXUPM_CMPLU_STALL_FPU
PM_CMPLU_STALL_FDIV
PM_GCT_NOSLOT_IC_MISS PM_CMPLU_STALL_REJECT
PM_GCT_NOSLOT_SRQ_FULL
PM_GCT_NOSLOT_BR_MPRED PM_CMPLU_STALL_ERAT_MISS
PM_CMPLU_STALL_DCACHE_MISS

Ratio of PM Counters

PM_IOPS_CMPL
6.000
PM_CMPLU_STALL_FPU PM_INST_CMPL
5.000
PM_CMPLU_STALL_FDIV PM_RUN_CYC
4.000

3.000
PM_CMPLU_STALL_DIV PM_GRP_CMPL
2.000

1.000

PM_CMPLU_STALL_FXU 0.000 PM_GCT_NOSLOT_CYC

PM_CMPLU_STALL_ERAT_MISS PM_GCT_NOSLOT_IC_MISS

PM_CMPLU_STALL_DCACHE_MISS PM_GCT_NOSLOT_SRQ_FULL

PM_CMPLU_STALL_REJECT PM_GCT_NOSLOT_BR_MPRED
PM_CMPLU_STALL_LSU

pmlist -p POWER5 -d -c 4,8
PM_CMPLU_STALL_ERAT_MISS,Completion stall caused by ERAT miss
Following a completion stall (any period when no groups completed) the last instruction to finish before
completion resumes suffered an ERAT miss. This is a subset of PM_CMPLU_STALL_REJECT.

hd3D Bluevista nosmt_32 smt_32

T PM_RUN_CYC

Completed(A) A PM_GRP_CMPL 0.373 0.210

Completion A1 PM_1PLUS_PPC_CMPL 0.355 0.200
Cycles(A1)

Completed(A1A) A1A PM_INST_CMPL/5 0.251 0.142

Overhead(A1B) A1-A1A 0.104 0.058

Overhead(A2) A-A1 0.018 0.011

Empty(B) B PM_GCT_NOSLOT_CYC 0.025 0.048

Miss(B1) B1 PM_GCT_NOSLOT_IC_MISS 0.003 0.003

Mispredict(B2) B2 PM_GCT_NOSLOT_BR_MPRED 0.012 0.027

B3 PM_GCT_NOSLOT_SRQ_FULL 0.000 0.000

B-B1-B2-B3 0.009 0.017

T-A-B C 0.602 0.742

C1 PM_CMPLU_STALL_LSU 0.260 0.387

C1A PM_CMPLU_STALL_REJECT 0.086 0.146

Miss(C1A1) C1A1 PM_CMPLU_STALL_ERAT_MISS 0.022 0.031

C1A-C1A1 0.064 0.115

Miss(C1B) C1B PM_CMPLU_STALL_DCACHE_MISS 0.088 0.128

C1-C1A-C1B 0.428 0.469

C2 PM_CMPLU_STALL_FXU 0.057 0.049

C2A PM_CMPLU_STALL_DIV 0.015 0.010

C2-C2A 0.042 0.039

C3 PM_CMPLU_STALL_FPU 0.205 0.260

C3A PM_CMPLU_STALL_FDIV 0.046 0.032

C3-C3A 0.159 0.229

C-C1-C2-C3 0.254 0.287

CPI 3.480 3.638

Conclusions
> Nothing new <
Sharing cache can degrade performance
But lots of questions remain:
• Gathering and processing the data from performance counters
was extremely tedious.
•Is there an easier way?
• Does difficulty increase exponentially with level of detail?
• Are the Power 5/5+ performance counters accurate?
•Some say not
• Eyerman, et. al. (ASPLOS, 2004)
• What do the counters mean?
• Are there expanded references besides pmlist?
• What is an ERAT miss?
• What does it say about code performance?
• Will ACTC tools give more info that what’s available via
pmlist?
•
•
•

M Page ScicomP14 Hd3D

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to M Page ScicomP14 Hd3D

Similar to M Page ScicomP14 Hd3D (20)

M Page ScicomP14 Hd3D

Editor's Notes