User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
Loffeld_SIAMCSE15
1. LLNL-PRES-668437
This work was performed under the auspices of the U.S. Department
of Energy by Lawrence Livermore National Laboratory under Contract
DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
High Performance of Finite-Volume Methods
through Increased Arithmetic Intensity
SIAM CSE 2015
J. Loffeld and J.A.F. Hittinger
3/17/2015
2. Lawrence Livermore National Laboratory LLNL-PRES-668437
2
To get high flops rate, you need high arithmetic
intensity
1024
512
256
128
64
32
16
Performance(GFlop/s)
321684211/2
Arithmetic Intensity (flop/byte)
Machine peak
Machinebalance
No FMA
No AVX
Low-order
PDE Stencils
FFTs Dense Matrix Multiply
Greaterconcurrency
Increasing order
of FV methods
improves AI
3. Lawrence Livermore National Laboratory LLNL-PRES-668437
3
Higher-order finite-volume methods require
high-order flux approximations
Update formula for
conservation laws:
• We are considering AI for one
time step
Approximating the flux
averages gives FV method
• High-order approximations give
high-order method
High-order flux approximations
use more flopsFlux update stencil
4. Lawrence Livermore National Laboratory LLNL-PRES-668437
4
High-order flux approximations include more
neighbor information
Eighth-order central flux:
Incorporating information from neighbors
gives high flop count
Derived upwind and central high-order
schemes for 5th through 8th order
[McCorquodale, et al. CAMCS (2011)],
[Colella, et al. J. Comput. Phys. (2011)]
5. Lawrence Livermore National Laboratory LLNL-PRES-668437
5
AI is most easily calculated in the limit of
infinite cache size
Assume unlimited cache
• Useful for later refinement
• Target AI
Load and store data only
once per cell
Temporaries between
stencils absorbed by
cache
Re-use of data allows
high AI
6. Lawrence Livermore National Laboratory LLNL-PRES-668437
6
Theoretical maximum AI reaches target for
sixth and eighth order
Modern machine balance
We would see these results in practice
if machines had infinite cache space
Flops for example step:
𝑐 8
𝐷 − 1
2
+ 1 𝐷(𝑁 + 1)(𝑁 + 2) 𝐷−1
Formulas parameterized by
• Dimension
• Number of components
• Domain size
• Flops cost of flux function
7. Lawrence Livermore National Laboratory LLNL-PRES-668437
7
In reality, machines have finite-size caches
Overhead from re-
fetching halo cells
• Sixth order halo width is 4
• Eighth order halo width is 6
• Halo cells limit minimum
block size
Each block stores
values per component
8. Lawrence Livermore National Laboratory LLNL-PRES-668437
8
Vulcan – IBM Blue Gene Q
• 32MB L2 cache (last level)
• Cache line is 128 bytes
BGPM for hardware counters
• Flops counts are highly accurate
• Overcount DRAM transfers
— Turn off prefetching
— Overhead from API
— Get aliasing error from large cache line
— Random noise
To verify the predictions, we used the hardware
counters on Vulcan
Machine
peak
Machine
balance
4.8 flop/byte
205
Gflop/s
9. Lawrence Livermore National Laboratory LLNL-PRES-668437
9
Measured AI with ND cache blocking compares
well to theory
Modern machine balance
• Higher order methods have wider stencils
• Blocks need wide halos
• Less efficient cache reuse
Fourth theoretical
Fourth measured
Sixth theoretical
Sixth measured
Eighth theoretical
Eighth measured
10. Lawrence Livermore National Laboratory LLNL-PRES-668437
10
Because of halo, 3D blocking requires too
much cache space
Need block length about 32
to keep overhead modest
• For eighth-order, 1.55x
• For sixth-order, 1.34x
Each block requires
For 5-component system
(e.g. Euler), need 5 MB
cache per 32-wide block
• Current processors have ~2 to
2.5 MB/core
On 1283
size domain
11. Lawrence Livermore National Laboratory LLNL-PRES-668437
11
However, vertical iteration of rectangular cache
blocks can improve cache usage
Successively evaluate
blocks in columns
• No re-fetching of halo in z
direction
Storage per block:
For 8 × 322 blocks in 1283
size domain:
Order Overhead Size AI
6 1.21x 1.5MB 13.6
8 1.33x 2.1MB 21.8 High AI with realistic cache size
12. Lawrence Livermore National Laboratory LLNL-PRES-668437
12
Derived high-order finite-volume schemes
Conducted AI analysis that shows high AI can be
obtained with realistic cache sizes
Summary:
Machine
peak
Machine
balance
Current and future work:
AI is an important metric for on-node utilization, but it
does not equal performance
• Latency, concurrency, cache blocking
• [Olschanowski et al. SC (2014)] for 4th order
Need to consider ways to reduce halo width to
further reduce overhead
Include nonlinear limiting in the flux AI analysis
• Will further increase ops without increasing data
transfers
Notas do Editor
Kernels are memory bound. Cores are 90% idle. Roofline model explains phenomenon. Relates performance to AI and CPU features.