Loffeld_SIAMCSE15

LLNL-PRES-668437
This work was performed under the auspices of the U.S. Department
of Energy by Lawrence Livermore National Laboratory under Contract
DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
High Performance of Finite-Volume Methods
through Increased Arithmetic Intensity
SIAM CSE 2015
J. Loffeld and J.A.F. Hittinger
3/17/2015

Lawrence Livermore National Laboratory LLNL-PRES-668437
2
To get high flops rate, you need high arithmetic
intensity
1024
512
256
128
64
32
16
Performance(GFlop/s)
321684211/2
Arithmetic Intensity (flop/byte)
Machine peak
Machinebalance
No FMA
No AVX
Low-order
PDE Stencils
FFTs Dense Matrix Multiply
Greaterconcurrency
Increasing order
of FV methods
improves AI

3
Higher-order finite-volume methods require
high-order flux approximations
 Update formula for
conservation laws:
• We are considering AI for one
time step
 Approximating the flux
averages gives FV method
• High-order approximations give
high-order method
High-order flux approximations
use more flopsFlux update stencil

4
High-order flux approximations include more
neighbor information
Eighth-order central flux:
Incorporating information from neighbors
gives high flop count
 Derived upwind and central high-order
schemes for 5th through 8th order
 [McCorquodale, et al. CAMCS (2011)],
[Colella, et al. J. Comput. Phys. (2011)]

5
AI is most easily calculated in the limit of
infinite cache size
 Assume unlimited cache
• Useful for later refinement
• Target AI
 Load and store data only
once per cell
 Temporaries between
stencils absorbed by
cache
 Re-use of data allows
high AI

6
Theoretical maximum AI reaches target for
sixth and eighth order
Modern machine balance
We would see these results in practice
if machines had infinite cache space
Flops for example step:
𝑐 8
𝐷 − 1
2
+ 1 𝐷(𝑁 + 1)(𝑁 + 2) 𝐷−1
 Formulas parameterized by
• Dimension
• Number of components
• Domain size
• Flops cost of flux function

7
In reality, machines have finite-size caches
 Overhead from re-
fetching halo cells
• Sixth order halo width is 4
• Eighth order halo width is 6
• Halo cells limit minimum
block size
Each block stores
values per component

8
 Vulcan – IBM Blue Gene Q
• 32MB L2 cache (last level)
• Cache line is 128 bytes
 BGPM for hardware counters
• Flops counts are highly accurate
• Overcount DRAM transfers
— Turn off prefetching
— Overhead from API
— Get aliasing error from large cache line
— Random noise
To verify the predictions, we used the hardware
counters on Vulcan
Machine
peak
Machine
balance
4.8 flop/byte
205
Gflop/s

9
Measured AI with ND cache blocking compares
well to theory
Modern machine balance
• Higher order methods have wider stencils
• Blocks need wide halos
• Less efficient cache reuse
Fourth theoretical
Fourth measured
Sixth theoretical
Sixth measured
Eighth theoretical
Eighth measured

10
Because of halo, 3D blocking requires too
much cache space
 Need block length about 32
to keep overhead modest
• For eighth-order, 1.55x
• For sixth-order, 1.34x
 Each block requires
 For 5-component system
(e.g. Euler), need 5 MB
cache per 32-wide block
• Current processors have ~2 to
2.5 MB/core
On 1283
size domain

11
However, vertical iteration of rectangular cache
blocks can improve cache usage
 Successively evaluate
blocks in columns
• No re-fetching of halo in z
direction
 Storage per block:
 For 8 × 322 blocks in 1283
size domain:
Order Overhead Size AI
6 1.21x 1.5MB 13.6
8 1.33x 2.1MB 21.8 High AI with realistic cache size

12
 Derived high-order finite-volume schemes
 Conducted AI analysis that shows high AI can be
obtained with realistic cache sizes
Summary:
Machine
peak
Machine
balance
Current and future work:
 AI is an important metric for on-node utilization, but it
does not equal performance
• Latency, concurrency, cache blocking
• [Olschanowski et al. SC (2014)] for 4th order
 Need to consider ways to reduce halo width to
further reduce overhead
 Include nonlinear limiting in the flux AI analysis
• Will further increase ops without increasing data
transfers

Loffeld_SIAMCSE15

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (17)

Semelhante a Loffeld_SIAMCSE15

Semelhante a Loffeld_SIAMCSE15 (20)

Mais de Karen Pao

Mais de Karen Pao (8)

Último

Último (20)

Loffeld_SIAMCSE15

Notas do Editor