More Related Content Similar to Runtime Performance Optimizations for an OpenFOAM Simulation (20) Runtime Performance Optimizations for an OpenFOAM Simulation1. science + computing ag
IT Service and Software Solutions for Complex Computing Environments
Tuebingen | Muenchen | Berlin | Duesseldorf
Runtime Performance Optimizations for an OpenFOAM Simulation
Dr. Oliver Schröder
HPC Services and Software
21 June 2014
HPC Advisory Council European Conference 2014
2. Many Thanks to…
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
Applications and Performance Team
Madeleine Richards
Rafael Eduardo Escovar Mendez
HPC Services and Software Team
Fisnik Kraja
Josef Hellauer
2
3. Overview
Introduction
The Test Case
The Benchmarking Environment
Initial Results (the starting point)
Dependency Analysis
• CPU Frequency
• Interconnect
• Memory Hierarchy
Performance Improvements
• MPI Communications
Tuning MPI Routines
Intel MPI vs. Bull MPI
• Domain Decomposition
Conclusions
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
3
4. Introduction
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
How can we improve application performance?
• By using the resources efficiently
• By selecting the appropriate resources
In order to do this we have to:
• Analyze the behavior of the application
• And then tune the runtime environment accordingly
In this study we:
• Analyze the dependencies of an OpenFOAM Simulation
• Improve the performance of OpenFOAM by
• Tuning MPI communications
• Selecting the appropriate domain decomposition
4
5. The Test Environment
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
5
Compute Cluster Sid Robin
Hardware
Processor Intel E5-2680 V2 Ivy-Bridge Intel E5-2697 V2 Ivy-Bridge
Frequency 2.80 GHz 2.70 GHz
Cores per processor 10 12
Sockets per node 2 2
Cores per nodes 20 24
Cache 25 MB (L3), 10x256 KB (L2) 30 MB (L3), 12 x 256 KB (L2)
Memory 64 GB 64 GB
Frequency of memory 1866 MHz 1866 MHz
IO – FS NFS over IB NFS over IB
Interconnect IB FDR IB FDR
Software
OpenFOAM 2.2.2 2.2.2
Intel C Compiler 13.1.3.192 13.1.3.192
GNU C Compiler 4.7.3 4.7.3
Compiler Options -O3 –fPIC –m64 -O3 –fPIC –m64
Intel MPI 4.1.0.030 4.1.0.030
Bull MPI 1.2.4.1 1.2.4.1
SLURM 2.6.0 2.5.0
OFED 1.5.4.1 1.5.4.1
6. Initial Results on SID (E5-2680v2) (1)
GCC(IMPI) vs. ICC (IMPI)
Better performance is obtained with the OpenFOAM version compiled with GCC
During the development and optimization of OpenFOAM mainly GCC is used
8 vs. 10 Tasks / CPU
Better Performance with 8 Tasks/CPU. The reasons might be:
With fewer tasks we use fewer cores in each CPU, having so more cache and more memory
bandwidth per core.
The domain decomposition is different. We will see later the impact of Domain Decomposition on
Performance
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
6
256 Tasks 512 Tasks 768 Tasks
16 Nodes 32 Nodes 48 Nodes
GCC 1767 1019 869
ICC 1806 1080 884
0
200
400
600
800
1000
1200
1400
1600
1800
2000
ClockTime[s]
GCC vs. ICC
16 Nodes 32 Nodes 48 Nodes
8 Tasks/CPU 1767 1019 869
10 Tasks/CPU 1799 1116 1010
0
200
400
600
800
1000
1200
1400
1600
1800
2000
ClockTime[s]
8 vs. 10 Tasks /CPU
7. Initial Results on SID (E5-2680v2) (2)
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
7
320 Tasks 640 Tasks 960 Tasks
16 Nodes 32 Nodes 48 Nodes
No_CPU_Binding 1801 1121 1007
With_CPU_BINDING 1799 1116 1010
0
200
400
600
800
1000
1200
1400
1600
1800
2000
ClockTime[s]
OpenFOAM: CPU Binding
DAPL
Fabric
OFA Fabric
zone
reclaim
huge pages
hyperthrea
ding
turbo
boost
16 Nodes 16 Nodes 16 Nodes 16 Nodes 16 Nodes 16 Nodes
320 Tasks 320 Tasks 320 Tasks 320 Tasks 640 Tasks 320 Tasks
Clock Time [s] 1804 1845 1802 1807 2229 1750
0
500
1000
1500
2000
2500
Seconds
OpenFOAM Tests on 16 Nodes
0,0
500,0
1000,0
1500,0
2000,0
2500,0
3000,0
3500,0
4000,0
120 Tasks 240 Tasks 480 Tasks 960 Tasks
6 Nodes 12 Nodes 24 Nodes 48 Nodes
Time[s]
Example results with Star-CCM+
(cb=cpu_bind, zr=zone reclaim, thp=transparent huge pages, trb=turbo, ht=hyperthreading)
Opt: none
Opt: cb
Opt: cb,zr
Opt: cb,zr,thp
Opt: cb,zr,thp,trb
Opt: cb,rz,thp,ht
8. Initial Results on SID (E5-2680v2) (3)
CPU Binding
With other applications we often observed performance improvements when we
bind the Tasks to specific CPUs (especially when Hyperthreading=ON)
This seems not to be the case for this OpenFOAM simulation
We applied also other optimization techniques but we observed
improvement only in the case of Turbo Boost.
Actually with Turbo Boost we expected more than 3% improvement.
We really need to analyze the dependencies of
OpenFOAM in order to understand this behavior.
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
8
9. Overview
Introduction
The Test Case
The Benchmarking Environment
Initial Results (the starting point)
Dependency Analysis
• CPU Frequency
• Interconnect
• Memory Hierarchy
Performance Improvements
• MPI Communications
Tuning MPI Routines
Intel MPI vs. Bull MPI
• Domain Decomposition
Conclusions
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
9
10. CPU Comparison
E5-2680v2 vs. E5-2697v2
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
Core-to-Core Comparison
16 nodes: better on E5-2697v2 CPUs considering the additional 5 MB of L3 cache
24 nodes: very small differences – MPI overhead hides the benefits from the cache
Node-to-Node Comparison
16 nodes: 2% better on E5-2697v2 CPUs
24 nodes: <1% better on E5-2697v2 CPUs
10
256 Tasks 320 Tasks 384 Tasks 480 Tasks
16 Nodes 16 Nodes 24 Nodes 24 Nodes
E5-2680v2 on SID 1912 1925 1364 1345
E5-2697v2 on Robin 1849 1803 1355 1344
0
500
1000
1500
2000
2500
ClockTime[s]
Core-to-Core Comparison
OpenFOAM @ 2.4 GHz
16 Nodes 24 Nodes
10c@2.8GHz - E5-2680v2 1799 1294
12c@2.7GHz - E5-2697v2 1765 1284
0
200
400
600
800
1000
1200
1400
1600
1800
2000
ClockTime[s]
Node-to-Node Comparison
OpenFOAM on Fully Populated Nodes
11. OpenFOAM Dependency on CPU Frequency
Tests on SID B71010c (E5-2680v2)
Dependency on the CPU frequency only modest, approx 45%
Beyond 2.5 GHz there is not much improvement
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
11
y = 69878x-0,463
R² = 0,9618
y = 36256x-0,441
R² = 0,9697
y = 25120x-0,408
R² = 0,9702
0
500
1000
1500
2000
2500
3000
1000 1500 2000 2500 3000
WallclocktimeinSeconds
CPU frequency in MHz
16 Nodes - 320 Tasks
32 Nodes - 640 Tasks
48 Nodes - 960 Tasks
Pot.(16 Nodes - 320 Tasks)
Pot.(32 Nodes - 640 Tasks)
Pot.(48 Nodes - 960 Tasks)
12. OpenFOAM Dependency on Interconnect
Compact Task Placement vs. Fully Populated
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
With the compact task placement we bind all 10 tasks on one of the CPUs in
each node to make sure that cache size and memory bandwidth stay the
same. In this way we give to each task more interconnect bandwidth.
For tests with 320 tasks we obtain 3% improvement
For tests with 640 tasks we obtain 11% improvement
Conclusion:
Dependency on Interconnect Bandwidth increases with the number of tasks (nodes)
12
320 Tasks 640 Tasks
Speedup 1,03 1,11
1,00
1,02
1,04
1,06
1,08
1,10
1,12
Compact vs. Fully Populated
Test
Fully Populated: 20 Tasks/Node
320 Tasks on 16 Nodes
640 Tasks on 32 Nodes
Compact: 10 Tasks/Node (2x #Nodes)
320 Tasks on 32 Nodes
640 Tasks on 64 Nodes
13. OpenFOAM Dependency on Memory Hierarchy
Scatter vs. Compact Task Placement
The dependency on the memory hierarchy is bigger when running
OpenFOAM on a small number of nodes. On more nodes, the bottleneck is
not anymore the memory hierarchy but the interconnect.
The impact of the L3 cache stays almost the same when doubling the
number of nodes. However, the impact coming from the memory
throughput is lower.
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
13
28%
4%
22%
24%
0%
10%
20%
30%
40%
50%
60%
320 Tasks 640 Tasks
32 Nodes 64 Nodes
Improvement in Memory Bandwidth
Improvement from the L3 Cache
1,50
1,28
1,00
1,10
1,20
1,30
1,40
1,50
1,60
320 Tasks 640 Tasks
32 Nodes 64 Nodes
Speedup
14. OpenFOAM Dependency on Memory Hierarchy
Scatter vs. Compact Task Placement
We tested OpenFOAM on Lustre with various
Stripe Counts
Stripe Sizes
We could not obtain any improvement
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
14
0
200
400
600
800
1000
1200
1400
1600
1800
2000
16 32 48
ClockTime[s]
Number of Nodes (20 Tasks/Node)
1 Stripe of 1 MB
2 Stripes of 1 MB
4 Stripes of 1 MB
6 Stripes of 1 MB
1 Stripe of 2 MB
1 Stripe of 4 MB
1 Stripe of 6MB
15. Overview
Introduction
The Test Case
The Benchmarking Environment
Initial Results (the starting point)
Dependency Analysis
• CPU Frequency
• Interconnect
• Memory Hierarchy
Performance Improvements
• MPI Communications
Tuning MPI Routines
Intel MPI vs. Bull MPI
• Domain Decomposition
Conclusions
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
15
16. MPI Profiling (with Intel MPI)
Tests on SID B71010c (E5-2680v2)
The portion of time spent in MPI increases with the number of nodes
Looking at the MPI Calls
~ 50 % of MPI Time is spent on MPI_Allreduce
~ 30 % of MPI Time is spent on MPI_Waitall
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
16
76%
58%
42%
23,5%
41,6%
57,9%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
320 Tasks 640 Tasks 960 Tasks
16 Nodes 32 Nodes 48 Nodes
MPI User
48,6% 53,5% 51,2%
31,9%
29,6% 28,4%
15,1% 11,7% 14,0%
4,4% 5,2% 6,4%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
320 Tasks 640 Tasks 960 Tasks
16 Nodes 32 Nodes 48 Nodes
MPI_Allreduce MPI_Waitall MPI_Recv Other
17. MPI All Reduce Tuning (with Intel MPI)
Tests on SID B71010c (E5-2680v2)
The best All_Reduce algorithm:
(5) Binominal Gather + Scatter
Overall obtained improvement
None on 16 nodes – 320 Tasks
11% on 32 Nodes – 640 Tasks
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
17
1. Recursive doubling algorithm
2. Rabenseifner's algorithm
3. Reduce + Bcast algorithm
4. Topology aware Reduce + Bcast algorithm
5. Binomial gather + scatter algorithm
6. Topology aware binominal gather + scatter algorithm
7. Shumilin's ring algorithm
8. Ring algorithm
1797 1893 2075 2232
1864 1801 1860
2208
2813
0
1000
2000
3000
4000
5000
6000
7000
8000
default 1 2 3 4 5 6 7 8
ClockTime[s]
MPI All Reduce Algorithms
16 Nodes - 320 Tasks
1117 1210
1597
4096
1490
1004
1526
3985
7373
0
1000
2000
3000
4000
5000
6000
7000
8000
default 1 2 3 4 5 6 7 8
ClockTime[s]
MPI All Reduce Algorithms
32 Nodes - 640 Tasks
18. Finding the Right Domain Decomposition
Tests on Robin with E6-2697v2 CPUs
We obtain
10 - 14% improvement on 16 nodes
3 – 5 % improvement on 24 nodes
This means that decomposition improves data locality impacting thus the intra node
communication, but it cannot have a high impact when the inter-node communication
overhead increases.
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
18
1765
1546 1550
1592
1,000
1,142 1,139
1,109
0,800
0,900
1,000
1,100
1,200
1,300
1,400
1,500
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2 2 2 2 3 2 4 4 24 2 3 4 4 6 8 8
2 96 4 6 4 8 4 24 4 12 8 6 8 8 6 8
96 2 48 32 32 24 24 4 4 16 16 16 12 8 8 6 Speedup
ClockTime[s]
Domain Decomposition (Bottom Up: Y Z X)
OpenFOAM on 16 Nodes with 24 Tasks/Node
Time [s]
Speedup
1284 1246
1224
1249
1,000
1,030 1,049 1,028
0,900
0,950
1,000
1,050
1,100
1,150
1,200
1,250
1,300
0
200
400
600
800
1000
1200
1400
2 2 2 3 2 4 2 3 2 3 4 2 4 3 4 6 4 6 8
2 4 6 4 8 4 9 6 12 8 6 16 8 12 9 6 12 8 8
144 72 48 48 36 36 32 32 24 24 24 18 18 16 16 16 12 12 9
Speeedup
ClockTime[s]
Domain Decomposition (Bottom Up : Y Z X)
OpenFOAM on 24 Nodes (24 Tasks/Node)
Time [s] Speedup
19. Bull MPI vs. Intel MPI
Tests on Robin with E6-2697v2 CPUs
Bull MPI is based on OpenMPI
It has been enhanced with many
features such as:
effective abnormal pattern detection
network-aware collective operations
multi-path network failover.
It has been designed to boost the
performance of MPI applications
thanks to full integration with
bullx PFS (based on Lustre)
bullx BM (based on SLURM)
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
19
0
200
400
600
800
1000
1200
1400
1600
1800
2000
X=2 X=2 X=2 X=2
Z=2 Z=6 Z=2 Z=9
Y=96 Y=32 Y=144 Y=32
384 Tasks 576 Tasks
16 Nodes 24 Nodes
Intel MPI
Bull MPI
6.2%
3.2%
1.4%
3.9%
20. Conclusions
© 2014 science + computing agDr. Oliver Schröder | o.schroeder@science-computing.de
1. CPU Frequency
• OpenFOAM shows a dependency under 50% on the CPU Frequency
2. Cache Size
• OpenFOAM is dependent on the Cache Size per Core
3. Limiting factor changes as number of nodes change:
1. Memory Bandwidth:
OpenFOAM is memory bandwidth dependent on a small number of nodes
2. Communication:
MPI Time increases substantially with increasing number of nodes
4. Tuning MPI All Reduce Operations
• Up to 11 % improvement can be obtained
5. Domain Decomposition
• Finding the right mesh dimensions is very important
6. File System
• We could not obtain better results with different stripe counts or size in Lustre
20
21. Thank you for your attention.
science + computing ag
www.science-computing.de
Oliver Schröder
Email: o.schroeder@science-computing.de