1) Performance tuning methods for HPC Cloud include PCI passthrough, NUMA affinity, and reducing VMM noise to improve performance and close the gap with bare metal machines.
2) Evaluation of MPI and HPC applications on a 16-node cluster showed PCI passthrough improved MPI bandwidth close to bare metal, and NUMA affinity improved performance up to 2%.
3) Parallel efficiency of coarse-grained applications was comparable to bare metal, but fine-grained applications saw up to 22% degradation due to communication overhead and virtualization.
1. Toward a practical “HPC Cloud”:
Performance tuning of a virtualized HPC cluster
Ryousei Takano
Information Technology Research Institute,
National Institute of Advanced Industrial Science and Technology (AIST),
Japan
SC2011@Seattle, Nov.15 2011
2. Outline
• What is HPC Cloud?
• Performance tuning method for HPC Cloud
– PCI passthrough
– NUMA affinity
– VMM noise reduction
• Performance evaluation
2
3. HPC Cloud
HPC Cloud utilizes cloud resources in High
Performance Computing (HPC) applications
Virtualized
Clusters
Users require resources Provider allocates users a dedicated
according to needs virtual cluster on demand
Physical
Cluster
3
4. HPC Cloud (cont’d)
• Pros:
– User side: easy to deployment
– Provider side: high resource utilization
• Cons:
– Performance degradation?
The method of performance tuning on a virtualized
environment is not established.
4
5. Toward a practical HPC Cloud
To reduce the overhead of “True” HPC Cloud
VM1
interrupt virtualization The performance is
Guest OS
To disable unnecessary services closing to that of bare
Physical
driver
on the host OS (i.e., ksmd). metals.
VMM
Reduce
VMM noise
NIC
Set NUMA (not completed)
affinity
VM (QEMU process)
Guest OS
Threads
Use PCI VCPU
threads
passthrough
Linux kernel
Current KVM
HPC Cloud
Its performance is
not good and Physical
unstable. CPU
CPU socket
5
6. PCI passthrough
IO emulation PCI passthrough SR-IOV
VM1 VM2 VM1 VM2 VM1 VM2
Guest OS Guest OS Guest OS
… … …
Guest Physical Physical
driver driver driver
VMM VMM VMM
vSwitch
Physical
driver
NIC NIC NIC
Switch (VEB)
IO emulation PCI passthrough SR-IOV
VM sharing
Performance
6
7. Virtual CPU scheduling
Bare Metal
Xen KVM
VM (Xen DomU) VM (QEMU process)
VM Guest OS Guest OS
(Dom0) Threads Threads
Virtual Machine
A guest OS can not run numactl
VCPU
V0 V1 V2 V3 V0 V1 V2 V3
threads
VCPU
Xen Hypervisor Linux kernel
KVM
Domain Process Virtual Machine
scheduler scheduler Monitor (VMM)
Physical Physical CPU
CPU P0 P1 P2 P3 P0 P1 P2 P3 Hardware
CPU socket
7
8. NUMA affinity
Bare Metal KVM
Linux VM (QEMU process)
Threads Guest OS
Threads
numactl
numactl bind threads
Process to vSocket
VCPU
scheduler V0 V1 V2 V3
threads
Linux kernel
taskset
KVM
pin vCPU to
Process CPU (Vn = Pn)
scheduler
CPU socket
Physical
CPU P0 P1 P2 P3
Physical
CPU P0 P1 P2 P3
memory memory
CPU socket
8
9. Evaluation
Evaluation of HPC applications on 16 nodes cluster
(part of AIST Green Cloud Cluster)
Compute node Dell PowerEdge M610 Host machine environment
CPU Intel quad-core Xeon E5540/2.53GHz x2 OS Debian 6.0.1
Chipset Intel 5520 Linux kernel 2.6.32-5-amd64
Memory 48 GB DDR3 KVM 0.12.50
InfiniBand Mellanox ConnectX (MT26428) Compiler gcc/gfortran 4.4.5
MPI Open MPI 1.4.2
Blade switch VM environment
InfiniBand Mellanox M3601Q (QDR 16 ports) VCPU 8
Memory 45 GB
9
10. MPI Point-to-Point
communication performance
10000
(higher is better)
1000
Bandwidth [MB/sec]
100
10 PCI passthrough improves MPI communication
throughput close to that of bare metal machines.
Bare Metal
KVM
1
1 10 100 1k 10k 100k 1M 10M 100M 1G
Message size [byte] Bare Metal: non-virtualized cluster
10
11. NUMA affinity
Execution time on a single node: NPB multi-zone
(Computational Fluid Dynamics) and Bloss (Non-linear
eignsolver)
SP-MZ [sec] BT-MZ [sec] Bloss [min]
Bare Metal 94.41 (1.00) 138.01 (1.00) 21.02 (1.00)
KVM 104.57 (1.11) 141.69 (1.03) 22.12 (1.05)
KVM (w/ bind) 96.14 (1.02) 139.32 (1.01) 21.28 (1.01)
NUMA affinity is an important performance factor not only
on bare metal machines but also on virtual machines.
11
12. NPB BT-MZ: Parallel efficiency
(higher is better)
300 100
Performance [Gop/s total]
250 Degradation of PE: 80
Parallel efficiency [%]
KVM: 2%, EC2: 14%
200
Bare Metal 60
150 KVM
Amazon EC2
40
100 Bare Metal (PE)
KVM (PE)
20
50 Amazon EC2 (PE)
0 0
1 2 4 8 16
Number of nodes
12
13. Bloss: Parallel efficiency
Bloss: non-linear internal eigensolver
– Hierarchical parallel program by MPI and OpenMP
120
Overhead of communication
100
and virtualization
Parallel Efficiency [%]
80
60
Degradation of PE:
KVM: 8%, EC2: 22%
40
20 Bare Metal
KVM
Amazon EC2
Ideal
0
1 2 4 8 16
Number of nodes
13
14. Summary
HPC Cloud is promising!
• The performance of coarse-grained parallel
applications is comparable to bare metal
machines
• We plan to operate a private cloud service
“AIST Cloud” for HPC users
• Open issues
– VMM noise reduction
– VMM-bypass device-aware VM scheduling
– Live migration with VMM-bypass devices
14
16. Bloss: Parallel efficiency
Bloss: non-linear internal eigensolver
– Hierarchical parallel program by MPI and OpenMP
120
100
Parallel Efficiency [%]
80
60
Binding threads and physical CPUs can
be sensitive to VMM noise and degrade
the performance.
40
Bare Metal
20 KVM
KVM (w/ bind)
Amazon EC2
Ideal
0
1 2 4 8 16
Number of nodes
16