Could the “C” in HPC stand for Cloud?This paper examines aspects of computing important in HPC (compute and network bandwidth, compute and network latency, memory size and bandwidth, I/O, and so on) and how they are affected by various virtualization technologies. For more information on IBM Systems, visit http://ibm.co/RKEeMO.
Visit the official Scribd Channel of IBM India Smarter Computing at http://bit.ly/VwO86R to get access to more documents.
1. Thought Leadership White Paper
IBM Systems and Technology Group November 2012
Could the “C” in HPC
stand for Cloud?
By Christopher N. Porter, IBM Corporation
porterc@us.ibm.com
2. 2 Could the “C” in HPC stand for Cloud?
Introduction
Most IaaS (infrastructure as a service) vendors such as
Rackspace, Amazon and Savvis use various virtualization
technologies to manage the underlying hardware they
build their offerings on. Unfortunately the virtualization
technologies used vary from vendor to vendor and are
sometimes kept secret. Therefore, the question about
virtual machines versus physical machines for high
performance computing (HPC) applications is germane
to any discussion of HPC in the cloud.
This paper examines aspects of computing important in HPC
(compute and network bandwidth, compute and network
latency, memory size and bandwidth, I/O, and so on) and how
they are affected by various virtualization technologies. The
benchmark results presented will illuminate areas where cloud
computing, as a virtualized infrastructure, is sufficient for some
workloads and inappropriate for others. In addition, it will
provide a quantitative assessment of the performance
differences between a sample of applications running on
various hypervisors so that data-based decisions can be made
for datacenter and technology adoption planning.
A business case for HPC clouds
HPC architects have been slow to adopt virtualization
technologies for two reasons:
1. The common assumption that virtualization impacts
application performance so severely that any gains in
flexibility are far outweighed by the loss of application
throughput.
2. Utilization on traditional HPC infrastructure is very high
(between 80 - 95 percent).Therefore, the typical driving
business cases for virtualization (for example, utilization of
hardware, server consolidation or license utilization) simply
did not hold significant enough merit to justify the added
complexity and expense of running workload in virtualized
resources.
In many cases, however, HPC architects would be willing to
lose some small percentage of application performance to
achieve the flexibility and resilience that virtual machine based
computing would allow. There are several reasons architects
may make this compromise, including:
• Security: Some HPC environments require data and host
isolation between groups of users or even between the users
themselves. In these situations VMs and VLANs can be used
in consort to isolate users from each other and isolate data to
the users who should have access to it.
• Application stack control: In a mixed application
environment where multiple applications share the same
physical hardware, it can be difficult to satisfy the
configuration requirements of each application, including OS
versions, updates and libraries. Using virtualization makes that
task easier since the whole stack can be deployed as part of the
application.
• High value asset maximization: In a heterogeneous HPC
system the newest machines are often in highest demand. To
manage this demand, some organizations use a reservation
system to minimize conflicts between users. When using VMs
for computing, however, the migration facility available within
3. IBM Systems and Technology Group 3
most hypervisors allows opportunistic workloads to use high
value assets by even after a reservation window opens for a
different user. If the reserving user submits workload against a
reservation, then the opportunistic workload can be migrated
to other assets to continue processing without losing any CPU
cycles.
• Utilization improvement: If the losses in application
performance are very small (single digit percentages), then
adoption of virtualization technology may enable incremental
steps forward in overall utilization in some cases. In these
cases, virtualization may offer an increase in overall HPC
throughput for the HPC environment.
• Large execution time jobs: Several HPC applications offer
no checkpoint restart capability. VM technology can capture
and checkpoint the entire state of the virtual machine,
however, allowing for checkpoint of these applications. If jobs
run long enough to be at the same MTBF for the solution as a
whole, then the checkpoint facility available within virtual
machines may be very attractive. Additionally, if server
maintenance is a common or predictable occurrence, then
checkpoint migration or suspension of a long running job
within a VM could prevent loss of compute time.
• Increases in job reliability: Virtual machines, if used on a 1:1
basis with batch jobs (meaning each job runs within a VM
container), provide a barrier between their own environment,
the host environment and any other virtual machine
environments running on the hypervisor. As such, “rogue”
jobs which try and access more memory or cpu cores than
expected can be isolated from well behaved jobs allocated
resources as expected. Such a situation without virtual
machine containment, where jobs share a physical host often
cause problems in the form of slowdowns, swapping or even
OS crashes.
Management tools
Achieving HPC in a cloud environment requires a few well
chosen tools including a hypervisor platform, workload
manager and an infrastructure management toolkit. The
management toolkit provides the policy definition,
enforcement, provisioning management, resource reservation
and reporting. The hypervisor platform provides the
foundation for the virtual portion of cloud resources and the
workload manager provides the task management.
The cloud computing management tools of IBM® Platform
Computing™—IBM® Platform™ Cluster Manager –
Advanced Edition and IBM® Platform™ Dynamic Cluster—
turn static clusters, grids and datacenters into dynamic shared
computing environments. The products can be used to create
private internal clouds or hybrid private clouds, which use
external public clouds for peak demand. This is commonly
referred to as “cloud bursting” or “peak shaving.”
Platform Cluster Manager – Advanced Edition creates a cloud
computing infrastructure to efficiently manage application
workloads applied to multiple virtual and physical platforms. It
does this by uniting diverse hypervisor and physical
environments into a single dynamically shared infrastructure.
Although this document describes the properties of virtual
machines, Platform Cluster Manager – Advanced Edition is
not in any way limited to managing virtual machines. It
unlocks the full computing potential lying dormant in existing
heterogeneous virtual and physical resources according to
workload-intelligent and resource-aware policies.
4. 4 Could the “C” in HPC stand for Cloud?
Platform Cluster Manager – Advanced Edition optimizes
infrastructure resources dynamically based on perceived
demand and critical resource availability using an API or a web
interface. This allows users to enjoy the following business
benefits:
• By eliminating silos resource utilization can be improved
• Batch job wait times are reduced because of additional
resource availability or flexibility
• Users perceive a larger resource pool
• Administrator workload is reduced through multiple layers of
automation
• Power consumption and server proliferation is reduced
Subsystem benchmarks
Hardware environment and settings
KVM and OVM testing
Physical hardware: (2) HP ProLiant BL465cG5 with Dual
Socket Quad Core AMD 2382 + AMD-V and 16 GB RAM
OS Installed: RHEL 5.5 x86_64
Hypervisor(s): KVM in RHEL 5.5, OVM 2.2, RHEL 5.5 Xen
(para-virtualized)
Number of VMs per physical node: Unless otherwise noted,
benchmarks were run on a 4 GB memory VM.
Interconnects: The interconnect between VMs or hypervisors
was never used to run the benchmarks. The hypervisor hosts
were connected to a 1000baseT network.
Citrix Xen testing
Physical hardware: (2) HP ProLiant BL2x220c in a c3000
chassis with dual socket quad core 2.83 GHz Intel® CPUs and
8 GB RAM
OS Installed: CentOS Linux 5.3 x86_64
Storage: Local Disk
Hypervisor: Citrix Xen 5.5
VM Configuration: (Qty 1) 8 GB VM with 8 cores, (Qty 2) 4
GB VMs with 4 cores, (Qty 4) 2 GB VMs with 2 cores, (Qty 8)
1 GB VMs with 1 core
NetPIPE
NetPIPE is an acronym that stands for Network Protocol
Independent Performance Evaluator.1
It is a useful tool for
measuring two important characteristics of networks: latency
and bandwidth. HPC application performance is becoming
increasingly dependent on the interconnect between compute
servers. Because of this trend, not only does parallel application
performance need to be examined, but also the performance
level of the network alone from both the latency and the
bandwidth standpoints.
The terms used for each data series in this section are defined
as follows:
• no_bkpln: Refers to communications happening over a
1000baseT Ethernet network
• same_bkpln: Refers to communications traversing a
backplane within a blade enclosure
5. IBM Systems and Technology Group 5
• diff_hyp: Refers to virtual machine to virtual machine
communication occurring between two separate physical
hypervisors
• pm2pm: Physical machine to physical machine
• vm2pm: Virtual machine to physical machine
• vm2vm: Virtual machine to virtual machine
Figures 1 and 2 illustrate that the closer the two entities
communicating are, the higher the bandwidth and lower the
latency between them. Additionally they show that when there
is a hypervisor layer between the entities, the communication is
slowed only slightly, and latencies stay in the expected range
for 1000baseT communication (60 - 80 µsec). When two
different VMs on separate hypervisors communicate—even
when the backplane is within the blade chassis—the latency is
more than double. The story gets even worse (by about 50
percent) when the two VMs do not share a backplane and
communicate over TCP/IP.
This benchmark illustrates that not all HPC workloads are
suitable for a virtualized environment. When applications run
in parallel and are latency sensitive (as many MPI based
applications are), using virtualized resources may be something
that should be avoided. If there is no choice but to use
virtualized resources, then the scheduler must have the ability
to choose resources that are adjacent to each other on the
network or the performance is likely to be unacceptable. This
conclusion also applies to transactional applications where
latency can be the largest part of the ‘submit to receive cycle
time.’
Figure 1: Network bandwidth between machines
Figure 2: Network latency between machines
6. 6 Could the “C” in HPC stand for Cloud?
IOzone
IOzone is a file system benchmarking tool, which generates
and measures a variety of file operations.2
In this benchmark,
IOzone was only run for write, rewrite, read and reread to
mimic the most popular functions an I/O subsystem performs.
This steady state I/O test clearly demonstrates that KVM
hypervisors are severely lacking when it comes to I/O to disk
in both reads and writes. Even in the OVM case, in a best case
scenario the performance of the I/O is nearing 40 percent
degradation. Write performance for Citrix Xen is also limited.
However, read performance exceeds that of the physical
machine by over 7 percent. This can only be attributed to a
read-ahead function in Xen, which worked better than the
native Linux read-ahead algorithm.
Figure 3: IOzone 32 GB file (Local disk) Figure 4: IOzone 32 GB file (Local disk)
Regardless, this benchmark, more than others, provides a
warning to early HPC cloud adopters of the performance risks
of virtual technologies. HPC users running I/O bound
applications (Nastran, Gaussian, certain types of ABAQUS
jobs, and so on) should steer clear of virtualization until these
issues are resolved.
Application benchmarks
Software compilation
Compiler used: gcc-4.1.2
Compilation target: Linux kernel 2.6.34 (with ‘deconfig’ option).
All transient files were put in a run specific subdirectory using
the ‘O’ option in make. Thus the source is kept in read-only
state and writes are into the run specific sub-directory.
7. IBM Systems and Technology Group 7
Figure 5 shows the difference in compilation performance for a
physical machine running a compile on an NFS volume
compared to Citrix Xen doing the same thing on the same
NFS volume. Citrix Xen is roughly 11 percent slower than the
physical machine performing the task. Also included is the
difference between compiling to a local disk target versus
compiling to the NFS target on the physical machine. The
results illustrate how NFS performance can significantly affect
a job’s elapsed time. This is of crucial importance because most
virtualized private cloud implementations utilize NFS as the
file system instead of using local drives to facilitate migration.
SIMULIA® Abaqus
SIMULIA® Abaqus3
is the standard of the manufacturing
industry for implicit and explicit non-linear finite element
Figure 5: Compilation of kernel 2.6.34 Figure 6: Parallel ABAQUS explicit (e2.inp)
solutions. SIMULIA publishes a benchmark suite that
hardware vendors use to distinguish their products.4
“e2” and
“s6” were used for these benchmarks.
The ABAQUS explicit distributed parallel runs were
performed using HP MPI (2.03.01) and scratch files were
written to local scratch disk. This comparison, unlike the
others presented in this paper, was done in two different ways:
1. The data series called “Citrix” is for a single 8 GB RAM VM
with 8 cores where the MPI ranks communicated within a
single VM.
2. The data series called “Citrix – Different VMs” represents
multiple separate VMs defined on the hypervisor host
intercommunicating.
8. 8 Could the “C” in HPC stand for Cloud?
Figure 7: Parallel ABAQUS standard (s6.inp)
As expected, the additional layers of virtualized networking
slowed the communication speeds (also shown in the NetPIPE
results) and reduced scalability when the job had higher rank
counts. In addition, for communications within a VM, the
performance for a virtual machine compared to the physical
machine was almost identical.
ABAQUS has a different algorithm for solving implicit Finite
Element Analysis (FEA) problem called “ABAQUS Standard.”
This method does not run distributed parallel, but can be run
SMP parallel which was done for the “s6” benchmark.
Figure 8: Serial FLUENT 12.1
Typically ABAQUS Standard does considerably more I/O to
scratch disk than its explicit counterpart. However, this is
dependent upon the amount of memory available in the
execution environment. It is clear again that when an
application is only CPU or memory constrained, a virtual
machine has almost no detectable performance impact.
ANSYS® FLUENT
ANSYS® FLUENT5
belongs to a large class of HPC
applications referred to as computational fluid dynamics (CFD)
codes. The “aircraft_2m” FLUENT model was selected based
on size and run for 25 iterations. The “sedan_4m” model was
chosen as a suitable sized model for running in parallel.
Hundred iterations were performed using this model.
9. IBM Systems and Technology Group 9
Figure 9: Distributed parallel FLUENT 12.1 (sedan_4m - 100 iterations)
Though CFD codes such as FLUENT are rarely run serially
because of memory requirements or solution time
requirements, the comparison in Figure 8 shows that the
solution time for a physical machine and a virtual machine are
different by only 1.9 percent where the virtual machine is the
slower of the two. The “aircraft_2m” model was simply too
small to scale well in parallel, and provided strangely varying
results, so the sedan_4m model was used.6
The result for the parallel case (Figure 9) illustrate that at two
CPUs the virtual machine outperforms the physical machine.
This is most likely caused by the native Linux scheduler
moving processes around on the physical host. If the
application had been bound to particular cores, then this effect
would disappear. In the four and eight CPU runs the difference
between physical and virtual machines is negligible. This
supports the theory that the Linux CPU scheduler is impacting
the two CPU job.
LS-DYNA®
LS-DYNA®7
is a transient dynamic finite element analysis
program capable of solving complex real world time domain
problems on serial, SMP parallel, and distributed parallel
computational engines. The “refined_neon_30ms” model was
chosen for benchmarks reviewed in this section. HP MPI
2.03.01, now owned by IBM Platform Computing was the
message passing library used.
Figure 10: LS-DYNA - MPP971 - Refined Neon
10. 10 Could the “C” in HPC stand for Cloud?
The MPP-DYNA application responds well when run in a low
latency environment. This benchmark supports the notion that
distributed parallel LS-DYNA jobs are still very sensitive to
network latency, even when using a backplane of a VM. A serial
run shows a virtual machine is 1 percent slower. Introduce
message passing, however, and at eight CPUs the virtual
machine is nearly 40 percent slower than the physical machine.
The expectation is that if the same job was run on multiple
VMs as was done for ABAQUS Explicit parallel jobs, the effect
would be even greater, where physical machines significantly
outperform virtual machines.
Conclusion
As with most legends, there is some truth to the notion that
VMs are inappropriate for HPC applications. The benchmark
results demonstrate that latency sensitive and I/O bound
applications would perform at levels unacceptable to HPC
users. However, the results also show that CPU and memory
bound applications and parallel applications that are not
latency sensitive perform well in a virtual environment. HPC
architects who dismiss virtualization technology entirely may
therefore be missing an enormous opportunity to inject
flexibility and even a performance edge into their HPC
designs.
The power of Platform Cluster Manger - Advanced Edition
and IBM® Platform™ LSF® is their ability to work in consort
to manage both of these types of workload simultaneously in a
single environment. These tools allow their users to maximize
resource utilization and flexibility through provisioning and
control at the physical and virtual levels. Only IBM Platform
Computing technology allows for environment optimization at
the job-by-job level, and only Platform Cluster Manager –
Advanced Edition continues to optimize that environment
after jobs have been scheduled and new jobs have been
submitted. Such an environment could realize orders of
magnitude increases in efficiency and throughput while
reducing the overhead of IT maintenance.
Significant results
• The KVM hypervisor significantly outperforms the OVM hypervisor on AMD servers, especially when several VMs run
simultaneously.
• Citrix Xen I/O read and rereads are very fast on Intel servers.
• OVM outperforms KVM by a significant margin for I/O intensive applications running on AMD servers.
• I/O intensive and latency sensitive parallel applications are not a good fit for virtual environments today.
• Memory and CPU bound applications are at performance parity between physical and virtual machines.