Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
Dsmp Whitepaper V5
1. D
S
M
P
Addressing the limitations of
Message Passing Interface
with a unique
Distributed Shared Memory Application
By Peter Robinson
Symmetric Computing
Venture Development Center
University of Massachusetts - Boston
Boston MA 02125
Page 1
2. D
S
M
P
This page is intentionally blank
Page 2
3. D
S
M
P
Overview
Today, the language-independent communications protocol - Message Passing Interface or MPI is the de
facto standard for most supercomputers. However, problems solved on these clusters must be decimated
to fit within the physical limitations of the individual nodes and modified to accommodate the clusters
hierarchy and messaging scheme. As of 1Q10, quad-socket Symmetric Multiprocessing (SMP) processing
nodes which make-up these MPI clusters can support 24-x86 cores and 128GB of memory. However,
addressing problems with big data-sets is still impractical for the MPI-1 model for it has no shared
memory concept, and MPI-2 has only a limited distributed shared memory (DSM) concept. Even an MPI-
2 cluster based upon these state-of-the-art SMP processing nodes cannot support problems with very large
data-sets without significant restructuring of the application and associated data-sets. Even after a
successful port, many programs suffer poor performance due to MPI hierarchy and message latency.
The Symmetric Computing - Distributed Symmetric Multiprocessing™ (DSMP™) Linux kernel
enhancement enables Distributed Shared Memory (DSM), or distributed global address space (DGAS),
across an InfiniBand connected cluster of Symmetric Multiprocessing (SMP) nodes with breakthrough
price/performance. The Symmetric Computing - Linux kernel enhancement transforms a homogeneous
cluster into a DSM/DGAS supercomputer which can service very large data-sets or accommodate legacy
MPI applications with increased efficiency and throughput via application-utilities that support MPI over
shared-memory. DSMP is poised to displace and obsolete message-passing as the protocol of choice for a
wide range of memory intensive applications because of its ability to service a wider class of problems
with greater efficiency.
DSMP is comprised of two unique operating systems; the host OS which runs on the head-node and a
unique lightweight micro-kernel OS which runs on all other servers (which make-up the cluster). The host
OS consists of a Linux image plus a new DSMP kernel, creating a new derivative work. The micro-kernel
is a non-Linux based operating system that extends the function of the host OS over the entire cluster.
These two OS images (host and micro-kernel), are designed to run on commodity, Symmetric
Multiprocessing (SMP) servers based on either the AMD or Intel direct connect architecture, i.e., AMD
Opteron™ processor or Intel Nehalem™ Processor. DSMP enables at the kernel
level, a shared-memory software architecture that scales to hundreds of thousands of
cores based on commodity hardware and InfiniBand. The key features that enable
this scalable DSM architecture are:
• A DSMP kernel level enhancement that results in significantly lower latency and improved
bandwidths, making a DSM/DGAS architecture both practical and possible;
• A transactional distributed shared-memory system, which allows the architecture to scale to
thousands of nodes with little to no impact on global-memory access times;
• An intelligent, optimized InfiniBand driver which leverages the HCA’s native Remote Direct
Memory Access (RDMA) logic, reducing global memory page access times to under 5µ-seconds;
• An application driven, memory-page coherency scheme that simplifies porting applications and
allows the programmer to focus on optimizing performance vs. decimating the application to
accommodate the limitations of the message-passing interface;
Page 3
4. D
S
M
P
MPI vs. DSM Supercomputer vs. the DSMP™ Cluster
As stated earlier, although MPI clusters are the de facto platform of choice, data-sets in bioinformatics,
Oil & Gas, atmospheric modeling, etc. are becoming too large for the SMP nodes that make up the cluster
and in many cases, it is impractical and inefficient to decimate the large data-sets. The alternatives are to
proceed anyway with a full restructuring of the problem and suffer the inefficiencies, or to purchase time
on a University or National Labs DSM supercomputer. The problem with the DSM supercomputer
approach is the prohibited cost of the hardware or the lengthy queue-time incurred to access a NSF/DoE
DSM Supercomputer. Additionally, the system requirements of a researcher looking to model a physical
system or assemble/align an DNA sequence MPI DSM DSMP™
are quite different from enterprise Cluster Supercomputer Cluster
computing. In short, they Don’t need a Commodity Yes No Yes
hardened enterprise class nine 9’s reliable Hardware
Support for Limited Yes Yes
platform with virtualization support because DSM
their applications are primarily single- Intelligent Platform Yes Yes Yes
Mngt. Interface
process, multiple-thread. In addition, they
Virtualization No Yes No
are more than willing to optimize their support
applications for the target hardware to get Static partition for No Yes Planned
multi app. support
the most out of the run. Ultimately they Affordability $ $$$$$ $
want unencumbered 24/7 access to an factor
Incrementally Yes No Yes
affordable DSM supercomputer – just like Expandable
their MPI cluster. The table to the right Support for Yes No Yes
>10K cores
summarizes the differences between the
three approaches.
Enter Symmetric Computing
The design team of Symmetric Computing came out of the research community. As such, they are very
aware of the problems researchers face today and in the future. This awareness drove the development of
DSMP™ and the need to leverage commodity hardware to implement a DSM/DGAS supercomputer. Our
intent is nothing short of having DSMP do for shared-memory supercomputing what the Beowulf project
(MPI) did for cluster-computing, which enabled thousands of researchers and Universities to solve
massively complex problems on commercially available hardware – Supercomputing for the Masses.
How DSMP™ works
As stated in the introduction, DSMP™ is software that transforms an InfiniBand connected cluster of
homogeneous 1U/4P commodity SMP servers into a shared-memory supercomputer. Although there are
two unique kernels, (host-kernel and a micro-kernel), for this discussion, we will ignore the difference
between them because, from the programmers perspective, there is only one OS image and one kernel.
The DSMP kernel provides seven (7) enhancements that enable this transformation, they are:
Page 4
5. D
S
M
P
1. A transactional distributed shared-memory system;
2. Optimized InfiniBand drivers; Trio™ Departmental
3. An application driven, memory-page coherency scheme; Supercomputer
4. An enhanced multi-threading service;
5. Support for a distributed MuTeX;
6. A memory based distributed disk-queue and
The transactional distributed shared-memory system: The center piece of DSMP is its transactional
distributed shared-memory architecture, which is based on a two tier memory-page (Local and Global)
and tables that support the transactional memory architecture. Just two years ago, such an approach
would have seemed inefficient and a poor use of memory. However, today memory is the most abundant
resource within the modern server, and leveraging this resource to implement a shared memory
architecture is not only practical but very cost effective.
The transactional global shared memory is concurrently accessible and available to all the processors in
the system. This memory is uniformly distributed over each server in the cluster and is indexed via tables,
which maintain a linear view of the global-memory. Local-memory contains copies of what is in global-
memory – acting as a memory page cache, providing temporal locality for program and data. All
processor code execution and data reads occur only from the local memory (cache).
128GB 128GB 128GB
Global Transactional Shared Memory
Based on
4,096B
pages
16 - 32GB 16-32 GB
16-32 GB
Local Local
Local
Memory #1 Memory #3
Memory #2
(cache) (cache)
(cache)
Based on
64B Cache
lines
IB 3 IB2
IB1
IB 2 IB 1 IB 3
3 - 1TB GbE SMP 0 GbE GbE SMP 2
SMP 1
HDD
per server
Shown above is a data flow view of the Trio™ departmental supercomputer which is based on three 1U -
4P servers with 48 or 72 processor cores and up to 128 giga-bytes of physical memory per node. The
three nodes in Trio are connected via 40Gb InfiniBand and there is no switch. The size of the local-
memory is set at boot time but is typically between one (1) and two (2) GB per-core or greater
Page 5
6. D
S
M
P
(application driven). When there is a page-fault in local-memory, the DSMP kernel finds an appropriate
least recently used (LRU) 4K memory-page in local-memory and swaps in the missing global-memory
page; this happen in just under 5µ-seconds. The large temporally local-memory (cache) provides all the
performance benefits (SREAMS and Linpack) of local-memory in a legacy SMP server with the added
benefit spatial locality to a large globally shared-memory which is <5µ-seconds away. Not only is this
architecture unique and extremely powerful, but it can scale to hundreds and even thousands of nodes
with no appreciable loss in performance; so long as a globally shared-memory page is <5µ-seconds away,
performance scales infinitely.
The Optimized InfiniBand Drivers: The entire success of DSMP™ revolved around the availability of a
low latency, commercial network fabric. It wasn’t that long ago, with the exit of Intel from InfiniBand,
that industry experts were forecasting its demise. Today InfiniBand is the fabric of choice for most High
Performance Computing (HPC) clusters - due to its low latency & high bandwidth.
To squeeze every last nano-second of performance out of the fabric, the designers of DSMP bypassed the
Linux InfiniBand protocol stack and wrote their own low-level drivers. In addition, they developed a set
of drivers that leveraged the native RDMA capabilities of the InfiniBand host channel adapter (HCA).
This allowed the HCA to service and move memory-page requests, without processor intervention.
Hence, RDMA eliminates the overhead for message construction and deconstruction, reducing system-
wide latency.
An application driven, memory page coherency scheme: As stated in the introduction, all proprietary
supercomputers maintain memory-consistency and/or coherency via a hardware extension of the host
processors coherency scheme. DSMP being a software solution, based on a local and global-memory
resource, had to take a different approach. Coherency within the local-memory of each of the individual
SMP servers is maintained by the AMD64 Memory Management Unit (MMU) on a cache-line basis.
Global-memory page coherency and consistency, is controlled by, and maintained under program control;
shifting responsibility from hardware to the programmer or the application. This approach may seem
counter intuitive at first, but it is the most efficient way to implement system-wide coherency based on
commodity hardware. Again, the target market-segment for DSMP is technical computing, not enterprise
and in most cases, the end user is familiar with their algorithm and how to optimize it for the target
platform - in the same way code was optimized for a Beowulf MPI cluster. Given that most applications
are open-source combined with the high skill level of the end users, drove system level decisions which
has kept DSMP based clusters affordable, fast and scalable. In cases where the end user is not computer
literate, or does not have access to a staff of computer scientists, Symmetric Computing can either provide
a user access to open-source algorithms already optimized for DSMP or we can work with your team to
modify the algorithm for you.
To assist the programmer in maintaining memory-consistency, a set of DSMP specific primitives were
developed. These primitives, combined with some simple intuitive programming rules, augmented with
Page 6
7. D
S
M
P
the new primitives make porting an application to a multi-node DSMP platform simple and manageable.
Those rules are as follows:
• Be sensitive to the fact that memory-pages are swapped into and out of local memory (cache)
from global memory in 4K pages and that it takes <5µ-seconds to complete the swap.
• Be careful not to overlap or allocate multiple data sets within the same memory page. To help
prevent this a new malloc( ) function [a_malloc( )] is provided to assure alignment on a 4k
boundary and to avoid FALSE sharing.
• Because of the way local and global-memory are partitioned within the physical memory, care
should be taken to distribute process/threads and associated data-sets evenly over the four
processors while maintaining temporal locality to data. How you do this is a function of the target
application. In short, care should be taken to not only distribute threads but ensure some level of
data locality. DSMP™ supports full POSIX conformance to simplify parallelization of threads.
• If there is a data-structure which is “modified-shared” and if that data structure is accessed by
multiple process/threads which are on an adjacent server, then it will be necessary to use a set of
new primitives to maintain memory-consistency i.e., Sync( ), Lock( ) and Release( ). These three
primitives are provided to simplify the implementation of system wide memory-consistency.
- Sync( ) synchronizes a local data-structure with its parent in global-memory.
- Lock( ) prevents any other process thread from accessing and subsequently modifying the
noted data-structure. Lock( ) also invalidates all other copies of the data structure (memory-
pages) within the computing-system. If a process thread on an adjacent computing-device
accesses a memory-page associated with a locked data structure, execution is suspended until
the structure (memory page) is released.
- Release( ) unlocks a previously locked data structure.
NOTE: If your application is single thread, or is already parallelized for OpenMP or Pthreads, then it
will run on a DSMP™ system (such as Trio™) without modification. The only limitation is that for a
multi-thread application you may not be able to take advantage of all the processor cores on the adjacent
worker nodes (only the processor-cores on the head-node). Hence, in the case of Trio™ you will have
access to all 16/24 processor cores on the head and up to 336GB of global shared memory with full
coherency support. This ability to run your single thread or OpenMP/Pthreads applications, and take full
advantage of DSMP™ transactional distributed shared-memory, without modifying your source, provides
a systematic approach to full parallelization. It should be noted that the current implementation of DSMP
will not support PMI applications. Sometime in 2H10, we plan to release a wrapper that will support MPI
over a DSMP shared-memory architecture.
Multi-Threading: The “gold standard” for parallelizing C/C++ or Fortran source code is with OpenMP
and the POSIX thread library or Pthreads. POSIX is an acronym for Portable Operating System Interface.
The latest version; POSIX.1 - IEEE Std 1003.1, 2004 Edition, was developed by The Austin Common
Standards Revision Group (CSRG). To ensure that Pthreads would work with DSMP each of the two
Page 7
8. D
S
M
P
dozen or so POSIX routines were either tested to and/or modified for DSMP and the Trio™ platform so
that the POSIX.1 standard is supported in its entirety.
Distributed MuTeX: A MuTeX or Mutual exclusion is a set of algorithms that are used in concurrent
programming to avoid the simultaneous use of a common resource, such as a global variable or a critical
section. A distributed MuTeX is nothing more than a DSMP kernel enhancement that insures that a
MuTeX functions as expected, within the DSMP multi-node system. From a programmers point-of-view,
there are no changes or modification to MuTeX – it just works.
Memory based distributed disk-queue: DSMP provides a high-bandwidth/low-latency elastic queue for
data which is intended to be written to a low bandwidth interface, such as a Hard Disk Drive (HDD) or
the network. This distributed input/output queue, is a memory (DRAM) based elastic storage buffer
which effectively eliminates bottlenecks which occur when multiple threads compete for a low bandwidth
resource.
A Perfect-Storm
Every so often, there are advancements in technology that impact a broad swath of the society, either
directly or indirectly. These advancements emerge and take hold due to three events:
1. A brilliant idea is set-upon and implemented which solves a “real-world” problem;
2. A set of technologies have evolved to a point that enable the idea and;
3. The market segment which directly benefits from the idea is ready to accept it.
The enablement of a high performance shared-memory architecture over a cluster is such an idea. As
implied in the previous section, the technologies that have allowed DSMP to be realized are:
• The adoption of Linux as the operating system of choice for technical computing;
• The commercial availability of a high bandwidth, low latency network fabric – InfiniBand;
• The adoption of the x86-64 processor as the architecture of choice for technical computing;
• The integration of the DRAM controller and I/O complex onto the x86-64 processor;
• The sudden and rapid increase of DRAM density with the corresponding drop in memory cost;
• The commercial availability of high-performance small form-factor SMP nodes.
If any of the above six advancements had not existed, Distributed Symmetric Multiprocessing would not
have succeeded.
DSMP Performance
Performance of a supercomputer is a function of two metrics:
1) Processor performance (computational throughput) which can be assessed with HPPC Linpack or FFT
algorithms and;
2) Global Memory Read/Write performance - which can assessed with HPPC STREAMS or Random.
The extraordinary thing about the DSMP™ is the fact that it is based on commodity components. That’s
important, because as with MPI, DSMP performance scales with the performance of the commodity
Page 8
9. D
S
M
P
components from which it is dependent on. As an example, random read/write latency for DSMP, went
down 40% when we moved from a 20Gb to a 40Gb fabric and no changes to the DSMP software were
needed. Also, within this same timeframe, the AMD64 processor density went from quad-core to six-core,
again with only a small increase in the cost of the total system. Therefore, over-time the performance gap
between a DSMP™ and a proprietary SMP supercomputer of equivalent processor/memory density will
continue to narrow – and eventually disappear.
Looking back to the birth of the Beowulf project, an equivalent processor/memory density SMP
supercomputer out performed a Beowulf cluster by a factor of 100, yet MPI clusters continued to
dominate supercomputing – why? The reasons are twofold:
First – price performance: MIP clusters are affordable – both from the hardware as well as a software
perspective (Linux based), and they can grow in small ways - by adding a few additional PC/servers or
in large way, by adding entire racks of servers. Hence, if a researcher needs more computing power,
they simply add more commodity PC/servers.
Second – accessibility: MPI clusters are inexpensive enough that almost any University or College can
afford them, making them readily accessible for a wide range of applications. As a result, an enormous
academic resource is focused on improving MPI and the applications that run on them.
DSMP™ brings the same level of value, accessibility and grass-roots innovation to the same audience
that embraced MPI. However, the performance gap between a DSMP cluster and a legacy shared-
memory supercomputer is small, and in many applications, a DSMP cluster out performs machines that
cost 10x its price. As an example, if your application can fit (program and data) within local-memory
(cache) of Trio, then you can apply 48 or 72 processors concurrently to solve your problem with no
penalty for memory accesses (all memory is local for all 48 or 72 processors). Even if your algorithm
does access global shared memory, it’s only 5µ-seconds away.
Today, Symmetric Computing is offering four turn-key DSMP proof of concept platforms coined the
Trio™ Departmental Supercomputer line and one dual node system coined Duet™. The Table below lists
the various Giga-flop/memory configurations available in 2010.
Trio™ Peak Theoretical Linpack score in Total system Total Shared
P/N floating-point Local-Memory memory Memory*
performance (75%)
SCA241604-3 749 Giga-flops 562 Giga-flops 192 GB 120GB
SCA241604-2 538 Giga-flops 404 Giga-flops 256 GB 208GB
SCA241608-3 749 Giga-flops 562 Giga-flops 384 GB 312GB
SCA161608-3 538 Giga-flops 404 Giga-flops 384 GB 336GB
* Note: In this table, local memory (cache) is set to 1GB/core – a total of 16GB for quad-core and 24GB for six-core processors
Page 9
10. D
S
M
P
Looking forward to 2Q10, the Symmetric Computing engineering staff will introduce a multi-node
InfiniBand switched based system delivering almost 2 tera-flops of peak throughput with more than a
terra-byte global-memory in a 10-blade chassis. In addition, we are working with our partners to deliver a
turn-key platform tuned for application specific missions – such as HMMER, BLAST and de novo
Align/Assembly algorithms.
Challenges and Opportunities
Symmetric Computing’s most significant challenge is to convince users that there is real value in its
DSMP Linux kernel enhancement over MPI. Much of the HPC market has become complacent with MPI
and look for incremental improvement i.e., MPI-1 to MPI-2; which continues to be the path of least
resistance. In addition, as MPI nodes convert to the AMD Opteron™ processor and Intel Nehalem™
processor’s direct connect architecture, combined with memory densities of >256GB/node, MPI-2 with
DSM support will gradually service a wider range of application.
Symmetric Computing will addresses these issues by first supplying affordable turn-key DSMP enabled
hardware – such as Trio™. These platforms will be optimized to solve problems in narrow vertical
markets such as Bioinformatics at an unprecedented price-performance level. We will then enable the
wider market with the availability of larger turn-key platforms with a version of OpenMP optimized for
DSMP and APIs that allow MPI applications to run on these turn-key shared-memory platforms with
improved performance.
In the short-term, Symmetric Computing will continue to focus on academic & research supercomputing.
We will continue to release turn-key platforms focused on specific verticals where we have an advantage.
Our long-term strategy is to continue to displacement MPI with DSMP for applications where
DSM/DGAS is as the architecture of choice delivering - Supercomputing for the Masses.
About Symmetric Computing
Symmetric Computing is a Boston based software company with offices at the Venture
Development Center on the campus of the University of Massachusetts – Boston. We design software to
accelerate the use and application of shared-memory computing systems for Bioinformatics, Oil & Gas,
Post Production Editing, Financial analysis and related fields. Symmetric Computing is dedicated to
delivering standards-based, customer-focused technical computing solutions for users, ranging from
Universities to enterprises. For more information, visit www.symmetriccomputing.com.
Page 10