SlideShare uma empresa Scribd logo
1 de 10
Baixar para ler offline
D
    S
        M
    P




                  Addressing the limitations of
                   Message Passing Interface
                         with a unique
            Distributed Shared Memory Application




                            By Peter Robinson
                           Symmetric Computing
                        Venture Development Center
                     University of Massachusetts - Boston
                              Boston MA 02125




                                   Page 1
D
    S
        M
    P




            This page is intentionally blank




                         Page 2
D
     S
         M
     P


Overview
Today, the language-independent communications protocol - Message Passing Interface or MPI is the de
facto standard for most supercomputers. However, problems solved on these clusters must be decimated
to fit within the physical limitations of the individual nodes and modified to accommodate the clusters
hierarchy and messaging scheme. As of 1Q10, quad-socket Symmetric Multiprocessing (SMP) processing
nodes which make-up these MPI clusters can support 24-x86 cores and 128GB of memory. However,
addressing problems with big data-sets is still impractical for the MPI-1 model for it has no shared
memory concept, and MPI-2 has only a limited distributed shared memory (DSM) concept. Even an MPI-
2 cluster based upon these state-of-the-art SMP processing nodes cannot support problems with very large
data-sets without significant restructuring of the application and associated data-sets. Even after a
successful port, many programs suffer poor performance due to MPI hierarchy and message latency.
The Symmetric Computing - Distributed Symmetric Multiprocessing™ (DSMP™) Linux kernel
enhancement enables Distributed Shared Memory (DSM), or distributed global address space (DGAS),
across an InfiniBand connected cluster of Symmetric Multiprocessing (SMP) nodes with breakthrough
price/performance. The Symmetric Computing - Linux kernel enhancement transforms a homogeneous
cluster into a DSM/DGAS supercomputer which can service very large data-sets or accommodate legacy
MPI applications with increased efficiency and throughput via application-utilities that support MPI over
shared-memory. DSMP is poised to displace and obsolete message-passing as the protocol of choice for a
wide range of memory intensive applications because of its ability to service a wider class of problems
with greater efficiency.
DSMP is comprised of two unique operating systems; the host OS which runs on the head-node and a
unique lightweight micro-kernel OS which runs on all other servers (which make-up the cluster). The host
OS consists of a Linux image plus a new DSMP kernel, creating a new derivative work. The micro-kernel
is a non-Linux based operating system that extends the function of the host OS over the entire cluster.
These two OS images (host and micro-kernel), are designed to run on commodity, Symmetric
Multiprocessing (SMP) servers based on either the AMD or Intel direct connect architecture, i.e., AMD
Opteron™ processor or Intel Nehalem™ Processor. DSMP enables at the kernel
level, a shared-memory software architecture that scales to hundreds of thousands of
cores based on commodity hardware and InfiniBand. The key features that enable
this scalable DSM architecture are:
     •   A DSMP kernel level enhancement that results in significantly lower latency and improved
         bandwidths, making a DSM/DGAS architecture both practical and possible;
     •   A transactional distributed shared-memory system, which allows the architecture to scale to
         thousands of nodes with little to no impact on global-memory access times;
     •   An intelligent, optimized InfiniBand driver which leverages the HCA’s native Remote Direct
         Memory Access (RDMA) logic, reducing global memory page access times to under 5µ-seconds;
     •   An application driven, memory-page coherency scheme that simplifies porting applications and
         allows the programmer to focus on optimizing performance vs. decimating the application to
         accommodate the limitations of the message-passing interface;


                                                Page 3
D
     S
         M
     P


MPI vs. DSM Supercomputer vs. the DSMP™ Cluster
As stated earlier, although MPI clusters are the de facto platform of choice, data-sets in bioinformatics,
Oil & Gas, atmospheric modeling, etc. are becoming too large for the SMP nodes that make up the cluster
and in many cases, it is impractical and inefficient to decimate the large data-sets. The alternatives are to
proceed anyway with a full restructuring of the problem and suffer the inefficiencies, or to purchase time
on a University or National Labs DSM supercomputer. The problem with the DSM supercomputer
approach is the prohibited cost of the hardware or the lengthy queue-time incurred to access a NSF/DoE
DSM Supercomputer. Additionally, the system requirements of a researcher looking to model a physical
system or assemble/align an DNA sequence                                  MPI        DSM         DSMP™
are quite different from enterprise                                     Cluster Supercomputer    Cluster

computing. In short, they Don’t need a                     Commodity      Yes         No            Yes
hardened enterprise class nine 9’s reliable                  Hardware
                                                           Support for  Limited       Yes           Yes
platform with virtualization support because                      DSM
their applications are primarily single-          Intelligent Platform    Yes         Yes           Yes
                                                      Mngt. Interface
process, multiple-thread. In addition, they
                                                         Virtualization   No          Yes           No
are more than willing to optimize their                        support
applications for the target hardware to get        Static partition for   No          Yes        Planned
                                                   multi app. support
the most out of the run. Ultimately they                  Affordability    $        $$$$$            $
want unencumbered 24/7 access to an                              factor
                                                        Incrementally     Yes         No            Yes
affordable DSM supercomputer – just like                   Expandable
their MPI cluster. The table to the right                  Support for    Yes         No            Yes
                                                            >10K cores
summarizes the differences between the
three approaches.
Enter Symmetric Computing
The design team of Symmetric Computing came out of the research community. As such, they are very
aware of the problems researchers face today and in the future. This awareness drove the development of
DSMP™ and the need to leverage commodity hardware to implement a DSM/DGAS supercomputer. Our
intent is nothing short of having DSMP do for shared-memory supercomputing what the Beowulf project
(MPI) did for cluster-computing, which enabled thousands of researchers and Universities to solve
massively complex problems on commercially available hardware – Supercomputing for the Masses.

How DSMP™ works
As stated in the introduction, DSMP™ is software that transforms an InfiniBand connected cluster of
homogeneous 1U/4P commodity SMP servers into a shared-memory supercomputer. Although there are
two unique kernels, (host-kernel and a micro-kernel), for this discussion, we will ignore the difference
between them because, from the programmers perspective, there is only one OS image and one kernel.
The DSMP kernel provides seven (7) enhancements that enable this transformation, they are:




                                                  Page 4
D
     S
         M
     P


     1. A transactional distributed shared-memory system;
     2. Optimized InfiniBand drivers;                                                                         Trio™ Departmental
     3. An application driven, memory-page coherency scheme;                                                    Supercomputer

     4. An enhanced multi-threading service;
     5. Support for a distributed MuTeX;
     6. A memory based distributed disk-queue and


The transactional distributed shared-memory system: The center piece of DSMP is its transactional
distributed shared-memory architecture, which is based on a two tier memory-page (Local and Global)
and tables that support the transactional memory architecture. Just two years ago, such an approach
would have seemed inefficient and a poor use of memory. However, today memory is the most abundant
resource within the modern server, and leveraging this resource to implement a shared memory
architecture is not only practical but very cost effective.
The transactional global shared memory is concurrently accessible and available to all the processors in
the system. This memory is uniformly distributed over each server in the cluster and is indexed via tables,
which maintain a linear view of the global-memory. Local-memory contains copies of what is in global-
memory – acting as a memory page cache, providing temporal locality for program and data. All
processor code execution and data reads occur only from the local memory (cache).
                                  128GB                                128GB                          128GB




                                                 Global Transactional Shared Memory

                                                                                                             Based on
                                                                                                              4,096B
                                                                                                              pages
                      16 - 32GB                                                        16-32 GB
                                                         16-32 GB
                        Local                                                            Local
                                                           Local
                      Memory #1                                                        Memory #3
                                                         Memory #2
                       (cache)                                                          (cache)
                                                          (cache)
                                                                                                              Based on
                                                                                                             64B Cache
                                                                                                                lines


                                                                        IB 3                          IB2
                                          IB1

                                          IB 2                          IB 1                          IB 3

          3 - 1TB       GbE       SMP 0                GbE                            GbE     SMP 2
                                                               SMP 1
           HDD
         per server




Shown above is a data flow view of the Trio™ departmental supercomputer which is based on three 1U -
4P servers with 48 or 72 processor cores and up to 128 giga-bytes of physical memory per node. The
three nodes in Trio are connected via 40Gb InfiniBand and there is no switch. The size of the local-
memory is set at boot time but is typically between one (1) and two (2) GB per-core or greater


                                                                           Page 5
D
     S
         M
     P


(application driven). When there is a page-fault in local-memory, the DSMP kernel finds an appropriate
least recently used (LRU) 4K memory-page in local-memory and swaps in the missing global-memory
page; this happen in just under 5µ-seconds. The large temporally local-memory (cache) provides all the
performance benefits (SREAMS and Linpack) of local-memory in a legacy SMP server with the added
benefit spatial locality to a large globally shared-memory which is <5µ-seconds away. Not only is this
architecture unique and extremely powerful, but it can scale to hundreds and even thousands of nodes
with no appreciable loss in performance; so long as a globally shared-memory page is <5µ-seconds away,
performance scales infinitely.

The Optimized InfiniBand Drivers: The entire success of DSMP™ revolved around the availability of a
low latency, commercial network fabric. It wasn’t that long ago, with the exit of Intel from InfiniBand,
that industry experts were forecasting its demise. Today InfiniBand is the fabric of choice for most High
Performance Computing (HPC) clusters - due to its low latency & high bandwidth.
To squeeze every last nano-second of performance out of the fabric, the designers of DSMP bypassed the
Linux InfiniBand protocol stack and wrote their own low-level drivers. In addition, they developed a set
of drivers that leveraged the native RDMA capabilities of the InfiniBand host channel adapter (HCA).
This allowed the HCA to service and move memory-page requests, without processor intervention.
Hence, RDMA eliminates the overhead for message construction and deconstruction, reducing system-
wide latency.

An application driven, memory page coherency scheme: As stated in the introduction, all proprietary
supercomputers maintain memory-consistency and/or coherency via a hardware extension of the host
processors coherency scheme. DSMP being a software solution, based on a local and global-memory
resource, had to take a different approach. Coherency within the local-memory of each of the individual
SMP servers is maintained by the AMD64 Memory Management Unit (MMU) on a cache-line basis.
Global-memory page coherency and consistency, is controlled by, and maintained under program control;
shifting responsibility from hardware to the programmer or the application. This approach may seem
counter intuitive at first, but it is the most efficient way to implement system-wide coherency based on
commodity hardware. Again, the target market-segment for DSMP is technical computing, not enterprise
and in most cases, the end user is familiar with their algorithm and how to optimize it for the target
platform - in the same way code was optimized for a Beowulf MPI cluster. Given that most applications
are open-source combined with the high skill level of the end users, drove system level decisions which
has kept DSMP based clusters affordable, fast and scalable. In cases where the end user is not computer
literate, or does not have access to a staff of computer scientists, Symmetric Computing can either provide
a user access to open-source algorithms already optimized for DSMP or we can work with your team to
modify the algorithm for you.

To assist the programmer in maintaining memory-consistency, a set of DSMP specific primitives were
developed. These primitives, combined with some simple intuitive programming rules, augmented with


                                                 Page 6
D
     S
         M
     P


the new primitives make porting an application to a multi-node DSMP platform simple and manageable.
Those rules are as follows:
    • Be sensitive to the fact that memory-pages are swapped into and out of local memory (cache)
       from global memory in 4K pages and that it takes <5µ-seconds to complete the swap.
    • Be careful not to overlap or allocate multiple data sets within the same memory page. To help
       prevent this a new malloc( ) function [a_malloc( )] is provided to assure alignment on a 4k
       boundary and to avoid FALSE sharing.
    • Because of the way local and global-memory are partitioned within the physical memory, care
       should be taken to distribute process/threads and associated data-sets evenly over the four
       processors while maintaining temporal locality to data. How you do this is a function of the target
       application. In short, care should be taken to not only distribute threads but ensure some level of
       data locality. DSMP™ supports full POSIX conformance to simplify parallelization of threads.
    • If there is a data-structure which is “modified-shared” and if that data structure is accessed by
       multiple process/threads which are on an adjacent server, then it will be necessary to use a set of
       new primitives to maintain memory-consistency i.e., Sync( ), Lock( ) and Release( ). These three
       primitives are provided to simplify the implementation of system wide memory-consistency.
       - Sync( ) synchronizes a local data-structure with its parent in global-memory.
       - Lock( ) prevents any other process thread from accessing and subsequently modifying the
            noted data-structure. Lock( ) also invalidates all other copies of the data structure (memory-
            pages) within the computing-system. If a process thread on an adjacent computing-device
            accesses a memory-page associated with a locked data structure, execution is suspended until
            the structure (memory page) is released.
       - Release( ) unlocks a previously locked data structure.
NOTE: If your application is single thread, or is already parallelized for OpenMP or Pthreads, then it
will run on a DSMP™ system (such as Trio™) without modification. The only limitation is that for a
multi-thread application you may not be able to take advantage of all the processor cores on the adjacent
worker nodes (only the processor-cores on the head-node). Hence, in the case of Trio™ you will have
access to all 16/24 processor cores on the head and up to 336GB of global shared memory with full
coherency support. This ability to run your single thread or OpenMP/Pthreads applications, and take full
advantage of DSMP™ transactional distributed shared-memory, without modifying your source, provides
a systematic approach to full parallelization. It should be noted that the current implementation of DSMP
will not support PMI applications. Sometime in 2H10, we plan to release a wrapper that will support MPI
over a DSMP shared-memory architecture.

Multi-Threading: The “gold standard” for parallelizing C/C++ or Fortran source code is with OpenMP
and the POSIX thread library or Pthreads. POSIX is an acronym for Portable Operating System Interface.
The latest version; POSIX.1 - IEEE Std 1003.1, 2004 Edition, was developed by The Austin Common
Standards Revision Group (CSRG). To ensure that Pthreads would work with DSMP each of the two


                                                 Page 7
D
        S
            M
        P


  dozen or so POSIX routines were either tested to and/or modified for DSMP and the Trio™ platform so
  that the POSIX.1 standard is supported in its entirety.

  Distributed MuTeX: A MuTeX or Mutual exclusion is a set of algorithms that are used in concurrent
  programming to avoid the simultaneous use of a common resource, such as a global variable or a critical
  section. A distributed MuTeX is nothing more than a DSMP kernel enhancement that insures that a
  MuTeX functions as expected, within the DSMP multi-node system. From a programmers point-of-view,
  there are no changes or modification to MuTeX – it just works.
  Memory based distributed disk-queue: DSMP provides a high-bandwidth/low-latency elastic queue for
  data which is intended to be written to a low bandwidth interface, such as a Hard Disk Drive (HDD) or
  the network. This distributed input/output queue, is a memory (DRAM) based elastic storage buffer
  which effectively eliminates bottlenecks which occur when multiple threads compete for a low bandwidth
  resource.

  A Perfect-Storm
  Every so often, there are advancements in technology that impact a broad swath of the society, either
  directly or indirectly. These advancements emerge and take hold due to three events:
      1. A brilliant idea is set-upon and implemented which solves a “real-world” problem;
      2. A set of technologies have evolved to a point that enable the idea and;
      3. The market segment which directly benefits from the idea is ready to accept it.
  The enablement of a high performance shared-memory architecture over a cluster is such an idea. As
  implied in the previous section, the technologies that have allowed DSMP to be realized are:
      • The adoption of Linux as the operating system of choice for technical computing;
      • The commercial availability of a high bandwidth, low latency network fabric – InfiniBand;
      • The adoption of the x86-64 processor as the architecture of choice for technical computing;
      • The integration of the DRAM controller and I/O complex onto the x86-64 processor;
      • The sudden and rapid increase of DRAM density with the corresponding drop in memory cost;
      • The commercial availability of high-performance small form-factor SMP nodes.
  If any of the above six advancements had not existed, Distributed Symmetric Multiprocessing would not
  have succeeded.

  DSMP Performance
   Performance of a supercomputer is a function of two metrics:
1) Processor performance (computational throughput) which can be assessed with HPPC Linpack or FFT
       algorithms and;
2) Global Memory Read/Write performance - which can assessed with HPPC STREAMS or Random.
  The extraordinary thing about the DSMP™ is the fact that it is based on commodity components. That’s
  important, because as with MPI, DSMP performance scales with the performance of the commodity


                                                 Page 8
D
     S
         M
     P


components from which it is dependent on. As an example, random read/write latency for DSMP, went
down 40% when we moved from a 20Gb to a 40Gb fabric and no changes to the DSMP software were
needed. Also, within this same timeframe, the AMD64 processor density went from quad-core to six-core,
again with only a small increase in the cost of the total system. Therefore, over-time the performance gap
between a DSMP™ and a proprietary SMP supercomputer of equivalent processor/memory density will
continue to narrow – and eventually disappear.

Looking back to the birth of the Beowulf project, an equivalent processor/memory density SMP
supercomputer out performed a Beowulf cluster by a factor of 100, yet MPI clusters continued to
dominate supercomputing – why? The reasons are twofold:
First – price performance: MIP clusters are affordable – both from the hardware as well as a software
perspective (Linux based), and they can grow in small ways - by adding a few additional PC/servers or
in large way, by adding entire racks of servers. Hence, if a researcher needs more computing power,
they simply add more commodity PC/servers.
Second – accessibility: MPI clusters are inexpensive enough that almost any University or College can
afford them, making them readily accessible for a wide range of applications. As a result, an enormous
academic resource is focused on improving MPI and the applications that run on them.
DSMP™ brings the same level of value, accessibility and grass-roots innovation to the same audience
that embraced MPI. However, the performance gap between a DSMP cluster and a legacy shared-
memory supercomputer is small, and in many applications, a DSMP cluster out performs machines that
cost 10x its price. As an example, if your application can fit (program and data) within local-memory
(cache) of Trio, then you can apply 48 or 72 processors concurrently to solve your problem with no
penalty for memory accesses (all memory is local for all 48 or 72 processors). Even if your algorithm
does access global shared memory, it’s only 5µ-seconds away.
Today, Symmetric Computing is offering four turn-key DSMP proof of concept platforms coined the
Trio™ Departmental Supercomputer line and one dual node system coined Duet™. The Table below lists
the various Giga-flop/memory configurations available in 2010.


                Trio™            Peak Theoretical          Linpack score in         Total system         Total Shared
                 P/N              floating-point            Local-Memory              memory              Memory*
                                   performance                 (75%)

             SCA241604-3        749 Giga-flops             562 Giga-flops              192 GB                120GB
             SCA241604-2        538 Giga-flops             404 Giga-flops              256 GB                208GB
             SCA241608-3        749 Giga-flops             562 Giga-flops              384 GB                312GB
             SCA161608-3        538 Giga-flops             404 Giga-flops              384 GB                336GB
         * Note: In this table, local memory (cache) is set to 1GB/core – a total of 16GB for quad-core and 24GB for six-core processors




                                                              Page 9
D
     S
         M
     P



Looking forward to 2Q10, the Symmetric Computing engineering staff will introduce a multi-node
InfiniBand switched based system delivering almost 2 tera-flops of peak throughput with more than a
terra-byte global-memory in a 10-blade chassis. In addition, we are working with our partners to deliver a
turn-key platform tuned for application specific missions – such as HMMER, BLAST and de novo
Align/Assembly algorithms.

Challenges and Opportunities
Symmetric Computing’s most significant challenge is to convince users that there is real value in its
DSMP Linux kernel enhancement over MPI. Much of the HPC market has become complacent with MPI
and look for incremental improvement i.e., MPI-1 to MPI-2; which continues to be the path of least
resistance. In addition, as MPI nodes convert to the AMD Opteron™ processor and Intel Nehalem™
processor’s direct connect architecture, combined with memory densities of >256GB/node, MPI-2 with
DSM support will gradually service a wider range of application.
Symmetric Computing will addresses these issues by first supplying affordable turn-key DSMP enabled
hardware – such as Trio™. These platforms will be optimized to solve problems in narrow vertical
markets such as Bioinformatics at an unprecedented price-performance level. We will then enable the
wider market with the availability of larger turn-key platforms with a version of OpenMP optimized for
DSMP and APIs that allow MPI applications to run on these turn-key shared-memory platforms with
improved performance.
In the short-term, Symmetric Computing will continue to focus on academic & research supercomputing.
We will continue to release turn-key platforms focused on specific verticals where we have an advantage.
Our long-term strategy is to continue to displacement MPI with DSMP for applications where
DSM/DGAS is as the architecture of choice delivering - Supercomputing for the Masses.

About Symmetric Computing
        Symmetric Computing is a Boston based software company with offices at the Venture
Development Center on the campus of the University of Massachusetts – Boston. We design software to
accelerate the use and application of shared-memory computing systems for Bioinformatics, Oil & Gas,
Post Production Editing, Financial analysis and related fields. Symmetric Computing is dedicated to
delivering standards-based, customer-focused technical computing solutions for users, ranging from
Universities to enterprises. For more information, visit www.symmetriccomputing.com.




                                                Page 10

Mais conteúdo relacionado

Mais procurados

"The University of Bonn boosts genetics research"
"The University of Bonn boosts genetics research""The University of Bonn boosts genetics research"
"The University of Bonn boosts genetics research"IBM India Smarter Computing
 
Memory Sizing for WebSphere Applications on System z Linux
Memory Sizing for WebSphere Applications on System z LinuxMemory Sizing for WebSphere Applications on System z Linux
Memory Sizing for WebSphere Applications on System z LinuxIBM India Smarter Computing
 
Dme presentation-feb2013v2-1
Dme presentation-feb2013v2-1Dme presentation-feb2013v2-1
Dme presentation-feb2013v2-1Bengt Edlund
 
June 10 MS13
June 10 MS13June 10 MS13
June 10 MS13Samimvez
 
Genetic Engineering - Teamworking Infrastructure For Post And DI
Genetic Engineering - Teamworking Infrastructure For Post And DIGenetic Engineering - Teamworking Infrastructure For Post And DI
Genetic Engineering - Teamworking Infrastructure For Post And DIQuantel
 

Mais procurados (6)

"The University of Bonn boosts genetics research"
"The University of Bonn boosts genetics research""The University of Bonn boosts genetics research"
"The University of Bonn boosts genetics research"
 
Memory Sizing for WebSphere Applications on System z Linux
Memory Sizing for WebSphere Applications on System z LinuxMemory Sizing for WebSphere Applications on System z Linux
Memory Sizing for WebSphere Applications on System z Linux
 
Dme presentation-feb2013v2-1
Dme presentation-feb2013v2-1Dme presentation-feb2013v2-1
Dme presentation-feb2013v2-1
 
Implementing Softek zDMF technology.
Implementing Softek zDMF technology.Implementing Softek zDMF technology.
Implementing Softek zDMF technology.
 
June 10 MS13
June 10 MS13June 10 MS13
June 10 MS13
 
Genetic Engineering - Teamworking Infrastructure For Post And DI
Genetic Engineering - Teamworking Infrastructure For Post And DIGenetic Engineering - Teamworking Infrastructure For Post And DI
Genetic Engineering - Teamworking Infrastructure For Post And DI
 

Semelhante a Dsmp Whitepaper V5

Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3gelfstrom
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...Paul Hofmann
 
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLER
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLERGENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLER
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLERijesajournal
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Indrajit Poddar
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 
Add Memory, Improve Performance, and Lower Costs with IBM MAX5 Technology
Add Memory, Improve Performance, and Lower Costs with IBM MAX5 TechnologyAdd Memory, Improve Performance, and Lower Costs with IBM MAX5 Technology
Add Memory, Improve Performance, and Lower Costs with IBM MAX5 TechnologyIBM India Smarter Computing
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfanil0878
 
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage UtilizationIBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage UtilizationIBM India Smarter Computing
 
Gartner Cool Vendor Report 2014
Gartner Cool Vendor Report 2014Gartner Cool Vendor Report 2014
Gartner Cool Vendor Report 2014jenjermain
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computersshopnil786
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit MumbaiAnand Haridass
 
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...StampedeCon
 
Lecture 11 - distributed database
Lecture 11 - distributed databaseLecture 11 - distributed database
Lecture 11 - distributed databaseHoneySah
 
High performance operating system controlled memory compression
High performance operating system controlled memory compressionHigh performance operating system controlled memory compression
High performance operating system controlled memory compressionMr. Chanuwan
 
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...inside-BigData.com
 

Semelhante a Dsmp Whitepaper V5 (20)

Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3Dsmp Whitepaper Release 3
Dsmp Whitepaper Release 3
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
 
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLER
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLERGENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLER
GENERIC SOPC PLATFORM FOR VIDEO INTERACTIVE SYSTEM WITH MPMC CONTROLLER
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Add Memory, Improve Performance, and Lower Costs with IBM MAX5 Technology
Add Memory, Improve Performance, and Lower Costs with IBM MAX5 TechnologyAdd Memory, Improve Performance, and Lower Costs with IBM MAX5 Technology
Add Memory, Improve Performance, and Lower Costs with IBM MAX5 Technology
 
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage UtilizationIBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
 
In-Memory Compute Grids… Explained
In-Memory Compute Grids… ExplainedIn-Memory Compute Grids… Explained
In-Memory Compute Grids… Explained
 
Gartner Cool Vendor Report 2014
Gartner Cool Vendor Report 2014Gartner Cool Vendor Report 2014
Gartner Cool Vendor Report 2014
 
Cluster Computers
Cluster ComputersCluster Computers
Cluster Computers
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
 
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
 
Lecture 11 - distributed database
Lecture 11 - distributed databaseLecture 11 - distributed database
Lecture 11 - distributed database
 
High performance operating system controlled memory compression
High performance operating system controlled memory compressionHigh performance operating system controlled memory compression
High performance operating system controlled memory compression
 
Embedded Linux
Embedded LinuxEmbedded Linux
Embedded Linux
 
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
Characteristics of Remote Persistent Memory – Performance, Capacity, or Local...
 

Dsmp Whitepaper V5

  • 1. D S M P Addressing the limitations of Message Passing Interface with a unique Distributed Shared Memory Application By Peter Robinson Symmetric Computing Venture Development Center University of Massachusetts - Boston Boston MA 02125 Page 1
  • 2. D S M P This page is intentionally blank Page 2
  • 3. D S M P Overview Today, the language-independent communications protocol - Message Passing Interface or MPI is the de facto standard for most supercomputers. However, problems solved on these clusters must be decimated to fit within the physical limitations of the individual nodes and modified to accommodate the clusters hierarchy and messaging scheme. As of 1Q10, quad-socket Symmetric Multiprocessing (SMP) processing nodes which make-up these MPI clusters can support 24-x86 cores and 128GB of memory. However, addressing problems with big data-sets is still impractical for the MPI-1 model for it has no shared memory concept, and MPI-2 has only a limited distributed shared memory (DSM) concept. Even an MPI- 2 cluster based upon these state-of-the-art SMP processing nodes cannot support problems with very large data-sets without significant restructuring of the application and associated data-sets. Even after a successful port, many programs suffer poor performance due to MPI hierarchy and message latency. The Symmetric Computing - Distributed Symmetric Multiprocessing™ (DSMP™) Linux kernel enhancement enables Distributed Shared Memory (DSM), or distributed global address space (DGAS), across an InfiniBand connected cluster of Symmetric Multiprocessing (SMP) nodes with breakthrough price/performance. The Symmetric Computing - Linux kernel enhancement transforms a homogeneous cluster into a DSM/DGAS supercomputer which can service very large data-sets or accommodate legacy MPI applications with increased efficiency and throughput via application-utilities that support MPI over shared-memory. DSMP is poised to displace and obsolete message-passing as the protocol of choice for a wide range of memory intensive applications because of its ability to service a wider class of problems with greater efficiency. DSMP is comprised of two unique operating systems; the host OS which runs on the head-node and a unique lightweight micro-kernel OS which runs on all other servers (which make-up the cluster). The host OS consists of a Linux image plus a new DSMP kernel, creating a new derivative work. The micro-kernel is a non-Linux based operating system that extends the function of the host OS over the entire cluster. These two OS images (host and micro-kernel), are designed to run on commodity, Symmetric Multiprocessing (SMP) servers based on either the AMD or Intel direct connect architecture, i.e., AMD Opteron™ processor or Intel Nehalem™ Processor. DSMP enables at the kernel level, a shared-memory software architecture that scales to hundreds of thousands of cores based on commodity hardware and InfiniBand. The key features that enable this scalable DSM architecture are: • A DSMP kernel level enhancement that results in significantly lower latency and improved bandwidths, making a DSM/DGAS architecture both practical and possible; • A transactional distributed shared-memory system, which allows the architecture to scale to thousands of nodes with little to no impact on global-memory access times; • An intelligent, optimized InfiniBand driver which leverages the HCA’s native Remote Direct Memory Access (RDMA) logic, reducing global memory page access times to under 5µ-seconds; • An application driven, memory-page coherency scheme that simplifies porting applications and allows the programmer to focus on optimizing performance vs. decimating the application to accommodate the limitations of the message-passing interface; Page 3
  • 4. D S M P MPI vs. DSM Supercomputer vs. the DSMP™ Cluster As stated earlier, although MPI clusters are the de facto platform of choice, data-sets in bioinformatics, Oil & Gas, atmospheric modeling, etc. are becoming too large for the SMP nodes that make up the cluster and in many cases, it is impractical and inefficient to decimate the large data-sets. The alternatives are to proceed anyway with a full restructuring of the problem and suffer the inefficiencies, or to purchase time on a University or National Labs DSM supercomputer. The problem with the DSM supercomputer approach is the prohibited cost of the hardware or the lengthy queue-time incurred to access a NSF/DoE DSM Supercomputer. Additionally, the system requirements of a researcher looking to model a physical system or assemble/align an DNA sequence MPI DSM DSMP™ are quite different from enterprise Cluster Supercomputer Cluster computing. In short, they Don’t need a Commodity Yes No Yes hardened enterprise class nine 9’s reliable Hardware Support for Limited Yes Yes platform with virtualization support because DSM their applications are primarily single- Intelligent Platform Yes Yes Yes Mngt. Interface process, multiple-thread. In addition, they Virtualization No Yes No are more than willing to optimize their support applications for the target hardware to get Static partition for No Yes Planned multi app. support the most out of the run. Ultimately they Affordability $ $$$$$ $ want unencumbered 24/7 access to an factor Incrementally Yes No Yes affordable DSM supercomputer – just like Expandable their MPI cluster. The table to the right Support for Yes No Yes >10K cores summarizes the differences between the three approaches. Enter Symmetric Computing The design team of Symmetric Computing came out of the research community. As such, they are very aware of the problems researchers face today and in the future. This awareness drove the development of DSMP™ and the need to leverage commodity hardware to implement a DSM/DGAS supercomputer. Our intent is nothing short of having DSMP do for shared-memory supercomputing what the Beowulf project (MPI) did for cluster-computing, which enabled thousands of researchers and Universities to solve massively complex problems on commercially available hardware – Supercomputing for the Masses. How DSMP™ works As stated in the introduction, DSMP™ is software that transforms an InfiniBand connected cluster of homogeneous 1U/4P commodity SMP servers into a shared-memory supercomputer. Although there are two unique kernels, (host-kernel and a micro-kernel), for this discussion, we will ignore the difference between them because, from the programmers perspective, there is only one OS image and one kernel. The DSMP kernel provides seven (7) enhancements that enable this transformation, they are: Page 4
  • 5. D S M P 1. A transactional distributed shared-memory system; 2. Optimized InfiniBand drivers; Trio™ Departmental 3. An application driven, memory-page coherency scheme; Supercomputer 4. An enhanced multi-threading service; 5. Support for a distributed MuTeX; 6. A memory based distributed disk-queue and The transactional distributed shared-memory system: The center piece of DSMP is its transactional distributed shared-memory architecture, which is based on a two tier memory-page (Local and Global) and tables that support the transactional memory architecture. Just two years ago, such an approach would have seemed inefficient and a poor use of memory. However, today memory is the most abundant resource within the modern server, and leveraging this resource to implement a shared memory architecture is not only practical but very cost effective. The transactional global shared memory is concurrently accessible and available to all the processors in the system. This memory is uniformly distributed over each server in the cluster and is indexed via tables, which maintain a linear view of the global-memory. Local-memory contains copies of what is in global- memory – acting as a memory page cache, providing temporal locality for program and data. All processor code execution and data reads occur only from the local memory (cache). 128GB 128GB 128GB Global Transactional Shared Memory Based on 4,096B pages 16 - 32GB 16-32 GB 16-32 GB Local Local Local Memory #1 Memory #3 Memory #2 (cache) (cache) (cache) Based on 64B Cache lines IB 3 IB2 IB1 IB 2 IB 1 IB 3 3 - 1TB GbE SMP 0 GbE GbE SMP 2 SMP 1 HDD per server Shown above is a data flow view of the Trio™ departmental supercomputer which is based on three 1U - 4P servers with 48 or 72 processor cores and up to 128 giga-bytes of physical memory per node. The three nodes in Trio are connected via 40Gb InfiniBand and there is no switch. The size of the local- memory is set at boot time but is typically between one (1) and two (2) GB per-core or greater Page 5
  • 6. D S M P (application driven). When there is a page-fault in local-memory, the DSMP kernel finds an appropriate least recently used (LRU) 4K memory-page in local-memory and swaps in the missing global-memory page; this happen in just under 5µ-seconds. The large temporally local-memory (cache) provides all the performance benefits (SREAMS and Linpack) of local-memory in a legacy SMP server with the added benefit spatial locality to a large globally shared-memory which is <5µ-seconds away. Not only is this architecture unique and extremely powerful, but it can scale to hundreds and even thousands of nodes with no appreciable loss in performance; so long as a globally shared-memory page is <5µ-seconds away, performance scales infinitely. The Optimized InfiniBand Drivers: The entire success of DSMP™ revolved around the availability of a low latency, commercial network fabric. It wasn’t that long ago, with the exit of Intel from InfiniBand, that industry experts were forecasting its demise. Today InfiniBand is the fabric of choice for most High Performance Computing (HPC) clusters - due to its low latency & high bandwidth. To squeeze every last nano-second of performance out of the fabric, the designers of DSMP bypassed the Linux InfiniBand protocol stack and wrote their own low-level drivers. In addition, they developed a set of drivers that leveraged the native RDMA capabilities of the InfiniBand host channel adapter (HCA). This allowed the HCA to service and move memory-page requests, without processor intervention. Hence, RDMA eliminates the overhead for message construction and deconstruction, reducing system- wide latency. An application driven, memory page coherency scheme: As stated in the introduction, all proprietary supercomputers maintain memory-consistency and/or coherency via a hardware extension of the host processors coherency scheme. DSMP being a software solution, based on a local and global-memory resource, had to take a different approach. Coherency within the local-memory of each of the individual SMP servers is maintained by the AMD64 Memory Management Unit (MMU) on a cache-line basis. Global-memory page coherency and consistency, is controlled by, and maintained under program control; shifting responsibility from hardware to the programmer or the application. This approach may seem counter intuitive at first, but it is the most efficient way to implement system-wide coherency based on commodity hardware. Again, the target market-segment for DSMP is technical computing, not enterprise and in most cases, the end user is familiar with their algorithm and how to optimize it for the target platform - in the same way code was optimized for a Beowulf MPI cluster. Given that most applications are open-source combined with the high skill level of the end users, drove system level decisions which has kept DSMP based clusters affordable, fast and scalable. In cases where the end user is not computer literate, or does not have access to a staff of computer scientists, Symmetric Computing can either provide a user access to open-source algorithms already optimized for DSMP or we can work with your team to modify the algorithm for you. To assist the programmer in maintaining memory-consistency, a set of DSMP specific primitives were developed. These primitives, combined with some simple intuitive programming rules, augmented with Page 6
  • 7. D S M P the new primitives make porting an application to a multi-node DSMP platform simple and manageable. Those rules are as follows: • Be sensitive to the fact that memory-pages are swapped into and out of local memory (cache) from global memory in 4K pages and that it takes <5µ-seconds to complete the swap. • Be careful not to overlap or allocate multiple data sets within the same memory page. To help prevent this a new malloc( ) function [a_malloc( )] is provided to assure alignment on a 4k boundary and to avoid FALSE sharing. • Because of the way local and global-memory are partitioned within the physical memory, care should be taken to distribute process/threads and associated data-sets evenly over the four processors while maintaining temporal locality to data. How you do this is a function of the target application. In short, care should be taken to not only distribute threads but ensure some level of data locality. DSMP™ supports full POSIX conformance to simplify parallelization of threads. • If there is a data-structure which is “modified-shared” and if that data structure is accessed by multiple process/threads which are on an adjacent server, then it will be necessary to use a set of new primitives to maintain memory-consistency i.e., Sync( ), Lock( ) and Release( ). These three primitives are provided to simplify the implementation of system wide memory-consistency. - Sync( ) synchronizes a local data-structure with its parent in global-memory. - Lock( ) prevents any other process thread from accessing and subsequently modifying the noted data-structure. Lock( ) also invalidates all other copies of the data structure (memory- pages) within the computing-system. If a process thread on an adjacent computing-device accesses a memory-page associated with a locked data structure, execution is suspended until the structure (memory page) is released. - Release( ) unlocks a previously locked data structure. NOTE: If your application is single thread, or is already parallelized for OpenMP or Pthreads, then it will run on a DSMP™ system (such as Trio™) without modification. The only limitation is that for a multi-thread application you may not be able to take advantage of all the processor cores on the adjacent worker nodes (only the processor-cores on the head-node). Hence, in the case of Trio™ you will have access to all 16/24 processor cores on the head and up to 336GB of global shared memory with full coherency support. This ability to run your single thread or OpenMP/Pthreads applications, and take full advantage of DSMP™ transactional distributed shared-memory, without modifying your source, provides a systematic approach to full parallelization. It should be noted that the current implementation of DSMP will not support PMI applications. Sometime in 2H10, we plan to release a wrapper that will support MPI over a DSMP shared-memory architecture. Multi-Threading: The “gold standard” for parallelizing C/C++ or Fortran source code is with OpenMP and the POSIX thread library or Pthreads. POSIX is an acronym for Portable Operating System Interface. The latest version; POSIX.1 - IEEE Std 1003.1, 2004 Edition, was developed by The Austin Common Standards Revision Group (CSRG). To ensure that Pthreads would work with DSMP each of the two Page 7
  • 8. D S M P dozen or so POSIX routines were either tested to and/or modified for DSMP and the Trio™ platform so that the POSIX.1 standard is supported in its entirety. Distributed MuTeX: A MuTeX or Mutual exclusion is a set of algorithms that are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable or a critical section. A distributed MuTeX is nothing more than a DSMP kernel enhancement that insures that a MuTeX functions as expected, within the DSMP multi-node system. From a programmers point-of-view, there are no changes or modification to MuTeX – it just works. Memory based distributed disk-queue: DSMP provides a high-bandwidth/low-latency elastic queue for data which is intended to be written to a low bandwidth interface, such as a Hard Disk Drive (HDD) or the network. This distributed input/output queue, is a memory (DRAM) based elastic storage buffer which effectively eliminates bottlenecks which occur when multiple threads compete for a low bandwidth resource. A Perfect-Storm Every so often, there are advancements in technology that impact a broad swath of the society, either directly or indirectly. These advancements emerge and take hold due to three events: 1. A brilliant idea is set-upon and implemented which solves a “real-world” problem; 2. A set of technologies have evolved to a point that enable the idea and; 3. The market segment which directly benefits from the idea is ready to accept it. The enablement of a high performance shared-memory architecture over a cluster is such an idea. As implied in the previous section, the technologies that have allowed DSMP to be realized are: • The adoption of Linux as the operating system of choice for technical computing; • The commercial availability of a high bandwidth, low latency network fabric – InfiniBand; • The adoption of the x86-64 processor as the architecture of choice for technical computing; • The integration of the DRAM controller and I/O complex onto the x86-64 processor; • The sudden and rapid increase of DRAM density with the corresponding drop in memory cost; • The commercial availability of high-performance small form-factor SMP nodes. If any of the above six advancements had not existed, Distributed Symmetric Multiprocessing would not have succeeded. DSMP Performance Performance of a supercomputer is a function of two metrics: 1) Processor performance (computational throughput) which can be assessed with HPPC Linpack or FFT algorithms and; 2) Global Memory Read/Write performance - which can assessed with HPPC STREAMS or Random. The extraordinary thing about the DSMP™ is the fact that it is based on commodity components. That’s important, because as with MPI, DSMP performance scales with the performance of the commodity Page 8
  • 9. D S M P components from which it is dependent on. As an example, random read/write latency for DSMP, went down 40% when we moved from a 20Gb to a 40Gb fabric and no changes to the DSMP software were needed. Also, within this same timeframe, the AMD64 processor density went from quad-core to six-core, again with only a small increase in the cost of the total system. Therefore, over-time the performance gap between a DSMP™ and a proprietary SMP supercomputer of equivalent processor/memory density will continue to narrow – and eventually disappear. Looking back to the birth of the Beowulf project, an equivalent processor/memory density SMP supercomputer out performed a Beowulf cluster by a factor of 100, yet MPI clusters continued to dominate supercomputing – why? The reasons are twofold: First – price performance: MIP clusters are affordable – both from the hardware as well as a software perspective (Linux based), and they can grow in small ways - by adding a few additional PC/servers or in large way, by adding entire racks of servers. Hence, if a researcher needs more computing power, they simply add more commodity PC/servers. Second – accessibility: MPI clusters are inexpensive enough that almost any University or College can afford them, making them readily accessible for a wide range of applications. As a result, an enormous academic resource is focused on improving MPI and the applications that run on them. DSMP™ brings the same level of value, accessibility and grass-roots innovation to the same audience that embraced MPI. However, the performance gap between a DSMP cluster and a legacy shared- memory supercomputer is small, and in many applications, a DSMP cluster out performs machines that cost 10x its price. As an example, if your application can fit (program and data) within local-memory (cache) of Trio, then you can apply 48 or 72 processors concurrently to solve your problem with no penalty for memory accesses (all memory is local for all 48 or 72 processors). Even if your algorithm does access global shared memory, it’s only 5µ-seconds away. Today, Symmetric Computing is offering four turn-key DSMP proof of concept platforms coined the Trio™ Departmental Supercomputer line and one dual node system coined Duet™. The Table below lists the various Giga-flop/memory configurations available in 2010. Trio™ Peak Theoretical Linpack score in Total system Total Shared P/N floating-point Local-Memory memory Memory* performance (75%) SCA241604-3 749 Giga-flops 562 Giga-flops 192 GB 120GB SCA241604-2 538 Giga-flops 404 Giga-flops 256 GB 208GB SCA241608-3 749 Giga-flops 562 Giga-flops 384 GB 312GB SCA161608-3 538 Giga-flops 404 Giga-flops 384 GB 336GB * Note: In this table, local memory (cache) is set to 1GB/core – a total of 16GB for quad-core and 24GB for six-core processors Page 9
  • 10. D S M P Looking forward to 2Q10, the Symmetric Computing engineering staff will introduce a multi-node InfiniBand switched based system delivering almost 2 tera-flops of peak throughput with more than a terra-byte global-memory in a 10-blade chassis. In addition, we are working with our partners to deliver a turn-key platform tuned for application specific missions – such as HMMER, BLAST and de novo Align/Assembly algorithms. Challenges and Opportunities Symmetric Computing’s most significant challenge is to convince users that there is real value in its DSMP Linux kernel enhancement over MPI. Much of the HPC market has become complacent with MPI and look for incremental improvement i.e., MPI-1 to MPI-2; which continues to be the path of least resistance. In addition, as MPI nodes convert to the AMD Opteron™ processor and Intel Nehalem™ processor’s direct connect architecture, combined with memory densities of >256GB/node, MPI-2 with DSM support will gradually service a wider range of application. Symmetric Computing will addresses these issues by first supplying affordable turn-key DSMP enabled hardware – such as Trio™. These platforms will be optimized to solve problems in narrow vertical markets such as Bioinformatics at an unprecedented price-performance level. We will then enable the wider market with the availability of larger turn-key platforms with a version of OpenMP optimized for DSMP and APIs that allow MPI applications to run on these turn-key shared-memory platforms with improved performance. In the short-term, Symmetric Computing will continue to focus on academic & research supercomputing. We will continue to release turn-key platforms focused on specific verticals where we have an advantage. Our long-term strategy is to continue to displacement MPI with DSMP for applications where DSM/DGAS is as the architecture of choice delivering - Supercomputing for the Masses. About Symmetric Computing Symmetric Computing is a Boston based software company with offices at the Venture Development Center on the campus of the University of Massachusetts – Boston. We design software to accelerate the use and application of shared-memory computing systems for Bioinformatics, Oil & Gas, Post Production Editing, Financial analysis and related fields. Symmetric Computing is dedicated to delivering standards-based, customer-focused technical computing solutions for users, ranging from Universities to enterprises. For more information, visit www.symmetriccomputing.com. Page 10