SlideShare uma empresa Scribd logo
1 de 8
Baixar para ler offline
Experiences in Scaling Scientific Applications on Current-generation
                                 Quad-core Processors

Kevin Barker, Kei Davis, Adolfy Hoisie, Darren Kerbyson, Mike Lang, Scott Pakin, José Carlos Sancho
          Performance and Architecture Lab (PAL), Los Alamos National Laboratory, USA
                      {kjbarker,kei,hoisie,djk,mlang,pakin,jcsancho@lanl.gov}


                        Abstract                             single die whereas the Tigerton incorporates two dual-
                                                             core dies into a single package. In this study we
   In this work we present an initial performance            compare the performance of two 16-core nodes, one
   evaluation of AMD and Intel’s first quad-core             with four Barcelona processors and the other with four
   processor offerings: the AMD Barcelona and the Intel      Tigerton processors. Our analysis relies on
   Xeon X7350. We examine the suitability of these           performance measurements of application-independent
   processors in quad-socket compute nodes as building       tests (microbenchmarks) and a suite of scientific
   blocks for large-scale scientific computing clusters.     applications taken from existing workloads within the
   Our analysis of intra-processor and intra-node            U.S. Department of Energy that represent various
   scalability of microbenchmarks and a range of large-      scientific domains and program structures.
   scale scientific applications indicates that quad-core          The performance and scaling behavior of each
   processors can deliver an improvement in                  application was measured on one core, when scaling
   performance of up to 4x per processor but is heavily      from one to four cores on a single processor, and also
   dependent on the workload being processed. While the      when using all four processors in a node. In addition,
   Intel processor has a higher clock rate and peak          we determined the best achievable performance of
   performance, the AMD processor has higher memory          each application on each node which is not necessarily
   bandwidth and intra-node scalability. The scientific      when using all processing cores within a socket, or all
   applications we analyzed exhibit a range of               cores within a node. This is heavily dependent on the
   performance improvements from only 3x up to the full      application characteristics.
   16x speed-up over a single core. Also, we note that the         Though much of our work is focused on large-
   maximum node performance is not necessarily               scale system performance including examining the
   achieved by using all 16 cores.                           largest systems available (for instance Blue Gene/L &
                                                             Blue Gene/P, ASC Purple, ASC Redstorm e.g. [2]),
                                                             we note that the performance at large-scale is a result
   1. Introduction                                           from both the performance of the computational nodes
                                                             as well as their integration into the system as whole.
        The advancing level of transistor integration is           This paper is organized as follows. An overview
   producing increasingly complex processor solutions        of the Barcelona and Xeon nodes is given in Section 2.
   ranging from main-stream multi-cores, heterogeneous       Low-level microbenchmarks are described in Section
   many-cores, and also special purpose processors           3 together with measured results for both nodes.
   including GPUs. There is no doubt that this will          Section 4 describes the suite of applications, and input
   continue into the future until Moore’s Law no longer      decks used and the methodology used to undertake the
   can be satisfied. This increasing integration will        scalability analysis. Results are presented in Section 5
   require increases in performance of the memory            for the three types of analysis as described.
   hierarchy to feed the processors. Innovations such as     Conclusions from this work are discussed in Section 6.
   putting memory on-top of processors, putting                    The contribution of this work is in the analysis of
   processors on-top of memory (PIMS), or a                  empirical performance data from a large suite of
   combination of both maybe a future way forward.           complete scientific applications on the first generation
   However, the utility of future processor generations      of quad-core processors from both AMD and Intel and
   will be a result from demonstrable increases in           in a quad-socket environment. These data are obtained
   achievable performance from real workloads.               from a strict measurement methodology to ensure that
        In this work we examine the performance of two       conclusions drawn from the scalability analysis are
   state-of-the-art quad-core processors—the quad-core       fair. Note that in this present work we do not consider
   AMD Opteron 8350 (Barcelona) and the quad-core            physical or economic issues such as hardware cost,
   Intel Xeon X7350 (Tigerton). Both are based on 65nm       power, or physical node size. The process that we
   process technology. The Barcelona is fabricated as a      follow is directly applicable to other multi-core studies.


    978-1-4244-1694-3/08/$25.00 ©2008 IEEE
2. Processor and Node Descriptions                                 memory accesses, while the DHSI provides a point-to-
                                                                   point link between each processor and the memory
     The Intel Xeon X7350 (Tigerton) and AMD 8350                  channels. The front-side-buss (FSB) of each processor
Opteron (Barcelona) represent competing first-                     runs at 1066MHz. The memory speed is 667MHz and
generation quad-core processor designs initially made              thus provides a peak memory bandwidth of 10.7GB/s
available in September 2007. Both are detailed below               per processor which is shared among the four cores.
and illustrate different implementations both in terms
of processor configuration and connectivity to memory.             2.2. The AMD 8350 quad-core Opteron
                                                                   processor (Barcelona)
2.1. The Intel X7350 quad-core (Tigerton)
                                                                        Barcelona, the latest generation of the Opteron,
     The Intel Tigerton processor contains two dual-               combines four Opteron cores onto a single die. Each
core dies that are packaged into a single dual-chip                die contains a single integrated memory controller and
module (DCM) that is seated within a single socket.                uses a HyperTransport (HT) network for point-to-point
Each core contains a private 64KB L1 cache (32KB                   connections between processors. Each core has a
data + 32KB instruction), and a shared 4MB L2 cache                private 64KB L1 cache (32KB data + 32KB
for the two cores on each die. Thus the total amount of            instruction) and a private 512KB L2 cache, and each
L2 cache is 8MB within the DCM. The processor                      processor has a shared 2MB L3 cache. The shared L3
implements the 128-bit SSE3 instruction set for SIMD               cache is new to the Opteron architecture. The new 128
operations and thus can perform 4 double-precision                 bit SSE4a instructions enable each processor to
floating-point operations per cycle. The processor is              execute 4 double-precision floating-point operations
clocked at 2.93GHz so the DCM has a theoretical peak               per clock. The clock speed of each core is 2.0GHz
performance of 46.9 Gflops/s.                                      giving each chip a peak performance of 32Gflops/s.
     Each node contains four processors for a total of                  Each node contains four quad-core processors as
16 cores as shown in Figure 1, and contains a total of             shown in Figure 2. Because each processor contains a
16GB of main memory using fully-buffered DIMMs                     separate memory controller, a key difference with the
(FBDIMMs). Central to the node is a single memory                  Xeon node is that memory is connected directly to
controller hub (MCH). This hub interconnects the                   each processor in a non-uniform memory access
front-side-bus (FSB) of each processor to four                     (NUMA) configuration versus the Xeon’s symmetric-
FBDIMM memory channels. The MCH contains a                         multiprocessor (SMP) configuration. DDR2 667MHz
64MB snoop buffer and a Dedicated High Speed
Interconnect (DHSI) as well as PCI Express channels.
                                                                                   Core      Core      Core       Core
The purpose of the snoop buffer is to minimize main
                                                                                    L2       L2           L2       L2
                                                                                                 Shared L3

                 Core     Core    Core      Core                                                 Cross-bar

                  Shared L2         Shared L2                                         HT (x3)         Mem Cont (x2)




                          1066MHz FSB
                                                                                 (a) Quad-core processor
              (a) Quad-core processor                                             Socket 0                     Socket 1

                                                                                 C C C C                       C C C C
   Socket 0       Socket 1          Socket 2          Socket 3                   0 1 2 3          8GB/s        4 5 6 7
 C C    C C     C C      C C      C C       C C      C C   C C
 0 1    8 9     2 3      10 11    4 5       12 13    6 7   14 15
                                                                      DDR2 667                                               DDR2 667

                                                                                         8GB/s                       8GB/s

                        Memory Controller
              8GB/s        Hub (MCH)         8GB/s                               C C C C                       C C C C
                        8GB/s     8GB/s                                          8 9 10 11                     12 13 14 15

                                                                                                  8GB/s

 FBDIMM 667     FBDIMM 667         FBDIMM 667        FBDIMM 667       DDR2 667    Socket 2                     Socket 3      DDR2 667

           (b) Quad-processor node                                           (b) Quad-processor node
    Figure 1. Overview of the Intel Tigerton                         Figure 2. Overview of the AMD Barcelona.
Table 1. Characteristics of the Intel and AMD processor, memory, and node organization.
                                                                      Chip                            Memory            Node
                                            Speed   Peak      L1                                              Power Transistor                                                 Speed  Memory       Peak
                                                                                  L2 (MB)         L3 (MB)                                                          Type
                                            (GHz) (Gflops/s) (KB)                                              (W) count (M)                                                   (MHz) controllers (Gflops)
Intel                                                                             4 (shared
          2.93                                          46.9           64                          None        130                                      582       FBDIMM        667             1            187.6
Tigerton                                                                           2-core)
AMD                                                                                               2 (shared
           2.0                                           32            64           0.25                       75                                       463        DDR2         667             4            128.0
Barcelona                                                                                          4-core)

memory is used and thus the memory bandwidth per                                                                outperforms the Xeon node in all cases. The measured
processor is 10.7GB/s. The total memory capacity of                                                             single-core memory bandwidth is 4.4GB/s on the
the node is 16GB (4GB per processor). The HT links                                                              Barcelona and 3.7GB/s on the Xeon. The memory
connect the four processors in a 2×2 mesh. Further HT                                                           aggregate memory bandwidth, using all 16 cores, is
links provide PCI Express I/O capability. Each HT                                                               17.4GB/s and 10.2GB/s respectively.
link has a theoretical peak of 8GB/s for data transfer.                                                              Figure 3(b) is based on the same data as Figure
     A summary of the two processor architectures and                                                           3(a) but presents the observed memory bandwidth per
nodes is presented in Table 1. It is interesting to note                                                        core. As shown, the per-core bandwidth decreases
the differences in power consumption and transistor                                                             from 4.4GB/s to 1.1GB/s for the Barcelona (a factor of
count per processor. The lesser transistor count on the                                                         four decrease), and from 3.7GB/s to 0.63GB/s for the
Barcelona is due mainly to the reduced cache capacity.                                                          Xeon (a factor of six decrease). This decrease is
                                                                                                                significant, and there is clearly room for improvement
3. Low-level performance characteristics                                                                        on both architectures. The aggregate achievable
                                                                                                                memory bandwidth is important to memory-intensive
3.1. Memory bandwidth                                                                                           applications, and the Barcelona node has a clear
                                                                                                                advantage over the Xeon node for such applications.
     To examine the memory bandwidth per core we
ran the MPI version of the University of Virginia's                                                             3.2. Processor locality
Streams benchmark [10]. This benchmark is a memory
stress test for a number of different operations. We                                                                The mapping of application processes to cores
report here the performance of the “triad” test. In                                                             within a node affects the memory contention that the
Figure 3(a) the aggregate memory bandwidth is shown                                                             application induces. In our experience, the core-to-
for both the Xeon and Barcelona nodes for two cases:                                                            processor ordering is not always obvious and should at
when using a single processor and when using all four                                                           a minimum be verified. Linux determines core
processors in the node. The number of cores per                                                                 numbering via information provided from the BIOS.
processor used in both cases is varied from one to                                                                  In this testing we used an MPI benchmark to
four. As shown by the figure, the Barcelona node                                                                measure the latency between each core and all of the

                                       24                                                                                                               5
                                                 AMD Barcelona (4-sockets)                                                                                                            AMD Barcelona (4-sockets)
                                       22
                                                                                                                                                                                      Intel Tigerton (4-sockets)
                                                 Intel Tigerton (4-sockets)
   Aggregate Memory Bandwidth (GB/s)




                                                                                                                     Memory Bandwidth per core (GB/s)




                                       20                                                                                                                                             AMD Barcelona (1-socket)
                                                 AMD Barcelona (1-socket)                                                                               4
                                       18                                                                                                                                             Intel Tigerton (1-socket)
                                                 Intel Tigerton (1-socket)
                                       16
                                                                                                                                                        3
                                       14
                                       12
                                       10                                                                                                               2
                                        8
                                        6
                                                                                                                                                        1
                                        4
                                        2
                                        0                                                                                                               0
                                             1              2                 3               4                                                               1            2               3             4
                                                           Cores Per Socket                                                                                               Cores Per Socket

                                             (a) Aggregate memory bandwidth                   (b) Bandwidth per core
                                                                  Figure 3. Streams Bandwidth
Destination core                                                                                                   Destination core
                                                      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15                                                                              0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
              0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15




                                                                                                                                 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
                                                                                              0.43-0.44µs                                                                                                        1.20-1.21µs
                                                                                              (same die)                                                                                                         (same die/processor)
                                                                                              0.84-0.85µs                                                                                                        1.47-1.49µs
Source core




                                                                                                                   Source core
                                                                                              (same processor)                                                                                                   (HT 1-hop)
                                                                                              1.63-1.64µs                                                                                                        1.55-1.56µs
                                                                                              (remote processor)                                                                                                 (HT 2-hops)




                                                                      (a) Intel Xeon                               (b) AMD Barcelona
                                                                    Figure 4. Observed latency from any core to any other core in a node

others. The latency varies depending on whether the                                                                       4. Application testing process
two communicating cores are on the same die, on
different dies in the same processor (in the case of the                                                                       The suite of applications utilized includes many
Intel Xeon), or different processors (or single- or                                                                       large-scale production applications that are currently
multiple-hop remote processors for AMD Barcelona).                                                                        used within the U.S. Department of Energy. The
     From our latency measurements we were able to                                                                        testing for each application consisted of
determine the arrangement of the cores as seen by an
application. Figure 4 shows the observed latency from                                                                     1) comparing the performance on a single core of the
this test in the form of a matrix. The vertical axis                                                                         Barcelona and Xeon;
indicates the sending core and the horizontal axis is                                                                     2) examining the scaling behavior for two cases: (a)
the receiving core. Shading is used to denote the                                                                            using only one processor, and (b) using all 4
different latencies. No test was performed for a core                                                                        processors in a node;
sending to itself (the major diagonal in the matrix). It                                                                  3) determining the configuration (processors and
can be seen that the Barcelona node uses a linear                                                                            cores/processor) that yields the best performance.
ordering of cores to processors – that is the first four                                                                       All of the applications are typically run in a weak-
cores reside on the first processor as indicated by the                                                                   scaling mode. That is, the global problem size grows
lowest latency, black shaded, block of size 4x4 cores                                                                     with the number of nodes in the system. All available
in Figure 4(b) and so on. In the case of the Xeon a                                                                       memory is typically used for increased fidelity in the
round-robin ordering across the dies is used. The first                                                                   physical simulations or for simulating larger physical
two dies reside on the first processor and so on. Cores                                                                   systems. Our approach mimics typical usage by fixing
on the same die are an MPI logical task distance of                                                                       the sub-problem per processor no matter how many
eight apart, as shown by the black diagonal lines in                                                                      cores per processor are used. The global problem
Figure 4(a).                                                                                                              grows in proportion to the number of processors used.
     The latency from one core to another is in very                                                                      This can be stated succinctly as doing strong scaling
distinct ranges depending on their relationship (same                                                                     within a processor and weak scaling across processors.
die, remote die etc). We observe that the maximum
latency is similar on both nodes, but the Xeon enjoys a                                                                   4.1. The application suite
much lower intra-die and intra-processor latency. The
core and processor ordering was actually included in                                                                           An overview of each application is given below.
Figure 1 and Figure 2 based on the observed processor                                                                     Each is typically run on high-performance parallel
locality map shown in Figure 4. To handle the                                                                             systems utilizing many 1000s of processors at a time.
differences in the processor numbering between the                                                                        A summary of the application input-decks used here is
two nodes we implemented a small software shim that                                                                       given in Table 2. The input decks used are typical for
uses the Linux sched_setaffinity() system call to                                                                         problems processed on large-scale systems.
allow user-defined mappings between MPI rank and
physical cores. This shim gives us the ability to map                                                                     GTC – Gyrokinetic Toroidal Code is a Particle-in-Cell
processes to cores identically across the two nodes and                                                                       (PIC) code from Princeton [12]. It was developed
thereby perform fair comparisons of application                                                                               to study energy transport in fusion devices.
performance.                                                                                                              Milagro - an implicit Monte Carlo (IMC) code for
                                                                                                                              thermal radiative transfer from Los Alamos [5].
Table 2. Summary of the input decks used for each application
                                                      Problem per             Problem per node Memory use per
                              Input deck                                                                                                                                           Processing characteristic
                                                       processor                                 processor
GTC                         1Dwedge                  6.2M particles             24.8M particles                                                    320MB           Particle based
Milagro                    Doublebend                0.5M particles              2M particles                                                      50MB            Particle based with replicated mesh
Partisn                       Pencil                   20x10x400                  40x20x400                                                        80MB            Compute intensive
S3D                           typical                   50x50x50                  100x100x50                                                       140MB           Memory/Compute intensive
SAGE                        timing_h                   140K cells                  560K cells                                                      280MB           Memory intensive
SPaSM                          BCC                      64x64x64                  128x128x64                                                       150MB           Compute intensive
Sweep3D                       Pencil                   20x10x400                  40x20x400                                                         8MB            Kernel, small memory footprint
VH1                        Shock_tube                200x200x200                 400x400x200                                                       900MB           Memory/Compute intensive
VPIC                        3D-HOT                    4M particles               16M particles                                                     256MB           SSE optimized, memory intensive

Partisn – SN transport code from Los Alamos [1]                                                   5. Application Performance Analysis
    solving the Boltzmann equation using the discrete
    ordinates method, on structured meshes.                                                       5.1. Comparison of single-core performance
S3D – high-fidelity 3D simulation of turbulent
    combustion that includes detailed chemistry. It                                                    The performance of each application on a single
    originates from Sandia National Laboratory [6].                                               core of both processor types is shown in Figure 5(a).
SAGE – an adaptive mesh (AMR) hydrodynamics                                                       The metric of “time” denotes the iteration time of the
    code used for the simulation of shock-waves.                                                  main computational loop (without I/O) for all
    Developed jointly by Los Alamos and SAIC [9].                                                 applications except Sweep and Partisn, which are run
SPaSM – the Scalable Parallel Short-range Molecular                                               for 10 iterations, and VPIC, which is run for 100
    dynamics code from Los Alamos. Used to study                                                  iterations to aid visual comparison. The reduction in
    material fracture and deformation properties [13].                                            application runtime on the Xeon core relative to the
Sweep3D – a code-kernel from Los Alamos [7,10] that                                               Barcelona core is shown in Figure 5(b). 50% indicates
    implements deterministic SN transport. The                                                    that the Tigerton has a runtime reduction of 50%
    computation is in the form of wavefronts that                                                 (halving the iteration time). Similarly, a value of -50%
    originate at the corners of a 3-D physical space.                                             indicates that the Barcelona has a 50% reduction in
VH1 – the Virginia Hydrodynamics code simulates an                                                runtime relative to the Tigerton core.
    ideal inviscid compressible flow of gas                                                            The advantage of the Tigerton core over the
    hydrodynamics and is capable of simulating three-                                             Barcelona core is between 25% and 44% depending on
    dimensional turbulent stellar flows [3].                                                      the application. Recall that the Tigerton has a clock
VPIC – a Particle-In-Cell code from Los Alamos used                                               speed of 2.93GHz compared with the Barcelona of
    to model particle flow within a plasma [4].
            40                                                                                                                                   100%
                       Intel Tigerton
            35
                                                                                                   Runtime advantage of Intel vs. AMD (1-core)




                       AMD Barcelona
            30
                                                                                                                                                  50%
            25
 Time (s)




            20

            15                                                                                                                                     0%
                                                                                                                                                                                         SAGE

                                                                                                                                                                                                SPaSM
                                                                                                                                                               Milagro




                                                                                                                                                                                                                  VH1

                                                                                                                                                                                                                        VPIC
                                                                                                                                                         GTC




                                                                                                                                                                                   S3D




                                                                                                                                                                                                        Sweep3D
                                                                                                                                                                         Partisn




            10

             5
                                                                                                                                                 -50%
             0
                                                     SAGE


                                                            SPaSM
                           Milagro




                                                                              VH1


                                                                                    VPIC
                 GTC




                                               S3D




                                                                    Sweep3D
                                     Partisn




                                                                                                                                                 -100%

                       (a) Application iteration time                   (b) Performance advantage of Xeon core
                                           Figure 5. Single-core performance comparison
20                                                                                                                          100%

            18     Intel Tigerton




                                                                                         Runtime advantage of Intel vs. AMD (16-core)
                   AMD Barcelona
            16

            14                                                                                                                           50%

            12
 Time (s)




            10

             8                                                                                                                            0%




                                                                                                                                                                                SAGE

                                                                                                                                                                                       SPaSM




                                                                                                                                                                                                               VPIC
                                                                                                                                                      Milagro




                                                                                                                                                                                                         VH1
                                                                                                                                                GTC




                                                                                                                                                                          S3D




                                                                                                                                                                                               Sweep3D
                                                                                                                                                                Partisn
             6

             4

             2                                                                                                                          -50%
             0
                                                   SAGE


                                                          SPaSM




                                                                                  VPIC
                         Milagro




                                                                            VH1
                 GTC




                                             S3D




                                                                  Sweep3D
                                   Partisn




                                                                                                                                        -100%

                       (a) Application iteration time                (b) Performance advantage of Xeon node
                            Figure 6. Single-node (16 cores) application performance comparison

2.0GHz, and that the memory bandwidth for a single                                       the node. In fact the best performance observed on
core was 3.7GB/s and 4.4GB/s respectively. Hence up                                      VPIC and Partisn was when using 2-cores per
to a 50% advantage could be expected for                                                 processor (8 cores total), and for SAGE was when
computationally intensive codes executing the same                                       using 3-cores per processor (12 cores total) on the
instructions mix over the same number of cycles                                          Xeon node. The best performance in all other cases
without memory stalls.                                                                   was observed when using all 16 cores. In Section 5.3
                                                                                         below we analyzed the performance as a function of
5.2. Comparison of node performance                                                      the number of cores and number of processors used for
                                                                                         each application.
     The best performance of each application on the
Xeon node (16 cores), and Barcelona node (16 cores)                                      5.3. Quad-core application scalability analysis
is shown in Figure 6. In the same way as for the
single-core comparison, Figure 6(a) depicts the                                               To analyze the performance of using multiple
iteration time, and Figure 6(b) depicts the runtime                                      cores we followed a strict process in which the
advantage of the Xeon node over the Barcelona node.                                      problem size per processor was constant for all tests,
Several observations are clear in this comparison: i)                                    as described in Section 4. For each application we
the iteration time is lower in all cases compared to                                     show the performance when using between one and
using only one-core per socket; ii) the run-time                                         four cores per processor for the case of using one
advantage of Xeon is much reduced from the single-                                       processor and four processors in a node. Note that the
core comparison of Section 5.1; and iii) the Xeon node                                   performance relative to the single-core performance
no-longer out-performs the Barcelona node for all                                        for each processor type is shown – this is the speedup
applications – the run-time on the Barcelona node is                                     when using multiple cores. Figure 7 shows this data
lower by as much as 60%.                                                                 for all applications. Note that the legend for all graphs
     The results shown in Figure 6 are in line with the                                  is shown in graph of Figure 7i).
differences in the memory bandwidth measurements in                                           The first observation made is that the scalability is
Section 3.1. The per-core bandwidth when using all                                       higher on the Barcelona quad-core than on the Xeon
cores within a node on the Xeon is 0.63GB/s and on                                       quad-core. This explains why the performance
the Barcelona is 1.1GB/s (an advantage of almost a                                       advantage of the Xeon quad-core nodes is less than it’s
factor of 2). The 60% runtime advantage of the                                           advantage for a single-core. The applications are
Barcelona for the node performance is directly in line                                   ordered in terms of their observed scalability in Figure
with this on the memory intensive applications of                                        7. The applications with best scaling behavior on both
SAGE and VH1.                                                                            nodes are Milagro, SPaSM and Sweep3D (Figures
     Note however the performance presented above is                                     7(a) – 7(c)). Both Milagro and SPaSM are compute
based on the best observed node performance – this                                       bound, and Sweep3D has a small memory footprint
does not necessarily result when using all 16 cores in                                   resulting in high cache utilization. In contrast SAGE
and Partisn are memory bound and show the slowest                                                                                      Note that even if an application has a higher
scalability on both nodes. Codes that are neither                                                                                  speedup on the Barcelona node in comparison to the
compute or memory bound scale better on the                                                                                        Xeon node it does not necessarily mean that it has a
Barcelona than on the Xeon – that is VH1, GTC,                                                                                     higher performance as was evident in Figure 6(b).
VPIC and S3D.


                                   16                                                                            16                                                                             16

                                   14                                                                            14                                                                             14




                                                                    Relative performance (to 1 core)




                                                                                                                                                             Relative performance (to 1 core)
Relative performance (to 1 core)




                                   12                                                                            12                                                                             12

                                   10                                                                            10                                                                             10

                                    8                                                                             8                                                                              8

                                    6                                                                             6                                                                              6

                                    4                                                                             4                                                                              4

                                    2                                                                             2                                                                              2

                                    0                                                                             0                                                                              0
                                        1     2          3      4                                                     1     2          3     4                                                       1      2           3             4
                                            Cores per socket                                                              Cores per socket                                                               Cores per socket

                                                  (a) Milagro                                                                   (b) SPaSM                                                                       (c) Sweep3D
                                   16                                                                            16                                                                             16

                                   14                                                                            14                                                                             14
                                                                     Relative performance (to 1 core)
Relative performance (to 1 core)




                                                                                                                                                  Relative performance (to 1 core)
                                   12                                                                            12                                                                             12

                                   10                                                                            10                                                                             10

                                    8                                                                             8                                                                              8

                                    6                                                                             6                                                                              6

                                    4                                                                             4                                                                              4

                                    2                                                                             2                                                                              2

                                    0                                                                             0                                                                              0
                                        1     2          3      4                                                     1     2          3     4                                                       1     2            3             4
                                            Cores per socket                                                              Cores per socket                                                               Cores per socket

                                                   (d) VH1                                                                      (e) GTC                                                                           (f) VPIC
                                   16                                                                            16                                                                             16

                                   14                                                                            14                                                                             14       AMD Barcelona (4-sockets)
Relative performance (to 1 core)




                                                                                                                                                        Relative performance (to 1 core)
                                                                              Relative performance (to 1 core)




                                                                                                                                                                                                         Intel Tigerton (4-sockets)
                                   12                                                                            12                                                                             12
                                                                                                                                                                                                         AMD Barcelona (1-socket)
                                   10                                                                            10                                                                             10       Intel Tigerton (1-socket)

                                    8                                                                             8                                                                              8

                                    6                                                                             6                                                                              6

                                    4                                                                             4                                                                              4

                                    2                                                                             2                                                                              2

                                    0                                                                             0                                                                              0
                                        1     2          3      4                                                     1     2          3     4                                                       1     2            3             4
                                            Cores per socket                                                              Cores per socket                                                               Cores per socket

                                              (g) S3D                        (h)SAGE                        (i) Partisn
                                                   Figure 7. Application speedup when using multiple-cores.
6. Conclusions                                            References
     Using a suite of applications we have evaluated      [1] R.S. Baker. “A Block Adaptive Mesh Refinement
                                                          Algorithm for the Neutral Particle Transport Equation”,
the first generation of quad-core processors available
                                                          Nuclear Science & Engineering, 141(1), pp. 1-12, 2002.
from AMD and Intel. Data was obtained using a strict
measurement methodology that used a shim to control       [2] A. Hoisie, G. Johnson, D.J. Kerbyson, M. Lang, S. Pakin.
the mapping of application processes to processors,       “A Performance Comparison Through Benchmarking and
and compared the per-core performance as well as          Modeling of Three Leading Supercomputers: Blue Gene/L,
scaling up to 16 cores in a node. The process followed    Red Storm, and Purple”, in proc. IEEE/ACM Conf. on
is directly applicable to other multi-core studies.       Supercomputing (SC06), Tampa, FL, 2006.
     When considering the performance of a single
                                                          [3] J.M. Blondin. VH-1 User’s Guide. North Carolina State
core, where there are no memory bottlenecks, the          University, 1999.
higher clock speed of Intel’s Xeon gives applications a
measured 25-44% reduction in runtime compared with        [4] K. Bowers. “Speed optimal implementation of a fully
the AMD’s Barcelona. When using all of the cores in a     relativistic 3d particle push with charge conserving current
processor the results are more dependent on the way       accumulation on modern processors”, in proc. 18th int. conf.
each application uses memory. Barcelona has the edge      Numerical Simul. Plasmas,2003, p.383
in memory bandwidth available on a single processor.      [5] T. M. Evans, T. J. Urbatsch. “MILAGRO: A parallel
     Finally, when examining scaling across the entire    Implicit Monte Carlo code for 3-d radiative transfer (U),” In
16-core node, the results are somewhat mixed. In          Proc. of the Nuclear Explosives Code Development
general, applications with a small memory footprints      Conference, Las Vegas, NV, Oct 1998.
perform better on the Xeon and see almost perfect
scaling, Memory-bandwidth-intensive applications, on      [6] E.R.Hawkes, R.Sankaran, J.C.Sutherland, J.H. Chen.
the other hand, scale better on the Barcelona because     “Direct numerical simulation of turbulent combustion:
of the reduced memory contention. For many of the         fundamental insights towards predictive models”, J. of
                                                          Physics Conference Series, 16:65–79, 2005.
applications, the better scaling behavior results in a
higher achievable performance on the Barcelona.           [7] A. Hoisie, O. Lubeck, H.J. Wasserman. “Performance
     While this study represents a snapshot of current    and Scalability Analysis of Teraflop-Scale Parallel
processors and node architectures it also represents a    Architectures      using     Multidimensional    Wavefront
snapshot of current application structures. All of the    Applications”, int. J. of High Performance Applications, vol.
applications we ran use the “one MPI rank per core”       14, no. 4, pp. 330-346, 2000.
model. Although this is an extremely portable way to
                                                          [8] Intel Corporation. Quad-core Intel Xeon Processor 7300
structure an application, it may be possible to gain      Series. Product Brief. 2007.
more performance by exploiting the properties of
multi-core processors, such as that physically            [9] D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J.
proximate processes can benefit from sharing cached       Wasserman, M.L. Gittings. “Predictive Performance and
data. We have shown that for applications as they exist   Scalability Modeling of a Large-scale Application”, in Proc.
today, it is important to consider the balance between    IEEE/ACM SC, Denver, 2001.
compute rate and memory rate when selecting a
                                                          [10] K.R. Koch, R.S. Baker, R.E. Alcouffe. Solution of the
processor from which to build a cluster. Neither the      First-Order Form of the 3-D Discrete Ordinates Equation on
Barcelona nor the Xeon is unambiguously faster than       a Massively Parallel Processor. Trans. of the American
the other. The decision of which to use must be made      Nuclear Soc., 65:198-199, 1992.
on a per-application (or per-workload) basis and can
benefit from the results we presented in this paper.      [11] J. McCalpin. “Memory bandwidth and machine balance
                                                          in current high performance computers”, in IEEE Comp. Soc.
                                                          Tech. committee on Computer Architecture (TCCA)
Acknowledgements                                          Newsletter, pages 19-25, Dec. 1995.

     We thank AMD and Intel for providing early           [12] N. Wichmann, M. Adams, S. Ethier. New Advances in
systems for this performance evaluation. This work        the Gyrokinetic Toriodal Code and Their Impact on
was funded in part by the Accelerated Strategic           Performance on the Cray XT Series. In Proc. Cray User
Computing program and the Office of Science of the        Group (CUG), Seattle, WA, 2007.
Department of Energy. Los Alamos National                 [13] S.J. Zhou, D.M. Beazley, P.S. Lomdahl, B.L. Holian.
Laboratory is operated by Los Alamos National             Large-scale molecular dynamics simulations of fracture and
Security LLC for the US Department of Energy under        deformation. J. of Computer-Aided Materials Design, 3(1-3),
contract DE-AC52-06NA25396.                               pp. 183-186, 1995.

Mais conteúdo relacionado

Mais procurados

HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan MilojicicHKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan MilojicicLinaro
 
Stefano Giordano
Stefano GiordanoStefano Giordano
Stefano GiordanoGoWireless
 
Leonardo Nve Egea - Playing in a Satellite Environment 1.2
Leonardo Nve Egea - Playing in a Satellite Environment 1.2Leonardo Nve Egea - Playing in a Satellite Environment 1.2
Leonardo Nve Egea - Playing in a Satellite Environment 1.2Jim Geovedi
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1wjunjmt
 
Leveraging Open Source Integration with WSO2 Enterprise Service Bus
Leveraging Open Source Integration with WSO2 Enterprise Service BusLeveraging Open Source Integration with WSO2 Enterprise Service Bus
Leveraging Open Source Integration with WSO2 Enterprise Service BusWSO2
 
GlusterFS座談会テクニカルセッション
GlusterFS座談会テクニカルセッションGlusterFS座談会テクニカルセッション
GlusterFS座談会テクニカルセッションKeisuke Takahashi
 
Process and Threads in Linux - PPT
Process and Threads in Linux - PPTProcess and Threads in Linux - PPT
Process and Threads in Linux - PPTQUONTRASOLUTIONS
 
IJCER (www.ijceronline.com) International Journal of computational Engineeri...
 IJCER (www.ijceronline.com) International Journal of computational Engineeri... IJCER (www.ijceronline.com) International Journal of computational Engineeri...
IJCER (www.ijceronline.com) International Journal of computational Engineeri...ijceronline
 

Mais procurados (17)

Nehalem (microarchitecture)
Nehalem (microarchitecture)Nehalem (microarchitecture)
Nehalem (microarchitecture)
 
Larrabee
LarrabeeLarrabee
Larrabee
 
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan MilojicicHKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
HKG15-The Machine: A new kind of computer- Keynote by Dejan Milojicic
 
Stefano Giordano
Stefano GiordanoStefano Giordano
Stefano Giordano
 
Aw25293296
Aw25293296Aw25293296
Aw25293296
 
Tma ph d_school_2011
Tma ph d_school_2011Tma ph d_school_2011
Tma ph d_school_2011
 
Leonardo Nve Egea - Playing in a Satellite Environment 1.2
Leonardo Nve Egea - Playing in a Satellite Environment 1.2Leonardo Nve Egea - Playing in a Satellite Environment 1.2
Leonardo Nve Egea - Playing in a Satellite Environment 1.2
 
Evaluation aodv
Evaluation aodvEvaluation aodv
Evaluation aodv
 
Cisco crs1
Cisco crs1Cisco crs1
Cisco crs1
 
QSpiders - Basic intel architecture
QSpiders - Basic intel architectureQSpiders - Basic intel architecture
QSpiders - Basic intel architecture
 
Leveraging Open Source Integration with WSO2 Enterprise Service Bus
Leveraging Open Source Integration with WSO2 Enterprise Service BusLeveraging Open Source Integration with WSO2 Enterprise Service Bus
Leveraging Open Source Integration with WSO2 Enterprise Service Bus
 
Pratima fragmentation
Pratima fragmentationPratima fragmentation
Pratima fragmentation
 
Frame mode mpls
Frame mode mplsFrame mode mpls
Frame mode mpls
 
GlusterFS座談会テクニカルセッション
GlusterFS座談会テクニカルセッションGlusterFS座談会テクニカルセッション
GlusterFS座談会テクニカルセッション
 
Process and Threads in Linux - PPT
Process and Threads in Linux - PPTProcess and Threads in Linux - PPT
Process and Threads in Linux - PPT
 
IJCER (www.ijceronline.com) International Journal of computational Engineeri...
 IJCER (www.ijceronline.com) International Journal of computational Engineeri... IJCER (www.ijceronline.com) International Journal of computational Engineeri...
IJCER (www.ijceronline.com) International Journal of computational Engineeri...
 
61
6161
61
 

Semelhante a 04536342

Processors and its Types
Processors and its TypesProcessors and its Types
Processors and its TypesNimrah Shahbaz
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processorsArun Kumar
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsJames McGalliard
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3senayteklay
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overviewlambertt
 
Intel Core i7
Intel Core i7Intel Core i7
Intel Core i7Md Ajmat
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiAnkit Raj
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSMULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSAIRCC Publishing Corporation
 

Semelhante a 04536342 (20)

Nehalem
NehalemNehalem
Nehalem
 
Processors and its Types
Processors and its TypesProcessors and its Types
Processors and its Types
 
Core 2 processors
Core 2 processorsCore 2 processors
Core 2 processors
 
Analysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific ApplicationsAnalysis of Multicore Performance Degradation of Scientific Applications
Analysis of Multicore Performance Degradation of Scientific Applications
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overview
 
Intel Core i7
Intel Core i7Intel Core i7
Intel Core i7
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
Multi-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IKMulti-Core on Chip Architecture *doc - IK
Multi-Core on Chip Architecture *doc - IK
 
Intel Core i7 Processors
Intel Core i7 ProcessorsIntel Core i7 Processors
Intel Core i7 Processors
 
Cray xt3
Cray xt3Cray xt3
Cray xt3
 
Ef35745749
Ef35745749Ef35745749
Ef35745749
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash Prajapati
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
Pentium iii
Pentium iiiPentium iii
Pentium iii
 
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONSMULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
MULTI-CORE PROCESSORS: CONCEPTS AND IMPLEMENTATIONS
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

04536342

  • 1. Experiences in Scaling Scientific Applications on Current-generation Quad-core Processors Kevin Barker, Kei Davis, Adolfy Hoisie, Darren Kerbyson, Mike Lang, Scott Pakin, José Carlos Sancho Performance and Architecture Lab (PAL), Los Alamos National Laboratory, USA {kjbarker,kei,hoisie,djk,mlang,pakin,jcsancho@lanl.gov} Abstract single die whereas the Tigerton incorporates two dual- core dies into a single package. In this study we In this work we present an initial performance compare the performance of two 16-core nodes, one evaluation of AMD and Intel’s first quad-core with four Barcelona processors and the other with four processor offerings: the AMD Barcelona and the Intel Tigerton processors. Our analysis relies on Xeon X7350. We examine the suitability of these performance measurements of application-independent processors in quad-socket compute nodes as building tests (microbenchmarks) and a suite of scientific blocks for large-scale scientific computing clusters. applications taken from existing workloads within the Our analysis of intra-processor and intra-node U.S. Department of Energy that represent various scalability of microbenchmarks and a range of large- scientific domains and program structures. scale scientific applications indicates that quad-core The performance and scaling behavior of each processors can deliver an improvement in application was measured on one core, when scaling performance of up to 4x per processor but is heavily from one to four cores on a single processor, and also dependent on the workload being processed. While the when using all four processors in a node. In addition, Intel processor has a higher clock rate and peak we determined the best achievable performance of performance, the AMD processor has higher memory each application on each node which is not necessarily bandwidth and intra-node scalability. The scientific when using all processing cores within a socket, or all applications we analyzed exhibit a range of cores within a node. This is heavily dependent on the performance improvements from only 3x up to the full application characteristics. 16x speed-up over a single core. Also, we note that the Though much of our work is focused on large- maximum node performance is not necessarily scale system performance including examining the achieved by using all 16 cores. largest systems available (for instance Blue Gene/L & Blue Gene/P, ASC Purple, ASC Redstorm e.g. [2]), we note that the performance at large-scale is a result 1. Introduction from both the performance of the computational nodes as well as their integration into the system as whole. The advancing level of transistor integration is This paper is organized as follows. An overview producing increasingly complex processor solutions of the Barcelona and Xeon nodes is given in Section 2. ranging from main-stream multi-cores, heterogeneous Low-level microbenchmarks are described in Section many-cores, and also special purpose processors 3 together with measured results for both nodes. including GPUs. There is no doubt that this will Section 4 describes the suite of applications, and input continue into the future until Moore’s Law no longer decks used and the methodology used to undertake the can be satisfied. This increasing integration will scalability analysis. Results are presented in Section 5 require increases in performance of the memory for the three types of analysis as described. hierarchy to feed the processors. Innovations such as Conclusions from this work are discussed in Section 6. putting memory on-top of processors, putting The contribution of this work is in the analysis of processors on-top of memory (PIMS), or a empirical performance data from a large suite of combination of both maybe a future way forward. complete scientific applications on the first generation However, the utility of future processor generations of quad-core processors from both AMD and Intel and will be a result from demonstrable increases in in a quad-socket environment. These data are obtained achievable performance from real workloads. from a strict measurement methodology to ensure that In this work we examine the performance of two conclusions drawn from the scalability analysis are state-of-the-art quad-core processors—the quad-core fair. Note that in this present work we do not consider AMD Opteron 8350 (Barcelona) and the quad-core physical or economic issues such as hardware cost, Intel Xeon X7350 (Tigerton). Both are based on 65nm power, or physical node size. The process that we process technology. The Barcelona is fabricated as a follow is directly applicable to other multi-core studies. 978-1-4244-1694-3/08/$25.00 ©2008 IEEE
  • 2. 2. Processor and Node Descriptions memory accesses, while the DHSI provides a point-to- point link between each processor and the memory The Intel Xeon X7350 (Tigerton) and AMD 8350 channels. The front-side-buss (FSB) of each processor Opteron (Barcelona) represent competing first- runs at 1066MHz. The memory speed is 667MHz and generation quad-core processor designs initially made thus provides a peak memory bandwidth of 10.7GB/s available in September 2007. Both are detailed below per processor which is shared among the four cores. and illustrate different implementations both in terms of processor configuration and connectivity to memory. 2.2. The AMD 8350 quad-core Opteron processor (Barcelona) 2.1. The Intel X7350 quad-core (Tigerton) Barcelona, the latest generation of the Opteron, The Intel Tigerton processor contains two dual- combines four Opteron cores onto a single die. Each core dies that are packaged into a single dual-chip die contains a single integrated memory controller and module (DCM) that is seated within a single socket. uses a HyperTransport (HT) network for point-to-point Each core contains a private 64KB L1 cache (32KB connections between processors. Each core has a data + 32KB instruction), and a shared 4MB L2 cache private 64KB L1 cache (32KB data + 32KB for the two cores on each die. Thus the total amount of instruction) and a private 512KB L2 cache, and each L2 cache is 8MB within the DCM. The processor processor has a shared 2MB L3 cache. The shared L3 implements the 128-bit SSE3 instruction set for SIMD cache is new to the Opteron architecture. The new 128 operations and thus can perform 4 double-precision bit SSE4a instructions enable each processor to floating-point operations per cycle. The processor is execute 4 double-precision floating-point operations clocked at 2.93GHz so the DCM has a theoretical peak per clock. The clock speed of each core is 2.0GHz performance of 46.9 Gflops/s. giving each chip a peak performance of 32Gflops/s. Each node contains four processors for a total of Each node contains four quad-core processors as 16 cores as shown in Figure 1, and contains a total of shown in Figure 2. Because each processor contains a 16GB of main memory using fully-buffered DIMMs separate memory controller, a key difference with the (FBDIMMs). Central to the node is a single memory Xeon node is that memory is connected directly to controller hub (MCH). This hub interconnects the each processor in a non-uniform memory access front-side-bus (FSB) of each processor to four (NUMA) configuration versus the Xeon’s symmetric- FBDIMM memory channels. The MCH contains a multiprocessor (SMP) configuration. DDR2 667MHz 64MB snoop buffer and a Dedicated High Speed Interconnect (DHSI) as well as PCI Express channels. Core Core Core Core The purpose of the snoop buffer is to minimize main L2 L2 L2 L2 Shared L3 Core Core Core Core Cross-bar Shared L2 Shared L2 HT (x3) Mem Cont (x2) 1066MHz FSB (a) Quad-core processor (a) Quad-core processor Socket 0 Socket 1 C C C C C C C C Socket 0 Socket 1 Socket 2 Socket 3 0 1 2 3 8GB/s 4 5 6 7 C C C C C C C C C C C C C C C C 0 1 8 9 2 3 10 11 4 5 12 13 6 7 14 15 DDR2 667 DDR2 667 8GB/s 8GB/s Memory Controller 8GB/s Hub (MCH) 8GB/s C C C C C C C C 8GB/s 8GB/s 8 9 10 11 12 13 14 15 8GB/s FBDIMM 667 FBDIMM 667 FBDIMM 667 FBDIMM 667 DDR2 667 Socket 2 Socket 3 DDR2 667 (b) Quad-processor node (b) Quad-processor node Figure 1. Overview of the Intel Tigerton Figure 2. Overview of the AMD Barcelona.
  • 3. Table 1. Characteristics of the Intel and AMD processor, memory, and node organization. Chip Memory Node Speed Peak L1 Power Transistor Speed Memory Peak L2 (MB) L3 (MB) Type (GHz) (Gflops/s) (KB) (W) count (M) (MHz) controllers (Gflops) Intel 4 (shared 2.93 46.9 64 None 130 582 FBDIMM 667 1 187.6 Tigerton 2-core) AMD 2 (shared 2.0 32 64 0.25 75 463 DDR2 667 4 128.0 Barcelona 4-core) memory is used and thus the memory bandwidth per outperforms the Xeon node in all cases. The measured processor is 10.7GB/s. The total memory capacity of single-core memory bandwidth is 4.4GB/s on the the node is 16GB (4GB per processor). The HT links Barcelona and 3.7GB/s on the Xeon. The memory connect the four processors in a 2×2 mesh. Further HT aggregate memory bandwidth, using all 16 cores, is links provide PCI Express I/O capability. Each HT 17.4GB/s and 10.2GB/s respectively. link has a theoretical peak of 8GB/s for data transfer. Figure 3(b) is based on the same data as Figure A summary of the two processor architectures and 3(a) but presents the observed memory bandwidth per nodes is presented in Table 1. It is interesting to note core. As shown, the per-core bandwidth decreases the differences in power consumption and transistor from 4.4GB/s to 1.1GB/s for the Barcelona (a factor of count per processor. The lesser transistor count on the four decrease), and from 3.7GB/s to 0.63GB/s for the Barcelona is due mainly to the reduced cache capacity. Xeon (a factor of six decrease). This decrease is significant, and there is clearly room for improvement 3. Low-level performance characteristics on both architectures. The aggregate achievable memory bandwidth is important to memory-intensive 3.1. Memory bandwidth applications, and the Barcelona node has a clear advantage over the Xeon node for such applications. To examine the memory bandwidth per core we ran the MPI version of the University of Virginia's 3.2. Processor locality Streams benchmark [10]. This benchmark is a memory stress test for a number of different operations. We The mapping of application processes to cores report here the performance of the “triad” test. In within a node affects the memory contention that the Figure 3(a) the aggregate memory bandwidth is shown application induces. In our experience, the core-to- for both the Xeon and Barcelona nodes for two cases: processor ordering is not always obvious and should at when using a single processor and when using all four a minimum be verified. Linux determines core processors in the node. The number of cores per numbering via information provided from the BIOS. processor used in both cases is varied from one to In this testing we used an MPI benchmark to four. As shown by the figure, the Barcelona node measure the latency between each core and all of the 24 5 AMD Barcelona (4-sockets) AMD Barcelona (4-sockets) 22 Intel Tigerton (4-sockets) Intel Tigerton (4-sockets) Aggregate Memory Bandwidth (GB/s) Memory Bandwidth per core (GB/s) 20 AMD Barcelona (1-socket) AMD Barcelona (1-socket) 4 18 Intel Tigerton (1-socket) Intel Tigerton (1-socket) 16 3 14 12 10 2 8 6 1 4 2 0 0 1 2 3 4 1 2 3 4 Cores Per Socket Cores Per Socket (a) Aggregate memory bandwidth (b) Bandwidth per core Figure 3. Streams Bandwidth
  • 4. Destination core Destination core 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.43-0.44µs 1.20-1.21µs (same die) (same die/processor) 0.84-0.85µs 1.47-1.49µs Source core Source core (same processor) (HT 1-hop) 1.63-1.64µs 1.55-1.56µs (remote processor) (HT 2-hops) (a) Intel Xeon (b) AMD Barcelona Figure 4. Observed latency from any core to any other core in a node others. The latency varies depending on whether the 4. Application testing process two communicating cores are on the same die, on different dies in the same processor (in the case of the The suite of applications utilized includes many Intel Xeon), or different processors (or single- or large-scale production applications that are currently multiple-hop remote processors for AMD Barcelona). used within the U.S. Department of Energy. The From our latency measurements we were able to testing for each application consisted of determine the arrangement of the cores as seen by an application. Figure 4 shows the observed latency from 1) comparing the performance on a single core of the this test in the form of a matrix. The vertical axis Barcelona and Xeon; indicates the sending core and the horizontal axis is 2) examining the scaling behavior for two cases: (a) the receiving core. Shading is used to denote the using only one processor, and (b) using all 4 different latencies. No test was performed for a core processors in a node; sending to itself (the major diagonal in the matrix). It 3) determining the configuration (processors and can be seen that the Barcelona node uses a linear cores/processor) that yields the best performance. ordering of cores to processors – that is the first four All of the applications are typically run in a weak- cores reside on the first processor as indicated by the scaling mode. That is, the global problem size grows lowest latency, black shaded, block of size 4x4 cores with the number of nodes in the system. All available in Figure 4(b) and so on. In the case of the Xeon a memory is typically used for increased fidelity in the round-robin ordering across the dies is used. The first physical simulations or for simulating larger physical two dies reside on the first processor and so on. Cores systems. Our approach mimics typical usage by fixing on the same die are an MPI logical task distance of the sub-problem per processor no matter how many eight apart, as shown by the black diagonal lines in cores per processor are used. The global problem Figure 4(a). grows in proportion to the number of processors used. The latency from one core to another is in very This can be stated succinctly as doing strong scaling distinct ranges depending on their relationship (same within a processor and weak scaling across processors. die, remote die etc). We observe that the maximum latency is similar on both nodes, but the Xeon enjoys a 4.1. The application suite much lower intra-die and intra-processor latency. The core and processor ordering was actually included in An overview of each application is given below. Figure 1 and Figure 2 based on the observed processor Each is typically run on high-performance parallel locality map shown in Figure 4. To handle the systems utilizing many 1000s of processors at a time. differences in the processor numbering between the A summary of the application input-decks used here is two nodes we implemented a small software shim that given in Table 2. The input decks used are typical for uses the Linux sched_setaffinity() system call to problems processed on large-scale systems. allow user-defined mappings between MPI rank and physical cores. This shim gives us the ability to map GTC – Gyrokinetic Toroidal Code is a Particle-in-Cell processes to cores identically across the two nodes and (PIC) code from Princeton [12]. It was developed thereby perform fair comparisons of application to study energy transport in fusion devices. performance. Milagro - an implicit Monte Carlo (IMC) code for thermal radiative transfer from Los Alamos [5].
  • 5. Table 2. Summary of the input decks used for each application Problem per Problem per node Memory use per Input deck Processing characteristic processor processor GTC 1Dwedge 6.2M particles 24.8M particles 320MB Particle based Milagro Doublebend 0.5M particles 2M particles 50MB Particle based with replicated mesh Partisn Pencil 20x10x400 40x20x400 80MB Compute intensive S3D typical 50x50x50 100x100x50 140MB Memory/Compute intensive SAGE timing_h 140K cells 560K cells 280MB Memory intensive SPaSM BCC 64x64x64 128x128x64 150MB Compute intensive Sweep3D Pencil 20x10x400 40x20x400 8MB Kernel, small memory footprint VH1 Shock_tube 200x200x200 400x400x200 900MB Memory/Compute intensive VPIC 3D-HOT 4M particles 16M particles 256MB SSE optimized, memory intensive Partisn – SN transport code from Los Alamos [1] 5. Application Performance Analysis solving the Boltzmann equation using the discrete ordinates method, on structured meshes. 5.1. Comparison of single-core performance S3D – high-fidelity 3D simulation of turbulent combustion that includes detailed chemistry. It The performance of each application on a single originates from Sandia National Laboratory [6]. core of both processor types is shown in Figure 5(a). SAGE – an adaptive mesh (AMR) hydrodynamics The metric of “time” denotes the iteration time of the code used for the simulation of shock-waves. main computational loop (without I/O) for all Developed jointly by Los Alamos and SAIC [9]. applications except Sweep and Partisn, which are run SPaSM – the Scalable Parallel Short-range Molecular for 10 iterations, and VPIC, which is run for 100 dynamics code from Los Alamos. Used to study iterations to aid visual comparison. The reduction in material fracture and deformation properties [13]. application runtime on the Xeon core relative to the Sweep3D – a code-kernel from Los Alamos [7,10] that Barcelona core is shown in Figure 5(b). 50% indicates implements deterministic SN transport. The that the Tigerton has a runtime reduction of 50% computation is in the form of wavefronts that (halving the iteration time). Similarly, a value of -50% originate at the corners of a 3-D physical space. indicates that the Barcelona has a 50% reduction in VH1 – the Virginia Hydrodynamics code simulates an runtime relative to the Tigerton core. ideal inviscid compressible flow of gas The advantage of the Tigerton core over the hydrodynamics and is capable of simulating three- Barcelona core is between 25% and 44% depending on dimensional turbulent stellar flows [3]. the application. Recall that the Tigerton has a clock VPIC – a Particle-In-Cell code from Los Alamos used speed of 2.93GHz compared with the Barcelona of to model particle flow within a plasma [4]. 40 100% Intel Tigerton 35 Runtime advantage of Intel vs. AMD (1-core) AMD Barcelona 30 50% 25 Time (s) 20 15 0% SAGE SPaSM Milagro VH1 VPIC GTC S3D Sweep3D Partisn 10 5 -50% 0 SAGE SPaSM Milagro VH1 VPIC GTC S3D Sweep3D Partisn -100% (a) Application iteration time (b) Performance advantage of Xeon core Figure 5. Single-core performance comparison
  • 6. 20 100% 18 Intel Tigerton Runtime advantage of Intel vs. AMD (16-core) AMD Barcelona 16 14 50% 12 Time (s) 10 8 0% SAGE SPaSM VPIC Milagro VH1 GTC S3D Sweep3D Partisn 6 4 2 -50% 0 SAGE SPaSM VPIC Milagro VH1 GTC S3D Sweep3D Partisn -100% (a) Application iteration time (b) Performance advantage of Xeon node Figure 6. Single-node (16 cores) application performance comparison 2.0GHz, and that the memory bandwidth for a single the node. In fact the best performance observed on core was 3.7GB/s and 4.4GB/s respectively. Hence up VPIC and Partisn was when using 2-cores per to a 50% advantage could be expected for processor (8 cores total), and for SAGE was when computationally intensive codes executing the same using 3-cores per processor (12 cores total) on the instructions mix over the same number of cycles Xeon node. The best performance in all other cases without memory stalls. was observed when using all 16 cores. In Section 5.3 below we analyzed the performance as a function of 5.2. Comparison of node performance the number of cores and number of processors used for each application. The best performance of each application on the Xeon node (16 cores), and Barcelona node (16 cores) 5.3. Quad-core application scalability analysis is shown in Figure 6. In the same way as for the single-core comparison, Figure 6(a) depicts the To analyze the performance of using multiple iteration time, and Figure 6(b) depicts the runtime cores we followed a strict process in which the advantage of the Xeon node over the Barcelona node. problem size per processor was constant for all tests, Several observations are clear in this comparison: i) as described in Section 4. For each application we the iteration time is lower in all cases compared to show the performance when using between one and using only one-core per socket; ii) the run-time four cores per processor for the case of using one advantage of Xeon is much reduced from the single- processor and four processors in a node. Note that the core comparison of Section 5.1; and iii) the Xeon node performance relative to the single-core performance no-longer out-performs the Barcelona node for all for each processor type is shown – this is the speedup applications – the run-time on the Barcelona node is when using multiple cores. Figure 7 shows this data lower by as much as 60%. for all applications. Note that the legend for all graphs The results shown in Figure 6 are in line with the is shown in graph of Figure 7i). differences in the memory bandwidth measurements in The first observation made is that the scalability is Section 3.1. The per-core bandwidth when using all higher on the Barcelona quad-core than on the Xeon cores within a node on the Xeon is 0.63GB/s and on quad-core. This explains why the performance the Barcelona is 1.1GB/s (an advantage of almost a advantage of the Xeon quad-core nodes is less than it’s factor of 2). The 60% runtime advantage of the advantage for a single-core. The applications are Barcelona for the node performance is directly in line ordered in terms of their observed scalability in Figure with this on the memory intensive applications of 7. The applications with best scaling behavior on both SAGE and VH1. nodes are Milagro, SPaSM and Sweep3D (Figures Note however the performance presented above is 7(a) – 7(c)). Both Milagro and SPaSM are compute based on the best observed node performance – this bound, and Sweep3D has a small memory footprint does not necessarily result when using all 16 cores in resulting in high cache utilization. In contrast SAGE
  • 7. and Partisn are memory bound and show the slowest Note that even if an application has a higher scalability on both nodes. Codes that are neither speedup on the Barcelona node in comparison to the compute or memory bound scale better on the Xeon node it does not necessarily mean that it has a Barcelona than on the Xeon – that is VH1, GTC, higher performance as was evident in Figure 6(b). VPIC and S3D. 16 16 16 14 14 14 Relative performance (to 1 core) Relative performance (to 1 core) Relative performance (to 1 core) 12 12 12 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 1 2 3 4 1 2 3 4 1 2 3 4 Cores per socket Cores per socket Cores per socket (a) Milagro (b) SPaSM (c) Sweep3D 16 16 16 14 14 14 Relative performance (to 1 core) Relative performance (to 1 core) Relative performance (to 1 core) 12 12 12 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 1 2 3 4 1 2 3 4 1 2 3 4 Cores per socket Cores per socket Cores per socket (d) VH1 (e) GTC (f) VPIC 16 16 16 14 14 14 AMD Barcelona (4-sockets) Relative performance (to 1 core) Relative performance (to 1 core) Relative performance (to 1 core) Intel Tigerton (4-sockets) 12 12 12 AMD Barcelona (1-socket) 10 10 10 Intel Tigerton (1-socket) 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 1 2 3 4 1 2 3 4 1 2 3 4 Cores per socket Cores per socket Cores per socket (g) S3D (h)SAGE (i) Partisn Figure 7. Application speedup when using multiple-cores.
  • 8. 6. Conclusions References Using a suite of applications we have evaluated [1] R.S. Baker. “A Block Adaptive Mesh Refinement Algorithm for the Neutral Particle Transport Equation”, the first generation of quad-core processors available Nuclear Science & Engineering, 141(1), pp. 1-12, 2002. from AMD and Intel. Data was obtained using a strict measurement methodology that used a shim to control [2] A. Hoisie, G. Johnson, D.J. Kerbyson, M. Lang, S. Pakin. the mapping of application processes to processors, “A Performance Comparison Through Benchmarking and and compared the per-core performance as well as Modeling of Three Leading Supercomputers: Blue Gene/L, scaling up to 16 cores in a node. The process followed Red Storm, and Purple”, in proc. IEEE/ACM Conf. on is directly applicable to other multi-core studies. Supercomputing (SC06), Tampa, FL, 2006. When considering the performance of a single [3] J.M. Blondin. VH-1 User’s Guide. North Carolina State core, where there are no memory bottlenecks, the University, 1999. higher clock speed of Intel’s Xeon gives applications a measured 25-44% reduction in runtime compared with [4] K. Bowers. “Speed optimal implementation of a fully the AMD’s Barcelona. When using all of the cores in a relativistic 3d particle push with charge conserving current processor the results are more dependent on the way accumulation on modern processors”, in proc. 18th int. conf. each application uses memory. Barcelona has the edge Numerical Simul. Plasmas,2003, p.383 in memory bandwidth available on a single processor. [5] T. M. Evans, T. J. Urbatsch. “MILAGRO: A parallel Finally, when examining scaling across the entire Implicit Monte Carlo code for 3-d radiative transfer (U),” In 16-core node, the results are somewhat mixed. In Proc. of the Nuclear Explosives Code Development general, applications with a small memory footprints Conference, Las Vegas, NV, Oct 1998. perform better on the Xeon and see almost perfect scaling, Memory-bandwidth-intensive applications, on [6] E.R.Hawkes, R.Sankaran, J.C.Sutherland, J.H. Chen. the other hand, scale better on the Barcelona because “Direct numerical simulation of turbulent combustion: of the reduced memory contention. For many of the fundamental insights towards predictive models”, J. of Physics Conference Series, 16:65–79, 2005. applications, the better scaling behavior results in a higher achievable performance on the Barcelona. [7] A. Hoisie, O. Lubeck, H.J. Wasserman. “Performance While this study represents a snapshot of current and Scalability Analysis of Teraflop-Scale Parallel processors and node architectures it also represents a Architectures using Multidimensional Wavefront snapshot of current application structures. All of the Applications”, int. J. of High Performance Applications, vol. applications we ran use the “one MPI rank per core” 14, no. 4, pp. 330-346, 2000. model. Although this is an extremely portable way to [8] Intel Corporation. Quad-core Intel Xeon Processor 7300 structure an application, it may be possible to gain Series. Product Brief. 2007. more performance by exploiting the properties of multi-core processors, such as that physically [9] D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. proximate processes can benefit from sharing cached Wasserman, M.L. Gittings. “Predictive Performance and data. We have shown that for applications as they exist Scalability Modeling of a Large-scale Application”, in Proc. today, it is important to consider the balance between IEEE/ACM SC, Denver, 2001. compute rate and memory rate when selecting a [10] K.R. Koch, R.S. Baker, R.E. Alcouffe. Solution of the processor from which to build a cluster. Neither the First-Order Form of the 3-D Discrete Ordinates Equation on Barcelona nor the Xeon is unambiguously faster than a Massively Parallel Processor. Trans. of the American the other. The decision of which to use must be made Nuclear Soc., 65:198-199, 1992. on a per-application (or per-workload) basis and can benefit from the results we presented in this paper. [11] J. McCalpin. “Memory bandwidth and machine balance in current high performance computers”, in IEEE Comp. Soc. Tech. committee on Computer Architecture (TCCA) Acknowledgements Newsletter, pages 19-25, Dec. 1995. We thank AMD and Intel for providing early [12] N. Wichmann, M. Adams, S. Ethier. New Advances in systems for this performance evaluation. This work the Gyrokinetic Toriodal Code and Their Impact on was funded in part by the Accelerated Strategic Performance on the Cray XT Series. In Proc. Cray User Computing program and the Office of Science of the Group (CUG), Seattle, WA, 2007. Department of Energy. Los Alamos National [13] S.J. Zhou, D.M. Beazley, P.S. Lomdahl, B.L. Holian. Laboratory is operated by Los Alamos National Large-scale molecular dynamics simulations of fracture and Security LLC for the US Department of Energy under deformation. J. of Computer-Aided Materials Design, 3(1-3), contract DE-AC52-06NA25396. pp. 183-186, 1995.