2. 2. Processor and Node Descriptions memory accesses, while the DHSI provides a point-to-
point link between each processor and the memory
The Intel Xeon X7350 (Tigerton) and AMD 8350 channels. The front-side-buss (FSB) of each processor
Opteron (Barcelona) represent competing first- runs at 1066MHz. The memory speed is 667MHz and
generation quad-core processor designs initially made thus provides a peak memory bandwidth of 10.7GB/s
available in September 2007. Both are detailed below per processor which is shared among the four cores.
and illustrate different implementations both in terms
of processor configuration and connectivity to memory. 2.2. The AMD 8350 quad-core Opteron
processor (Barcelona)
2.1. The Intel X7350 quad-core (Tigerton)
Barcelona, the latest generation of the Opteron,
The Intel Tigerton processor contains two dual- combines four Opteron cores onto a single die. Each
core dies that are packaged into a single dual-chip die contains a single integrated memory controller and
module (DCM) that is seated within a single socket. uses a HyperTransport (HT) network for point-to-point
Each core contains a private 64KB L1 cache (32KB connections between processors. Each core has a
data + 32KB instruction), and a shared 4MB L2 cache private 64KB L1 cache (32KB data + 32KB
for the two cores on each die. Thus the total amount of instruction) and a private 512KB L2 cache, and each
L2 cache is 8MB within the DCM. The processor processor has a shared 2MB L3 cache. The shared L3
implements the 128-bit SSE3 instruction set for SIMD cache is new to the Opteron architecture. The new 128
operations and thus can perform 4 double-precision bit SSE4a instructions enable each processor to
floating-point operations per cycle. The processor is execute 4 double-precision floating-point operations
clocked at 2.93GHz so the DCM has a theoretical peak per clock. The clock speed of each core is 2.0GHz
performance of 46.9 Gflops/s. giving each chip a peak performance of 32Gflops/s.
Each node contains four processors for a total of Each node contains four quad-core processors as
16 cores as shown in Figure 1, and contains a total of shown in Figure 2. Because each processor contains a
16GB of main memory using fully-buffered DIMMs separate memory controller, a key difference with the
(FBDIMMs). Central to the node is a single memory Xeon node is that memory is connected directly to
controller hub (MCH). This hub interconnects the each processor in a non-uniform memory access
front-side-bus (FSB) of each processor to four (NUMA) configuration versus the Xeon’s symmetric-
FBDIMM memory channels. The MCH contains a multiprocessor (SMP) configuration. DDR2 667MHz
64MB snoop buffer and a Dedicated High Speed
Interconnect (DHSI) as well as PCI Express channels.
Core Core Core Core
The purpose of the snoop buffer is to minimize main
L2 L2 L2 L2
Shared L3
Core Core Core Core Cross-bar
Shared L2 Shared L2 HT (x3) Mem Cont (x2)
1066MHz FSB
(a) Quad-core processor
(a) Quad-core processor Socket 0 Socket 1
C C C C C C C C
Socket 0 Socket 1 Socket 2 Socket 3 0 1 2 3 8GB/s 4 5 6 7
C C C C C C C C C C C C C C C C
0 1 8 9 2 3 10 11 4 5 12 13 6 7 14 15
DDR2 667 DDR2 667
8GB/s 8GB/s
Memory Controller
8GB/s Hub (MCH) 8GB/s C C C C C C C C
8GB/s 8GB/s 8 9 10 11 12 13 14 15
8GB/s
FBDIMM 667 FBDIMM 667 FBDIMM 667 FBDIMM 667 DDR2 667 Socket 2 Socket 3 DDR2 667
(b) Quad-processor node (b) Quad-processor node
Figure 1. Overview of the Intel Tigerton Figure 2. Overview of the AMD Barcelona.
3. Table 1. Characteristics of the Intel and AMD processor, memory, and node organization.
Chip Memory Node
Speed Peak L1 Power Transistor Speed Memory Peak
L2 (MB) L3 (MB) Type
(GHz) (Gflops/s) (KB) (W) count (M) (MHz) controllers (Gflops)
Intel 4 (shared
2.93 46.9 64 None 130 582 FBDIMM 667 1 187.6
Tigerton 2-core)
AMD 2 (shared
2.0 32 64 0.25 75 463 DDR2 667 4 128.0
Barcelona 4-core)
memory is used and thus the memory bandwidth per outperforms the Xeon node in all cases. The measured
processor is 10.7GB/s. The total memory capacity of single-core memory bandwidth is 4.4GB/s on the
the node is 16GB (4GB per processor). The HT links Barcelona and 3.7GB/s on the Xeon. The memory
connect the four processors in a 2×2 mesh. Further HT aggregate memory bandwidth, using all 16 cores, is
links provide PCI Express I/O capability. Each HT 17.4GB/s and 10.2GB/s respectively.
link has a theoretical peak of 8GB/s for data transfer. Figure 3(b) is based on the same data as Figure
A summary of the two processor architectures and 3(a) but presents the observed memory bandwidth per
nodes is presented in Table 1. It is interesting to note core. As shown, the per-core bandwidth decreases
the differences in power consumption and transistor from 4.4GB/s to 1.1GB/s for the Barcelona (a factor of
count per processor. The lesser transistor count on the four decrease), and from 3.7GB/s to 0.63GB/s for the
Barcelona is due mainly to the reduced cache capacity. Xeon (a factor of six decrease). This decrease is
significant, and there is clearly room for improvement
3. Low-level performance characteristics on both architectures. The aggregate achievable
memory bandwidth is important to memory-intensive
3.1. Memory bandwidth applications, and the Barcelona node has a clear
advantage over the Xeon node for such applications.
To examine the memory bandwidth per core we
ran the MPI version of the University of Virginia's 3.2. Processor locality
Streams benchmark [10]. This benchmark is a memory
stress test for a number of different operations. We The mapping of application processes to cores
report here the performance of the “triad” test. In within a node affects the memory contention that the
Figure 3(a) the aggregate memory bandwidth is shown application induces. In our experience, the core-to-
for both the Xeon and Barcelona nodes for two cases: processor ordering is not always obvious and should at
when using a single processor and when using all four a minimum be verified. Linux determines core
processors in the node. The number of cores per numbering via information provided from the BIOS.
processor used in both cases is varied from one to In this testing we used an MPI benchmark to
four. As shown by the figure, the Barcelona node measure the latency between each core and all of the
24 5
AMD Barcelona (4-sockets) AMD Barcelona (4-sockets)
22
Intel Tigerton (4-sockets)
Intel Tigerton (4-sockets)
Aggregate Memory Bandwidth (GB/s)
Memory Bandwidth per core (GB/s)
20 AMD Barcelona (1-socket)
AMD Barcelona (1-socket) 4
18 Intel Tigerton (1-socket)
Intel Tigerton (1-socket)
16
3
14
12
10 2
8
6
1
4
2
0 0
1 2 3 4 1 2 3 4
Cores Per Socket Cores Per Socket
(a) Aggregate memory bandwidth (b) Bandwidth per core
Figure 3. Streams Bandwidth
4. Destination core Destination core
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0.43-0.44µs 1.20-1.21µs
(same die) (same die/processor)
0.84-0.85µs 1.47-1.49µs
Source core
Source core
(same processor) (HT 1-hop)
1.63-1.64µs 1.55-1.56µs
(remote processor) (HT 2-hops)
(a) Intel Xeon (b) AMD Barcelona
Figure 4. Observed latency from any core to any other core in a node
others. The latency varies depending on whether the 4. Application testing process
two communicating cores are on the same die, on
different dies in the same processor (in the case of the The suite of applications utilized includes many
Intel Xeon), or different processors (or single- or large-scale production applications that are currently
multiple-hop remote processors for AMD Barcelona). used within the U.S. Department of Energy. The
From our latency measurements we were able to testing for each application consisted of
determine the arrangement of the cores as seen by an
application. Figure 4 shows the observed latency from 1) comparing the performance on a single core of the
this test in the form of a matrix. The vertical axis Barcelona and Xeon;
indicates the sending core and the horizontal axis is 2) examining the scaling behavior for two cases: (a)
the receiving core. Shading is used to denote the using only one processor, and (b) using all 4
different latencies. No test was performed for a core processors in a node;
sending to itself (the major diagonal in the matrix). It 3) determining the configuration (processors and
can be seen that the Barcelona node uses a linear cores/processor) that yields the best performance.
ordering of cores to processors – that is the first four All of the applications are typically run in a weak-
cores reside on the first processor as indicated by the scaling mode. That is, the global problem size grows
lowest latency, black shaded, block of size 4x4 cores with the number of nodes in the system. All available
in Figure 4(b) and so on. In the case of the Xeon a memory is typically used for increased fidelity in the
round-robin ordering across the dies is used. The first physical simulations or for simulating larger physical
two dies reside on the first processor and so on. Cores systems. Our approach mimics typical usage by fixing
on the same die are an MPI logical task distance of the sub-problem per processor no matter how many
eight apart, as shown by the black diagonal lines in cores per processor are used. The global problem
Figure 4(a). grows in proportion to the number of processors used.
The latency from one core to another is in very This can be stated succinctly as doing strong scaling
distinct ranges depending on their relationship (same within a processor and weak scaling across processors.
die, remote die etc). We observe that the maximum
latency is similar on both nodes, but the Xeon enjoys a 4.1. The application suite
much lower intra-die and intra-processor latency. The
core and processor ordering was actually included in An overview of each application is given below.
Figure 1 and Figure 2 based on the observed processor Each is typically run on high-performance parallel
locality map shown in Figure 4. To handle the systems utilizing many 1000s of processors at a time.
differences in the processor numbering between the A summary of the application input-decks used here is
two nodes we implemented a small software shim that given in Table 2. The input decks used are typical for
uses the Linux sched_setaffinity() system call to problems processed on large-scale systems.
allow user-defined mappings between MPI rank and
physical cores. This shim gives us the ability to map GTC – Gyrokinetic Toroidal Code is a Particle-in-Cell
processes to cores identically across the two nodes and (PIC) code from Princeton [12]. It was developed
thereby perform fair comparisons of application to study energy transport in fusion devices.
performance. Milagro - an implicit Monte Carlo (IMC) code for
thermal radiative transfer from Los Alamos [5].
5. Table 2. Summary of the input decks used for each application
Problem per Problem per node Memory use per
Input deck Processing characteristic
processor processor
GTC 1Dwedge 6.2M particles 24.8M particles 320MB Particle based
Milagro Doublebend 0.5M particles 2M particles 50MB Particle based with replicated mesh
Partisn Pencil 20x10x400 40x20x400 80MB Compute intensive
S3D typical 50x50x50 100x100x50 140MB Memory/Compute intensive
SAGE timing_h 140K cells 560K cells 280MB Memory intensive
SPaSM BCC 64x64x64 128x128x64 150MB Compute intensive
Sweep3D Pencil 20x10x400 40x20x400 8MB Kernel, small memory footprint
VH1 Shock_tube 200x200x200 400x400x200 900MB Memory/Compute intensive
VPIC 3D-HOT 4M particles 16M particles 256MB SSE optimized, memory intensive
Partisn – SN transport code from Los Alamos [1] 5. Application Performance Analysis
solving the Boltzmann equation using the discrete
ordinates method, on structured meshes. 5.1. Comparison of single-core performance
S3D – high-fidelity 3D simulation of turbulent
combustion that includes detailed chemistry. It The performance of each application on a single
originates from Sandia National Laboratory [6]. core of both processor types is shown in Figure 5(a).
SAGE – an adaptive mesh (AMR) hydrodynamics The metric of “time” denotes the iteration time of the
code used for the simulation of shock-waves. main computational loop (without I/O) for all
Developed jointly by Los Alamos and SAIC [9]. applications except Sweep and Partisn, which are run
SPaSM – the Scalable Parallel Short-range Molecular for 10 iterations, and VPIC, which is run for 100
dynamics code from Los Alamos. Used to study iterations to aid visual comparison. The reduction in
material fracture and deformation properties [13]. application runtime on the Xeon core relative to the
Sweep3D – a code-kernel from Los Alamos [7,10] that Barcelona core is shown in Figure 5(b). 50% indicates
implements deterministic SN transport. The that the Tigerton has a runtime reduction of 50%
computation is in the form of wavefronts that (halving the iteration time). Similarly, a value of -50%
originate at the corners of a 3-D physical space. indicates that the Barcelona has a 50% reduction in
VH1 – the Virginia Hydrodynamics code simulates an runtime relative to the Tigerton core.
ideal inviscid compressible flow of gas The advantage of the Tigerton core over the
hydrodynamics and is capable of simulating three- Barcelona core is between 25% and 44% depending on
dimensional turbulent stellar flows [3]. the application. Recall that the Tigerton has a clock
VPIC – a Particle-In-Cell code from Los Alamos used speed of 2.93GHz compared with the Barcelona of
to model particle flow within a plasma [4].
40 100%
Intel Tigerton
35
Runtime advantage of Intel vs. AMD (1-core)
AMD Barcelona
30
50%
25
Time (s)
20
15 0%
SAGE
SPaSM
Milagro
VH1
VPIC
GTC
S3D
Sweep3D
Partisn
10
5
-50%
0
SAGE
SPaSM
Milagro
VH1
VPIC
GTC
S3D
Sweep3D
Partisn
-100%
(a) Application iteration time (b) Performance advantage of Xeon core
Figure 5. Single-core performance comparison
6. 20 100%
18 Intel Tigerton
Runtime advantage of Intel vs. AMD (16-core)
AMD Barcelona
16
14 50%
12
Time (s)
10
8 0%
SAGE
SPaSM
VPIC
Milagro
VH1
GTC
S3D
Sweep3D
Partisn
6
4
2 -50%
0
SAGE
SPaSM
VPIC
Milagro
VH1
GTC
S3D
Sweep3D
Partisn
-100%
(a) Application iteration time (b) Performance advantage of Xeon node
Figure 6. Single-node (16 cores) application performance comparison
2.0GHz, and that the memory bandwidth for a single the node. In fact the best performance observed on
core was 3.7GB/s and 4.4GB/s respectively. Hence up VPIC and Partisn was when using 2-cores per
to a 50% advantage could be expected for processor (8 cores total), and for SAGE was when
computationally intensive codes executing the same using 3-cores per processor (12 cores total) on the
instructions mix over the same number of cycles Xeon node. The best performance in all other cases
without memory stalls. was observed when using all 16 cores. In Section 5.3
below we analyzed the performance as a function of
5.2. Comparison of node performance the number of cores and number of processors used for
each application.
The best performance of each application on the
Xeon node (16 cores), and Barcelona node (16 cores) 5.3. Quad-core application scalability analysis
is shown in Figure 6. In the same way as for the
single-core comparison, Figure 6(a) depicts the To analyze the performance of using multiple
iteration time, and Figure 6(b) depicts the runtime cores we followed a strict process in which the
advantage of the Xeon node over the Barcelona node. problem size per processor was constant for all tests,
Several observations are clear in this comparison: i) as described in Section 4. For each application we
the iteration time is lower in all cases compared to show the performance when using between one and
using only one-core per socket; ii) the run-time four cores per processor for the case of using one
advantage of Xeon is much reduced from the single- processor and four processors in a node. Note that the
core comparison of Section 5.1; and iii) the Xeon node performance relative to the single-core performance
no-longer out-performs the Barcelona node for all for each processor type is shown – this is the speedup
applications – the run-time on the Barcelona node is when using multiple cores. Figure 7 shows this data
lower by as much as 60%. for all applications. Note that the legend for all graphs
The results shown in Figure 6 are in line with the is shown in graph of Figure 7i).
differences in the memory bandwidth measurements in The first observation made is that the scalability is
Section 3.1. The per-core bandwidth when using all higher on the Barcelona quad-core than on the Xeon
cores within a node on the Xeon is 0.63GB/s and on quad-core. This explains why the performance
the Barcelona is 1.1GB/s (an advantage of almost a advantage of the Xeon quad-core nodes is less than it’s
factor of 2). The 60% runtime advantage of the advantage for a single-core. The applications are
Barcelona for the node performance is directly in line ordered in terms of their observed scalability in Figure
with this on the memory intensive applications of 7. The applications with best scaling behavior on both
SAGE and VH1. nodes are Milagro, SPaSM and Sweep3D (Figures
Note however the performance presented above is 7(a) – 7(c)). Both Milagro and SPaSM are compute
based on the best observed node performance – this bound, and Sweep3D has a small memory footprint
does not necessarily result when using all 16 cores in resulting in high cache utilization. In contrast SAGE
7. and Partisn are memory bound and show the slowest Note that even if an application has a higher
scalability on both nodes. Codes that are neither speedup on the Barcelona node in comparison to the
compute or memory bound scale better on the Xeon node it does not necessarily mean that it has a
Barcelona than on the Xeon – that is VH1, GTC, higher performance as was evident in Figure 6(b).
VPIC and S3D.
16 16 16
14 14 14
Relative performance (to 1 core)
Relative performance (to 1 core)
Relative performance (to 1 core)
12 12 12
10 10 10
8 8 8
6 6 6
4 4 4
2 2 2
0 0 0
1 2 3 4 1 2 3 4 1 2 3 4
Cores per socket Cores per socket Cores per socket
(a) Milagro (b) SPaSM (c) Sweep3D
16 16 16
14 14 14
Relative performance (to 1 core)
Relative performance (to 1 core)
Relative performance (to 1 core)
12 12 12
10 10 10
8 8 8
6 6 6
4 4 4
2 2 2
0 0 0
1 2 3 4 1 2 3 4 1 2 3 4
Cores per socket Cores per socket Cores per socket
(d) VH1 (e) GTC (f) VPIC
16 16 16
14 14 14 AMD Barcelona (4-sockets)
Relative performance (to 1 core)
Relative performance (to 1 core)
Relative performance (to 1 core)
Intel Tigerton (4-sockets)
12 12 12
AMD Barcelona (1-socket)
10 10 10 Intel Tigerton (1-socket)
8 8 8
6 6 6
4 4 4
2 2 2
0 0 0
1 2 3 4 1 2 3 4 1 2 3 4
Cores per socket Cores per socket Cores per socket
(g) S3D (h)SAGE (i) Partisn
Figure 7. Application speedup when using multiple-cores.
8. 6. Conclusions References
Using a suite of applications we have evaluated [1] R.S. Baker. “A Block Adaptive Mesh Refinement
Algorithm for the Neutral Particle Transport Equation”,
the first generation of quad-core processors available
Nuclear Science & Engineering, 141(1), pp. 1-12, 2002.
from AMD and Intel. Data was obtained using a strict
measurement methodology that used a shim to control [2] A. Hoisie, G. Johnson, D.J. Kerbyson, M. Lang, S. Pakin.
the mapping of application processes to processors, “A Performance Comparison Through Benchmarking and
and compared the per-core performance as well as Modeling of Three Leading Supercomputers: Blue Gene/L,
scaling up to 16 cores in a node. The process followed Red Storm, and Purple”, in proc. IEEE/ACM Conf. on
is directly applicable to other multi-core studies. Supercomputing (SC06), Tampa, FL, 2006.
When considering the performance of a single
[3] J.M. Blondin. VH-1 User’s Guide. North Carolina State
core, where there are no memory bottlenecks, the University, 1999.
higher clock speed of Intel’s Xeon gives applications a
measured 25-44% reduction in runtime compared with [4] K. Bowers. “Speed optimal implementation of a fully
the AMD’s Barcelona. When using all of the cores in a relativistic 3d particle push with charge conserving current
processor the results are more dependent on the way accumulation on modern processors”, in proc. 18th int. conf.
each application uses memory. Barcelona has the edge Numerical Simul. Plasmas,2003, p.383
in memory bandwidth available on a single processor. [5] T. M. Evans, T. J. Urbatsch. “MILAGRO: A parallel
Finally, when examining scaling across the entire Implicit Monte Carlo code for 3-d radiative transfer (U),” In
16-core node, the results are somewhat mixed. In Proc. of the Nuclear Explosives Code Development
general, applications with a small memory footprints Conference, Las Vegas, NV, Oct 1998.
perform better on the Xeon and see almost perfect
scaling, Memory-bandwidth-intensive applications, on [6] E.R.Hawkes, R.Sankaran, J.C.Sutherland, J.H. Chen.
the other hand, scale better on the Barcelona because “Direct numerical simulation of turbulent combustion:
of the reduced memory contention. For many of the fundamental insights towards predictive models”, J. of
Physics Conference Series, 16:65–79, 2005.
applications, the better scaling behavior results in a
higher achievable performance on the Barcelona. [7] A. Hoisie, O. Lubeck, H.J. Wasserman. “Performance
While this study represents a snapshot of current and Scalability Analysis of Teraflop-Scale Parallel
processors and node architectures it also represents a Architectures using Multidimensional Wavefront
snapshot of current application structures. All of the Applications”, int. J. of High Performance Applications, vol.
applications we ran use the “one MPI rank per core” 14, no. 4, pp. 330-346, 2000.
model. Although this is an extremely portable way to
[8] Intel Corporation. Quad-core Intel Xeon Processor 7300
structure an application, it may be possible to gain Series. Product Brief. 2007.
more performance by exploiting the properties of
multi-core processors, such as that physically [9] D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J.
proximate processes can benefit from sharing cached Wasserman, M.L. Gittings. “Predictive Performance and
data. We have shown that for applications as they exist Scalability Modeling of a Large-scale Application”, in Proc.
today, it is important to consider the balance between IEEE/ACM SC, Denver, 2001.
compute rate and memory rate when selecting a
[10] K.R. Koch, R.S. Baker, R.E. Alcouffe. Solution of the
processor from which to build a cluster. Neither the First-Order Form of the 3-D Discrete Ordinates Equation on
Barcelona nor the Xeon is unambiguously faster than a Massively Parallel Processor. Trans. of the American
the other. The decision of which to use must be made Nuclear Soc., 65:198-199, 1992.
on a per-application (or per-workload) basis and can
benefit from the results we presented in this paper. [11] J. McCalpin. “Memory bandwidth and machine balance
in current high performance computers”, in IEEE Comp. Soc.
Tech. committee on Computer Architecture (TCCA)
Acknowledgements Newsletter, pages 19-25, Dec. 1995.
We thank AMD and Intel for providing early [12] N. Wichmann, M. Adams, S. Ethier. New Advances in
systems for this performance evaluation. This work the Gyrokinetic Toriodal Code and Their Impact on
was funded in part by the Accelerated Strategic Performance on the Cray XT Series. In Proc. Cray User
Computing program and the Office of Science of the Group (CUG), Seattle, WA, 2007.
Department of Energy. Los Alamos National [13] S.J. Zhou, D.M. Beazley, P.S. Lomdahl, B.L. Holian.
Laboratory is operated by Los Alamos National Large-scale molecular dynamics simulations of fracture and
Security LLC for the US Department of Energy under deformation. J. of Computer-Aided Materials Design, 3(1-3),
contract DE-AC52-06NA25396. pp. 183-186, 1995.