This document summarizes the Texas Advanced Computing Center's (TACC) experience using DDN's Infinite Memory Engine (IME) as a burst buffer for three HPC applications on the Stampede supercomputer. Initial testing showed I/O bottlenecks that were addressed by improving the InfiniBand topology. Performance testing found the IME provided significant acceleration over the Lustre parallel file system, with speedups ranging from 3.7x to 28x for the HACC cosmology code, 6.8x to 22.3x for the S3D combustion code, and 6.2x to 10.1x for the MADBench mini-app. The IME demonstrated its ability to scale and improve
DDN IME Evaluation Shows Significant Performance Boost for HPC Workloads
1. Site-Wide Storage Use Case and Early
User Experience with Infinite Memory
Engine
Tommy Minyard
Texas Advanced Computing Center
DDN User Group Meeting
November 17, 2014
2. TACC Mission & Strategy
The mission of the Texas Advanced Computing Center is to enable
scientific discovery and enhance society through the application of
advanced computing technologies.
To accomplish this mission, TACC:
– Evaluates, acquires & operates
advanced computing systems
– Provides training, consulting, and
documentation to users
– Collaborates with researchers to
apply advanced computing techniques
– Conducts research & development to
produce new computational technologies
Resources &
Services
Research &
Development
3. TACC Storage Needs
• Cluster specific storage
– High performance (tens to hundreds GB/s bandwidth)
– Large-capacity (~2TBs per Teraflop), purged frequently
– Very scalable to thousands of clients
• Center-wide persistent storage
– Global filesystem available on all systems
– Very large capacity, quota enabled
– Moderate performance, very reliable, high availability
• Permanent archival storage
– Maximum capacity, tens of PBs of capacity
– Slow performance, tape-based offline storage with spinning
storage cache
4. History of DDN at TACC
• 2006 – Lonestar 3 with DDN S2A9500
controllers and 120TB of disk
• 2008 – Corral with DDN S2A9900 controller
and 1.2PB of disk
• 2010 – Lonestar 4 with DDN SFA10000
controllers with 1.8PB of disk
• 2011 – Corral upgrade with DDN SFA10000
controllers and 5PB of disk
5. Global Filesystem Requirements
• User requests for persistent storage available
on all production systems
– Corral limited to UT System users only
• RFP issued for storage system capable of:
– At least 20PB of usable storage
– At least 100GB/s aggregate bandwidth
– High availability and reliability
• DDN proposal selected for project
6. Stockyard: Design and Setup
• A Lustre 2.4.2 based global files system, with
scalability for future upgrades
• Scalable Unit (SU): 16 OSS nodes providing
access to 168 OST’s of RAID6 arrays from
two SFA12k couplets, corresponding to 5PB
capacity and 25+ GB/s throughput per SU
• Four SU’s provide 25PB raw with >100GB/s
• 16 initial LNET routers for external mounts
11. Stockyard: Capabilities and Features
• 20PB usable capacity with 100+ GB/s
aggregate bandwidth
• Client systems can add LNET routers to
connect to the Stockyard core IB switches or
connect to the built-in LNET routers using
either IB or TCP. (FDR14 or 10GigE)
• Automatic failover with Corosync and
Pacemaker
12. Stockyard: Performance
• Local storage testing surpassed 100GB/s
• Initial bandwidth from Stampede compute
clients using Lustre 2.1.6 and 16 routers:
65GB/s with 256 clients (IOR, posix, fpp, with
8 tasks per node)
• After upgrade of Stampede clients to Lustre
2.5.2: 75GB/s
• Added 8 LNET routers to connect Maverick
visualization system: 38GB/s
13. Failover Testing
• OSS failover test setup and results
• Procedure:
– Identify the OST’s for the test pair
– Initiate write processes targeted to the particular OST’s, each of
about 67GB in size so that it does not finish before the failover
– Interrupt one of the OSS server with shutdown using ipmitool
– Record the individual write process outputs as well as server and
client side Lustre messages
– Compare and confirm the recovery and operation of the failover
pair with all OST’s
• All I/O completes within 2 minutes of failover
14. Failover Testing (cont’d)
• Similarly for MDS pair: same sequence of interrupted I/O
and collection of Lustre messages on both servers and clients,
client side log shows the recovery.
– Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:
1869:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay:
[sent 1381348698/real 0] req@ffff88180cfcd000 x1448277242593528/t0(0) o250-
>MGC192.168.200.10@o2ib100@192.168.200.10@o2ib100:26/25 lens 400/544 e 0
to 1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
– Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre: 13689:0:(client.c:
1869:ptlrpc_expire_one_request()) Skipped 1 previous similar message
– Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at
MGC192.168.200.10@o2ib100_1) after server handle changed from
0xb9929a99b6d258cd to 0x6282da9e97a66646
– Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100:
Connection restored to MGS (at 192.168.200.11@o2ib100)
15. Infinite Memory Engine Evaluation
• As with most HPC filesystems, rarely sustain
full bandwidth capability of filesystem
• Really need the capacity of lots of disk
spindles and handle the bursts of I/O activity
• Stampede used to evaluate IME at scale
using old /work filesystem for backend
storage
16. IME Evaluation Hardware
• Old Stampede /work filesystem hardware
– Eight storage servers, 64 drives each
– Lustre 2.5.2 server version
– Capable of 24GB/s peak performance
– At ~50% of capacity from previous use
• IME hardware configuration
– Eight DDN IME servers fully populated with SSDs
– Two FDR IB connections per server
– 80GB/s peak performance
17. Initial IME Evaluation
• First testing showed bottlenecks with write
performance reaching only 40GB/s
• IB topology identified as culprit as 12 of the IB
ports connected to a single IB switch with
only 8 uplinks to core switches
• Redistributing IME IB links to switches without
oversubscription resolved bottleneck
• Performance increased to almost 80GB/s
after moving IB connections
20. MADBench @ TACC
COMPUTE
CLUSTER
BURST
BUFFER
8.7 GB/s!
Lustre PFS
70+ GB/s!
Phase IME Read
(GB/s)
IME Write
(GB/s)
PFS
Read
(GB/s)
PFS
Write
(GB/s)
S 71.9 7.1
W 74.6 75.5 7.8 8.7
C 74.7 11.9
IME
6.2x-9.6x 8.7x-10.1x
Accel.
Application Configuration: NP = 3136, #Bins=8, #pix = 265K !
21. Summary
• Storage capacity and performance needs
growing at exponential rate
• High-performance and reliable filesystems
critical for HPC productivity
• Current best solution for cost, performance
and scalability is Lustre-based filesystem
• Initial IME testing demonstrated scalability
and capability on large scale system