SlideShare a Scribd company logo
1 of 80
Download to read offline
Designing Software Libraries and Middleware for
Exascale Systems: Opportunities and Challenges
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
Talk at HPC Advisory Council Brazil Conference (2014)
by
High-End Computing (HEC): PetaFlop to ExaFlop
HPC Advisory Council Brazil Conference, May '14 2
Expected to have an ExaFlop system in 2020-2022!
100
PFlops in
2015
1 EFlops in
2018?
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the
Dominant Programming Model
– Many discussions towards Partitioned Global Address Space
(PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce)
– Spark is emerging for in-memory computing
– Memcached is also used for Web 2.0
3
Two Major Categories of Applications
HPC Advisory Council Brazil Conference, May '14
HPC Advisory Council Brazil Conference, May '14 4
Trends for Commodity Computing Clusters in the Top 500
List (http://www.top500.org)
0
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
300
350
400
450
500
PercentageofClusters
NumberofClusters
Timeline
Percentage of Clusters
Number of Clusters
Drivers of Modern HPC Cluster Architectures
• Multi-core processors are ubiquitous
• InfiniBand very popular in HPC clusters•
• Accelerators/Coprocessors becoming common in high-end systems
• Pushing the envelope for Exascale computing
Accelerators / Coprocessors
high compute density, high performance/watt
>1 TFlop DP on a chip
High Performance Interconnects - InfiniBand
<1usec latency, >100Gbps Bandwidth
Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)
5
Multi-core Processors
HPC Advisory Council Brazil Conference, May '14
• 207 IB Clusters (41%) in the November 2013 Top500 list
(http://www.top500.org)
• Installations in the Top 40 (19 systems):
HPC Advisory Council Brazil Conference, May '14
Large-scale InfiniBand Installations
519,640 cores (Stampede) at TACC (7th) 70,560 cores (Helios) at Japan/IFERC (24th)
147, 456 cores (Super MUC) in Germany (10th) 138,368 cores (Tera-100) at France/CEA (29th)
74,358 cores (Tsubame 2.5) at Japan/GSIC (11th)
60,000-cores, iDataPlex DX360M4 at
Germany/Max-Planck (31st)
194,616 cores (Cascade) at PNNL (13th) 53,504 cores (PRIMERGY) at Australia/NCI (32nd)
110,400 cores (Pangea) at France/Total (14th) 77,520 cores (Conte) at Purdue University (33rd)
96,192 cores (Pleiades) at NASA/Ames (16th) 48,896 cores (MareNostrum) at Spain/BSC (34th)
73,584 cores (Spirit) at USA/Air Force (18th) 222,072 (PRIMERGY) at Japan/Kyushu (36th)
77,184 cores (Curie thin nodes) at France/CEA (20th) 78,660 cores (Lomonosov) in Russia (37th )
120, 640 cores (Nebulae) at China/NSCS (21st) 137,200 cores (Sunway Blue Light) in China 40th)
72,288 cores (Yellowstone) at NCAR (22nd) and many more!
6
Towards Exascale System (Today and Target)
Systems 2014
Tianhe-2
2020-2022 Difference
Today & Exascale
System peak 55 PFlop/s 1 EFlop/s ~20x
Power 18 MW
(3 Gflops/W)
~20 MW
(50 Gflops/W)
O(1)
~15x
System memory 1.4 PB
(1.024PB CPU + 0.384PB CoP)
32 – 64 PB ~50X
Node performance 3.43TF/s
(0.4 CPU + 3 CoP)
1.2 or 15 TF O(1)
Node concurrency 24 core CPU +
171 cores CoP
O(1k) or O(10k) ~5x - ~50x
Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x
System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x
Total concurrency 3.12M
12.48M threads (4 /core)
O(billion)
for latency hiding
~100x
MTTI Few/day Many/day O(?)
Courtesy: Prof. Jack Dongarra
7HPC Advisory Council Brazil Conference, May '14
HPC Advisory Council Brazil Conference, May '14 8
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory
Logical shared memory
Shared Memory Model
SHMEM, DSM
Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems
– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• Power required for data movement operations is one of
the main challenges
• Non-blocking collectives
– Overlap computation and communication
• Much improved One-sided interface
– Reduce synchronization of sender/receiver
• Manage concurrency
– Improved interoperability with PGAS (e.g. UPC, Global Arrays,
OpenSHMEM)
• Resiliency
– New interface for detecting failures
HPC Advisory Council Brazil Conference, May '14 9
How does MPI Plan to Meet Exascale Challenges?
• Major features
– Non-blocking Collectives
– Improved One-Sided (RMA) Model
– MPI Tools Interface
• Specification is available from: http://www.mpi-
forum.org/docs/mpi-3.0/mpi30-report.pdf
HPC Advisory Council Brazil Conference, May '14 10
Major New Features in MPI-3
11
Partitioned Global Address Space (PGAS) Models
HPC Advisory Council Brazil Conference, May '14
• Key features
- Simple shared memory abstractions
- Light weight one-sided communication
- Easier to express irregular communication
• Different approaches to PGAS
- Languages
• Unified Parallel C (UPC)
• Co-Array Fortran (CAF)
• X10
- Libraries
• OpenSHMEM
• Global Arrays
• Chapel
• Hierarchical architectures with multiple address spaces
• (MPI + PGAS) Model
– MPI across address spaces
– PGAS within an address space
• MPI is good at moving data between address spaces
• Within an address space, MPI can interoperate with other shared
memory programming models
• Can co-exist with OpenMP for offloading computation
• Applications can have kernels with different communication patterns
• Can benefit from different models
• Re-writing complete applications can be a huge effort
• Port critical kernels to the desired model instead
HPC Advisory Council Brazil Conference, May '14 12
MPI+PGAS for Exascale Architectures and Applications
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS based
on communication characteristics
• Benefits:
– Best of Distributed Computing Model
– Best of Shared Memory Computing Model
• Exascale Roadmap*:
– “Hybrid Programming is a practical way to
program exascale systems”
* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011,
International Journal of High Performance Computer Applications, ISSN 1094-3420
Kernel 1
MPI
Kernel 2
MPI
Kernel 3
MPI
Kernel N
MPI
HPC Application
Kernel 2
PGAS
Kernel N
PGAS
HPC Advisory Council Brazil Conference, May '14 13
Designing Software Libraries for Multi-Petaflop
and Exaflop Systems: Challenges
Programming Models
MPI, PGAS (UPC, Global Arrays, OpenSHMEM),
CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.
Application Kernels/Applications
Networking Technologies
(InfiniBand, 40/100GigE,
Aries, BlueGene)
Multi/Many-core
Architectures
Point-to-point
Communication
(two-sided & one-sided)
Collective
Communication
Synchronization &
Locks
I/O & File
Systems
Fault
Tolerance
Communication Library or Runtime for Programming Models
14
Accelerators
(NVIDIA and MIC)
Middleware
HPC Advisory Council Brazil Conference, May '14
Co-Design
Opportunities
and
Challenges
across Various
Layers
Performance
Scalability
Fault-
Resilience
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Balancing intra-node and inter-node communication for next generation
multi-core (128-1024 cores/node)
– Multiple end-points per node
• Support for efficient multi-threading
• Support for GPGPUs and Accelerators
• Scalable Collective communication
– Offload
– Non-blocking
– Topology-aware
– Power-aware
• Fault-tolerance/resiliency
• QoS support for communication and I/O
• Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC,
MPI + OpenSHMEM, …)
Challenges in Designing (MPI+X) at Exascale
15
HPC Advisory Council Brazil Conference, May '14
• Extreme Low Memory Footprint
– Memory per core continues to decrease
• D-L-A Framework
– Discover
• Overall network topology (fat-tree, 3D, …)
• Network topology for processes for a given job
• Node architecture
• Health of network and node
– Learn
• Impact on performance and scalability
• Potential for failure
– Adapt
• Internal protocols and algorithms
• Process mapping
• Fault-tolerance solutions
– Low overhead techniques while delivering performance, scalability and fault-
tolerance
Additional Challenges for Designing Exascale Middleware
16
HPC Advisory Council Brazil Conference, May '14
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA
over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Support for GPGPUs and MIC
– Used by more than 2,150 organizations in 72 countries
– More than 212,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 7th ranked 519,640-core cluster (Stampede) at TACC
• 11th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology
• 16th ranked 96,192-core cluster (Pleiades) at NASA
• 75th ranked 16,896-core cluster (Keenland) at GaTech and many others . . .
– Available with software stacks of many IB, HSE, and server vendors including
Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede System
MVAPICH2/MVAPICH2-X Software
17HPC Advisory Council Brazil Conference, May '14
• Released on 05/25/14
• Major Features and Enhancements
– Based on MPICH-3.1
– CMA support is default
– Optimization of collectives with CMA support
– RMA optimizations for shared memory and atomics operations
– MPI-T support for additional performance and control variables
– Large message transfer support for PSM interface
– Optimization of collectives for PSM interface
– Updated hwloc to version 1.9
• MVAPICH2-X 2.0RC2 supports hybrid MPI + PGAS (UPC and OpenSHMEM)
programming models
– Based on MVAPICH2 2.0RC2 including MPI-3 features;
– Compliant with UPC 2.18.0 and OpenSHMEM v1.0f
18
MVAPICH2 2.0RC2 and MVAPICH2-X 2.0RC2
HPC Advisory Council Brazil Conference, May '14
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs
• Support for Intel MICs
• Fault-tolerance
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified Runtime
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
19
HPC Advisory Council Brazil Conference, May '14
HPC Advisory Council Brazil Conference, May '14 20
One-way Latency: MPI over IB with MVAPICH2
0.00
1.00
2.00
3.00
4.00
5.00
6.00
Small Message Latency
Message Size (bytes)
Latency(us)
1.66
1.56
1.64
1.82
0.99
1.09
1.12
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch
FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00
Qlogic-DDR
Qlogic-QDR
ConnectX-DDR
ConnectX2-PCIe2-QDR
ConnectX3-PCIe3-FDR
Sandy-ConnectIB-DualFDR
Ivy-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Latency(us)
HPC Advisory Council Brazil Conference, May '14 21
Bandwidth: MPI over IB with MVAPICH2
0
2000
4000
6000
8000
10000
12000
14000
Unidirectional Bandwidth
Bandwidth(MBytes/sec)
Message Size (bytes)
3280
3385
1917
1706
6343
12485
12810
0
5000
10000
15000
20000
25000
30000
Qlogic-DDR
Qlogic-QDR
ConnectX-DDR
ConnectX2-PCIe2-QDR
ConnectX3-PCIe3-FDR
Sandy-ConnectIB-DualFDR
Ivy-ConnectIB-DualFDR
Bidirectional Bandwidth
Bandwidth(MBytes/sec)
Message Size (bytes)
3341
3704
4407
11643
6521
21025
24727
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch
FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
0
0.2
0.4
0.6
0.8
1
0 1 2 4 8 16 32 64 128 256 512 1K
Latency(us)
Message Size (Bytes)
Latency
Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance
(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
22
Latest MVAPICH2 2.0rc2
Intel Ivy-bridge
0.18 us
0.45 us
0
2000
4000
6000
8000
10000
12000
14000
16000
Bandwidth(MB/s)
Message Size (Bytes)
Bandwidth (Inter-socket)
inter-Socket-CMA
inter-Socket-Shmem
inter-Socket-LiMIC
0
2000
4000
6000
8000
10000
12000
14000
16000
Bandwidth(MB/s)
Message Size (Bytes)
Bandwidth (Intra-socket)
intra-Socket-CMA
intra-Socket-Shmem
intra-Socket-LiMIC
14,250 MB/s 13,749 MB/s
HPC Advisory Council Brazil Conference, May '14
0
0.5
1
1.5
2
2.5
3
3.5
1 2 4 8 16 32 64 128 256 512 1K 2K 4K
Latency(us)
Message Size (Bytes)
Inter-node Get/Put Latency
Get Put
MPI-3 RMA Get/Put with Flush Performance
23
Latest MVAPICH2 2.0rc2, Intel Sandy-bridge with Connect-IB (single-port)
2.04 us
1.56 us
0
5000
10000
15000
20000
1 4 16 64 256 1K 4K 16K 64K 256K
Bandwidth(Mbytes/sec)
Message Size (Bytes)
Intra-socket Get/Put Bandwidth
Get Put
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 2 4 8 16 32 64 128256512 1K 2K 4K 8K
Latency(us)
Message Size (Bytes)
Intra-socket Get/Put Latency
Get Put
0.08 us
HPC Advisory Council Brazil Conference, May '14
0
1000
2000
3000
4000
5000
6000
7000
8000
1 4 16 64 256 1K 4K 16K 64K 256K
Bandwidth(Mbytes/sec)
Message Size (Bytes)
Inter-node Get/Put Bandwidth
Get Put
6881
6876
14926
15364
• Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections)
• NAMD performance improves when there is frequent communication to many peers
HPC Advisory Council Brazil Conference, May '14
Minimizing memory footprint with XRC and Hybrid Mode
Memory Usage Performance on NAMD (1024 cores)
24
• Both UD and RC/XRC have benefits
• Hybrid for the best of both
• Available since MVAPICH2 1.7 as integrated interface
• Runtime Parameters: RC - default;
• UD - MV2_USE_ONLY_UD=1
• Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1
M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable
Connection,” Cluster ‘08
0
100
200
300
400
500
1 4 16 64 256 1K 4K 16K
Memory(MB/process)
Connections
MVAPICH2-RC
MVAPICH2-XRC
0
0.5
1
1.5
apoa1 er-gre f1atpase jac
NormalizedTime
Dataset
MVAPICH2-RC MVAPICH2-XRC
0
2
4
6
128 256 512 1024
Time(us)
Number of Processes
UD Hybrid RC
26% 40% 30%38%
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs
• Support for Intel MICs
• Fault-tolerance
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified Runtime
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
25
HPC Advisory Council Brazil Conference, May '14
• Before CUDA 4: Additional copies
– Low performance and low productivity
• After CUDA 4: Host-based pipeline
– Unified Virtual Address
– Pipeline CUDA copies with IB transfers
– High performance and high productivity
• After CUDA 5.5: GPUDirect-RDMA support
– GPU to GPU direct transfer
– Bypass the host memory
– Hybrid design to avoid PCI bottlenecks
HPC Advisory Council Brazil Conference, May '14 26
MVAPICH2-GPU: CUDA-Aware MPI
InfiniBand
GPU
GPU
Memor
y
CPU
Chip
set
MVAPICH2 1.8, 1.9, 2.0a-2.0rc2 Releases
• Support for MPI communication from NVIDIA GPU device
memory
• High performance RDMA-based inter-node point-to-point
communication (GPU-GPU, GPU-Host and Host-GPU)
• High performance intra-node point-to-point
communication for multi-GPU adapters/node (GPU-GPU,
GPU-Host and Host-GPU)
• Taking advantage of CUDA IPC (available since CUDA 4.1) in
intra-node communication for multiple GPU adapters/node
• Optimized and tuned collectives for GPU device buffers
• MPI datatype support for point-to-point and collective
communication from GPU device buffers
27HPC Advisory Council Brazil Conference, May '14
• Hybrid design using GPUDirect RDMA
– GPUDirect RDMA and Host-based
pipelining
– Alleviates P2P bandwidth bottlenecks
on SandyBridge and IvyBridge
• Support for communication using multi-rail
• Support for Mellanox Connect-IB and
ConnectX VPI adapters
• Support for RoCE with Mellanox ConnectX
VPI adapters
HPC Advisory Council Brazil Conference, May '14 28
GPUDirect RDMA (GDR) with CUDA
IB
Adapter
System
Memory
GPU
Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s
P2P read: < 1.0 GB/s
SNB E5-2670
P2P write: 6.4 GB/s
P2P read: 3.5 GB/s
IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2
S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient
Inter-node MPI Communication using GPUDirect RDMA for InfiniBand
Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)
29
Performance of MVAPICH2 with GPUDirect-RDMA: Latency
GPU-GPU Internode MPI Latency
0
100
200
300
400
500
600
700
800
8K 32K 128K 512K 2M
1-Rail
2-Rail
1-Rail-GDR
2-Rail-GDR
Large Message Latency
Message Size (Bytes)
Latency(us)
Based on MVAPICH2-2.0b
Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA
CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
10 %
0
5
10
15
20
25
1 4 16 64 256 1K 4K
1-Rail
2-Rail
1-Rail-GDR
2-Rail-GDR
Small Message Latency
Message Size (Bytes)
Latency(us)
67 %
5.49 usec
HPC Advisory Council Brazil Conference, May '14
30
Performance of MVAPICH2 with GPUDirect-RDMA: Bandwidth
GPU-GPU Internode MPI Uni-Directional Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
1-Rail
2-Rail
1-Rail-GDR
2-Rail-GDR
Small Message Bandwidth
Message Size (Bytes)
Bandwidth(MB/s)
0
2000
4000
6000
8000
10000
12000
8K 32K 128K 512K 2M
1-Rail
2-Rail
1-Rail-GDR
2-Rail-GDR
Large Message Bandwidth
Message Size (Bytes)
Bandwidth(MB/s)
5x
9.8 GB/s
HPC Advisory Council Brazil Conference, May '14
Based on MVAPICH2-2.0b
Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA
CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
31
Performance of MVAPICH2 with GPUDirect-RDMA: Bi-Bandwidth
GPU-GPU Internode MPI Bi-directional Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
1-Rail
2-Rail
1-Rail-GDR
2-Rail-GDR
Small Message Bi-Bandwidth
Message Size (Bytes)
Bi-Bandwidth(MB/s)
0
5000
10000
15000
20000
25000
8K 32K 128K 512K 2M
1-Rail
2-Rail
1-Rail-GDR
2-Rail-GDR
Large Message Bi-Bandwidth
Message Size (Bytes)
Bi-Bandwidth(MB/s)
4.3x
19 %
19 GB/s
HPC Advisory Council Brazil Conference, May '14
Based on MVAPICH2-2.0b
Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA
CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
32
MPI-3 RMA Support with GPUDirect RDMA
MPI-3 RMA provides flexible synchronization and completion primitives
0
2
4
6
8
10
12
14
16
18
20
1 4 16 64 256 1K 4K
Send-Recv
Put+ActiveSync
Put+Flush
Small Message Latency
Message Size (Bytes)
Latency(usec)
HPC Advisory Council Brazil Conference, May '14
< 2usec
0
200000
400000
600000
800000
1000000
1200000
1 4 16 64 256 1K 4K 16K
Send-Recv
Put+ActiveSync
Put+Flush
Small Message Rate
Message Size (Bytes)
MessageRate(Msgs/sec)
Based on MVAPICH2-2.0b + enhancements
Intel Ivy Bridge (E5-2630 v2) node with 12 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA
CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
33
Communication Kernel Evaluation: 3Dstencil and Alltoall
0
10
20
30
40
50
60
70
80
90
1 4 16 64 256 1K 4K
Latency(usec)
Message Size (Bytes)
Send/Recv One-sided
3D Stencil with 16 GPU nodes
0
20
40
60
80
100
120
140
1 4 16 64 256 1K 4K
Latency(usec)
Message Size (Bytes)
Send/Recv One-sided
AlltoAll with 16 GPU nodes
HPC Advisory Council Brazil Conference, May '14
42%
50%
Based on MVAPICH2-2.0b + enhancements
Intel Ivy Bridge (E5-2630 v2) node with 12 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA
CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
How can I get Started with GDR Experimentation?
• MVAPICH2-2.0b with GDR support can be downloaded from
https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/
• System software requirements
– Mellanox OFED 2.1 or later
– NVIDIA Driver 331.20 or later
– NVIDIA CUDA Toolkit 5.5 or later
– Plugin for GPUDirect RDMA
(http://www.mellanox.com/page/products_dyn?product_family=116)
• Has optimized designs for point-to-point communication using GDR
• Work under progress for optimizing collective and one-sided communication
• Contact MVAPICH help list with any questions related to the package mvapich-
help@cse.ohio-state.edu
• MVAPICH2-GDR-RC2 with additional optimizations coming soon!!
34HPC Advisory Council Brazil Conference, May '14
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs
• Support for Intel MICs
• Fault-tolerance
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified Runtime
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
35
HPC Advisory Council Brazil Conference, May '14
MPI Applications on MIC Clusters
Xeon Xeon Phi
Multi-core Centric
Many-core Centric
MPI
Program
MPI
Program
Offloaded
Computation
MPI
Program
MPI Program
MPI Program
Host-only
Offload
(/reverse Offload)
Symmetric
Coprocessor-only
• Flexibility in launching MPI jobs on clusters with Xeon Phi
36HPC Advisory Council Brazil Conference, May '14
Data Movement on Intel Xeon Phi Clusters
CPU CPU
QPI
MIC
PCIe
MIC
MIC
CPU
MIC
IB
Node 0 Node 1
1. Intra-Socket
2. Inter-Socket
3. Inter-Node
4. Intra-MIC
5. Intra-Socket MIC-MIC
6. Inter-Socket MIC-MIC
7. Inter-Node MIC-MIC
8. Intra-Socket MIC-Host
10. Inter-Node MIC-Host
9. Inter-Socket MIC-Host
MPI Process
37
11. Inter-Node MIC-MIC with IB adapter on remote socket
and more . . .
• Critical for runtimes to optimize data movement, hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
HPC Advisory Council Brazil Conference, May '14
38
MVAPICH2-MIC Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communication
• Coprocessor-only Mode
• Symmetric Mode
• Internode Communication
• Coprocessors-only
• Symmetric Mode
• Multi-MIC Node Configurations
HPC Advisory Council Brazil Conference, May '14
Direct-IB Communication
39
IB Write to Xeon Phi
(P2P Write)
IB Read from Xeon Phi
(P2P Read)
CPU
Xeon Phi
PCIe
QPI
E5-2680 v2E5-2670
5280 MB/s (83%)
IB FDR HCA
MT4099
6396 MB/s (100%)
962 MB/s (15%) 3421 MB/s (54%)
1075 MB/s (17%) 1179 MB/s (19%)
370 MB/s (6%) 247 MB/s (4%)
(SandyBridge) (IvyBridge)
• IB-verbs based communication
from Xeon Phi
• No porting effort for MPI libraries
with IB support
Peak IB FDR Bandwidth: 6397 MB/s
Intra-socket
IB Write to Xeon Phi
(P2P Write)
IB Read from Xeon Phi
(P2P Read)Inter-socket
• Data movement between IB HCA
and Xeon Phi: implemented as
peer-to-peer transfers
HPC Advisory Council Brazil Conference, May '14
MIC-Remote-MIC P2P Communication
40HPC Advisory Council Brazil Conference, May '14
Bandwidth
Better
BetterBetter
Latency (Large Messages)
0
1000
2000
3000
4000
5000
8K 32K 128K 512K 2M
Latency(usec)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth(MB/sec)
Message Size (Bytes)
5236
Intra-socket P2P
Inter-socket P2P
0
2000
4000
6000
8000
10000
12000
14000
16000
8K 32K 128K 512K 2M
Latency(usec)
Message Size (Bytes)
Latency (Large Messages)
0
2000
4000
6000
1 16 256 4K 64K 1M
Bandwidth(MB/sec)
Message Size (Bytes)
Better 5594
Bandwidth
InterNode – Symmetric Mode - P3DFFT Performance
41HPC Advisory Council Brazil Conference, May '14
• To explore different threading levels for host and Xeon Phi
3DStencil Comm. Kernel P3DFFT – 512x512x4K
0
2
4
6
8
10
4+4 8+8
TimeperStep(sec)
Processes on Host+MIC (32 Nodes)
0
5
10
15
20
2+2 2+8
CommperStep(sec)
Processes on Host-MIC - 8 Threads/Proc
(8 Nodes)
50.2%
MV2-INTRA-OPT MV2-MIC
16.2%
42
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)
HPC Advisory Council Brazil Conference, May '14
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance
Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014
0
5000
10000
15000
20000
25000
1 2 4 8 16 32 64 128256512 1K
Latency(usecs)
Message Size (Bytes)
32-Node-Allgather (16H + 16 M)
Small Message Latency
MV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Latency(usecs)x1000
Message Size (Bytes)
32-Node-Allgather (8H + 8 M)
Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
200
400
600
800
4K 8K 16K 32K 64K 128K 256K 512K
Latency(usecs)x1000
Message Size (Bytes)
32-Node-Alltoall (8H + 8 M)
Large Message Latency
MV2-MIC
MV2-MIC-Opt
0
10
20
30
40
50
MV2-MIC-Opt MV2-MIC
ExecutionTime(secs)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance
Communication
Computation
76%
58%
55%
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs
• Support for Intel MICs
• Fault-tolerance
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified Runtime
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
43
HPC Advisory Council Brazil Conference, May '14
• Component failures are common in large-scale clusters
• Imposes need on reliability and fault tolerance
• Multiple challenges:
– Checkpoint-Restart vs. Process Migration
– Benefits of Scalable Checkpoint Restart (SCR) Support
• Application guided
• Application transparent
HPC Advisory Council Brazil Conference, May '14 44
Fault Tolerance
HPC Advisory Council Brazil Conference, May '14 45
Checkpoint-Restart vs. Process Migration
Low Overhead Failure Prediction with IPMI
X. Ouyang, R. Rajachandrasekar, X. Besseron, D. K. Panda, High Performance Pipelined Process Migration with RDMA,
CCGrid 2011
X. Ouyang, S. Marcarelli, R. Rajachandrasekar and D. K. Panda, RDMA-Based Job Migration Framework for MPI over
InfiniBand, Cluster 2010
10.7X
2.3X
4.3X
1
Time to migrate 8 MPI processes
• Job-wide Checkpoint/Restart is not scalable
• Job-pause and Process Migration framework can deliver pro-active fault-tolerance
• Also allows for cluster-wide load-balancing by means of job compaction
0
5000
10000
15000
20000
25000
30000
Migration
w/o RDMA
CR (ext3) CR (PVFS)
ExecutionTime(seconds)
Job Stall
Checkpoint (Migration)
Resume
Restart
2.03x
4.49x
LU Class C Benchmark (64 Processes)
Multi-Level Checkpointing with ScalableCR (SCR)
46HPC Advisory Council Brazil Conference, May '14

CheckpointCostandResiliency
Low
High
• LLNL’s Scalable Checkpoint/Restart
library
• Can be used for application guided and
application transparent checkpointing
• Effective utilization of storage hierarchy
– Local: Store checkpoint data on node’s local
storage, e.g. local disk, ramdisk
– Partner: Write to local storage and on a
partner node
– XOR: Write file to local storage and small sets
of nodes collectively compute and store parity
redundancy data (RAID-5)
– Stable Storage: Write to parallel file system
Application-guided Multi-Level Checkpointing
47HPC Advisory Council Brazil Conference, May '14
0
20
40
60
80
100
PFS MVAPICH2+SCR
(Local)
MVAPICH2+SCR
(Partner)
MVAPICH2+SCR
(XOR)
CheckpointWritingTime(s)
Representative SCR-Enabled Application
• Checkpoint writing phase times of representative SCR-enabled MPI application
• 512 MPI processes (8 procs/node)
• Approx. 51 GB checkpoints
Transparent Multi-Level Checkpointing
48HPC Advisory Council Brazil Conference, May '14
• ENZO Cosmology application – Radiation Transport workload
• Using MVAPICH2’s CR protocol instead of the application’s in-built CR mechanism
• 512 MPI processes running on the Stampede Supercomputer@TACC
• Approx. 12.8 GB checkpoints: ~2.5X reduction in checkpoint I/O costs
• Scalability for million to billion processors
– Support for highly-efficient inter-node and intra-node communication (both two-sided
and one-sided)
– Extremely minimum memory footprint
• Support for GPGPUs
• Support for Intel MICs
• Fault-tolerance
• Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with
Unified Runtime
Overview of A Few Challenges being Addressed by
MVAPICH2/MVAPICH2-X for Exascale
49
HPC Advisory Council Brazil Conference, May '14
MVAPICH2-X for Hybrid MPI + PGAS Applications
HPC Advisory Council Brazil Conference, May '14 50
MPI Applications, OpenSHMEM Applications, UPC
Applications, Hybrid (MPI + PGAS) Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI CallsUPC Calls
• Unified communication runtime for MPI, UPC, OpenSHMEM available with
MVAPICH2-X 1.9 onwards!
– http://mvapich.cse.ohio-state.edu
• Feature Highlights
– Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,
MPI(+OpenMP) + UPC
– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard
compliant
– Scalable Inter-node and Intra-node communication – point-to-point and collectives
OpenSHMEM Collective Communication Performance
51
Reduce (1,024 processes) Broadcast (1,024 processes)
Collect (1,024 processes) Barrier
0
50
100
150
200
250
300
350
400
128 256 512 1024 2048
Time(us)
No. of Processes
1
10
100
1000
10000
100000
1000000
10000000
Time(us)
Message Size
180X
1
10
100
1000
10000
4 16 64 256 1K 4K 16K 64K 256K
Time(us)
Message Size
18X
1
10
100
1000
10000
100000
1 4 16 64 256 1K 4K 16K 64K
Time(us)
Message Size
MV2X-SHMEM
Scalable-SHMEM
OMPI-SHMEM
12X
3X
HPC Advisory Council Brazil Conference, May '14
OpenSHMEM Application Evaluation
52
0
100
200
300
400
500
600
256 512
Time(s)
Number of Processes
UH-SHMEM
OMPI-SHMEM (FCA)
Scalable-SHMEM (FCA)
MV2X-SHMEM
0
10
20
30
40
50
60
70
80
256 512
Time(s)
Number of Processes
UH-SHMEM
OMPI-SHMEM (FCA)
Scalable-SHMEM (FCA)
MV2X-SHMEM
Heat Image Daxpy
• Improved performance for OMPI-SHMEM and Scalable-SHMEM with FCA
• Execution time for 2DHeat Image at 512 processes (sec):
- UH-SHMEM – 523, OMPI-SHMEM – 214, Scalable-SHMEM – 193, MV2X-
SHMEM – 169
• Execution time for DAXPY at 512 processes (sec):
- UH-SHMEM – 57, OMPI-SHMEM – 56, Scalable-SHMEM – 9.2, MV2X-
SHMEM – 5.2
J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance Evaluation of
OpenSHMEM Libraries on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014
HPC Advisory Council Brazil Conference, May '14
HPC Advisory Council Brazil Conference, May '14 53
Hybrid MPI+OpenSHMEM Graph500 Design
Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,
International Supercomputing Conference (ISC’13), June 2013
0
1
2
3
4
5
6
7
8
9
26 27 28 29
BillionsofTraversedEdgesPer
Second(TEPS)
Scale
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
0
1
2
3
4
5
6
7
8
9
1024 2048 4096 8192
BillionsofTraversedEdgesPer
Second(TEPS)
# of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid(MPI+OpenSHMEM)
Strong Scalability
Weak Scalability
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance
Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
0
5
10
15
20
25
30
35
4K 8K 16K
Time(s)
No. of Processes
MPI-Simple
MPI-CSC
MPI-CSR
Hybrid (MPI+OpenSHMEM)
13X
7.6X
• Performance of Hybrid (MPI+OpenSHMEM)
Graph500 Design
• 8,192 processes
- 2.4X improvement over MPI-CSR
- 7.6X improvement over MPI-Simple
• 16,384 processes
- 1.5X improvement over MPI-CSR
- 13X improvement over MPI-Simple
54
Hybrid MPI+OpenSHMEM Sort Application
Execution Time Strong Scalability
• Performance of Hybrid (MPI+OpenSHMEM) Sort Application
• Execution Time
- 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds
- 51% improvement over MPI-based design
• Strong Scalability (configuration: constant input size of 500GB)
- At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min
- 55% improvement over MPI based design
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
512 1,024 2,048 4,096
SortRate(TB/min)
No. of Processes
MPI
Hybrid
0
500
1000
1500
2000
2500
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Time(seconds)
Input Data - No. of Processes
MPI
Hybrid
51%
55%
HPC Advisory Council Brazil Conference, May '14
• Performance and Memory scalability toward 900K-1M cores
– Dynamically Connected Transport (DCT) service with Connect-IB
• Enhanced Optimization for GPGPU and Coprocessor Support
– Extending the GPGPU support (GPU-Direct RDMA) with CUDA 6.5 and Beyond
– Support for Intel MIC (Knight Landing)
• Taking advantage of Collective Offload framework
– Including support for non-blocking collectives (MPI 3.0)
• RMA support (as in MPI 3.0)
• Extended topology-aware collectives
• Power-aware collectives
• Support for MPI Tools Interface (as in MPI 3.0)
• Checkpoint-Restart and migration support with in-memory checkpointing
• Hybrid MPI+PGAS programming support with GPGPUs and Accelertors
MVAPICH2/MVPICH2-X – Plans for Exascale
55
HPC Advisory Council Brazil Conference, May '14
• Scientific Computing
– Message Passing Interface (MPI), including MPI + OpenMP, is the
Dominant Programming Model
– Many discussions towards Partitioned Global Address Space
(PGAS)
• UPC, OpenSHMEM, CAF, etc.
– Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC)
• Big Data/Enterprise/Commercial Computing
– Focuses on large data and data analysis
– Hadoop (HDFS, HBase, MapReduce)
– Spark is emerging for in-memory computing
– Memcached is also used for Web 2.0
56
Two Major Categories of Applications
HPC Advisory Council Brazil Conference, May '14
Introduction to Big Data Applications and Analytics
• Big Data has become the one of the most
important elements of business analytics
• Provides groundbreaking opportunities
for enterprise information management
and decision making
• The amount of data is exploding;
companies are capturing and digitizing
more information than ever
• The rate of information growth appears
to be exceeding Moore’s Law
57HPC Advisory Council Brazil Conference, May '14
• Commonly accepted 3V’s of Big Data
• Volume, Velocity, Variety
Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf
• 5V’s of Big Data – 3V + Value, Veracity
Can High-Performance Interconnects Benefit Big Data
Middleware?
• Most of the current Big Data middleware use Ethernet
infrastructure with Sockets
• Concerns for performance and scalability
• Usage of high-performance networks is beginning to draw
interest from many companies
• What are the challenges?
• Where do the bottlenecks lie?
• Can these bottlenecks be alleviated with new designs (similar
to the designs adopted for MPI)?
• Can HPC Clusters with high-performance networks be used
for Big Data middleware?
• Initial focus: Hadoop, HBase, Spark and Memcached
58HPC Advisory Council Brazil Conference, May '14
Big Data Middleware
(HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies
(InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models
(Sockets)
Applications
Commodity Computing System
Architectures
(Multi- and Many-core
architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point
Communication
QoS
Threaded Models
and Synchronization
Fault-ToleranceI/O and File Systems
Virtualization
Benchmarks
RDMA Protocol
Designing Communication and I/O Libraries for Big
Data Systems: Solved a Few Initial Challenges
HPC Advisory Council Brazil Conference, May '14
Upper level
Changes?
59
Big Data Processing with Hadoop
• The open-source implementation
of MapReduce programming
model for Big Data Analytics
• Major components
– HDFS
– MapReduce
• Underlying Hadoop Distributed
File System (HDFS) used by both
MapReduce and HBase
• Model scales but high amount of
communication during
intermediate phases can be
further optimized
60
HDFS
MapReduce
Hadoop Framework
User Applications
HBase
Hadoop Common (RPC)
HPC Advisory Council Stanford Conference, Feb '14
Two Broad Approaches for Accelerating Hadoop
• All components of Hadoop (HDFS, MapReduce, RPC, HBase,
etc.) working on native IB verbs while exploiting RDMA
• Hadoop Map-Reduce over verbs for existing HPC clusters with
Lustre (no HDFS)
61HPC Advisory Council Brazil Conference, May '14
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance design with native InfiniBand and RoCE support at the verbs-
level for HDFS, MapReduce, and RPC components
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-
based support (Ethernet and InfiniBand with IPoIB)
• Current release: 0.9.9 (03/31/14)
– Based on Apache Hadoop 1.2.1
– Compliant with Apache Hadoop 1.2.1 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE
• Various multi-core platforms
• Different file systems with disks and SSDs
– http://hadoop-rdma.cse.ohio-state.edu
RDMA for Apache Hadoop Project
62HPC Advisory Council Brazil Conference, May '14
Reducing Communication Time with HDFS-RDMA Design
• Cluster with HDD DataNodes
– 30% improvement in communication time over IPoIB (QDR)
– 56% improvement in communication time over 10GigE
• Similar improvements are obtained for SSD DataNodes
Reduced
by 30%
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,
High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012
63HPC Advisory Council Brazil Conference, May '14
0
5
10
15
20
25
2GB 4GB 6GB 8GB 10GB
CommunicationTime(s)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
Evaluations using TestDFSIO for HDFS-RDMA Design
• Cluster with 8 HDD DataNodes, single disk per node
– 24% improvement over IPoIB (QDR) for 20GB file size
• Cluster with 4 SSD DataNodes, single SSD per node
– 61% improvement over IPoIB (QDR) for 20GB file size
Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps
Increased by 24%
Increased by 61%
64HPC Advisory Council Brazil Conference, May '14
0
200
400
600
800
1000
1200
5 10 15 20
TotalThroughput(MBps)
File Size (GB)
IPoIB (QDR) OSU-IB (QDR)
0
500
1000
1500
2000
2500
5 10 15 20
TotalThroughput(MBps)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
Evaluations using Sort for MapReduce-RDMA Design
8 HDD DataNodes
• With 8 HDD DataNodes for 40GB sort
– 41% improvement over IPoIB (QDR)
– 39% improvement over UDA-IB (QDR)
65
M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-
based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May
2013.
HPC Advisory Council Brazil Conference, May '14
8 SSD DataNodes
• With 8 SSD DataNodes for 40GB sort
– 52% improvement over IPoIB (QDR)
– 44% improvement over UDA-IB
(QDR)
0
50
100
150
200
250
300
350
400
450
500
25 30 35 40
JobExecutionTime(sec)
Data Size (GB)
10GigE IPoIB (QDR)
UDA-IB (QDR) OSU-IB (QDR)
0
50
100
150
200
250
300
350
400
450
500
25 30 35 40
JobExecutionTime(sec)
Data Size (GB)
10GigE IPoIB (QDR)
UDA-IB (QDR) OSU-IB (QDR)
• For 240GB Sort in 64 nodes
– 40% improvement over IPoIB (QDR)
with HDD used for HDFS
66
Performance Evaluation on Larger Clusters
HPC Advisory Council Brazil Conference, May '14
0
200
400
600
800
1000
1200
Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
JobExecutionTime(sec)
IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)
Sort in OSU Cluster
0
100
200
300
400
500
600
700
Data Size: 80 GB Data Size: 160 GBData Size: 320 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
JobExecutionTime(sec)
IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)
TeraSort in TACC Stampede
• For 320GB TeraSort in 64 nodes
– 38% improvement over IPoIB
(FDR) with HDD used for HDFS
• For 500GB Sort in 64 nodes
– 44% improvement over IPoIB (FDR)
67
Performance Improvement of MapReduce over Lustre on
TACC-Stampede
HPC Advisory Council Brazil Conference, May '14
• For 640GB Sort in 128 nodes
– 48% improvement over IPoIB (FDR)
0
200
400
600
800
1000
1200
300 400 500
JobExecutionTime(sec)
Data Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
• Local disk is used as the intermediate data directory
0
50
100
150
200
250
300
350
400
450
500
20 GB 40 GB 80 GB 160 GB 320 GB 640 GB
Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128
JobExecutionTime(sec)
IPoIB (FDR) OSU-IB (FDR)
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-
based Approach Benefit?, Euro-Par, August 2014.
• For 160GB Sort in 16 nodes
– 35% improvement over IPoIB (FDR)
68
Performance Improvement of MapReduce over Lustre on
TACC-Stampede
HPC Advisory Council Brazil Conference, May '14
• For 320GB Sort in 32 nodes
– 33% improvement over IPoIB (FDR)
• Lustre is used as the intermediate data directory
0
20
40
60
80
100
120
140
160
180
200
80 120 160
JobExecutionTime(sec)
Data Size (GB)
IPoIB (FDR) OSU-IB (FDR)
0
50
100
150
200
250
300
80 GB 160 GB 320 GB
Cluster: 8 Cluster: 16 Cluster: 32
JobExecutionTime(sec)
IPoIB (FDR) OSU-IB (FDR)
Worker
Worker
Worker
Worker
Master
HDFS
Driver
Zookeeper
Worker
SparkContext
Spark Architecture
• An in-memory data-processing
framework
– Iterative machine learning jobs
– Interactive data analytics
– Scala based Implementation
– Standalone, YARN, Mesos
• Scalable and communication intensive
– Wide dependencies between Resilient
Distributed Datasets (RDDs)
– MapReduce-like shuffle operations to
repartition RDDs
– Sockets based communication
• Native IB-verbs-level Design and
evaluation with
– Spark: 0.9.1
– Scala: 2.10.4
69HPC Advisory Council Brazil Conference, May '14
HPC Advisory Council Brazil Conference, May '14 70
Preliminary Results of Spark-RDMA design - GroupBy
0
1
2
3
4
5
6
7
8
9
4 6 8 10
GroupByTime(sec)
Data Size (GB)
0
1
2
3
4
5
6
7
8
9
8 12 16 20
GroupByTime(sec)
Data Size (GB)
10GigE
IPoIB
RDMA
Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores
• Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks
– 18% improvement over IPoIB (QDR) for 10GB data size
• Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks
– 20% improvement over IPoIB (QDR) for 20GB data size
Memcached Architecture
• Distributed Caching Layer
– Allows to aggregate spare memory from multiple nodes
– General purpose
• Typically used to cache database queries, results of API calls
• Scalable model, but typical usage very network intensive
• Native IB-verbs-level Design and evaluation with
– Memcached Server: 1.4.9
– Memcached Client: (libmemcached) 0.52
71
!"#$%"$#&
Web Frontend Servers
(Memcached Clients)
(Memcached Servers)
(Database Servers)
High
Performance
Networks
High
Performance
Networks
Main
memory
CPUs
SSD HDD
High Performance
Networks
... ...
...
Main
memory
CPUs
SSD HDD
Main
memory
CPUs
SSD HDD
Main
memory
CPUs
SSD HDD
Main
memory
CPUs
SSD HDD
HPC Advisory Council Brazil Conference, May '14
72
• Memcached Get latency
– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us
– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us
• Almost factor of four improvement over 10GE (TOE) for 2K bytes on
the DDR cluster
Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128 256 512 1K 2K
Time(us)
Message Size
0
10
20
30
40
50
60
70
80
90
100
1 2 4 8 16 32 64 128 256 512 1K 2K
Time(us)
Message Size
IPoIB (QDR)
10GigE (QDR)
OSU-RC-IB (QDR)
OSU-UD-IB (QDR)
Memcached Get Latency (Small Message)
HPC Advisory Council Brazil Conference, May '14
Time%(us)%
Message%Size%
Time%(us)%
Message%Size%
Time%(us)%
Message%Size%
Time%(us)%
Message%Size%
Time%(us)%
Message%Size%
Time%(us)%
Message%Size%
Time%(us)%
Message%Size%
Time%(us)%
Message%Size%
IPoIB (DDR)
10GigE (DDR)
OSU-RC-IB (DDR)
OSU-UD-IB (DDR)
1
10
100
1000 1
2
4
8
16
32
64
128
256
512
1K
2K
4K
Time(us)
Message Size
OSU-IB (FDR)
IPoIB (FDR)
0
100
200
300
400
500
600
700
16 32 64 128 256 512 1024 2048 4080
ThousandsofTransactionsper
Second(TPS)
No. of Clients
73
• Memcached Get latency
– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us
– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us
• Memcached Throughput (4bytes)
– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s
– Nearly 2X improvement in throughput
Memcached GET Latency Memcached Throughput
Memcached Performance (FDR Interconnect)
Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)
Latency Reduced
by nearly 20X
2X
HPC Advisory Council Brazil Conference, May '14
HPC Advisory Council Brazil Conference, May '14 74
Application Level Evaluation –
Workload from Yahoo
• Real Application Workload
– RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients
• 12X times better than IPoIB for 8 clients
• Hybrid design achieves comparable performance to that of pure RC design
0
20
40
60
80
100
120
1 4 8
Time(ms)
No. of Clients
SDP
IPoIB
OSU-RC-IB
OSU-UD-IB
OSU-Hybrid-IB
0
50
100
150
200
250
300
350
64 128 256 512 1024
Time(ms)
No. of Clients
J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.
Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached
design for InfiniBand Clusters using Hybrid Transport, CCGrid’12
• InfiniBand with RDMA feature is gaining momentum in HPC systems with best
performance and greater usage
• As the HPC community moves to Exascale, new solutions are needed in the
MPI and Hybrid MPI+PGAS stacks for supporting GPUs and Accelerators
• Demonstrated how such solutions can be designed with
MVAPICH2/MVAPICH2-X and their performance benefits
• New solutions are also needed to re-design software libraries for Big Data
environments to take advantage of RDMA
• Such designs will allow application scientists and engineers to take advantage
of upcoming exascale systems
75
Concluding Remarks
HPC Advisory Council Brazil Conference, May '14
HPC Advisory Council Brazil Conference, May '14
Funding Acknowledgments
Funding Support by
Equipment Support by
76
HPC Advisory Council Brazil Conference, May '14
Personnel Acknowledgments
Current Students
– S. Chakraborthy (Ph.D.)
– N. Islam (Ph.D.)
– J. Jose (Ph.D.)
– M. Li (Ph.D.)
– S. Potluri (Ph.D.)
– R. Rajachandrasekhar (Ph.D.)
Past Students
– P. Balaji (Ph.D.)
– D. Buntinas (Ph.D.)
– S. Bhagvat (M.S.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
77
Past Research Scientist
– S. Sur
Current Post-Docs
– X. Lu
– K. Hamidouche
Current Programmers
– M. Arnold
– J. Perkins
Past Post-Docs
– H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– R. Shir (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Senior Research Associate
– H. Subramoni
Past Programmers
– D. Bureddy
• August 25-27, 2014; Columbus, Ohio, USA
• Keynote Talks, Invited Talks, Contributed Presentations, Tutorial on MVAPICH2,
MVAPICH2-X and MVAPICH2-GDR optimization and tuning
• Speakers (Confirmed so far):
– Keynote: Dan Stanzione (TACC)
– Sadaf Alam (CSCS, Switzerland)
– Jens Glaser (Univ. of Michigan)
– Jeff Hammond (Intel)
– Adam Moody (LLNL)
– Davide Rossetti (NVIDIA)
– Gilad Shainer (Mellanox)
– Mahidhar Tatineni (SDSC)
– Jerome Vienne (TACC)
• Call for Contributed Presentations (Deadline: July 1, 2014)
• More details at: http://mug.mvapich.cse.ohio-state.edu
78
Upcoming 2nd Annual MVAPICH User Group (MUG) Meeting
HPC Advisory Council Brazil Conference, May '14
• Looking for Bright and Enthusiastic Personnel to join as
– Visiting Scientists
– Post-Doctoral Researchers
– PhD Students
– MPI Programmer/Software Engineer
– Hadoop/Big Data Programmer/Software Engineer
• If interested, please contact me at this conference and/or send
an e-mail to panda@cse.ohio-state.edu
79
Multiple Positions Available in My Group
HPC Advisory Council Brazil Conference, May '14
HPC Advisory Council Brazil Conference, May '14
Pointers
http://nowlab.cse.ohio-state.edu
panda@cse.ohio-state.edu
http://www.cse.ohio-state.edu/~panda
80
http://mvapich.cse.ohio-state.edu
http://hadoop-rdma.cse.ohio-state.edu

More Related Content

Similar to Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future PlansMVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plansinside-BigData.com
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Datainside-BigData.com
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systemsinside-BigData.com
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418inside-BigData.com
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systemsinside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersIntel® Software
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systemsinside-BigData.com
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsGanesan Narayanasamy
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceObject Automation
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyondinside-BigData.com
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCObject Automation
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 

Similar to Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges (20)

MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future PlansMVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
MVAPICH2 and MVAPICH2-X Projects: Latest Developments and Future Plans
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Data
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418
 
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale SystemsDesigning Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
Designing Scalable HPC, Deep Learning and Cloud Middleware for Exascale Systems
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
A Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing ClustersA Library for Emerging High-Performance Computing Clusters
A Library for Emerging High-Performance Computing Clusters
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 
Programming Models for Exascale Systems
Programming Models for Exascale SystemsProgramming Models for Exascale Systems
Programming Models for Exascale Systems
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Scallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systemsScallable Distributed Deep Learning on OpenPOWER systems
Scallable Distributed Deep Learning on OpenPOWER systems
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
 
C cerin piv2017_c
C cerin piv2017_cC cerin piv2017_c
C cerin piv2017_c
 
Ucx an open source framework for hpc network ap is and beyond
Ucx  an open source framework for hpc network ap is and beyondUcx  an open source framework for hpc network ap is and beyond
Ucx an open source framework for hpc network ap is and beyond
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPC
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
 

Recently uploaded

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges

  • 1. Designing Software Libraries and Middleware for Exascale Systems: Opportunities and Challenges Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Talk at HPC Advisory Council Brazil Conference (2014) by
  • 2. High-End Computing (HEC): PetaFlop to ExaFlop HPC Advisory Council Brazil Conference, May '14 2 Expected to have an ExaFlop system in 2020-2022! 100 PFlops in 2015 1 EFlops in 2018?
  • 3. • Scientific Computing – Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model – Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, etc. – Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) • Big Data/Enterprise/Commercial Computing – Focuses on large data and data analysis – Hadoop (HDFS, HBase, MapReduce) – Spark is emerging for in-memory computing – Memcached is also used for Web 2.0 3 Two Major Categories of Applications HPC Advisory Council Brazil Conference, May '14
  • 4. HPC Advisory Council Brazil Conference, May '14 4 Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org) 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 350 400 450 500 PercentageofClusters NumberofClusters Timeline Percentage of Clusters Number of Clusters
  • 5. Drivers of Modern HPC Cluster Architectures • Multi-core processors are ubiquitous • InfiniBand very popular in HPC clusters• • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale computing Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, >100Gbps Bandwidth Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10) 5 Multi-core Processors HPC Advisory Council Brazil Conference, May '14
  • 6. • 207 IB Clusters (41%) in the November 2013 Top500 list (http://www.top500.org) • Installations in the Top 40 (19 systems): HPC Advisory Council Brazil Conference, May '14 Large-scale InfiniBand Installations 519,640 cores (Stampede) at TACC (7th) 70,560 cores (Helios) at Japan/IFERC (24th) 147, 456 cores (Super MUC) in Germany (10th) 138,368 cores (Tera-100) at France/CEA (29th) 74,358 cores (Tsubame 2.5) at Japan/GSIC (11th) 60,000-cores, iDataPlex DX360M4 at Germany/Max-Planck (31st) 194,616 cores (Cascade) at PNNL (13th) 53,504 cores (PRIMERGY) at Australia/NCI (32nd) 110,400 cores (Pangea) at France/Total (14th) 77,520 cores (Conte) at Purdue University (33rd) 96,192 cores (Pleiades) at NASA/Ames (16th) 48,896 cores (MareNostrum) at Spain/BSC (34th) 73,584 cores (Spirit) at USA/Air Force (18th) 222,072 (PRIMERGY) at Japan/Kyushu (36th) 77,184 cores (Curie thin nodes) at France/CEA (20th) 78,660 cores (Lomonosov) in Russia (37th ) 120, 640 cores (Nebulae) at China/NSCS (21st) 137,200 cores (Sunway Blue Light) in China 40th) 72,288 cores (Yellowstone) at NCAR (22nd) and many more! 6
  • 7. Towards Exascale System (Today and Target) Systems 2014 Tianhe-2 2020-2022 Difference Today & Exascale System peak 55 PFlop/s 1 EFlop/s ~20x Power 18 MW (3 Gflops/W) ~20 MW (50 Gflops/W) O(1) ~15x System memory 1.4 PB (1.024PB CPU + 0.384PB CoP) 32 – 64 PB ~50X Node performance 3.43TF/s (0.4 CPU + 3 CoP) 1.2 or 15 TF O(1) Node concurrency 24 core CPU + 171 cores CoP O(1k) or O(10k) ~5x - ~50x Total node interconnect BW 6.36 GB/s 200 – 400 GB/s ~40x -~60x System size (nodes) 16,000 O(100,000) or O(1M) ~6x - ~60x Total concurrency 3.12M 12.48M threads (4 /core) O(billion) for latency hiding ~100x MTTI Few/day Many/day O(?) Courtesy: Prof. Jack Dongarra 7HPC Advisory Council Brazil Conference, May '14
  • 8. HPC Advisory Council Brazil Conference, May '14 8 Parallel Programming Models Overview P1 P2 P3 Shared Memory P1 P2 P3 Memory Memory Memory P1 P2 P3 Memory Memory Memory Logical shared memory Shared Memory Model SHMEM, DSM Distributed Memory Model MPI (Message Passing Interface) Partitioned Global Address Space (PGAS) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
  • 9. • Power required for data movement operations is one of the main challenges • Non-blocking collectives – Overlap computation and communication • Much improved One-sided interface – Reduce synchronization of sender/receiver • Manage concurrency – Improved interoperability with PGAS (e.g. UPC, Global Arrays, OpenSHMEM) • Resiliency – New interface for detecting failures HPC Advisory Council Brazil Conference, May '14 9 How does MPI Plan to Meet Exascale Challenges?
  • 10. • Major features – Non-blocking Collectives – Improved One-Sided (RMA) Model – MPI Tools Interface • Specification is available from: http://www.mpi- forum.org/docs/mpi-3.0/mpi30-report.pdf HPC Advisory Council Brazil Conference, May '14 10 Major New Features in MPI-3
  • 11. 11 Partitioned Global Address Space (PGAS) Models HPC Advisory Council Brazil Conference, May '14 • Key features - Simple shared memory abstractions - Light weight one-sided communication - Easier to express irregular communication • Different approaches to PGAS - Languages • Unified Parallel C (UPC) • Co-Array Fortran (CAF) • X10 - Libraries • OpenSHMEM • Global Arrays • Chapel
  • 12. • Hierarchical architectures with multiple address spaces • (MPI + PGAS) Model – MPI across address spaces – PGAS within an address space • MPI is good at moving data between address spaces • Within an address space, MPI can interoperate with other shared memory programming models • Can co-exist with OpenMP for offloading computation • Applications can have kernels with different communication patterns • Can benefit from different models • Re-writing complete applications can be a huge effort • Port critical kernels to the desired model instead HPC Advisory Council Brazil Conference, May '14 12 MPI+PGAS for Exascale Architectures and Applications
  • 13. Hybrid (MPI+PGAS) Programming • Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics • Benefits: – Best of Distributed Computing Model – Best of Shared Memory Computing Model • Exascale Roadmap*: – “Hybrid Programming is a practical way to program exascale systems” * The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420 Kernel 1 MPI Kernel 2 MPI Kernel 3 MPI Kernel N MPI HPC Application Kernel 2 PGAS Kernel N PGAS HPC Advisory Council Brazil Conference, May '14 13
  • 14. Designing Software Libraries for Multi-Petaflop and Exaflop Systems: Challenges Programming Models MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc. Application Kernels/Applications Networking Technologies (InfiniBand, 40/100GigE, Aries, BlueGene) Multi/Many-core Architectures Point-to-point Communication (two-sided & one-sided) Collective Communication Synchronization & Locks I/O & File Systems Fault Tolerance Communication Library or Runtime for Programming Models 14 Accelerators (NVIDIA and MIC) Middleware HPC Advisory Council Brazil Conference, May '14 Co-Design Opportunities and Challenges across Various Layers Performance Scalability Fault- Resilience
  • 15. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node) – Multiple end-points per node • Support for efficient multi-threading • Support for GPGPUs and Accelerators • Scalable Collective communication – Offload – Non-blocking – Topology-aware – Power-aware • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …) Challenges in Designing (MPI+X) at Exascale 15 HPC Advisory Council Brazil Conference, May '14
  • 16. • Extreme Low Memory Footprint – Memory per core continues to decrease • D-L-A Framework – Discover • Overall network topology (fat-tree, 3D, …) • Network topology for processes for a given job • Node architecture • Health of network and node – Learn • Impact on performance and scalability • Potential for failure – Adapt • Internal protocols and algorithms • Process mapping • Fault-tolerance solutions – Low overhead techniques while delivering performance, scalability and fault- tolerance Additional Challenges for Designing Exascale Middleware 16 HPC Advisory Council Brazil Conference, May '14
  • 17. • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Support for GPGPUs and MIC – Used by more than 2,150 organizations in 72 countries – More than 212,000 downloads from OSU site directly – Empowering many TOP500 clusters • 7th ranked 519,640-core cluster (Stampede) at TACC • 11th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology • 16th ranked 96,192-core cluster (Pleiades) at NASA • 75th ranked 16,896-core cluster (Keenland) at GaTech and many others . . . – Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Partner in the U.S. NSF-TACC Stampede System MVAPICH2/MVAPICH2-X Software 17HPC Advisory Council Brazil Conference, May '14
  • 18. • Released on 05/25/14 • Major Features and Enhancements – Based on MPICH-3.1 – CMA support is default – Optimization of collectives with CMA support – RMA optimizations for shared memory and atomics operations – MPI-T support for additional performance and control variables – Large message transfer support for PSM interface – Optimization of collectives for PSM interface – Updated hwloc to version 1.9 • MVAPICH2-X 2.0RC2 supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models – Based on MVAPICH2 2.0RC2 including MPI-3 features; – Compliant with UPC 2.18.0 and OpenSHMEM v1.0f 18 MVAPICH2 2.0RC2 and MVAPICH2-X 2.0RC2 HPC Advisory Council Brazil Conference, May '14
  • 19. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 19 HPC Advisory Council Brazil Conference, May '14
  • 20. HPC Advisory Council Brazil Conference, May '14 20 One-way Latency: MPI over IB with MVAPICH2 0.00 1.00 2.00 3.00 4.00 5.00 6.00 Small Message Latency Message Size (bytes) Latency(us) 1.66 1.56 1.64 1.82 0.99 1.09 1.12 DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 0.00 50.00 100.00 150.00 200.00 250.00 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Large Message Latency Message Size (bytes) Latency(us)
  • 21. HPC Advisory Council Brazil Conference, May '14 21 Bandwidth: MPI over IB with MVAPICH2 0 2000 4000 6000 8000 10000 12000 14000 Unidirectional Bandwidth Bandwidth(MBytes/sec) Message Size (bytes) 3280 3385 1917 1706 6343 12485 12810 0 5000 10000 15000 20000 25000 30000 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Bidirectional Bandwidth Bandwidth(MBytes/sec) Message Size (bytes) 3341 3704 4407 11643 6521 21025 24727 DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
  • 22. 0 0.2 0.4 0.6 0.8 1 0 1 2 4 8 16 32 64 128 256 512 1K Latency(us) Message Size (Bytes) Latency Intra-Socket Inter-Socket MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) 22 Latest MVAPICH2 2.0rc2 Intel Ivy-bridge 0.18 us 0.45 us 0 2000 4000 6000 8000 10000 12000 14000 16000 Bandwidth(MB/s) Message Size (Bytes) Bandwidth (Inter-socket) inter-Socket-CMA inter-Socket-Shmem inter-Socket-LiMIC 0 2000 4000 6000 8000 10000 12000 14000 16000 Bandwidth(MB/s) Message Size (Bytes) Bandwidth (Intra-socket) intra-Socket-CMA intra-Socket-Shmem intra-Socket-LiMIC 14,250 MB/s 13,749 MB/s HPC Advisory Council Brazil Conference, May '14
  • 23. 0 0.5 1 1.5 2 2.5 3 3.5 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Latency(us) Message Size (Bytes) Inter-node Get/Put Latency Get Put MPI-3 RMA Get/Put with Flush Performance 23 Latest MVAPICH2 2.0rc2, Intel Sandy-bridge with Connect-IB (single-port) 2.04 us 1.56 us 0 5000 10000 15000 20000 1 4 16 64 256 1K 4K 16K 64K 256K Bandwidth(Mbytes/sec) Message Size (Bytes) Intra-socket Get/Put Bandwidth Get Put 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 1 2 4 8 16 32 64 128256512 1K 2K 4K 8K Latency(us) Message Size (Bytes) Intra-socket Get/Put Latency Get Put 0.08 us HPC Advisory Council Brazil Conference, May '14 0 1000 2000 3000 4000 5000 6000 7000 8000 1 4 16 64 256 1K 4K 16K 64K 256K Bandwidth(Mbytes/sec) Message Size (Bytes) Inter-node Get/Put Bandwidth Get Put 6881 6876 14926 15364
  • 24. • Memory usage for 32K processes with 8-cores per node can be 54 MB/process (for connections) • NAMD performance improves when there is frequent communication to many peers HPC Advisory Council Brazil Conference, May '14 Minimizing memory footprint with XRC and Hybrid Mode Memory Usage Performance on NAMD (1024 cores) 24 • Both UD and RC/XRC have benefits • Hybrid for the best of both • Available since MVAPICH2 1.7 as integrated interface • Runtime Parameters: RC - default; • UD - MV2_USE_ONLY_UD=1 • Hybrid - MV2_HYBRID_ENABLE_THRESHOLD=1 M. Koop, J. Sridhar and D. K. Panda, “Scalable MPI Design over InfiniBand using eXtended Reliable Connection,” Cluster ‘08 0 100 200 300 400 500 1 4 16 64 256 1K 4K 16K Memory(MB/process) Connections MVAPICH2-RC MVAPICH2-XRC 0 0.5 1 1.5 apoa1 er-gre f1atpase jac NormalizedTime Dataset MVAPICH2-RC MVAPICH2-XRC 0 2 4 6 128 256 512 1024 Time(us) Number of Processes UD Hybrid RC 26% 40% 30%38%
  • 25. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 25 HPC Advisory Council Brazil Conference, May '14
  • 26. • Before CUDA 4: Additional copies – Low performance and low productivity • After CUDA 4: Host-based pipeline – Unified Virtual Address – Pipeline CUDA copies with IB transfers – High performance and high productivity • After CUDA 5.5: GPUDirect-RDMA support – GPU to GPU direct transfer – Bypass the host memory – Hybrid design to avoid PCI bottlenecks HPC Advisory Council Brazil Conference, May '14 26 MVAPICH2-GPU: CUDA-Aware MPI InfiniBand GPU GPU Memor y CPU Chip set
  • 27. MVAPICH2 1.8, 1.9, 2.0a-2.0rc2 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers 27HPC Advisory Council Brazil Conference, May '14
  • 28. • Hybrid design using GPUDirect RDMA – GPUDirect RDMA and Host-based pipelining – Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge • Support for communication using multi-rail • Support for Mellanox Connect-IB and ConnectX VPI adapters • Support for RoCE with Mellanox ConnectX VPI adapters HPC Advisory Council Brazil Conference, May '14 28 GPUDirect RDMA (GDR) with CUDA IB Adapter System Memory GPU Memory GPU CPU Chipset P2P write: 5.2 GB/s P2P read: < 1.0 GB/s SNB E5-2670 P2P write: 6.4 GB/s P2P read: 3.5 GB/s IVB E5-2680V2 SNB E5-2670 / IVB E5-2680V2 S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)
  • 29. 29 Performance of MVAPICH2 with GPUDirect-RDMA: Latency GPU-GPU Internode MPI Latency 0 100 200 300 400 500 600 700 800 8K 32K 128K 512K 2M 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR Large Message Latency Message Size (Bytes) Latency(us) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in 10 % 0 5 10 15 20 25 1 4 16 64 256 1K 4K 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR Small Message Latency Message Size (Bytes) Latency(us) 67 % 5.49 usec HPC Advisory Council Brazil Conference, May '14
  • 30. 30 Performance of MVAPICH2 with GPUDirect-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1 4 16 64 256 1K 4K 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR Small Message Bandwidth Message Size (Bytes) Bandwidth(MB/s) 0 2000 4000 6000 8000 10000 12000 8K 32K 128K 512K 2M 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR Large Message Bandwidth Message Size (Bytes) Bandwidth(MB/s) 5x 9.8 GB/s HPC Advisory Council Brazil Conference, May '14 Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
  • 31. 31 Performance of MVAPICH2 with GPUDirect-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth 0 200 400 600 800 1000 1200 1400 1600 1800 2000 1 4 16 64 256 1K 4K 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR Small Message Bi-Bandwidth Message Size (Bytes) Bi-Bandwidth(MB/s) 0 5000 10000 15000 20000 25000 8K 32K 128K 512K 2M 1-Rail 2-Rail 1-Rail-GDR 2-Rail-GDR Large Message Bi-Bandwidth Message Size (Bytes) Bi-Bandwidth(MB/s) 4.3x 19 % 19 GB/s HPC Advisory Council Brazil Conference, May '14 Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
  • 32. 32 MPI-3 RMA Support with GPUDirect RDMA MPI-3 RMA provides flexible synchronization and completion primitives 0 2 4 6 8 10 12 14 16 18 20 1 4 16 64 256 1K 4K Send-Recv Put+ActiveSync Put+Flush Small Message Latency Message Size (Bytes) Latency(usec) HPC Advisory Council Brazil Conference, May '14 < 2usec 0 200000 400000 600000 800000 1000000 1200000 1 4 16 64 256 1K 4K 16K Send-Recv Put+ActiveSync Put+Flush Small Message Rate Message Size (Bytes) MessageRate(Msgs/sec) Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
  • 33. 33 Communication Kernel Evaluation: 3Dstencil and Alltoall 0 10 20 30 40 50 60 70 80 90 1 4 16 64 256 1K 4K Latency(usec) Message Size (Bytes) Send/Recv One-sided 3D Stencil with 16 GPU nodes 0 20 40 60 80 100 120 140 1 4 16 64 256 1K 4K Latency(usec) Message Size (Bytes) Send/Recv One-sided AlltoAll with 16 GPU nodes HPC Advisory Council Brazil Conference, May '14 42% 50% Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Plug-in
  • 34. How can I get Started with GDR Experimentation? • MVAPICH2-2.0b with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/ • System software requirements – Mellanox OFED 2.1 or later – NVIDIA Driver 331.20 or later – NVIDIA CUDA Toolkit 5.5 or later – Plugin for GPUDirect RDMA (http://www.mellanox.com/page/products_dyn?product_family=116) • Has optimized designs for point-to-point communication using GDR • Work under progress for optimizing collective and one-sided communication • Contact MVAPICH help list with any questions related to the package mvapich- help@cse.ohio-state.edu • MVAPICH2-GDR-RC2 with additional optimizations coming soon!! 34HPC Advisory Council Brazil Conference, May '14
  • 35. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 35 HPC Advisory Council Brazil Conference, May '14
  • 36. MPI Applications on MIC Clusters Xeon Xeon Phi Multi-core Centric Many-core Centric MPI Program MPI Program Offloaded Computation MPI Program MPI Program MPI Program Host-only Offload (/reverse Offload) Symmetric Coprocessor-only • Flexibility in launching MPI jobs on clusters with Xeon Phi 36HPC Advisory Council Brazil Conference, May '14
  • 37. Data Movement on Intel Xeon Phi Clusters CPU CPU QPI MIC PCIe MIC MIC CPU MIC IB Node 0 Node 1 1. Intra-Socket 2. Inter-Socket 3. Inter-Node 4. Intra-MIC 5. Intra-Socket MIC-MIC 6. Inter-Socket MIC-MIC 7. Inter-Node MIC-MIC 8. Intra-Socket MIC-Host 10. Inter-Node MIC-Host 9. Inter-Socket MIC-Host MPI Process 37 11. Inter-Node MIC-MIC with IB adapter on remote socket and more . . . • Critical for runtimes to optimize data movement, hiding the complexity • Connected as PCIe devices – Flexibility but Complexity HPC Advisory Council Brazil Conference, May '14
  • 38. 38 MVAPICH2-MIC Design for Clusters with IB and MIC • Offload Mode • Intranode Communication • Coprocessor-only Mode • Symmetric Mode • Internode Communication • Coprocessors-only • Symmetric Mode • Multi-MIC Node Configurations HPC Advisory Council Brazil Conference, May '14
  • 39. Direct-IB Communication 39 IB Write to Xeon Phi (P2P Write) IB Read from Xeon Phi (P2P Read) CPU Xeon Phi PCIe QPI E5-2680 v2E5-2670 5280 MB/s (83%) IB FDR HCA MT4099 6396 MB/s (100%) 962 MB/s (15%) 3421 MB/s (54%) 1075 MB/s (17%) 1179 MB/s (19%) 370 MB/s (6%) 247 MB/s (4%) (SandyBridge) (IvyBridge) • IB-verbs based communication from Xeon Phi • No porting effort for MPI libraries with IB support Peak IB FDR Bandwidth: 6397 MB/s Intra-socket IB Write to Xeon Phi (P2P Write) IB Read from Xeon Phi (P2P Read)Inter-socket • Data movement between IB HCA and Xeon Phi: implemented as peer-to-peer transfers HPC Advisory Council Brazil Conference, May '14
  • 40. MIC-Remote-MIC P2P Communication 40HPC Advisory Council Brazil Conference, May '14 Bandwidth Better BetterBetter Latency (Large Messages) 0 1000 2000 3000 4000 5000 8K 32K 128K 512K 2M Latency(usec) Message Size (Bytes) 0 2000 4000 6000 1 16 256 4K 64K 1M Bandwidth(MB/sec) Message Size (Bytes) 5236 Intra-socket P2P Inter-socket P2P 0 2000 4000 6000 8000 10000 12000 14000 16000 8K 32K 128K 512K 2M Latency(usec) Message Size (Bytes) Latency (Large Messages) 0 2000 4000 6000 1 16 256 4K 64K 1M Bandwidth(MB/sec) Message Size (Bytes) Better 5594 Bandwidth
  • 41. InterNode – Symmetric Mode - P3DFFT Performance 41HPC Advisory Council Brazil Conference, May '14 • To explore different threading levels for host and Xeon Phi 3DStencil Comm. Kernel P3DFFT – 512x512x4K 0 2 4 6 8 10 4+4 8+8 TimeperStep(sec) Processes on Host+MIC (32 Nodes) 0 5 10 15 20 2+2 2+8 CommperStep(sec) Processes on Host-MIC - 8 Threads/Proc (8 Nodes) 50.2% MV2-INTRA-OPT MV2-MIC 16.2%
  • 42. 42 Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall) HPC Advisory Council Brazil Conference, May '14 A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014 0 5000 10000 15000 20000 25000 1 2 4 8 16 32 64 128256512 1K Latency(usecs) Message Size (Bytes) 32-Node-Allgather (16H + 16 M) Small Message Latency MV2-MIC MV2-MIC-Opt 0 500 1000 1500 8K 16K 32K 64K 128K 256K 512K 1M Latency(usecs)x1000 Message Size (Bytes) 32-Node-Allgather (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 0 200 400 600 800 4K 8K 16K 32K 64K 128K 256K 512K Latency(usecs)x1000 Message Size (Bytes) 32-Node-Alltoall (8H + 8 M) Large Message Latency MV2-MIC MV2-MIC-Opt 0 10 20 30 40 50 MV2-MIC-Opt MV2-MIC ExecutionTime(secs) 32 Nodes (8H + 8M), Size = 2K*2K*1K P3DFFT Performance Communication Computation 76% 58% 55%
  • 43. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 43 HPC Advisory Council Brazil Conference, May '14
  • 44. • Component failures are common in large-scale clusters • Imposes need on reliability and fault tolerance • Multiple challenges: – Checkpoint-Restart vs. Process Migration – Benefits of Scalable Checkpoint Restart (SCR) Support • Application guided • Application transparent HPC Advisory Council Brazil Conference, May '14 44 Fault Tolerance
  • 45. HPC Advisory Council Brazil Conference, May '14 45 Checkpoint-Restart vs. Process Migration Low Overhead Failure Prediction with IPMI X. Ouyang, R. Rajachandrasekar, X. Besseron, D. K. Panda, High Performance Pipelined Process Migration with RDMA, CCGrid 2011 X. Ouyang, S. Marcarelli, R. Rajachandrasekar and D. K. Panda, RDMA-Based Job Migration Framework for MPI over InfiniBand, Cluster 2010 10.7X 2.3X 4.3X 1 Time to migrate 8 MPI processes • Job-wide Checkpoint/Restart is not scalable • Job-pause and Process Migration framework can deliver pro-active fault-tolerance • Also allows for cluster-wide load-balancing by means of job compaction 0 5000 10000 15000 20000 25000 30000 Migration w/o RDMA CR (ext3) CR (PVFS) ExecutionTime(seconds) Job Stall Checkpoint (Migration) Resume Restart 2.03x 4.49x LU Class C Benchmark (64 Processes)
  • 46. Multi-Level Checkpointing with ScalableCR (SCR) 46HPC Advisory Council Brazil Conference, May '14  CheckpointCostandResiliency Low High • LLNL’s Scalable Checkpoint/Restart library • Can be used for application guided and application transparent checkpointing • Effective utilization of storage hierarchy – Local: Store checkpoint data on node’s local storage, e.g. local disk, ramdisk – Partner: Write to local storage and on a partner node – XOR: Write file to local storage and small sets of nodes collectively compute and store parity redundancy data (RAID-5) – Stable Storage: Write to parallel file system
  • 47. Application-guided Multi-Level Checkpointing 47HPC Advisory Council Brazil Conference, May '14 0 20 40 60 80 100 PFS MVAPICH2+SCR (Local) MVAPICH2+SCR (Partner) MVAPICH2+SCR (XOR) CheckpointWritingTime(s) Representative SCR-Enabled Application • Checkpoint writing phase times of representative SCR-enabled MPI application • 512 MPI processes (8 procs/node) • Approx. 51 GB checkpoints
  • 48. Transparent Multi-Level Checkpointing 48HPC Advisory Council Brazil Conference, May '14 • ENZO Cosmology application – Radiation Transport workload • Using MVAPICH2’s CR protocol instead of the application’s in-built CR mechanism • 512 MPI processes running on the Stampede Supercomputer@TACC • Approx. 12.8 GB checkpoints: ~2.5X reduction in checkpoint I/O costs
  • 49. • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Extremely minimum memory footprint • Support for GPGPUs • Support for Intel MICs • Fault-tolerance • Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, …) with Unified Runtime Overview of A Few Challenges being Addressed by MVAPICH2/MVAPICH2-X for Exascale 49 HPC Advisory Council Brazil Conference, May '14
  • 50. MVAPICH2-X for Hybrid MPI + PGAS Applications HPC Advisory Council Brazil Conference, May '14 50 MPI Applications, OpenSHMEM Applications, UPC Applications, Hybrid (MPI + PGAS) Applications Unified MVAPICH2-X Runtime InfiniBand, RoCE, iWARP OpenSHMEM Calls MPI CallsUPC Calls • Unified communication runtime for MPI, UPC, OpenSHMEM available with MVAPICH2-X 1.9 onwards! – http://mvapich.cse.ohio-state.edu • Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant – Scalable Inter-node and Intra-node communication – point-to-point and collectives
  • 51. OpenSHMEM Collective Communication Performance 51 Reduce (1,024 processes) Broadcast (1,024 processes) Collect (1,024 processes) Barrier 0 50 100 150 200 250 300 350 400 128 256 512 1024 2048 Time(us) No. of Processes 1 10 100 1000 10000 100000 1000000 10000000 Time(us) Message Size 180X 1 10 100 1000 10000 4 16 64 256 1K 4K 16K 64K 256K Time(us) Message Size 18X 1 10 100 1000 10000 100000 1 4 16 64 256 1K 4K 16K 64K Time(us) Message Size MV2X-SHMEM Scalable-SHMEM OMPI-SHMEM 12X 3X HPC Advisory Council Brazil Conference, May '14
  • 52. OpenSHMEM Application Evaluation 52 0 100 200 300 400 500 600 256 512 Time(s) Number of Processes UH-SHMEM OMPI-SHMEM (FCA) Scalable-SHMEM (FCA) MV2X-SHMEM 0 10 20 30 40 50 60 70 80 256 512 Time(s) Number of Processes UH-SHMEM OMPI-SHMEM (FCA) Scalable-SHMEM (FCA) MV2X-SHMEM Heat Image Daxpy • Improved performance for OMPI-SHMEM and Scalable-SHMEM with FCA • Execution time for 2DHeat Image at 512 processes (sec): - UH-SHMEM – 523, OMPI-SHMEM – 214, Scalable-SHMEM – 193, MV2X- SHMEM – 169 • Execution time for DAXPY at 512 processes (sec): - UH-SHMEM – 57, OMPI-SHMEM – 56, Scalable-SHMEM – 9.2, MV2X- SHMEM – 5.2 J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, A Comprehensive Performance Evaluation of OpenSHMEM Libraries on InfiniBand Clusters, OpenSHMEM Workshop (OpenSHMEM’14), March 2014 HPC Advisory Council Brazil Conference, May '14
  • 53. HPC Advisory Council Brazil Conference, May '14 53 Hybrid MPI+OpenSHMEM Graph500 Design Execution Time J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 0 1 2 3 4 5 6 7 8 9 26 27 28 29 BillionsofTraversedEdgesPer Second(TEPS) Scale MPI-Simple MPI-CSC MPI-CSR Hybrid(MPI+OpenSHMEM) 0 1 2 3 4 5 6 7 8 9 1024 2048 4096 8192 BillionsofTraversedEdgesPer Second(TEPS) # of Processes MPI-Simple MPI-CSC MPI-CSR Hybrid(MPI+OpenSHMEM) Strong Scalability Weak Scalability J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012 0 5 10 15 20 25 30 35 4K 8K 16K Time(s) No. of Processes MPI-Simple MPI-CSC MPI-CSR Hybrid (MPI+OpenSHMEM) 13X 7.6X • Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design • 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple • 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple
  • 54. 54 Hybrid MPI+OpenSHMEM Sort Application Execution Time Strong Scalability • Performance of Hybrid (MPI+OpenSHMEM) Sort Application • Execution Time - 4TB Input size at 4,096 cores: MPI – 2408 seconds, Hybrid: 1172 seconds - 51% improvement over MPI-based design • Strong Scalability (configuration: constant input size of 500GB) - At 4,096 cores: MPI – 0.16 TB/min, Hybrid – 0.36 TB/min - 55% improvement over MPI based design 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 512 1,024 2,048 4,096 SortRate(TB/min) No. of Processes MPI Hybrid 0 500 1000 1500 2000 2500 3000 500GB-512 1TB-1K 2TB-2K 4TB-4K Time(seconds) Input Data - No. of Processes MPI Hybrid 51% 55% HPC Advisory Council Brazil Conference, May '14
  • 55. • Performance and Memory scalability toward 900K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB • Enhanced Optimization for GPGPU and Coprocessor Support – Extending the GPGPU support (GPU-Direct RDMA) with CUDA 6.5 and Beyond – Support for Intel MIC (Knight Landing) • Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0) • RMA support (as in MPI 3.0) • Extended topology-aware collectives • Power-aware collectives • Support for MPI Tools Interface (as in MPI 3.0) • Checkpoint-Restart and migration support with in-memory checkpointing • Hybrid MPI+PGAS programming support with GPGPUs and Accelertors MVAPICH2/MVPICH2-X – Plans for Exascale 55 HPC Advisory Council Brazil Conference, May '14
  • 56. • Scientific Computing – Message Passing Interface (MPI), including MPI + OpenMP, is the Dominant Programming Model – Many discussions towards Partitioned Global Address Space (PGAS) • UPC, OpenSHMEM, CAF, etc. – Hybrid Programming: MPI + PGAS (OpenSHMEM, UPC) • Big Data/Enterprise/Commercial Computing – Focuses on large data and data analysis – Hadoop (HDFS, HBase, MapReduce) – Spark is emerging for in-memory computing – Memcached is also used for Web 2.0 56 Two Major Categories of Applications HPC Advisory Council Brazil Conference, May '14
  • 57. Introduction to Big Data Applications and Analytics • Big Data has become the one of the most important elements of business analytics • Provides groundbreaking opportunities for enterprise information management and decision making • The amount of data is exploding; companies are capturing and digitizing more information than ever • The rate of information growth appears to be exceeding Moore’s Law 57HPC Advisory Council Brazil Conference, May '14 • Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf • 5V’s of Big Data – 3V + Value, Veracity
  • 58. Can High-Performance Interconnects Benefit Big Data Middleware? • Most of the current Big Data middleware use Ethernet infrastructure with Sockets • Concerns for performance and scalability • Usage of high-performance networks is beginning to draw interest from many companies • What are the challenges? • Where do the bottlenecks lie? • Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)? • Can HPC Clusters with high-performance networks be used for Big Data middleware? • Initial focus: Hadoop, HBase, Spark and Memcached 58HPC Advisory Council Brazil Conference, May '14
  • 59. Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached) Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Storage Technologies (HDD and SSD) Programming Models (Sockets) Applications Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators) Other Protocols? Communication and I/O Library Point-to-Point Communication QoS Threaded Models and Synchronization Fault-ToleranceI/O and File Systems Virtualization Benchmarks RDMA Protocol Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges HPC Advisory Council Brazil Conference, May '14 Upper level Changes? 59
  • 60. Big Data Processing with Hadoop • The open-source implementation of MapReduce programming model for Big Data Analytics • Major components – HDFS – MapReduce • Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase • Model scales but high amount of communication during intermediate phases can be further optimized 60 HDFS MapReduce Hadoop Framework User Applications HBase Hadoop Common (RPC) HPC Advisory Council Stanford Conference, Feb '14
  • 61. Two Broad Approaches for Accelerating Hadoop • All components of Hadoop (HDFS, MapReduce, RPC, HBase, etc.) working on native IB verbs while exploiting RDMA • Hadoop Map-Reduce over verbs for existing HPC clusters with Lustre (no HDFS) 61HPC Advisory Council Brazil Conference, May '14
  • 62. • High-Performance Design of Hadoop over RDMA-enabled Interconnects – High performance design with native InfiniBand and RoCE support at the verbs- level for HDFS, MapReduce, and RPC components – Easily configurable for native InfiniBand, RoCE and the traditional sockets- based support (Ethernet and InfiniBand with IPoIB) • Current release: 0.9.9 (03/31/14) – Based on Apache Hadoop 1.2.1 – Compliant with Apache Hadoop 1.2.1 APIs and applications – Tested with • Mellanox InfiniBand adapters (DDR, QDR and FDR) • RoCE • Various multi-core platforms • Different file systems with disks and SSDs – http://hadoop-rdma.cse.ohio-state.edu RDMA for Apache Hadoop Project 62HPC Advisory Council Brazil Conference, May '14
  • 63. Reducing Communication Time with HDFS-RDMA Design • Cluster with HDD DataNodes – 30% improvement in communication time over IPoIB (QDR) – 56% improvement in communication time over 10GigE • Similar improvements are obtained for SSD DataNodes Reduced by 30% N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012 63HPC Advisory Council Brazil Conference, May '14 0 5 10 15 20 25 2GB 4GB 6GB 8GB 10GB CommunicationTime(s) File Size (GB) 10GigE IPoIB (QDR) OSU-IB (QDR)
  • 64. Evaluations using TestDFSIO for HDFS-RDMA Design • Cluster with 8 HDD DataNodes, single disk per node – 24% improvement over IPoIB (QDR) for 20GB file size • Cluster with 4 SSD DataNodes, single SSD per node – 61% improvement over IPoIB (QDR) for 20GB file size Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps Increased by 24% Increased by 61% 64HPC Advisory Council Brazil Conference, May '14 0 200 400 600 800 1000 1200 5 10 15 20 TotalThroughput(MBps) File Size (GB) IPoIB (QDR) OSU-IB (QDR) 0 500 1000 1500 2000 2500 5 10 15 20 TotalThroughput(MBps) File Size (GB) 10GigE IPoIB (QDR) OSU-IB (QDR)
  • 65. Evaluations using Sort for MapReduce-RDMA Design 8 HDD DataNodes • With 8 HDD DataNodes for 40GB sort – 41% improvement over IPoIB (QDR) – 39% improvement over UDA-IB (QDR) 65 M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA- based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May 2013. HPC Advisory Council Brazil Conference, May '14 8 SSD DataNodes • With 8 SSD DataNodes for 40GB sort – 52% improvement over IPoIB (QDR) – 44% improvement over UDA-IB (QDR) 0 50 100 150 200 250 300 350 400 450 500 25 30 35 40 JobExecutionTime(sec) Data Size (GB) 10GigE IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR) 0 50 100 150 200 250 300 350 400 450 500 25 30 35 40 JobExecutionTime(sec) Data Size (GB) 10GigE IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)
  • 66. • For 240GB Sort in 64 nodes – 40% improvement over IPoIB (QDR) with HDD used for HDFS 66 Performance Evaluation on Larger Clusters HPC Advisory Council Brazil Conference, May '14 0 200 400 600 800 1000 1200 Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 JobExecutionTime(sec) IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR) Sort in OSU Cluster 0 100 200 300 400 500 600 700 Data Size: 80 GB Data Size: 160 GBData Size: 320 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64 JobExecutionTime(sec) IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR) TeraSort in TACC Stampede • For 320GB TeraSort in 64 nodes – 38% improvement over IPoIB (FDR) with HDD used for HDFS
  • 67. • For 500GB Sort in 64 nodes – 44% improvement over IPoIB (FDR) 67 Performance Improvement of MapReduce over Lustre on TACC-Stampede HPC Advisory Council Brazil Conference, May '14 • For 640GB Sort in 128 nodes – 48% improvement over IPoIB (FDR) 0 200 400 600 800 1000 1200 300 400 500 JobExecutionTime(sec) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) • Local disk is used as the intermediate data directory 0 50 100 150 200 250 300 350 400 450 500 20 GB 40 GB 80 GB 160 GB 320 GB 640 GB Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128 JobExecutionTime(sec) IPoIB (FDR) OSU-IB (FDR) M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA- based Approach Benefit?, Euro-Par, August 2014.
  • 68. • For 160GB Sort in 16 nodes – 35% improvement over IPoIB (FDR) 68 Performance Improvement of MapReduce over Lustre on TACC-Stampede HPC Advisory Council Brazil Conference, May '14 • For 320GB Sort in 32 nodes – 33% improvement over IPoIB (FDR) • Lustre is used as the intermediate data directory 0 20 40 60 80 100 120 140 160 180 200 80 120 160 JobExecutionTime(sec) Data Size (GB) IPoIB (FDR) OSU-IB (FDR) 0 50 100 150 200 250 300 80 GB 160 GB 320 GB Cluster: 8 Cluster: 16 Cluster: 32 JobExecutionTime(sec) IPoIB (FDR) OSU-IB (FDR)
  • 69. Worker Worker Worker Worker Master HDFS Driver Zookeeper Worker SparkContext Spark Architecture • An in-memory data-processing framework – Iterative machine learning jobs – Interactive data analytics – Scala based Implementation – Standalone, YARN, Mesos • Scalable and communication intensive – Wide dependencies between Resilient Distributed Datasets (RDDs) – MapReduce-like shuffle operations to repartition RDDs – Sockets based communication • Native IB-verbs-level Design and evaluation with – Spark: 0.9.1 – Scala: 2.10.4 69HPC Advisory Council Brazil Conference, May '14
  • 70. HPC Advisory Council Brazil Conference, May '14 70 Preliminary Results of Spark-RDMA design - GroupBy 0 1 2 3 4 5 6 7 8 9 4 6 8 10 GroupByTime(sec) Data Size (GB) 0 1 2 3 4 5 6 7 8 9 8 12 16 20 GroupByTime(sec) Data Size (GB) 10GigE IPoIB RDMA Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores • Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks – 18% improvement over IPoIB (QDR) for 10GB data size • Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks – 20% improvement over IPoIB (QDR) for 20GB data size
  • 71. Memcached Architecture • Distributed Caching Layer – Allows to aggregate spare memory from multiple nodes – General purpose • Typically used to cache database queries, results of API calls • Scalable model, but typical usage very network intensive • Native IB-verbs-level Design and evaluation with – Memcached Server: 1.4.9 – Memcached Client: (libmemcached) 0.52 71 !"#$%"$#& Web Frontend Servers (Memcached Clients) (Memcached Servers) (Database Servers) High Performance Networks High Performance Networks Main memory CPUs SSD HDD High Performance Networks ... ... ... Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD Main memory CPUs SSD HDD HPC Advisory Council Brazil Conference, May '14
  • 72. 72 • Memcached Get latency – 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us – 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us • Almost factor of four improvement over 10GE (TOE) for 2K bytes on the DDR cluster Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR) 0 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 32 64 128 256 512 1K 2K Time(us) Message Size 0 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 32 64 128 256 512 1K 2K Time(us) Message Size IPoIB (QDR) 10GigE (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) Memcached Get Latency (Small Message) HPC Advisory Council Brazil Conference, May '14 Time%(us)% Message%Size% Time%(us)% Message%Size% Time%(us)% Message%Size% Time%(us)% Message%Size% Time%(us)% Message%Size% Time%(us)% Message%Size% Time%(us)% Message%Size% Time%(us)% Message%Size% IPoIB (DDR) 10GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR)
  • 73. 1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time(us) Message Size OSU-IB (FDR) IPoIB (FDR) 0 100 200 300 400 500 600 700 16 32 64 128 256 512 1024 2048 4080 ThousandsofTransactionsper Second(TPS) No. of Clients 73 • Memcached Get latency – 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us • Memcached Throughput (4bytes) – 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput Memcached GET Latency Memcached Throughput Memcached Performance (FDR Interconnect) Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR) Latency Reduced by nearly 20X 2X HPC Advisory Council Brazil Conference, May '14
  • 74. HPC Advisory Council Brazil Conference, May '14 74 Application Level Evaluation – Workload from Yahoo • Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients • 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design 0 20 40 60 80 100 120 1 4 8 Time(ms) No. of Clients SDP IPoIB OSU-RC-IB OSU-UD-IB OSU-Hybrid-IB 0 50 100 150 200 250 300 350 64 128 256 512 1024 Time(ms) No. of Clients J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11 J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12
  • 75. • InfiniBand with RDMA feature is gaining momentum in HPC systems with best performance and greater usage • As the HPC community moves to Exascale, new solutions are needed in the MPI and Hybrid MPI+PGAS stacks for supporting GPUs and Accelerators • Demonstrated how such solutions can be designed with MVAPICH2/MVAPICH2-X and their performance benefits • New solutions are also needed to re-design software libraries for Big Data environments to take advantage of RDMA • Such designs will allow application scientists and engineers to take advantage of upcoming exascale systems 75 Concluding Remarks HPC Advisory Council Brazil Conference, May '14
  • 76. HPC Advisory Council Brazil Conference, May '14 Funding Acknowledgments Funding Support by Equipment Support by 76
  • 77. HPC Advisory Council Brazil Conference, May '14 Personnel Acknowledgments Current Students – S. Chakraborthy (Ph.D.) – N. Islam (Ph.D.) – J. Jose (Ph.D.) – M. Li (Ph.D.) – S. Potluri (Ph.D.) – R. Rajachandrasekhar (Ph.D.) Past Students – P. Balaji (Ph.D.) – D. Buntinas (Ph.D.) – S. Bhagvat (M.S.) – L. Chai (Ph.D.) – B. Chandrasekharan (M.S.) – N. Dandapanthula (M.S.) – V. Dhanraj (M.S.) – T. Gangadharappa (M.S.) – K. Gopalakrishnan (M.S.) – G. Santhanaraman (Ph.D.) – A. Singh (Ph.D.) – J. Sridhar (M.S.) – S. Sur (Ph.D.) – H. Subramoni (Ph.D.) – K. Vaidyanathan (Ph.D.) – A. Vishnu (Ph.D.) – J. Wu (Ph.D.) – W. Yu (Ph.D.) 77 Past Research Scientist – S. Sur Current Post-Docs – X. Lu – K. Hamidouche Current Programmers – M. Arnold – J. Perkins Past Post-Docs – H. Wang – X. Besseron – H.-W. Jin – M. Luo – W. Huang (Ph.D.) – W. Jiang (M.S.) – S. Kini (M.S.) – M. Koop (Ph.D.) – R. Kumar (M.S.) – S. Krishnamoorthy (M.S.) – K. Kandalla (Ph.D.) – P. Lai (M.S.) – J. Liu (Ph.D.) – M. Luo (Ph.D.) – A. Mamidala (Ph.D.) – G. Marsh (M.S.) – V. Meshram (M.S.) – S. Naravula (Ph.D.) – R. Noronha (Ph.D.) – X. Ouyang (Ph.D.) – S. Pai (M.S.) – M. Rahman (Ph.D.) – D. Shankar (Ph.D.) – R. Shir (Ph.D.) – A. Venkatesh (Ph.D.) – J. Zhang (Ph.D.) – E. Mancini – S. Marcarelli – J. Vienne Current Senior Research Associate – H. Subramoni Past Programmers – D. Bureddy
  • 78. • August 25-27, 2014; Columbus, Ohio, USA • Keynote Talks, Invited Talks, Contributed Presentations, Tutorial on MVAPICH2, MVAPICH2-X and MVAPICH2-GDR optimization and tuning • Speakers (Confirmed so far): – Keynote: Dan Stanzione (TACC) – Sadaf Alam (CSCS, Switzerland) – Jens Glaser (Univ. of Michigan) – Jeff Hammond (Intel) – Adam Moody (LLNL) – Davide Rossetti (NVIDIA) – Gilad Shainer (Mellanox) – Mahidhar Tatineni (SDSC) – Jerome Vienne (TACC) • Call for Contributed Presentations (Deadline: July 1, 2014) • More details at: http://mug.mvapich.cse.ohio-state.edu 78 Upcoming 2nd Annual MVAPICH User Group (MUG) Meeting HPC Advisory Council Brazil Conference, May '14
  • 79. • Looking for Bright and Enthusiastic Personnel to join as – Visiting Scientists – Post-Doctoral Researchers – PhD Students – MPI Programmer/Software Engineer – Hadoop/Big Data Programmer/Software Engineer • If interested, please contact me at this conference and/or send an e-mail to panda@cse.ohio-state.edu 79 Multiple Positions Available in My Group HPC Advisory Council Brazil Conference, May '14
  • 80. HPC Advisory Council Brazil Conference, May '14 Pointers http://nowlab.cse.ohio-state.edu panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda 80 http://mvapich.cse.ohio-state.edu http://hadoop-rdma.cse.ohio-state.edu