Harvard HPC Seminar Series
Theresa Kaltz, PhD, High Performance Technical Computing, FAS, Harvard
Due to the wide availability and low cost of high speed networking, commodity clusters have become the de facto standard for building high performance parallel computing systems. This talk will introduce the leading technology for high speed interconnects called Infiniband and compare its deployment and performance to Ethernet. In addition, some emerging interconnect technologies and trends in cluster networking will be discussed.
1. To Infiniband and Beyond: High
Speed Interconnects in Commodity
HPC Clusters
Teresa Kaltz, PhD
Research Computing
December 3, 2009
1
2. Interconnect Types on Top 500
On the latest TOP500 list, there is exactly one 10 GigE deployment,
compared to 181 InfiniBand-connected systems.
Michael Feldman, HPCwire Editor
2
4. What is Infiniband Anyway?
• Open, standard interconnect architecture
– http://www.infinibandta.org/index.php
– Complete specification available for download
• Complete "ecosystem"
– Both hardware and software
• High bandwidth, low latency, switch-based
• Allows remote direct memory access (RDMA)
4
5. Why Remote DMA?
• TCP offload engines reduce overhead via
offloading protocol processing like checksum
• 2 copies on receive: NIC kernel user
• Solution is Remote DMA (RDMA)
Per Byte Percent Overhead
User-system copy 16.5 %
TCP Checksum 15.2 %
Network-memory copy 31.8 %
Per Packet
Driver 8.2 %
TCP+IP+ARP protocols 8.2 %
OS overhead 19.8 %
5
7. Infiniband Signalling Rate
• Each link is a point to point serial connection
• Usually aggregated into groups of four
• Unidirectional effective bandwidth
– SDR 4X: 1 GB/s
– DDR 4X: 2 GB/s
– QDR 4X: 4 GB/s
• Bidirectional bandwidth twice unidirectional
• Many factors impact measured performance!
7
9. DDR 4X Unidirectional Bandwidth
• Achieved bandwidth
limited by
PCIe 8x Gen 1
• Current platforms
mostly ship with
PCIe Gen 2
9
10. QDR 4X Unidirectional Bandwidth
• Still seem to
have bottleneck
at host if
using QDR
http://mvapich.cse.ohio-state.edu/performance/interNode.shtml 10
13. Infiniband Silicon Vendors
• Both switch and HCA parts
– Mellanox: Infiniscale, Infinihost
– Qlogic: Truescale, Infinipath
• Many OEM's use their silicon
• Large switches
– Parts arranged in fat tree topology
13
14. Infiniband Switch Hardware
24 port silicon product line at right
Scales to thousands of ports 288 Ports
Host-based and hardware-
based subnet management
Current generation (QDR) based on 144 Ports
36 port silicon
Up to 864 ports in single 96 Ports
switch!!
48 Ports
24 Ports
14
15. Infiniband Topology
• Infiniband uses credit-based flow control
– Need to avoid loops in topology that may produce
deadlock
• Common topology for small
and medium size
networks is tree (CLOS)
• Mesh/torus more cost effective
for large clusters (>2500 hosts)
15
16. Infiniband Routing
• Infiniband is statically routed
• Subnet management software discovers fabric
and generates set of routing tables
– Most subnet managers support multiple routing
algorithms
• Tables updated with changes in topology only
• Often cannot achieve theoretical bisection
bandwidth with static routing
• QDR silicon introduces adaptive routing
16
17. HPCC Random Ring Benchmark
1600
1400
Avg Bandwidth (MB/s)
1200
1000 "Routing 1"
"Routing 2"
800
"Routing 3"
600 "Routing 4"
400
200
0
Number of Enclosures
17
18. Infiniband Specification for Software
• IB specification does not define API
• Actions are known as "verbs"
– Services provided to upper layer protocols
– Send verb, receive verb, etc
• Community has standardized around open
source distribution called OFED to provide verbs
• Some Infiniband software is also available from
vendors
– Subnet management
18
19. Application Support of Infiniband
• All MPI implementations support native IB
– OpenMPI, MVAPICH, Intel MPI
• Existing socket applications
– IP over IB
– Sockets direct protocol (SDP)
• Does NOT require re-link of application
• Oracle uses RDS (reliable datagram sockets)
– First available in Oracle 10g R2
• Developer can program to "verbs" layer
19
23. "High Performance" Ethernet
• 1 GbE cheap and ubiquitous
– hardware acceleration
– multiple multiport NIC's
– supported in kernel
• 10 GbE still used primarily as uplinks from edge
switches and as backbone
• Some vendors providing 10 GbE to server
– low cost NIC on motherboard
– HCA's with performance proportional to cost
23
24. RDMA over Ethernet
• NIC capable of RDMA is called RNIC
• RDMA is primary method of reducing latency on
host side
• Multiple vendors have RNIC's
– Mainstream: Broadcom, Intel, etc.
– Boutique: Chelsio, Mellanox, etc.
• New Ethernet standards
– "Data Center Bridging"; "Converged Enhanced
Ethernet"; "Data Center Ethernet"; etc
24
25. What is iWarp?
• RDMA consortium (RDMAC) standardized some
protocols with are now part of the IETF Remote
Data Direct Placement (RDDP) working group
• http://www.rdmaconsortium.org/home
• Also defined SRP, iSER in addition to verbs
• iWARP supported in OFED
• Most specification work complete in ~2003
25
26. RDMA over Ethernet?
The name ‘RoCEE’ (RDMA over Converged Enhanced Ethernet),
is a working name.
You might hear me say RoXE, RoE, RDMAoE, IBXoE, IBXE or
any other of a host of equally obscure names.
Tom Talpey, Microsoft Corporation
Paul Grun, System Fabric Works
August 2009
26
27. The Future: InfiniFibreNet
• Vendors moving towards "converged fabrics"
• Using same "fabric" for both networking and
storage
• Storage protocols and IB over Ethernet
• Storage protocols over Infiniband
– NFS over RDMA, lustre
• Gateway switches and converged adapters
– Various combinations of Ethernet, IB and FC
27
28. Any Questions?
THANK YOU!
(And no mention of The Cloud)
28