O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

2014/09/02 Cisco UCS HPC @ ANL

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 39 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a 2014/09/02 Cisco UCS HPC @ ANL (20)

Anúncio

Mais recentes (20)

2014/09/02 Cisco UCS HPC @ ANL

  1. 1. Introduction to Cisco UCS and Userspace NIC (usNIC) Argonne National Laboratory September 2, 2014 Dave Goodell dgoodell@cisco.com © 2013 Cisco and/or its affiliates. All rights reserved. 1
  2. 2. Record-setting Intel Ivy Bridge 1U and 2U servers (with GPU Support) Low latency Ethernet Up to 1.5 TB RAM Yes, really! 10 & 40 Gbps top-of-rack & Core Switching 1.6 usecs 190 nsecs 10 & 40 Gbps! © 2013 Cisco and/or its affiliates. All rights reserved. 2
  3. 3. Performance optimized for any type of workload Integrated Design Service Profiles UCS Manager UCS Central Unified Fabric Virtualized I/O Form Factor Independence Low Latency Agility and reduced time to deploy and provision applications Role-based management, automation, ease of integration Centralized, multi-domain management, alerting and visibility Simplified infrastructure Security isolation per application, scale, improved performance Supports both blades and rack mount servers in a single domain Low Latency over Industry Standard Ethernet networking © 2013 Cisco and/or its affiliates. All rights reserved. 3
  4. 4. Consolidating the messaging/interconnect network Traditional Network LAN Ethernet FC FC Ethernet FC Unified Fabric LAN Ethernet FC Infiniband Cluster DCB, FCoE & Low Latency © 2013 Cisco and/or its affiliates. All rights reserved. 4
  5. 5. • Benefits • Low Latency Ethernet delivers high performance while retaining all the advantages of managing unified network fabric • HPC Compute Clusters can coexist with Enterprise IT under same management framework • Leverage True Hybrid Solutions From All IT Resources • Simplifies Procurement • Accelerates Deployment • Non Intrusive • Extends the Product Life Cycle / Reusability Lower CAPEX and OPEX © 2013 Cisco and/or its affiliates. All rights reserved. 5
  6. 6. One wire to rule them all: • OS Mgmt Traffic (e.g., ssh) • Server Hardware Mgmt • File System / IO Traffic • MPI / Application Traffic Cisco CIMC Rich XML Interface Unified Management 10 & 40 Gbps Ethernet With QoS HPC Networking / Routing © 2013 Cisco and/or its affiliates. All rights reserved. 6
  7. 7. Host Port Switch Port eth0 eth1 eth2 VLAN 27, MTU 1500B, Bandwidth: 100 Mbps VLAN 42, MTU 9000B, Bandwidth: 2Gbps VLAN 64, MTU 9000B, Bandwidth: Not limited PCIe Physical Function eth2 Isolated HW Resource Virtual Functions RX/TX Queue Pairs CPU MPI Process SSH Process eth0 © 2013 Cisco and/or its affiliates. All rights reserved. 7
  8. 8. Characteristics • Up to 20 Chassis (160 Blades) • 3840 CPU Cores • 20 Gbps Bandwidth/Blade • Burst Capacity up to 80 Gbps • Single Wire Management • Enterprise & HPC • Pod Architecture • Scalable 96 or 48 Ports 5.3 usecs Any to Any Latency Up to 82.94 TeraFLOPs (Intel Ivy Bridge) © 2013 Cisco and/or its affiliates. All rights reserved. 8
  9. 9. 3rd Party GPU Expansion C220 M3 - 1RU Dual Socket Rack Server (Up to 384 GB RAM) 3rd Party GPU Expansion C240 M3 - 2RU Dual Socket Compute OR Storage Rack Server 3rd Party GPU Expansion C420 M3 - 2RU Dual OR Quad Socket Server (Upto 1.5 TB RAM) © 2013 Cisco and/or its affiliates. All rights reserved. 9
  10. 10. Port-to-Port Latency 190 nsecs <500 nsecs <500 nsecs <500 nsecs Nexus 3548 48 Port x 10 Gbps 12 x 40 Gbps Nexus 3172PQ 72 Port x 10 Gbps 6 x 40 Gbps Nexus 3132Q 32 Port x 40 Gbps Nexus 9000 9504 - 144 Port x 40 Gbps 9508 - 288 Port x 40 Gbps 9516 - 576 Port x 40 Gbps © 2013 Cisco and/or its affiliates. All rights reserved. 10
  11. 11. © 2013 Cisco and/or its affiliates. All rights reserved. 11
  12. 12. App to App Latency Components Kernel Bypass 2.02 usecs using SRIOV Kernel Overhead 9.42 usecs 0 2 4 6 8 10 usNIC TCP/IP Latency (usecs) Middle Ware Kernel NIC Network HW Resource isolation using IOMMU TCP/IP usNIC Dual Functionality! © 2013 Cisco and/or its affiliates. All rights reserved. 12
  13. 13. • Direct access to NIC hardware from Linux userspace Operating System bypass via the Linux Verbs API (UD) • Utilizes Cisco Virtual Interface Card (VIC) for ultra-low Ethernet latency 2nd generation 80Gbps Cisco ASIC 2 x 10Gbps Ethernet ports, or 2 x 40Gbps Ethernet ports PCI and mezzanine form factors • Half-round trip (HRT) ping-pong latencies (Intel E5-2690 v2 servers): Raw back to back: 1.57μs MPI back to back: 1.85μs Through MPI+N3548: 2.02μs These numbers keep going down © 2013 Cisco and/or its affiliates. All rights reserved. 13
  14. 14. • 2nd generation VIC: Can present itself 256 times on the PCI bus Has enough hardware queues / buffering for 256 actual NICs • Created for virtualization Designed for hypervisor bypass • Intent: Each vNIC assigned to a single virtual machine Can therefore bypass hypervisor “Bare metal” network performance in a VM © 2013 Cisco and/or its affiliates. All rights reserved. 14
  15. 15. VIC vNIC vNIC PCI Physical Function (PF) vNIC PCI Physical Function (PF) vNIC PCI Physical Function (PF) MAC address: aa:bb:cc:dd:ee:fa vNIC PCI Physical Function (PF) MAC address: aa:bb:cc:dd:ee:fb vNIC PCI Physical Function (PF) MAC address: aa:bb:cc:dd:ee:fc PCI Physical Function (PF) MAC address: aa:bb:cc:dd:ee:fd MAC address: aa:bb:cc:dd:ee:fe MAC address: aa:bb:cc:dd:ee:ff Physical port Physical port © 2013 Cisco and/or its affiliates. All rights reserved. 15
  16. 16. VM App VM Guest kernel Guest driver App Guest kernel Guest driver App Guest kernel Guest driver virtual switch Host driver VM Hypervisor data path VIC PCI PF PCI PF © 2013 Cisco and/or its affiliates. All rights reserved. 16
  17. 17. VM App VM Guest kernel Guest driver App Guest kernel Guest driver App Guest kernel Guest driver virtual switch Host driver VM Hypervisor data path VIC PCI VF PCI VF PCI PF © 2013 Cisco and/or its affiliates. All rights reserved. 17
  18. 18. VM App User process User space driver VM App User process User space driver VM App User process virtual switch Host driver Hypervisor data path VIC PCI VF PCI VF PCI PF Host OS Host TCP/IP stack © 2013 Cisco and/or its affiliates. All rights reserved. 18
  19. 19. TCP/IP usNIC Application Userspace sockets Userspace Kernel library TCP stack General Ethernet driver Cisco VIC driver Cisco VIC hardware Application Userspace verbs library Bootstrapping and setup Verbs IB core Cisco USNIC driver Send and receive fast path Cisco VIC hardware © 2013 Cisco and/or its affiliates. All rights reserved. 19
  20. 20. MPI MPI receives L2 frames directly from the VIC Userspace verbs library Cisco VIC hardware MPI directly injects L2 frames (with UDP/IP payloads) © 2013 Cisco and/or its affiliates. All rights reserved. 20
  21. 21. x86 Chipset VT-d I/O MMU VIC SR-IOV NIC MPI process MPI process Classifier QQPP Inbound L2 frames Outbound L2 frames © 2013 Cisco and/or its affiliates. All rights reserved. 21
  22. 22. VIC Physical Function (PF) Physical Function (PF) MAC address: aa:bb:cc:dd:ee:fe MAC address: aa:bb:cc:dd:ee:ff QP QP VF VF VF QP QP VF VF VF QP QP VF VF VF QP QP VF VF VF Physical port Physical port © 2013 Cisco and/or its affiliates. All rights reserved. 22
  23. 23. VIC PF (MAC) V F V F V F QP QP QP QP V F V F V F PF (MAC) V F V F V F V F V F V F MPI process Intel IO MMU MPI process Physical port Physical port © 2013 Cisco and/or its affiliates. All rights reserved. 23
  24. 24. • Used for physical  virtual memory translation • usnic verbs driver programs (and de-programs) the IOMMU Virtual Virtual VIC Intel IO MMU Userspace process Physical RAM Virtual Physical © 2013 Cisco and/or its affiliates. All rights reserved. 24
  25. 25. © 2013 Cisco and/or its affiliates. All rights reserved. 25
  26. 26. • Do you know what these are? MAC address IP Subnet ARP GID LID GRH © 2013 Cisco and/or its affiliates. All rights reserved. 26
  27. 27. • Manage your Ethernet network however you want • Manage and monitor UDP/IP traffic with standard tools • Can use IP routing + ECMP to create spine+leaf (Clos) networks • Incrementally grow deployments without rejiggering existing sub-cluster subnet config • No additional cost for IP: Cisco switches route L2/L3 at same speed © 2013 Cisco and/or its affiliates. All rights reserved. 27
  28. 28. • Design Principle: Behave like OS network stack as much as possible! • Examples Routing ARP UDP/IP port usage + visibility MAC in L2 frames • Can’t always achieve full parity exotic routing configurations (e.g., ip rule add blackhole …) tcpdump  (no OS in datapath*) © 2013 Cisco and/or its affiliates. All rights reserved. 28
  29. 29. 1. call ibv_create_qp() 2. allocates a full Linux UDP socket w/ port in OS tables 3. pass to kmod w/ create_qp command 4. bump refcount before installing filter, prevents freeing socket before QP destruction MPI libibverbs libusnic_verbs user space kernel usnic_verbs.ko shows up in lsof/netstat  © 2013 Cisco and/or its affiliates. All rights reserved. 29
  30. 30. • Open MPI natively supports multi-rail • Open MPI automagic configuration philosophy (when possible) • VICs have 2 ports, can have >1 VIC per server • Want to avoid artificial contention pair local interfaces with remote interfaces • Remote MPI process might be on the same subnet, might not • Nontrivial software problem © 2013 Cisco and/or its affiliates. All rights reserved. 30
  31. 31. Example Interface Pairing Host A Host B NIC A1 NIC A2 NIC B1 NIC B2 P1 P2 Host A Host B P1 P2 Host A Host B possible connectivity OMPI selected pairing NIC A1 NIC A2 NIC A1 NIC A2 Key NIC B1 NIC B2 NIC B1 NIC B2 P1 P2 before pairing valid pairing 1 valid pairing 2 an MPI process © 2013 Cisco and/or its affiliates. All rights reserved. 31
  32. 32. Host A NIC A1 NIC A2 Host B NIC R1a NIC R2a Subnet S1 NIC R1b NIC R2b NIC B1 NIC B2 Subnet S2 Switch (does not need L3 capability) © 2013 Cisco and/or its affiliates. All rights reserved. 32
  33. 33. Matching Logic Must Watch For Sub-optimal Pairings Host A Host B NIC A1 NIC A2 NIC B1 NIC B2 A1 can reach B1 and B2 A2 can only reach B1 NIC A1 NIC A2 NIC B1 NIC B2 NIC A1 NIC A2 NIC B1 NIC B2 Case 1 (sub-optimal) • A2 cannot pair with any interface on Host B • reduces aggregate bandwidth Host A Host A Host A Host B Case 2 (desired) • Both Host A interfaces can pair with Host B interfaces © 2013 Cisco and/or its affiliates. All rights reserved. 33
  34. 34. © 2013 Cisco and/or its affiliates. All rights reserved. 34
  35. 35. 1.88 μs on this SB machine © 2013 Cisco and/or its affiliates. All rights reserved. 35
  36. 36. © 2013 Cisco and/or its affiliates. All rights reserved. 36
  37. 37. • Everything above the firmware is open source • Open MPI Distributing in Cisco Open MPI v1.6.5 (soon to be v1.8.2) Upstream in Open MPI v1.7.3 and beyond (current stable is v1.8.1) • Libibverbs plugin • Verbs kernel module © 2013 Cisco and/or its affiliates. All rights reserved. 37
  38. 38. • 3rd Generation VIC 2 x 40G and PCIe gen 3 More MPI offload to hardware • Software update (expected this week) Upgrade transport from custom L2 protocol to UDP Key rationale point: Cisco switches L2 and L3 at same speed Allows switching usNIC traffic around data center Allows easier monitoring and policy control of usNIC traffic Kernel + userspace support for RHEL 7.0, SLES 12 Open MPI optimizations for 3rd generation VIC © 2013 Cisco and/or its affiliates. All rights reserved. 38
  39. 39. Thank you.

Notas do Editor

  • UCS is Cisco’s x86 server line. It offers both blade and rack servers with a focus on manageability, virtualization, networking, and performance. It’s all designed to integrate smoothly with Cisco’s switching products. I’m really here to talk about usNIC, our low latency Ethernet solution for HPC.

    N3K: 48 ports of 10GB, 12 ports 40GB, 1RU
    N6K: 384 ports of 10GB, or 96 ports of 40GB, 4RU
  • Many innovative features in UCS since we launched in 2009.
  • Simplifies deployment and management by cutting out specialized networks. Saves costs by reducing the number of expensive adapters that need to be plugged into a server and reducing the number of cables and switches that need to be purchased and installed.
  • usNIC allows customers to finally take control of their HPC resources and save time, energy and money by empowering IT to do what only scientists and researchers have been doing with compute clusters. This technology also enables HPC On-Demand in that the same VIC which already demonstrated world-record performance in the enterprise now enables the speed HPC applications require. Customers can now provision compute at will from a single point over a single network fabric.
  • The trick is in VLANS and QoS, allowing you to carve that single wire into separate slices.
  • could poll the audience about Ethernet switch latencies
  • <Main point: Approximately 85% of the end to end latency in within the server, lets tackle the big ticker item>

    <Click> Latency within the application depends of the application, the way it has been written and designed
    <Click> The middle ware layer is a big contributor as well, often taking approximately 20uSecs
    <Click> The kernel protocol processing is responsible for at least another 6uSecs
    <Click> The adapter itself adds between 3-6uSec depending on the HW vendors design and implementation
    <Click> Finally the network elements between 2 servers can add up to 5uSec of latency per hop

    The breakdown of these latency elements show that approximately 85% of the latency, and that’s is not counting the application latency itself, is within the server. The network only contributes 15% of the total end to end application latency. At Cisco, our target is to reduce the overall latency and we are taking a holistic view in our approach.
  • All over *standard* Ethernet (though the VIC is required).
  • VT-d: Virtualization Technology for Directed I/O
    IO MMU: Input / output memory management unit
    SR-IOV: Single Root Input Output Virtualization
  • Measurements taken on E5-2690 0 @ 2.90GHz CPUs (Sandy Bridge) with Icehouse 40 GbE cards (PCIe Gen2, x16)

×