As the data center network scales out (both through the addition of more servers per pod and the interconnection of more pods per data center), conventional Ethernet designs need to be modified. This section will consider the evolution from conventional network design to several
emerging standards that will support higher scalability and more complex network topologies.
Towards an Open Data Center with an Interoperable Network (ODIN) : Volume 2: ECMP Layer 3 Networks
1. Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
May 2012 ®
Towards an Open Data Center
with an Interoperable Network
(ODIN)
Volume 2: ECMP Layer 3
Networks
Casimer DeCusatis, Ph.D.
Distinguished Engineer
IBM System Networking, CTO Strategic Alliances
IBM Systems and Technology Group
May 2012
2. Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 2
Executive Overview
As the data center network scales out (both through the addition of more servers per pod and the
interconnection of more pods per data center), conventional Ethernet designs need to be
modified. This section will consider the evolution from conventional network design to several
emerging standards that will support higher scalability and more complex network topologies.
Note that this section does not differentiate between traditional, lossy Ethernet (in which frames
may be dropped during transmission) and lossless Ethernet (also known as Converged
Enhanced Ethernet, which is a different technology that guarantees frame delivery, and will be
discussed in a separate volume of the ODIN reference architecture).
2.1 STP protocol limitations
In order to understand the motivation behind spine-leaf ECMP designs, we must first briefly
review the traditional Ethernet approach using spanning tree protocol (STP) and multi-chassis link
aggregation. Classic Ethernet uses STP to define a hierarchical structure for a network, forcing
the network topology into a single path tree structure without loops while providing redundancy
around both link and device failures. STP works by blocking ports on redundant paths so that all
nodes in the network are reachable through a single path. If a device or a link failure occurs,
based on the spanning tree algorithm, a selective redundant path or paths are opened up to allow
traffic to flow, while still reducing the topology to a tree structure which prevents loops. Even
when multiple links are connected for scalability and availability, only one link or LAG can be
active. An enhanced version called multiple spanning tree protocol (MSTP) has also been
standardized; this configures a separate spanning tree for each VLAN group and blocks all but
one of the possible alternate paths within each spanning tree.
The changing requirements of modern data center networks are forcing designers to reexamine
the role of STP. One of the drawbacks of spanning tree protocol is that in blocking redundant
ports and paths, spanning tree effectively reduces the available bandwidth significantly (the
bandwidth available on the redundant paths goes unused until a failure occurs). Put another way,
spanning trees reduce aggregate bandwidth by forcing all traffic paths onto a single tree (they
lack multi-pathing support). This significantly lowers utilization of the available network bandwidth.
Additionally, in many situations the choice of which ports to block can also lead to a suboptimal
path of communication between end nodes by forcing traffic to go up and down the spanning tree.
Spanning tree cannot be easily segregated into smaller domains to provide better scalability.
Finally, the convergence time taken to recompute the spanning tree and propagate the changes
in the event of a failure can vary and sometimes becomes quite large. When a new link is added
or removed, the entire network halts all traffic while it configures a new loop-free tree; this can
take anywhere from tens of seconds to minutes. This is highly disruptive for virtual machine
migration, storage traffic, and other applications; in some cases, it can lead to server or system
crashes.
As the data center grows larger and networking devices proliferate, designers are forced to give
closer attention than ever before to the complexity associated with the vast number of devices to
be managed in a single fabric. Long distance bridging between networks has also made the
overall data center design more complex. Virtual machine mobility adds requirements to the
network in terms of extending Layer 2 VLANs between racks within a data center or between
different geographic data centers. These moves typically require network configuration changes
and in many cases the traffic may use a non-optimal path between data centers.
To optimize bandwidth utilization in this environment, several vendors have proposed proprietary
alternatives to STP. Since these approaches only function within a single vendor network, and
3. Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 3
only for certain devices, we will not discuss them here. However, we will review link aggregation
technology which can be used as part of a spine-leaf ECMP network.
2.2 MLAG – Multi-chassis Link Aggregation Groups
The original link aggregation group (LAG) standard (IEEE 802.3ad) is supported by all switch
vendors today, and was developed in part to overcome the limitations of STP. LAG allows you to
bond two or more physical links into a logical link between two switches or between a server and
a switch. Subsequently, an extension of LAG was proposed and standardized by IEEE
802.1AXbq, Link Aggregation Amendment: Distributed Resilient Network Interconnect. This is
more commonly known as multi-chassis link aggregation, or MLAG. As illustrated in the figure
below, one end of the link aggregated port group is dual-homed into two different devices to
provide device level redundancy. The other end of the group is still single homed into a single
device. The link aggregated port group that is single homed continues to run normal LAG and is
unaware that MLAG is being used. For example, in the figure below Device 1 continues to run
normal LAG. However, Device 2 and Device 3 run MLAG.
Device 2 Device 3
MLAG
LAG
Device 1
Figure 2.1 – Illustration of Multi-Chassis Link Aggregation
If the data center network is designed with multiple links between devices, the connections
between the end systems and the access switches and between the access switches and the
aggregation switches can be based on MLAG. MLAG can be used to create a logically loop-free
topology without relying on spanning tree protocol
MLAG builds on IEEE 802.1ax (2008) and inherits these properties from conventional LAG in that
all frames in a flow are sent over the same physical link typically using hashing based on packet
headers to ensure that frame order is maintained and duplication is avoided. The number of hops
that are traversed between two devices remain the same, so delay should be equivalent
regardless of the path taken. As a relatively mature technology , MLAG has been deployed
extensively, does not require new encapsulation, and works with existing OAM systems and
multicast protocols.
MLAG is supported by a broad spectrum of switch vendors, many of who refer to this technology
by slightly different brand names. It is possible to MLAG from a server to two TORs using NIC
4. Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 4
teaming on the server side with MLAG at the TORs, or from a blade switch to two aggregation
switches, or from a TOR into two cores. In each of these cases, each tier of switches or servers
could come from a different vendor since one end of the MLAG really sees this as a traditional
LAG. In other words MLAG allows interoperability across tiers. The primary constraint is that the
two devices that are being used for dual-homing should come from the same vendor. For
example in the previous figure, Device 1 could be from one vendor while Device 2 and Device 3
could be from another vendor. However, Device 2 and Device 3 need to be from the same
vendor. Furthermore Device 1 and Device2/3 could be different device types. For example Device
1 could be a server or blade switch, while Device 2 and Device 3 could be an aggregation switch.
Here again the constraint is that Device 2 and Device 3 should typically be similar devices. In
practice most MLAG systems allow dual homing across only two paths as it is difficult to maintain
a coherent state between more than two devices with sub-microsecond refresh times.
MLAG and LAG STP and LAG
LAG LAG
MLAG MLAG
STP Blocked LAG
LAG LAG
Figure 2.2 – MLAG and STP comparison
As shown on the left, MLAG increases switching bandwidth and allows dual homing; a change to
the MLAG configuration only impacts affected links. As shown on the right, STP can block certain
links and does not allow dual homing; a change to STP impacts the whole network.
2.3 Layer 3 Spine-Leaf Designs with VLAG and ECMP
In this section, we will describe the basic approach to a Layer 3 “Fat Tree” design (or Clos
network) using Equal Cost Multi-Pathing (ECMP). As shown in the figure below, a Layer 3 ECMP
design creates multiple paths between nodes in a network, which are load balanced with the
network traffic. The number of paths is variable depending on the implementation; the figure
shows a 4 way ECMP (in other words, there are 4 paths which can be used for load balancing).
Bandwidth can be adjusted by adding or removing paths up to the maximum allowed number of
links. Unlike a Layer 2 network which relies on STP, no links are blocked with this approach.
Broadcast loops are avoided by using different VLANs, and the network can route around link
failures.
A typical Layer 3 ECMP implementation is shown in the attached figure. In this case, all attached
servers are dual homed (each server has two connections to the first network switch using active-
active NIC teaming). This approach is known as a spine and leaf architecture, where the switches
closest to the server are “leaf” switches which interconnect with a set of “spine” switches using a
set of load balanced paths (a 4 way ECMP in this case). In this example, there are 16 IP subnets
per rack and 64 IP subnets per uplink, for a total of 80 IP subnets. Using a two tier design such as
this with a reasonably sized (48 port) leaf and spine switch and relatively low oversubscription
(3:1), it is possible to scale this L3 ECMP network up to around 1,000 – 2,000 ports. The spine of
the network supports east-west traffic between servers, which can account for over 90% of the
traffic flow in modern data center networks. Note that the design does not require a larger form
5. Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 5
factor core switch, although we could certainly use core switches to replace the spine switches in
this example. Any vendor product which supports L3 ECMP can be employed in this manner.
10.1.1.0/24
.1 10.1.2.0/24 .
10.1.3.0/24
10.1.4.0/24
Figure 2.3 – Layer 3 ECMP design principles
4 spine” switches
40GbE Links
16“leaf” switches
48 Servers per “leaf” switch
Figure 2.4 – Example Layer 3 ECMP leaf-spine design
A Layer 3 ECMP design can be enhanced by using Virtual Link Aggregation Groups (VLAGs), as
shown in the figure below. If devices attached to the network support Link Aggregation Control
Protocol (LACP) it becomes possible to logically aggregate multiple connections to the same
device under a common vLAG ID. It is also possible to use vLAG inter-switch links (ISLs)
combined with VRRP protocols to interconnect switches at the same tier of the network. VRRP
supports IP Forwarding between subnets, and protocols such as OSPF or BGP can be used to
route around link failures. Server pods can be constructed as shown in this example, and VMs
can be migrated to any server within the pod (note that migration across multiple pods is not
supported by this design).
6. Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 6
Layer 3
L3 connections
With unique
OSPF/BGP subnets per link
Layer 2/3 ISL
Active/Active VRRP
vLAG Primary Switch vLAG Secondary Switch
vLAG3 vLAG4
vLAG5 vLAG6
Layer 2
Subnet 1 Subnet 2 Subnet 3 Subnet 4
Server pod 1
Dual homed with LACP
Figure 2.5 – Spine-Leaf ECMP design example
Layer 3 ECMP designs offer several advantages. They are based on proven, standardized
technology which leverages smaller, less expensive rack or blade switches (virtual switches
typically do not provide Layer 3 functions and would not participate in an ECMP network). The
control plane is distributed, and smaller fault domains are possible using the pod design
approach. These networks scale well (up to 1-2 thousand ports with a slightly oversubscribed 2
tier topology, higher with more tiers).
There are also some tradeoffs when using a Layer 3 ECMP design. The native Layer 2 domains
are relatively small, which limits the ability to perform live VM migrations from any server to any
other server. Such designs can also be fairly complex, requiring expertise in IP routing to setup
and manage the network, and presenting complications with multicast domains. In the examples
shown earlier, scaling is limited by the control plane, which can become unstable in some
conditions (for example, if all the servers attached to a leaf switch boot up at once, the switch’s
ability to process ARP and DHCP relay requests will be a bottleneck in overall performance). In a
Layer 3 design, the size of the ARP table supported by the switches can become a limiting factor
in scaling the design, even if the MAC address tables are quite large. Finally, complications may
result from the use of different hashing algorithms on the spine and leaf switches.
Summary
We have outlined industry standard best practices related to the use of MLAG and Layer 3 ECMP
“fat tree” networks within the data center. This approach addresses the rising CAPEX and OPEX
associated with data center design, enables cost effective scaling of the network, and supports
virtualization of the network servers.
7. Towards an Open Data Center with an Interoperable Network (ODIN)
Volume 2: ECMP Layer 3 Networks
Page 7
For More Information
IBM System Networking http://ibm.com/networking/
IBM PureSystems http://ibm.com/puresystems/
IBM System x Servers http://ibm.com/systems/x
IBM Power Systems http://ibm.com/systems/power
IBM BladeCenter Server and options http://ibm.com/systems/bladecenter
IBM System x and BladeCenter Power Configurator http://ibm.com/systems/bladecenter/resources/powerconfig.html
IBM Standalone Solutions Configuration Tool http://ibm.com/systems/x/hardware/configtools.html
IBM Configuration and Options Guide http://ibm.com/systems/x/hardware/configtools.html
Technical Support http://ibm.com/server/support
Other Technical Support Resources http://ibm.com/systems/support
Legal Information This publication may contain links to third party sites that are
not under the control of or maintained by IBM. Access to any
IBM Systems and Technology Group such third party site is at the user's own risk and IBM is not
Route 100 responsible for the accuracy or reliability of any information,
data, opinions, advice or statements made on these sites. IBM
Somers, NY 10589.
provides these links merely as a convenience and the
Produced in the USA inclusion of such links does not imply an endorsement.
May 2012
Information in this presentation concerning non-IBM products
All rights reserved.
was obtained from the suppliers of these products, published
IBM, the IBM logo, ibm.com, BladeCenter, and VMready are announcement material or other publicly available sources.
trademarks of International Business Machines Corp., IBM has not tested these products and cannot confirm the
registered in many jurisdictions worldwide. Other product and accuracy of performance, compatibility or any other claims
service names might be trademarks of IBM or other related to non-IBM products. Questions on the capabilities of
companies. A current list of IBM trademarks is available on non-IBM products should be addressed to the suppliers of
the web at ibm.com/legal/copytrade.shtml those products.
InfiniBand is a trademark of InfiniBand Trade Association. MB, GB and TB = 1,000,000, 1,000,000,000 and
1,000,000,000,000 bytes, respectively, when referring to
Intel, the Intel logo, Celeron, Itanium, Pentium, and Xeon are
storage capacity. Accessible capacity is less; up to 3GB is
trademarks or registered trademarks of Intel Corporation or its
used in service partition. Actual storage capacity will vary
subsidiaries in the United States and other countries. based upon many factors and may be less than stated.
Linux is a registered trademark of Linus Torvalds.
Performance is in Internal Throughput Rate (ITR) ratio based
Lotus, Domino, Notes, and Symphony are trademarks or on measurements and projections using standard IBM
registered trademarks of Lotus Development Corporation benchmarks in a controlled environment. The actual
and/or IBM Corporation. throughput that any user will experience will depend on
considerations such as the amount of multiprogramming in the
Microsoft, Windows, Windows Server, the Windows logo, user’s job stream, the I/O configuration, the storage
Hyper-V, and SQL Server are trademarks or registered configuration and the workload processed. Therefore, no
trademarks of Microsoft Corporation. assurance can be given that an individual user will achieve
TPC Benchmark is a trademark of the Transaction Processing throughput improvements equivalent to the performance ratios
Performance Council. stated here.
UNIX is a registered trademark in the U.S. and/or other Maximum internal hard disk and memory capacities may require the
countries licensed exclusively through The Open Group. replacement of any standard hard drives and/or memory and the
population of all hard disk bays and memory slots with the largest
Other company, product and service names may be currently supported drives available. When referring to variable
trademarks or service marks of others. speed CD-ROMs, CD-Rs, CD-RWs and DVDs, actual playback
IBM reserves the right to change specifications or other product speed will vary and is often less than the maximum possible.
information without notice. References in this publication to IBM
products or services do not imply that IBM intends to make them
available in all countries in which IBM operates. IBM PROVIDES
THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY
KIND, EITHER EXPRESS OR IMPLIED, INCLUDING THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do
not allow disclaimer of express or implied warranties in certain
transactions; therefore, this statement may not apply to you.
QCW03020USEN-00