Rajesh Kumar Sundararajan, Assistant VP of Product Management at Aricent, gave a talk about TRILL and Datacenter technologies at the Interop Show in Las Vegas, May 2012.
1. TRILL & Datacenter technologies – their
importance, and alternatives to
datacenter network convergence
Rajesh Kumar Sundararajan
Assistant VP Product Management, Aricent
May 10, Interop Las Vegas, 2012
2. About Me
ARICENT Group
• Global innovation, technology, and services company
• Focused exclusively on communications
• Co-creating most innovative communications products and applications
with customers
• Complete lifecycle solutions for networks, management, applications
Me
• Rajesh Kumar Sundararajan
• Assistant Vice President – Product Line Management
• Ethernet, IP and Datacenter offerings
2
3. Agenda
Datacenter imperatives
Solutions proposed to datacenter imperatives
Categorization of the solutions
Technological overview of each of the solutions
TRILL and alternatives
Summary comparison
Conclusion
3
4. 3 Famous C’s of Datacenter Operations
Operational trends in the Datacenter driven by Virtualization, Convergence, & Cloud Services
• Increasing amount of physical space to accommodate new
hardware
• Additional CAPEX for hardware and software
• Increased OPEX for staffing and power
COST
• Physical separation of users from applications
• Latency-related concerns due to geographical distribution of
applications
• High performance demands of Cloud Services applications
COMPLEXITY
• Ever-increasing amounts of bandwidth demanded by consumers
and enterprise applications
• Increasing proliferation of video to deliver both consumer as well as
business-related content
CAPACITY
4
5. Virtualization, Convergence, & Cloud Services
Improved efficiencies in the Datacenter from
• Increased utilization of individual servers
• Consolidation of servers and network ports
• Simplified management and operations
• Network virtualization (beyond storage and server virtualization)
VIRTUALIZATION
• Convergence of equipment and network architectures
• Simplified design and increased flexibility of network architecture
• LAN/SAN convergence enables the ubiquity and extends the
reach of Ethernet in the Datacenter
CONVERGENCE
• Ability to push hardware (storage, server) and software (SaaS) to
a 3rd party provider
• Eliminates need to procure, install, update and upgrade hardware
& software, resources can be obtained on as-needed basis
• Drive performance / load across datacenters
CLOUD SERVICES • Remote datacenter backup
5
6. Imperatives and Solutions
Increase Price/ Performance Ratio
• 10Gb Ethernet support required for all new Datacenter Ethernet equipment
• Migration to 40GbE/100GbE needs to be planned
• Next gen products need to balance high performance with low CAPEX &
OPEX
Improve Energy Efficiency
• Networking equipment needs to lower energy consumption costs
• New server chipsets, architectures and software needed to improve overall
energy efficiency
Support Multiple Migration Options
• Multiple migration options available for evolution to a converged network
• Companies may start by migrating to converged network adaptors to top-of-
rack switches to a converged core or vice versa
• Equipment vendors need to support different migration options in products
Evolve Standards
• DCBX, ETS, PFC, QCN
• FCoE, FIP, FIP Snooping
• Openflow, SDN, TRILL, SPB, MC-LAG
• VxLAN, NVGRE
6
8. Lossless Ethernet
Why existing QoS and Flow Control are not enough
QoS techniques are primarily
– Flow control – pause frames between switches to control sender’s rate
– 802.1p and DSCP based queuing – flat or hierarchical queuing
– Congestion avoidance methods – WRED, etc.
Issues
– Flow control – no distinction between different applications or frame priorities
– 802.1p and DSCP based QoS methods
• Need to differentiate classes of applications (LAN, SAN, IPC, Management)
• Need to allocate deterministic bandwidth to classes of applications
– Congestion avoidance methods
• Rely on dropping frames at the switches
• Source may continue to transmit at same rate
• Fine for IP based applications which assume channel loss
• Don’t work for storage applications which are loss intolerant
8
9. Lossless Ethernet
Flow control – All traffic on port is
Priority Flow Control (PFC)
affected
DCBX
Internal back
pressure
Q1 – CoS 1
Q2 – CoS 2
PAUSE frames Q3 – CoS 3
Q4 – CoS 4
Priority Flow Control – Traffic for specific CoS on port is
affected
Internal back
pressure TLVs in LLDP messages
Q1 – CoS 1
Q2 – CoS 2 Advertise own capabilities
Q3 – CoS 3
PAUSE (CoS = 3) frames Q4 – CoS 4 Priority groups = x;
PFC = Yes, which priorities;
Quantized Congestion Notification (QCN) Congestion notification = Yes;
Reaction point Congestion point (switch facing
(source/ ETH congestion on egress port) TLVs in LLDP messages
end-point) Accept or No
Advertise own capabilities
Throttle
Tx rate ETH
Congestion Switches advertise and
notification know capabilities to use
message
on links
9
10. NPIV & NPV
•NPIV = N_PortID_Virtualization FiberChannel FiberChannel
Storage node
switch switch
- host based technology
•NPV = N_Port_Virtualization E_Port F_Port
E_Port N_Port
– switch based technology (N_PortID)
•Technology for the storage side
FiberChannel
•Relevant to Datacenter Ethernet because of switch
Storage node
the virtualization capabilities and bearing on NPIV
FCoE Physical port
F_Port
•NPIV – virtualization of storage device port
Logical port N_Port1 (N_PortID1)
to support VMs, multiple zones on same link N_Port2 (N_PortID2)
•Requires support on storage endpoint and N_Port2 (N_PortID3)
connected switch as well
NPV
•NPV – endpoint is unaware; switch proxies
for multiple endpoints (N_Ports) using a F_Port NP_Port
single NP_Port
•Reduces number of switches required
Multiple N_PortIDs
10
11. FCoE (FiberChannel Over Ethernet)
FCoE switch
•Means to carry FiberChannel frames within FC endpoint
FC link
Ethernet frames
Ethernet
•Interconnects FiberChannel endpoints or Ethernet switch
Ethernet
switches across an Ethernet (DataCenter interconnect
Bridged Ethernet) network Ethernet switch Ethernet switch
•FC frames encapsulated in Ethernet Ethernet switch
header
•New EtherType to transport FC frames Ethernet switch
•FCoE can be enabled on – FC endpoint
devices / FC switches / Ethernet switches Ethernet
FCoE switch
FCoE switch FC Endpoint
Ethernet switch Ethernet switch FCoE switch
FC switch
FC link Ethernet Ethernet Ethernet FC link
FC endpoints
FC frame FCoE frame FCoE frame Eth header FC frame FC frame
FCoE frame
11
12. FIP (FCoE Initialization Protocol)
•Protocol between FC devices built on assumption of direct connection
•Traversing an Ethernet cloud requires additional procedures
•Addressed by FIP
Device discovery
Initializing communication
Maintaining communication
•FCoE
Control frames – FIP – uses different EtherType than FCoE
Data frames – FC frames encapsulated with FCoE EtherType
FCoE device discovery
Initializing communication
Maintaining communication
FCoE device FCoE device
Ethernet
12
13. FIP Snooping
•Protocol between FC devices built on assumption of direct connection
•FC switch enforces many configurations, performs validations and access control on
attached endpoints
•Security concerns when this is exposed over non-secure Ethernet
•Addressed by FIPSnooping
Done on transit switches carrying FCoE, on VLANs dedicated to FCoE
Switches install firewall filters to protect FCoE ports
Filters are based on inspection of the FLOGI procedure
Example
(a) deny Enodes using FC-Switch MAC address as source MAC
(b) ensure address assigned to Enode is used only for FCoE traffic
FIP FIP Snooping
FIP Snooping
Ethernet switch
Ethernet switch
FCoE device FCoE device
Ethernet
13
14. Fabric Scalability and Performance
Why Spanning Tree (RSTP/MSTP/PVRST) is not enough
Necessary fundamentals for FCoE to work
Multipath through the network
Lossless fabric
Rapid convergence in fabric
Spanning tree (with variants like RSTP, MSTP, PVRST) is the universal way to provide redundancy
and stability (loop avoidance) in Ethernet networks
Spanning tree is a distance vector based protocol
• Routing equivalent of spanning tree (distance vector based) = RIP
• Limits the size of the network that it can handle
• Much smaller network size than link state based protocols (OSPF / ISIS)
Datacenter networks have got much bigger (and getting bigger still !!)
Spanning tree blocks links / paths to create redundancy inefficient capacity utilization
Does not support multipath which is important for SAN/LAN convergence
The TRILL solution
• Apply link state routing to bridging / Layer 2 Ethernet
• Use technique like ECMP for alternate paths without blocking any links or paths
14
15. Fabric Scalability and Performance
TRILL – Transparent Interconnection of Lot of Links
Focus on problem of dense collection of RBridge RBridge
interconnected clients and switches
Attempt to:
• Eliminate limitations of spanning tree
TRILL control protocol
centric solutions (IS-IS extension)
• Bring the benefits of routing technologies
to the L2 network (without the need for RBridge RBridge
IP/subnets, etc.)
Objectives:
Learn
• Zero configuration and zero assumptions MAC
• Forwarding loop mitigation MAC TRILL control frame
frame (advertise learnt MAC)
• No changes to spanning tree protocols
Normal bridge
Key components: RBridge RBridge RBridge or destination
• R-bridges (Routing bridges)
• Extensions to IS-IS
• Apply link state routing to VLAN aware
bridging problem MAC TRILL MAC TRILL MAC MAC
frame header frame header frame frame
15
16. Fabric Scalability and Performance
TRILL – Handling Multicast and Broadcast
• Create distribution tree with selected root
Root for
• Distribute from root to rest of tree Distribution Tree 1
• Multiple distribution trees for
Root for
– Multipath and load distribution Distribution Tree 2
– Alternate paths and resilience
• All Rbridges pre-calculate and maintain the distribution trees
• Algorithm specified for ensuring identical calculations at all
RBRidges
• By default - distribution tree is shared across all VLANs and
multicast groups Distribution Tree 1
• How an ingress node selects a specific tree (from multiple Distribution Tree 2
existing trees) is not specified
• Ingress Rbridge receiving multicast encapsulates in TRILL
header and sends to root of tree and to downstream branches
• Frame with TRILL header is distributed down branches of tree
• Rbridges at edges remove TRILL header and send to receivers
• Rbridges listen to IGMP messages
• Rbridges prune trees based in presence of multicast receivers
• Information from IGMP messages propagated through the TRILL
core to prune the distribution trees
16
17. Fabric Scalability and Performance
TRILL – Issues and Problems
•Does not (yet) address different types of virtual networks (VxLAN, NVGRE…)
•Provides for L2 multipathing (traffic within a VLAN) but L3 (routed traffic across
VLANs) is unipath only
•Initial scope of TRILL defined to address spanning tree limitations
•IP maybe an afterthought; only 1 default router with VRRP = unipath for L3
•Result of above – forces datacenter operators to provision larger VLANs (more
members per VLAN) – so, restricts segmentation using VLANs
•Requires hardware replacement in switching infrastructure
•Existing security processes have to be enhanced
– Existing security processes rely on packet scanning and analysis
– Encapsulation changes packet headers, existing tools must be modified / enhanced
•Does not inherently address fault isolation
•Does not inherently address QoS mapping between edge & core (example –
congestion management requires congestion in network to be signaled to source )
•Does not clearly address source specific multicast; so multicast based on groups
only
17
19. Fabric Scalability and Performance
SPB – Shortest Path Bridging
• Key Components
SP-Bridge SP-Bridge
– Extend IS-IS to compute paths
between shortest path bridges
– Encapsulate MAC frames in an
additional header for transport Control protocol (IS-IS extension)
between shortest path bridges
• Variations in Encapsulation
– SPB – VID
Normal bridge or
– SPB – MAC (reuse 802.1ah SP-Bridge SP-Bridge SP-Bridge
destination
encapsulation)
• Allows reuse of reliable Ethernet
Learn Learn
OAM technology (802.1ag, MAC MAC
Y.1731) MAC SPB MAC SPB MAC MAC
frame header frame header frame frame
• Source MAC learning from SPB
encapsulated frames at the edge
SP-bridges
19
20. EVB (Edge Virtual Bridging)
Addresses interaction between virtual
switching environments in a hypervisor and VM-11 VM-12 ……… VM-1n
1st layer of physical switching infrastructure
Virtualizer
2 different methods – VEPA (Virtual Ethernet
Port Aggregator) & VN-Tag
VEB / vSwitch
Without VEPA – in virtualized environment, Ethernet switch
traffic between VMs is switched within the
virtualizer
VM-11 VM-12 ……… VM-1n
Key issues – monitoring of traffic, security
policies, etc, between VMs is broken Virtualizer
With VEPA – all traffic is pushed out to the Negotiation
switch and then to the appropriate VM VEPA
Key issue – additional external link bandwidth Ethernet switch
requirement, additional latency
Switch must be prepared to do “hairpin turn”
Accomplished by software negotiation
between switch and virtualizer
20
21. MC-LAG (Multi Chassis LAG)
Relies on fact that datacenter network is
large but with predictable topology
CORE
Downstream node has multiple links to
different upstream nodes AGGREGATION
& ACCESS
Links are link-aggregated (trunked) into
single logical interface
Can be used in redundant or load-shared Typical datacenter network
mode
Load-shared mode offers multipath
Resilience and multipath inherent Coordination protocol across switches
No hardware changes required
LAG
Accomplished with software upgrade
Switches must have protocol extension to LAG LAG LAG
coordinate LAG termination across switches
Does nothing about address reuse problem
from endpoint virtualization
21
22. OpenFlow and Software Defined Networking (SDN)
Paradigm shift to networking
SDN Controller [ Service Creation,
Flow management, first packet
processing, route creation)
Flowvisor ( Responsible
for Network Partitioning
based on Rules ( e.g.
Bridge IDs, Flow ids, User
credentials)
Secure
connection
Open Flow Open Flow Open Flow Open Flow
enabled enabled enabled enabled
Switches Switches Switches Switches
Open Flow Open Flow Open Flow
Source -ONF enabled enabled enabled
Switches Switches Switches
SDN Controller – Focus on Service OpenFlow – Enabling Network Virtualization
• Simplify the network (make it dumb?
• Move the intelligence outside
22
23. Fabric Scalability and Performance
OpenFlow and Software Defined Networking (SDN) paradigm
SDN Controller – Focus on OpenFlow – Enabling Network
Service Virtualization
• Open platform for managing the traffic on • Light weight software (strong resemblance to
“open flow” complying switches client software)
• Functions – network discovery, network • Standard interface for access and
service creation, provisioning, QoS, “flow” provisioning
management, first packet handling • Secure access to controller
• Interoperability with existing networking
• Push/pull support for statistics
infrastructure – hybrid networks
• Unknown flow packet trap and disposition
• Overlay networks, application aware
through controller
routing, performance routing, extensions to
existing network behavior • Comply to OpenFlow specifications (current
version 1.2)
• Accessed and managed by multiple
controllers
23
24. Fabric Scalability and Performance
OpenFlow/SDN based switches
• Open flow enabled switches – 2 Rule Action Stats
types
– Hybrid (OpenFlow HAL and current Packet + byte counters
network control plane)
1. Forward packet to port(s)
– Pure OpenFlow Switches 2. Encapsulate and forward to controller
3. Drop packet
• Pure OpenFlow 4. Send to normal processing pipeline
– Simpler, low in software content, lower Switch MAC MAC Eth VLAN IP IP IP TCP TCP
cost Port src dst type ID Src Dst Prot sport dport
+ mask
• Primarily contains
– SSL (for secure management)
Management (CLI, SNMP, WEB)
Secure Connection (SSL)
– Encap/decap
– Hardware programming layer/driver Routing Block
( Protocols –RIP, DCBX –for Data centers
OSFP, ISIS, BGP, Encap / Event
– Event handler RTM, Multicast) Decap Handler
PFC , ETS LLDP
• Open Flow switches receive IP forwarding
Chassis Infrastructure software
instructions from service controllers Management and
Congestion
Notification
Master Policy
Engine
System Monitoring
• Architectural aspects Layer -2 Block – QOS (Hierarchical, multiple
Vlan, STP, LACP, Scheduling Scheme) & ACL HAL Layer for open flow
– Resource partitioning IGMP Management
– Packet flow aspects
24
25. Openflow – key issues
• Enormous amount of provisioning for rules in each switch
• In today’s switches – must rely on setting up ACLs
• Switches typically have low limits on number of ACLs that can be set up
• Will need hardware upgrade to use switches with large amounts of ACLs
• Ensuring consistency of ACLs across all switches in the network? –
troubleshooting challenges
25
26. NVGRE and VXLAN – Background
Challenges with Virtualization:
• More VMs = more MAC addresses and more IP addresses
• Multi-user datacenter + VMs = need to reuse MAC and IP addresses across users
• Moving applications to cloud
= avoid having to renumber all client applications
= need to reuse MAC addresses, IP addresses and VLAN-IDs across users
• More MAC addresses and IP addresses
= larger table sizes in switches
= larger network, more links and paths
Necessity = Create a virtual network for each user
Possible Solutions:
• VLANs per user – limitations of VLAN-Id range
• Provider bridging (Q-in-Q)
– Limitations in number of users (limited by VLAN-ID range)
– Proliferation of VM MAC addresses in switches in the network (requiring larger table sizes in switches)
– Switches must support use of same MAC address in multiple VLANs (independent VLAN learning)
• VXLAN, NVGRE – new methods
26
27. VxLAN (Virtual eXtensible LAN) – How it Works
VM-11 VM-12 ……… VM-1n VM-21 VM-22 ……… VM-2n
ip-11 ip-12
mac-11 mac-12
ip-1n
mac-1n
ip-21 ip-22
mac-21 mac-22
ip-2n
mac-2n
Without
HYPERVISOR HYPERVISOR VxLAN
SERVER – A (ip-A, mac-A) SERVER – B (ip-B, mac-B)
Ethernet frame from VM-11 to VM-12
D-MAC = mac-21 S-MAC = mac-11 D-IP = ip-21 S-IP = ip-11 payload
Ethernet header IP header
VM-11 VM-12 ……… VM-1n VM-21 VM-22 ……… VM-2n
To ip-21 ip-11 ip-12 ip-1n ip-21 ip-22 ip-2n Using
To mac-21 mac-11 mac-12 mac-1n mac-21 mac-22 mac-2n
HYPERVISOR HYPERVISOR VxLAN
To ip-B SERVER – A (ip-A, mac-A) SERVER – B (ip-B, mac-B) (tunneling in
To mac-B UDP/IP)
Ethernet frame from VM-11 to VM-12
D-MAC = mac-B S-MAC = mac-A D-IP = ip-B S-IP = ip-A VNI D-MAC = mac-21 S-MAC = mac-11 D-IP = ip-21 S-IP = ip-11 payload
Outer Ethernet header Outer IP header Inner Ethernet header Inner IP header 27
27
28. VXLAN - Internals
VN-X (User X)
VM-11 VM-12 ……… VM-1n VM-21 VM-22 ……… VM-2n VN-Y (User Y)
ip-11 ip-12 ip-1n ip-21 ip-22 ip-2n
mac-11 mac-12 mac-1n mac-21 mac-22 mac-2n
HYPERVISOR HYPERVISOR
SERVER – A (ip-A, mac-A) SERVER – B (ip-B, mac-B)
Self table
VM-11 VNI-X
VM-12 VNI-Y
VM-13 VNI-Z
Remote table
ip-21 VNI-X ip-B
ip-22 VNI-X ip-B
ip-2n VNI-Y ip-B
Remote table – learnt and aged
continuously based on actual traffic
28
29. NVGRE
• NVGRE = Network Virtualization using Generic Routing Encapsulation
• Transport Ethernet frames from VMs by tunneling in GRE (Generic Routing Encapsulation)
• Tunneling involves GRE header + outer IP header + outer Ethernet header
• Relies on existing standardized GRE protocol – avoids new protocol, new Assigned Number, etc
• Use of GRE (as opposed to UDP/IP) loss of multipath capability
VXLAN NVGRE
VNI – VXLAN Network Identifier (or VXLAN TNI – Tenant Network Identifier
Segment ID)
VxLAN header + UDP header + IP header + GRE header + IP header + Ethernet header
Ethernet header = 8+8+40+16 = 72 bytes = 8+40+16 = 64 bytes addition per Ethernet
addition per Ethernet frame frame
VTEP - VXLAN Tunnel End Point - originates NVGRE endpoint
or terminates VXLAN tunnels
VXLAN Gateway - forwards traffic between NVGRE gateway
VXLAN and non-VXLAN environments
New protocol Extends existing protocol for new usage
5/29/201 29
Multipath using different UDP ports No multipath since GRE header is same
2
29
30. Problems with NVGRE and VXLAN
Configuration and Management: Security:
• Controlled multicast (with the use of say, IGMPv3) • Existing security processes are broken
– Existing security processes rely on packet
within tenant network now gets broadcast to all scanning and analysis
endpoints in the tenant’s virtual network – since – Encapsulating MAC in IP changes packet
broadcast and multicast will get mapped to one headers, existing tools must be modified /
enhanced
multicast address for the entire VXLAN/VNI
• Starts to put more emphasis on firewalls
• Requires configuration of the virtual network mapping
and IPS in virtualizers – redoing Linux
consistently on all the virtual machines – network stack !!
management nightmare without tools to debug and
isolate misconfiguration
• These maybe just the tip of the iceberg – will we need
virtualized DHCP, virtualized DNS, etc.
Partial Support: New Ecosystem:
• Does not address QoS • Existing network analysis tools won’t
– Encapsulation / tunneling techniques like Provider Bridging or work – partner ecosystem for technology
PBB clearly addressed QoS by mapping the internal has to be developed
“marking” to external “marking”
• Existing ACLs installed in network
• What is tunneled may be already tunneled – questions
infrastructure are broken
of backward compatibility with existing apps
• Needs additional gateway to
communicate outside the virtual network
30
31. TRILL vs. VXLAN (or) TRILL & VXLAN?
TRILL VxLAN
Addresses network; tries to optimize Addresses end points (like servers)
network
Technology to be implemented in network Technology to be implemented in virtualizers
infrastructure (switches/routers)
Needs hardware replacement of switch Needs software upgrade in virtualizers
infrastructure (assuming virtualizer supplier supports this)
Restricted to handling VLANs Agnostic about switches/routers between end-
No optimizations for VXLAN/NVGRE points (leaves them as good/bad as they are)
No changes required in end-points More computing power requirement from
virtualizers (additional packet header handling,
additional table maintenance, associated
timers)
Need is for large datacenters (lots of links Need is primarily for multi tenant datacenters
and switches)
Need for both – depending on what requirement a datacenter addresses
31
32. TRILL & the other networking alternatives
TRILL SPB VEPA VN-Tag MC-LAG SDN / Openflow
Multipath Multipath No / NA No / NA Multipath Multipath
Requires Requires No h/w H/w change No h/w No h/w change (for
h/w change h/w change change at analyzer change small number of
point flows)
Requires h/w
change for large
numbers
More More Negligible More complex Negligible Simpler switch
complex complex additional endpoint additional Complex controller
switch switch complexity complexity
Does not Tools Existing Requires Existing Existing tools
address available tools enhancement tools continue to work
network from Mac- continue to to existing continue to
trouble in-Mac work tools at work
shooting technology analyzer point
Lesser provisioning Least Lesser Least High amount of
provisioning provisioning provisioning provisioning
32
The TRILL solution:Apply link state routing to bridging/Layer 2 EthernetUse techniques like ECMP for alternate paths without blocking any links or paths