Albert Greenberg
Director of Development
Microsoft
Keynotes Session
Summary
• Scenario: BYO Virtual Network to the Cloud
• Per customer, with capabilities equivalent to on premise counterpart
• Challenge: How do we scale virtual networks across millions of servers?
• Solution: Host SDN solves it: scale, flexibility, timely feature rollout, debuggabililty
• Virtual networks, software load balancing, …
• How: Scaling flow processing to millions of nodes
• Flow tables on the host, with on-demand rule dissemination
• RDMA to storage
• Demo: ExpressRoute to the Cloud (Bing it!)
ONS2015: http://bit.ly/ons2015sd
ONS Inspire! Webinars: http://bit.ly/oiw-sd
Watch the talk (video) on ONS Content Archives: http://bit.ly/ons-archives-sd
Axa Assurance Maroc - Insurer Innovation Award 2024
Windows Azure: Scaling SDN in the Public Cloud
1.
2.
3. Windows Azure:
Scaling SDN in the Public Cloud
Albert Greenberg
Director of Development
Windows Azure Networking
albert@microsoft.com
4. • Microsoft’s big bet on public
cloud
• Companies move their IT
infrastructure to the cloud
• Elastic scaling and less
expensive than on-premises
DC
• Runs major Microsoft
properties (Office 365,
OneDrive, Skype, Bing,
Xbox)
5. Summary
• Scenario: BYO Virtual Network to the Cloud
• Per customer, with capabilities equivalent to on premise counterpart
• Challenge: How do we scale virtual networks across millions of servers?
• Solution: Host SDN solves it: scale, flexibility, timely feature rollout, debuggabililty
• Virtual networks, software load balancing, …
• How: Scaling flow processing to millions of nodes
• Flow tables on the host, with on-demand rule dissemination
• RDMA to storage
• Demo: ExpressRoute to the Cloud (Bing it!)
6. Infrastructure as a Service:
Develop, test, run your apps
Easy VM portability
If it runs on Hyper-V, it runs
in Windows Azure:
Windows, Linux, … (Ubuntu, redis,
mongodb, redis, …)
Deploy VMs anywhere
with no lock-in
7. What Does IaaS Mean for Networking?
Scenario: BYO Network
Windows Azure Virtual Networks
• Goal: BYO Address Space +
Policy
• Azure is just another branch
office of your enterprise, via
VPN
• Communication between
tenants of your Azure
deployment should be efficient
and scalable
10.1/16 10.1/16
SecureTunnel
14. How do we support 50k+ virtual
networks, spread over a single 100k+
server deployment in a DC?
Start by finding the right abstractions
15. SDN: Building the right abstractions for Scale
Abstract by separating management,
control, and data planes
Azure Frontend
Controller
Switch
Management Plane
Control Plane
Management plane Create a tenant
Control plane Plumb these tenant
ACLs to these
switches
Data plane Apply these ACLs to
these flows
Example: ACLs
• Data plane needs to apply per-flow
policy to millions of VMs
• How do we apply billions of flow policy
actions to packets?
16. Solution: Host Networking
• If every host performs all packet actions for its own VMs, scale is
much more tractable
• Use a tiny bit of the distributed computing power of millions of
servers to solve the SDN problem
• If millions of hosts work to implement billions of flows, each host only needs
thousands
• Build the controller abstraction to push all SDN to the host
17. VNets on the Host
• A VNet is essentially a set of mappings
from a customer defined address space
(CAs) to provider addresses (PAs) of hosts
where VMs are located
• Separate the interface to specify a VNet
from the interface to plumb mappings to
switches via a Network Controller
• All CA<-> PA mappings for a local VM
reside on the VM’s host, and are applied
there
Azure Frontend
Controller
Customer Config
VNet Description (CAs)
L3 Forwarding Policy
(CAs <-> PAs)
VMSwitchVMSwitch
Blue VMs
CA Space
Green VMs
CA Space
Northbound API
Southbound API
18. VNet Controller
Azure Frontend
Controller
Node1: 10.1.1.5
Blue VM1
10.1.1.2
Green VM1
10.1.1.2
Azure VMSwitch
Node2: 10.1.1.6
Red VM1
10.1.1.2
Green VM2
10.1.1.3
Azure VMSwitch
Node3: 10.1.1.7
Green S2S GW
10.1.2.1
Azure VMSwitch
Green Enterpise
Network
10.2/16
VPN GW
Customer Config
VNet Description
L3 Forwarding Policy
Secondary
Controllers
Consensus
Protocol
19. Forwarding Policy: Traffic to on-prem
Node1: 10.1.1.5
Blue VM1
10.1.1.2
Green VM1
10.1.1.2
Azure VMSwitchSrc:10.1.1.2 Dst:10.2.0.9
Src:10.1.1.2 Dst:10.2.0.9
Policy lookup:
10.2/16 routes to
GW on host with
PA 10.1.1.7
Controller
Src:10.1.1.5 Dst:10.1.1.7 GRE:Green Src:10.1.1.2 Dst:10.2.0.9
L3 Forwarding Policy
Node3: 10.1.1.7
Green S2S GW
10.1.2.1
Azure VMSwitch
Green Enterpise
Network
10.2/16
VPN GW
Src:10.1.1.2 Dst:10.2.0.9L3VPN PPP
20. IaaS VM
Cloud Load Balancing
• All infrastructure runs behind an LB
to enable high availability and
application scale
• How do we make application load
balancing scale to the cloud?
• Challenges:
• Load balancing the load balancers
• Hardware LBs are expensive, and
cannot support the rapid
creation/deletion of LB endpoints
required in the cloud
• Support 10s of Gbps per cluster
• Support a simple provisioning model
LB
Web Server
VM
Web Server
VM
SQL
Service
IaaS VM
SQL
Service
21. NAT
All-Software Load Balancer:
Scale using the Hosts
LB VM
VM DIP
10.1.1.2
VM DIP
10.1.1.3
Azure VMSwitch
Stateless
Tunnel
Edge Routers
Client
VIP
VIP
DIP
DIP
Direct
Return:
VIP
VIP
LB VM
VM DIP
10.1.1.4
VM DIP
10.1.1.5
Azure VMSwitch
NAT
Controller
Tenant Definition:
VIPs, # DIPs
Mappings
• Goal of an LB: Map a Virtual IP
(VIP) to a Dynamic IP (DIP) set of a
cloud service
• Two steps: Load Balance (select a
DIP) and NAT (translate VIP->DIP
and ports)
• Pushing the NAT to the vswitch
makes the LBs stateless (ECMP)
and enables direct return
• SDN controller abstracts out
LB/vswitch interactions
NAT
23. Flow Tables are the right abstraction
Node: 10.4.1.5
Azure VMSwitch
Blue VM1
10.1.1.2
NIC
Controller
Tenant Description
VNet Description
Flow Action
VNet Routing
Policy
ACLsNAT
Endpoints
Flow ActionFlow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Flow ActionFlow Action
TO: 79.3.1.2 DNAT to 10.1.1.2
TO: !10/8 SNAT to 79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
• VMSwitch exposes a typed Match-
Action-Table API to the controller
• One table per policy
• Key insight: Let controller tell the
switch exactly what to do with
which packets (e.g. encap/decap),
rather than trying to use existing
abstractions (Tunnels, …)
VNET LB NAT ACLS
24. 1. Table typing and flow caching are critical to
Dataplane Performance
Node: 10.4.1.5
Azure VMSwitch
Blue VM1
10.1.1.2
NIC
Flow ActionFlow ActionFlow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Flow ActionFlow Action
TO: 79.3.1.2 DNAT to 10.1.1.2
TO: !10/8 SNAT to 79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
VNET LB NAT ACLS
• COGS in the cloud is driven by VM density – 40GbE is here
• NIC Offloads are critical to achieving density
• Requires significant design work in the VMSwitch to scale overlay / NAT /
ACL policy to line speed
• First-packet actions can be complex, but established-flow matches need
to be typed, predictable, and simple
25. Node: 10.4.1.5
Azure VMSwitch
2. Separate Controllers By Application
Blue VM1
10.1.1.2
NIC
LB Controller
Tenant Description
VNet Description
Flow Action
VNet Routing
Policy
ACLs
NAT Endpoints
Flow ActionFlow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Flow ActionFlow Action
TO: 79.3.1.2 DNAT to 10.1.1.2
TO: !10/8 SNAT to 79.3.1.2
Flow Action
TO: 10.1.1/24 Allow
10.4/16 Block
TO: !10/8 Allow
VNET LB NAT ACLS
Network
Controller
VNet Controller
LB
VIP
Endpoints
Northbound API
26. 3. Eventing: Agents are also per-Application
• Attempting to give each VMSwitch
a synchronously consistent view of
the entire network is not scalable
• Separate rapidly changing policy
(location mappings of VMs in VNet)
from static provisioning policy
• VMSwitches should request needed
mappings on-demand via eventing
• We need a smart host agent to
handle eventing and look up
mappings
Azure VMSwitch
Blue VM1
10.1.1.2
NIC
Flow ActionFlow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
VNET
VNet Agent
VNet Controller
Mapping Service
Mapping Service
Mapping Service
Policy (once)
Policy
Mapping Request Event
(No policy found for packet)
Mapping Request
Mappings
27. Eventing: The Real API is on the Host
• The wire protocols between the
controller, agent, and related services
are now application specific (rather than
generic SDN APIs)
• The real southbound API (which is
implemented by VNet, LB, ACLs, etc) is
now between the Agents and the
VMSwitch
• High performance OS-level API rather than a
wire protocol
• We have found that eventing is a
requirement of any nontrivial SDN
application
Azure VMSwitch
Blue VM1
10.1.1.2
NIC
Flow ActionFlow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
VNET
Vnet Agent
VNet Controller
Mapping Service
Mapping Service
Mapping Service
Policy (once)
Mapping Request Event
(No policy found for packet)
Mapping Request
Southbound API
VNet Application
Mappings
28. • VNet scope is a region –
100k+ nodes. One controller
can’t manage them all!
• Solution: Regional controller
defines the VNet, local
controller programs end
hosts
• Make the Mapping Service
hierarchical, enabling DNS-
style recursive lookup VNET
Agent
Local
Controller
Local
Mappings
Policy Mapping Request
Mappings
4. Separate Regional and Local Controllers
Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
VNET
Agent
Local
Controller
Local
Mappings
Policy Mapping Request
Mappings
Flow Action
TO: 10.2/16 Encap to GW
TO: 10.1.1.5 Encap to 10.5.1.7
TO: !10/8 NAT out of VNET
Regional
Controller
Regional
Controller
Regional
Controller
Regional
Controller
Regional
Controller
Regional
Mappings
Mapping Request
VNet Description
Policy
29. A complete virtual network needs
storage as well as compute!
How do we make Azure Storage scale?
30. Storage is Software Defined, Too
• Erasure Coding provides durability
of 3-copy writes with small (<1.5x)
overhead by distributing coded
blocks over many servers
• Lots of network I/O for each
storage I/O
…
Write
Commit
Erasure Code
• We want to make storage clusters scale cheaply on commodity servers
To make storage cheaper, we use lots more network!
31. RDMA – High Performance Transport for Storage
• Remote DMA primitives (e.g. Read address, Write address) implemented on-NIC
• Zero Copy (NIC handles all transfers via DMA)
• Zero CPU Utilization at 40Gbps (NIC handles all packetization)
• <2μs E2E latency
• RoCE enables Infiniband RDMA transport over IP/Ethernet network (all L3)
• Enabled at 40GbE for Windows Azure Storage, achieving massive COGS savings by
eliminating many CPUs in the rack
All the logic is in the host:
Software Defined Storage now scales with the Software Defined Network
NIC
Application
NIC
Application
Memory
Buffer A
Memory
Buffer B
Write local buffer at
Address A to remote
buffer at Address B
Buffer B is filled
38. Result: We made SDN Scale
• VNET, SLB, ACLs, Metering, and more scale to millions of servers
• Tens of Thousands of VNETs
• Tens of Thousands of Gateways
• Hundreds of Thousands VIPs
• 10s of Tbps of LB’d traffic
• Billions of Flows…
all in the host!
Bandwidth served by SLB to a storage cluster over a week
40Gbps
30Gbps
20Gbps
39. Host Networking makes Physical Network
Fast and Scalable
• Massive, distributed 40GbE network
built on commodity hardware
• No Hardware per tenant ACLs
• No Hardware NAT
• No Hardware VPN / overlay
• No Vendor-specific control,
management or data plane
• All policy is in software – and
everything’s a VM!
• Network services deployed like all
other services
• Battle-tested solutions in Windows
Azure are coming to private cloud
10G Servers
40. We bet our infrastructure on Host SDN, and it paid off
• The incremental cost of deploying a new tenant, new VNet, or new
load balancer is tiny – everything is in software
• Using scale, we are cheaper and faster than any tenant deployed by
an admin on-prem
• Public cloud is the future! Join us!