The Docker network overlay driver relies on several technologies: network namespaces, VXLAN, Netlink and a distributed key-value store. This talk will present each of these mechanisms one by one along with their userland tools and show hands-on how they interact together when setting up an overlay to connect containers. The talk will continue with a demo showing how to build your own simple overlay using these technologies. Finally, it will show how we can dynamically distribute IP and MAC information to every hosts in the overlay using BGP EVPN
6. What is VXLAN?
• Tunneling technology over UDP (L2 in UDP)
• Developed for cloud SDN to create multi-tenancy
• Without the need for L2 connectivity
• Without the normal VLAN limit (4096 VLAN Ids)
• Easy to encrypt: IPsec
• Overhead: 50 bytes
• In Linux
• Started with Open vSwitch
• Native with Kernel >= 3.7 and >=3.16 for Namespace support
Outer IP packet
UDP
dst: 4789
VXLAN
Header
Original L2
VXLAN: Virtual eXtensible LAN
VNI: VXLAN Network Identifier
VTEP: VXLAN Tunnel Endpoint
9. Creating the overlay namespace
ip netns add overns
ip netns exec overns ip link add dev br42 type bridge
ip netns exec overns ip addr add dev br42 192.168.0.1/24
ip link add dev vxlan42 type vxlan id 42 proxy dstport 4789
ip link set vxlan1 netns overns
ip netns exec overns ip link set vxlan42 master br42
ip netns exec overns ip link set vxlan42 up
ip netns exec overns ip link set br42 up
create overlay NS
create bridge in NS
create VXLAN interface
move it to NS
add it to bridge
bring all interfaces up
setup_vxlan script
11. docker0
docker run -d --net=none --name=demo debian sleep infinity
ctn_ns_path=$(docker inspect --format="{{ .NetworkSettings.SandboxKey}}" demo)
ctn_ns=${ctn_ns_path##*/}
ip link add dev veth1 mtu 1450 type veth peer name veth2 mtu 1450
ip link set dev veth1 netns overns
ip netns exec overns ip link set veth1 master br42
ip netns exec overns ip link set veth1 up
ip link set dev veth2 netns $ctn_ns
ip netns exec $ctn_ns ip link set dev veth2 name eth0 address 02:42:c0:a8:00:10
ip netns exec $ctn_ns ip addr add dev eth0 192.168.0.10
ip netns exec $ctn_ns ip link set dev eth0 up
docker1
Same with 192.168.0.20 / 02:42:c0:a8:00:20
Create container without net
Create veth
Send veth1 to overlay NS
Attach it to overlay bridge
Send veth2 to container
Rename & Configure
Get NS for container
Create containers and attach them
plumb script
12. Does it ping?
docker0:~$ docker exec -it demo ping 192.168.0.20
PING 192.168.0.20 (192.168.0.20): 56 data bytes
92 bytes from 192.168.0.10: Destination Host Unreachable
docker0:~$ sudo ip netns exec overns ip neighbor show
docker0:~$ sudo ip netns exec overns ip neighbor add 192.168.0.20 lladdr 02:42:c0:a8:00:20 dev vxlan42
docker0:~$ sudo ip netns exec overns bridge fdb add 02:42:c0:a8:00:20 dev vxlan42 self dst 10.0.1.10
vni 42 port 4789
docker1: Same with 192.168.0.10, 02:42:c0:a8:00:10 and 10.0.0.10
15. vxlan vxlan
vxlan
Multicast
239.x.x.x
ARP: Who has 192.168.0.2?
L2 discovery: where is 02:42:c0:a8:00:02 ?
Use a multicast group to send traffic for unknown L3/L2 addresses
PROS: simple and efficient
CONS: Multicast connectivity not always available (on public clouds for instance)
VXLAN Control Plane options - 1: Multicast
16. Configure a remote IP address where to send traffic for unknown addresses
PROS: simple, not need for multicast, very good for two hosts
CONS: difficult to manage with more than 2 hosts
VXLAN Control Plane options - 2: Point-to-point
vxlan vxlan
Remote IP: point-to-point
Send everything to remote IP
17. Do nothing, provide ARP / FDB information from outside
PROS: very flexible
CONS: requires a daemon and a centralized database of addresses
VXLAN Control Plane options - 3: User-Land
vxlan vxlan
daemon daemon
Manual (with a daemon modifying ARP/FDB)
ARP: Do you know 192.168.0.2?
L2: where is 02:42:c0:a8:00:02 ?
vxlan
daemon
21. Rely on BGP eVPN address family to distribute L2 and L3 data
PROS: BGP is a standard to distribute addresses, supported by SDN vendors
CONS: limited Linux implementations, requires some BGP knowledge
VXLAN Control Plane- Option 4: BGP-EVPN
vxlan vxlan
bgpd bgpd
vxlan
bgpd
Endpoint data is distributed with BGP
22. BGP in one slide
● Routing Protocol between network entities ("Autonomous Systems", AS)
Google ASN: 15169 / Amazon ASN: 16509
(both actually have more than one)
● BGP is an EGP: Exterior Gateway Protocol
IGP: Interior Gateway Protocol (OSPF, EIGRP, IS-IS)
IGP: next hop is the IP of a router
BGP: next hop is an Autonomous System
● BGP is what makes Internet work
● BGP scales very well
500 000+ prefixes for a full Internet table
23. A quick BGP example
AS 1
AS 2
AS 3
AS 5
AS 4
eBGP
iBGP
20.0.0.0/16
20.0.0.0/16: AS1
20.0.0.0/16: AS1
20.0.0.0/16: AS4-AS1
Shortest PATH?
20.0.0.0/16: AS5-AS4-AS1
20.0.0.0/16: AS2-AS1
AS: Autonomous System
eBGP: external (different AS)
iBGP: internal (same AS)
24. iBGP
iBGP requires to mesh between all peers
n peers => n * (n-1) / 2 connections
50 peers => 1225 (49 of each host)
Route-reflectors simulate the mesh
More scalable and simpler
Possible to have more than one RR
RR
Distribute BGP information within an Autonomous System
25. BGP EVPN
● Part of MP-BGP (multi-protocol BGP: not only IP prefixes)
● Announce VXLAN information instead of IP prefixes
L3: IP addresses of VXLAN endpoints (VTEP)
L2: Location of MAC addresses
● BUM (Broadcast, Unknown, Multicast) traffic unicasted to all VTEPs
● Get the scalability of BGP
28. 10.0.0.0/16
docker0: 10.0.0.10
What we have so far
RR1 RR2
quagga-
rr
quagga-
rr
docker0
quagga eth0
10.0.0.5 10.0.1.5
docker1: 10.0.1.10
docker0
quaggaeth0
29. Let's look at the BGP data
docker0:~$ docker exec -it quagga vtysh
docker0# show run
docker0# show bgp neighbors
docker0# show bgp evpn summary
BGP router identifier 10.0.0.10, local AS number 65000 vrf-id 0
Peers 2, using 42 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
quagga0(10.0.0.5) 4 65000 42 43 0 0 0 00:02:01 0
quagga1(10.0.1.5) 4 65000 42 43 0 0 0 00:02:01 0
docker0# show bgp evpn route
No EVPN prefixes exist
31. Let's look at the BGP data
docker0:~$ docker exec -it quagga vtysh
docker0# show bgp evpn route
BGP table version is 0, local router ID is 10.0.0.10
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 10.0.0.10:1
*> [3]:[0]:[32]:[10.0.0.10]
10.0.0.10 32768 i
Route Distinguisher: 10.0.1.10:1
*>i[3]:[0]:[32]:[10.0.1.10]
10.0.1.10 0 100 0 i
docker0# show evpn mac vni all
33. What about BGP?
docker0:~$ docker exec -it quagga vtysh
docker0# show bgp evpn route
BGP table version is 0, local router ID is 10.0.0.10
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
Route Distinguisher: 10.0.1.10:1
*>i[2]:[0]:[0]:[48]:[02:42:c0:a8:00:20]
10.0.1.10 0 100 0 i
* i[3]:[0]:[32]:[10.0.1.10]
10.0.1.10 0 100 0 i
docker0# show evpn mac vni all
VNI 42 #MACs (local and remote) 2
MAC Type Intf/Remote VTEP VLAN
02:42:c0:a8:00:10 local veth0pldemo
02:42:c0:a8:00:20 remote 10.0.1.10
35. ● Standard VXLAN address distribution (used on many routers)
● Full management of BUM traffic
ARP queries
Broadcasts (DHCP)
Multicast (Discovery, keepalived)
● BUM traffic is unicasted (not efficient)
Possible optimizations: ARP suppression (Cumulus Quagga)
What's interesting about this setup?
38. What about BGP?
docker0:~$ docker exec -it quagga vtysh
docker0# show evpn vni
Number of VNIs: 2
VNI VxLAN IF VTEP IP # MACs # ARPs # Remote VTEPs
42 vxlan42 0.0.0.0 2 0 1
66 vxlan66 0.0.0.0 2 0 1
docker0# show evpn mac vni all
VNI 42 #MACs (local and remote) 2
MAC Type Intf/Remote VTEP VLAN
02:42:c0:a8:00:10 local veth0pldemo
02:42:c0:a8:00:20 remote 10.0.1.10
VNI 66 #MACs (local and remote) 2
MAC Type Intf/Remote VTEP VLAN
02:42:c0:a8:66:10 local veth0pldemo66
02:42:c0:a8:66:20 remote 10.0.1.10
42. Getting out of our Docker environment
gateway0:~$ ./setup_vxlan 42 host dstport 4789 nolearning
gateway0:~$ ip link add dev vethbr type veth peer name vethgw
gateway0:~$ ip link set vethbr master br42
gateway0:~$ ip addr add 192.168.0.1/24 dev vethgw
gateway0:~$ ping 192.168.0.10
PING 192.168.0.10 (192.168.0.10): 56 data bytes
64 bytes from 192.168.0.10: icmp_seq=0 ttl=47 time=0.866 ms
br42
vethgw
192.168.0.1
vxlan42
vethbr
46. What could a real-life setup look like?
RR2
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
Docker
quagga
BGP/EVPN
Router
Standard
host
Standard
host
Standard
host
Standard
host
VXLAN
Routing
Routes from non-VXLAN infraRoutes to VXLAN networks
RR1
47. How does it compare to other solutions?
Data plane Control Plane
Swarm Classic VXLAN External KV Store (Consul / Etcd)
SwarmKit VXLAN Swarmkit (Raft / Gossip implementation)
Flannel host-gw Routing Etcd / Kubernetes API
Flannel VXLAN VXLAN Etcd / Kubernetes API
Calico Routing / IPIP Etcd / BGP (IP prefixes)
Weave Classic Custom Custom
Weave Fast Datapath VXLAN Custom
Contiv VXLAN, Routing, L2 Etcd / BGP (IP and maybe eVPN)
Disclaimer: almost no experience with any (from documentation and discussions mostly)
48. Perspectives
● FFRouting
Quagga fork
Cumulus has switched to FFRouting and merged EVPN support
● Open vSwitch
Alternative to linux native bridge and VXLAN
(Possibly) better performances and more features
Not sure how Quagga/FFRouting would integrate with Open vSwitch
● Performances
Measure impact of VXLAN
Test VXLAN acceleration when available on NICs
● CNI plugin (to test on Kubernetes and mostly for learning purposes )