5. Schedule
- Online at https://ceph.io/cephdays
- Highlights:
- Short break at 10:45
- Lunch at 12:00
- Short break at 15:15
- Pub (the Gallery, 1st floor) at 17:00
7. Who am I?
- Wido den Hollander (1986)
- Dutch, live in the Netherlands
- Ceph, CloudStack and IPv6 guru
- Open Source lover!
- Ceph Trainer and Consultant
- CTO at 42on
8. 42on
- Founded the company in 2012
- Company specialized in Ceph
- Consultancy
- Training
- Emergency Assistance
- https://42on.com/
9. PCextreme
- Co-founded this Dutch hosting company in 2004
- Traditionally a domain + webhoster
- Transitioned into providing infrastructure for many other hosting companies
- Running cloud deployment with Ceph and CloudStack
- Using the KVM hypervisor
- Numbers:
- 80k customers
- >100k domain names
- ~20k Virtual Machines
- ~3PB of Ceph storage
10. Layer 2 networks
- When trying to scale a network Layer 2 becomes difficult
- Redundancy is a problem
- Reliability is a challenge
- L2 works great in smaller scale networks
- We are trying to eliminate them on our cloud deployment
- Usually a 4k limit on the total amount of VLANs
11. Layer 3 networks are better!
- More flexible
- More reliable
- More scalable
- They are just better! (imho)
12. Compute 2.0
● Initiated end of 2018
● Existing Cloud(Stack) and Ceph clusters were based on Layer 2 networking
● Goals:
○ Increase flexibility
○ Better scalability (more virtual networks)
■ Less (or no) Layer 2 networking
○ As Open Source as it can be
○ Underlying network should be IPv6-only!
15. VXLAN
"Virtual Extensible LAN (VXLAN) is a network virtualization technology that
attempts to address the scalability problems associated with large cloud
computing deployments. It uses a VLAN-like encapsulation technique to
encapsulate OSI layer 2 Ethernet frames within layer 4 UDP datagrams, using
4789 as the default IANA-assigned destination UDP port number."
16. VXLAN
Ethernet packets encapsulated inside UDP
● This means you can route them through an existing Layer 3 network
● MTU of underlying network needs to be >1580
○ We use 9216
● VXLAN segments are called VNI
○ You can create 16M VNIs!
17. VXLAN & Multicast
● Default VXLAN uses Multicast to exchange VNI, IP and MAC information with
other hosts
● Multicast has scalability problems
● Difficult to route Multicast to different Layer 2 segments
18. The solution: VXLAN+BGP+EVPN
● Use the Border Gateway Protocol (BGP) for exchanging VNI, IP and MAC
information between hosts
● Allows for scaling VXLAN over multiple Layer 3 segments and further
○ Multi DC if required
● Good blog: https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn
19. VXLAN & EVPN
Ethernet VPNs (EVPNs) enable you to connect groups of dispersed customer
sites using Layer 2 virtual bridges, and Virtual Extensible LANs (VXLANs) allow
you to stretch Layer 2 connectivity over an intervening Layer 3 network, while
providing network segmentation like a VLAN, but without the scaling limitation of
traditional VLANs.
Source: Juniper.net
21. VXLAN & CloudStack
● Properly supported in 4.12 release
○ We (PCextreme) pushed multiple commits to polish and improve support
● BGP+EVPN only supported with customized script
○ https://github.com/PCextreme/cloudstack/blob/vxlan-bgp-evpn/scripts/vm/network/vnet/modifyvxlan.sh
● Configuring Frrouting (BGP) is not done by CloudStack
○ Frr learns new VNIs automatically when CloudStacks spawns a new VM
● Security Grouping (Firewalling) works
○ With IPv6 support!
22. Cumulus Linux
● Debian Linux based Operating System for ONIE switches
○ "Open networking software for the modern data center"
● Support for
○ VXLAN
○ BGP
○ EVPN
● You can run your own (Linux) scripting/tooling on the switch!
○ We run the Salt Minion for provisioning on these switches
23. Dell S5232F-ON
● 32x100Gbit switch with ONIE support
○ Atom CPU
○ 4GB Memory
● Modern Broadcom Trident3 chipset
○ VXLAN offloading in the chipset
○ This does all the actual heavy lifting. Cumulus just programs the chipset
● Affordable
○ ~EUR 3.500,00 (ex VAT) for a single switch
● Supported by Cumulus Linux
● We use them for two purposes
○ Core routers for this cloud deployment
○ Top-of-Rack for Hypervisors
27. Ceph OSD
● 1U SuperMicro with Single AMD Epyc 7351P
○ 1113S-WN10RT
● 128GB DDR4 Memory
● 10x Samsung 4TB NVMe
○ PM983 SSDs
● 2x 10Gbit to top of rack
● Affordable
○ ~EUR 8.500 for a single system
29. Networking on each host
● Hypervisors and Ceph servers run BGP on the host
○ We are using Frrouting
● No LACP, Bonding, Teaming to servers
○ This means no Layer 2 to hosts!
● All hosts announce their IPv6 loopback address through BGP
○ This is a /128 route
● Jumbo Frames enabled on all hosts
○ MTU is set to 9000
○ IPv6 MTU Path Detection makes sure this also works towards the internet
31. BGP Unnumbered
● Each host creates a BGP session with both Top-of-Rack switches
● Using BGP Unnumbered the BGP sessions are created over IPv6 Link-Local
○ No need to manually assign IP(v6) addresses to hosts and switches
○ VXLAN and IPv4 traffic is routed over IPv6
○ Super easy to scale!
BGP neighbor on enp81s0f0: fe80::225:90ff:feb2:bcdf, remote AS 4200100002, local AS
4200100123, external link
Hostname: tor-138-a05-46
..
..
Local host: fe80::ba59:9fff:fe20:6e22, Local port: 39020
Foreign host: fe80::225:90ff:feb2:bcdf, Foreign port: 179
32.
33. Scaling
● We scale by expanding and building multiple Ceph clusters
○ We choose to NOT grow a Ceph cluster larger than a Single 19" rack
○ This way we prevent that our whole cloud environment goes down due an issue with a single
Ceph cluster
○ CloudStack supports multiple Ceph clusters
● Layer 2 (VLAN) scaling is no longer a problem!
○ We can add as many hypervisors as we think CloudStack (safely) supports
○ No need to stretch VLANs over multiple racks
● Hypervisors can be spread out over multiple racks
○ They consume about 0,5kW each. Maximum per rack is ~5KW
34. Scaling
● With VLANs we had to be conservative
● Now we can provide isolated networks for customers on request
○ We can route their own IP-space
○ Additional traffic policies can be set
35. Ceph clusters
● One cluster per rack
● 3 MONs
● As much OSD machines as we can safely fit
○ Power consumption restrictions (5~6kW) apply
● ~650TB of RAW Ceph storage per rack
○ 100% NVMe storage
○ We are considering using 8TB NVMe drives instead of 4TB
● If needed we could deploy a HDD-only Ceph cluster for slow bulk storage
36. Performance
● Excellent!
● We can saturate the 100Gbit links
● We achieve a ~0.8ms write latency
○ 4k write size
○ 3x replication
37. Ceph and BGP
● Just works
○ Ceph only wants a working Layer 3 route to other hosts
● Need to define public_addr in ceph.conf
○ public_addr = 2001:db8:601:2::7
● No further configuration needed for Ceph
38. Conclusion
● Scalable networking solution without the use of VLANs
● We can easily create Virtual Networks for customers on our CloudStack
deployment(s)
○ Announce their (public) IP-space (v4/v6) and route it to their VMs
● High performance Ceph environment providing reliable and fast storage
○ Risks mitigated by creating multiple Ceph clusters