call girls in Anand Vihar (delhi) call me [🔝9953056974🔝] escort service 24X7
Making our networking stack truly extensible
1. Making our networking stack
truly extensible
Olivier Bonaventure with
Quentin Deconinck , Cyril Dénos, Fabien Duchêne, Mathieu Jadin David Lebrun, Francois Michel,
Maxime Piraux,, Olivier Tilmans, Hoang Tran Viet, Thomas Wirtgen, Mathieu Xhonneux
http://inl.info.ucl.ac.be
LCN2019 Keynote, October 2019
Partially supported by FNRS, FRIA, MQUIC project (DG06 in cooperation with
Tessares),ARC-SDN and a Facebook grant
2. Agenda
• Evolution of the networking stack
• Making IPv6 Segment Routing programmable
• Making TCP extensible again
• Pluginizing Routing Protocols
• Pluginizing QUIC
4. But deploying TCP extensions
remains very difficult
• 20th century extensions took more than a
decade to be widely deployed
– TCP Window Scale
– TCP Timestamp
• Still not supported by Microsoft Windows
– TCP Selective Acknowledgements
– Explicit Congestion Notification
• Multipath TCP is being deployed, but getting it
everywhere will require lots of effort
6. Tuning such an implementation
• Implementations typically expose a few
configuration knobs
– Socket options to enable/disable a given feature
– Socket options to set some limit (e.g. window)
– Sysctl variables for system-wide tuning
– Linux modules provide some flexibility
• Congestion control as loadable modules
• Path managers in Linux Multipath TCP
7. Agenda
• Evolution of the networking stack
• Making IPv6 Segment Routing programmable
• Making TCP extensible again
• Pluginizing QUIC
• Pluginizing Routing Protocols
8. IPv6 Segment Routing in one slide
• Each router advertises its loopback in IGP
– Packets contains source route in SRH and follow
shortest path to next address in SRH
R1
R4
R3
R5
R2 R7
R8 R9
100
3:7
3:7 3:7
8:4:7:3 8:4:7:3
8:4:7:3 8:4:7:3
8:4:7:3
10. IPv6 Segment Routing
Network Programming
• IPv6 SR enables more than non-shortest paths
– Each node advertises one or more prefixes
R4 R5
R2 R7
R8 R9
IGP : 2001:…:4/40
FCT1:param
FCT2:param
Locator Function Param
C. Filsfils et al., SRv6 Network Programming, draft-filsfils-spring-srv6-
network-programming-03, Dec. 2017
11. Implementing SRv6
Network Programming
• First step
– Add support for IPv6 Segment Routing in Linux
– David Lebrun's PhD thesis
• Second step
– Find a simple way to enable network operators to
truly program their network
• Socket options ?
• Kernel modules ?
• Add eBPF support in Linux's IPv6 Segment Routing
implementation
Lebrun, D., & Bonaventure, O. (2017, July). Implementing IPv6 Segment Routing in
the Linux kernel. In ANRW2017ACM.
12. eBPF
• Lightweight virtual machine, in Linux kernel
since 2014
– RISC instruction set (~100)
• ALU, memory and branch purposes
• Bytecode recompiled to native architecture
• Verifier
– Checks absence of loops, stack usage, …
• Dedicated, isolated stack memory
– But no persistence
• Use cases
– Monitoring, SECCOMP, …
01011
10010
x86_64
13. eBPF
bytecode
Realising Network Programming :
the power of eBPF
Application
verifier
K
E
R
N
E
L
bpf syscall
map
eBPF
bytecode
eBPF
VM
M. Xhonneux et al., Leveraging eBPF for programmable network functions
with IPv6 Segment Routing, Proc. Conext 2018
14. eBPF for SRv6
• When are eBPF programs called ?
– Upon reception of a packet whose address in SRH
matches
• Which features of the stack can eBPF programs use ?
– bpf_lwt_seg6_store_bytes
• update parts of SRH
– bpf_lwt_seg6_adjust_srh
• update TLVs in SRH.
– bpf_lwt_seg6_action
• execute basic SRv6 function (End.X, End.T, End.B6, End.B6.Encaps
and End.DT6)
• Each eBPF program returns specific code
– BPF_OK, BPF_DROP, BFP_REDIRECT
16. Demonstrated use cases
• Delay measurements
– Sender timestamps some packets are requests
routers to timestamp and tunnel them as well
• Hybrid Access Networks
– Segments are used to forward packets over
different paths and combine them as one router
• Failure Detection and recovery
– Uses eBPF to implement detection similar to BFD
and a simple fast reroute techniques
Xhonneux, Mathieu, and Olivier Bonaventure. "Flexible failure detection and fast reroute
using eBPF and SRv6." CNSM'18E, 2018.
17. Agenda
• Evolution of the networking stack
• Making IPv6 Segment Routing programmable
• Making TCP extensible again
• Pluginizing Routing Protocols
• Pluginizing QUIC
18. Debugging TCP performance problems
• Classical approaches
– Collect packet traces and ask Ph.D. student to
analyze them
– Look at SNMP MIB, output of netstat, ss, …
• Limitations
– Either limited visibility or scalability concerns
19. In-protocol debugging with eBPF
• eBPF probes can be attached at specific places
in the TCP stack to observe unusual events
– Retransmission of SYN packet
– Reception of out-of-order packets
– Peak in measured round-trip-time
– Application too slow to recv data from kernel
– …
• Daemon collects stats and sends them via
IPFIX
O. Tilmans, O. Bonaventure, COP2: Continuously Observing
Protocol Performance, Feb. 2019, arxiv 1902.04280v1
22. TCP can be made more extensible
• Starting point
– Lawrence Brakmo's TCP-BPF patches
– Adds various hooks inside the TCP stack to
• Callbacks
– BPF_SOCK_OPS_TCP_CONNECT_CB,
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB,
BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
• Access to socket options
– BPF_SETSOCKOPT and BPF_GETSOCKOPT
• Read and write TCP state variables (rtt, cwnd, …)
– Main use case is to configure TCP parameters
on a per connection basis
Brakmo, L. (2017). TCP-BPF: Programmatically tuning TCP behavior through BPF.
NetDev 2.2.
23. What does TCP-BPF brings ?
Protocol messages
Callbacks
Helpers
Triggered by specific events An API used by
eBPF programs
24. User Timeout TCP option
• Defined in RFC5483 but not supported by Linux
• Sender side changes
– Add eBPF hooks in tcp_transmit_skb and
tcp_options_write
– eBPF code controls the transmission of this option
• Receiver side changes
– Add eBPF hook to tcp_parse_options
– eBPF code interprets the received option and adjusts
TCP state
27. Different use cases
• New TCP option to specify initial congestion
window
– Could be sent by iPhone in function of wireless
conditions
• New TCP option to specify delayed
acknowledgment strategy (delayed ack timer)
• Various improvements to Multipath TCP
– eBPF-based path manager
Tran, Viet-Hoang, and Olivier Bonaventure. "Beyond socket options: making the Linux TCP
stack truly extensible.', IFIP Networking(2019).
28. Agenda
• Evolution of the networking stack
• Making IPv6 Segment Routing programmable
• Making TCP extensible again
• Pluginizing Routing Protocols
• Pluginizing QUIC
29. Difficult to innovate in BGP/OSPF
• How long does it take for ISPs to get new
features in BGP ?
• Example: BGP large communities
2009:
First AS with 32bits AS number
September 2016:
draft-ietf-idr-large-community-00
February 2017:
RFC 8092
~March 2018 :
Router Implementation !
9 years before ISPs can use BGP communities
December 2002: draft-lange-
flexible-bgp-communities-00
30. Faster deployment of new routing
features
VM 0 0 0 0 1
1 1 1 0 0
1 1 1 10
0 0 0 0 1
1 1 1 0 0
1 1 1 10
0 0 0 0 1
1 1 1 0 0
1 1 1 10
Plugin
Protoco
l
CLI SNMP NetConfCLI
RIB
route1 → via R2
routeN → via R4
Internal Data Structure
Neighbor routers context
…
Protocol
SNMPNetConf
Protocol memory
API
T. Wirtgens et al., “The Case for Pluginized Routing Protocols”, ICNP 2019, Chicago
31. How to safely execute plugins ?
● Userspace eBPF VM
○ Same RISC instruction set (~100) as in Linux
kernel
■ ALU, memory and branch purposes
● Bytecode recompiled to native architecture
● Dedicated, isolated stack memory
○ But no persistence
● Rely on a user-space implementation
○ With relaxed verifier
○ With persistent heap memory
32. Example: Adding Monitoring to BGP
0 0 0 0
1 1 1 1
0 0 1 1
PRE
V
M
int bgp_update(args)
{
// code
r = decision_process(args);
// ...
// end of function
}
int decision_process(args)
{
// code
// ...
// ...
// ...
return something;
}
0 0 0 0
1 1 1 1
0 0 1 1
POST
V
M
time_t start = time(NULL);
time_t diff = time(NULL) - start;
33. 45
int bgp_update(args)
{
// code
r = decision_process(args);
// ...
// end of function
}
int decision_process(args)
{
// code
// ...
// ...
// ...
return something;
}
Example: protocol function
replacement
34. 46
int bgp_update(args)
{
// code
r = decision_process(args);
// ...
// end of function
}
int decision_process(args)
{
// code
// ...
// ...
// ...
return something;
}
Example: protocol function
replacement
35. 47
int bgp_update(args)
{
// code
r = decision_process(args);
// ...
// end of function
}
int decision_process(args)
{
// code
// ...
// ...
// ...
return something;
}
Example: protocol function
replacement
0 0 0 0
1 1 1 1
0 0 1 1
REPLACE
V
M
36. Summary : plugin structure
PRE
REPLACE
0 0 0 0
1 1 1 1
0 0 1 1
…
Plugin
heap
stack
ctx
VM
0 0 0 0
1 1 1 1
0 0 1 1
heap
stack
0 0 0 0
1 1 1 1
0 0 1 1
heap
stack
POST …
0 0 0 0
1 1 1 1
0 0 1 1
heap
stack
0 0 0 0
1 1 1 1
0 0 1 1
heap
stack
Read Only
Read Only
Write
Access
ctx
ctx
ctx ctx
48
Protocol Memory
API RIB
Internal Data Structure
Neighbor context
Shared
Memory
VM
VM
VM VM
37. Use case: Flexible BGP filters
• BGP filters are key for ISPs,
– But they need to be written in special languages
uint64_t
filter_routes_from_even_as(bpf_full_args_t *args)
{
as_t a = bpf_get_args(args, 2);
// from even AS → DENY
if (a % 2 == 0) return FILTER_DENY;
return FILTER_PERMIT; // the route is originated
from odd AS → ACCEPT
}
router bgp 64512
bgp router-id 10.236.87.1
neighbor 10.0.0.1 remote-as 64515
neighbor 10.0.0.1 filter-list IN in
!
! IN list accepts routes originated from odd AS
only
as-path access-list IN permit ^(.+_+)*(.*)1$
as-path access-list IN permit ^(.+_+)*(.*)3$
as-path access-list IN permit ^(.+_+)*(.*)5$
as-path access-list IN permit ^(.+_+)*(.*)7$
as-path access-list IN permit ^(.+_+)*(.*)9$
as-path access-list IN deny any
C-based filter
41. Agenda
• Evolution of the networking stack
• Making IPv6 Segment Routing programmable
• Making TCP extensible again
• Pluginizing Routing Protocols
• Pluginizing QUIC
42. The QUIC revolution
• What are the benefits ?
– Deploy without convincing kernel developers/ SDO
HTTP/2
TLS
TCP
IP
Application
QUIC
IP
Application
UDP
43. Pluginized QUIC
• Key ideas
– Include an eBPF VM inside PQUIC to enable it to
be dynamically extended with bytecode
– Expose a richer set of callback functions and
helpers than inside TCP
– Leverage QUIC's flexible packet format to support
a wide range of extensions
– Leverage QUIC's multistream and security features
to allow client and servers to exchange bytecode
over QUIC connections
Q. Deconinck et al., “Pluginized QUIC”, SIGCOMM’19, August 2019, Beijing
44. Exchanging plugins
First connection
Initial: Client Hello - “Hey, I support multipath”
Initial: Server Hello - “I want to inject monitoring”
Encrypted - PLUGIN_REQUEST(monitoring)
Encrypted - PLUGIN(monitoring)
...
Let’s monitor
the client state.
Let’s request
monitoring
Bytecode
45. Exchanging plugins
Next connections
Initial: Client Hello - “Hey, I support multipath and
monitoring”
Initial: Server Hello - “Let’s use monitoring”
Encrypted - STREAM, STAT(info about RTT,
reordering,...)
Encrypted - STREAM
...
Let’s use
monitoring
Added by the
monitoring
plugin
46. Very Different Use Cases
● Monitoring
● A QUIC VPN
● Multipath
● Forward Erasure Correction
See our SIGCOMM’19
paper for more details
Plugin Lines of C Code Number of bytecodes
Monitoring 500 14
QUIC VPN 500 11
Multipath 2600 32
Forward Erasure
Correction
2500 51
47. Use case : Monitoring
• Plugin
– Collect statistics about various events in the QUIC
stack
• bytes/packets sent/received, lost, received out-of-
order, etc.
– Exports data to a monitoring server, but could also
transmit them over QUIC connection
– Passive, pluglets are attached in pre or post
Plugin
500 lines of C code
14 pluglets
86 Kbytes of bytecode
See SIGCOMM’19
paper for more details
48. Use case : Multipath QUIC
• Plugin
– Supports our proposed Multipath QUIC draft
• Connection id and path id, address advertisement
– Includes path manager, packet scheduler (round
robin and lowest rtt) as in MPTCP
– provides similar performance as MPTCP
Plugin
2600 lines of C code
32 pluglets
138 Kbytes of bytecode
See SIGCOMM’19
paper for more details
49. Use case: Forward Erasure Correction
• Objective
– Encode packets so that losses can be recovered at
the receiver without waiting for retransmissions
• Plugin
– Adds new frame to carry Repair Symbols
– Supports XOR and Random Linear Code (RLC)
• Complex computations are required
Plugin
2500 lines of C code
51 pluglets
236 Kbytes of bytecode
See SIGCOMM’19
paper for more details
51. Security and safety concerns
• PQUIC relies on several techniques to ensure
safety of plugins
• Plugins are isolated from PQUIC and each other
– eBPF VM adds code in JIT to validate memory
• We propose a system to certify plugins
– Manual certification like applications in a store
– Tool-assisted certification
• We have successfully used tools to prove termination
• Future work required to develop tools to verify more specific
properties than termination
– Cryptographic certificates are attached to plugins and
can be validated before injecting them
52. Conclusions …
• eBPF-based Protocol plugins bring benefits to
various protocols
– IPv6 Segment Routing for network programmability
– TCP to collect accurate measurement data, implement
new options, update key algorithms
– BGP with more flexible eBPF filters, OSPF for new LSAs
• Pluginized QUIC goes one step further by
exchanging eBPF plugins over QUIC connections
– Makes the protocol truly extensible
53. … next steps
• How to redesign network protocols to completely
leverage plugins ?
– A more efficient virtual machine
• Webassembly, improved eBPF, other ?
– A simple base protocol that provides a clean API
• Similar to microkernels, offload more complex or less
frequently used functions to plugins
– Interoperable independent implementations
• The same plugin should work on different implementations
– Tools and techniques to validate plugins
• Not only termination, but other types of automated proofs
Notas do Editor
Autre chose que ce que l’ietf propose
context memory api etc
The idea of Pluginized QUIC is to revisit how protocol implementations are structured
In particular, the transport protocol is now viewed as a set of basic functions which can now be easily mapped to implementation methods
In this canvas, we can insert plugins, which are a set of modified or added protocol functions
For instance, in the base implementation, we find operations for RTO computations, for the preparation of header or to handle new data from the application.
In PQUIC, a plugin can change the algorithms, for instance the RTO computation, or even add new functions such as the support of unreliable messages.
Our PQUIC provides dynamic per-connection customization allowing different algorithms on different connections, thanks to the isolation between them.
Now that we have a mechanism to exchange plugins, what is needed at the PQUIC implementations to run the bytecode?
Now, to exemplify how protocol operations appear in PQUIC, consider the following situation where an host receive a packet.
In QUIC and thus PQUIC, a packet is composed of a small cleartext header and an encrypted payload containing the frames.
So first, the host needs to process the incoming packet, with potentially decryption.
Then it needs to parse the packet header, and all of the frames contained in the packet.
While receiving the ACK frame, the host can estimate the latency of the network and update its retransmit timer.
All these operations are base operations of the protocol, which thus map into implementation functions provided by the core of PQUIC.
What if we want to recompute the retransmit timer? Just attach a VM at that place.
How to handle the reception of STAT frames used to exchange monitoring information? Just create a new protocol operation to process the STAT frame and attach a VM.
HOw can we insert the code here? Actually, the implementation of an operation is placed in an hook called REPLACE, which refer by default to the built-in implementation.
If we want to change this code, we can simply attach a VM to this hook that will replace the behavior of the protocol operation.
The code inserted at that hook has full read/write access to the connection state.
Monitoring use case, see how code works, so just look at arguments and outputs of the protocol operations.
PRE and POST hooks for that purpose, where VMs can be attached at that points.
Those are read-only hooks to the connection state, but this enables multiple VMs, possibly for different purposes, to observe the operations.
Materialize OR
Monitoring case, inserted observer and write STAT frames, how to communicate? Use shared heap memory.
Another plugin for another use case, memory is isolated.