SlideShare a Scribd company logo
1 of 28
Download to read offline
Brought to you by
Bryan McCoid
Sr. Software Engineer at
High Performance Networking
Using eBPF, XDP, and io_uring
Bryan McCoid
Sr. Software Engineer
■ I work on the clustering/control plane component at Couchbase
■ Into all things Linux, eBPF, and networking
■ I live in Santa Cruz, CA -- so I literally enjoy long walks on the
beach
■ Talk not based on Couchbase product(s) or technology
● Based on personal interest and open source work
About me..
So who cares about networking anyways?!
Why is it that we even care about this? It all sounds very boring and I
just want to write my application logic..
The answer I’ve found most compelling is..
■ Most (interesting) high performance applications are I/O bound
■ Backend servers and databases all tend to be bound by either
networking or disk performance.
■ … so that’s why I’m gonna focus on networking
Some history...
So what are our options?
Most people deal with networking in linux at the socket level
■ Normal, blocking sockets
● 1 thread blocking on every I/O syscall
■ Async sockets (epoll + nonblocking sockets, io_uring)
● Epoll: register sockets and get notified when they are ready
● Io_uring: new hotness
■ Low Level: AF_PACKET
■ Kernel Bypass:
● DPDK - Data Plane Development Kit (Intel) - Link
● Netmap - "the fast packet I/O framework" - Link
Enter eBPF...
“eBPF is a revolutionary technology with origins in the Linux kernel that
can run sandboxed programs in an operating system kernel. It is used
to safely and efficiently extend the capabilities of the kernel without
requiring to change kernel source code or load kernel modules.”
-- https://ebpf.io/
XDP/AF_XDP - Linux’s answer to
kernel bypass
AF_PACKET V4 -> AF_XDP
■ A new version of AF_PACKET was being developed around the
same time eBPF came into existence.
■ Instead of releasing it, they transformed it into what we now know
as XDP by using ebpf as the glue.
■ A specific socket type was created that used eBPF to route
packets to userspace.
■ And that’s how we got AF_XDP. (and this is where the talk really
begins ;-)
XDP
XDP comes in two variants to the user, but these interfaces are
intertwined
■ XDP programs can be standalone, written in C, compiled to ebpf
bytecode, and ran on raw packet data without ever having anything
to do with userspace.
■ XDP can be used to route packets to ringbuffers that allow rapid
transfer to userspace buffers. This is referred to as AF_XDP.
The linux kernel networking is great, but...
■ For many use-cases it’s not ideal
■ It’s relatively scalable but..
● At some point you are just at the mercy of the kernel
■ Copies.. Copies, copies..
● Multiple copies often need to take place, even in the best
optimized methods
■ Much better latency, or at least more consistent latency
My use for XDP
And how to use AF_XDP in general
How to use it? + Designing a runtime in Rust
■ Built for existing Rust asynchronous I/O library named Glommio
● Designed as a TPC async I/O library that uses io_uring
■ Currently just in a forked version
■ Still being worked on.. (check back someday)
■ Would love contributions!
So how would we design it and how does AF_XDP work?
Basic layout
Memory Region
(pre-allocated)
Basic layout
Memory Region
(pre-allocated)
Ring buffers
(queues)
Basic layout
Memory Region
(pre-allocated)
Ring buffers
(queues)
These are
single
producer,
single
consumer
Basic layout
Basic layout (including another cpu for driver / IRQ handling)
“Default” eBPF code for AF_XDP socket
#include <bpf/bpf_helpers.h>
Struct bpf_map_def SEC(“maps”) xsks_map = {
.type = BPF_MAP_TYPE_XSKMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 64,
};
“Default” eBPF code for AF_XDP socket
SEC(“xdp_sock”)
int xdp_sock_prog(struct xdp_md *ctx) {
int index = ctx->rx_queue_index;
if (bpf_map_lookup_elem(&xsks_map, &index)) {
return bpf_redirect_map(&xsks_map, index, 0);
}
return XDP_PASS;
}
How about other designs?
Shared UMEM / multiple AF_XDP sockets
Many sockets, many UMEMs, 1 per core
Interfacing with kernel is easy
let count = _xsk_ring_cons__peek(
self.rx_queue.as_mut(),
filled_frames.len().unwrap_or(0),
&mut idx,
);
… snip …
for idx in 0..count {
let rx_descriptor = _xsk_ring_cons__rx_desc(rx_queue.as_ref(), idx);
// put ^^ this descriptor somewhere
}
… snip …
_xsk_ring_cons__release(rx_queue.as_mut(), count);
How this fits into the larger framework
The framework I’m integrating to focuses on IO_URING so integrating can happen
a few different ways..
1. When you need to poll/drive I/O (needs_wakeup mode) you can submit to
io_uring
a. When packets arrive, this will wake up the consumer (from io_uring) to poll the rings again.
2. For busy-polling mode you just make the calls in the runtime loop and you
can’t easily go to sleep/block. So you either spin, or fallback to the other
mode.
a. It makes it harder to use the existing capabilities of io_uring, but pretty much forces you into
spinning constantly, calling synchronous kernel API’s, whether there’s packets or not.
Other considerations
■ Is relying on the io_uring for needs_wakeup notifications of
readiness too slow?
● Does it add too much latency?
■ Is it possible to use multi-shot poll and never have to re-arm the
ring?
■ The runtime is cooperatively scheduled, so how do you guard
against tight loops on recieve or similar?
● Most of the calls to kernel/libbpf are actually synchronous, is
that a problem?
This is all a work in-progress...
Thanks for listening and we’d love for people to get interested and contribute!
Main repo: https://github.com/DataDog/glommio
Parting words:
When in doubt: just try using io_uring alone, it’s really great and is the future of
most async I/O on linux.
Brought to you by
Bryan McCoid
bryan.mccoid@couchbase.com
@bryanmccoid
(BONUS) Popular projects
■ (DPDK) Seastar *cough* *cough* - asynchronous fast I/O
framework
■ (DPDK) mTCP - TCP stack
■ (XDP) Katran - XDP load balancer
■ (XDP/eBpf) Cillium - Kubernetes networking stack built into linux
kernel using ebpf.
■ (XDP) Mizar - high performance cloud network for virtual
machines.

More Related Content

What's hot

BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
Brendan Gregg
 

What's hot (20)

LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 
Cfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF SuperpowersCfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF Superpowers
 
Faster packet processing in Linux: XDP
Faster packet processing in Linux: XDPFaster packet processing in Linux: XDP
Faster packet processing in Linux: XDP
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
Using eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in CiliumUsing eBPF for High-Performance Networking in Cilium
Using eBPF for High-Performance Networking in Cilium
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
Linux kernel tracing
Linux kernel tracingLinux kernel tracing
Linux kernel tracing
 
DPDK In Depth
DPDK In DepthDPDK In Depth
DPDK In Depth
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
EBPF and Linux Networking
EBPF and Linux NetworkingEBPF and Linux Networking
EBPF and Linux Networking
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
 
DoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDKDoS and DDoS mitigations with eBPF, XDP and DPDK
DoS and DDoS mitigations with eBPF, XDP and DPDK
 
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric LeblondKernel Recipes 2017 - EBPF and XDP - Eric Leblond
Kernel Recipes 2017 - EBPF and XDP - Eric Leblond
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 

Similar to High-Performance Networking Using eBPF, XDP, and io_uring

Similar to High-Performance Networking Using eBPF, XDP, and io_uring (20)

Transparent eBPF Offload: Playing Nice with the Linux Kernel
Transparent eBPF Offload: Playing Nice with the Linux KernelTransparent eBPF Offload: Playing Nice with the Linux Kernel
Transparent eBPF Offload: Playing Nice with the Linux Kernel
 
Arch linux and whole security concepts in linux explained
Arch linux and whole security concepts in linux explained Arch linux and whole security concepts in linux explained
Arch linux and whole security concepts in linux explained
 
DEF CON 27 - JEFF DILEO - evil e bpf in depth
DEF CON 27 - JEFF DILEO - evil e bpf in depthDEF CON 27 - JEFF DILEO - evil e bpf in depth
DEF CON 27 - JEFF DILEO - evil e bpf in depth
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU Delegates
 
Snaps on open suse
Snaps on open suseSnaps on open suse
Snaps on open suse
 
The art of concurrent programming
The art of concurrent programmingThe art of concurrent programming
The art of concurrent programming
 
SciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programmingSciPipe - A light-weight workflow library inspired by flow-based programming
SciPipe - A light-weight workflow library inspired by flow-based programming
 
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...
ELC-E 2016 Neil Armstrong - No, it's never too late to upstream your legacy l...
 
Stack Frame Protection
Stack Frame ProtectionStack Frame Protection
Stack Frame Protection
 
Multicore
MulticoreMulticore
Multicore
 
Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)Network Stack in Userspace (NUSE)
Network Stack in Userspace (NUSE)
 
6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx6-9-2017-slides-vFinal.pptx
6-9-2017-slides-vFinal.pptx
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
 
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup SunnyvaleIntroduction to Docker (and a bit more) at LSPE meetup Sunnyvale
Introduction to Docker (and a bit more) at LSPE meetup Sunnyvale
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
Smash the Stack: Writing a Buffer Overflow Exploit (Win32)
Smash the Stack: Writing a Buffer Overflow Exploit (Win32)Smash the Stack: Writing a Buffer Overflow Exploit (Win32)
Smash the Stack: Writing a Buffer Overflow Exploit (Win32)
 
OSDC 2017 | Open POWER for the data center by Werner Fischer
OSDC 2017 | Open POWER for the data center by Werner FischerOSDC 2017 | Open POWER for the data center by Werner Fischer
OSDC 2017 | Open POWER for the data center by Werner Fischer
 
OSDC 2017 - Werner Fischer - Open power for the data center
OSDC 2017 - Werner Fischer - Open power for the data centerOSDC 2017 - Werner Fischer - Open power for the data center
OSDC 2017 - Werner Fischer - Open power for the data center
 
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner FischerOSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
OSDC 2017 | Linux Performance Profiling and Monitoring by Werner Fischer
 

More from ScyllaDB

More from ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

High-Performance Networking Using eBPF, XDP, and io_uring

  • 1. Brought to you by Bryan McCoid Sr. Software Engineer at High Performance Networking Using eBPF, XDP, and io_uring
  • 2. Bryan McCoid Sr. Software Engineer ■ I work on the clustering/control plane component at Couchbase ■ Into all things Linux, eBPF, and networking ■ I live in Santa Cruz, CA -- so I literally enjoy long walks on the beach ■ Talk not based on Couchbase product(s) or technology ● Based on personal interest and open source work About me..
  • 3. So who cares about networking anyways?! Why is it that we even care about this? It all sounds very boring and I just want to write my application logic.. The answer I’ve found most compelling is.. ■ Most (interesting) high performance applications are I/O bound ■ Backend servers and databases all tend to be bound by either networking or disk performance. ■ … so that’s why I’m gonna focus on networking
  • 5. So what are our options? Most people deal with networking in linux at the socket level ■ Normal, blocking sockets ● 1 thread blocking on every I/O syscall ■ Async sockets (epoll + nonblocking sockets, io_uring) ● Epoll: register sockets and get notified when they are ready ● Io_uring: new hotness ■ Low Level: AF_PACKET ■ Kernel Bypass: ● DPDK - Data Plane Development Kit (Intel) - Link ● Netmap - "the fast packet I/O framework" - Link
  • 6. Enter eBPF... “eBPF is a revolutionary technology with origins in the Linux kernel that can run sandboxed programs in an operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules.” -- https://ebpf.io/
  • 7. XDP/AF_XDP - Linux’s answer to kernel bypass
  • 8. AF_PACKET V4 -> AF_XDP ■ A new version of AF_PACKET was being developed around the same time eBPF came into existence. ■ Instead of releasing it, they transformed it into what we now know as XDP by using ebpf as the glue. ■ A specific socket type was created that used eBPF to route packets to userspace. ■ And that’s how we got AF_XDP. (and this is where the talk really begins ;-)
  • 9. XDP XDP comes in two variants to the user, but these interfaces are intertwined ■ XDP programs can be standalone, written in C, compiled to ebpf bytecode, and ran on raw packet data without ever having anything to do with userspace. ■ XDP can be used to route packets to ringbuffers that allow rapid transfer to userspace buffers. This is referred to as AF_XDP.
  • 10. The linux kernel networking is great, but... ■ For many use-cases it’s not ideal ■ It’s relatively scalable but.. ● At some point you are just at the mercy of the kernel ■ Copies.. Copies, copies.. ● Multiple copies often need to take place, even in the best optimized methods ■ Much better latency, or at least more consistent latency
  • 11. My use for XDP And how to use AF_XDP in general
  • 12. How to use it? + Designing a runtime in Rust ■ Built for existing Rust asynchronous I/O library named Glommio ● Designed as a TPC async I/O library that uses io_uring ■ Currently just in a forked version ■ Still being worked on.. (check back someday) ■ Would love contributions! So how would we design it and how does AF_XDP work?
  • 16. Memory Region (pre-allocated) Ring buffers (queues) These are single producer, single consumer Basic layout
  • 17. Basic layout (including another cpu for driver / IRQ handling)
  • 18. “Default” eBPF code for AF_XDP socket #include <bpf/bpf_helpers.h> Struct bpf_map_def SEC(“maps”) xsks_map = { .type = BPF_MAP_TYPE_XSKMAP, .key_size = sizeof(int), .value_size = sizeof(int), .max_entries = 64, };
  • 19. “Default” eBPF code for AF_XDP socket SEC(“xdp_sock”) int xdp_sock_prog(struct xdp_md *ctx) { int index = ctx->rx_queue_index; if (bpf_map_lookup_elem(&xsks_map, &index)) { return bpf_redirect_map(&xsks_map, index, 0); } return XDP_PASS; }
  • 20. How about other designs?
  • 21. Shared UMEM / multiple AF_XDP sockets
  • 22. Many sockets, many UMEMs, 1 per core
  • 23. Interfacing with kernel is easy let count = _xsk_ring_cons__peek( self.rx_queue.as_mut(), filled_frames.len().unwrap_or(0), &mut idx, ); … snip … for idx in 0..count { let rx_descriptor = _xsk_ring_cons__rx_desc(rx_queue.as_ref(), idx); // put ^^ this descriptor somewhere } … snip … _xsk_ring_cons__release(rx_queue.as_mut(), count);
  • 24. How this fits into the larger framework The framework I’m integrating to focuses on IO_URING so integrating can happen a few different ways.. 1. When you need to poll/drive I/O (needs_wakeup mode) you can submit to io_uring a. When packets arrive, this will wake up the consumer (from io_uring) to poll the rings again. 2. For busy-polling mode you just make the calls in the runtime loop and you can’t easily go to sleep/block. So you either spin, or fallback to the other mode. a. It makes it harder to use the existing capabilities of io_uring, but pretty much forces you into spinning constantly, calling synchronous kernel API’s, whether there’s packets or not.
  • 25. Other considerations ■ Is relying on the io_uring for needs_wakeup notifications of readiness too slow? ● Does it add too much latency? ■ Is it possible to use multi-shot poll and never have to re-arm the ring? ■ The runtime is cooperatively scheduled, so how do you guard against tight loops on recieve or similar? ● Most of the calls to kernel/libbpf are actually synchronous, is that a problem?
  • 26. This is all a work in-progress... Thanks for listening and we’d love for people to get interested and contribute! Main repo: https://github.com/DataDog/glommio Parting words: When in doubt: just try using io_uring alone, it’s really great and is the future of most async I/O on linux.
  • 27. Brought to you by Bryan McCoid bryan.mccoid@couchbase.com @bryanmccoid
  • 28. (BONUS) Popular projects ■ (DPDK) Seastar *cough* *cough* - asynchronous fast I/O framework ■ (DPDK) mTCP - TCP stack ■ (XDP) Katran - XDP load balancer ■ (XDP/eBpf) Cillium - Kubernetes networking stack built into linux kernel using ebpf. ■ (XDP) Mizar - high performance cloud network for virtual machines.