eBPF is an exciting new technology that is poised to transform Linux performance engineering. eBPF enables users to dynamically and programatically trace any kernel or user space code path, safely and efficiently. However, understanding eBPF is not so simple. The goal of this talk is to give audiences a fundamental understanding of eBPF, how it interconnects existing Linux tracing technologies, and provides a powerful aplatform to solve any Linux performance problem.
4. Examples
● A developer claims boxes have “slow” I/O
● Network connections are randomly
terminated.
● Your service is crashing, you’re not sure why,
maybe it getting OOM killed?
● You think some process might be getting
starved.
9. What is eBPF? (Extended Berkeley Packet Filter)
● Fast and safe, in-kernel, register based,
bytecode VM.
● Designed to be JITed with direct mapping to
x86_64 and other modern architectures.
● eBPF programs are “attached” to code paths
within the kernel or user space programs and
are executed when the code path is traversed.
● Linux Kernel 3.18 (2014) - bpf(2) syscall
○ (4.1 for Kprobes)
10.
11.
12. What is eBPF? … cont.
● Programs are written in restricted C. eBPF backend for
LLVM/Clang.
○ clang -O2 -emit-llvm -c bpf.c -o - | llc -march=bpf -filetype=obj -o bpf.o
● eBPF Verifier
○ Verified to finish (no loops), no unreachable instructions, reads to uninitialized registers, or
memory access to arbitrary pointers restricted kernel func calls and data structure access.
● eBPF Maps / Perf Events Ring Buffer
○ Memory-Mapped, bi-directional data structures for storage. Allow sharing of data between
eBPF kernel programs, and also between kernel and user-space applications.
● Helper Functions
○ Kernel functions exposed to eBPF programs.
○ Context sensitive to type of eBPF program.
17. eBPF is appealing to different people for different reasons,
but its power resides in what you can attach it to.
For Performance Engineering
we’re primarily interested in
these hooks.
● Kprobes/Uprobes
● Tracepoints
● USDT
● PerfEvents
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/bpf.h#L145
18. Tracepoints (2.6.32) - 2009
● Static places in the kernel where tracing is inserted.
● $ grep -ri TRACE_EVENT *
● https://github.com/brendangregg/perf-tools
19. K/J(ret)probes (2.6.9) - 2004 / U(ret)probes 3.15 - (2014)
● Probe any instruction, dynamically
● grep <func> /proc/kallsyms
● Register kprobes copies instruction, inserts breakpoint.
(int3 on x86_64)
● Cpu hits breakpoints, trap occurs, registers saved and
control passed to Kprobe.
● Pre-handler function called, Kprobes single steps
instructions (Slow), Post-Handler called.
● CONFIG_OPTPROBES=Y (enabled on x86_64)
24. Perf events (2.6.31) - 2009
● The “nearly un-googleable” - http://web.eece.maine.edu/~vweaver/projects/perf_events/
● Trace and count tracepoints and lower level events, PMU, HW events (L1
cache store/load/miss etc).
● Accesses data from user space efficiently by accessing the perf_events ring
buffer.
25. USDT (BCC March 2016)
● Userland Statically Defined Tracepoints
● sudo ./tplist -l <library name>
51. eBPF - Extended Berkeley Packet Filter
● Bytecode, register based VM, with a extended instruction set
○ Designed to be JITed with direct mapping to x86_64
● 64-bit instructions, and 10 64-bit registers
○ R0 - return value from in-kernel function, and exit value for eBPF program
○ R1 - R5 - arguments from eBPF program to in-kernel function
○ R6 - R9 - callee saved registers that in-kernel function will preserve
○ R10 - read-only frame pointer to access stack
● BPF_CALL
○ hw register zero overhead calls to other kernel functions
● BPF_MAPS
○ Bi-directional data structures for storage. Allow sharing of data between eBPF kernel
programs, and also between kernel and user-space applications.
● Helper Functions
○ https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md ← Very Important!
52. eBPF - Extended Berkeley Packet Filter… cont
● Load programs via bpf(2) syscall (see: man bpf)
○ int bpf(int cmd, union bpf_attr *attr, unsigned int size);
● Cmd: BPF_PROG_LOAD
○ Verify and load an eBPF program, returning a new file descriptor associated with the
program. The close-on-exec file descriptor flag (see fcntl(2)) is automatically enabled for
the new file descriptor.
53.
54. Can we learn more about
eBPF VM like we did with
tcpdump?
64. As you can imagine the next 4 instructions
copy the “hello wo” into a scratch space at
offset -16. Copy a “0” into r1 and then
copies “0” at offset -4. Finally we copy the
address of the variable from the frame
pointer at r10 into r1.
65. To prepare for the call to
int bpf_trace_printk(const char *fmt, u32 fmt_size, ...)
We need to point r1 to the variable (which is -16 bytes
from the frame pointer) and in r2, we store the size of
“hello worldn0” = 13 bytes.
66. 0x85 Is a function call, with an imm of 6. We need to
look that up in bpf.h in order to figure out what that is.
77. References
● https://lwn.net/Articles/740157/ - A thorough introduction to eBPF
● https://lwn.net/Articles/599755/ - BPF: the universal in-kernel virtual machine
● https://www.collabora.com/news-and-blog/blog/2019/04/15/an-ebpf-overview-part-2-machine-and-bytecode/
● https://www.youtube.com/watch?v=2lbtr85Yrs4 - Kernel Tracing with eBPF
● https://www.kernel.org/doc/Documentation/networking/filter.txt - Linux Socket Filtering aka Berkeley Packet Filter
● http://www.brendangregg.com/ebpf.html - Linux Extended BPF (eBPF) Tracing Tools
● https://www.slideshare.net/vh21/meet-cutebetweenebpfandtracing - Meet cute between eBPF and tracing
● https://blog.cloudflare.com/bpf-the-forgotten-bytecode/ - BPF the forgotten bytecode
● https://www.oreilly.com/learning/using-linux-tracing-tools - Modern Linux Tracing Landscape
● https://lwn.net/Articles/742082/ - An introduction to the BPF Compiler Collection
● https://bolinfest.github.io/opensnoop-native/ - How I ended up writing opensnoop in pure C using eBPF
● https://lwn.net/Articles/753601/ - Using user-space tracepoints with BPF
● http://brendangregg.com/perf.html - Perf Examples
Editor's Notes
We’re going to refer back to the slide several time in our presentation
Kprobe tcp_set_state
We check subnet for whether it’s an AWS hosted service
docker