What Are The Drone Anti-jamming Systems Technology?
Dpdk applications
1. share, discuss, & ask
1
DPDK What is and is not?
How to port an application?
Where can we run?
ISSUES What are?
Why did it occurred?
TOOLS General
Debug guide
Custom
2. 2
a) Platform:
Difference General Processor, Network Processor
NPU has smaller caches, lower clock speed (around 1GHz)
Specialized ISA (Instruction Set Architecture) for single clock
parsing (eg: IP, MPLS, Vlan etc..)
Specialized schedulers keeping OS and worker threads
separate
b) Processing Architecture:
HW: dedicate peripheral interrupt processing, reduce TLB
misses,
SW: schedulers, locks
Timers: scheduler tick, remove SW watchdog on DP cores
Inter Processor Communication: LRU drain, rcu_barrier,
paging, per cpu drain.
c) Locking overheads by CPU or Memory:
IPI between large number of multicore is expensive
Locking of memory or regions is expensive
Vmstat_update every sec and updates for virtual memory
d) Pipeline Latency: Hiding for both HW & SW
Get bulk packets for processing
Pre-fetch to cache
Allow bulk lookup for multiple packets in burst
e) Bus Interface: Improve PCIe (NIC) to CPU Cache
Design Cache to accommodate more frames per core.
Larger L2 or L3.
Use HW assisted caching to manage incoming packets.
f) Meta Buffer:
Impossible to hold and access Millions packets per sec CPU
cache or memory. Overhead of latency and difficulty in size.
Pre parse stage; prepare Meta data to hold essential
headers which can be stored in cache.
Make use of bytes prefetch in interleaved fashion to hide
latency by pipeline.
3. Measuring cross socket
bandwidth
CPU Socket
CPU Cores
UPICONTROLLER
1
LLC
PacketRXd
NIC
CPU Socket
CPU Cores
MEMORYCONTROLLER
RAM
1
LLC
PacketRXd
NIC
UPICONTROLLER
7. Problem?
User Buffer
User App
Network Stack
SKB
Driver - generic
_rcv_ISR()
_hard_start_xmit()
RX DMA BUFFER
TX DMA BUFFER
SKB frame
end
len
tail
data
head
Head room
User data
Tail room
SKB shared info
7
9. 9
Application – Packet Life
Read from NIC
Check content
Ensure integrity
Do lookup / hash
Identify processing
Map to queue
Action per queue/schedule
Update stats counters
Send burst to NIC
CPU NIC Programmable NIC
slow
fast
10. What is not DPDK?
HW support: Huge Page Size, Data Direct I/O, SIMD
Converse for Power or Cycles as required
Allow multi process data sharing without SYSCALL IPC (SHM, sockets,
FIFO)
Either burst or low latency polls
Adapt to small, big or hybrid cases
Prototype and Deploy quickly
HW offload with SW fallback
Runs in User Space (Bypasses Kernel Path)
Library of Functions
What is DPDK?
10
11. 11
Where all we can run DPDK?
Host User Space
Application
DPDK + Ext
NIC
Docker +
Application
NIC
Application
vNIC
DPDK + Ext
VM Guest
NIC
Host User Space
Docker +
Application
vNIC
DPDK + Ext
VM Guest
NIC
NIC User Space
Application
DPDK + Ext
NIC
Host User Space
Application
DPDK + Ext
Docker +
Application
Application
vNIC
DPDK + Ext
VM Guest
Docker +
Application
vNIC
DPDK + Ext
VM Guest
13. Other Applications
13
Network I/O (Multiple 10Gbit/s Interfaces)
Control,
Configuration and
Stats
User Space
Clear Text
Encrypted
Encrypted
RX NIC
Capture Decode Stream Detect Output
Capture Decode Stream Detect Output
RSS HASH
Parse for
metadata
Match for rule
set
Buffer & Zero
Copy
DPDK
19. Bottleneck Analysis
mismatch in
packet rates
(received <
desired)?
does RX
lcore threads
gets enough
cycles?
packet drops
at receive or
transmit?
packet or
object
processing
rate in the
pipeline?
user
functions
performance
is not as
expected?
execution
cycles for
dynamic
service
functions are
not
frequent?
Is the packet
not in the
unexpected
format?
19
20. 20
Why are there various drops?
Stress & Regress
pkt-gen
trex
Generic tools
lstopo
dmidecode
libunwind
dpdk apps
proc-info
pdump
Isolate
Debug guide
numa
huge page
pinning
Characterize &
Quantize
perf top
Perf stats
vtune
Custom tools
malloc scanner
Memzone
monitor
Thread Stack
Tracer
23. 23
Hardware related items
• NIC details, configurations, firmware version via Linux
• PCIe capability and current configurations
• PCIe advertised speed and configurations.
• SFP and SFP+ details fetch
lshw -c network –businfo
lshw -c network | egrep 'firmware|pci@‘
Ethool –m | -k | -P | -S
• CPU flags and feature get
• Lscpu
• Cat /proc/cpuinfo
HW performance counters (user perf and vtune on IA)
Helpful!
24. Linux Signals
24
SIGHUP 1 Term Hangup detected on controlling terminal or death of
controlling process
SIGINT 2 Term Interrupt from keyboard
SIGQUIT 3 Core Quit from keyboard
SIGILL 4 Core Illegal Instruction
SIGABRT 6 Core Abort signal from abort(3)
SIGKILL 9 Term Kill signal
SIGSEGV 11 Core Invalid memory reference
SIGTERM 15 Term Termination signal
SIGSTOP 17,19,23 Stop Stop process
SIGTSTP 18,20,24 Stop Stop typed at terminal
SIGBUS 10,7,10 Core Bus error (bad memory access)
SIGFPE 8 Core Floating point exception
SIGPIPE 13 Term Broken pipe: write to pipe with no readers
SIGALRM 14 Term Timer signal from alarm(2)
SIGUSR1 30,10,16 Term User-defined signal 1
SIGUSR2 31,12,17 Term User-defined signal 2
SIGCHLD 20,17,18 Ign Child stopped or terminated
SIGCONT 19,18,25 Cont Continue if stopped
SIGTTIN 21,21,26 Stop Terminal input for background process
SIGTTOU 22,22,27 Stop Terminal output for background process
The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.
Next the signals not in the POSIX.1-1990 standard but described in SUSv2 and POSIX.1-2001.
Signal Value Action Comment
SIGPOLL Term Pollable event (Sys V). Synonym for SIGIO
SIGPROF 27,27,29 Term Profiling timer expired
SIGSYS 12,31,12 Core Bad argument to routine (SVr4)
SIGTRAP 5 Core Trace/breakpoint trap
SIGURG 16,23,21 Ign Urgent condition on socket (4.2BSD)
SIGVTALRM 26,26,28 Term Virtual alarm clock (4.2BSD)
SIGXCPU 24,24,30 Core CPU time limit exceeded (4.2BSD)
SIGXFSZ 25,25,31 Core File size limit exceeded (4.2BSD)
Not Sure!
25. STRACE
25
strace -e trace=open,read <executable>
strace -t -e open <Executable>
strace -r -e open <exdcutable>
strace -c <executbale>
strace -i <executable>
strace -T -e read <executable>
strace -e trace=network|signal|memory <executable>
strace userspace utility for Linux helps to diagnose, debug and instructional by monitoring system calls and signal. The operation of
strace is made possible by the kernel feature known as ptrace.
Specifying a list of paths to be traced (-P /etc/ld.so.cache, for example).
Modifying return and error code of the specified syscalls, and inject signals upon their execution (since strace 4.15, -e inject=
option).
Extracting information about file descriptors (including sockets, -y option).
Not Helpful!
26. objdump
26
File header: -f
File format: -p
Section header: -h
All headers: -x
Executable sections: -d
Assembler sections: -D
Full contents: -s
Debug: -g
Symbol table: -t
Dynamic Symbol table: -T
Dynamic Relocation: -R
Function content via name: -s -j.rodata, -D --prefix-addresses
readelf --relocs
Somewhat Helpful!
27. nm <executable>
27
t|T – The symbol is present in the .text code section
b|B – The symbol is in UN-initialized .data section
D|d – The symbol is in Initialized .data section.
nm -A ./*.o
nm -u undefined symbols
nm -n symbol
nm -S symbol wth size
nm -D dynamic symbol
A : Global absolute symbol.
a : Local absolute symbol.
B : Global bss symbol.
b : Local bss symbol.
D : Global data symbol.
d : Local data symbol.
f : Source file name symbol.
L : Global thread-local symbol (TLS).
l : Static thread-local symbol (TLS).
T : Global text symbol.
t : Local text symbol.
U : Undefined symbol.
Somewhat Helpful!
28. CPU utilization
28
{
char *stat_param[5] = {"utime", "stime", "cutime", "cstime", "starttime"};
char *stat_result[5] = {0};
struct sysinfo info = {0};
fprintf(stdout, "Process to fetch stat: %sn", argv[1]);
if (sysinfo(&info) == 0) {
fprintf(stdout, "sysinfo n");
sprintf(buf, "cat /proc/%s/stat | awk '{print $14 "," $15 "," $16 "," $17 "," $22}'", argv[1]);
fp = popen(buf, "r");
if (fp) {
char *parse = fgets (buf, 999, fp);
char *p = strtok (parse, ",");
res = 0;
while (p) {
stat_result[res++] = p;
p = strtok (NULL, ",");
}
fprintf(stdout, " --- Calculation --- n");
unsigned long int hertz = sysconf(_SC_CLK_TCK);
unsigned long int total_time = atol(stat_result[0]) + atol(stat_result[1]) + atol(stat_result[2]) + atol(stat_result[3]);
unsigned long int sec = info.uptime - (atol(stat_result[4])/ hertz);
unsigned long int cpu_usage = (100 * total_time) / (sec *hertz);
fprintf(stdout, "cpu_usgae (%lu) for process (%s)n", cpu_usage, argv[1]);
}
return cpu_usage;
}
Not Helpful!
29. GDB
29
call actual library functions or even functions from within the debugged
program using the command call
start GDB with gdbtui or gdb -tui. Switch using 'layout src|asm|regs'
shell allows you to execute commands in the shell
print, examine and display
info file - Entry point
set disassembly-flavor intel
set print pretty
set print addr off
set print array
set print array on
set print array off
display next 5 instructions - x/5i $pc
disassemble <function name>
.gdbinit
file exe
break *0x400710
set disassembly-flavor intel
layout asm
layout regs
run argument1 argument2
we can use set so do the magic for us. Let's first inspect the instruction
bytes:
(gdb) x/10b $pc
(gdb) set write
(gdb) set {unsigned int}$pc = 0x90909090
(gdb) set {unsigned char}($pc+4) = 0x90
(gdb) set write off
(gdb) x/10i $pc
x/6i $pc
=> 0x40911f: nop
0x409120: nop
0x409121: nop
0x409122: nop
0x409123: nop
0x409124: push rbp
set {unsigned int}0x40911f = 0x90909090
{unsigned char}0x409123 = 0x9
set $pc+=5
jump *$pc+5
Somewhat Helpful!
30. 30
DPDK packet processing using Direct Data
I/O
1. Core writes RXd preparing for receiving packet
2. NIC reads RXd to get buffer address
3. NIC writes packet
4. NIC writes RXd
5. Core reads RXd (polling)
6. Core reads packet and performs some action
CPU Socket
CPU Cores
MEMORYCONTROLLER
RAM
1
LLC
Packet
1
5
2
RXd
NIC
4 3
6
Easy!
34. Stack, register, variable trace for all threads
34
When to use: an unexpected signal or crash occurs
What to do: dump all threads stack and register information in an
environment where GDB is not present or not run.
Where it works:
Binary are stripped.
Binary and Application have no debug symbols.
Rare cases & combinations when faults occurs.
Errors or faults difficult to reproduce.
There are no access to GDB or remote GDB, ptrace or pstack-dump.
Inspect stack for each thread.
Inspect & dump global and debug variables.
DPDK when secondary causes primary to segfault. Running GDB for primary
causes Secondary to segfault.
Q & A:
Does this work for all shared library? Yes
Does this work mixed libraries static and shared? Yes
Does this work for all stripped libraries? Yes
Can we register SIGUSER1 to dump intermediate? Yes
How to make it work:
Build:
LIB: libunwind-dev
CFLAGS: -DDUMPSTACK_EXTRAREG -DDUMPSTACK_EXTRASTACK -DDUMPSTACK -
L/usr/lib/x86_64-linux-gnu/ -lunwind
LDFLAGS: -L/usr/lib/x86_64-linux-gnu/ -lunwind
Application Code Modify: add signal handler to call custom signal
handler
Somewhat Easy!
35. trace
stack
35
----------------- THREAD NAME BEGIN -----------------
/proc/41253/task/41248/comm
/proc/41253/task/41249/comm
/proc/41253/task/41250/comm
/proc/41253/task/41251/comm
/proc/41253/task/41252/comm
/proc/41253/task/41253/comm
l2fwd
eal-intr-thread
lcore-slave-3
lcore-slave-4
lcore-slave-5
pdump-thread
----------------- THREAD NAME DONE -----------------
DPDK Version 0x11080010
Config: msater 2 lcore count 4 process 0
rte_sys_gettid 41253
37. When to use: Memory layout is shared across multiple process, this can
lead to Unintended changes within the same process unintended changes
from multi process Application logic or function pointers modifying
unintended areas
What to do: dump all threads stack and register information in an
environment where GDB is not present or not run.
Where it works:
Control and Data Plane are in same or different process
Tables are close by.
Table entries are malloced dynamically.
Isolate the table or counter where the change is occurring
Can monitor multiple tables.
Program error
Key or values are read without const.
Values are modified using PTR athematic.
Tables with Lookup, Lookup + Result, Lookup + Result + Counters, Counters or Index to
Counters, Reference to Lookup, and Lookup + Result and Lookup + Result + Counter
Q & A:
Does this work for all shared library? Yes
Does this work mixed libraries static and shared? Yes
Does this work for all stripped libraries? Yes
Can we register SIGUSER1 to dump intermediate? Yes
How it works: Works as secondary application, which periodically
monitor selected tables or memory region. Reports back the
offset where the change has occurred.
Build:
LIB: libunwind-dev
CFLAGS: -DDUMPSTACK_EXTRAREG -DDUMPSTACK_EXTRASTACK -DDUMPSTACK -
L/usr/lib/x86_64-linux-gnu/ -lunwind
LDFLAGS: -L/usr/lib/x86_64-linux-gnu/ -lunwind
Application Code Modify: add signal handler to call custom signal
handler
Somewhat Easy!
Memzone Monitor
37
40. MALLOC-FREE Scanner
40
When to use: Quick and dirty valgrind like report tool
What to do:
For every successful malloc, calloc, zalloc create a container to hold - name, pointer and size.
For every free of alloced entry, remove the container.
How to works:
create ‘struct rte_fbarray´ with ‘rte_memzone_reserve'
In Primary process we ‘rte_fbarray_init’
In secondary we ‘rte_fbarray_attach’
In primary process for each alloc retrieve container ‘rte_fbarray_find_next_free’.
For each successful alloc we mark with ‘rte_fbarray_set_used’
For each free we ‘rte_fbarray_set_free’
In secondary fetch the details back by ‘rte_fbarray_find_next_used,|rte_fbarray_find_next_n_used’
Where it works:
rte_malloc, rte_calloc and rte_zalloc does not map alloc region name to address.
This makes it difficult to track the usage on dynamically allocates instance.
Easy!
Seg - 0
Seg - 1
Seg - 2
Seg - n
Memzone-
container
Alloc-1
Alloc-2
Alloc-3
rte_fbarrary_attach
41. Dynamic DEBUG with eBPF (user-space)
41
Looku
p
Table
Count
ers
API:
I. Application
Specific
II. DPDK
eBPF functions
for Debug API
When to use: for dynamic debug
What to do: load eBPF to existing applications
How it works: same as user space eBPF
Where:
1. Applications in field
2. Recompile not possible
3. Compiler MACROs not possible
Mem-copy:
Pros: XDP Buffer are released to pool immediately after copy.
Cons: Limited vector instructions (large byte copy is multiple smaller copy, HW is limited to 2 load & 1 store on vector.). With SIMD-512 we can only achieve 64B (512b) copy.
Zero-Copy:
Pros: Buffer is in DPDK buffer format, No copy or external buffer.
Cons: All buffers needs to be page aligned, Applications needs to be adapted, Buffer held in application till packet is dropped or tx complete.
single or multiple primary processes.
single primary and single secondary.
single primary and multiple secondaries.
set LD_PRELOAD to the path of a shared object, that file will be loaded before any other library (including the C runtime, libc.so).
To run with special library (example malloc) ‘LD_PRELOAD=/path/to/my/malloc.so /bin/ls’