2. Breakup
• To debug an application, isolate it to run on
Linux 64|32 bit environment
user space only
Uses DPDK PMD and libraries
Data captures for both static and shared libraries runs.
use gcc & gnu debug options too.
3. Sample Packet Life (RX Classifier + Load balance
worker cores crypto (optional) traffic manager TX)
RX
TX
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf
Rx
meta
data
Tx
meta
data
payload
Mempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … nLcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7
Lcore 8
Lcore 0 Lcore 9
4. 1. Is ‘received PKT rate < send PKT rate’?
are generic configuration correct?
What is port Speed, Duplex – rte_eth_link_get()
Is packet of higher sizes are dropped? - rte_eth_get_mtu()
Are only specific MAC are received? - rte_eth_promiscuous_get()
are there NIC specific drops?
Check rte_eth_rx_queue_info_get() for nb_desc, scattered_rx,
Check rte_eth_dev_stats() for Stats per queue
Is stats of other queues shows no change - rte_eth_dev_dev_rss_hash_conf_get()
If problem still persists, this might be at RX lcore thread not pulling the traffic
Check if RX thread, distributor or event rx adapater is holding or processing more than required
try using rte_prefetch_non_temporal() to intimate the mbuf in pulled to cache for temporary
RX TX
Ste_mbuf payload
truct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf Rx meta
data
Tx meta
data
payloadMempool of MBUF
Worker - 1
Worker - 1
Worker - 1Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n
NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7 Lcore 8
Lcore 0
Lcore 9
5. 2. Are there packet drops?
at RX
Get the rx queues by rte_eth_dev_info_get() for nb_rx_queues
Check for miss, errors, qerros by rte_eth_dev_stats() for imissed, ierrors, q_erros, rx_nombuf, rte_mbuf_ref_count
at TX
Are we doing in bulk to reduce the TX descriptor overhead?
Check rte_eth_dev_stats() oerrors, q_erros, rte_mbuf_ref_count
RX TX
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf
Rx
meta
data
Tx
meta
data payload
Mempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n
NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7
Lcore 8
Lcore 0 Lcore 9
6. 3. Drops at producer ring?
Performance for Producer
Fetch the type of RING – rte_ring_dump() for ‘flags (RING_F_SP_ENQ)’
If ‘burst enqueue - actual enqueue > 0’ – check rte_ring_count() or rte_ring_free_count()
If ‘burst or single‘ enqueue results 0, then there is no more space – check rte_ring_full() or not
RX TX
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf
Rx
met
a
dat
a
Tx
met
a
dat
a
payloadMempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7
Lcore 8
Lcore 0 Lcore 9
7. 4. Drops at consumer ring?
Performance for Producer
1. Fetch the type of RING – rte_ring_dump() for ‘flags (RING_F_SC_DEQ)’
2. If ‘burst dequeue - actual dequeue > 0’ – rte_ring_free_count()
3. If ‘burst or single’ enqueue always results 0 – check the ring is empty via rte_ring_empty()
RX
TX
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf
Rx
met
a
data
Tx
met
a
data payload
Mempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7
Lcore 8
Lcore 0
Lcore 9
8. 5. Are we still not getting packets or objects
at desired rate? Are packets received from multiple NIC? -rte_eth_dev_count_all()
Are NIC interfaces on different socket? – use rte_eth_dev_socket_id()
Is mempool created with right socket? - rte_mempool_create() or rte_pktmbuf_pool_create()
Are we seeing drop on specific socket? – It might require more mempool objects; try allocating more objects
Is there single RX thread for multiple NIC? – try having
multiple lcore to read from fixed interface
we might hitting cache limit. Increase cache_size for pool_create()
If we are still seeing low performance
Check if we sufficient objects – rte_mempool_avail_count()
Failure in some pkt – we might be getting pkts > mbuf data size. Check rte_pktmbuf_is_continguous()
If user pthread is used to access object access – rte_mempool_cache_create()
Try using 1GB huge pages instead of 2MB – if yes, try rte_mem_lock_page() for 2MB pages
RX TX
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf
Rx
met
a
data
Tx
met
a
data
payloadMempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n
NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7 Lcore 8
Lcore 0 Lcore 9
Stall in release of MBUF can be because
Processing pipeline is too heavy
Number of stages are too many
TX is not transferred at desired rate
Application missue scenarios
a. not freeing packets
b. Rte_pktmbuf_refcnt_set
c. Rte_pktmbuf_prefree_seg
9. 6. Performance issues in crypto lookaside?
are generic configuration correct?
Get total crypto devices – rte_cryptodev_count()
Cross check SW or HW flags are configured properly – rte_cryptodev_info_get() for feature_flags
If enqueue request > actual enqueue (drops)?
Is the queue pair setup for proper node – rte_cryptodev_queue_pair_setup() for
socket_id
Is the session_pool created from same socket_id as queue pair?
Is enqueue thread same socket_id?
rte_cryptodev_stats() for drops err_count for enqueue or dequeue
Are there multiple threads enqueue or dequeue from same queue pair?
If enqueue rate > dequeue rate?
Is dequeue lcore thread is same socket_id?
If we using SW crypto
Is the CRYPTO Library build with right flags?
Is the queue pair using CPU ISA - rte_cryptodev_info_get() for
feature_flags for AVX|SSE
If we are using HW crypto – Is the card on same numa socket as queue pair
and session pool?
RX TX
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf
Rx
meta
data
Tx meta
data payloadMempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7 Lcore 8
Lcore 0 Lcore 9
11. 7. What all to check in a Process for
Bottlenecks?
Debug
Mode of operation - rte_eal_get_configuration() (master, lcore|service|numa count, process_type)
Check lcore run mode - rte_eal_lcore_role() (rte, off, service)
process details- rte_dump_stack(), rte_dump_registers(), rte_memdump()
Performance
Threads context switches more - Identify lcore with rte_lcore() and lcore index mapping with rte_lcore_index()
Check lcore role type - rte_eal_lcore_role (rte, off, service)
Check the cpu core – rte_thread_get_affinity() and rte_eal_get_lcore_state()
RX TX
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf
Rx
met
a
dat
a
Tx
met
a
dat
a
payloadMempool of
MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7 Lcore 8
Lcore 0 Lcore 9
12. 8. What all to debug for service core?
Get service core count – rte_service_lcore_count() and compare with result of rte_eal_get_configuration()
Check if registered service is available – rte_service_get_by_name(), rte_service_get_count() and rte_service_get_name()
Is given service running parallel on multiple lcores – rte_service_probe_capability() and rte_service_map_lcore_get()
Is service running – rte_service_runstate_get()
Find how many services are running on specific service lcore - rte_service_lcore_count_services()
Generic debug - rte_service_dump()
RX TX
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf
Rx
meta
data
Tx
meta
data payload
Mempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7
Lcore 8
Lcore 0
Lcore 9
13. 9. Bottleneck in event_dev?
are generic configuration correct?
Get event_dev devices – rte_event_dev_count()
Are they created on correct socket_id? - rte_event_dev_socket_id()
Check if HW or SW capabilities? - rte_event_dev_info_get() for event_qos, queue_all_types, burst_mode, multiple_queue_port,
max_event_queue|dequeue_depth
Packet stuck in queue - check for stages (event qeueue) where packets are looped back to same or previous stages.
Performance drops in enqueue event count > actual enqued?
Dump the event_dev information – rte_event_dev_dump()
Check stats for queue and port for eventdev
Check the inflight, current queue element for enqueue|deqeue
14. 10. How to debug HW or SW traffic manager?Is configuration right?
Get current capabilities for DPDK port – rte_tm_capabilities_get() for max nodes, level, shaper_private, shaper_shared, sched_n_children and stats_mask
Check if current leaf are configured identically - rte_tm_capabilities_get() for lead_nodes_identicial
Get leaf nodes for a dpdk port – rte_tn_get_number_of_leaf_node()
Check level capabilities by rte_tm_level_capabilities_get for n_noeds
• Max, nonleaf_max, leaf_max,
• identical, non_identical
• Shaper_private_supported
• Stats_mask
• Cman wred packet|byte supported
• Camn head drop supported
Check node capabilities by rte_tm_node_capabilities_get for n_noeds
• Shaper_private_supported
• Stats_mask
• Cman wred packet|byte supported
• Camn head drop supported
Debug via stats – rte_tm_stats_update() and rte_tm_node_stats_read()
RX TX
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payload
Struct rte_mbuf payloadStruct rte_mbuf
Rx
meta
data
Tx
meta
data payload
Mempool of
MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7 Lcore 8
Lcore 0 Lcore 9
15. memory
tail: rte_dump_tailq, rte_eal_tailq_lookup
Memzone: rte_memzone_dump (used to debug in config space for tables and counter)
timer
rte_timer_manage – only after enabling debug
Crypto inline
rte_security_session_stats_get
16. How to develop custom code to debug?
For single process – the debug functionality is to be added in same process
For multiple process – the debug functionality can be added to secondary multi process
Debug functions can be
1. Timer call-back
2. Service function under service core
3. USR1 or USR2 signal handler
17. 11. PKT capture with pdump?
Primary
Secondary
with primary enabling then secondary can access. Copies packets from specific RX or TX queues to secondary process ring buffers.
Need to explore:
if secondary shares same interface can we enable from secondary for rx|tx happening on primary.
Specific PMD private data dump the details
User private data if present, dump the details
Useful to identify the packets passed from RX or TX per queue.