Debug dpdk process bottleneck & painpoints

Debug & Troubleshooting
bottlenecks & pain Points
Sep 2018
Vipin Varghese

Breakup
• To debug an application, isolate it to run on
 Linux 64|32 bit environment
 user space only
 Uses DPDK PMD and libraries
 Data captures for both static and shared libraries runs.
 use gcc & gnu debug options too.

Sample Packet Life (RX  Classifier + Load balance 
worker cores  crypto (optional)  traffic manager  TX)
RX
TX
Struct rte_mbuf payload
Struct rte_mbuf
Rx
meta
data
Tx
meta
data
payload
Mempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
PKT RX + Classify and
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … nLcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7
Lcore 8
Lcore 0 Lcore 9

1. Is ‘received PKT rate < send PKT rate’?
are generic configuration correct?
 What is port Speed, Duplex – rte_eth_link_get()
 Is packet of higher sizes are dropped? - rte_eth_get_mtu()
 Are only specific MAC are received? - rte_eth_promiscuous_get()
are there NIC specific drops?
 Check rte_eth_rx_queue_info_get() for nb_desc, scattered_rx,
 Check rte_eth_dev_stats() for Stats per queue
 Is stats of other queues shows no change - rte_eth_dev_dev_rss_hash_conf_get()
If problem still persists, this might be at RX lcore thread not pulling the traffic
 Check if RX thread, distributor or event rx adapater is holding or processing more than required
 try using rte_prefetch_non_temporal() to intimate the mbuf in pulled to cache for temporary
RX TX
Ste_mbuf payload
truct rte_mbuf payload
Struct rte_mbuf Rx meta
data
Tx meta
data
payloadMempool of MBUF
Worker - 1
Worker - 1
Worker - 1Worker - 1
Crypto
QoS + TX
distribute
CRYPTO DEV
NIC 1,2,3 … n
NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7 Lcore 8
Lcore 0
Lcore 9

2. Are there packet drops?
at RX
 Get the rx queues by rte_eth_dev_info_get() for nb_rx_queues
 Check for miss, errors, qerros by rte_eth_dev_stats() for imissed, ierrors, q_erros, rx_nombuf, rte_mbuf_ref_count
at TX
 Are we doing in bulk to reduce the TX descriptor overhead?
 Check rte_eth_dev_stats() oerrors, q_erros, rte_mbuf_ref_count
RX TX
Struct rte_mbuf
Rx
meta
data
Tx
meta
data payload
Mempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
distribute
CRYPTO DEV
NIC 1,2,3 … n
NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7
Lcore 8
Lcore 0 Lcore 9

3. Drops at producer ring?
Performance for Producer
 Fetch the type of RING – rte_ring_dump() for ‘flags (RING_F_SP_ENQ)’
 If ‘burst enqueue - actual enqueue > 0’ – check rte_ring_count() or rte_ring_free_count()
 If ‘burst or single‘ enqueue results 0, then there is no more space – check rte_ring_full() or not
RX TX
Struct rte_mbuf
Rx
met
a
dat
a
Tx
met
a
dat
a
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7
Lcore 8
Lcore 0 Lcore 9

4. Drops at consumer ring?
Performance for Producer
1. Fetch the type of RING – rte_ring_dump() for ‘flags (RING_F_SC_DEQ)’
2. If ‘burst dequeue - actual dequeue > 0’ – rte_ring_free_count()
3. If ‘burst or single’ enqueue always results 0 – check the ring is empty via rte_ring_empty()
RX
TX
Struct rte_mbuf
Rx
met
a
data
Tx
met
a
data payload
Mempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7
Lcore 8
Lcore 0
Lcore 9

5. Are we still not getting packets or objects
at desired rate? Are packets received from multiple NIC? -rte_eth_dev_count_all()
 Are NIC interfaces on different socket? – use rte_eth_dev_socket_id()
 Is mempool created with right socket? - rte_mempool_create() or rte_pktmbuf_pool_create()
 Are we seeing drop on specific socket? – It might require more mempool objects; try allocating more objects
 Is there single RX thread for multiple NIC? – try having
 multiple lcore to read from fixed interface
 we might hitting cache limit. Increase cache_size for pool_create()
 If we are still seeing low performance
 Check if we sufficient objects – rte_mempool_avail_count()
 Failure in some pkt – we might be getting pkts > mbuf data size. Check rte_pktmbuf_is_continguous()
 If user pthread is used to access object access – rte_mempool_cache_create()
 Try using 1GB huge pages instead of 2MB – if yes, try rte_mem_lock_page() for 2MB pages
RX TX
Struct rte_mbuf
Rx
met
a
data
Tx
met
a
data
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
distribute
CRYPTO DEV
NIC 1,2,3 … n
NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7 Lcore 8
Lcore 0 Lcore 9
Stall in release of MBUF can be because
 Processing pipeline is too heavy
 Number of stages are too many
 TX is not transferred at desired rate
 Application missue scenarios
a. not freeing packets
b. Rte_pktmbuf_refcnt_set
c. Rte_pktmbuf_prefree_seg

6. Performance issues in crypto lookaside?
 Get total crypto devices – rte_cryptodev_count()
 Cross check SW or HW flags are configured properly – rte_cryptodev_info_get() for feature_flags
If enqueue request > actual enqueue (drops)?
 Is the queue pair setup for proper node – rte_cryptodev_queue_pair_setup() for
socket_id
 Is the session_pool created from same socket_id as queue pair?
 Is enqueue thread same socket_id?
 rte_cryptodev_stats() for drops err_count for enqueue or dequeue
 Are there multiple threads enqueue or dequeue from same queue pair?
If enqueue rate > dequeue rate?
 Is dequeue lcore thread is same socket_id?
 If we using SW crypto
 Is the CRYPTO Library build with right flags?
 Is the queue pair using CPU ISA - rte_cryptodev_info_get() for
feature_flags for AVX|SSE
 If we are using HW crypto – Is the card on same numa socket as queue pair
and session pool?
RX TX
Struct rte_mbuf
Rx
meta
data
Tx meta
data payloadMempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7 Lcore 8
Lcore 0 Lcore 9

7. What all to check in a Process for
Bottlenecks?
Debug
 Mode of operation - rte_eal_get_configuration() (master, lcore|service|numa count, process_type)
 Check lcore run mode - rte_eal_lcore_role() (rte, off, service)
 process details- rte_dump_stack(), rte_dump_registers(), rte_memdump()
Performance
 Threads context switches more - Identify lcore with rte_lcore() and lcore index mapping with rte_lcore_index()
 Check lcore role type - rte_eal_lcore_role (rte, off, service)
 Check the cpu core – rte_thread_get_affinity() and rte_eal_get_lcore_state()
RX TX
Struct rte_mbuf
Rx
met
a
dat
a
Tx
met
a
dat
a
payloadMempool of
MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7 Lcore 8
Lcore 0 Lcore 9

8. What all to debug for service core?
 Get service core count – rte_service_lcore_count() and compare with result of rte_eal_get_configuration()
 Check if registered service is available – rte_service_get_by_name(), rte_service_get_count() and rte_service_get_name()
 Is given service running parallel on multiple lcores – rte_service_probe_capability() and rte_service_map_lcore_get()
 Is service running – rte_service_runstate_get()
 Find how many services are running on specific service lcore - rte_service_lcore_count_services()
 Generic debug - rte_service_dump()
RX TX
Struct rte_mbuf
Rx
meta
data
Tx
meta
data payload
Mempool of MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7
Lcore 8
Lcore 0
Lcore 9

9. Bottleneck in event_dev?
 Get event_dev devices – rte_event_dev_count()
 Are they created on correct socket_id? - rte_event_dev_socket_id()
 Check if HW or SW capabilities? - rte_event_dev_info_get() for event_qos, queue_all_types, burst_mode, multiple_queue_port,
max_event_queue|dequeue_depth
 Packet stuck in queue - check for stages (event qeueue) where packets are looped back to same or previous stages.
Performance drops in enqueue event count > actual enqued?
 Dump the event_dev information – rte_event_dev_dump()
 Check stats for queue and port for eventdev
 Check the inflight, current queue element for enqueue|deqeue

10. How to debug HW or SW traffic manager?Is configuration right?
 Get current capabilities for DPDK port – rte_tm_capabilities_get() for max nodes, level, shaper_private, shaper_shared, sched_n_children and stats_mask
 Check if current leaf are configured identically - rte_tm_capabilities_get() for lead_nodes_identicial
 Get leaf nodes for a dpdk port – rte_tn_get_number_of_leaf_node()
 Check level capabilities by rte_tm_level_capabilities_get for n_noeds
• Max, nonleaf_max, leaf_max,
• identical, non_identical
• Shaper_private_supported
• Stats_mask
• Cman wred packet|byte supported
• Camn head drop supported
 Check node capabilities by rte_tm_node_capabilities_get for n_noeds
• Shaper_private_supported
• Stats_mask
• Cman wred packet|byte supported
• Camn head drop supported
 Debug via stats – rte_tm_stats_update() and rte_tm_node_stats_read()
RX TX
Struct rte_mbuf payloadStruct rte_mbuf
Rx
meta
data
Tx
meta
data payload
Mempool of
MBUF
Worker - 1
Worker - 1
Worker - 1
Worker - 1
Crypto
QoS + TX
distribute
CRYPTO DEV
NIC 1,2,3 … n NIC 1,2,3 … n
Lcore 1
Stats +
health
Service core 2
Lcore 3, 4, 5, 6
Lcore 7 Lcore 8
Lcore 0 Lcore 9

memory
 tail: rte_dump_tailq, rte_eal_tailq_lookup
 Memzone: rte_memzone_dump (used to debug in config space for tables and counter)
timer
 rte_timer_manage – only after enabling debug
Crypto inline
 rte_security_session_stats_get

How to develop custom code to debug?
 For single process – the debug functionality is to be added in same process
 For multiple process – the debug functionality can be added to secondary multi process
 Debug functions can be
1. Timer call-back
2. Service function under service core
3. USR1 or USR2 signal handler

11. PKT capture with pdump?
Primary
Secondary
 with primary enabling then secondary can access. Copies packets from specific RX or TX queues to secondary process ring buffers.
 Need to explore:
 if secondary shares same interface can we enable from secondary for rx|tx happening on primary.
 Specific PMD private data dump the details
 User private data if present, dump the details
 Useful to identify the packets passed from RX or TX per queue.

Debug dpdk process bottleneck & painpoints

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Debug dpdk process bottleneck & painpoints

Semelhante a Debug dpdk process bottleneck & painpoints (20)

Mais de Vipin Varghese

Mais de Vipin Varghese (9)

Último

Último (20)

Debug dpdk process bottleneck & painpoints