2. • Introduction
• Use case
• libperf and in-kernel perf API
• Test analysis direct user access vs syscall
based perf counter access
• Design Issues and Next step
• QA
Fast access to perf Counters
3. • Access to perf counters is not fast enough in the
embedded networking space.
• We think we need
• The fastest access from user space. (see use
case)
• Shared when read only (no locking overhead).
• Stable API (based on libperf)
• Easy way to access to SoC specific counters
Introduction
4. • In fast path (could be ODP in future), There’ll be a
method to analyze odp crash dump based on statistics.
• Because crash dump statistics are based on the perf hw
counters, really low overhead counter access is needed.
Should provide near accurate cpu or bus clock cycle
precision.
• For example, in the fast path - per-packet budgeting is
1000 cpu cycle, then measuring can not take 3000 cpu
cycle as it does today with syscall based perf counter in
linux.
Use Case
5. Perf provides a syscall method to open a perf file descriptor for user space
application to access the counters, and attach the events to them.
sys_perf_counter_open - The syscall
- event type attributes for monitoring/sampling
- target pid
- target cpu
- group_fd
- flags
Event type :
- PERF_TYPE_HARDWARE
- PERF_TYPE_SOFTWARE
- PERF_TYPE_TRACEPOINT
- PERF_TYPE_HW_CACHE
- PERF_TYPE_RAW (for raw tracepoint data)
Perf
7. • Libperf creates set of file descriptors for bunch of perf events..by calling
sys_perf_open_event() api, and does enable/disable/read operation on
them .
current API has :
libperf_initialize : sets up a set of fd's for profiling
code to read from
libperf_finalize : read from fd’s, print and close all
pef FD.
libperf_readcounter : read perf counter.
libperf_enablecounter : Enable perf counter
libperf_disablecounter : disable perf counter
libperf_close : Close fd
Libperf
8. • Raw Proposal :
• Mmaping hw counters to user space could be a way forward for fast
access, removing overhead with the current kernel implementation.
• Adding scalable framework in user space ..could be libperf so to read
cpu specific counter, counter on offload block and other variant of
counters.
• Current mmapped based perf support in kernel:
• in-kernel perf supports mmaped based persistent ring-buffer
implementation for user space.
• This implementation is limited in performance due to the following.
The hw counter mappable and stored into ring-buffer with lots of
synchronisation overhead for user space to access i.e.. rmb for every
perf read counter, locking, async wake-up event for user space to
read statistics.
design issues, next step investigation
9. • But,
• The current kernel mappable events are exclusive, and
are not shareable, they won't fall back to sysfs perf event
mode. Therefore it is not scalable.
• The current kernel counter overhead is still significant,
therefore the current implementation won't achieve 1000
cycle requirement for fast path model, example ODP
crash dump statistics requirement mentioned in prev slide
[4].
Next Step continued..
10. • Effort to investigate and try to evaluate these issues :
• Focus on exclusive fast access approach
• HW counter pinned to specific core, specific task
• Avoid sync primitives in kernel space while reading hw counter, Let
user space application handle this job.
• Educate libperf to handle sync primitive and decide on locking policy.
• Design should be flexible enough to fall back to syscall based perf
mode.
• Respect SMP policy as much as possible.
Next Step continued..
11. Userspace fast access flow control arrow key - too shor
Application should be squa
Both these inside Soc
Arm Processor Core
event extensions
12. Custom user space application detail -
• Ran test application on arndale to demonstrate delta of user vs kernel space perf
counter. Result shows close to 9x improvement.
• Tiny test kernel module enables,disable perf counter for user mode.
/* enable */
asm ("MCR p15, 0, %0, C9, C14, 0nt" :: "r"(1));
/* disable */
asm ("MCR p15, 0, %0, C9, C14, 2nt" :: "r"(0x8000000f));
• User app uses x86 style timer api to read perf counter.
static inline uint32_t
rdtsc32(void)
{
#if defined(__GNUC__) && defined(__ARM_ARCH_7A__)
uint32_t r = 0;
asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(r) );
return r;
#else
#error Unsupported architecture/compiler!
#endif
}
Benchmarking current & proposed access
13. Libperf application using perf syscall -
• Create perf event FD using perf_event_open syscall.
• Reads perf counter event from file descriptor.
init(void)
{
static struct perf_event_attr attr;
attr.type = PERF_TYPE_HARDWARE;
attr.config = PERF_COUNT_HW_CPU_CYCLES;
fddev = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
}
• Both application runs in a tight loop for some duration and
there delta recorded for comparison..
Benchmarking cont..
14. • Enable pmu direct user space vs perf syscall based
application.
Benchmarking cont..
17. More about Linaro Connect: http://connect.linaro.org
More about Linaro: http://www.linaro.org/about/
More about Linaro engineering: http://www.linaro.org/engineering/
Linaro members: www.linaro.org/members