Peemuperf is a Linux kernel module and userspace tool that uses the Performance Monitoring Unit (PMU) on ARM processors to monitor performance metrics like CPU cycles, cache misses, and stalls. It can profile the ARM Cortex A8 and A9 by dynamically configuring the number and types of performance counters. The tool outputs profiling data to the Linux proc filesystem for inspection in userspace. Peemuperf aims to provide cache monitoring capabilities for ARM devices where the oprofile tool is currently limited.
2. What is PMU ?
• Cortex-A series processors contain event counting hardware which
can be used to profile and benchmark code, including generation of
cycle and instruction count figures and to derive figures for cache
misses and so forth. The performance counter block contains a cycle
counter which can count processor cycles, or be configured to count
every 64 cycles. There are also a number of configurable 32-bit wide
event counters which can be set to count instances of events from a
wide-ranging list (for example, instructions executed, or MMU TLB
misses). These counters can be accessed through debug tools, or by
software running on the processor, through the CP15 Performance
Monitoring Unit (PMU) registers. They provide a non-invasive debug
feature and do not change the behavior of the processor. CP15 also
provides a number of controls for enabling and resetting the counters
and to indicate overflows (there is an option to generate an interrupt
on a counter overflow). The cycle counter can be enabled
independently of the event counters.
• From ARM Architecture Reference Manual
3. Profiling alternatives
• Oprofile
– Supported in mainline kernel (drivers/oprofile)
– ARM support enabled
– Relies on “Interrupts” from HW unit, when event counters
overflow
– Timer fallback when no HW event monitors are available
• Unfortunately, different errata in current ARM A8/A9
devices, make interrupt based monitoring unreliable
– To be fixed in later ARM cores
• Due to above, oprofile only supports CPU cycle
measurement using timers, on majority of ARM cores,
atleast upto 3.2 kernel
4. Latest status
• http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/103189.html
• Convert OMAP2/3 devices to use HWMOD for creating a PMU device. To support PMU
• on OMAP2/3 devices we only need to use MPU sub-system and so we can simply use
• the MPU HWMOD to create the PMU device. The MPU HWMOD for OMAP2/3 devices is
• currently missing the PMU interrupt and so add the PMU interrupt to the MPU
• HWMOD for these devices.
• This change also moves the PMU code out of the mach-omap2/devices.c files into
• its own pmu.c file as suggested by Kevin Hilman to de-clutter devices.c.
• Cc: Ming Lei <ming.lei at canonical.com>
• Cc: Will Deacon <will.deacon at arm.com>
• Cc: Benoit Cousson <b-cousson at ti.com>
• Cc: Paul Walmsley <paul at pwsan.com>
• Cc: Kevin Hilman <khilman at ti.com>
• Signed-off-by: Jon Hunter <jon-hunter at ti.com>
• ---
• arch/arm/mach-omap2/Makefile | 1+
• arch/arm/mach-omap2/devices.c | 33 -----------
• arch/arm/mach-omap2/omap_hwmod_2xxx_ipblock_data.c | 6 ++
• arch/arm/mach-omap2/omap_hwmod_3xxx_data.c | 6 ++
• arch/arm/mach-omap2/pmu.c | 59 ++++++++++++++++++++
• arch/arm/plat-omap/include/plat/irqs.h | 1+
• 6 files changed, 73 insertions(+), 33 deletions(-)
• create mode 100644 arch/arm/mach-omap2/pmu.c
5. Patch status
• The patch set mentioned in earlier slide, is
in various stages of integration into
different SOC architectures
• Beagle/ OMAP35x is supported
• This is not supported in AM335x as of
2012, expect to be in mainline by 2013
• In the interim, what is the option ?
6. What is the need ?
• For measuring different aspects of
performance related to external memory
bandwidth, cache usage monitoring is very
key
• Current oprofile does not support this in
different SOCs
7. peemuperf
• A tool to measure overall Linux
Performance using PMU HW of ARM -
ARM CPU Cycles, Cache misses at L1
and L2 level, stalls, NEON..
• Consists of a kernel module that does the
heavy lifting, and exposes all profile
information to userspace via proc entry
10. A8 vs A9
• A8 has 4 performance counters
• A9 has 6
• peemuperf dynamically configures based
on run-time query
11. Default Events monitored
• 1 ==> Instruction fetch that causes a refill at the
lowest level of instruction or unified cache
• 68 ==> Any cacheable miss in the L2 cache
• 3 ==> Data read or write operation that causes a
refill at the lowest level of data or unified cache
• 4 ==> Data read or write operation that causes a
cache access at the lowest level of data or
unified cache