ARM Linux Cache Monitoring Using PMU

peemuperf

Cache monitoring on ARM Linux
2012

What is PMU ?
• Cortex-A series processors contain event counting hardware which
can be used to profile and benchmark code, including generation of
cycle and instruction count figures and to derive figures for cache
misses and so forth. The performance counter block contains a cycle
counter which can count processor cycles, or be configured to count
every 64 cycles. There are also a number of configurable 32-bit wide
event counters which can be set to count instances of events from a
wide-ranging list (for example, instructions executed, or MMU TLB
misses). These counters can be accessed through debug tools, or by
software running on the processor, through the CP15 Performance
Monitoring Unit (PMU) registers. They provide a non-invasive debug
feature and do not change the behavior of the processor. CP15 also
provides a number of controls for enabling and resetting the counters
and to indicate overflows (there is an option to generate an interrupt
on a counter overflow). The cycle counter can be enabled
independently of the event counters.
• From ARM Architecture Reference Manual

Profiling alternatives
• Oprofile
– Supported in mainline kernel (drivers/oprofile)
– ARM support enabled
– Relies on “Interrupts” from HW unit, when event counters
overflow
– Timer fallback when no HW event monitors are available
• Unfortunately, different errata in current ARM A8/A9
devices, make interrupt based monitoring unreliable
– To be fixed in later ARM cores
• Due to above, oprofile only supports CPU cycle
measurement using timers, on majority of ARM cores,
atleast upto 3.2 kernel

Latest status
• http://lists.infradead.org/pipermail/linux-arm-kernel/2012-June/103189.html
• Convert OMAP2/3 devices to use HWMOD for creating a PMU device. To support PMU
• on OMAP2/3 devices we only need to use MPU sub-system and so we can simply use
• the MPU HWMOD to create the PMU device. The MPU HWMOD for OMAP2/3 devices is
• currently missing the PMU interrupt and so add the PMU interrupt to the MPU
• HWMOD for these devices.

• This change also moves the PMU code out of the mach-omap2/devices.c files into
• its own pmu.c file as suggested by Kevin Hilman to de-clutter devices.c.

• Cc: Ming Lei <ming.lei at canonical.com>
• Cc: Will Deacon <will.deacon at arm.com>
• Cc: Benoit Cousson <b-cousson at ti.com>
• Cc: Paul Walmsley <paul at pwsan.com>
• Cc: Kevin Hilman <khilman at ti.com>

• Signed-off-by: Jon Hunter <jon-hunter at ti.com>
• ---
• arch/arm/mach-omap2/Makefile | 1+
• arch/arm/mach-omap2/devices.c | 33 -----------
• arch/arm/mach-omap2/omap_hwmod_2xxx_ipblock_data.c | 6 ++
• arch/arm/mach-omap2/omap_hwmod_3xxx_data.c | 6 ++
• arch/arm/mach-omap2/pmu.c | 59 ++++++++++++++++++++
• arch/arm/plat-omap/include/plat/irqs.h | 1+
• 6 files changed, 73 insertions(+), 33 deletions(-)
• create mode 100644 arch/arm/mach-omap2/pmu.c

Patch status
• The patch set mentioned in earlier slide, is
in various stages of integration into
different SOC architectures
• Beagle/ OMAP35x is supported
• This is not supported in AM335x as of
2012, expect to be in mainline by 2013
• In the interim, what is the option ?

What is the need ?
• For measuring different aspects of
performance related to external memory
bandwidth, cache usage monitoring is very
key
• Current oprofile does not support this in
different SOCs

peemuperf
• A tool to measure overall Linux
Performance using PMU HW of ARM -
ARM CPU Cycles, Cache misses at L1
and L2 level, stalls, NEON..
• Consists of a kernel module that does the
heavy lifting, and exposes all profile
information to userspace via proc entry

Configurable parameters
• evdelay=500 evlist=1,68,3,4 evdebug=1

• evdelay – Sampling interval (milliseconds)
• evlist – Comma separated array of event
IDs (refer 3.2.49 c9, Event Selection
Register of Cortex A8 TRM)
• evdebug – Controls debug output
messages

Userspace access
• Proc entry is
– /proc/peemuperf
• Displays in below format
– <COUNTER #> : <COUNTER VALUE>
– Counter[0] : 48,
– Counter[1] :77448,
– Counter[2]: 13,
– Counter[3]: 115058
– Overflow flag: = 0, Cycle Count: = 5739253

A8 vs A9
• A8 has 4 performance counters
• A9 has 6
• peemuperf dynamically configures based
on run-time query

Default Events monitored
• 1 ==> Instruction fetch that causes a refill at the
lowest level of instruction or unified cache
• 68 ==> Any cacheable miss in the L2 cache
• 3 ==> Data read or write operation that causes a
refill at the lowest level of data or unified cache
• 4 ==> Data read or write operation that causes a
cache access at the lowest level of data or
unified cache

Source
• github.com/prabindh/peemuperf

ARM Linux Cache Monitoring Using PMU

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a ARM Linux Cache Monitoring Using PMU

Semelhante a ARM Linux Cache Monitoring Using PMU (20)

Mais de Prabindh Sundareson

Mais de Prabindh Sundareson (20)

Último

Último (20)

ARM Linux Cache Monitoring Using PMU