Computer Performance Microscopy with SHIM

Computer Performance
Microscopy with SHIM
Kathryn McKinley
Microsoft Research
1
Steve Blackburn
Australian National University
Xi Yang
Australian National University

2
4 μops
Intel i7-4770, 3.4 GHz

0
0.5
1
1.5
2
2.5
3
3.5
4
IPC
Benchmark IPC
3
Lusearch is a DaCapo benchmark based on
the widely used open source search engine
framework Lucene.
Plenty of room here!

Interrupt Driven Profilers
4
Sampling at default 1 KHz, maximum 100 KHz.

Method IPC
Lusearch
5
top 10 methods (74% total execution time)
IPC
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10
default 1 KHz maximum 100 KHz SHIM 10 MHz

Sampling IPC
6
time
Two counters: C – cycles, R - retired instructions
R0 C0
IPC1 IPC2 IPC3
R1 C1 R2 C2 R3 C3
IPC = (Rt – Rt-1) / (Ct – Ct-1)
IPC is a high frequency signal.

0
0.5
1
1.5
2
2.5
Sampling Lusearch IPC
7
SHIM 10 MHz
maximum 100 KHz
default 1 KHz
0
0.5
1
1.5
2
2.5
0
0.5
1
1.5
2
2.5
IPC
IPC
IPC

#define DEFAULT_MAX_SAMPLE_RATE 100000
/*
* perf samples are done in some very critical code paths (NMIs).
* If they take too much CPU time, the system can lock up and not
* get any real work done. This will drop the sample rate when
profilers SHIM simulators
HiFi
handy
online
✓✗ ✓
✗
✗
✓✓
✓ ✓
8

Hardware and Software
Generate Signals
10
hardware signals software signals
hardware
performance counters
A (x){
x.y = B();
x.z = C();
}
A()
B()
C()
time
memory locations

Signals
11
hardware signals software signals
hardware software
counters
tags
✓
✓
✓
✓

12
Observe Signals From
Another Hardware Context

Observe Global Counters
14
LLC misses per cycle
while (true):
for counter in LLC misses, cycles:
buf[i++] = readCounter(counter)

0
4
15
while (true):
for counter in HT2 SHIM, Core, Cycles:
buf[i++] = readCounter(counter);
HT1
HT1 IPC
0
4
Core IPC
0
4
HT2 SHIM
IPC
HT1 IPC = Core IPC – HT2 SHIM IPC
HT2
Observe Local Counters

Correlate Hardware and Software Signals
16
while (true):
for counter in HT2 SHIM, Core, cycles:
tid = thread on HT1
buf[i++] = tid.method;
0
1
2
3
4
HT1 IPC
0
1
2
3
4
Core IPC
0
1
2
3
4
HT2 SHIM
IPC
1
2
3
A()
B()
C()
HT1
HT2
HT1
stack

Raw Samples
18
IPC (log scale)
% of
samples
(log scale)

Problem: Samples Are Not Atomic
19
time
Counters: C – cycles, R - retired
instructions
R0 C0
IPC1 IPC2 IPC3
R1 C1 R2 C2 R3 C3
IPC = (Rt – Rt-1) / (Ct – Ct-1)
✗✓ ✓

Solution: Use Clock As Ground Truth
20
time
Cs
0R0C0Ce
0
IPC1 IPC2 IPC3
Cs
1R1C1Ce
1 Cs
2R2C2Ce
2 Cs
3R3C3Ce
3
✗✓ ✓
CPC1 = 1.0 +/- 1% CPC2 = 1.0 +/- 1% CPC3 != 1.0 +/- 1%
CPC = (Ce
t – Ce
t-1) / (Cs
t – Cs
t-1) this should be 1!
while (true):
buf[i++] = readCycle();// read Cs
for counter in HT2 SHIM, Core, cycles:
buf[i++] = readCycle();// read Ce
tid = thread on HT1
buf[i++] = tid.method;

Filter Lusearch Samples
21
---- raw IPC
%ofsamples(logscale)
---- raw CPC
---- filtered IPC
---- filtered CPC in [0.99,1.01]

Software Signal
Other Core
23
0
0.5
1
1.5
2
2.5
3
3.5
4
30 cycles 1213 cycles
observe method and loop IDs.
NormalizedtowithoutSHIM
Overheads are from write invalidate transactions.
3MHz: more than an
order of magnitude
better than ‘maximum’
113MHz: more than three
orders of magnitude
better than ‘maximum’

Software Signal
Same Core
24
0
0.5
1
1.5
2
2.5
15 cycles 1505 cycles
Overheads are from sharing the core resources.
observe method and loop IDs.

Hardware and Software Signals
Same Core
25
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
495 cycles
Correlate IPC with method and loop IDs.

Reducing Overheads
• Bursty sampling
• SMT priorities
• Heterogeneous multicore
• Globally visible per-thread performance
counters
26

Conclusion
• High frequency sampling is important
• SHIM observes signals directly, low overhead
• Cycles per cycle filters samples
• Opportunities for hardware analysis
• Opportunities for hardware design
27
Questions?
https://github.com/ShimProfiler/SHIM

100 KHz (10 μs)
High or low ?
29

10 μs is not bad?
31
25 μs!Simple Address Book
*Name: Xi YANG
*Email: xi.yang@anu.edu.au

100 KHz (10 μs) won’t see this
32
The 25 μs life of the
address_book.SerializeToOstream(&output).
Sampling at 5 MHz, 608
cycles

Computer Performance Microscopy with SHIM

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Computer Performance Microscopy with SHIM

Semelhante a Computer Performance Microscopy with SHIM (20)

Computer Performance Microscopy with SHIM

Notas do Editor