SlideShare uma empresa Scribd logo
1 de 32
Computer Performance
Microscopy with SHIM
Kathryn McKinley
Microsoft Research
1
Steve Blackburn
Australian National University
Xi Yang
Australian National University
2
4 μops
Intel i7-4770, 3.4 GHz
0
0.5
1
1.5
2
2.5
3
3.5
4
IPC
Benchmark IPC
3
Lusearch is a DaCapo benchmark based on
the widely used open source search engine
framework Lucene.
Plenty of room here!
Interrupt Driven Profilers
4
Sampling at default 1 KHz, maximum 100 KHz.
Method IPC
Lusearch
5
top 10 methods (74% total execution time)
IPC
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 2 3 4 5 6 7 8 9 10
default 1 KHz maximum 100 KHz SHIM 10 MHz
Sampling IPC
6
time
Two counters: C – cycles, R - retired instructions
R0 C0
IPC1 IPC2 IPC3
R1 C1 R2 C2 R3 C3
IPC = (Rt – Rt-1) / (Ct – Ct-1)
IPC is a high frequency signal.
0
0.5
1
1.5
2
2.5
Sampling Lusearch IPC
7
SHIM 10 MHz
maximum 100 KHz
default 1 KHz
0
0.5
1
1.5
2
2.5
0
0.5
1
1.5
2
2.5
IPC
IPC
IPC
#define DEFAULT_MAX_SAMPLE_RATE 100000
/*
* perf samples are done in some very critical code paths (NMIs).
* If they take too much CPU time, the system can lock up and not
* get any real work done. This will drop the sample rate when
profilers SHIM simulators
HiFi
handy
online
✓✗ ✓
✗
✗
✓✓
✓ ✓
8
insight
9
Hardware and Software
Generate Signals
10
hardware signals software signals
hardware
performance counters
A (x){
x.y = B();
x.z = C();
}
A()
B()
C()
time
memory locations
Signals
11
hardware signals software signals
hardware software
counters
tags
✓
✓
✓
✓
12
Observe Signals From
Another Hardware Context
SHIM design
13
Observe Global Counters
14
LLC misses per cycle
while (true):
for counter in LLC misses, cycles:
buf[i++] = readCounter(counter)
0
4
15
while (true):
for counter in HT2 SHIM, Core, Cycles:
buf[i++] = readCounter(counter);
HT1
HT1 IPC
0
4
Core IPC
0
4
HT2 SHIM
IPC
HT1 IPC = Core IPC – HT2 SHIM IPC
HT2
Observe Local Counters
Correlate Hardware and Software Signals
16
while (true):
for counter in HT2 SHIM, Core, cycles:
buf[i++] = readCounter(counter);
tid = thread on HT1
buf[i++] = tid.method;
0
1
2
3
4
HT1 IPC
0
1
2
3
4
Core IPC
0
1
2
3
4
HT2 SHIM
IPC
1
2
3
A()
B()
C()
HT1
HT2
HT1
stack
Fidelity
17
Raw Samples
18
IPC (log scale)
% of
samples
(log scale)
Problem: Samples Are Not Atomic
19
time
Counters: C – cycles, R - retired
instructions
R0 C0
IPC1 IPC2 IPC3
R1 C1 R2 C2 R3 C3
IPC = (Rt – Rt-1) / (Ct – Ct-1)
✗✓ ✓
Solution: Use Clock As Ground Truth
20
time
Cs
0R0C0Ce
0
IPC1 IPC2 IPC3
Cs
1R1C1Ce
1 Cs
2R2C2Ce
2 Cs
3R3C3Ce
3
✗✓ ✓
CPC1 = 1.0 +/- 1% CPC2 = 1.0 +/- 1% CPC3 != 1.0 +/- 1%
CPC = (Ce
t – Ce
t-1) / (Cs
t – Cs
t-1) this should be 1!
while (true):
buf[i++] = readCycle();// read Cs
for counter in HT2 SHIM, Core, cycles:
buf[i++] = readCounter(counter);
buf[i++] = readCycle();// read Ce
tid = thread on HT1
buf[i++] = tid.method;
Filter Lusearch Samples
21
---- raw IPC
%ofsamples(logscale)
---- raw CPC
---- filtered IPC
---- filtered CPC in [0.99,1.01]
overheads
22
Software Signal
Other Core
23
0
0.5
1
1.5
2
2.5
3
3.5
4
30 cycles 1213 cycles
observe method and loop IDs.
NormalizedtowithoutSHIM
Overheads are from write invalidate transactions.
3MHz: more than an
order of magnitude
better than ‘maximum’
113MHz: more than three
orders of magnitude
better than ‘maximum’
Software Signal
Same Core
24
0
0.5
1
1.5
2
2.5
15 cycles 1505 cycles
NormalizedtowithoutSHIM
Overheads are from sharing the core resources.
observe method and loop IDs.
Hardware and Software Signals
Same Core
25
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
495 cycles
Correlate IPC with method and loop IDs.
NormalizedtowithoutSHIM
Reducing Overheads
• Bursty sampling
• SMT priorities
• Heterogeneous multicore
• Globally visible per-thread performance
counters
26
Conclusion
• High frequency sampling is important
• SHIM observes signals directly, low overhead
• Cycles per cycle filters samples
• Opportunities for hardware analysis
• Opportunities for hardware design
27
Questions?
https://github.com/ShimProfiler/SHIM
Backup Slides
28
100 KHz (10 μs)
High or low ?
29
10 μs is not bad
30
10 μs is not bad?
31
25 μs!Simple Address Book
*Name: Xi YANG
*Email: xi.yang@anu.edu.au
100 KHz (10 μs) won’t see this
32
The 25 μs life of the
address_book.SerializeToOstream(&output).
Sampling at 5 MHz, 608
cycles

Mais conteúdo relacionado

Mais procurados

Processing Big Data in Realtime
Processing Big Data in RealtimeProcessing Big Data in Realtime
Processing Big Data in RealtimeTikal Knowledge
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...NECST Lab @ Politecnico di Milano
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Performance testing of microservices in Action
Performance testing of microservices in ActionPerformance testing of microservices in Action
Performance testing of microservices in ActionAlexander Kachur
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Root cause analysis with e bpf & python
Root cause analysis with e bpf & pythonRoot cause analysis with e bpf & python
Root cause analysis with e bpf & pythonPavel Rogovoy
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitAndrea Righi
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingScyllaDB
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価Kazushi Yamashina
 
Eac integrations JS LiveStream
Eac integrations JS LiveStreamEac integrations JS LiveStream
Eac integrations JS LiveStreamChronoLogic
 
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity HardwareMirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity HardwareRyan Aydelott
 

Mais procurados (20)

Kafka short
Kafka shortKafka short
Kafka short
 
Processing Big Data in Realtime
Processing Big Data in RealtimeProcessing Big Data in Realtime
Processing Big Data in Realtime
 
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
OXiGen: Automated FPGA design flow from C applications to dataflow kernels - ...
 
Heatmap
HeatmapHeatmap
Heatmap
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Performance testing of microservices in Action
Performance testing of microservices in ActionPerformance testing of microservices in Action
Performance testing of microservices in Action
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Root cause analysis with e bpf & python
Root cause analysis with e bpf & pythonRoot cause analysis with e bpf & python
Root cause analysis with e bpf & python
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
Spying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profitSpying on the Linux kernel for fun and profit
Spying on the Linux kernel for fun and profit
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
FPGAを用いた処理のロボット向けコンポーネントの設計生産性評価
 
Eac integrations JS LiveStream
Eac integrations JS LiveStreamEac integrations JS LiveStream
Eac integrations JS LiveStream
 
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity HardwareMirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
 

Semelhante a Computer Performance Microscopy with SHIM

class12_time.ppt
class12_time.pptclass12_time.ppt
class12_time.pptGauravWaila
 
IEEE 1149.1-2013 Addresses Challenges in Test Re-Use from IP to IC to Systems
IEEE 1149.1-2013 Addresses Challenges in Test Re-Use from IP to IC to SystemsIEEE 1149.1-2013 Addresses Challenges in Test Re-Use from IP to IC to Systems
IEEE 1149.1-2013 Addresses Challenges in Test Re-Use from IP to IC to SystemsIEEE Computer Society Computing Now
 
Overview of Qiskit Ignis - Struggle with errors -
Overview of Qiskit Ignis   - Struggle with errors - Overview of Qiskit Ignis   - Struggle with errors -
Overview of Qiskit Ignis - Struggle with errors - Shin Nishio
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdfFrangoCamila
 
Pcr array data analysis 2013
Pcr array data analysis 2013Pcr array data analysis 2013
Pcr array data analysis 2013Elsa von Licy
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Ontico
 
LSBB_NOK_bob1
LSBB_NOK_bob1LSBB_NOK_bob1
LSBB_NOK_bob1THWIN BOB
 
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...Kevin Mathew
 
Algorithm analysis.pptx
Algorithm analysis.pptxAlgorithm analysis.pptx
Algorithm analysis.pptxDrBashirMSaad
 
Microcontrollers ii
Microcontrollers iiMicrocontrollers ii
Microcontrollers iiKumar Kumar
 
MicrocontrollersII.ppt
MicrocontrollersII.pptMicrocontrollersII.ppt
MicrocontrollersII.pptSatheeshMECE
 
introduction to Microcontrollers 8051.ppt
introduction to Microcontrollers 8051.pptintroduction to Microcontrollers 8051.ppt
introduction to Microcontrollers 8051.pptjaychoudhary37
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)Qiangning Hong
 
Mi rna data analysis 2013
Mi rna data analysis 2013Mi rna data analysis 2013
Mi rna data analysis 2013Elsa von Licy
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Andriy Berestovskyy
 
Pipeline stalling in vhdl
Pipeline stalling in vhdlPipeline stalling in vhdl
Pipeline stalling in vhdlSai Malleswar
 

Semelhante a Computer Performance Microscopy with SHIM (20)

W10: Interrupts
W10: InterruptsW10: Interrupts
W10: Interrupts
 
class12_time.ppt
class12_time.pptclass12_time.ppt
class12_time.ppt
 
IEEE 1149.1-2013 Addresses Challenges in Test Re-Use from IP to IC to Systems
IEEE 1149.1-2013 Addresses Challenges in Test Re-Use from IP to IC to SystemsIEEE 1149.1-2013 Addresses Challenges in Test Re-Use from IP to IC to Systems
IEEE 1149.1-2013 Addresses Challenges in Test Re-Use from IP to IC to Systems
 
Overview of Qiskit Ignis - Struggle with errors -
Overview of Qiskit Ignis   - Struggle with errors - Overview of Qiskit Ignis   - Struggle with errors -
Overview of Qiskit Ignis - Struggle with errors -
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
Pcr array data analysis 2013
Pcr array data analysis 2013Pcr array data analysis 2013
Pcr array data analysis 2013
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
 
Pipelining slides
Pipelining slides Pipelining slides
Pipelining slides
 
Coa.ppt2
Coa.ppt2Coa.ppt2
Coa.ppt2
 
LSBB_NOK_bob1
LSBB_NOK_bob1LSBB_NOK_bob1
LSBB_NOK_bob1
 
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
A Robust UART Architecture Based on Recursive Running Sum Filter for Better N...
 
Algorithm analysis.pptx
Algorithm analysis.pptxAlgorithm analysis.pptx
Algorithm analysis.pptx
 
The Spectre of Meltdowns
The Spectre of MeltdownsThe Spectre of Meltdowns
The Spectre of Meltdowns
 
Microcontrollers ii
Microcontrollers iiMicrocontrollers ii
Microcontrollers ii
 
MicrocontrollersII.ppt
MicrocontrollersII.pptMicrocontrollersII.ppt
MicrocontrollersII.ppt
 
introduction to Microcontrollers 8051.ppt
introduction to Microcontrollers 8051.pptintroduction to Microcontrollers 8051.ppt
introduction to Microcontrollers 8051.ppt
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Mi rna data analysis 2013
Mi rna data analysis 2013Mi rna data analysis 2013
Mi rna data analysis 2013
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Pipeline stalling in vhdl
Pipeline stalling in vhdlPipeline stalling in vhdl
Pipeline stalling in vhdl
 

Computer Performance Microscopy with SHIM

Notas do Editor

  1. I will introduce SHIM, a high freq profiler
  2. many of you have this micro-architecture CPU in your laptop
  3. need strong reasons for lusearch similar to Bing and Google.
  4. intrinsic limitations of interrupt driven profilers.
  5. if we increase the frequency 100x more, then we see very interesting pictures. 20 for legends, keys, 28 font size for words, title 36
  6. TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT Transaction, let’s see how we build this tool,
  7. let’s start with the insights of SHIM
  8. software signal: explicit signal and implicit signal
  9. speak examples for the matrix
  10. transition: HC could be a core or a Hyper Thread
  11. have to explain it why HT2 IPC is stable talk about the size of profiling loop
  12. software signa
  13. Now we have shown the design of SHIM, we need one more thing, how can we trust those numbers.
  14. Existing profilers share a same problem, low sampling rate. Low sampling rate -> 1) can’t observe fine granularity events, 2) can’t
  15. after filitering with CPC metric, we can trust those samples, and they are in the valid range Thant is the completed design of our tool, we can check it out from github
  16. We are going to show a few simple examples and overheads
  17. change fonts method and loop IDs are very high frequency signals
  18. SMT priority isn’t har
  19. put url here
  20. Existing profilers share a same problem, low sampling rate. Low sampling rate -> 1) can’t observe fine granularity events, 2) can’t
  21. Existing profilers share a same problem, low sampling rate. Low sampling rate -> 1) can’t observe fine granularity events, 2) can’t
  22. Existing profilers share a same problem, low sampling rate. Low sampling rate -> 1) can’t observe fine granularity events, 2) can’t