1. Performance Profiling
of Virtual Machines
Jiaqing Du+, Nipun Sehrawat*, Willy Zwaenepoel+
+EPFL, Switzerland
*University of Illinois at Urbana-Champaign
2. Performance Profiling
• Use CPU performance counters
• Monitor software runtime behavior
• Incur very low overhead
• Used extensively: OProfile, VTune, …
%CYCLE Function Module
98.5529 vmx_vcpu_run kvm-intel.ko
0.2226 (no symbols) libc.so
0.1034 hpet_cpuhp_notify vmlinux
0.1034 native_patch vmlinux
Jiaqing Du, VEE, March 9, 2011 2
3. Terminology
OS Guest Guest
profiler profiler profiler
VMM VMM
profiler
CPU PMU CPU PMU CPU PMU
(1) native profiling (2) guest-wide profiling (3) system-wide profiling
Jiaqing Du, VEE, March 9, 2011 3
4. Profiling with Virtual Machines
Para- Hardware Binary
virtualization assistance translation
Guest-wide
profiling
? ? ?
System-wide
profiling XenOprof ? ?
Profilers do not work well with virtual machines.
Jiaqing Du, VEE, March 9, 2011 4
6. Outline
• Native profiling
• Guest-wide profiling
• System-wide profiling
• Evaluation
Jiaqing Du, VEE, March 9, 2011 6
7. Native Profiling
• Performance monitoring unit (PMU)
– consists of a set of event counters
– generates an interrupt when a counter overflows
• PMU-based profiler
User
Control Interpret - previous PC value
Kernel
- process identifier
Configure Collect
CPU PMU
Jiaqing Du, VEE, March 9, 2011 7
8. Guest-wide Profiling
• Profiler runs in the guest and only profiles the guest
Guest
Control Interpret Injected interrupts
should be handled
right after guest
Configure Collect resumes execution.
VMM
CPU PMU
Challenge: synchronous interrupt delivery to the guest
Jiaqing Du, VEE, March 9, 2011 8
9. System-wide Profiling (1/3)
• Reveal runtime behavior of both VMM and guest(s)
Guest1 Guest2
Do not know the
internals of a guest.
Control Interpret
VMM Configure Collect
CPU PMU
Challenge: interpret samples belonging to the guest
Jiaqing Du, VEE, March 9, 2011 9
10. System-wide Profiling (2/3)
• Interpret guest samples: full delegation
Control Interpret
Guest
Configure Collect
Control Interpret
VMM Configure Collect
CPU PMU
Jiaqing Du, VEE, March 9, 2011 10
11. System-wide Profiling (3/3)
• Interpret guest samples: interpretation delegation
Control Interpret
Guest
Configure Collect
Control Interpret
Shared
Buffer
VMM Configure Collect
CPU PMU
Jiaqing Du, VEE, March 9, 2011 11
12. PMU Multiplexing
• When to save & restore performance counters?
• CPU switch
– only in-guest execution is accounted to the guest
VMM VMM
guest1 I/Oguest1 guest2 I/Oguest2 guest2
account to guest 1 account to guest 2 account to guest 2
• Domain switch
– in-VMM execution is also accounted to the guest
VMM VMM
guest1 I/Oguest1 guest2 I/Oguest2 guest2
account to guest1 account to guest2
Jiaqing Du, VEE, March 9, 2011 12
15. Profiling Overhead
• Measure execution time
– a computation-intensive program
– with and without profiling
– about 400 counter overflows per second
Profiling environment Increased execution time
Native Linux 0.04% ± 0.004%
KVM guest-wide 0.39% ± 0.045%
KVM system-wide 0.44% ± 0.043%
QEMU system-wide 0.94% ± 0.044%
Jiaqing Du, VEE, March 9, 2011 15
16. Evaluation question #2
Are profiling results accurate?
Jiaqing Du, VEE, March 9, 2011 16
17. Profiling Accuracy (1/4)
• A computation-intensive benchmark
• compute_{a|b}() does floating point arithmetic
• Monitor CPU cycles
int main(int argc, char *argv[])
{
while (1) {
compute_a();
compute_b();
}
}
Jiaqing Du, VEE, March 9, 2011 17
18. Profiling Accuracy (2/4)
• Comparison with native profiling
90
80
70
60
50 Native
Cycle % 40
KVM guest-wide
KVM system-wide
30
QEMU system-wide
20
10
0
compute_a compute_b
Routine name
Jiaqing Du, VEE, March 9, 2011 18
19. Profiling Accuracy (3/4)
• A memory-intensive benchmark
• Randomly access a fixed-size region of memory
• Monitor last level cache misses
struct item {
struct item *next;
long pad[NUM_PAD];
}
void chase_pointer()
{
struct item *p = NULL;
p = &randomly_connected_items;
while (p != null) p = p->next;
}
Jiaqing Du, VEE, March 9, 2011 19
20. Profiling Accuracy (4/4)
• Comparison with native profiling
1.6
1.4
1.2
1
Native
Cache misses per 0.8 KVM guest-wide
memory access 0.6
KVM system-wide
QEMU system-wide
0.4
0.2
0
256 512 768 1024 1280 1536 1792 2048 2304 2560 2816 3072
Working set size (KB)
Jiaqing Du, VEE, March 9, 2011 20
21. Evaluation question #3
What is the difference between
CPU switch and domain switch?
Jiaqing Du, VEE, March 9, 2011 21
22. Recap
• CPU switch
VMM VMM
guest1 I/Oguest1 guest2 I/Oguest2 guest2
account to guest 1 account to guest 2 account to guest 2
• Domain switch
VMM VMM
guest1 I/Oguest1 guest2 I/Oguest2 guest2
account to guest1 account to guest2
Jiaqing Du, VEE, March 9, 2011 22
23. Profiling Packet Receive (1/2)
• Experiment
– push packets to a Linux guest in KVM
– run OProfile in the guest
– monitor instruction retirements
Linux
KVM virtual NIC Linux
Hardware Hardware
NIC NIC
Jiaqing Du, VEE, March 9, 2011 23
25. Related Work
• XenOprof
– first profiler targeting virtual machines
– system-wide profiling for Xen
• Linux perf
– a profiling infrastructure for Linux
– limited support of profiling KVM Linux guest
• VMware vmkperf
– only read and write CPU performance counters
Jiaqing Du, VEE, March 9, 2011 25