The document summarizes the results of experiments comparing performance of nested virtualization on x86_64 and S390x architectures. Key findings include:
- CPU performance decreases more sharply with each level of virtualization on x86_64/KVM than on S390x/zVM. zVM performance scales better.
- Thread scheduling performance and throughput decrease more on x86_64/KVM than S390x/zVM with increasing virtualization levels. zVM scheduling scales better.
- Memory write performance decreases more sharply on nested (L2) zVM than gradually on KVM. zVM has better performance overall but KVM changes are more predictable.
- Memory read performance decreases faster
3. Introduction
●
Background
– What is “Nested Virtualization”
– Why should I care?
– What is “Turtles”?
– What is “VM”?
●
Purpose of our research
4. Experimental Setup
●
Problem: Comparing apples-to-apples
– Mainframes are bigger than desktop servers
– Needed to make both have similar resources
●
Solution: Abstraction and partitioning
– Logical Partitioning
6. Experimental Setup
●
Three test configurations
●
Each named by its “degree of virtualization”
– Levels of virtualization on the system
– “Level 0” - non-virtualized environment
– “Level 1” - single hypervisor
– “Level 2” - nested virtualization
11. Experimental Setup
●
Software
– GCC 4.8.3
– SysBench 0.5
– KVM, QEMU, and friends
●
Custom Linux kernel
– Can't have any large page support!
12. Experimental Setup
●
Creating L1 and L2 environments
●
S390x
– Install new z/VM at L0
– Reuse L0 Linux installation for L1 and L2
● X86_64
– Create qcow2 disk image for L1
– Install/configure Linux on disk image
– Copy and tweak disk image for L2
13. Experimental Setup
●
SysBench 0.5
– Each test executes a number of transactions,
– “Transaction” is some discreet computational
operation
– Focused on five different tests to get a good
idea of overall system performance
15. Experimental Setup
●
Goal was to maximize resource usage
●
Problem: resource over-commitment
– Guest VMs and host VMM fighting for the
same resources
– Leads to resource contention
– Dispatched to wait queue, paged to disk
16. Experimental Setup
●
Solution: Second set of tests performed
without resource over-commitment
● L0 – 16GB RAM, 4 CPU
● L1 – 14GB RAM, 3 VCPU
●
L2 – 12GB RAM, 2 VCPU
19. Definition of “Performance”
●
With regards to our two measurements
●
“Better” performance
– Higher throughput
– Faster (lower) response times
● “Worse” performance
– Lower throughput
– Slower (higher) response times
20. CPU Performance Comparison
Transactional Throughput
Environment →
Configuration →
X86_64s390x
L2L1L0L2L1L0
1400
1200
1000
800
600
400
200
0
EventThroughput(events/s)
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
1400
1200
1000
800
600
400
200
0
EventThroughput(events/s)
SysBenchCPUPerformanceComparison
Vertical Comparison:ConfigurationsbyEnvironment
Transactional Throughput
Individual standarddeviationswereusedto calculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchCPUPerformanceComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Transactional Throughput
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
●
x86_64 has greater throughput, but it decreases
●
S390x throughput doesn't change much at all:
●
20.7% probability that L0 and L1 throughput means are identical
●
82.9% probability that L1 and L2 throughput means are identical
21. CPU Performance Comparison
Mean Response Time
Environment→
Configuration →
X86_64s390x
L2L1L0L2L1L0
4
3
2
1
0
ResponseTime(ms)
Configuration →
Environment →
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
4
3
2
1
0
ResponseTime(ms)
SysBenchCPUPerformanceComparison
Vertical Comparison:ConfigurationsbyEnvironment
Mean ResponseTimeValue
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchCPUPerformanceComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Mean ResponseTimeValue
Individual standarddeviationswereusedtocalculatethe intervals.
95%ConfidenceInterfal for theMean
●
X86_64 response times are faster, but they increase
●
S390x response time doesn't change much at all
●
16.7% probability that L0 and L1 response time means are identical
●
79.4% probability that L1 and L2 response time means are identical
22. CPU Performance Comparison
No Memory or CPU Over-commitment
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
1400
1200
1000
800
600
400
200
0
EventThroughput(events/s)
Configuration →
Environment →
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
4
3
2
1
0
ResponseTime(ms)
SysBenchCPUPerformanceComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Transactional Throughput - No ResourceOvercommitment
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchCPUPerformanceComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Mean ResponseTimeValue- No ResourceOvercommitment
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
● Removing over-commitment greatly reduced the variation on S390x
●
Transactional throughput scales with number of processors
23. CPU Performance Comparison
●
x86_64 performs better than S390x
●
x86_64 performance is impacted by each
level of virtualization (KVM)
● S390x performance apparently has no such
impact from z/VM
●
CPU performance on z/VM scales better
24. Thread Scheduling Comparison
Transactional Throughput
Environment→
Configuration →
X86_64s390x
L2L1L0L2L1L0
5000
4000
3000
2000
1000
0
EventThroughput(events/s)
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
5000
4000
3000
2000
1000
0
EventThroughput(events/s)
SysBenchThreadSchedulingComparison
Vertical Comparison:ConfigurationsbyEnvironment
Transactional Throughput
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchThreadSchedulingComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Transactional Throughput
Individual standarddeviationswereusedtocalculatethe intervals.
95%ConfidenceInterval for theMean
●
Without virtualization, x86_64 has higher throughput
●
S390x has higher throughput in all virtualized configurations
25. Thread Scheduling Comparison
Mean Response Time
Environment→
Configuration →
X86_64s390x
L2L1L0L2L1L0
50
40
30
20
10
0
ResponseTime(ms)
Configuration →
Environment →
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
50
40
30
20
10
0
ResponseTime(ms)
SysBenchThreadSchedulingComparison
Vertical Comparison:ConfigurationsbyEnvironment
Mean ResponseTimeValue
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchThreadSchedulingComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Mean ResponseTimeValue
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
●
Without virtualization, x86_64 has faster response time
●
S390x has faster response time in all virtualized configurations
● x86_64 response time scales poorly with degree of virtualization
26. Thread Scheduling Comparison
No Memory or CPU Over-commitment
Configuration →
Environment →
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
4000
3000
2000
1000
0
EventThroughput(events/s)
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
90
80
70
60
50
40
30
20
10
0
ResponseTime(ms)
SysBenchThreadSchedulingComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Transactional Throughput - No Overcommitment
Individual standarddeviationswereusedto calculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchThreadSchedulingComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Mean ResponseTimeValue- No Overcommitment
Individual standarddeviationswereusedtocalculatethe intervals.
95%ConfidenceInterval for theMean
●
Removing over-commitment reduced the variation
●
S390x L1 performance “jumps” are eliminated
● S390x thread scheduling throughput and response time scale better
27. Thread Scheduling Comparison
●
Performance decreases with increasing
degree of virtualization
● x86_64 hardware advantages erased by KVM
● z/VM provides better thread scheduling
performance than KVM
●
Thread scheduling on S390x and z/VM scales
better than on x86_64 and KVM
28. Memory Write Comparison
Transactional Throughput
Environment→
Configuration →
X86_64s390x
L2L1L0L2L1L0
1.2
1.0
0.8
0.6
0.4
0.2
0.0
EventThroughput(events/s)
Configuration →
Environment →
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
1.2
1.0
0.8
0.6
0.4
0.2
0.0
EventThroughput(events/s)
SysBenchMemoryWriteComparison
Vertical Comparison:ConfigurationsbyEnvironment
Transactional Throughput
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchMemoryWriteComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Transactional Throughput
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
●
No virtualization: S390x has higher throughput than x86_64
●
Throughput increases on L1 z/VM, decreases on L1 KVM
● Big throughput decrease on L2 z/VM, gradual decrease on L2 KVM
29. Memory Write Comparison
Mean Response Time
Environment →
Configuration →
X86_64s390x
L2L1L0L2L1L0
25000
20000
15000
10000
5000
0
ResponseTime(ms)
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
25000
20000
15000
10000
5000
0
ResponseTime(ms)
SysBenchMemoryWriteComparison
Vertical Comparison:ConfigurationsbyEnvironment
Mean ResponseTimeValue
Individual standarddeviationswereusedto calculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchMemoryWriteComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Mean ResponseTimeValue
Individual standarddeviationswereusedtocalculatethe intervals.
95%ConfidenceInterval for theMean
●
No virtualization: S390x has faster response time than x86_64
●
Response time improves on L1 z/VM, degrades on L1 KVM
●
Nested virtualization causes L2 z/VM response time to slow, L2 KVM
response time to improve. Unexpected result.
30. Memory Write Comparison
No Memory or CPU Over-commitment
●
Over-commitment was cause of L2 discrepancy
●
Variation on KVM all but eliminated
●
S390x L0, L1 very close: 70.5% probability that means are identical
●
x86_64 response time slows from L1 to L2
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
EventThroughput(events/s)
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
6000
5000
4000
3000
2000
1000
0
ResponseTime(ms)
SysBenchMemoryWriteComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Transactional Throughput - No Overcommitment
Individual standarddeviationswere usedto calculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchMemoryWriteComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Mean ResponseTimeValue- No Overcommitment
Individual standarddeviationswereusedto calculatetheintervals.
95%ConfidenceInterval for theMean
31. Memory Write Comparison
●
L1 z/VM performance comparable to HW
●
L2 z/VM performance significantly degrades
– Throughput halved
– Response time more than doubled
● x86_64: gradual but consistent degradation
● z/VM has overall better performance, but
KVM performance changes are more
“predictable”
32. Memory Read Comparison
Transactional Throughput
Environment→
Configuration →
X86_64s390x
L2L1L0L2L1L0
1.0
0.8
0.6
0.4
0.2
0.0
EventThroughput(events/s)
Configuration →
Environment →
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
1.0
0.8
0.6
0.4
0.2
0.0
EventThroughput(events/s)
SysBenchMemoryReadComparison
Vertical Comparison:ConfigurationsbyEnvironment
Transactional Throughput
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchMemoryReadComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Transactional Throughput
Individual standarddeviationswereusedtocalculatethe intervals.
95%ConfidenceInterval for theMean
●
Transactional throughput decreased in both environments
●
Throughput decreased at a faster rate on z/VM than on KVM
● L2 z/VM throughput far lower than L2 KVM throughput
33. Memory Read Comparison
Mean Response Time
Environment →
Configuration →
X86_64s390x
L2L1L0L2L1L0
5000
4000
3000
2000
1000
0
ResponseTime(ms)
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
5000
4000
3000
2000
1000
0
ResponseTime(ms)
SysBenchMemoryReadComparison
Vertical Comparison:ConfigurationsbyEnvironment
Mean ResponseTimeValue
Individual standarddeviationswereusedto calculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchMemoryReadComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Mean ResponseTimeValue
Individual standarddeviationswereusedtocalculatethe intervals.
95%ConfidenceInterval for theMean
●
Response times slowed with degree of virtualization
●
Response times on z/VM slowed with faster rate than on KVM
● L2 z/VM response time much slower than L2 KVM response time
34. Memory Read Comparison
No Memory or CPU Over-commitment
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
1.0
0.8
0.6
0.4
0.2
0.0
EventThroughput(events/s)
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
4000
3000
2000
1000
0
ResponseTime(ms)
SysBenchMemoryReadComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Transactional Throughput - No Overcommitment
Individual standarddeviationswere usedto calculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchMemoryReadComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Mean ResponseTimeValue- No Overcommitment
Individual standarddeviationswereusedto calculatetheintervals.
95%ConfidenceInterval for theMean
●
L1 response times for z/VM, KVM within 1% of HW response times
●
z/VM performance degrades at a faster rate than KVM performance
● L2 KVM performance still beats L2 z/VM performance
35. Memory Read Comparison
●
Performance degrades with each level of
virtualization
● S390x, z/VM have better L0, L1 performance
● KVM has better L2 memory read performance
●
Relative performance change between L1 and
L2 KVM is much smaller than between L1 and
L2 z/VM
36. MySQL Database Comparison
Transactional Throughput
Environment→
Configuration →
X86_64s390x
L2L1L0L2L1L0
1400
1200
1000
800
600
400
200
0
EventThroughput(events/s)
Configuration →
Environment →
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
1400
1200
1000
800
600
400
200
0
EventThroughput(events/s)
SysBenchMySQLDatabasePerformanceComparison
Vertical Comparison:ConfigurationsbyEnvironment
Transactional Throughput
Individual standarddeviationswereusedtocalculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchMySQLDatabasePerformanceComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Transactional Throughput
Individual standarddeviationswereusedtocalculatethe intervals.
95%ConfidenceInterval for theMean
●
S390x and z/VM provide superior L0, L1 throughput
●
Precipitous drop on z/VM throughput between L1 and L2
● x86_64 throughput degrades, but at a much more practical rate
37. MySQL Database Comparison
Mean Response Time
Environment →
Configuration →
X86_64s390x
L2L1L0L2L1L0
100
80
60
40
20
0
ResponseTime(ms)
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
100
80
60
40
20
0
ResponseTime(ms)
SysBenchMySQLDatabasePerformanceComparison
Vertical Comparison:ConfigurationsbyEnvironment
Mean ResponseTimeValue
Individual standarddeviationswereusedto calculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchMySQLDatabasePerformanceComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Mean ResponseTimeValue
Individual standarddeviationswereusedtocalculatethe intervals.
95%ConfidenceInterval for theMean
●
S390x and z/VM offer incredible L0, L1 response times
●
Between L1 and L2, z/VM response time degrades by over 2000%!
● x86_64 and KVM performance degrade at a far more consist ant pace
38. MySQL Database Comparison
No Memory or CPU Over-commitment
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
1400
1200
1000
800
600
400
200
0
EventThroughput(events/s)
Configuration →
Environment→
L2L1L0
X86_64s390xX86_64s390xX86_64s390x
70
60
50
40
30
20
10
0
ResponseTime(ms)
SysBenchMySQLDatabasePerformanceComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Transactional Throughput - No Overcommitment
Individual standarddeviationswere usedto calculatetheintervals.
95%ConfidenceInterval for theMean
SysBenchMySQLDatabasePerformanceComparison
Horizontal Comparison:EnvironmentsbyConfiguration
Mean ResponseTime- No Overcommitment
Individual standarddeviationswereusedto calculatetheintervals.
95%ConfidenceInterval for theMean
●
L1, L2 throughput decreases from over-committed results
●
L1, L2 response times improve from over-committed results
● L2 KVM response time now very close to L1 KVM response time
●
Not sufficient to help S390x's performance problems
39. MySQL Database Comparison
●
MySQL test performance degrades with
increasing degree of virtualization
● x86_64 and KVM had the most reasonable
rate of performance change
●
S390x and z/VM had vastly superior L0 and
L1 performance
● Performance degradation at L2 z/VM is
“jarring”, could be a show-stopper
40. MySQL Database Comparison
●
Why does L2 z/VM performance collapse?
●
Three factors
– An I/O intensive workload
– Design of S390x interpretive-execution
– Nature of how z/VM virtualizes interpretive-
execution for nested hypervisors
41. Conclusion
●
z/VM outperformed KVM in a number of areas
– Largely due to architectural benefits
● KVM had more predictable performance
– Memory read, Memory write, MySQL
● KVM needs to improve how CPU and thread
scheduling scale with degree of virtualization
●
z/VM needs to address L2 performance
degradation of I/O-generating workloads
42. Future Work
●
This study is only a first step
– Not a predictor of scalable performance
● Test how performance scales with increasing
numbers of nested and non-nested guests
●
Analyze performance of disk and network I/O
● Perform a study using “real world” macro-
benchmark, such as DayTrader
44. Special Thanks
●
My IBM Managers who encouraged and
supported this work
– Hanif Dandia (z/VM Development Org.)
– Jennifer Hunt (z/Firmware Development Org.)
– Keri Liburdi (z/Firmware Development Org.)
– Rob Urfer (IBM Wave for z/VM)
●
Elizabeth Crew (BCC)
●
Sarah FitzGerald
47. Interpretive Execution
●
Used by z/VM to achieve hardware-levels of
performance in L1 guests
● Allows most privileged guest instructions to
be handled by hardware, not by hypervisor
●
Problem: cannot handle guest I/O
instructions
● “SIE Break” - context switch when hypervisor
leaves SIE to simulate guest I/O instruction
48. Interpretive Execution
●
Further problem: interpretive execution is only
available to L1 guests
● In order to run a hypervisor as an L1 guest,
the L0 hypervisor must simulate interpretive
execution for it
– Which L1 will need in order to run its L2 Vms
●
This “virtual” interpretive execution adds
overhead to SIE breaks
49. Interpretive Execution
●
The added overhead of L1 SIE breaks
(caused by L2 guest I/O operations) is the
cause of the poor L2 z/VM performance in the
MySQL test
●
It may also be a factor with the poor L2 z/VM
performance observed with memory reads
and writes