1. Chicago, October 19 - 22, 2010
Virtualization Technical Deep Dive
Key Concepts for for Developers
Richard McDougall - VMware
2. Virtualization Technical Deep Dive
We’ll be covering
• Virtualization Capabilities
• Workstation Virtualization
• How Virtual machines work, what is the overhead
• How Server Virtualization/Consolidation works
• Java and Consolidation on Server Virtualization
SpringOne 2GX 2009. All rights
reserved. Do not distribute without
permission.
4. Three Properties of Virtualization
Partitioning Isolation Encapsulation
• Run multiple operating • Isolate faults and security at • Encapsulate the entire state
systems on one physical the virtual-machine level of the virtual machine in
machine • Dynamically control CPU, hardware-independent files
• Fully utilize server resources memory, disk and network • Save the virtual machine state
• Support high availability by resources per virtual as a snapshot in time
clustering virtual machines machine • Re-use or transfer whole virtual
• Guarantee service levels machines with a simple file copy
5. Virtualization for Desktops/Laptops
• Desktop products
– VMware Fusion and Workstation
• Features for Developers
– Run multiple OS versions concurrently
– Test Server applications on your desktop/laptop
– Leverage the record/replay capability for debug
6. Virtualization for Servers:
Problem: Underutilized Servers
Consolidation targets are often
<30% Utilized
" Windows average utilization: 5-8%
" Linux/Unix average: 10-35%
7. Initial Virtualization Benefits: Consolidation
BEFORE VMware AFTER VMware
Servers 1,000 Servers 80
Direct attach Tiered SAN and NAS
Storage Storage
3000 cables/ports 400 cables/ports
Network 200 racks Network 10 racks
Facilities 400 power whips Facilities 20 power whips
8. Next Benefit: Simpler Management
VMotion Technology
VMotion Technology moves running virtual machines from one
host to another while maintaining continuous service availability
- Enables Resource Pools
- Enables High Availability
9. Pooling of resources
Pools replace hosts
as the primary
compute
abstraction
Resource Resource
Pool Resource Pool
Pool
10. Automated Pool of Resources
vCenter
Imbalanced
Balanced
Cluster
Cluster
Heavy Load
Lighter Load
11. DRS Scalability – Transactions per minute
(Higher the better)
Already balanced
So, fewer gains
Higher gains (> 40%)
with more imbalance
14. “Hosted” Virtualization Architecture
Virtual CPU abstraction is
created by “monitor”
rmc$ ps -fp 4295 File
TCP/IP System
UID
Guest PID PPID
Guest C STIME TTY Each VM is an CMD process
TIME OS
0 4295 1 0 18:15.66 ?? 21:05.14 /Library/
Application Support/VMware Fusion/vmware-vmx /Users/rmc/
Monitor supports:
Documents/Virtual Machines/Windows XP Pro.vmwarevm/Windows
XP Pro.vmx
Monitor Monitor Virtual NIC . Virtual SCSI
BT (Binary Translation)
HW (Hardware assist)
rmc$ more Windows XP Pro.vmx
OS OS
PV (Paravirtualization)
Host Process Process Local File System
virtualHW.version = "7”
Operating mydisk.vmdk
System Memory is allocated by the
memsize = "776”
OS and virtualized by the
ide0:0.fileName = "Windows XPI/O Drivers
NIC Drivers Professional.vmdk”
monitor
ethernet0.connectionType = "nat"
Network and I/O devices
Physical
Hardware
are emulated and proxied
though native device
drivers
15. Inside the Monitor: Classical Instruction Virtualization
Trap-and-emulate
Nonvirtualized (“native”) system
– OS runs in privileged mode
– OS “owns” the hardware Apps Ring 3
– Application code has less privilege OS Ring 0
Virtualized
– VMM most privileged (for isolation)
– Classical “ring compression” or “de-privileging”
• Run guest OS kernel in Ring 1
Apps Ring 3
• Privileged instructions trap; emulated by VMM
– But: does not work for x86 (lack of traps) Guest OS Ring 1
VMM Ring 0
16. Binary Translation of Guest Code
Translate guest kernel code
Replace privileged instrs with safe “equivalent” instruction
sequences
No need for traps
BT is an extremely powerful technology
– Permits any unmodified x86 OS to run in a VM
– Can virtualize any instruction set
17. Combining BT and Direct Execution
Direct Execution
(user mode guest code)
Faults, syscalls
interrupts
VMM
IRET, sysret
Binary Translation
(kernel mode guest code)
18. BT Mechanics
Each translator invocation
– Consume one input basic block (guest code)
– Produce one output basic block
Store output in translation cache
– Future reuse
– Amortize translation costs
– Guest-transparent: no patching “in place”
input translated
basic block basic block
Guest translator
Translation cache
19. Intel VT/ AMD-V: 1st Generation HW Support
Apps Ring 3
• Key feature: root vs. guest CPU mode
Guest mode Root mode
– VMM executes in root mode Guest OS
Ring 0
– Guest (OS, apps) execute in guest mode VM VM
exit enter
• VMM and Guest run as
VMM
“co-routines”
– VM enter
– Guest runs
– A while later: VM exit
– VMM runs
– ...
20. Qualitative Comparison of BT and VT-x/AMD-V
• VT-x/AMD-V loses on: • BT loses on:
– exits (costlier than “callouts”) – system calls
– no adaptation (cannot elim. exits) – translator overheads
– page table updates – path lengthening
– memory-mapped I/O – indirect control flow
– IN/OUT instructions • BT wins on:
• VT-x/AMD-V wins on: – page table updates (adaptation)
– system calls – memory-mapped I/O (adapt.)
– almost all code runs “directly” – IN/OUT instructions
– no traps for priv. instructions
21. Can I Virtualize CPU Intensive Applications?
Most CPU intensive applications have very low overhead
VMware ESX 3.x compared to Native
SPECcpu results covered by O.Agesen and K.Adams Paper
Websphere results published jointly by IBM/VMware
SPECjbb results from recent internal measurements
22. Virtualizing Virtual Memory
VM 1 VM 2
Process 1 Process 2 Process 1 Process 2
Virtual VA
Memory
Physical PA
Memory
Machine MA
Memory
• To run multiple VMs on a single system, another level of memory
virtualization must be done
– Guest OS still controls virtual to physical mapping: VA -> PA
– Guest OS has no direct access to machine memory (to enforce
isolation)
• VMM maps guest physical memory to actual machine memory: PA -> MA
23. Virtualizing Virtual Memory
Shadow Page Tables
VM 1 VM 2
Process 1 Process 2 Process 1 Process 2
Virtual VA
Memory
Physical PA
Memory
Machine MA
Memory
• VMM builds “shadow page tables” to accelerate the mappings
– Shadow directly maps VA -> MA
– Can avoid doing two levels of translation on every access
– TLB caches VA->MA mapping
– Leverage hardware walker for TLB fills (walking shadows)
– When guest changes VA -> PA, the VMM updates shadow page tables
24. 2nd Generation Hardware Assist
Nested/Extended Page Tables
VA→PA mapping
Guest PT ptr ...
TLB
VA MA
TLB fill guest
hardware VMM
Nested PT ptr
PA→MA mapping
27. vSphere Virtualization Architecture
Virtual CPU abstraction is
created by “monitor”
File
TCP/IP System
Guest Guest Each VM is an OS process
Monitor supports:
BT (Binary Translation)
Monitor Monitor HW (Hardware assist)
PV (Paravirtualization)
Virtual NIC Virtual SCSI
vSphere Memory
Scheduler Allocator Virtual Switch File System Memory is allocated by the
OS and virtualized by the
NIC Drivers I/O Drivers monitor
Network and I/O devices
Physical
Hardware
are emulated and proxied
though native device
drivers
28. Performance
100%
Mission
Critical VI 3.0 VI 3.5
ESX 2.x vSphere 4.0
Apps (2005) (2007)
(2003) (2009)
Overhead:30-60% Overhead:20-40% Overhead:10-30% Overhead:2-15%
VCPUs: 2 VCPUs:2 VCPUs:4 VCPUs:8
VM RAM:3.6 GB VM RAM:16 GB VM RAM:64GB VM RAM:255GB
General Phys RAM:256GB Phys RAM:1 TB
Population Phys RAM:64GB Phys RAM:64GB
Of PCPUs:16 core PCPUs:16 core PCPUs:64 core PCPUs:64 core
Apps IOPS:100,000 IOPS:350,000
IOPS:<10,000 IOPS:10,000
N/W:380 Mb/s N/W:800 Mb/s N/W:9 Gb/s N/W:28 Gb/s
Monitor Type: Gen-1 64-bit OS Support 64-bit OS Support
Binary Translation HW Virtualization Gen-2 HW 320 VMs per host
Monitor Type: Virtualization 512 vCPUs per host
VT / SVM Monitor Type: NPT Monitor Type: EPT
Ability to satisfy Performance Demands
29. High Throughput Web Workloads
(SPECweb)
Overall response time is lower when CPU utilization is less than 100% due to multi-core offload
30. >95% of All Databases fit in a Virtual Machine
31. CPUs and Scheduling
o Schedule virtual CPUs on
physical CPUs
o Virtual time based
Guest Guest Guest proportional-share CPU
scheduler
o Flexible and accurate rate-
based controls over CPU time
Monitor Monitor Monitor allocations
o NUMA/processor/cache
topology aware
VMkernel Scheduler o Provide graceful degradation
in over-commitment situations
o High scalability with low
scheduling latencies
o Fine-grain built-in accounting
Physical for workload observability
CPUs o Support for VSMP virtual
machines
32. VM Scheduling: How will multiple VMs operate?
Run
• VM state
– running (%used)
– waiting (%twait) Ready
Wait
– ready to run (%ready)
• When does a VM go to “ready to run” state
– Guest wants to run or need to be woken up (to deliver an
interrupt)
– All available CPU is running other VMs
33. Resource Controls: Performance SLA
Total Mhz
• Reservation
– Minimum service level guarantee (in MHz)
– Even when system is overcommitted Limit
– Needs to pass admission control
• Shares Shares
apply
– CPU entitlement is directly proportional to VM's here
shares and depends on the total number of
shares issued Reservation
– Abstract number, only ratio matters
• Limit 0 Mhz
– Absolute upper bound on CPU entitlement (in MHz)
– Even when system is not overcommitted
34. vSphere Memory Management
VM Size:
1GB Guest A Guest B Thin
1GB
200MB 200MB Provisioned
used used (Undercommited)
400Mb 2GB of VMs on
1GB 1GB host is OK
used
Physical
(Overcommited)
Guest A Guest B
1GB 1GB 1GB 1GB
used used
1GB 1GB used
Physical Paging and
Swapping to
Disk
36. Application Memory Management
– Starts with no memory
– Allocates memory through syscall to
operating system App
– Often frees memory voluntarily through
syscall
– Explicit memory allocation interface
with operating system OS
Hyper
visor
37. Operating System Memory Management
– Assumes it owns all physical
memory
App
– No memory allocation interface
with hardware
• Does not explicitly allocate or
free physical memory
OS
– Defines semantics of
“allocated” and “free” memory
• Maintains “free” list and
“allocated” lists of physical Hyper
memory visor
• Memory is “free” or “allocated”
depending on which list it
resides
38. Hypervisor Memory Management
– Very similar to operating
system memory management
App
• Assumes it owns all machine
memory
• No memory allocation
interface with hardware OS
• Maintains lists of “free” and
“allocated” memory
Hypervisor
39. VM Memory Allocation
– VM starts with no physical
memory allocated to it
App
– Physical memory allocated on
demand
• Guest OS will not explicitly
allocate OS
• Allocate on first VM access to
memory (read or write)
Hypervisor
40. VM Memory Reclamation
• Guest physical memory not “freed”
in typical sense
– Guest OS moves memory to its App
“free” list
– Data in “freed” memory may
not have been modified Guest
free list OS
• Hypervisor isn’t aware when guest
frees memory
– Freed memory state
unchanged Hypervisor
– No access to guest’s “free” list
– Unsure when to reclaim “freed”
guest memory
41. VM Memory Reclamation Cont’d
• Guest OS (inside the VM)
– Allocates and frees… Insid
e the Ap
– And allocates and p
VM
frees… VM
– And allocates and
frees… Guest
free list OS
" VM
" Allocates…
" And allocates… Hypervisor
" And allocates…
Hypervisor needs some way
of reclaiming memory!
42. Ballooning
may free
buffers or
inflate balloon
page out
(+ pressure) Guest OS to virtual disk
balloon
Guest OS guest OS manages memory
implicit cooperation
balloon
May grow
buffers or page
Guest OS in from virtual
deflate balloon disk
(– pressure)
46. Java Heap Usage – Without reservations
VM Config
Size
VM Usage
JVM Heap
Size
(-Xmx=)
47. Java Heap Usage – With VM Reservation
VM Config
Total MB
Size
VM Usage
JVM Heap
Limit Size
Reservation
0 MB
48. Performance Measurement in a Virtual World
Traditionally, the OS was the authority
Operating system performs various roles
– Application Runtime Libraries
– Resource Management (CPU, Memory etc)
– Hardware + Driver management
" Performance & Scalability of the OS
was paramount
" Performance Observability tools are
a feature of the OS
49. Performance Measurement in a Virtual World
The OS becomes the “Application Library”, and the Hypervisor becomes the authority
50. Important Notes about Measuring
Performance
• Resources measured from within the Guest-OS may not
be accurate
– The OS is sharing physical resources with others
– CPU utilization is often under-reported (some CPU time is
stolen to other guest-Oses)
• Time measurements
– Course grained time measurements are correct (if VMware
tools are installed/enabled)
– Fine grained measurements are subject to jitter (don’t try to
measure sub-millisecond response times without special
tools)
– CPU steals will add to latency of non-CPU measured events
(e.g. I/O response times)
51. Tools for Performance Analysis
• Guest Tools: vmstat, mpstat, management tools
• VirtualCenter client (VI client):
– Per-host and per-cluster stats
– Graphical Interface
– Historical and Real-time data
• esxtop: per-host statistics
– Command-line tool found in the console-OS
• Java SDK
– Allows you to collect only the statistics they want
52. Potential Impacts to Performance
• Virtual Machine Contributors Latency:
– CPU Overhead can contribute to latency (but it’s small!)
– Scheduling latency (VM runnable, but waiting…)
– Waiting for a global memory paging operation
– Disk Reads/Writes taking longer
• Virtual machine impacts to Throughput:
– Throughput ceiling if not enough resources allocated
– Throughput ceiling if not enough virtual CPU/Mem allocated
53. vSphere Instrumentation Points
File
TCP/IP System
Service
vCPU Guest
Console
Virtual Disk
cCPU VMHBA
Monitor Monitor vNIC
Virtual NIC Virtual SCSI
Memory
VMkernel Virtual Switch File System
Scheduler Allocator
NIC Drivers I/O Drivers
Physical
HBA
Hardware
pCPU Physical Disk
pNIC
54. VI Client
Chart Type
Real-time vs. Historical
Object
Counter type
Rollup Stats type
55. CPU capacity
(screenshot from VI Client)
Some caveats on ready time
Used time
Used time ~ ready time: may
signal contention. However,
might not be overcommitted
due to workload variability
In this example, we have Ready time ~ used time
periods of activity and idle
periods: CPU isn’t
overcommitted all the time
Ready time < used
time
56. esxtop
What is esxtop ?
• Performance troubleshooting tool for ESX host
• Displays performance statistics in rows and column format
Fields
57. Performance Summary
• Use vSphere rather than Workstation/Fusion for any
performance testing
– Better performance from Sched, I/O, Large Pages, etc,…
• vSphere will provide near-native performance
– Ensure resources are available (under-commit or use
controls)
– If I/O intensive, ensure shared storage is configured with
enough capacity
– Ensure VMware tools are installed
• Use the correct performance instrumentation
– vSphere or esxtop
58. Q&A
SpringOne 2GX 2010. All rights reserved. Do not distribute without permission.