Virtualization Primer for Java Developers

Chicago, October 19 - 22, 2010

Virtualization Technical Deep Dive
Key Concepts for for Developers
Richard McDougall - VMware

Virtualization Technical Deep Dive

We’ll be covering
•  Virtualization Capabilities
•  Workstation Virtualization
•  How Virtual machines work, what is the overhead
•  How Server Virtualization/Consolidation works
•  Java and Consolidation on Server Virtualization

SpringOne 2GX 2009. All rights
reserved. Do not distribute without
permission.

Three Properties of Virtualization

Partitioning Isolation Encapsulation
•  Run multiple operating •  Isolate faults and security at •  Encapsulate the entire state
systems on one physical the virtual-machine level of the virtual machine in
machine •  Dynamically control CPU, hardware-independent files
•  Fully utilize server resources memory, disk and network •  Save the virtual machine state
•  Support high availability by resources per virtual as a snapshot in time
clustering virtual machines machine •  Re-use or transfer whole virtual
•  Guarantee service levels machines with a simple file copy

Virtualization for Desktops/Laptops

•  Desktop products
–  VMware Fusion and Workstation
•  Features for Developers
–  Run multiple OS versions concurrently
–  Test Server applications on your desktop/laptop
–  Leverage the record/replay capability for debug

Virtualization for Servers:
Problem: Underutilized Servers

Consolidation targets are often
<30% Utilized
"  Windows average utilization: 5-8%
"  Linux/Unix average: 10-35%

Initial Virtualization Benefits: Consolidation

BEFORE VMware AFTER VMware

Servers 1,000 Servers 80
Direct attach Tiered SAN and NAS
Storage Storage
3000 cables/ports 400 cables/ports
Network 200 racks Network 10 racks
Facilities 400 power whips Facilities 20 power whips

Next Benefit: Simpler Management
VMotion Technology

VMotion Technology moves running virtual machines from one
host to another while maintaining continuous service availability
- Enables Resource Pools
- Enables High Availability

Pooling of resources
Pools replace hosts
as the primary
compute
abstraction

Resource Resource
Pool Resource Pool
Pool

Automated Pool of Resources

vCenter

Imbalanced
Balanced
Cluster
Cluster

Heavy Load

Lighter Load

DRS Scalability – Transactions per minute
(Higher the better)

Already balanced
So, fewer gains
Higher gains (> 40%)
with more imbalance

“Hosted” vs vSphere Virtualization Architecture

Guest Guest Guest Guest

VMware
(Fusion, Workstation)

VMware
Host
Operating vSphere
System (Server Virtualization)
(Linux, Windows, MacOSX)

Physical Physical
Hardware Hardware

“Hosted” Virtualization Architecture
Virtual CPU abstraction is
created by “monitor”
rmc$ ps -fp 4295 File
TCP/IP System
UID
Guest PID PPID
Guest C STIME TTY Each VM is an CMD process
TIME OS
0 4295 1 0 18:15.66 ?? 21:05.14 /Library/
Application Support/VMware Fusion/vmware-vmx /Users/rmc/
Monitor supports:
Documents/Virtual Machines/Windows XP Pro.vmwarevm/Windows
XP Pro.vmx
Monitor Monitor Virtual NIC . Virtual SCSI
 BT (Binary Translation)
 HW (Hardware assist)
rmc$ more Windows XP Pro.vmx
OS OS
 PV (Paravirtualization)
Host Process Process Local File System
virtualHW.version = "7”
Operating mydisk.vmdk
System Memory is allocated by the
memsize = "776”
OS and virtualized by the
ide0:0.fileName = "Windows XPI/O Drivers
NIC Drivers Professional.vmdk”
monitor
ethernet0.connectionType = "nat"
Network and I/O devices
Physical
Hardware
are emulated and proxied
though native device
drivers

Inside the Monitor: Classical Instruction Virtualization
Trap-and-emulate

  Nonvirtualized (“native”) system
–  OS runs in privileged mode
–  OS “owns” the hardware Apps Ring 3
–  Application code has less privilege OS Ring 0

  Virtualized
–  VMM most privileged (for isolation)
–  Classical “ring compression” or “de-privileging”
•  Run guest OS kernel in Ring 1
Apps Ring 3
•  Privileged instructions trap; emulated by VMM
–  But: does not work for x86 (lack of traps) Guest OS Ring 1

VMM Ring 0

Binary Translation of Guest Code
  Translate guest kernel code
  Replace privileged instrs with safe “equivalent” instruction
sequences
  No need for traps
  BT is an extremely powerful technology
–  Permits any unmodified x86 OS to run in a VM
–  Can virtualize any instruction set

Combining BT and Direct Execution

Direct Execution
(user mode guest code)

Faults, syscalls
interrupts

VMM

IRET, sysret

Binary Translation
(kernel mode guest code)

BT Mechanics
  Each translator invocation
–  Consume one input basic block (guest code)
–  Produce one output basic block
  Store output in translation cache
–  Future reuse
–  Amortize translation costs
–  Guest-transparent: no patching “in place”

input translated
basic block basic block
Guest translator

Translation cache

Intel VT/ AMD-V: 1st Generation HW Support

Apps Ring 3
•  Key feature: root vs. guest CPU mode

Guest mode Root mode
–  VMM executes in root mode Guest OS
Ring 0

–  Guest (OS, apps) execute in guest mode VM VM
exit enter
•  VMM and Guest run as
VMM
“co-routines”
–  VM enter
–  Guest runs
–  A while later: VM exit
–  VMM runs
–  ...

Qualitative Comparison of BT and VT-x/AMD-V

•  VT-x/AMD-V loses on: •  BT loses on:
–  exits (costlier than “callouts”) –  system calls
–  no adaptation (cannot elim. exits) –  translator overheads
–  page table updates –  path lengthening
–  memory-mapped I/O –  indirect control flow
–  IN/OUT instructions •  BT wins on:
•  VT-x/AMD-V wins on: –  page table updates (adaptation)
–  system calls –  memory-mapped I/O (adapt.)
–  almost all code runs “directly” –  IN/OUT instructions
–  no traps for priv. instructions

Can I Virtualize CPU Intensive Applications?
Most CPU intensive applications have very low overhead

VMware ESX 3.x compared to Native
SPECcpu results covered by O.Agesen and K.Adams Paper
Websphere results published jointly by IBM/VMware
SPECjbb results from recent internal measurements

Virtualizing Virtual Memory
VM 1 VM 2

Process 1 Process 2 Process 1 Process 2
Virtual VA
Memory

Physical PA
Memory

Machine MA
Memory

•  To run multiple VMs on a single system, another level of memory
virtualization must be done
–  Guest OS still controls virtual to physical mapping: VA -> PA
–  Guest OS has no direct access to machine memory (to enforce
isolation)
•  VMM maps guest physical memory to actual machine memory: PA -> MA

Virtualizing Virtual Memory
Shadow Page Tables
VM 1 VM 2

Process 1 Process 2 Process 1 Process 2
Virtual VA
Memory

Physical PA
Memory

Machine MA
Memory

•  VMM builds “shadow page tables” to accelerate the mappings
–  Shadow directly maps VA -> MA
–  Can avoid doing two levels of translation on every access
–  TLB caches VA->MA mapping
–  Leverage hardware walker for TLB fills (walking shadows)
–  When guest changes VA -> PA, the VMM updates shadow page tables

2nd Generation Hardware Assist
Nested/Extended Page Tables

VA→PA mapping

Guest PT ptr ...
TLB
VA MA

TLB fill guest
hardware VMM

Nested PT ptr

PA→MA mapping

Hardware-assisted Memory Virtualization

Efficiency Improvement

60%

50%

40%

30%

20%

10%

0%
Apache Compile SQL Server Citrix XenApp

Efficiency Improvement

“Hosted” vs vSphere Virtualization Architecture

Guest Guest Guest Guest

VMware
(Fusion, Workstation)

Host VMware
Operating vSphere
System
(Linux, Windows, MacOSX)

Physical Physical
Hardware Hardware

vSphere Virtualization Architecture
Virtual CPU abstraction is
created by “monitor”
File
TCP/IP System
Guest Guest Each VM is an OS process

Monitor supports:
 BT (Binary Translation)
Monitor Monitor  HW (Hardware assist)
 PV (Paravirtualization)
Virtual NIC Virtual SCSI
vSphere Memory
Scheduler Allocator Virtual Switch File System Memory is allocated by the
OS and virtualized by the
NIC Drivers I/O Drivers monitor

Network and I/O devices
Physical
Hardware
are emulated and proxied
though native device
drivers

Performance
100%
Mission
Critical VI 3.0 VI 3.5
ESX 2.x vSphere 4.0
Apps (2005) (2007)
(2003) (2009)

Overhead:30-60% Overhead:20-40% Overhead:10-30% Overhead:2-15%
VCPUs: 2 VCPUs:2 VCPUs:4 VCPUs:8
VM RAM:3.6 GB VM RAM:16 GB VM RAM:64GB VM RAM:255GB
General Phys RAM:256GB Phys RAM:1 TB
Population Phys RAM:64GB Phys RAM:64GB
Of PCPUs:16 core PCPUs:16 core PCPUs:64 core PCPUs:64 core
Apps IOPS:100,000 IOPS:350,000
IOPS:<10,000 IOPS:10,000
N/W:380 Mb/s N/W:800 Mb/s N/W:9 Gb/s N/W:28 Gb/s
Monitor Type: Gen-1 64-bit OS Support 64-bit OS Support
Binary Translation HW Virtualization Gen-2 HW 320 VMs per host
Monitor Type: Virtualization 512 vCPUs per host
VT / SVM Monitor Type: NPT Monitor Type: EPT

Ability to satisfy Performance Demands

High Throughput Web Workloads
(SPECweb)

Overall response time is lower when CPU utilization is less than 100% due to multi-core offload

>95% of All Databases fit in a Virtual Machine

CPUs and Scheduling
o  Schedule virtual CPUs on
physical CPUs
o  Virtual time based
Guest Guest Guest proportional-share CPU
scheduler
o  Flexible and accurate rate-
based controls over CPU time
Monitor Monitor Monitor allocations
o  NUMA/processor/cache
topology aware
VMkernel Scheduler o  Provide graceful degradation
in over-commitment situations
o  High scalability with low
scheduling latencies
o  Fine-grain built-in accounting
Physical for workload observability
CPUs o  Support for VSMP virtual
machines

VM Scheduling: How will multiple VMs operate?

Run
•  VM state
–  running (%used)
–  waiting (%twait) Ready
Wait
–  ready to run (%ready)

•  When does a VM go to “ready to run” state
–  Guest wants to run or need to be woken up (to deliver an
interrupt)
–  All available CPU is running other VMs

Resource Controls: Performance SLA

Total Mhz
•  Reservation
–  Minimum service level guarantee (in MHz)
–  Even when system is overcommitted Limit

–  Needs to pass admission control
•  Shares Shares
apply
–  CPU entitlement is directly proportional to VM's here

shares and depends on the total number of
shares issued Reservation

–  Abstract number, only ratio matters
•  Limit 0 Mhz
–  Absolute upper bound on CPU entitlement (in MHz)
–  Even when system is not overcommitted

vSphere Memory Management

VM Size:
1GB Guest A Guest B Thin
1GB
200MB 200MB Provisioned
used used (Undercommited)

400Mb 2GB of VMs on
1GB 1GB host is OK
used
Physical

(Overcommited)
Guest A Guest B
1GB 1GB 1GB 1GB
used used

1GB 1GB used
Physical Paging and
Swapping to
Disk

Virtual Memory

“virtual” memory guest
“virtual”
Application memory

guest App

“physical” memory Operating “physical”
System memory
hypervisor

OS
hypervisor

“machine”
“machine” memory Hypervisor memory

Hypervisor

Application Memory Management

–  Starts with no memory
–  Allocates memory through syscall to
operating system App
–  Often frees memory voluntarily through
syscall
–  Explicit memory allocation interface
with operating system OS

Hyper
visor

Operating System Memory Management

–  Assumes it owns all physical
memory
App
–  No memory allocation interface
with hardware
•  Does not explicitly allocate or
free physical memory
OS
–  Defines semantics of
“allocated” and “free” memory
•  Maintains “free” list and
“allocated” lists of physical Hyper
memory visor
•  Memory is “free” or “allocated”
depending on which list it
resides

Hypervisor Memory Management

–  Very similar to operating
system memory management
App
•  Assumes it owns all machine
memory
•  No memory allocation
interface with hardware OS
•  Maintains lists of “free” and
“allocated” memory

Hypervisor

VM Memory Allocation

–  VM starts with no physical
memory allocated to it
App
–  Physical memory allocated on
demand
•  Guest OS will not explicitly
allocate OS
•  Allocate on first VM access to
memory (read or write)

Hypervisor

VM Memory Reclamation

•  Guest physical memory not “freed”
in typical sense
–  Guest OS moves memory to its App
“free” list
–  Data in “freed” memory may
not have been modified Guest
free list OS
•  Hypervisor isn’t aware when guest
frees memory
–  Freed memory state
unchanged Hypervisor
–  No access to guest’s “free” list
–  Unsure when to reclaim “freed”
guest memory

VM Memory Reclamation Cont’d

•  Guest OS (inside the VM)
–  Allocates and frees… Insid
e the Ap
–  And allocates and p
VM
frees… VM
–  And allocates and
frees… Guest
free list OS
"   VM

"   Allocates…
"   And allocates… Hypervisor

"   And allocates…

Hypervisor needs some way
of reclaiming memory!

Ballooning
may free
buffers or
inflate balloon
page out
(+ pressure) Guest OS to virtual disk
balloon

Guest OS guest OS manages memory
implicit cooperation
balloon
May grow
buffers or page
Guest OS in from virtual
deflate balloon disk
(– pressure)

Java Memory Management (Hotspot)

Java Heap Usage

Garbage Collection

VM Usage
JVM Heap
Size
(-Xmx=)

VMware ESX and Java Memory
Management Combined

Java Heap Usage – Without reservations
VM Config
Size

VM Usage
JVM Heap
Size
(-Xmx=)

Java Heap Usage – With VM Reservation
VM Config
Total MB
Size

VM Usage
JVM Heap
Limit Size

Reservation

0 MB

Performance Measurement in a Virtual World
Traditionally, the OS was the authority

Operating system performs various roles
–  Application Runtime Libraries
–  Resource Management (CPU, Memory etc)
–  Hardware + Driver management

"   Performance & Scalability of the OS
was paramount
"   Performance Observability tools are
a feature of the OS

Performance Measurement in a Virtual World
The OS becomes the “Application Library”, and the Hypervisor becomes the authority

Important Notes about Measuring
Performance
•  Resources measured from within the Guest-OS may not
be accurate
–  The OS is sharing physical resources with others
–  CPU utilization is often under-reported (some CPU time is
stolen to other guest-Oses)
•  Time measurements
–  Course grained time measurements are correct (if VMware
tools are installed/enabled)
–  Fine grained measurements are subject to jitter (don’t try to
measure sub-millisecond response times without special
tools)
–  CPU steals will add to latency of non-CPU measured events
(e.g. I/O response times)

Tools for Performance Analysis

•  Guest Tools: vmstat, mpstat, management tools
•  VirtualCenter client (VI client):
–  Per-host and per-cluster stats
–  Graphical Interface
–  Historical and Real-time data
•  esxtop: per-host statistics
–  Command-line tool found in the console-OS
•  Java SDK
–  Allows you to collect only the statistics they want

Potential Impacts to Performance

•  Virtual Machine Contributors Latency:
–  CPU Overhead can contribute to latency (but it’s small!)
–  Scheduling latency (VM runnable, but waiting…)
–  Waiting for a global memory paging operation
–  Disk Reads/Writes taking longer
•  Virtual machine impacts to Throughput:
–  Throughput ceiling if not enough resources allocated
–  Throughput ceiling if not enough virtual CPU/Mem allocated

vSphere Instrumentation Points
File
TCP/IP System
Service
vCPU Guest
Console
Virtual Disk
cCPU VMHBA
Monitor Monitor vNIC
Virtual NIC Virtual SCSI

Memory
VMkernel Virtual Switch File System
Scheduler Allocator

NIC Drivers I/O Drivers

Physical
HBA
Hardware

pCPU Physical Disk
pNIC

VI Client

Chart Type

Real-time vs. Historical

Object

Counter type

Rollup Stats type

CPU capacity
(screenshot from VI Client)

Some caveats on ready time
Used time
  Used time ~ ready time: may
signal contention. However,
might not be overcommitted
due to workload variability
  In this example, we have Ready time ~ used time
periods of activity and idle
periods: CPU isn’t
overcommitted all the time

Ready time < used
time

esxtop
  What is esxtop ?
•  Performance troubleshooting tool for ESX host
•  Displays performance statistics in rows and column format

Fields

Performance Summary

•  Use vSphere rather than Workstation/Fusion for any
performance testing
–  Better performance from Sched, I/O, Large Pages, etc,…
•  vSphere will provide near-native performance
–  Ensure resources are available (under-commit or use
controls)
–  If I/O intensive, ensure shared storage is configured with
enough capacity
–  Ensure VMware tools are installed
•  Use the correct performance instrumentation
–  vSphere or esxtop

Virtualization Primer for Java Developers

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Virtualization Primer for Java Developers

Semelhante a Virtualization Primer for Java Developers (20)

Último

Último (20)

Virtualization Primer for Java Developers