2. Agenda
Server Virtualization technologies ~15 min
•
− Overview and history
− VMM architectures
− Criteria for a processor to be virtualizable
X86 Virtualization ~30 min
•
− The x86 processor architecture overview
− Virtualization challenges in x86 processors
Break 1 – Q&A
•
Software techniques for virtualization ~ 45 min
•
− CPU virtualization (Binary Translation/Para-virtualization)
− Memory virtualization (shadow tables/Xen writeable page tables)
− I/O virtualization (device emulation)
Break 2 – Q&A
•
Hardware techniques for virtualization ~45 min
•
− CPU virtualization (VT-x/AMD-V)
− Memory virtualization (Intel EPT/AMD NPT)
− I/O virtualization (VT-d/Vt-d2/PCI SIG SR-IOV/MR-IOV)
Future Trends ~ 5 min
•
− Manageability
− Security
Did you ever wonder if the person in the puddle is real, and you're just
a reflection of him? ~Calvin and Hobbes
2 16 January 2009
3. Server Virtualization Technologies
Software/
Software/
Hardware Firmware Resource
Firmware
Partitioning Virtualization Virtualization
Partitioning
APP1 APP2
APP1 APP2
APP2
APP1
APP2
APP1
OS1 OS2
OS1 OS2
OS2
OS1 OS
Hypervisor Layer
Hypervisor Layer
S/W (Software/
(Software/
Firmware)
Firmware)
H/W CPU CPU
CPU
CPU CPU
CPU CPU CPU
Memory Memory
Memory Memory
Memory Memory Memory Memory
HP Integrity VM HP-UX SRP
HP vPar
HP nPar
IBM SLPARS (micro- Solaris Containers
IBM DLPAR
Sun DSD
partitions) (Zones)
Sun Logical Domains
Hitachi Virtage PVC (earlier SWSoft)
VMware ESX/GSX OpenVZ,
Microsoft Hyper-V
Isolation IBM WPAR
Flexibility
3 16 January 2009
Xen, KVM, xVM…
4. A brief history lesson
1960’s 1996
APP APP APP
APP APP APP APP
APP
CMS MVS MVS CMS W2K3 W2K WNT4 Linux
IBM VM/370 VMware
IBM Mainframe Intel / AMD x86 Server
Stanford Research
VMM on IBM Mainframe
•
• DISCO project
• Many apps on $$$ HW
• VMM on cheap x86 HW
• VMware in 1999
Commodity hardware becomes powerful enough to support a virtual machine
manager (VMM) – so it’s back to the future with a proven technology!
4 16 January 2009
6. Hosted VM Architecture
HP Integrity VM, Microsoft Virtual Server, VMware GSX
6 16 January 2009
7. Virtualization Requirements – Popek and
Goldberg
A Model of Third Generation Machines
•
− Two modes of execution
− Protection mechanism for the
supervisor mode
− A method to automatically signal the
supervisor when the VM executes a
sensitive instruction.
Properties for a Virtual Machine Monitor
•
− Equivalence
− Resource control
− Efficiency
7 16 January 2009
8. VMM Requirements (Sensitive Instructions)
Ref : Analyzing the Intel Pentium’s ability to support a secure VMM – John Scott Robin (1999)
8 16 January 2009
9. Agenda
Server Virtualization technologies
•
− Overview and history
− VMM architectures
− Criteria for a processor to be virtualizable
X86 Virtualization
•
− The x86 processor architecture overview
− Virtualization challenges in x86 processors
Software techniques for virtualization
•
− CPU virtualization (Binary Translation/Para-virtualization)
− Memory virtualization (shadow tables/Xen writeable page tables)
− I/O virtualization (device emulation)
Hardware techniques for virtualization
•
− CPU virtualization (VT-x/AMD-V)
− Memory virtualization (Intel EPT/AMD NPT)
− I/O virtualization (VT-d/Vt-d2/PCI SIG SR-IOV/MR-IOV)
Future Trends
•
− Manageability
− Security
9 16 January 2009
10. X86 architecture – Privilege Levels
Data Structures contains Privilege Levels
• DPL : Descriptor privilege level
• CPL : Current Privilege Level
− DPL of the access rights byte in CS
segment descriptor cache register
− privilege level of the code and data
segment for the current task
• RPL : Requested Privilege Level
− the privilege level of the new selector
loaded into a segment register
10 16 January 2009
12. X86 memory management - segmentation
Upper 13 bits of
segment selector
are used to index
the descriptor table
GDTR, LDTR
TI = Table Indicator
Select the descriptor table
0 = Global Descriptor Table
1 = Local Descriptor Table
Access
selector Segment base Segment limit
rights
Hidden part of segment register
12 16 January 2009
13. X86 Paging – 32 bit mode
Page Table
Page Table Entry
13 16 January 2009
17. X86 virtualization challenges
Incorrect execution when
Non-faulting read Excessive
run in ring level > 0 (3C1)
of privileged Faulting
registers (3B1)
Guest
Guest
Ring 3 CPUID Sysenter
Apps
Apps
Ring
POPF LAR/LSL/
SGDT/SIDT/SLDT/STR CLI/
VERR/VER aliasing/
/PUSHF/SMSW/POP/ STI
STR/POP W/CALL/
PUSH
compression
Ring 1 /PUSH INT/JMP/
RET
Address space
compression
Ring 0
VMM
Leakage of privilege
level (3C1)
Hardware
Non-faulting write to Segment
privileged state reversibility issue
(eflags.IF) (3B1) on context switch
17 16 January 2009
18. Agenda
Server Virtualization technologies
•
− Overview and history
− VMM architectures
− Criteria for a processor to be virtualizable
X86 Virtualization
•
− The x86 processor architecture overview
− Virtualization challenges in x86 processors
Software techniques for virtualization
•
− CPU virtualization (Binary Translation/Para-virtualization)
− Memory virtualization (shadow tables/Xen writeable page tables)
− I/O virtualization (device emulation)
Hardware techniques for virtualization
•
− CPU virtualization (VT-x/AMD-V)
− Memory virtualization (Intel EPT/AMD NPT)
− I/O virtualization (VT-d/Vt-d2/PCI SIG SR-IOV/MR-IOV)
Future Trends
•
− Manageability
− Security
18 16 January 2009
19. Dynamic Binary Translation
x86 Parser &
x86
x86 High Level
Binary
Binary Translator
Data RAM
Disk Code Cache Code Cache
High Level Tags
Optimization
Low Level
Code Generation
Low Level
Optimization and
Scheduling
Translator Runtime -- Execution
Ref : Virtual Machines and Dynamic Translation:Implementing ISAs in Software – Joel Emer, Massachusetts
Institute of Technology
19 16 January 2009
20. Binary Translation - C Code Example
int isPrime(int a) {
for (int i = 2; i < a; i++) {
if (a % i == 0) return 0;
}
return 1;
}
Ref : Keith Adams and Ole Agesen. A comparison of software and hardware techniques for x86
virtualization. Operating Systems Review, 40(5):2–13, December 2006
20 16 January 2009
21. Basic Block Translation
Most instructions copied identically.
•
Privileged instructions must be emulated.
•
Jumps must be translated since translation can alter code layout.
•
Each translated BB must end with jump to next translated BB.
•
Ref : Keith Adams and Ole Agesen. A comparison of software and hardware techniques for x86
virtualization. Operating Systems Review, 40(5):2–13, December 2006
21 16 January 2009
22. Translation of isPrime(49)
Note that prime: BB never translated since 49 is not prime.
Ref : Keith Adams and Ole Agesen. A comparison of software and hardware techniques for x86
virtualization. Operating Systems Review, 40(5):2–13, December 2006
22 16 January 2009
26. Dynamic memory resizing - Ballooning
Inflating a balloon
•
− When the server wants to
reclaim memory
− Driver allocates pinned
physical pages within the VM
− Increases memory pressure in
the guest OS, reclaims space
to satisfy the driver allocation
request
− Driver communicates the
physical page number for
each allocated page to VMM
Deflating
•
− Frees up memory for general
use within the guest OS
26 16 January 2009
27. I/O system architecture overview (PCI/PCI-e)
OS driver OS driver
OS driver
VMM
CPU CPU CPU
CPU
CHAOS!!
Root Memory
Configuration
Complex RX
TX
space
01 2
27 16 January 2009
3, 0, 0 (BDF)
28. I/O Virtualization Architecture
Service VM Model
Monolithic Model Pass-through Model
Guest VMs
Service VMs
VMn
VM0 VMn
VM0
VMn
I/O
Guest OS
Guest OS Guest OS
Guest OS
Services VM0
and Apps
and Apps and Apps
and Apps
Device
Device
Device
Guest OS Drivers
Drivers
Drivers
I/O Services
and Apps
Device Drivers
Hypervisor
Hypervisor
Hypervisor
Assigned
Shared
Shared
Devices
Devices
Devices
Pro: Higher Performance Pro: High Security
• Pro: Highest Performance
• •
Pro: I/O Device Sharing Pro: I/O Device Sharing
• Pro: Smaller Hypervisor
• •
Pro: VM Migration Pro: VM Migration
• Pro: Device assisted sharing
• •
Con: Larger Hypervisor Con: Lower Performance
• Con: Migration Challenges
• •
VMWare ESX Xen
28 16 January 2009
31. Networking in Xen
Guest
Driver Domain
Domain 1
Back-End Drivers Packet Data Guest
Front-End
Domain 2
Driver
Guest
Ethernet Hypervisor
Domain ...
Page
Bridge
Flipping
Virtual
Interrupts
NIC Driver
Driver
Control
Interrupt Hypervisor
Dispatch
Hardware Control + Data
Packet Data Interrupts
NIC CPU / Memory / Disk / Other Devices
31 16 January 2009
32. Agenda
Server Virtualization technologies (15 min)
•
− Overview and history
− VMM architectures
− Criteria for a processor to be virtualizable
X86 Virtualization (30 min)
•
− The x86 processor architecture overview
− Virtualization challenges in x86 processors
Software techniques for virtualization (30 min)
•
− CPU virtualization (Binary Translation/Para-virtualization)
− Memory virtualization (shadow tables/Xen writeable page tables)
− I/O virtualization (device emulation)
Hardware techniques for virtualization
•
− CPU virtualization (VT-x/AMD-V)
− Memory virtualization (Intel EPT/AMD NPT)
− I/O virtualization (VT-d/Vt-d2/PCI SIG SR-IOV/MR-IOV)
Future Trends (5 min)
•
− Manageability
− Security
32 16 January 2009
33. CPU Virtualization with Intel VT-x
Virtual Machines (VMs)
Two new VT-x operating modes
•
− Less-privileged mode
(VMX non-root) for guest OSes Apps
Apps
Ring 3
− More-privileged mode
OS
OS
(VMX root) for VMM Ring 0
Two new transitions
• VM Exit VM Entry
− VM entry to non-root operation VMX
VM Monitor (VMM)
− VM exit to root operation Root
Execution controls determine when exits occur
•
− Access to privilege state, occurrence of exceptions, etc.
− Flexibility provided to minimize unwanted exits
VM Control Structure (VMCS) controls VT-x operation
•
− Also holds guest and host state
33 16 January 2009
34. VT-x Operations
VM 1 VM 2 VM n
VMX Ring 3 Ring 3 Ring 3
...
Non-root
Ring 0 Ring 0 Ring 0
Operation
VM Exit VMCS VMCS VMCS
1 2 n
Ring 3
VMX
IA-32 Root
Operation VMRESUME
VMLAUNCH Ring 0
VMXON
34 16 January 2009
35. VT-x new instructions
VMXON and VMXOFF
•
− To enter and exit VMX-root mode.
VMLAUNCH: Used on initial transition from VMM to Guest
•
− Enters VMX non-root operation mode
VMRESUME: Used on subsequent entries
•
− Enters VMX non-root operation mode
− Loads Guest state and Exit criteria from VMCS
VMEXIT
•
− Used on transition from Guest to VMM
− Enters VMX root operation mode
− Saves Guest state in VMCS
− Loads VMM state from VMCS
VMPTRST and VMPTRLD
•
− To Read and Write the VMCS pointer.
VMREAD, VMWRITE, VMCLEAR
•
− Read from, Write to and clear a VMCS
VMCALL
•
− Hypervisor entry point for hypercall from guest
35 16 January 2009
36. VT-x Data Structures (VMCS)
VMCS is a 4K table VM execution controls Controls External interrupt
•
which specifies the processor exiting, interrupt
behaviour in window exiting,
VM environment
non-root mode CR3 load/store
Physical addressing exiting, VPID
•
only, and is accessed enable, VPID
value, EPT
through
enable, EPTP…
VMREAD/VMWRITE
interface Guest save state Processor state EIP, ESP,
saved on VM EFLAGS, IDTR,
Loads and Stores to
• exits and loaded Segment
the current VMCS from on VM registers etc..
pointer through entries
VMPTRLD and Host save state Processor state CR3, EIP set to
VMPTRST loaded on VM monitor entry,
exits EFLAGS etc..
VMRESUME used if
•
same VMCS is being VM exit controls These fields MSR save etc..
control VM exits
resumed on a
processor. Else,
VM entry controls These fields Interrupts on
VMCLEAR followed by
control VM entry, MSR
VMLAUNCH. entries loads etc..
36 16 January 2009
37. VT-x solution to x86 virtualization challenges
Sysenter calls into guest
All reads return privilege Guest OS in full control of OS. CLI/STI optimized to
level 0, GDT/LDT owned by segment/task descriptors deliver virtual interrupts to
guest OS, CPUID can be VM
made to trap into VMM
Guest
Guest
Ring 3 CPUID Sysenter
Apps
Apps
No ring
LAR/LSL/VERR/
POPF compression –
CLI/
SGDT/SIDT/SLDT/STR
VERW/CALL/IN
Ring 0 all rings
STI
/PUSHF/SMSW/POP/
T/JMP/RET
PUSH available
No need for VMM to share address
space with guest – no address
compression
Ring -1
VMM
Hardware
Clean context switch on
Eflags.IF is no longer used for VM entry/exit
interrupt masking
37 16 January 2009
38. Intel EPT/AMD NPT
GPT Base
Pointer (hCR3)
gCR3
Guest
Physical
Guest Host
x86 Guest Host GPT
Address
Linear Physical
Page Tables Page Tables
Address Address
TLB & Caches
GPT directly translates Guest Virtual addresses into Host Physical
•
addresses on the fly.
− Uses Guest Page Table and Host-based Page Table
Significant reduction in “exit frequency”
•
• Primary page table modifications are as fast as native
• Page faults require no exits
• Context switches require no exits
− No shadow page table memory overhead
However, results in more expensive TLB misses - The “memsweep effect” –
•
mitigated by large guest pages
AMD ASID/Intel VPID - segments the TLB, reduces TLB purge overheads.
•
38 16 January 2009
39. VT-x extension: Extended Page Table
(EPT)
All guest-physical addresses go through extended page tables
•
• Includes address in CR3, address in PDE, address in PTE, etc.
39 16 January 2009
40. VT-x extension: Virtual Processor IDs
(VPID)
The idea of a tagged TLB is that each
•
TLB entry is “tagged” with an identifier
• Having such a tag allows the TLB
entries to not be “flushed” when
switching between the host and a
guest
• VPID is activated if the new “enable
VPIP” control bit is set in VMCS
Tag
Virtual Address Physical Address
Host 0x1000 0x10001000
Host 0x2000 0x10002000
Host 0x3000 0x10003000
Host 0x4000 0x10004000
Guest 0x1000 0xFFF01000
Guest 0x2000 0xFFF02000
Guest 0x3000 0xFFF03000
Guest 0x4000 0xFFF04000
40 16 January 2009
41. VT-x extension: CPUID spoofing
(Flex Migration)
Allows software to “spoof” the CPUID feature bits (e.g. make
•
the value of the CPUID feature bits appear different than
they really are).
• This is the same than the CPUID spoofing feature that the
current VT processors have.
Live VM Live VM
Migration Migration
Pre 2004 2006+ (Intel® Core™)
2004+
64 bit
32 bit 64 bit dual,
single core
single core quad-core
Older / Existing Servers Newer Servers
41 16 January 2009
42. Intel VT-d Architecture Detail
DMA Requests
Dev 31, Func 7
Device ID Virtual Address …
Length
Dev P, Func 2
Bus 255
Page
Frame
Bus N
Fault Generation Bus 0
Dev P, Func 1
4KB Page
Tables
Dev 0, Func 0
Address Translation
DMA Remapping Structures
Device D1
Engine Device
Assignment
Translation Cache Structures
Device D2
Address Translation
Structures
Context Cache
Memory Access with System Memory-resident Partitioning And
Physical Address Translation Structures
42 16 January 2009
43. VT-d: Remapping Structures
VT-d hardware selects page-table based on source of DMA request
•
− Requestor ID (bus / device / function) in request identifies DMA source
VT-d Device Assignment Entry
•
127 64
Rsvd Domain ID Rsvd Address
Width
63 0
Address Space Root Pointer Rsvd Ext. Controls P
Controls
VT-d supports hierarchical page tables for address translation
•
− Page directories and page tables are 4 KB in size
− 4KB base page size with support for larger page sizes
− Support for DMA snoop control through page table entries
VT-d Page Table Entry
•
63 0
Rsvd Page-Frame / Page-Table Address Available S Rsvd Ext. W R
P Controls
43 16 January 2009
45. PCI SIG IOV Overview
PCIe Multi-Root IOV
PCIe Single-Root IOV
SI SI SI SI SI SI
VI VI VI
PCI SIG is standardizing mechanisms that enable PCIe Devices to be directly shared
•
− Single-Root IOV – Direct sharing between SIs on a single system
− Multi-Root IOV – Direct sharing between SIs on multiple systems
PCI-SIG IOV Specification covers “north-side” of the Device
•
45 16 January 2009
46. PCI SIG IOV
Terminologies
SR-PCIM
SI SI
VI
VI
System Image (SI)
•
− SW, e.g., a guest OS, to which virtual
and physical devices can be assigned
Virtual Intermediary (VI)
•
− Performs resource allocation, isolation,
management and event handling
PCIM – PCI Manager
•
− Controls configuration, management
and error handling of PFs and VFs
− May be in SW and/or Firmware.
− May be integrated into a VI
Translation Agent (TA )
•
− Uses ATPT to translates PCI Bus
Addresses into platform addresses
PCIe
Address Translation and Protection
•
Switch
Table (ATPT)
− Validates access rights of incoming PCI
memory transactions.
− Translates PCI Address into
platform physical addresses
F F
46 16 January 2009
47. VT-c: Virtual Machine Device
Queues (VMDq)
• On the receive path, VMDq
provides a hardware ‘sorter'
or classifier that essentially
does the pre-work for the
VMM of directing which end
VM the packets should go to.
The NIC or LAN silicon is
performing a hardware assist
for the VMM layer.
47 16 January 2009
49. Deja-Vu – Back to the future
What VT calls quot;non-root modequot;, and Pacifica calls quot;guest
•
modequot;, was called quot;interpretive executionquot; on the IBM
VM/370 and VM/ESA mainframes.
• VT's quot;vmlaunchquot; instruction and Pacifica's quot;vmrunquot; was
called as quot;sie“
• Intel's quot;VMCSquot; and AMD's quot;VMCBquot; was called as quot;state
descriptionquot; on the IBM mainframes.
• IBM also defined the concept of shadow translation tables
and a dual page-table walk in hardware.
• IBM also defined a interpreted SIE for nested hypervisor
support (not yet in Intel/AMD)
49 16 January 2009
50. Agenda
Server Virtualization technologies
•
− Overview and history
− VMM architectures
− Criteria for a processor to be virtualizable
X86 Virtualization
•
− The x86 processor architecture overview
− Virtualization challenges in x86 processors
Software techniques for virtualization
•
− CPU virtualization (Binary Translation/Para-virtualization)
− Memory virtualization (shadow tables/Xen writeable page tables)
− I/O virtualization (device emulation)
Hardware techniques for virtualization
•
− CPU virtualization (VT-x/AMD-V)
− Memory virtualization (Intel EPT/AMD NPT)
− I/O virtualization (VT-d/Vt-d2/PCI SIG SR-IOV/MR-IOV)
Future Trends
•
− Manageability
− Security
50 16 January 2009
51. Future Trends
Secure Hypervisors – The hypervisor itself like an OS can have holes.
•
BluePill attacks – subverting the hypervisor
•
Trusted Virtualization - Virtualizing TPMs for use by guest virtual machines
•
Trusted Virtualization – How do we trust the VMM ? Intel’s LT (LaGrande) and
•
AMD’s Presidio introduce architectural extensions for security
Firewalls to protect guests. Xen Motion security hole
•
Storage QoS – FC NPIV, Storage vMotion
•
Datacenter/Lifecycle Management (Virtualiztion 2.0)
•
− OpsWare PAS (now HP Operations Orchestrator)
− Novell ZENworks Orchestrator
− VMware Lifecycle Manager
51 16 January 2009
52. References
D. L. Osisek, K. M. Jackson, and P. H. Gum. ESA/390
•
interpretive-execution architecture, foundation for VM/ESA.
IBM Systems Journal, 30(1):34–51, 1991.
• John Scott Robin and Cynthia E. Irvine. Analysis of the Intel
Pentium’s ability to support a secure virtual machine
monitor. In USENIX, editor, Proceedings of the Ninth
USENIX Security Symposium, August 14–17, 2000,
Denver, Colorado, page 275, San Francisco, CA, USA,
2000
• Keith Adams and Ole Agesen. A comparison of software
and hardware techniques for x86 virtualization. Operating
Systems Review, 40(5):2–13, December 2006
• PCI IOV talks at WinHEC and HP by Michael Krause
• VMWorld 2007 talk by Ole Agesen
• Intel IDF 2007/2008 presentations
52 16 January 2009