2. 2012/11/282
Agenda
•Introduction
history
Usage model
•Virtualization overview
cpu virtualiztion
memory virtualization
I/O virtualization
•Xen/KVM architecture
Xen
KVM
•Some intel work for Openstack
OAT
3. 2012/11/283
Virtualization history
• 60s’ IBM - CP/CMS on S360, VM370, …
• 70’s 80s’ Silence
• 1998 VMWare - SimOS project, Stanford
• 2003 Xen - Xen project, Cambridge
• After that: KVM/Hyper-v/Parallels …
4. 2012/11/284
What is Virtualization
• VMM is a layer of abstraction
support multiple guest OSes
de-privilege each OS to run as Guest OS
• VMM is a layer of redirection
redirect physical platform to virtual platform illusions of many
provide virtaul platfom to guest os
...
Virtual Machine Monitor (VMM)
VMnVM0
Guest OS
VM1
Platform HW
I/O DevicesProcessorsMemory
Apps
Guest OS
Apps
Guest OS
Apps
...
Virtual Machine Monitor (VMM)
VMnVM0
Guest OS
VM1
Platform HW
I/O DevicesProcessorsMemory
Apps
Guest OS
Apps
Guest OS
Apps
5. 2012/11/285
Server Virtualization Usage Model
Server Consolidation
Benefit: Cost Savings
• Consolidate services
• Power saving
HWHW
HW
VMM
Disaster Recovery
HW
VMM
HW
VMM
…
OS
App
OS
App
OS
App
…
OS
App
HW
VMM
HW
VMM
• Benefit: Productivity
Dynamic Load Balancing
OS
App
1
OS
App
2
OS
App
3
OS
App
4
CPU Usage
30%
CPU Usage
90%
CPU Usage CPU Usage
Benefit: Business Agility and Productivity
R&D Production
HW
VMM
OS
App
Benefit: Lost saving
• RAS
• live migration
• relief lost
7. 2012/11/287
X86 virtualization challenges
• Ring Deprivileging
Goal: isolate guest OS from
• Controlling physical resources directly
• Modifying VMM code and data
Ring deprivileging layout
• vmm runs at full privileged ring0
• Guest kernel runs at
• X86-32: deprivileging ring 1
• X86-64: deprivileging ring 3
• Guest app runs at ring 3
Ring deprivileging problems
• Unnecessary faulting
• some privilege instructions
• some exceptions
• Guest kernel protection (x86-64)
• Virtualization holes
19 instructions
• SIDT/SGDT/SLDT …
• PUSHF/POPF …
Some userspace holes hard to fix by s/w approach
• Hard to trap, or
• Performance overhead
9. 2012/11/289
Typical X86 virtualization approaches
• Para-virtualization (PV)
Para virtualization approach, like Xen
Modified guest OS aware and co-work with VMM
Standardization milestone: linux3.0
• VMI vs. PVOPS
• Bare metal vs. virtual platform
• Binary Translation (BT)
Full virtualization approach, like VMWare
Unmodified guest OS
Translate binary ‘on-the-fly’
• translation block w/ caching,
• usually used for kernel, ~80% native performance
• userspace app directly runs natively as much as possible, ~100% native performance
• overall ~95% native performance
• Complicated
• Involves excessive complexities. e.g., self-modifying code
• Hardware-assisted Virtualization (VT)
Full virtualization approach assisted by hardware, like KVM
Unmodified guest OS
Intel VT-x, AMD-v
Benefits:
• Closing virtualization holes in hardware
• Simplify VMM software
• Optimizing for performance
10. 2012/11/2810
Memory virtualization challenges
• Guest OS has 2 assumptions
expect to own physical memory starting from 0
• BIOS/Legacy OS are designed to boot from address low 1M
expect to own basically contiguous physical memory
• OS kernel requires minimal contiguous low memory
• DMA require certain level of contiguous memory
• Efficient MM management, e.g., less buddy overhead
• Efficient TLB, e.g., super page TLB
• MMU virtualization
How to keep physical TLB valid
Different approaches involve different complication and overhead
12. 2012/11/2812
Memory virtualization approaches
• Direct page table
Guest/VMM in same linear space
Guest/VMM share same page table
• Shadow page table
Guest page table unmodified
• gva -> gpa
VMM shadow page table
• gva -> hpa
Complication and memory overhead
• Extended page table
Guest page table unmodified
• gva -> gpa
• full control CR3, page fault
VMM extended page table
• gpa -> hpa
• hardware based
• good scalability for SMP
• low memory overhead
• Reduce page fault VMexit greatly
• Flexible choices
Para virtualization
• Direct page table
• Shadow page table
Full virtualization
• Shadow page table
• Extended page table
GVA
GPA
HPA
Extended
page table
Shadow
page table
Direct
page table
Guest
page table
13. 13
Shadow page table
• Guest page table remains
unmodified to guest
Translate from gva -> gpa
• Hypervisor create a new
page table for physical
Use hpa in PDE/PTE
Translate from gva -> hpa
Invisible to guest
Page
Directory
Page
Table
PDE
PTE
Page
Directory
Page
Table
PDE
PTE
vCR3
pCR3
Virtual
Physical
2012/11/28
14. 14
• Extended page table
Guest can have full control over its page tables and events
• CR3, INVLPG, page fault
VMM controls Extended Page Tables
• Complicated shadow page table is eliminated
• Improved scalability for SMP guest
Guest
Page
Tables
Extended
Page
Tables
Guest Physical Address
Host Physical
Address
Guest Linear
Address
Guest CR3 EPT base pointer
Extended page table
2012/11/28
15. 2012/11/2815
I/O virtualization requirements
• I/O device from OS point of view
Resource configuration and probe
I/O request: IO, MMIO
I/O data: DMA
Interrupt
• I/O Virtualization require
presenting guestos driver a complete device interface
• Presenting an existing interface
• Software Emulation
• Direct assignment
• Presenting a brand new interface
• Paravirtualization
Device CPU
Shared
Memory
Interrupt
Register Access
DMA
16. 2012/11/2816
I/O virtualization approaches
• Emulated I/O
Software emulates real hardware device
VMs run same driver for the emulated hardware device
Good legacy software compatibility
Emulation overheads limit performance
• Paravirtualized I/O
Uses abstract interfaces and stack for I/O services
FE driver: guest run virtualization-aware drivers
BE driver: driver based on simplified I/O interface and stack
Better performance over emulated I/O
• Direct I/O
Directly assign device to Guest
• Guest access I/O device directly
• High performance and low CPU utilization
DMA issue
• Guest set guest physical address
• DMA hardware only accept host physical address
Solution: DMA Remapping (a.k.a IOMMU)
• I/O page table is introduced
• DMA engine translate according to I/O page table
Some limitations under live migration
22. Trusted Pools - Implementation
Attestation
Service
Scheduler
EC2APIOSAPI
Query API
User specifies ::
Mem > 2G
Disk > 50G
GPGPU=Intel
trusted_host=trusted HW/TXT
Hypervisor / tboot
OS
App
App
App
OS
App
App
App
Host
agent
Attestation
Server
Privacy
CA
Appraiser
Whitelist
DB
Whitelist API
HostAgentAPI
QueryAPI
OpenStack
TrustedFilterCreate
Attest
Report
Query
trusted/
untrusted
Create VM
OAT-
Based
Tboot-
Enabled