2. Agenda
• IO Virtualization Overview
• Software solution
• Hardware solution
• IO performance Trend
• How IO Virtualization Performed in Micro Benchmark
• Network
• Disk
• Performance in Enterprise Workloads
• Web Server: PV, VT-d and Native performance
• Database: VT-d vs. Native performance
• Consolidated Workload: SR-IOV benefit
• Direct IO (VT-d) Overhead Analysis
Software 2
& Services
group
3. IO Virtualization Overview
IO Virtualization enables VMs to utilize
Input/Output resources of hardware
VMN
platform. In this session, we cover network
VM0
and storage.
Software Solutions:
Two solution we are familiar with on Xen:
•Emulated Devices – QEMU
Good compatibility, very poor performance
•Para-Virtualized Devices
Need driver support in guest, provides optimized
performance compared to QEMU.
Both require participation of Dom0 (driver domain)
to serve an VM IO request.
Software 3
& Services
group
4. IO Virtualization Overview – Hardware Solution
• VMDq (Virtual Machine Device Queue)
• Separate Rx & Tx queue pairs of NIC for each VM,
Network Only Offer 3 kinds of
Software “switch”. Requires specific OS and VMM
support. H/W assists to
accelerate IO.
• Direct IO (VT-d)
VM Exclusively Single or
•Improved IO performance through direct assignment of a
Owns Device combination of
I/O device to an unmodified or paravirtualized VM.
technology can
be used to
• SR-IOV (Single Root I/O Virtualization) address various
One Device,
•Changes to I/O device silicon to support multiple PCI Multiple usages.
device ID’s, thus one I/O device can support multiple Virtual
direct assigned guests. Requires VT-d. Function
Software 4
& Services
group
5. IO Virtualization Overview - Trend Much Higher throughput,
denser IO capacity.
40Gb/s and 100Gb/s Ethernet
Scheduled to release Draft 3.0 in Nov. 2009, Standard Approval in
2010**.
Fibre Channel over Ethernet (FCoE)
Unified IO consolidates network (IP) and storage (SAN) to single
connection.
Solid State Drive (SSD)
Provides hundreds MB/s bandwidth and >10,000 IOPS for single devices*.
PCIe 2.0
Doubles the bit rate from 2.5GT/s to 5.0GT/s.
Software 5
& Services * http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3403
group ** See http://en.wikipedia.org/wiki/100_Gigabit_Ethernet
6. How IO Virtualization Performed in Micro Benchmark – Network
iperf: Transmiting Performance SR-IOV Dual-Port
Iperf with 10Gb/s Ethernet NIC were 10GbE NIC
20 100%
18
used to benchmark TCP bandwidth of 16
19
80%
Bandwidth (Gb/s)
14
12 60%
different device models(*). 10
8 1.13x 9.54 40%
6 8.47
1.81x
Thanks to VT-d, VM can easily 4
4.68
20%
2
0 0%
achieved 10GbE line-rate in both cases HVM + PV driver PV Guest HVM + VT-d 30VMs+SR-IOV
perf cpu%
with relatively much lower resource
iperf: Receiving Performance
consumption. 10 100%
8 9.43 80%
Bandwidth (Gb/s)
Also with SR-IOV, we were able to get 6 3.04x 60%
19Gb/s transmitting performance with 4 40%
2 2.12x 3.10 20%
30VFs assigned to 30 VMs. 0
1.46
0%
HVM + PV driver PV Guest HVM + VT-d
perf cpu%
Software 6
* HVM+ VT-d uses 2.6.27 kernel while PV guest and HVM+PV driver uses 2.6.18.
* We turned off multiqueue support in NIC driver of HVM+VT-d due 2.6.18 kernel doesn’t have multi TX queue support. So for
& Services iperf test, there was only one TX/RX queue in the NIC and all interrupts are sent to one physical core only.
group * ITR (Interrupt Throttle Rate) was set to 8000 for all cases.
7. How IO Virtualization Performed in Micro Benchmark – Network
(cont.)
Packet Transmitting performance is another
Packet Transmitting Performance essential aspect of high throughput network.
9.00 8.15 Using Linux kernel packet generator(pktgen)
8.00
Million Packet Per Second (mpp/s)
7.00
2.1x
with small UDP packets (128 Byte), HVM+VT-d
6.00
5.00 3.90
can send over 4 million packets/s with 1 TX
4.00
3.00
queue and >8 million packets/s with 4 TX
11.7x
2.00
1.00 0.33 queue.
0.00
PV Guest 1 queue HVM+VT-d 1 queue HVM+VT-d 4 queue PV performance was far behind due to its long
packet processing path.
Software 7
& Services
group
8. How IO Virtualization Performed in Micro Benchmark – Disk IO
IOmeter: Disk Bandwidth
6,000 100.00%
4,911
5,000 80.00%
We measured disk bandwidth with sequential
Bandwidth (MB/s)
4,000
60.00%
3,000 2.9x
1,711 40.00% read and IOPS with random read to check block
2,000
1,000 20.00%
device performance.
0 0.00%
PV Guest HVM + VT-d
HVM+VT-d out-perform PV guest to ~3.0x in
IOmeter: Disk IOPS
20,000 18,725 100.00%
both tests.
16,000 80.00%
IO per Second
12,000 60.00%
2.7x
8,000 7,056 40.00%
4,000 20.00%
0 0.00%
PV Guest HVM + VT-d
Software 8
& Services
group
9. Performance in Enterprise Workloads – Web Server
Web Server simulates a support website
where connected users browse and
Web Server Performance download files.
30,000 100.00%
24,500 We measures maximum simultaneous user
25,000
80.00%
sessions that web server can support while
20,000
60.00%
satisfying the QoS criteria.
Sessions
15,000 2.7x
40.00%
10,000 9,000
5,000 1.8x
Only HVM+VT-d was able to push server’s
20.00%
5,000
utilization to ~100%. PV solution hit some
- 0.00%
HVM + PV driver PV guest HVM+ VT-d bottleneck that they failed to pass QoS while
performance CPU utilization
utilization is still <70%.
Software 9
& Services
group
10. Performance in Enterprise Workloads – Database
Decision Support DB Performnce OLTP DB Performance
11,443 199.71
12000 10,762 184.08
200.00
10000 94.06%
150.00 127.56
8000 92.2%
QphH
6000 100.00
4000
50.00 63.9%
2000
0 0.00
Native HVM + VT-d Native HVM + VT-d Storage & NIC HVM + VT-d Storage
Decision Support DB requires high disk bandwidth while OLTP DB is IOPS bound and requires certain
network bandwidth to connect to clients.
HVM+VT-d combination achieved > 90% of native performance in these two DB workloads.
Software 10
& Services
group
11. Performance in Enterprise Workloads – Consolidation with SR-IOV
Workload consolidates multiple tiles of servers
to run on same physical machine. One tile
SR-IOV Benefit: Consolidated Workload consists of 1 instance of Web Server, 1 J2EE
1.60 1.49 100
AppServer and 1 Mail Server, altogether 6 VMs.
1.40 90
1.20
80 It’s a complex workload, which consumes CPU,
Ratio of Performance
70
1.00
1.00 memory, disk and network.
60
0.80 50
0.60 1.49x 40
PV solution could only supports 4 tiles on two
30
0.40
20
socket server and fails to pass QoS of Web
0.20 10 Server criteria before it saturates CPU.
0.00 0
PV Guest HVM+SR-IOV As a pilot, we enabled SR-IOV NIC for Web
Ratio System Utilization Server. This brought >49% performance
increase also allowed system to support two
more tiles (12 VMs).
Software 11
& Services
group
12. Direct IO (VT-d) Overhead
VT-d cases: Utilization Breakdown VT-d: XEN Cycles Breakdown
100 14.00%
90 11.878 12.00%
80
70 10.00%
Utilization (%)
60 Dom0 4.73%
8.00% Interrupt Windows
50 Xen
INTR
40 Guest Kernel 6.00%
30 APIC ACCESS
Guest User 4.00%
20 IO INSTRUCTION
6.37%
10 5.94 2.00%
0
Disk Bandwidth Web Server 0.00%
SPECWeb
Web Server
APIC access and interrupt delivery consumed the most cycles. (Note that some amount of interrupts arrive when CPU is HLT thus they are not
counted.)
In various workloads, we’ve seen Xen brings in about 5~12% overhead, which is being mainly spent on serving
interrupts. Intel OTC team has developed patch set to eliminate part of Xen software overhead. Check out
Xiaowei’s session for details.
Software 12
& Services
group
13. CREDIT
Great thanks to DUAN, Ronghui and XIANG, Kai for providing data
of VT-d network and SR-IOV.
Software 13
& Services
group
15. BACKUP
Software 15 SSG/SSD/SPA/PRC Scalability Lab
& Services
group
16. Configuration
Hardware Configuration:
Intel® Nehalem-EP Server System Test Case VM Configuration
CPU Info:
2 socket Nehalem 2.66 GHZ with 8MB LLC Cache, C0 stepping. Network Micro 4 vCPU, 64GB memory
Hardware Prefetches OFF benchmark
Turbo mode OFF, EIST OFF.
NIC device
Intel 10Gb XF SR NIC (82598EB)—2 single port NIC installed on
Storage Micro 2 vCPU, 12GB memory
machine and one dual port NIC installed on server.
Benchmark
RAID bus controller:
LSI Logic MegaRAID SAS 8888elp x3
4 vCPU, 64GB memory
DISK array x6 (each with 70GB X 12 SAS HDD). Web Server
Memory Info
64GB memory (16x 4GB DDR3 1066MHz) , 32GB on each node.
Database 4 vCPU, 12GB memory
Software Configuration:
Xen C/S:18771 for network/disk micro benchmark, 19591 for SR-IOV test
Software 16
& Services
group