webinar vmware v-sphere performance management Challenges and Best Practices

vSphere Performance Management
Challenges and Best Practices

Jamie Baker
Principal Consultant
jamie.baker@metron-athene.com

Agenda

• Can my Application be virtualized?

• Virtual Machine Performance Management and Best Practices

• CPU Performance Management and Best Practices

• Memory Performance Management and Best Practices

• Networking Performance Management and Best Practices

• Storage Performance Management and Best Practices

• x86 Virtualization Challenges ( not covered in presentation but included with
slides)

www.metron-athene.com

Can my application be virtualized?

Resource Application Category
CPU CPU-intensive Green (with latest HW)
More than 8 CPUs Red
Memory Memory-intensive Green (with latest HW)
Greater than 255GB Red
RAM

Network bandwidth 1-27Gb/s Yellow
Greater than 27Gb/s Red
Storage bandwidth 10-250K IOPS Yellow
Greater than 250K IOPS Red

1/3/2012 3

Virtual Machines Performance Management

• Selecting the correct operating system

• Virtual machine timekeeping

• Why installing VMware Tools brings benefits

• SMP guidelines

• NUMA server considerations

• Overview of Best Practices

1/3/2012 4

Selecting the Correct Operating System

• When creating a VM, select the correct OS to:
• Determine the optimal monitor mode to use
• Determine the default optimal devices, e.g. SCSI controller
• Correct version of VMware Tools is installed

• Use a 64-bit OS only when necessary
• If running 64-bit applications
• 64-bit VMs require more memory overhead
• Compare 32-bit / 64-bit application benchmarks to determine the
worthiness of 64-bit

1/3/2012 5

Guest Operating System Timekeeping

• To keep time, most OS’ count periodic timer interrupts, or “ticks”
• Tick frequency is 64-1000Hz or more

• Counting “ticks” can be a real-time issue
• Ticks are not always delivered on time
• VM might be descheduled
• If a tick is lost, time falls behind
• Ticks are backlogged
• When backlogged, the system delivers ticks faster to catch up

• Mitigate these issues by:
• Use guest OS’ that require fewer ticks
• Most Windows – 66 to 100Hz, Linux 2.4: 100Hz
• Linux 2.6 – 1000Hz, Recent Linux – 250Hz
• Clock synchronisation software (VMware Tools)
• For Linux use NTP instead of VMware Tools
• Check article kb1006427

1/3/2012 6

SMP Guidelines

• Avoid using SMP unless specifically required
• Processes can migrate across vCPUs
• This increases CPU overhead
• If migration is frequent – use CPU Affinity

• VMkernel in vSphere now owns all PCI devices
• No performance concern with IRQ sharing in vSphere
• Earlier versions of ESX had performance issues when IRQ shared
devices were owned by the Service Console and VMkernel

1/3/2012 7

NUMA Server Considerations

• If using NUMA, ensure that a
VM fits in a NUMA node
• If the total amount of guest
vCPUs exceeds the cores,
then the VM is not treated as
a NUMA client and not
managed by the NUMA
balancer
• VM Memory size is greater
than memory per NUMA node
• Wide VMs are supported in
ESX 4.1
• If more vCPUs than cores
• Splits the VM into multiple
NUMA clients
• Improves memory locality

1/3/2012 8

Virtual Machine Performance Best Practices

• Select the right guest operating system type during virtual machine
creation

• Use 64-bit operating systems only when necessary

• Don’t deploy single-threaded applications in SMP virtual machines

• Configure proper time synchronization

• Install VMware Tools in the guest operating system and keep up-to-
date

1/3/2012 9

CPU Performance Management

• What are Worlds?

• CPU Scheduling
• How it works
• Processor Topology
• SMP and CPU ready time

• What affects CPU Performance?
• Causes and resolution of Host Saturation
• ESX PCPU0 – High Utilization


1/3/2012 10

Worlds

• A world is an execution context scheduled on a processor (a.k.a.
Process)

• A virtual machine is a group of worlds
• World for each vCPU
• World for virtual machines Mouse, Keyboard and Screen (MKS)
• World for VMM

• CPU Scheduler chooses which world to schedule on a processor

• VMkernel worlds are known as non-virtual machine worlds

1/3/2012 11

CPU Scheduling

• Schedules vCPUs on physical CPUs

• Scheduler check physical CPU utilization every 20 milliseconds and
migrates worlds as necessary

• Entitlement is implemented when CPU resources are overcommitted
• Calculated from user resource specifications, such as Shares,
Reservations and Limits
• Ratio of consumed CPU / Entitlement is used to set priority of the world
• High Priority = consumed CPU < Entitlement

1/3/2012 12

CPU Scheduling – SMP VMs

• ESX/ESXi uses a form of co-scheduling to run SMP VMs

• Co-scheduling is a technique that schedules related processes to run on
different processors at the same time
• At any time, each vCPU might be scheduled, descheduled, pre-empted or
blocked while waiting for some event.

• The CPU Scheduler takes “skew” into account when scheduling vCPUs
• Skew is the difference in execution rates between two or more vCPUs
• A vCPUs “skew” increases when it is not making progress whilst one of its
sibling is.
• A vCPU is considered to be skewed if its cumulative skew exceeds a
configurable threshold (typically a few milliseconds)

• Further relaxed co-scheduling introduced to mitigate vCPU Skew

• More information on vSphere 4 CPU Scheduler visit
http://www.vmware.com/resources/techresources/10059

1/3/2012 13

Processor Topology / Cache Aware

• The CPU Scheduler uses processor topology information to optimize
the placement of vCPUs onto different sockets.

• The CPU Scheduler spreads load across all sockets to maximize the
aggregate amount of cache available.
• Cores within a single socket typically use shared last-level cache
• Use of a shared last-level cache can improve vCPU performance if
running memory intensive workloads.
• Reduces the need to access slower main memory
• Last-level cache has a dedicated channel to a CPU socket enabling it to
run at the full speed of the CPU

1/3/2012 14

CPU Ready Time

• The amount of time the vCPU waits for physical CPU time to
become available

• Co-scheduling SMP VMs can increase CPU ready time

• This latency can impact on the performance of the guest operating
system and its applications in a VM

• Avoid over committing vCPUs to host physical CPUs

1/3/2012 15

What affects CPU Performance?

• Idling VMs
• Consider the overhead of delivering guest timer interrupts
• Try to use as few vCPUs as possible to reduce timer interrupts and
reduce any co-scheduling overhead

• CPU Affinity
• This can constrain the scheduler and cause an imbalanced load.
• VMware strongly recommends against using CPU affinity

• SMP VMs
• Some co-scheduling overhead is incurred.

• Insufficient CPU resources to satisfy demand
• When there is contention the scheduler forces vCPUs of lower-priority
VMs to queue behind higher-priority VMs

1/3/2012 16

Causes of Host CPU Saturation

• VMs running on the host are demanding more CPU resource than
the host has available.

• Main scenarios for this problem:
• The host has a small number of VMs with high CPU demand
• The host has a large number of VMs with moderate CPU demand
• The host has a mix of VMs with high and low CPU demand

1/3/2012 17

Resolving Host CPU Saturation

• Reduce the VMs on the host
• Monitor CPU Usage (both Peak and Average) of each VM, then identify and
migrate the VMs to other hosts with spare CPU capacity
• Load can be manually rebalanced without downtime
• If additional hosts are not available, either power down non-critical VMs or
use resource controls such as Shares.

• Increase the CPU resources by adding the host to a DRS cluster
• Automatic vMotion migrations to load balance across Hosts

• Increase the efficiency of a VM’s CPU Usage
• Reference application tuning guides, papers, forums.
• Guest operating system and application should include:
• Use of large memory pages
• Reducing timer interrupt rate for VM operating system
• Optimize VMs by:
• Choose right vCPU number and memory allocation
• Use resource controls to direct available resources to critical VMs

1/3/2012 18

ESX Host – pCPU0 High Utilization

• Monitor the usage on pCPU0
• There is high utilization on pCPU0, if the average usage is > 75% and
20% > than overall host usage

• Within ESX the service console is restricted to running on pCPU0

• Management agents installed in Service Console can request large
amounts of CPU

• High utilization on PCPU0 can impact on hosted VMs performance

• SMP VMs running on NUMA systems might be impacted when
assigned to the home node that includes pCPU0

1/3/2012 19

CPU Performance Best Practices

• Avoid using SMP unless specifically required by the application
running in the VM.

• Prioritize VM CPU Usage with Shares

• Use vMotion and DRS to load balance VMs and reduce contention

• Increase the efficiency of VM usage by:
• Referencing application tuning guides
• Tuning the guest operating system
• Optimizing the virtual hardware

1/3/2012 20

Memory Performance Management

• Reclamation – how and why?

• Monitoring – what and why?

• Troubleshooting

• vSwp files placement guidelines


1/3/2012 21

Memory Reclamation Challenges

• VM physical memory is not
“freed”
• Memory is moved to the “free”
list

• The hypervisor is not aware
when the VM releases memory
• It has no access to the VMs
“free” list
• The VM can accrue lots of host
physical memory

• Therefore, the hypervisor
cannot reclaim released VM
memory

1/3/2012 22

VM Memory Reclamation Techniques

• The hypervisor relies on these techniques to “free” the host physical
memory

• Transparent page sharing (default)
• redundant copies reclaimed

• Ballooning
• Forces guest OS to “free” up guest physical memory when the physical
host memory is low
• Balloon driver installed with VMware Tools

• Host-level (hypervisor) swapping
• Used when TPS and Ballooning are not enough
• Swaps out guest physical memory to the swap file
• Might severely penalize guest performance

1/3/2012 23

Memory Management Reporting

Production Cluster
Memory Shared, Ballooned and Swapped
VIXEN (ESX)

Average Swap space i n use MB
Average Amount of memo ry used by memory control MB

Average Memory shared across VMs MB

5,000

4,000

3,000

2,000

1,000

0

1/3/2012 24

Why does the Hypervisor Reclaim Memory?

• Hypervisor reclaims memory to support memory overcommitment

• ESX host memory is overcommitted when the total amount of VM
physical memory exceeds the total amount of host

1/3/2012 25

When to Reclaim Host Memory

• ESX/ESXi maintains four host free memory states and associated
thresholds:
• High (6%), Soft (4%), Hard (2%), Low (1%)

• If the host free memory drops towards the stated thresholds, the
following reclamation technique is used:

High None
Soft Ballooning
Hard Swapping and Ballooning
Low Swapping

1/3/2012 26

Monitoring VM and Host Memory Usage

• Active
• amount of physical host memory currently used by the guest
• displayed as “Guest Memory Usage” in vCenter at Guest level
• Consumed
• amount of physical ESX memory allocated (granted) to the guest, accounting for
savings from memory sharing with other guests.
• includes memory used by Service Console & VMKernel
• displayed as “Memory Usage” in vCenter at Host level
• displayed as “Host Memory Usage” in vCenter at Guest level

• If consumed host memory > active memory
• Host physical memory not overcommitted
• Active guest usage low but high host physical memory assigned
• Perfectly normal

• If consumed host memory <= active memory
• Active guest memory might not completely reside in host physical memory
• This might point to potential performance degradation

1/3/2012 27

Memory Troubleshooting

1. Active host-level swapping
• Cause: excessive memory overcommitment
• Resolution:
• reduce memory overcommitment (add physical memory / reduce VMs)
• enable balloon driver in all VMs
• reduce memory reservations and use shares

2. Guest operating system paging
• Monitor the hosts ballooning activity
• If host ballooning > 0 look at the VM ballooning activity
• If VM ballooning > 0 check for high paging activity within the guest OS

3. When swapping occurs before ballooning
• Many VMs are powered on at same time
• VMs might access a large portion of their allocated memory
• At the same time, the balloon drivers have not started yet
• This causes the host to swap VMs

1/3/2012 28

vSwp file usage and placement guidelines

• Used when memory is overcommitted

• vSwp file is created for every VM

• Default placement is with VM files

• Can affect vMotion performance if vSwp file is not located on Shared
Storage

1/3/2012 29

Memory Performance Best Practices

• Allocate enough memory to hold the working set of applications you
will run in the virtual machine, thus minimizing swapping

• Never disable the balloon driver

• Keep transparent page sharing enabled

• Avoid over committing memory to the point that it results in heavy
memory reclamation

1/3/2012 30

Networking Performance Management

• Reducing CPU Load
• TCP Segmentation Off Load and jumbo frames

• NetQueue
• What is it and how will it benefit me?

• Monitoring
• How to identify network performance problems


1/3/2012 31

TCP Segmentation Off-Load

• Segmentation is the process of breaking messages into frames
• Size of a frame is the Maximum Transmission Unit (MTU)
• Default MTU is 1500 bytes

• Historically, the operating system used the CPU to perform
segmentation.

• Now modern NICs optimize TCP segmentation
• Using larger segments and offloading from CPU to NIC hardware
• Improves networking performance by reducing the CPU overhead involved
in sending large amounts of TCP traffic

• TSO is supported in VMkernel and Guest OS
• Enabled by default in VMkernel
• You must select Enhanced vmxnet, vmxnet2 (or later) or e1000 as the
network device for the VM

1/3/2012 32

Jumbo Frames

• Data is transmitted into MTU sized frames

• The receive side reassembles the data

• A jumbo frame:
• Is an Ethernet frame with a bigger MTU, typically 9000 bytes
• reduces the number of frames transmitted
• reduced the CPU utilization on transmit and receive side

• VMs must be configured with vmxnet2 or vmxnet3 adapters

• The network must support jumbo frames end to end

1/3/2012 33

NetQueue

• NetQueue is a performance technology that significantly improves
performance in a 10Gb Ethernet environment

• How?
• It allows network processing to scale over multiple CPUs
• Multiple transmit and receive queues are used to allow I/O processing
across multiple CPUs.

• NetQueue monitors receive load and balances across queues
• Can assign queues to critical VMs

• NetQueue requires MSI-X support from the server platform, so
NetQueue support is limited to specific systems.

• For further information -
http://www.vmware.com/support/vi3/doc/whatsnew_esx35_vc25.html

1/3/2012 34

Monitoring Network Statistics

• Network packets are queued in buffers if:
• The destination is not ready to receive (Rx)
• The network is too busy to send (Tx)

• Buffers are finite in size.
• Virtual NIC devices buffer packets when they cannot be handled
immediately
• If the Virtual NIC queue fills, packets are buffered by the virtual switch port
• Packets are dropped if this happens

• Monitor the droppedRx and droppedTx values

• If droppedRX or dropped Tx values > 0 = network throughput issue
• For droppedRx = check CPU Utilization and driver configuration
• For droppedTx = check virtual switch usage, move VMs to other virtual
switch

1/3/2012 35

Networking Best Practices

• Use the vmxnet3 network adapter where possible
• Use vmxnet, vmxnet2 if vmxnet3 is not supported by the guest
operating system

• Use a physical network adapter that supports high-performance
features

• Use TCP Off-load and Jumbo Frames where possible to reduce the
CPU load.

• Monitor droppedRx and droppedTx metrics for network throughput
information

1/3/2012 36

Storage Performance Management

• LUN Queue Depth

• Monitoring

• Storage Response Time – key factors

• Overview of Storage Best Practices

1/3/2012 37

LUN Queue Depth

• LUN queue depth determines how many commands to a given LUN
can be active at one time.

• The default depth driver queue depth is 32:
• Qlogic FC HBAs support up to 255
• Emulex HBAs support up to 128
• Maximum recommended queue depth is 64.

• If ESX generates more commands to a LUN than the specified
queue depth:
• Excess commands are queued
• Disk I/O Latency increased

• Review the Disk.SchedNumReqOutstanding parameter

1/3/2012 38

Monitoring Disk Metrics

• Monitor the following key metrics, to determine available bandwidth and
identify disk-related performance issues:

• Disk throughout
• Disk Read Rate, Disk Write Rate and Disk Usage
• Latency (Device and Kernel)
• Physical device command latency (values > 15ms indicate a slow array)
• Kernel command latency (values should be 0-1ms, >4ms more data being sent
to the storage system than it supports

• Number of aborted disk commands
• If this value is >0 for any LUN, then storage is overloaded on that LUN

• Number of active disk commands
• If this value is close to or zero, the storage subsystem is not being used

• Number of active commands queued
• Significant double digit average values indicate storage hardware unable to
process the hosts I/O requests fast enough

1/3/2012 39

Storage Response Time - Factors

Three main factors:

• I/O arrival rate
• Maximum rate at which a storage device can handle specific mixes of I/O
requests. Requests may be queued in buffers if they exceed this rate.
• This queuing can add to the overall response time

• I/O size
• Transmission rate of storage interconnects. Huge IOPS naturally take
longer to complete.

• I/O locality
• I/O requests to data that is stored sequentially can be completed faster than
to data that is stored randomly.
• High speed caches typically complete read requests to sequential data

1/3/2012 40

Storage Performance Best Practices

• Applications that write a lot of data to storage should not share
Ethernet links to a storage device

• Eliminate all possible swapping to reduce the burden on the storage
subsystem

• If required, set the LUN Queue size to 64 (default 32)

1/3/2012 41

x86 Virtualization Challenges

• Privilege levels

• Software Virtualization – Binary Translation

• Hardware Virtualization – Intel VT-x and AMD-V

• Memory Management Concepts
• MMU Virtualization
• Software MMU
• Hardware MMU

• Memory Virtualization Overhead

1/3/2012 43

x86 Virtualization Challenges

• x86 operating systems are designed to run
on the bare-metal hardware.

• Four levels of privilege are available to
operating systems and applications – (Ring
0,1, 2 & 3)

• Operating system needs direct access to
the memory and hardware and must
execute instructions in Ring 0.

• Virtualizing x86 architecture requires a
virtualization layer under the operating
system

• Initial difficulties in trapping and translating
sensitive and privileged instructions made
virtualizing x86 architecture look impossible!

• VMware resolved the issue by developing
Binary Translation

1/3/2012 44

Software Virtualization - Binary Translation

• Original approach to virtualizing
the (32-bit) x86 instruction set

• Binary Translation allows the
VMM to run in Ring 0

• Guest operating system moved
to Ring 1

• Applications still run in Ring 3

1/3/2012 45

Hardware Virtualization

• In addition to software virtualization:
• Intel VT-x
• AMD –V

• Both are similar in aim but different in
detail

• Aim: is to simplify virtualization techniques
• VMM removes Binary Translation whilst
fully controlling VM.
• Restricts privileged instructions the VM can
execute without assistance from VMM.

• CPU execution mode feature allows:
• The VMM to run in a root mode below 0
• Automatically traps privileged and sensitive
call to the hypervisor
• Stores the guest operating system state in
VM control structures (Intel) or blocks
(AMD)

1/3/2012 46

Memory Management Concepts

• Memory virtualization is next critical component

• Processes see virtual memory

• Guest operating systems use page tables to map virtual memory
addresses to physical memory addresses

• The MMU translates virtual addresses to physical addresses and
the TLB cache help the MMU speed up these translations.

• Page table is consulted if a TLB hit is not achievable.

• The TLB is updated with virtual/physical address map, when page
table walk is completed.

1/3/2012 47

MMU Virtualization

• Hosting multiple virtual machines on a single host requires:
• Another level of virtualization – Host Physical Memory

• VMM maps “guest” physical addresses (PA) to host physical
addresses (MA)

• To support the Guest operating system, the MMU must be virtualized
by using:
• Software technique: shadow page tables
• Hardware technique: Intel EPT and AMD RVI

1/3/2012 48

Software MMU Virtualization - Shadow Page Tables

• Are created for each primary page table

• Consist of two mappings:
• Virtual Addresses (VA) -> Physical Addresses (PA)
• Physical Addresses (PA) -> Machine Addresses (MA)

• Accelerate memory access
• VMM points the hardware MMU directly at Shadow Page Tables
• Memory access runs as native speed
• Ensures VM cannot access host physical memory that is not associated

1/3/2012 49

Hardware MMU Virtualization

• AMD RVI and Intel EPT permit two levels of address mapping
• Guest page tables
• Nested page tables

• When a virtual address is accessed, the hardware walks both the
guest page and nested page tables

• Eliminates the need for VMM to synchronize shadow page tables
with guest page tables

• Can affect performance of applications that stress the TLB
• Increases the cost of a page walk
• Can be mitigated by use of Large Pages

1/3/2012 50

Memory Virtualization Overhead

• Software MMU virtualization incurs CPU overhead:
• When new processes are created
• New address spaces created
• When context switching occurs
• Address spaces are switched
• Running large numbers of processes
• Shadow page tables need updating
• Allocating or deallocating pages

• Hardware MMU virtualization incurs CPU overhead
• When there is a TLB miss

1/3/2012 51

webinar vmware v-sphere performance management Challenges and Best Practices

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a webinar vmware v-sphere performance management Challenges and Best Practices

Semelhante a webinar vmware v-sphere performance management Challenges and Best Practices (20)

Último

Último (20)

webinar vmware v-sphere performance management Challenges and Best Practices

Notas do Editor