3. www.metron-athene.com
1: Dangers with OS Metrics
OS: 50% CPU Busy
vmware: 25% CPU Busy
Dormant/Idle
OS: 50% CPU Busy
vmware: 50% CPU Busy
VM1
VM2
CPU: 1 Second
5. www.metron-athene.com
Time Slicing
• Cores are shared between vCPUs in time slices
– 1 vCPU to 1 core at any point in time
• More vCPUs = More time slicing
• More time slicing = less accurate data from the
OS
• Ignore OS metrics that involve time
– (Disk Occupancy is probably OK)
Running Dormant/IdleVM1
VM1
6. www.metron-athene.com
2: Ready Time
• Ready Time
– VM wants to process, but can’t
– Accumulated against VM
– More of a stack than a queue
– Contention for CPUs
– Performance impact
• How to avoid Ready Time
– Fewer vCPUs per VM
– Monitor: CPU Threads vs vCPUs
• & Ready Time
11. www.metron-athene.com
3: Monitoring Memory
• Tightest headroom in most clusters
• Not just a question of % used
– Reservations
– Limits
– Ballooning
– Shared Pages
– Active Memory
– Memory Available for VMs
18. www.metron-athene.com
5: Trending Clusters
• VMs have soft limits
• Resource Pools have soft limits
• Individual hosts are unimportant
• Want to know when you’ll run out of capacity?
– The hardware is the limit
– Trend hardware utilisation
22. www.metron-athene.com
Review
1. Dangers with OS Metrics
– Time slicing
2. Ready Time
– Delays processing
– vCPU vs Cores
3. Monitoring Memory
– Swap
– Allocated vs Active vs Consumed on Host
4. Disk Latency
– Device and Kernel
5. Trending Clusters
– Size of the cluster is the hard limit
Notas do Editor
Dangers with OS Metrics:
Almost every time we discuss data capture for vmware, we’ll be asked by someone if we can capture the utilisation of specific VMs, by monitoring the OS. The simple answer is no.
The more complex answer is that we can capture the data from the OS, but it may not be reliable. So here’s an example of why.
We have 2 VMs. Within the 1 second interval we are looking at, one of the VMs was only on the CPU for ½ a second. In that ½ second the VM used 50% of it’s possible CPU. So the OS thinks it was running at 50% CPU utilisation. If we look at data from vmware, we’ll see that vmware knows the VM only used ½ the CPU available in ½ a second. Or 25%.
The 2nd VM was running on CPU for the entire second. And again used 50% it’s possible CPU. So the OS thinks it was running at 50% CPU utilisation, and vmware has the same result.
The more contention there is for CPU time, the more time VMs will spend Dormant/Idle, and the further apart the values will be. This effect means that any metrics which have an element of time in their calculation cannot be relied upon to be accurate.
Here is data from a real VM.
The (top) dark blue line is the data captured from the OS, and the (Bottom) light blue line is the data from vmware. While there clearly is some correlation between the two, at the start of the chart there is about 1.5% CPU difference. Given we’re only running at about 4.5% CPU that is an overestimation by the OS of about 35%. While at about 09:00 the difference is ~0.5% so the difference doesn’t remain stable either. It would not be unusual to see the OS reporting 70% CPU utilisation and vmware reporting 30%.
The effect we saw between the OS and vmware is caused by time slicing. In a typical vmware host we have more vCPUs assigned to VMs than we do physical cores. The processing time of the cores has to be shared among the vCPUs. The more vCPUs we have the less time each can be on the core, and therefore the slower time passes for that VM. To keep the VM in time extra time interrupts are sent in quick succession. So time passes slowly and then very fast.
Time is no longer a constant but the OS doesn’t know. So the safest approach is to avoid using anything from the OS that involves time.
Imagine you are driving a car, and you are stationary. There could be several reasons for this. You may be waiting to pick up someone, you may have stopped to take a phone call, or it might be that you have stopped at a red light. The 1st two of these (pick up, phone), you decided to stop the car to perform a task. But the red light is stopping you doing something you want to do. You spend the whole time at the light ready to move away when you get a green. That time is ready time. When a VM wants to use the processor, but is stopped from doing so. It accumulates ready time. This has a direct impact on performance.
For any processing to happen all the vCPUs assigned to the VM must be running at the same time. This means if you have a 4 vCPU all 4 need available cores or hyperthreads to run. So the fewer vCPUs a VM has, the more likely it is to be able to get onto the processors. You can reduce contention by having as few vCPUs as possible in each VM. And if you monitor CPU Threads, vCPUs and Ready Time you’ll be able to see if there is a correlation between increasing vCPU numbers and Ready Time in your systems.