More Related Content
Similar to How to Fail at VDI (20)
How to Fail at VDI
- 2. How to Fail at VDI
Dan Brinkmann @dbrinkmann
blog.danbrinkmann.com
Solutions Architect, VMware vExpert
Lewan & Associates (Denver, CO)
BriForum | © TechTarget
- 4. Business/Expectation VDI Failures
● No business problem
● Desktop virtualization is not server virtualization
● Saving money
● Project in the hands of the vSphere administrator
● No success criteria
● Assume you know what users do
● The same or better experience remotely as locally
BriForum | © TechTarget 4
- 6. How to Fail at VDI
The technology failure points
● Test with 5 users
● Using vendor provided users/core sizing
● Using vendor provided IOPs estimates
● Ignore anti-virus
● Ignore user profile management
● Use existing desktop images for physcial PC’s
● Guess
BriForum | © TechTarget 6
- 7. Compute
It’s magic until it stops working
● Multi-threaded apps
● Latency sensitive workloads
● Hyperthreading
● Latency = Health
BriForum | © TechTarget 7
- 8. Compute
CPU scheduler in vSphere
● CPU scheduler in vSphere is entitlement/consumption
based, not priority (unlike Windows)
● There is no priority in the CPU scheduler
● Given equal entitlement the more a vm/world consumes
the more likely it is to be prempted by another vm/world
● http://www.vmware.com/resources/techresources/10131
BriForum | © TechTarget 8
- 9. Compute with a Physical PC
OS/Apps/Profil
e
CPU 1
BriForum | © TechTarget 9
- 10. Compute with Citrix XenApp
OS/Apps/Pr OS/Apps/Pr OS/Apps/Pr OS/Apps/Pr
OS/Apps/
ofile OS/Apps/
ofile OS/Apps/
ofile OS/Apps/
ofile
Profile Profile Profile Profile
CPU 1 CPU 2
BriForum | © TechTarget 10
- 13. vSphere Compute
This is better performance monitoring - ESXTOP
Display Metric Threshold Explanation
Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check
CPU %RDY 10
%MLMTD) has been set.
Excessive usage of vSMP. Decrease amount of vCPUs for this
CPU %CSTP 3 particular VM. This should lead to increased scheduling
opportunities.
The percentage of time spent by system services on behalf of the
CPU %SYS 20 world. Most likely caused by high IO VM. Check other metrics and
VM for possible root cause
The percentage of time the vCPU was ready to run but deliberately
wasn’t scheduled because that would violate the “CPU limit”
CPU %MLMTD 0
settings. If larger than 0 the world is being throttled due to the limit
on CPU.
VM waiting on swapped pages to be read from disk. Possible cause:
CPU %SWPWT 5
Memory overcommitment.
BriForum | © TechTarget 13
- 17. Summary on Compute
● Multithreading, vSMP
● Not priority based
● % Utilization is not the complete picture
● Latency = Health
● http://kb.vmware.com/selfservice/microsites/search.do?la
nguage=en_US&cmd=displayKC&externalId=1017926
BriForum | © TechTarget 17
- 18. Storage
The wrath of the math
● #1 cause of performance issues in server virtualization
● #1 cause of performance issues in desktop virtualization
● Latency = Health
- 20ms - in trouble
- 50ms - your users hate you
BriForum | © TechTarget 18
- 19. What You Need to Know
● Capacity vs performance
● Random vs sequential
● Average vs peak
● Where it’s coming from
● Most are guessing
BriForum | © TechTarget 19
- 20. Storage
Spinning disk
Device Type IOPS
7,200 rpm SATA drives HDD ~75-100 IOPS
10,000 rpm SATA drives HDD ~125-150 IOPS
10,000 rpm SAS drives HDD ~140 IOPS
15,000 rpm SAS drives HDD ~175-210 IOPS
BriForum | © TechTarget 20
- 21. RAID Penalty
RAID level Read Write
RAID 0 1 1
RAID 1 and 10 1 2
RAID 5 1 4
RAID 6 1 6
BriForum | © TechTarget 21
- 22. The Math – RAID 5 50/50
Some back of the napkin math
● 500 users, Windows 7, 20 IOPs avg, 50/50 read/write
RAID 5
● 500 * 20 = 10,000 IOPs – 5,000 read, 5,000 write
● 5,000 write * 4 = 20,000 + 5,000 read = 25,000 IOPs
● 25,000 IOPs on 15K spindles (200 IOPS) = 125 spindles
BriForum | © TechTarget 22
- 23. The Math – RAID 10 50/50
Some back of the napkin math
● 500 users, Windows 7, 20 IOPs avg, 50/50 read/write
RAID 10
● 500 * 20 = 10,000 IOPs – 5,000 read, 5,000 write
● 5,000 write * 2 = 10,000 + 5,000 read = 15,000 IOPs
● 15,000 IOPs on 15K spindles (200 IOPS) = 75 spindles
BriForum | © TechTarget 23
- 24. The Math – RAID 10 20/80
Some back of the napkin math
● 500 users, Windows 7, 20 IOPs avg, 20/80 read/write
RAID 10
● 500 * 20 = 10,000 IOPs – 2,000 read, 8,000 write
● 8,000 write * 2 = 16,000 + 2,000 read = 18,000 IOPs
● 18,000 IOPs on 15K spindles (200 IOPS) = 90 spindles
BriForum | © TechTarget 24
- 25. vSphere Storage Latency
Application
A Application Latency
Filesystem
Guest
I/O Drivers R R = Physical Disk
“Disk Secs/Transfer”
Device Queue
S
G = Guest Latency
K G
K = ESX Kernel
Virtual SCSI
VMkernel Filesystem
D D = Device Latency
BriForum | © TechTarget 25
- 26. vSphere Storage
Performance monitoring for storage
Display Metric Threshold Explanation
Look at “DAVG” and “KAVG” as the sum of both is
DISK GAVG 20
GAVG.
DISK DAVG 20 Disk latency most likely to be caused by array.
Disk latency caused by the VMkernel, high KAVG
DISK KAVG 2
usually means queuing. Check “QUED”.
Queue maxed out. Possibly queue depth set to low.
DISK QUED 1 Check with array vendor for optimal queue depth
value.
Aborts issued by guest(VM) because storage is not
DISK ABRTS/s 1
responding. Can be caused when paths failed.
DISK RESETS/s 1 The number of commands reset per second.
SCSI Reservation Conflicts per second. Can be
DISK CONS/s 20
caused by too many VMDKs on a datastore.
BriForum | © TechTarget 26
- 27. Building for Read IOPs
Fairly easy
● Memory - Storage controller cache, PVS
● Host/Hypervisor - CBRC, Intellicache
● Storage - SSD tiering / flash cache
BriForum | © TechTarget 27
- 28. Building for Write IOPs
Much harder…and expensive
● Profiles/Apps
● Spinning disk
● SSD tiering
● Local disk
● IO optimization (dedupe, serializing IO)
BriForum | © TechTarget 28
- 29. Storage Summary
● 25,000 IOPs R5 50/50 – 125 spindles
● 15,000 IOPs R10 50/50 – 75 spindles
● 18,000 IOPs R10 20/80 – 90 spindles
● Latency is the key metric
● Write IOPs & things that cause it is #1 focus
BriForum | © TechTarget 29
- 30. How does this relate to VDI failure?
● Pilot performance is great, then terrible in production
● Boot storm vs login storm
● Applications in gold image vs streamed
● Read/write ratio is important
● Anti-virus software
● Existing desktop images
BriForum | © TechTarget 30
- 31. Guessing
You need to use tools to do this
● Initial sizing
● Determine peaks and when
● Baseline application impact
● Monitor application impact over time
● Application updates/changes
BriForum | © TechTarget 31
- 32. Project testing
Good to know what you are and aren’t doing
● Unit/system testing
● Application testing
● Performance/scalability testing
● Operational testing
● User acceptance testing
BriForum | © TechTarget 32
- 33. Summary
● Understand your limited resources (compute/storage)
● Don’t guess
● 5 users = what kind of testing, what are you really
accomplishing?
BriForum | © TechTarget 33