Cloudstack Top 5 technical issues and troubleshooting. Cloudstack is a mature product in use by companies world-wide. While being associated with CloudStack development for over 5 years, Abhi has come across some technical issues that once in a while affect the CloudStack deployment. This presentation is an effort to put together top 5 such issues, analyze their symptoms, see them from CloudStack architecture perspective and from the distributed nature of cloud orchestration, then look at ways to avoid them and finally be able to troubleshoot if they occur.
2024: Domino Containers - The Next Step. News from the Domino Container commu...
CloudStack - Top 5 Technical Issues and Troubleshooting
1. The Cloud Specialists
Cloudstack - Top 5
technical issues and
troubleshooting
Looking at cloudstack thru the prism of support tickets.
2. C l i c k t o e d i t
The Cloud
Specialists
ShapeBlue.com @ShapeBlue
A b h i n a n d a n P r a t e e k
Software Architect @ Shapeblue
Tinkering with cloudstack since 2011
Based out of Hyderabad, India
Apache cloudstack committer since 2012
aprateek@apache.org
abhinandan.prateek@shapeblue.com
Charminar - 1591
18m high monolithic Buddha statue
3. C l i c k t o e d i t
The Cloud
Specialists ShapeBlue.com @ShapeBlue
“ShapeBlue are expert builders of public
& private clouds. They are the leading
global CloudStack services company.”
A b o u t S h a p e B l u e
4. C l i c k t o e d i t
The Cloud
Specialists
ShapeBlue.com @ShapeBlue
S h a p e B l u e c u s t o m e r s
5. C l i c k t o e d i t
The Cloud
Specialists
ShapeBlue.com @ShapeBlue
S h a p e B l u e c u s t o m e r s
6. C l i c k t o e d i t
The Cloud
Specialists
ShapeBlue.com @ShapeBlue
S h a p e B l u e c u s t o m e r s
7. The Cloud Specialists
Amongst other things, we provide 2nd to 4th line remote
support of the entire Cloudstack Infrastructure.
We have helped build some of the biggest cloudstack
deployments.
We regularly work round the clock with enterprises around
the world trouble-shooting live, production environments.
What do we do ?
8. The Cloud Specialists
Variety is the spice of life.
Mix of Hypervisor mainly using advanced networking.
Customer environments vary from 2500 hypervisor hosts to 4
hypervisor hosts.
What kind of deployments do we manage ?
Have been dealing with VMWare, Xenserver and KVM mainly
9. The Cloud Specialists
Analysed 698 support tickets (across 36 months) and divided them
into specific areas like VR, storage, timeouts, agent and database.
Support tickets were divided into areas covering cloudstack
components like storage, networking or hypervisors.
Added the issues that were dependant on environment, external
hardware or obvious user errors under misc.
Cloudstack is overall very reliable, used by many large service providers and enterprise
customers.
(70% of our customers are cloud service provider, 30% enterprise)
How I arrived at top 5 issues ?
10. The Cloud Specialists
Support %
Virtual Router
Storage
Misc
Upgrade
Timeouts
Hypervisor
Offering
0 10 20 30 40
Service tickets by logical areas
Thus after doing an analysis of such support issues, 5 areas
were picked up for deep dive.
11. The Cloud Specialists
Virtual router - Entangled cables
Storage - Storage tsunami
Timeouts - Database & Cloudstack-timers
Service offering - Possibilities for upgrading
Hypervisor - Operations
5-Top cloudstack areas to understand and troubleshoot
12. The Cloud Specialists
Virtual Router - Tangled cables
Problems with VR are one of
the most frequently reported
set of issue.
My VR does not
start
My network does
not behave as I
want it to
13. The Cloud Specialists
Virtual Router Issues
My VR does not
start
Look at each step in
VR orchestration to
narrow down on
issue.
Do a network
restart with
cleanup
My network does not
behave as I want it to
What service is
not working
Access VR to
Troubleshoot service
DHCP
DNS
NAT
LB
VPN
Password
14. The Cloud Specialists
VR Orchestration
SecondaryStorage
1. Seed or upload template
VR
systemvm.iso
5. /etc/init.d/cloud-early-config < boot-args + systemvm.iso
Cloudstack
Hypervisor
3.Deploy System VM + boot-args
2.Copy template to Primary
4.Mount systemvm.iso
systemvm.default.hypervisor
router.template.xenserver
6.cloud-early-config initialises router specific services like haproxy and dnsmasq
Steps in virtual router VM deployment
15. The Cloud Specialists
VR Orchestration
Cloudstack
VR
systemvm.iso
KVM
KVM-Agent
VMWare
vcenter
Xenserver
xapi
shell
ssh script name and params router_proxy.sh
proxies the
command and
parameters
to VR
Send the command
to KVM Agent
Configuring rules
16. The Cloud Specialists
VR Troubleshooting
Xenserver
xapi
shellVR
GUEST
eth0
vlan 778
10.1.1.1
CONTROL
eth1
169.254.3.116
PUBLIC
eth2
vlan 7
10.1.34.223 ssh -p 3922 -i /root/.ssh/id_rsa.cloud root@169.254.3.116
Router details are found on MS UI:
Home > Infrastructure > Virtual Routers > r-4-VM
19. The Cloud Specialists
VR Troubleshooting
port: for scanning specific port
-i: for a particular interface
-c: for specific packet count
-A: to print output in ASCII
tcpdump port 3922 -i eth1 -c 10 -A
21. The Cloud Specialists
Virtual router - Entangled cables
Storage - Storage tsunami
Timeouts - Database & Cloudstack-timers
Service offering - Possibilities for upgrading
Hypervisor - Operations
5-Top cloudstack areas to understand and troubleshoot
22. The Cloud Specialists
Secondary Storage - Tsunami
My storage is full !
Storage 99% used
Storage not found
Copy operation not working !
The volume download fails with "Failed to copy the volume from the
source primary storage pool to secondary storage.
Cloudstack fails in migrating root volume from one
Primary storage to another for a specific VM.
Primary
SSVM
Secondary
OR
23. The Cloud Specialists
Storage Issues ?
Having issues with
template download ?
Having issues with
volume extraction ?
Host getting fenced due
to unreachable storage ?
Storage is Full ?
Having issues
with snapshot chain ?
Snapshot backup
problems ?
Primary StorageSecondary Storage (SSVM)
SSVM not running ?
Host getting rebooted as a
result of fencing the VM ?
24. The Cloud Specialists
Storage: Primary and Secondary
Make sure the ssvm is running and working fine.
Backup Snapshots
Copy Template
Download Template
Upload Volume
SSVM
Templates
Volumes
Snapshots
25. The Cloud Specialists
Secondary storage troubleshooting
1. Login: ssh -i /root/.ssh/id_rsa -p 3922 root@169.254.3.178
SSVM
GUEST
eth1
10.2.6.32
CONTROL
eth0
169.254.3.178
PUBLIC
eth2
10.1.34.221
STORAGE
eth3
10.3.34.221
26. The Cloud Specialists
Secondary storage troubleshooting
2. Run: /usr/local/cloud/systemvm/ssvm-check.sh
SSVM Health Check Checks
following:
• Connectivity with DNS server
• Resolving of domain names
• Status of secondary storage
• Ability to write to secondary
storage
• Connectivity with management
server at port 8250
• Status of java process.
27. The Cloud Specialists
Secondary Storage - Tsunami
My storage is full !
Storage 99% used-EMERGENCY
Storage not found -EMERGENCIA
Copy operation not working !
The volume download fails with "Failed to copy the volume from the
source primary storage pool to secondary storage.”
Cloudstack fails in migrating root volume from one
Primary storage to another for a specific VM.
Primary
SSVM
Secondary
OR
28. The Cloud Specialists
Secondary Storage Full ?
Snapshots not getting cleared up !
This could happen if snapshot chain gets
broken, resulting in exception in MS logs
Xenserver
xapi
shell
Do these Global Settings have anything to do with this issue:
1. storage.cleanup.interval: The interval (in seconds) to wait before running the storage
cleanup thread. (Default: 86400)
2. storage.cleanup.enabled: Enables/disables the storage cleanup thread (Default: true)
3. storage.template.cleanup.enabled: Enable/disable template cleanup activity, only take
effect when overall storage cleanup is enabled (Default: true)
29. The Cloud Specialists
Volumes are physically deleted from the storage device by the
garbage collection process which runs based on the following global
settings:
1. expunge.delay: determines how old the volume must be before it is destroyed, in seconds
(Default: 86400)
2. expunge.interval: determines how often to run the garbage collection check (Default: 86400)
Secondary Storage Full ?
Storage is really full ?
Cleanup required ? Migrate to bigger nfs
30. The Cloud Specialists
Secondary Storage Cleanup
For secondary storage cleanup you need to determine the items that
are no longer in use or obsolete.
For that you need to understand the mapping from cloud db to
hypervisor resources
image_store
snapshot_store_ref
volume_store_ref
template_store_ref
SSVM
33. The Cloud Specialists
Secondary Storage - Tsunami
My storage is full !
Storage 99% used-EMERGENCY
Storage not found -EMERGENCIA
Copy operation not working !
The volume download fails with "Failed to copy the volume from the
source primary storage pool to secondary storage.” #1250
Cloudstack fails in migrating root volume from one
Primary storage to another for a specific VM. #1253
Primary
SSVM
Secondary
OR
39. The Cloud Specialists
Prevent Storage Issues
Periodically check if the storage has enough space.
Keep a check on snapshot chain sizes.
Keep a check on spurious object that do not have any
reference on cloudstack db. Cleanup such objects
periodically.
File bugs for issues that you encounter.
40. The Cloud Specialists
We looked at how the secondary storage and primary
storage are organized in cloudstack.
We know what each is responsible for.
We checked how the object mapping from the management
server to back end storage works like.
We also looked at maintaining storage in case it becomes full.
Concluding storage troubleshooting
41. The Cloud Specialists
Virtual router - Entangled cables
Storage - Storage nuances
Timeouts - Database & Cloudstack-timers
Service offering - Possibilities for upgrading
Hypervisor - Operations
5-Top cloudstack areas to understand and troubleshoot
42. The Cloud Specialists
Timeouts
Timeout issues are one broad area covering various components.
Here we will look at some typical timeouts that occur the most in
cloudstack and then look at ways to resolve them.
A timeout is detected by error messages logged in the MS logs
here: /var/log/cloudstack/management/management-server.log
KVM the virtual router commands are timing out, resulting in MS shutting down the VR !
Snapshot failure ! In the logs it says that MS tried for ‘wait’ timeout.
Snapshot completes on Hypervisor after 6 hours, but MS logs show DB timeout
43. The Cloud Specialists
• The timeout can happen due to cloudstack aborting an operation
that is taking longer than expected.
• A timeout can also happen because one of the cloudstack
component failed to respond in time, like the agent or Hypervisor.
A timeout can happen due to external timeout conditions like a
database connection timeout or ssh script execution timeout.
Timeout
Management Server Timeout
External Timeouts
44. The Cloud Specialists
Cloudstack Timeouts
Check the logs and try to determine what
kind of timeout it is ?
A timeout could point to problems with cloudstack subsystem
If it is a cloudstack timeout then tweak the corresponding
global timeout parameter value and restart cloudstack.
Snapshot VM Migration CopyVolume
45. The Cloud Specialists
Cloudstack Timeouts
Snapshot Migrate CopyVolume
wait
Time in seconds to wait for control
commands to return, 3600
backup.snapshot.wait
In second, timeout for
BackupSnapshotCommand, 3600
migratewait
Time (in seconds) to wait for
VM migrate finish, 1200
copy.volume.wait
In second, timeout for
copy volume command 1200
46. The Cloud Specialists
KVM Agent - timeouts
A generic timeout configured for ‘almost’ all the agent commands,
showing weird behaviour sometime.
‘OperationTimedOut’
Cloudstack
KVM
KVM-Agent
Send the command
to KVM Agent
Agent Response
KVM the virtual router commands are timing out, resulting in MS shutting down the VR !
47. The Cloud Specialists
KVM Agent Timeouts
The timeout values changed in MS global config are not propagated to
KVM agents automatically.
Many times this result in failure to apply config to VR as the
aggregated commands timeout.
This timeout setting is governed by
router.aggregation.command.each.timeout global setting.
To make this or any other global setting effective on KVM agent you
need to add this to /etc/cloudstack/agent/agent.properties file.
We have a virtual router starting on KVM, after few seconds of start it is shutdown by
cloudstack.
48. The Cloud Specialists
Database - timeouts
Cloudstack
Xenserver
xapi
shell
1. Get connection from pool
2. Start long winded resource action
3. MySql expires the connection
4. Even though the action finished properly,
cloudstack rolled back the transaction on
completion
Even though the snapshot operation succeed on Xenserver, cloudstack rolls back the transaction
on completion, thereby failing it.
49. The Cloud Specialists
mysql> show variables;
+-----------------------------------------
+-------------------------------------------------------------------------------------------+
| Variable_name | Value
|
| interactive_timeout | 28800
|
| wait_timeout | 28800
|
mysql> SET interactive_timeout=36000;
Query OK, 0 rows affected (0.01 sec)
interactive_timeout : interactive time out for mysql shell sessions in seconds like mysqldump or
mysql command line tools.
wait_timeout” : the amount of seconds during inactivity that MySQL will wait before it will close
a connection on a non-interactive connection in seconds.
**Fixed in 4.9 where connection will get refreshed if found dead.
Database - timeouts
50. The Cloud Specialists
Virtual router - Entangled cables
Storage - Storage nuances
Timeouts - Database & Cloudstack-timers
Service offering - Possibilities for upgrading
Agent - Operations
5-Top cloudstack areas to understand and troubleshoot
51. The Cloud Specialists
How do I migrate a machine service offering from an offering with tags to an offering without,
I get this error …
RAM
TAGS
CPU
RATE
Hypervisor
tag1
Deployment
Planner
Service Offering
A service offering defines a set of virtual hardware features that will
be assigned to a virtual machine.
Storage Tags: These tags are associated with primary storage.
Host tags: Tags associated with hosts.
Compute and Disk Service Offering - Tags
Service offering
provides a way to
measure usage.
52. The Cloud Specialists
Compute and Disk Service Offering - Tags
This set of error has to do with pinning the virtual machine
to a particular set of resources.
You may want to do it to provide premium services to high
paying customers.
At some point you or your customer decides that the VM
needs to be moved to different set of resources.
What are the options at this point ?
53. The Cloud Specialists
Compute and Disk Service Offering - Tags
hdd
ssd
blue
Compute Offering
Name
Desc
Size
…
Host Tags: hdd
Post CS 4.7: Change Offering
Pre CS 4.7: Change Offering
Ideally both ways should be allowed !
Compute Offering
Name
Desc
Size
…
Host Tags: hdd, ssd
Compute Offering
Name
Desc
Size
…
Host Tags:
Tags - confused compatibility !
54. The Cloud Specialists
Compute and Disk Service Offering - Tags
How do I migrate a machine service
offering from an offering with tags to
an offering without, I get this error >
“if our current compute offering is associated with tags (x,y) then the new compute offering
should have tags atleast (x,y), then only it can have the same functionality as the current one.
Suppose the new offering has the only tag(x), then it is missing the functionality associated
with tag(y)….”
How DevOps sees it:
How Software engineer sees it:
55. The Cloud Specialists
Compute and Disk Service Offering - Tags
MariaDB [cloud]> select id, cpu, speed, ram_size, host_tag, deployment_planner from
service_offering where id=17;
+----+------+-------+----------+----------+--------------------+
| id | cpu | speed | ram_size | host_tag | deployment_planner |
+----+------+-------+----------+----------+--------------------+
| 17 | 2 | 1024 | 512 | htx, hty | FirstFitPlanner |
+----+------+-------+----------+----------+--------------------+
MariaDB [cloud]> select id, name, display_text, tags from disk_offering where id=17;
+----+--------------+-------------------+----------+
| id | name | display_text | tags |
+----+--------------+-------------------+----------+
| 17 | CCC Offering | CCC Demo Offering | stx,,sty |
+----+--------------+-------------------+----------+
Don’t Update the tags in database for existing offerings !
Their might be many VMs associated with them.
MariaDB [cloud]> select id, name, state, service_offering_id, disk_offering_id from vm_instance where id=6;
+----+-----------+---------+---------------------+------------------+
| id | name | state | service_offering_id | disk_offering_id |
+----+-----------+---------+---------------------+------------------+
| 6 | vmdiskone | Running | 1 | 6 |
+----+-----------+---------+---------------------+------------------+
Instead create a new service offering and update the
vm_instance table to upgrade the VMs to this new offering.
56. The Cloud Specialists
Virtual router - Entangled cables
Storage - Storage nuances
Timeouts - Database & Cloudstack-timers
Service offering - Possibilities for upgrading
Hypervisor - Operations
5-Top cloudstack areas to understand and troubleshoot
57. The Cloud Specialists
Hypervisor Issues
KVM agent connection blowing up !
Xenserver snapshots failing as the chain is too long !
VMWare worker VM to create snapshots are not getting cleared up !
KVM the virtual router commands are timing out, resulting in MS shutting down the VR !
Xenserver have upgraded but it seems the existing xens are not instrumented with changes !
Some typical issues involving hypervisor
Cloudstack interacts with these hypervisors using agent framework.
58. The Cloud Specialists
Xenserver
xe sr-list <gives you a list of SRs>
xe sr-scan <refreshes Xen db and coalesces, reclaiming disk space>
vhd-util scan -f -p -a -m /var/run/sr-mount/c31f8a5a-ef12-e573-cee0-02787136601c/ ..
07fe1fe4-074c-434d-8010-823e29c260af.vhd
vhd=f4008be7-514d-4358-aa3f-1d0c1e78f37a.vhd capacity=268435456000 size=263885443584
hidden=1 parent=none
vhd=07fe1fe4-074c-434d-8010-823e29c260af.vhd capacity=268435456000 size=183076900864
hidden=0 parent=f4008be7-514d-4358-aa3f-1d0c1e78f37a.vhd
Xenserver Direct Agent running with management server uses
xapi to communicate with Xen host. Logs in /var/log/
SMlog and xensource.log.
59. The Cloud Specialists
KVM
KVM qemu logs are here /var/log/libvirt/qemu/<domain>.log
Libvirtd log is at: /var/log/libvirt/libvirtd.log
Agent log is here: /var/log/cloudstack/agent/agent.log
For further debugging and querying you can use the virsh tool
virsh <command> <domain>
command can be vol-info, dumpxml etc
domain is the vm name like s-1-VM
virsh console s-1-VM
virsh dumpxml s-1-VM
There is a hypervisor (Qemu) and a management library (libvirt).
60. The Cloud Specialists
VMWare
VMWare Direct Agent running with management server uses
vim to communicate with vCenter host. Check
vCenter for details.
67. The Cloud Specialists
VR Trivia
1. How does a password enabled guest vm gets its password ?
{"com.cloud.agent.api.routing.SavePasswordCommand":
{,"vmIpAddress":"10.1.1.217","vmName":"vmthree","executeInSequence":false,"accessDetails":{"router.name":"r-4-
VM","router.guest.ip":"10.1.1.1","router.ip":"169.254.3.116","zone.network.type":"Advanced"},"wait":0}
2017-03-24 05:56:34,854 DEBUG [c.c.h.x.r.CitrixResourceBase] (DirectAgent-383:ctx-24596387) (logid:abb6150f) VR Config
file vm_password.json got created in VR, ip 169.254.3.116 with content
{"ip_address":"10.1.1.217","password":"YkSS9S","type":"vmpassword"}
Clo
Xen VR
VR
2017-03-24 05:57:23,777 merge.py save:72 {u'10.1.1.195': u'saved_password', u'id':
u'vmpassword', u'10.1.1.166': u'saved_password', u'10.1.1.217': u’YkSS9S'}
2017-03-24 05:57:24,346 CsHelper.py execute:184 Executing: curl --header "DomU_Request:
save_password" "http://10.1.1.1:8080/" -F "ip=10.1.1.217" -F "password=YkSS9S" -F
"token=5c10a12a384a27c5a3e92c8a96d5ea40" >/dev/null 2>/dev/null &
Xen
root@r-4-VM:~# cat /var/cache/cloud/passwords-10.1.1.1
10.1.1.195=saved_password
10.1.1.217=Ga6CyR
10.1.1.166=saved_password
root@r-4-VM:~#
68. The Cloud Specialists
VM
wget -t 3 -T 20 -O - --header "DomU_Request: saved_password" $PASSWORD_SERVER_IP:8080
VR Trivia
1. How does a password enabled guest vm gets its password ?