How to Troubleshoot Apps for the Modern Connected Worker
Snapdrive for UNIX storage wizard create times out
1. 'Snapdrive storage wizard create' fails/times out on HP-UX systems when
the nodename does not equal the hostname.
What is SnapDrive for UNIX?
SnapDrive for UNIX is a tool that helps you in automating storage provisioning
operations, it enables you to manage Snapshot™ copies easily and simplifies data
backup, so that you can recover the data if it is accidentally deleted or modified. With
the integration of SnapCreator framework you can even manage database
applications, do cloning and many more stuff etc. For more info, please refer to
NetApp documentation.
Issue:
‘Snapdrive storage wizard create’ command on HP-UX fails/times out while trying to
get the configured NetApp storage systems.
Little story behind this issue:
This issue occurred roughly about 1 year & 6 months back while working at one of
my client side. It took 2 months of struggle & frustration before the enlightenment
happened. As I mentioned, it was a struggle b'cos there was absolutely no indication
of what was broke in the 'snapdrive storage wizard' API.
During this frustrating period of troubleshooting, I turned to NetApp support, but due
to it's unique nature and not being a wider impact there wasn't much help available,
NetApp said they will have to test it and it can take some time, they hinted, we might
be hitting a bug. Basically, I was told to advise a workaround, which meant manually
provisioning & management of LUN.
As so much time had already passed and shear frustration & pressure was building up,
I decided to continue working on it for few more days before giving up, thankfully, it
all got sorted in time.
Background about the issue:
HP-UX and Solaris systems have 'nodename' & 'hostname' concept, which is different
from redhat *nix which I am use to for number of years.
2. Error on the front end: [Following screenshot is the simulation of the error on
Redhat box]
As you can see the 'wizard just sits there and eventually times out', not much helpful.
Error on the SD trace log:
It was the generic nature of the error that caused so much frustration, as you can see
the error [could not get IP address] traced in the sd-trace log:
As the error indicated ‘could not get IP address’, we checked the host resolution in
/etc/hosts file and also nameserver and it worked fine. We could telnet/ping and it
resolved as expected. Hence, we were clueless as what else could go wrong.
Note: Out of 5 HP-UX servers, which had snapdrive, installed, only one complained
during 'snapdrive storage wizard create...' operation.
The server, which complained, turns out had different nodename compared to
hostname. For example purpose only –
HP-UX>uname -a
server
HP-UX>hostname
server01
As you can see nodename is different from the hostname.
3. Cause:
As the Nodename on the HP-UNIX host was different from the Hostname, and
‘getaddrinfo () call’ that ‘snapdrive storage wizard’ makes to obtain the IP address of
the host fails/times out b'cos the nodename is not captured either in the nameserver or
in the flat file /etc/hosts. As a result, it fails to obtain the IP address. Let's try to
understand what 'getaddrinfo ()' does.
What is getaddrinfo?
The getaddrinfo [RFC 2553 function] provides protocol-independent translation from
an ANSI host name to an address.
[The RFC 2553 functions getaddrinfo and getnameinfo provide an abstracted way to
convert between a pair of host name/service name and socket addresses, or vice versa.
getaddrinfo converts names into a set of arguments to pass to the socket() and
connect() syscalls, and getnameinfo converts a socket address back into its host
name/service name pair.]
How it works: getaddrinfo() library routines internally call the name service switch to
access the ipnodes database lookup policy configured in the /etc/nsswitch.conf ? file.
Domain Name Server Operation:[Bind/DNS]
If the local system is configured to use the BIND/DNS name server, for name/address
resolution, getaddrinfo( ) retrieves the host information from the name server. If the
'nameserver' is not available or if the hostname is not available in the database, it then
moves to flat-file /etc/hosts.
Nonserver Operation:[/etc/hosts]
During a name/address resolution, if the database is configured for flat-file resolution,
getaddrinfo( ) use the /etc/hosts for resolution. getaddrinfo ( ) Sequentially searches
the /etc/hosts file until a host name (official name or an alias name) matching the
name parameter is found or until the end of file is encountered. The host names are
matched irrespective of upper or lower case alphabets.
Reason for failure:
Nodename which was different from the hostname, was nowhere listed, neither in
DNS or /etc/hosts file. Hence, getaddrinfo () could not get the IP address.
Resolution:
If for some reason you have to have a hostname that differs from the nodename, you
can add a line to '/etc/rc.config.d/netconf' like:
HOSTNAME=longnetworkname
NODENAME=shortname
Then declare both names in the /etc/hosts file, so that both values resolve to the host's
IP address.
Simply: You can just add the nodename & IP entry into the /etc/hosts file. That’s it!
4. On HP-UNIX systems:
Normally, when a system boots the /sbin/init.d/hostname script runs the uname -S
command to set the uname value to be equal to the root of the HOSTNAME variable
(everything before the first ".", in case a domainname is specified) defined in
/etc/rc.config.d netconf. In most cases, nodename & hostnames are same and hence
hostname/nodename to IP translation works well. But, there are situations when you
have no choice but to have hostname different from nodename due to 8 characters
restriction on nodename naming length.
How to replicate this issue:
You can test this scenario on any *nix system [I believe], it's not really necessary to
have to have HP-UX system. I am going to test this on redhat EL 5.
Steps:
1. Make sure everything is working, as it should be. In this test, my hostname is
‘redhat’ and it resolves to IP address without any issues.
2. Run the wizard, it should work fine.
5. 3. Now, let’s break it. Let’s alter the hostname from redhat to redhat_01. Basically, by
changing the hostname, we are trying to simulate a scenario where nodename &
hostname are not same.
4. Run the wizard again, this time it should fail as expected.
5. You can also trace the error on the snapdrive trace log [sd-trace.log].
6. Let’s fix this issue by simply adding the modified hostname [redhat_01] entry
into the /etc/hosts file.
6. 7. Run the wizard, we should be back in business now.
Observation
How could we have avoided this issue?
Well, I am not developer so I really don’t know how we could provide more
informative error reporting around getaddrinfo ( ) calls that snadprive makes.
I am guessing snapdrive is using standard POSIX API getaddrinfo () call to do the
nodename translation, so I guess it has to report errors as per getaddrinfo() API. Could
SnapDrive for UNIX developers have inserted some extra logic to report more
informative error? I leave this to NetApp.
This issue has been turned into a KB; you can access the KB through this link:
https://kb.netapp.com/support/index?page=content&id=2017582
Oct, 2014
ashwinwriter@gmail.com