Solving Problems

Problems with Serviceguard may be of several types. The following is a list of common categories of problem:

Serviceguard Command Hangs.
Cluster Re-formations.
System Administration Errors.
Package Control Script Hangs.
Problems with VxVM Disk Groups.
Package Movement Errors.
Node and Network Failures.
Quorum Server Problems.

The first two categories of problems occur with the incorrect configuration of Serviceguard. The last category contains “normal” failures to which Serviceguard is designed to react and ensure the availability of your applications.

Serviceguard Command Hangs

If you are having trouble starting Serviceguard, it is possible that someone has accidentally deleted, modified, or placed files in, the directory that is reserved for Serviceguard use only:
/etc/cmcluster/rc (HP-UX) or
${SGCONF}/rc (Linux)

Many Serviceguard commands, including cmviewcl, depend on name resolution services to look up the addresses of cluster nodes. When name services are not available (for example, if a name server is down), Serviceguard commands may hang, or may return a network-related error message. If this happens, use the nslookup command on each cluster node to see whether name resolution is correct. For example: nslookup ftsys9

Name Server:  server1.cup.hp.com
Address:  15.13.168.63
 
Name:    ftsys9.cup.hp.com
Address:  15.13.172.229

If the output of this command does not include the correct IP address of the node, then check your name resolution services further.

Cluster Re-formations

Cluster re-formations may occur from time to time due to current cluster conditions. Some of the causes are as follows:

local switch on an Ethernet LAN if the switch takes longer than the cluster NODE_TIMEOUT value. To prevent this problem, you can increase the cluster NODE_TIMEOUT value, or you can use a different LAN type.
excessive network traffic on heartbeat LANs. To prevent this, you can use dedicated heartbeat LANs, or LANs with less traffic on them.
an overloaded system, with too much total I/O and network traffic.
an improperly configured network, for example, one with a very large routing table.

In these cases, applications continue running, though they might experience a small performance impact during cluster re-formation.

System Administration Errors

There are a number of errors you can make when configuring Serviceguard that will not show up when you start the cluster. Your cluster can be running, and everything appears to be fine, until there is a hardware or software failure and control of your packages is not transferred to another node as you would have expected.

These are errors caused specifically by errors in the cluster configuration file and package configuration scripts. Examples of these errors include:

Volume groups not defined on adoptive node.
Mount point does not exist on adoptive node.
Network errors on adoptive node (configuration errors).
User information not correct on adoptive node.

You can use the following commands to check the status of your disks:

bdf - to see if your package's volume group is mounted.
vgdisplay -v - to see if all volumes are present.
lvdisplay -v - to see if the mirrors are synchronized.
strings /etc/lvmtab - to ensure that the configuration is correct.
ioscan -fnC disk - to see physical disks.
diskinfo -v /dev/rdsk/cxtydz - to display information about a disk.
lssf /dev/d*/* - to check logical volumes and paths.
vxdg list - to list Veritas disk groups.
vxprint - to show Veritas disk group details.

Package Control Script Hangs or Failures

When a RUN_SCRIPT_TIMEOUT or HALT_SCRIPT_TIMEOUT value is set, and the control script hangs, causing the timeout to be exceeded, Serviceguard kills the script and marks the package “Halted.” Similarly, when a package control script fails, Serviceguard kills the script and marks the package “Halted.” In both cases, the following also take place:

Control of a failover package will not be transferred.
The run or halt instructions may not run to completion.
AUTO_RUN (automatic package switching) will be disabled.
The current node will be disabled from running the package.

Following such a failure, since the control script is terminated, some of the package's resources may be left activated. Specifically:

Volume groups may be left active.
File systems may still be mounted.
IP addresses may still be installed.
Services may still be running.




	NOTE: Any form of the mount command (for example, mount -o cluster, dbed_chkptmount, or sfrac_chkptmount) other than cfsmount or cfsumount in a HP Serviceguard Storage Management Suite environment with CFS should be done with caution. These non-cfs commands could cause conflicts with subsequent command operations on the file system or Serviceguard packages. Use of these other forms of mount will not create an appropriate multi-node package which means that the cluster packages are not aware of the file system changes.

In this kind of situation, Serviceguard will not restart the package without manual intervention. You must clean up manually before restarting the package. Use the following steps as guidelines:

Perform application-specific cleanup. Any application-specific actions the control script might have taken should be undone to ensure successfully starting the package on an alternate node. This might include such things as shutting down application processes, removing lock files, and removing temporary files.
Ensure that package IP addresses are removed from the system. This step is accomplished via the cmmodnet(1m) command. First determine which package IP addresses are installed by inspecting the output resulting from running the command netstat -in. If any of the IP addresses specified in the package control script appear in the netstat output under the “Address” column for IPv4 or the “Address/Prefix” column for IPv6, use cmmodnet to remove them:
cmmodnet -r -i <ip-address> <subnet>
where <ip-address> is the address in the “Address” or the “Address/Prefix column and <subnet> is the corresponding entry in the “Network” column for IPv4, or the prefix (which can be derived from the IPV6 address) for IPv6.
Ensure that package volume groups are deactivated. First unmount any package logical volumes which are being used for filesystems. This is determined by inspecting the output resulting from running the command bdf -l. If any package logical volumes, as specified by the LV[] array variables in the package control script, appear under the “Filesystem” column, use umount to unmount them:
fuser -ku <logical-volume> umount <logical-volume>
Next, deactivate the package volume groups. These are specified by the VG[] array entries in the package control script.
vgchange -a n <volume-group>
Finally, re-enable the package for switching.
cmmodpkg -e <package-name>
If after cleaning up the node on which the timeout occurred it is desirable to have that node as an alternate for running the package, remember to re-enable the package to run on the node:
cmmodpkg -e -n <node-name> <package-name>

The default Serviceguard control scripts are designed to take the straightforward steps needed to get an application running or stopped. If the package administrator specifies a time limit within which these steps need to occur and that limit is subsequently exceeded for any reason, Serviceguard takes the conservative approach that the control script logic must either be hung or defective in some way. At that point the control script cannot be trusted to perform cleanup actions correctly, thus the script is terminated and the package administrator is given the opportunity to assess what cleanup steps must be taken.

If you want the package to switch automatically in the event of a control script timeout, set the node_fail_fast_enabled parameter to yes. In this case, Serviceguard will cause the node where the control script timed out to halt (system reset). This effectively cleans up any side effects of the package's run or halt attempt. In this case the package will be automatically restarted on any available alternate node for which it is configured.

Problems with Cluster File System (CFS)

If you have a system multi-node package for Veritas CFS, you may not be able to start the cluster until SG-CFS-pkg starts. Check SG-CFS-pkg.log for errors.

You will have trouble running the cluster if there is a discrepancy between the CFS cluster and the Serviceguard cluster. To check, enter gabconfig -a command.

The ports that must be up are:

a - which is llt, gab
b - vxfen
v w - cvm
f - cfs

Any form of the mount command (for example, mount -o cluster, dbed_chkptmount, or sfrac_chkptmount) other than cfsmount or cfsumount in a HP Serviceguard Storage Management Suite environment with CFS should be done with caution. These non-cfs commands could cause conflicts with subsequent command operations on the file system or Serviceguard packages. Use of these other forms of mount will not create an appropriate multi-node package which means that the cluster packages are not aware of the file system changes.

Also check the syslog file for information.




	NOTE: Check the Serviceguard, SGeRAC, and SMS Compatibility and Feature Matrix and the latest Release Notes for your version of Serviceguard for up-to-date information about support for CFS (http://www.docs.hp.com -> High Availability - > Serviceguard).

Problems with VxVM Disk Groups

This section describes some approaches to solving problems that may occur with VxVM disk groups in a cluster environment. For most problems, it is helpful to use the vxdg list command to display the disk groups currently imported on a specific node. Also, you should consult the package control script log files for messages associated with importing and deporting disk groups on particular nodes.

Force Import and Deport After Node Failure

After certain failures, packages configured with VxVM disk groups will fail to start, logging an error such as the following in the package log file:

vxdg: Error dg_01 may still be imported on ftsys9
         ERROR:  Function check_dg failed

This can happen if a package is running on a node which then fails before the package control script can deport the disk group. In these cases, the host name of the node that had failed is still written on the disk group header.

When the package starts up on another node in the cluster, a series of messages is printed in the package log file

Follow the instructions in the messages to use the force import option (-C) to allow the current node to import the disk group. Then deport the disk group, after which it can be used again by the package. Example:

vxdg -tfC import dg_01

vxdg deport dg_01

The force import will clear the host name currently written on the disks in the disk group, after which you can deport the disk group without error so it can then be imported by a package running on a different node.




	CAUTION: This force import procedure should only be used when you are certain the disk is not currently being accessed by another node. If you force import a disk that is already being accessed on another node, data corruption can result.

Package Movement Errors

These errors are similar to the system administration errors except they are caused specifically by errors in the package control script. The best way to prevent these errors is to test your package control script before putting your high availability application on line.

Adding a set -x statement in the second line of your control script will cause additional details to be logged into the package log file, which can give you more information about where your script may be failing.

Node and Network Failures

These failures cause Serviceguard to transfer control of a package to another node. This is the normal action of Serviceguard, but you have to be able to recognize when a transfer has taken place and decide to leave the cluster in its current condition or to restore it to its original condition.

Possible node failures can be caused by the following conditions:

HPMC. This is a High Priority Machine Check, a system panic caused by a hardware error.
TOC
Panics
Hangs
Power failures

In the event of a TOC, a system dump is performed on the failed node and numerous messages are also displayed on the console.

You can use the following commands to check the status of your network and subnets:

netstat -in - to display LAN status and check to see if the package IP is stacked on the LAN card.
lanscan - to see if the LAN is on the primary interface or has switched to the standby interface.
arp -a - to check the arp tables.
lanadmin - to display, test, and reset the LAN cards.

Since your cluster is unique, there are no cookbook solutions to all possible problems. But if you apply these checks and commands and work your way through the log files, you will be successful in identifying and solving problems.

Troubleshooting Quorum Server

Authorization File Problems

The following kind of message in a Serviceguard node’s syslog file or in the output of cmviewcl -v may indicate an authorization problem:

Access denied to quorum server 192.6.7.4

The reason may be that you have not updated the authorization file. Verify that the node is included in the file, and try using /usr/lbin/qs -update to re-read the quorum server authorization file.

Timeout Problems

The following kinds of message in a Serviceguard node’s syslog file may indicate timeout problems:

Unable to set client version at quorum server 192.6.7.2:reply timed out Probe of quorum server 192.6.7.2 timed out

These messages could be an indication of an intermittent network; or the default quorum server timeout may not be sufficient. You can set the QS_TIMEOUT_EXTENSION to increase the timeout, or you can increase the heartbeat or node timeout value.

The following kind of message in a Serviceguard node’s syslog file indicates that the node did not receive a reply to it's lock request on time. This could be because of delay in communication between the node and the qs or between the qs and other nodes in the cluster:

Attempt to get lock /sg/cluser1 unsuccessful. Reason:
request_timedout

Messages

The coordinator node in Serviceguard sometimes sends a request to the quorum server to set the lock state. (This is different from a request to obtain the lock in tie-breaking.) If the quorum server’s connection to one of the cluster nodes has not completed, the request to set may fail with a two-line message like the following in the quorum server’s log file:

Oct 008 16:10:05:0: There is no connection to the applicant
2 for lock /sg/lockTest1
Oct 08 16:10:05:0:Request for lock /sg/lockTest1 from
applicant 1 failed: not connected to all applicants.

This condition can be ignored. The request will be retried a few seconds later and will succeed. The following message is logged:

Oct 008 16:10:06:0: Request for lock /sg/lockTest1
succeeded. New lock owners: 1,2.