|Managing Serviceguard Fifteenth Edition > Chapter 8 Troubleshooting
Problems with Serviceguard may be of several types. The following is a list of common categories of problem:
The first two categories of problems occur with the incorrect configuration of Serviceguard. The last category contains “normal” failures to which Serviceguard is designed to react and ensure the availability of your applications.
If you are having trouble starting Serviceguard, it is possible
that someone has accidentally deleted, modified, or placed files
in, the directory that is reserved for Serviceguard use only:
Many Serviceguard commands, including cmviewcl, depend on name resolution services to look up the addresses of cluster nodes. When name services are not available (for example, if a name server is down), Serviceguard commands may hang, or may return a network-related error message. If this happens, use the nslookup command on each cluster node to see whether name resolution is correct. For example: nslookup ftsys9
If the output of this command does not include the correct IP address of the node, then check your name resolution services further.
Cluster re-formations may occur from time to time due to current cluster conditions. Some of the causes are as follows:
In these cases, applications continue running, though they might experience a small performance impact during cluster re-formation.
There are a number of errors you can make when configuring Serviceguard that will not show up when you start the cluster. Your cluster can be running, and everything appears to be fine, until there is a hardware or software failure and control of your packages is not transferred to another node as you would have expected.
These are errors caused specifically by errors in the cluster configuration file and package configuration scripts. Examples of these errors include:
You can use the following commands to check the status of your disks:
When a RUN_SCRIPT_TIMEOUT or HALT_SCRIPT_TIMEOUT value is set, and the control script hangs, causing the timeout to be exceeded, Serviceguard kills the script and marks the package “Halted.” Similarly, when a package control script fails, Serviceguard kills the script and marks the package “Halted.” In both cases, the following also take place:
Following such a failure, since the control script is terminated, some of the package's resources may be left activated. Specifically:
In this kind of situation, Serviceguard will not restart the package without manual intervention. You must clean up manually before restarting the package. Use the following steps as guidelines:
The default Serviceguard control scripts are designed to take the straightforward steps needed to get an application running or stopped. If the package administrator specifies a time limit within which these steps need to occur and that limit is subsequently exceeded for any reason, Serviceguard takes the conservative approach that the control script logic must either be hung or defective in some way. At that point the control script cannot be trusted to perform cleanup actions correctly, thus the script is terminated and the package administrator is given the opportunity to assess what cleanup steps must be taken.
If you want the package to switch automatically in the event of a control script timeout, set the node_fail_fast_enabled parameter to yes. In this case, Serviceguard will cause the node where the control script timed out to halt (system reset). This effectively cleans up any side effects of the package's run or halt attempt. In this case the package will be automatically restarted on any available alternate node for which it is configured.
If you have a system multi-node package for Veritas CFS, you may not be able to start the cluster until SG-CFS-pkg starts. Check SG-CFS-pkg.log for errors.
You will have trouble running the cluster if there is a discrepancy between the CFS cluster and the Serviceguard cluster. To check, enter gabconfig -a command.
The ports that must be up are:
Any form of the mount command (for example, mount -o cluster, dbed_chkptmount, or sfrac_chkptmount) other than cfsmount or cfsumount in a HP Serviceguard Storage Management Suite environment with CFS should be done with caution. These non-cfs commands could cause conflicts with subsequent command operations on the file system or Serviceguard packages. Use of these other forms of mount will not create an appropriate multi-node package which means that the cluster packages are not aware of the file system changes.
Also check the syslog file for information.
This section describes some approaches to solving problems that may occur with VxVM disk groups in a cluster environment. For most problems, it is helpful to use the vxdg list command to display the disk groups currently imported on a specific node. Also, you should consult the package control script log files for messages associated with importing and deporting disk groups on particular nodes.
After certain failures, packages configured with VxVM disk groups will fail to start, logging an error such as the following in the package log file:
This can happen if a package is running on a node which then fails before the package control script can deport the disk group. In these cases, the host name of the node that had failed is still written on the disk group header.
When the package starts up on another node in the cluster, a series of messages is printed in the package log file
Follow the instructions in the messages to use the force import option (-C) to allow the current node to import the disk group. Then deport the disk group, after which it can be used again by the package. Example:
vxdg -tfC import dg_01
vxdg deport dg_01
The force import will clear the host name currently written on the disks in the disk group, after which you can deport the disk group without error so it can then be imported by a package running on a different node.
These errors are similar to the system administration errors except they are caused specifically by errors in the package control script. The best way to prevent these errors is to test your package control script before putting your high availability application on line.
Adding a set -x statement in the second line of your control script will cause additional details to be logged into the package log file, which can give you more information about where your script may be failing.
These failures cause Serviceguard to transfer control of a package to another node. This is the normal action of Serviceguard, but you have to be able to recognize when a transfer has taken place and decide to leave the cluster in its current condition or to restore it to its original condition.
Possible node failures can be caused by the following conditions:
In the event of a TOC, a system dump is performed on the failed node and numerous messages are also displayed on the console.
You can use the following commands to check the status of your network and subnets:
Since your cluster is unique, there are no cookbook solutions to all possible problems. But if you apply these checks and commands and work your way through the log files, you will be successful in identifying and solving problems.
The following kind of message in a Serviceguard node’s syslog file or in the output of cmviewcl -v may indicate an authorization problem:
Access denied to quorum server 18.104.22.168
The reason may be that you have not updated the authorization file. Verify that the node is included in the file, and try using /usr/lbin/qs -update to re-read the quorum server authorization file.
The following kinds of message in a Serviceguard node’s syslog file may indicate timeout problems:
Unable to set client version at quorum server 22.214.171.124:reply timed out
These messages could be an indication of an intermittent network; or the default quorum server timeout may not be sufficient. You can set the QS_TIMEOUT_EXTENSION to increase the timeout, or you can increase the heartbeat or node timeout value.
The following kind of message in a Serviceguard node’s syslog file indicates that the node did not receive a reply to it's lock request on time. This could be because of delay in communication between the node and the qs or between the qs and other nodes in the cluster:
The coordinator node in Serviceguard sometimes sends a request to the quorum server to set the lock state. (This is different from a request to obtain the lock in tie-breaking.) If the quorum server’s connection to one of the cluster nodes has not completed, the request to set may fail with a two-line message like the following in the quorum server’s log file:
This condition can be ignored. The request will be retried a few seconds later and will succeed. The following message is logged: