Responses to Failures

Serviceguard responds to different kinds of failures in specific ways. For most hardware failures, the response is not user-configurable, but for package and service failures, you can choose the system’s response, within limits.

System Reset When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is an HP-UX TOC or INIT, which is a system reset without a graceful shutdown (normally referred to in this manual simply as a system reset). This allows packages to move quickly to another node, protecting the integrity of the data.

A system reset occurs if a cluster node cannot communicate with the majority of cluster members for the predetermined time, or under other circumstances such as a kernel hang or failure of the cluster daemon (cmcld).

The case is covered in more detail under “What Happens when a Node Times Out”. See also “Cluster Daemon: cmcld”.

A system reset is also initiated by Serviceguard itself under specific circumstances; see “Responses to Package and Service Failures ”.

What Happens when a Node Times Out

Each node sends a heartbeat message to the cluster coordinator every HEARTBEAT_INTERVAL number of microseconds (as specified in the cluster configuration file). The cluster coordinator looks for this message from each node, and if it does not receive it within NODE_TIMEOUT microseconds, the cluster is reformed minus the node no longer sending heartbeat messages. (See the HEARTBEAT_INTERVAL and NODE_TIMEOUT entries under “Cluster Configuration Parameters ” for advice about configuring these parameters.)

On a node that is not the cluster coordinator, and on which a node timeout occurs (that is, no heartbeat message has arrived within NODE_TIMEOUT seconds), the following sequence of events occurs:

The node tries to reform the cluster.
If the node cannot get a quorum (if it cannot get the cluster lock) then
The node halts (system reset).

Example

Situation. Assume a two-node cluster, with Package1 running on SystemA and Package2 running on SystemB. Volume group vg01 is exclusively activated on SystemA; volume group vg02 is exclusively activated on SystemB. Package IP addresses are assigned to SystemA and SystemB respectively.

Failure. Only one LAN has been configured for both heartbeat and data traffic. During the course of operations, heavy application traffic monopolizes the bandwidth of the network, preventing heartbeat packets from getting through.

Since SystemA does not receive heartbeat messages from SystemB, SystemA attempts to reform as a one-node cluster. Likewise, since SystemB does not receive heartbeat messages from SystemA, SystemB also attempts to reform as a one-node cluster. During the election protocol, each node votes for itself, giving both nodes 50 percent of the vote. Because both nodes have 50 percent of the vote, both nodes now vie for the cluster lock. Only one node will get the lock.

Outcome. Assume SystemA gets the cluster lock. SystemA reforms as a one-node cluster. After reformation, SystemA will make sure all applications configured to run on an existing clustered node are running. When SystemA discovers Package2 is not running in the cluster it will try to start Package2 if Package2 is configured to run on SystemA.

SystemB recognizes that it has failed to get the cluster lock and so cannot reform the cluster. To release all resources related to Package2 (such as exclusive access to volume group vg02 and the Package2 IP address) as quickly as possible, SystemB halts (system reset).




	NOTE: If `AUTOSTART_CMCLD` in `/etc/rc.config.d/cmcluster` (`$SGAUTOSTART`) is set to zero, the node will not attempt to join the cluster when it comes back up.

For more information on cluster failover, see the white paper Optimizing Failover Time in a Serviceguard Environment at http://www.docs.hp.com -> High Availability -> Serviceguard -> White Papers.

Responses to Hardware Failures

If a serious system problem occurs, such as a system panic or physical disruption of the SPU's circuits, Serviceguard recognizes a node failure and transfers the failover packages currently running on that node to an adoptive node elsewhere in the cluster. (System multi-node and multi-node packages do not fail over.)

The new location for each failover package is determined by that package's configuration file, which lists primary and alternate nodes for the package. Transfer of a package to another node does not transfer the program counter. Processes in a transferred package will restart from the beginning. In order for an application to be swiftly restarted after a failure, it must be “crash-tolerant”; that is, all processes in the package must be written so that they can detect such a restart. This is the same application design required for restart after a normal system crash.

In the event of a LAN interface failure, a local switch is done to a standby LAN interface if one exists. If a heartbeat LAN interface fails and no standby or redundant heartbeat is configured, the node fails with a system reset. If a monitored data LAN interface fails without a standby, the node fails with a system reset only if node_fail_fast_enabled (described further in “Package Configuration Planning ”) is set to YES for the package. Otherwise any packages using that LAN interface will be halted and moved to another node if possible (unless the LAN recovers immediately; see “When a Service, Subnet, or Monitored Resource Fails, or a Dependency is Not Met”).

Disk protection is provided by separate products, such as Mirrordisk/UX in LVM or Veritas mirroring in VxVM and related products. In addition, separately available EMS disk monitors allow you to notify operations personnel when a specific failure, such as a lock disk failure, takes place. Refer to the manual Using High Availability Monitors (HP part number B5736-90074) for additional information; you can find it at http://docs.hp.com -> High Availability -> Event Monitoring Service and HA Monitors -> Installation and User’s Guide.

Serviceguard does not respond directly to power failures, although a loss of power to an individual cluster component may appear to Serviceguard like the failure of that component, and will result in the appropriate switching behavior. Power protection is provided by HP-supported uninterruptible power supplies (UPS), such as HP PowerTrust.

Responses to Package and Service Failures

In the default case, the failure of a failover package, or of a service within the package, causes the package to shut down by running the control script with the ‘stop’ parameter, and then restarting the package on an alternate node. A package will also fail if it is configured to have a dependency on another package, and that package fails. If the package manager receives a report of an EMS (Event Monitoring Service) event showing that a configured resource dependency is not met, the package fails and tries to restart on the alternate node.

You can modify this default behavior by specifying that the node should halt (system reset) before the transfer takes place. You do this by setting failfast parameters in the package configuration file.

In cases where package shutdown might hang, leaving the node in an unknown state, failfast options can provide a quick failover, after which the node will be cleaned up on reboot. Remember, however, that a system reset causes all packages on the node to halt abruptly.

The settings of the failfast parameters in the package configuration file determine the behavior of the package and the node in the event of a package or resource failure:

If service_fail_fast_enabled is set to yes in the package configuration file, Serviceguard will halt the node with a system reset if there is a failure of that specific service.
If node_fail_fast_enabled is set to yes in the package configuration file, and the package fails, Serviceguard will halt (system reset) the node on which the package is running.




	NOTE: In a very few cases, Serviceguard will attempt to reboot the system before a system reset when this behavior is specified. If there is enough time to flush the buffers in the buffer cache, the reboot succeeds, and a system reset does not take place. Either way, the system will be guaranteed to come down within a predetermined number of seconds.

“Configuring a Package: Next Steps” provides advice on choosing appropriate failover behavior.

Service Restarts

You can allow a service to restart locally following a failure. To do this, you indicate a number of restarts for each service in the package control script. When a service starts, the variable RESTART_COUNT is set in the service’s environment. The service, as it executes, can examine this variable to see whether it has been restarted after a failure, and if so, it can take appropriate action such as cleanup.

Network Communication Failure

An important element in the cluster is the health of the network itself. As it continuously monitors the cluster, each node listens for heartbeat messages from the other nodes confirming that all nodes are able to communicate with each other. If a node does not hear these messages within the configured amount of time, a node timeout occurs; see “What Happens when a Node Times Out”.