How Packages Run

Packages are the means by which Serviceguard starts and halts configured applications. Failover packages are also units of failover behavior in Serviceguard. A package is a collection of services, disk volumes and IP addresses that are managed by Serviceguard to ensure they are available. There can be a maximum of 150 packages per cluster and a total of 900 services per cluster.

What Makes a Package Run?

There are 3 types of packages:

The failover package is the most common type of package. It runs on one node at a time. If a failure occurs, it can switch to another node listed in its configuration file. If switching is enabled for several nodes, the package manager will use the failover policy to determine where to start the package.
A system multi-node package runs on all the active cluster nodes at the same time. It can be started or halted on all nodes, but not on individual nodes.
A multi-node package can run on several nodes at the same time. If auto_run is set to yes, Serviceguard starts the multi-node package on all the nodes listed in its configuration file. It can be started or halted on all nodes, or on individual nodes, either by user command (cmhaltpkg) or automatically by Serviceguard in response to a failure of a package component, such as service, EMS resource, or subnet.

System multi-node packages are supported only for use by applications supplied by Hewlett-Packard.

A failover package can be configured to have a dependency on a multi-node or system multi-node package. The package manager cannot start a package on a node unless the package it depends on is already up and running on that node.

The package manager will always try to keep a failover package running unless there is something preventing it from running on any node. The most common reasons for a failover package not being able to run are that auto_run is disabled so Serviceguard is not allowed to start the package, that node switching is disabled for the package on particular nodes, or that the package has a dependency that is not being met. When a package has failed on one node and is enabled to switch to another node, it will start up automatically in a new location where its dependencies are met. This process is known as package switching, or remote switching.

A failover package starts on the first available node in its configuration file; by default, it fails over to the next available one in the list. Note that you do not necessarily have to use a cmrunpkg command to restart a failed failover package; in many cases, the best way is to enable package and/or node switching with the cmmodpkg command.

When you create the package, you indicate the list of nodes on which it is allowed to run. System multi-node packages must list all cluster nodes in their cluster. Multi-node packages and failover packages can name some subset of the cluster’s nodes or all of them.

If the auto_run parameter is set to yes in a package’s configuration file Serviceguard automatically starts the package when the cluster starts. System multi-node packages are required to have auto_run set to yes. If a failover package has auto_run set to no, Serviceguard cannot start it automatically at cluster startup time; you must explicitly enable this kind of package using the cmmodpkg command.




	NOTE: If you configure the package while the cluster is running, the package does not start up immediately after the cmapplyconf command completes. To start the package without halting and restarting the cluster, issue the cmrunpkg or cmmodpkg command.

How does a failover package start up, and what is its behavior while it is running? Some of the many phases of package life are shown in Figure 3-13 “Legacy Package Time Line Showing Important Events”.




	NOTE: This diagram applies specifically to legacy packages. Differences for modular scripts are called out below.

Figure 3-13 Legacy Package Time Line Showing Important Events

The following are the most important moments in a package’s life:

Before the control script starts. (For modular packages, this is the master control script.)
During run script execution. (For modular packages, during control script execution to start the package.)
While services are running
When a service, subnet, or monitored resource fails, or a dependency is not met.
During halt script execution. (For modular packages, during control script execution to halt the package.)
When the package or the node is halted with a command
When the node fails

Before the Control Script Starts

First, a node is selected. This node must be in the package’s node list, it must conform to the package’s failover policy, and any resources required by the package must be available on the chosen node. One resource is the subnet that is monitored for the package. If the subnet is not available, the package cannot start on this node. Another type of resource is a dependency on a monitored external resource or on a special-purpose package. If monitoring shows a value for a configured resource that is outside the permitted range, the package cannot start.

Once a node is selected, a check is then done to make sure the node allows the package to start on it. Then services are started up for a package by the control script on the selected node. Strictly speaking, the run script on the selected node is used to start a legacy package; the master control script starts a modular package.

During Run Script Execution

Once the package manager has determined that the package can start on a particular node, it launches the script that starts the package (that is, a package’s control script or master control script is executed with the start parameter). This script carries out the following steps:

Executes any external_pre_scripts (modular packages only; see “external_pre_script”)
Activates volume groups or disk groups.
Mounts file systems.
Assigns package IP addresses to the LAN card on the node (failover packages only).
Executes any customer-defined run commands (legacy packages only; see “Adding Customer Defined Functions to the Package Control Script ”) or external_scripts (modular packages only; see “external_script”).
Starts each package service.
Starts up any EMS (Event Monitoring Service) resources needed by the package that were specially marked for deferred startup.
Exits with an exit code of zero (0).

Figure 3-14 Package Time Line

(Legacy Package)At any step along the way, an error will result in the script exiting abnormally (with an exit code of 1). For example, if a package service is unable to be started, the control script will exit with an error.




	NOTE: This diagram is specific to legacy packages. Modular packages also run external scripts and “pre-scripts” as explained above.

If the run script execution is not complete before the time specified in the run_script_timeout, the package manager will kill the script. During run script execution, messages are written to a log file. For legacy packages, this is in the same directory as the run script and has the same name as the run script and the extension .log. For modular packages, the pathname is determined by the script_log_file parameter in the package configuration file (see “script_log_file”). Normal starts are recorded in the log, together with error messages or warnings related to starting the package.




	NOTE: After the package run script has finished its work, it exits, which means that the script is no longer executing once the package is running normally. After the script exits, the PIDs of the services started by the script are monitored by the package manager directly. If the service dies, the package manager will then run the package halt script or, if `service_fail_fast_enabled` is set to `yes`, it will halt the node on which the package is running. If a number of Restarts is specified for a service in the package control script, the service may be restarted if the restart count allows it, without re-running the package run script.

Normal and Abnormal Exits from the Run Script

Exit codes on leaving the run script determine what happens to the package next. A normal exit means the package startup was successful, but all other exits mean that the start operation did not complete successfully.

0—normal exit. The package started normally, so all services are up on this node.
1—abnormal exit, also known as no_restart exit. The package did not complete all startup steps normally. Services are killed, and the package is disabled from failing over to other nodes.
2—alternative exit, also known as restart exit. There was an error, but the package is allowed to start up on another node. You might use this kind of exit from a customer defined procedure if there was an error, but starting the package on another node might succeed. A package with a restart exit is disabled from running on the local node, but can still run on other nodes.
Timeout—Another type of exit occurs when the run_script_timeout is exceeded. In this scenario, the package is killed and disabled globally. It is not disabled on the current node, however. The package script may not have been able to clean up some of its resources such as LVM volume groups, VxVM disk groups or package mount points, so before attempting to start up the package on any node, be sure to check whether any resources for the package need to be cleaned up.

Service Startup with cmrunserv

Within the package control script, the cmrunserv command starts up the individual services. This command is executed once for each service that is coded in the file. You can configure a number of restarts for each service. The cmrunserv command passes this number to the package manager, which will restart the service the appropriate number of times if the service should fail. The following are some typical settings in a legacy package; for more information about configuring services in modular packages, see the discussion starting on “service_name”, and the comments in the package configuration template file.

SERVICE_RESTART[0]=" "        ; do not restart
SERVICE_RESTART[0]="-r <n>"   ; restart as many as <n> times
SERVICE_RESTART[0]="-R"       ; restart indefinitely




	NOTE: If you set `<n>` restarts and also set `service_fail_fast_enabled` to `yes`, the failfast will take place after `<n>` restart attempts have failed. It does not make sense to set `service_restart` to `“-R”` for a service and also set `service_fail_fast_enabled` to `yes`.

While Services are Running

During the normal operation of cluster services, the package manager continuously monitors the following:

Process IDs of the services
Subnets configured for monitoring in the package configuration file
Configured resources on which the package depends

Some failures can result in a local switch. For example, if there is a failure on a specific LAN card and there is a standby LAN configured for that subnet, then the Network Manager will switch to the healthy LAN card. If a service fails but the restart parameter for that service is set to a value greater than 0, the service will restart, up to the configured number of restarts, without halting the package.

If there is a configured EMS resource dependency and there is a trigger that causes an event, the package will be halted.

During normal operation, while all services are running, you can see the status of the services in the “Script Parameters” section of the output of the cmviewcl command.

When a Service, Subnet, or Monitored Resource Fails, or a Dependency is Not Met

What happens when something goes wrong? If a service fails and there are no more restarts, if a subnet fails and there are no standbys, if a configured resource fails, or if a configured dependency on a special-purpose package is not met, then a failover package will halt on its current node and, depending on the setting of the package switching flags, may be restarted on another node. If a multi-node or system multi-node package fails, all of the packages that have configured a dependency on it will also fail.

Package halting normally means that the package halt script executes (see the next section). However, if a failover package’s configuration has the service_fail_fast_enabled flag set to yes for the service that fails, then the node will halt as soon as the failure is detected. If this flag is not set, the loss of a service will result in halting the package gracefully by running the halt script.

If auto_run is set to yes, the package will start up on another eligible node, if it meets all the requirements for startup. If auto_run is set to no, then the package simply halts without starting up anywhere else.




	NOTE: If a package is dependent on a subnet, and the subnet fails on the node where the package is running, the package will start to shut down. If the subnet recovers immediately (before the package is restarted on an adoptive node), the package manager restarts the package on the same node; no package switch occurs.

When a Package is Halted with a Command

The Serviceguard cmhaltpkg command has the effect of executing the package halt script, which halts the services that are running for a specific package. This provides a graceful shutdown of the package that is followed by disabling automatic package startup (see auto_run on “auto_run”).

You cannot halt a multi-node or system multi-node package unless all packages that have a configured dependency on it are down. Use cmviewcl to check the status of dependents. For example, if pkg1 and pkg2 depend on PKGa, both pkg1 and pkg2 must be halted before you can halt PKGa.




	NOTE: If you use cmhaltpkg command with the `-n <nodename>` option, the package is halted only if it is running on that node.

The cmmodpkg command cannot be used to halt a package, but it can disable switching either on particular nodes or on all nodes. A package can continue running when its switching has been disabled, but it will not be able to start on other nodes if it stops running on its current node.

During Halt Script Execution

Once the package manager has detected the failure of a service or package that a failover package depends on, or when the cmhaltpkg command has been issued for a particular package, the package manager launches the halt script. That is, a package’s control script or master control script is executed with the stop parameter. This script carries out the following steps (also shown in Figure 3-15 “Legacy Package Time Line for Halt Script Execution”):

Halts any deferred resources that had been started earlier.
Halts all package services.
Executes any customer-defined halt commands (legacy packages only) or external_scripts (modular packages only; see “external_pre_script”).
Removes package IP addresses from the LAN card on the node.
Unmounts file systems.
Deactivates volume groups.
Exits with an exit code of zero (0).
Executes any external_pre_scripts (modular packages only; see “external_pre_script”).

Figure 3-15 Legacy Package Time Line for Halt Script Execution

At any step along the way, an error will result in the script exiting abnormally (with an exit code of 1). Also, if the halt script execution is not complete before the time specified in the HALT_SCRIPT_TIMEOUT, the package manager will kill the script. During halt script execution, messages are written to a log file. For legacy packages, this is in the same directory as the run script and has the same name as the run script and the extension .log. For modular packages, the pathname is determined by the script_log_file parameter in the package configuration file (see “script_log_file”). Normal starts are recorded in the log, together with error messages or warnings related to halting the package.




	NOTE: This diagram applies specifically to legacy packages. Differences for modular scripts are called out above.

Normal and Abnormal Exits from the Halt Script

The package’s ability to move to other nodes is affected by the exit conditions on leaving the halt script. The following are the possible exit codes:

0—normal exit. The package halted normally, so all services are down on this node.
1—abnormal exit, also known as no_restart exit. The package did not halt normally. Services are killed, and the package is disabled globally. It is not disabled on the current node, however.
Timeout—Another type of exit occurs when the halt_script_timeout is exceeded. In this scenario, the package is killed and disabled globally. It is not disabled on the current node, however. The package script may not have been able to clean up some of its resources such as LVM volume groups, VxVM disk groups or package mount points, so before attempting to start up the package on any node, be sure to check whether any resources for the package need to be cleaned up

Package Control Script Error and Exit Conditions

Table 3-3 “Error Conditions and Package Movement for Failover Packages” shows the possible combinations of error condition, failfast setting and package movement for failover packages.

Table 3-3 Error Conditions and Package Movement for Failover Packages

Package Error Condition			Results
Error or Exit Code	Node Failfast Enabled	Service Failfast Enabled	HP-UX Status on Primary after Error	Halt script runs after Error or Exit	Package Allowed to Run on Primary Node after Error	Package Allowed to Run on Alternate Node
Service Failure	YES	YES	system reset	No	N/A (system reset)	Yes
Service Failure	NO	YES	system reset	No	N/A (system reset)	Yes
Service Failure	YES	NO	Running	Yes	No	Yes
Service Failure	NO	NO	Running	Yes	No	Yes
Run Script Exit 1	Either Setting	Either Setting	Running	No	Not changed	No
Run Script Exit 2	YES	Either Setting	system reset	No	N/A (system reset)	Yes
Run Script Exit 2	NO	Either Setting	Running	No	No	Yes
Run Script Timeout	YES	Either Setting	system reset	No	N/A (system reset)	Yes
Run Script Timeout	NO	Either Setting	Running	No	Not changed	No
Halt Script Exit 1	YES	Either Setting	Running	N/A	Yes	No
Halt Script Exit 1	NO	Either Setting	Running	N/A	Yes	No
Halt Script Timeout	YES	Either Setting	system reset	N/A	N/A (system reset)	Yes, unless the timeout happened after the cmhaltpkg command was executed.
Halt Script Timeout	NO	Either Setting	Running	N/A	Yes	No
Service Failure	Either Setting	YES	system reset	No	N/A (system reset)	Yes
Service Failure	Either Setting	NO	Running	Yes	No	Yes
Loss of Network	YES	Either Setting	system reset	No	N/A (system reset)	Yes
Loss of Network	NO	Either Setting	Running	Yes	Yes	Yes
Loss of Monitored Resource	YES	Either Setting	system reset	No	N/A (system reset)	Yes
Loss of Monitored Resource	NO	Either Setting	Running	Yes	Yes, if the resource is not a deferred resource. No, if the resource is deferred.	Yes
dependency package failed	Either Setting	Either Setting	Running	Yes	Yes when dependency is again met	Yes if depend -ency met