Can the application be started and stopped automatically or
does it require operator intervention?
This section describes how to automate application operations
to avoid the need for user intervention. One of the first rules
of high availability is to avoid manual intervention. If it takes
a user at a terminal, console or GUI interface to enter commands
to bring up a subsystem, the user becomes a key part of the system.
It may take hours before a user can get to a system console to do
the work necessary. The hardware in question may be located in a
far-off area where no trained users are available, the systems may
be located in a secure datacenter, or in off hours someone may have
to connect via modem.
There are two principles to keep in mind for automating application relocation:
Insulate users from outages.
Applications must have defined startup and shutdown
procedures.
You need to be aware of what happens currently when the system
your application is running on is rebooted, and whether changes
need to be made in the application's response for high availability.
Insulate
Users from Outages |
|
Wherever possible, insulate your end users from outages. Issues
include the following:
Do not require user intervention to
reconnect when a connection is lost due to a failed server.
Where possible, warn users of slight delays due
to a failover in progress.
Minimize the reentry of data.
Engineer the system for reserve capacity to minimize
the performance degradation experienced by users.
Define
Application Startup and Shutdown |
|
Applications must be restartable without manual intervention.
If the application requires a switch to be flipped on a piece of
hardware, then automated restart is impossible. Procedures for application
startup, shutdown and monitoring must be created so that the HA
software can perform these functions automatically.
To ensure automated response, there should be defined procedures
for starting up the application and stopping the application. In
Serviceguard these procedures are placed in the package control
script. These procedures must check for errors and return status
to the HA control software. The startup and shutdown should be command-line
driven and not interactive unless all of the answers can be predetermined
and scripted.
In an HA failover environment, HA software restarts the application
on a surviving system in the cluster that has the necessary resources,
like access to the necessary disk drives. The application must be
restartable in two aspects:
It must be able to restart and recover
on the backup system (or on the same system if the application restart
option is chosen).
It must be able to restart if it fails during the
startup and the cause of the failure is resolved.
Application administrators need to learn to startup and shutdown applications
using the appropriate HA commands. Inadvertently shutting down the
application directly will initiate an unwanted failover. Application
administrators also need to be careful that they don't accidently
shut down a production instance of an application rather than a
test instance in a development environment.
A mechanism to monitor whether the application is active is
necessary so that the HA software knows when the application has
failed. This may be as simple as a script that issues the command ps -ef | grep xxx for all the processes belonging to the application.
To reduce the impact on users, the application should not
simply abort in case of error, since aborting would cause an unneeded
failover to a backup system. Applications should determine the exact
error and take specific action to recover from the error rather
than, for example, aborting upon receipt of any error.