Fault Tolerance

Topics in this chapter include:

Overview: Fault Tolerance

Overview: Fault Tolerance

Cisco RPMS allows you to build fault tolerance and resiliency into your dial service offerings. Fault tolerance and resiliency consist of the following features:

Cisco RPMS provides the hot standby (or high availability) feature to protect against server failure. Any stateful Cisco RPMS component server, such as policy processors, can be deployed as a high availability (HA) pair. In this configuration, the servers continually synchronize with each other by exchanging messages.

Providing an HA pair means the active call counts and other network states are mirrored across both servers. If either server fails, the system continues to operate with the remaining server without losing state.

Both servers in a hot standby pair act as peers. There is no primary or backup server. So both servers are fully functional and independently capable of handling all the network traffic. Because of this, you can provision a network so that some devices (such as RASERs) communicate with one server, while others communicate with its peer. This type of provisioning can help with network load sharing.

Both servers in a hot standby pair replicate their state to the peer by exchanging messages over a reliable, TCP link, so that both servers receive the combined network traffic meant for the pair. The servers make their policy decisions based on the total network activity. For this reason, network load sharing does not mean less load or lower CPU utilization for the servers participating in hot standby configuration.

When installing a new server installation or after rebooting it, you can synchronize it with its peer at any time by using a CLI command. You can also configure HA servers to automatically synchronize with their peer at startup.

For more specific configuration information, refer to "Configuring Cisco RPMS Fault Tolerance" in the Cisco Resource Policy Management System Configuration Guide.

Tolerance to Database Failures

Cisco RPMS needs database connectivity for configuration purposes only; database connectivity is not required for call processing. So, if the Oracle database fails, the Cisco RPMS still continues to accept UG requests.

When Cisco RPMS detects a database failure, it generates an e-mail. Additionally, an SNMP trap is generated when the connectivity to the database fails and when it is restored.

Cisco RPMS Autorestart

Cisco RPMS can detect server process failures and can automatically restart any Cisco RPMS processes that failed. The following components are monitored:

Policy Processors—These processors enforce policies and also handle GUI requests for reports. When a policy processor fails, you must immediately restart it.

RASER—This processor accepts RADIUS messages from the UG and forwards them to the appropriate destination (for example, the Cisco RPMS server or AAA proxy). Restarting this process is necessary for all call processing.

Oracle RDBMS—Since the Oracle server is not distributed as a Cisco RPMS component, the Cisco RPMS system can detect Oracle RDBMS failure, but Cisco RPMS cannot automatically restart the Oracle database. The database administrator must manually restart Oracle. During this period, the DBServer detects an Oracle database failure and notifies the administrator.

DBServer—If the DBServer shuts down when Oracle fails, the DBServer cannot restart until the database administrator brings up the Oracle database server. The autorestart process must verify that the Oracle database is operational before restarting the DBServer. The Oracle tool tnsping detects Oracle availability. If the DBServer fails, it automatically restarts provided the Oracle database is operational.

FastTrack Web server—The FastTrack server consists of multiple http daemons that are monitored and restarted after a failure.

Acme Web server (servlets)—The Acme Web server process is monitored and restarts after failure.

Detection of Universal Gateway Failures

Cisco RPMS implements a heartbeat checker mechanism that allows Cisco RPMS to test whether or not a UG is still active.

You can configure Cisco RPMS to use SNMP to automatically poll the UGs that are in the UG list. When polling, Cisco RPMS sends an SNMP Get request to each UG. If an SNMP agent is running on the UG, a Get request returns a message with either a time in hundredths of seconds that the encapsulated agent has been running, or a "no such name" error message, which signifies that the agent and the UG are alive, regardless of the returned value. However, if a UG does not respond to the request, then Cisco RPMS resets all the corresponding active calls.

For information on configuring the heartbeat checker or for the heartbeat configuration definitions, refer to the "Overview: The Universal Gateway Heartbeat" section in the Cisco Resource Policy Management System Configuration Guide.

Tolerance to AAA Server Failure

To enhance fault tolerance to AAA server failures, Cisco RPMS allows you to create a prioritized list of AAA servers. Cisco RPMS RASERs use this list to determine the destination of authorization and accounting messages received from the UG.

The RASERs forward messages to the AAA server with highest priority. If the RASERs detect that this AAA server has failed, they switch over to the server with the next highest priority. When the RASERs reach the end of this list, they continue from the top of the list again.