| United States-English |
|
|
|
![]() |
Managing Serviceguard Fifteenth Edition > Chapter 8 Troubleshooting
Your ClusterMonitoring Hardware |
|
Good standard practice in handling a high availability system includes careful fault monitoring so as to prevent failures if possible or at least to react to them swiftly when they occur. The following should be monitored for errors or warnings of all kinds:
Some monitoring can be done through simple physical inspection, but for the most comprehensive monitoring, you should examine the system log file (/var/adm/syslog/syslog.log) periodically for reports on all configured HA devices. The presence of errors relating to a device will show the need for maintenance. When the proper redundancy has been configured, failures can occur with no external symptoms. Proper monitoring is important. For example, if a Fibre Channel switch in a redundant mass storage configuration fails, LVM will automatically fail over to the alternate path through another Fibre Channel switch. Without monitoring, however, you may not know that the failure has occurred, since the applications are still running normally. But at this point, there is no redundant path if another failover occurs, so the mass storage configuration is vulnerable. Event Monitoring Service (EMS) allows you to configure monitors of specific devices and system resources. You can direct alerts to an administrative workstation where operators can be notified of further action in case of a problem. For example, you could configure a disk monitor to report when a mirror was lost from a mirrored volume group being used in the cluster. Refer to the manual Using High Availability Monitors (http://docs.hp.com -> High Availability -> Event Monitoring Service and HA Monitors -> Installation and User’s Guide) for additional information. A set of hardware monitors is available for monitoring and reporting on memory, CPU, and many other system values. Some of these monitors are supplied with specific hardware products. When hardware monitors are disabled using the monconfig tool, associated hardware monitor persistent requests are removed from the persistence files. When hardware monitoring is re-enabled, the monitor requests that were initialized using the monconfig tool are re-created. However, hardware monitor requests created using Serviceguard Manager, or established when Serviceguard is started, are not re-created. These requests are related to the psmmon hardware monitor. To re-create the persistence monitor requests, halt Serviceguard on the node, and then restart it. This will re-create the persistence monitor requests. In addition to messages reporting actual device failure, the logs may accumulate messages of lesser severity which, over time, can indicate that a failure may happen soon. One product that provides a degree of automation in monitoring is called HP ISEE, which gathers information from the status queues of a monitored system to see what errors are accumulating. This tool will report failures and will also predict failures based on statistics for devices that are experiencing specific non-fatal errors over time. In a Serviceguard cluster, HP ISEE should be run on all nodes. HP ISEE also reports error conditions directly to an HP Response Center, alerting support personnel to the potential problem. HP ISEE is available through various support contracts. For more information, contact your HP representative. |
|||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||