What happens if part or all of an application fails?
All of the preceding sections have assumed the failure in
question was not a failure of the application, but of another component
of the cluster. This section deals specifically with application
problems. For instance, software bugs may cause an application to
fail or system resource issues (such as low swap/memory space) may
cause an application to die. The section deals with how to design
your application to recover after these types of failures.
Applications to be Failure Tolerant
An application should be tolerant to failure of a single component.
Many applications have multiple processes running on a single node.
If one process fails, what happens to the other processes? Do they
also fail? Can the failed process be restarted on the same node
without affecting the remaining pieces of the application?
Ideally, if one process fails, the other processes can wait
a period of time for that component to come back online. This is
true whether the component is on the same system or a remote system.
The failed component can be restarted automatically on the same
system and rejoin the waiting processing and continue on. This type
of failure can be detected and restarted within a few seconds, so
the end user would never know a failure occurred.
Another alternative is for the failure of one component to
still allow bringing down the other components cleanly. If a database
SQL server fails, the database should still be able to be brought
down cleanly so that no database recovery is necessary.
The worse case is for a failure of one component to cause
the entire system to fail. If one component fails and all other
components need to be restarted, the downtime will be high.
Able to Monitor Applications
All components in a system, including applications, should
be able to be monitored for their health. A monitor might be as
simple as a display command or as complicated as a SQL query. There
must be a way to ensure that the application is behaving correctly.
If the application fails and it is not detected automatically, it
might take hours for a user to determine the cause of the downtime
and recover from it.