Soft mount issues (Managing NFS and NIS, 2nd Edition)

18.2. Soft mount issues

Repeated retransmission cycles only occur for hard-mounted filesystems. When the soft option is supplied in a mount, the RPC retransmission sequence ends at the first major timeout, producing messages like:

NFS write failed for server wahoo: error 5 (RPC: Timed out)
NFS write error on host wahoo: error 145.
(file handle: 800000 2 a0000 114c9 55f29948 a0000 11494 5cf03971)

The NFS operation that failed is indicated, the server that failed to respond before the major timeout, and the filehandle of the file affected. RPC timeouts may be caused by extremely slow servers, or they can occur if a server crashes and is down or rebooting while an RPC retransmission cycle is in progress.

With soft-mounted filesystems, you have to worry about damaging data due to incomplete writes, losing access to the text segment of a swapped process, and making soft-mounted filesystems more tolerant of variances in server response time. If a client does not give the server enough latitude in its response time, the first two problems impair both the performance and correct operation of the client. If write operations fail, data consistency on the server cannot be guaranteed. The write error is reported to the application during some later call to write( ) or close( ), which is consistent with the behavior of a local filesystem residing on a failing or overflowing disk. When the actual write to disk is attempted by the kernel device driver, the failure is reported to the application as an error during the next similar or related system call.

A well-conditioned application should exit abnormally after a failed write, or retry the write if possible. If the application ignores the return code from write( ) or close( ), then it is possible to corrupt data on a soft-mounted filesystem. Some write operations may fail and never be retried, leaving holes in the open file.

To guarantee data integrity, all filesystems mounted read-write should be hard-mounted. Server performance as well as server reliability determine whether a request eventually succeeds on a soft-mounted filesystem, and neither can be guaranteed. Furthermore, any operating system that maps executable images directly into memory (such as Solaris) should hard-mount filesystems containing executables. If the filesystem is soft-mounted, and the NFS server crashes while the client is paging in an executable (during the initial load of the text segment or to refill a page frame that was paged out), an RPC timeout will cause the paging to fail. What happens next is system-dependent; the application may be terminated or the system may panic with unrecoverable swap errors.

A common objection to hard-mounting filesystems is that NFS clients remain catatonic until a crashed server recovers, due to the infinite loop of RPC retransmissions and timeouts. By default, Solaris clients allow interrupts to break the retransmission loop. Use the intr mount option if your client doesn't specify interrupts by default. Unfortunately, some older implementations of NFS do not process keyboard interrupts until a major timeout has occurred: with even a small timeout period and retransmission count, the time required to recognize an interrupt can be quite large.

If you choose to ignore this advice, and choose to use soft-mounted NFS filesystems, you should at least make NFS clients more tolerant of soft-mounted NFS fileservers by increasing the retrans mount option. Increasing the number of attempts to reach the server makes the client less likely to produce an RPC error during brief periods of server loading.