NFS protocol and implementation (Managing NFS and NIS, 2nd Edition)

7.2. NFS protocol and implementation

NFS is an RPC-based protocol, with a client-server relationship between the machine having the filesystem to be distributed and the machine wanting access to that filesystem. NFS kernel server threads run on the server and accept RPC calls from clients. These server threads are initiated by an nfsd daemon. NFS servers also run the mountd daemon to handle filesystem mount requests and some pathname translation. On an NFS client, asynchronous I/O threads (async threads) are usually run to improve NFS performance, but they are not required.

On the client, each process using NFS files is a client of the server. The client's system calls that access NFS-mounted files make RPC calls to the NFS servers from which these files were mounted. The virtual filesystem really just extends the operation of basic system calls like read( ) and write( ), similar to the way that NIS extends the operation of library calls like getpwuid( ). In NIS, the getpwuid( ) routine knows how to use the NIS RPC protocol to locate user information that isn't in the local /etc/passwd file. Within the virtual filesystem, the basic file- and filesystem-oriented system calls were modified to "know" how to operate on non-local filesystems.

Let's look at this with an example. On an NFS client, a user process executes a chmod( ) system call on an NFS-mounted file. The virtual filesystem passes this system call to NFS, which then executes a remote procedure call to set the permissions on the file, as specified in the process's system call. When the RPC completes, the system call returns to the user process. This example is fairly simple, because it doesn't involve any block I/O to get file data to or from the NFS server. When blocks of files are moved around, the async threads get involved to improve NFS performance. This section covers the protocols used by NFS and features of its implementation that were driven by performance or transparency goals.

7.2.1. NFS RPC procedures

Each version of the NFS RPC protocol contains several procedures, each of which operates on either a file or a filesystem object. The basic procedures performed on an NFS server can be grouped into directory operations, file operations, link operations, and filesystem operations. Directory operations include mkdir and rmdir, which create and destroy directories like their Unix system call equivalents. readdir reads a directory, using an opaque directory pointer to perform sequential reads of the same directory. Other directory-oriented procedures are rename and remove, which operate on entries in a directory the same way the mv and rm commands do. create makes a new directory entry for a file.

The lookup operation is the heart of the pathname-to-filehandle translation mechanism. lookup finds a named directory entry and returns a filehandle pointing to it. The open( ) system call uses lookup( ) extensively: it breaks a pathname down into its components and locates each component in its parent directory. For example, open( ) would handle the pathname /home/thud/stern by performing three operations:

Look up home in the root directory (/).
Look up thud in /home.
Look up stern in /home/thud.

File operations are very closely associated with Unix system calls: read and write move data to and from the NFS client, and getattr and setattr get or modify the file's attributes. In a local filesystem, such as UFS, these attributes are stored in the file's inode, but file attributes are mapped to whatever system is used by the NFS server. Link operations include link, which creates a hard link on the server, and symlink and readlink which create and read the values of symbolic links, respectively. Finally, statfs is a filesystem operation that returns information about the mounted filesystem that might be needed by df, for example.

Other filesystem operations include mounting and unmounting a filesystem, but these are handled through the NFS mountd server rather than the server threads. Mount operations are separated from the NFS protocol because mount points revolve around pathnames, and pathname syntax is peculiar to each operating system. Unix and VMS, for example, do not use the same syntax to specify the path to a file. The mount protocol is responsible for turning the server's file pathname into information that NFS can use to locate the file in future operations.

From the preceding descriptions, it is fairly clear how the basic Unix system calls map into NFS RPC calls. It is important to note that the NFS RPC protocol and the vnode interface are two different things. The vnode interface defines a set of operating system services that are used to access all filesystems, NFS or local. Vnodes simply generalize the interface to file objects. There are many routines in the vnode interface that correspond directly to procedures in the NFS protocol, but the vnode interface also contains implementations of operating system services such as mapping file blocks and buffer cache management.

The NFS RPC protocol is a specific realization of one of these vnode interfaces. It is used to perform specific vnode operations on remote files. Using the vnode interface, new filesystem types may be plugged into the operating system by adding kernel routines that perform the necessary vnode operations on objects in that filesystem.

7.2.2. Statelessness and crash recovery

The NFS protocol is stateless, meaning that there is no need to maintain information about the protocol on the server. The client keeps track of all information required to send requests to the server, but the server has no information about previous NFS requests, or how various NFS requests relate to each other. Remember the differences between the TCP and UDP protocols: UDP is a stateless protocol that can lose packets or deliver them out of order; TCP is a stateful protocol that guarantees that packets arrive and are delivered in order. The hosts using TCP must remember connection state information to recognize when part of a transmission was lost.

The choice of a stateless protocol has two implications for the design and implementation of NFS:

NFS RPC requests must completely describe the operation to be performed. When writing a file block, for example, the write operation must contain a filehandle, the offset into the file, and the length of the write operation. This is distinctly different from the Unix write( ) system call, which writes a buffer to wherever the current file descriptor's write pointer directs it. The state contained in the file descriptor does not exist on the NFS server.
Most NFS requests are idempotent, which means that an NFS client may send the same request one or more times without any harmful side effects. The net result of these duplicate requests is the same. For example, reading a specific block from a file is idempotent: the same data is returned from each operation.
Obviously, some operations are not idempotent: removing a file can't be repeated without side effects, because a second attempt to remove the file will fail if the first one succeeded. Most NFS servers make all requests idempotent by recording recently performed operations. A duplicate request that matches one of the recently performed requests is thrown away by the NFS server.[11]
[11]Not all implementations of NFS have this duplicate request cache. Current releases of Solaris, Compaq's Tru64 Unix, and other current operating systems implement the cache to improve the performance and "correctness" of NFS. A few, older implementations of NFS do not reject nonidempotent, duplicate requests. This produces some strange and often incorrect results when requests are retransmitted. An NFS client that sends the same remove operation to such a server may find that the designated file was removed, but the RPC call returns the "No such file or directory" error.

The primary motivation for choosing a stateless protocol was to minimize the burden of crash recovery. Unlike a database system, which must verify transaction logs and look for incomplete operations, NFS has no explicit crash recovery mechanism. Because no state is maintained, the server may reboot and begin accepting client NFS requests again as if nothing had happened. Similarly, when clients reboot, the server does not need to know anything about them. Each NFS request contains enough information to be completed without any reference to state on the client or server.

7.2.3. Request retransmission

NFS RPC requests are sent from a client to the server one at a time. A single client process will not issue another RPC call until the call in progress completes and has been acknowledged by the NFS server. In this respect NFS RPC calls are like system calls -- a process cannot continue with the next system call until the current one completes. A single client host may have several RPC calls in progress at any time, coming from several processes, but each process ensures that its file operations are well ordered by waiting for their acknowledgements. Using the NFS async threads makes this a little more complicated, but for now it's helpful to think of each process sending a stream of NFS requests, one at a time.

When a client makes an RPC request, it sets a timeout period during which the server must service and acknowledge it. If the server doesn't get the request because it was lost along the way, or because the server is too overloaded to complete the request within the timeout period, the client retransmits the request. Requests are idempotent (if the server has a duplicate request cache), so no harm is done if the server executes the same request twice -- when the NFS client gets a second confirmation from the RPC request, the client discards it.

NFS clients continue to retransmit requests until the request completes, either with an acknowledgement from the server or an error from the RPC layer. If an NFS server crashes, clients continue to repeat the call to the RPC layer (if the NFS filesystem is hard-mounted, otherwise the RPC timeout error is returned to the application) until the server reboots and can service them again. When the server is up again, NFS clients continue as if nothing happened. NFS clients cannot tell the difference between a server that has crashed and one that is very slow. This raises some important issues for tuning NFS servers and networks, which will be visited in Section 18.1, "Slow server compensation".

The duplicate request cache on NFS servers usually contains a few hundred entries -- the last few seconds (at most) of NFS requests on a busy server. This cache is limited in size to establish a "window" in which non-idempotent NFS requests are considered duplicates caused by retransmission rather than distinct requests. For example, if you execute:

% rm foo

on an NFS client, the client may need to send two or more remove requests to the NFS server before it receives an acknowledgment. It's up to the NFS server to weed out the duplicate remove requests, even if they are a second or so apart. However, if you execute rm foo on Monday, and then on Tuesday you execute the same command in the same directory (where the file has already been removed), you would be very surprised if rm did not return an error. Executing this "duplicate request" a day later should produce this familiar error:

% rm foo 
rm: foo: No such file or directory

To distinguish between duplicates generated due to an RPC timeout and retry and duplicates due to you repeating a command (whether it be a day later or a second later), NFS servers record a 32-bit RPC transaction identifier (xid ) with each entry in the duplicate request cache. The xid is part of every RPC request's header, and it is expected that the NFS client will generate unique xids.

7.2.4. Preserving Unix filesystem semantics

The VFS makes all filesystems appear homogeneous to user processes. There is a single Unix system call interface that operates on files, and the VFS and underlying vnode interface translate semantics of these system calls into actions appropriate for each type of underlying filesystem. It's important to stress the difference between syntax and semantics of system calls. Consistent syntax means that the system calls take the same arguments independent of the underlying filesystem. Semantics refers to what the system calls actually do: preserving semantics across different filesystem types means that a system call will have the same net effect on the files in each filesystem type. Unix filesystem semantics collectively refers to the way in which Unix files behave when various sequences of system calls are made. For example, opening a file and then unlinking it doesn't cause the file's data blocks to be released until the close( ) system call is made. A new filesystem that wants to maintain Unix filesystem semantics must support this behavior.

The VFS definition makes it possible to ensure that semantics are preserved for all filesystems, so they all behave in the same manner when Unix system calls are made on their files. It is easy to use VFS to implement a filesystem with non-Unix semantics. It's also possible to integrate a filesystem into the VFS interface without supporting all of the Unix semantics; for example, you can put FAT (a filesystem used in MS-DOS, Windows, and NT operating systems) filesystems under VFS, but you can't create Unix-like symbolic links on them because the native FAT filesystem doesn't support symbolic links.

In this section, we'll look at how NFS deals with Unix filesystem semantics, including some of the operations that aren't exactly the same under NFS. NFS has slightly different semantics than the local Unix filesystem, but it tries to preserve the Unix semantics. An application that works with a local filesystem works equally well with an NFS-mounted filesystem and will not be able to distinguish between the two.

Consistency at the vnode interface level makes NFS a powerful tool for creating filesystem hierarchies using many different NFS servers. The mount command requires that a filesystem be mounted on a directory; but directories are vnodes themselves. An NFS filesystem can be mounted on any vnode, which means that NFS filesystems can be mounted on top of other NFS filesystems or local filesystems. This is completely consistent with the way in which local disks are mounted on local filesystems. /net may be on the root filesystem, and /net/host is mounted on top of it. A workstation configured using NFS can create a view of the filesystems on the network that best meets its requirements by mounting these filesystems with a directory naming scheme of its choice.

Maintaining other Unix filesystem semantics is not quite as easy. Locking operations, for example, introduce state into a system that was meant to be stateless. This problem is addressed by a separate lock manager daemon. Another bit of Unix lore that had be preserved was the retention of an open file's data blocks, even when the file's directory entry was removed. Many Unix utilities including shells and mailers, use this "delayed unlink" feature to create temporary files that have no name in the filesystem, and are therefore invisible to probing users.

A complete solution to the problem would require that the server keep open file reference counts for each file and not free the file's data blocks until the reference count decreased to zero. However, this is precisely the kind of state information that makes crash recovery difficult, so NFS was implemented with a client-side solution that handles the common applications of this feature. When a remove operation is performed on an open file, the client issues a rename NFS RPC instead. The file is renamed to .nfsXXXX, where XXXX is a suffix to make the filename unique. When the file is eventually closed, the client issues the remove operation on the previously unlinked file. Note that there is no need for an "open" or "close" NFS RPC procedure, since "opened" and "closed" are states that are maintained on the client. It is still possible to confuse two clients that attempt to unlink a shared, open NFS-mounted file, since one client will not know that the other has the file open, but it emulates the behavior of a local filesystem sufficiently to eliminate the need to change utilities that rely on it.

7.2.5. Pathnames and filehandles

All NFS operations use filehandles to designate the files or directories on which they will be performed. Filehandles are created on the server and contain information that uniquely identifies the file or directory on the server. The client's NFS mount and lookup requests retrieve these filehandles for existing files. A side effect of making all vnodes homogeneous is that file pathname lookup must be done one component at a time. Each directory in the pathname might be a mount point for another filesystem, so each name look-up request cannot include multiple components. For example, let's look at Client A that NFS-mounts the /usr/local filesystem and also NFS-mounts a filesystem on /usr/local/bin:

clientA# mount server1:/usr/local /usr/local 
clientA# mount server2:/usr/local/bin.mips /usr/local/bin

When the NFS client reaches the bin component in the pathname, it realizes that there is an NFS filesystem mounted on this directory, and it sends its lookup requests to server2 instead of server1. If the NFS client passed the whole pathname to server1, it might get the wrong answer on its lookup: server1 has its own /usr/local/bin directory that may or may not be the same directory that Client A has mounted. While this may seem to be a very expensive series of operations, the kernel keeps a directory name lookup cache (DNLC) that prevents every look-up request from going to an NFS server.

The lookup operation takes a filename and a filehandle for a directory, and returns a filehandle pointing to the named file on the server. How then does the pathname traversal get started, if every lookup requires a filehandle from a previous pathname resolution? The mount operation seeds the lookup process by providing a filehandle for the root of the mounted filesystem. Within NFS, the only procedure that accepts full pathnames is the mount RPC, which turns the pathname into a filehandle for the mounted filesystem.

Let's look at how NFS turns the pathname /usr/local/bin/emacs into an NFS filehandle, assuming that it's on a filesystem mounted on /usr/local from server wahoo:

The NFS client asks the mountd daemon on wahoo for a filehandle for the filesystem the client has mounted on /usr/local, using the server's pathname that was supplied in the /etc/vfstab file or mount command. That is, if the client has mounted /usr/local with the /etc/vfstab entry:
```
wahoo:/tools/local   - /usr/local    nfs  - yes ro,hard
```
then the client will ask wahoo for a filehandle for the /tools/local directory.
[12]
[12]Asking the mountd daemon isn't the only way to get the filehandle for a filesystem. Recall that Chapter 6, "System Administration Using the Network File System" briefly mentioned the public option to the mount command. We will discuss this in more detail in Chapter 12, "Network Security".
Using the mount point filehandle, the client performs a lookup operation on the next component in the pathname: bin. It sends a lookup to wahoo, supplying the filehandle for the /usr/local directory and the name "bin." Server wahoo returns another filehandle for this directory.
The client goes to work on the next component in the path, emacs. Again, it sends a lookup using the filehandle for the directory containing emacs and the name it is looking for. The filehandle returned by the server is used by the client as a "pointer" (on the server) to /usr/local/bin/emacs (in the filesystem seen by client) for all future operations on that file.

Filehandles are opaque to the client. In most NFS implementations on Unix machines, they are an encoding of the file's inode number, disk device number, and inode generation number. Other implementations, particularly non-Unix NFS servers that do not have inodes, encode their own native filesystem information in the filehandle. In any system, the filehandle is in a form that can be disassembled only on the NFS server. The structures contained in the filehandle are kept hidden from the client, the same way the structures in an object-oriented system are hidden in the object's implementation routines. In the case of NFS filehandles, the data described by the structure doesn't even exist on the client -- it's all on the server, where the filehandle can be converted into a pointer to local file.

Filehandles become invalid, or stale, when the inodes to which they point (on the server) are freed or re-used. NFS clients have no way of knowing what other operations may be affecting objects pointed to by their filehandles, so there is no way to warn a client in advance that a filehandle is invalid. If an RPC call is made with a filehandle that is stale, the NFS server returns a stale filehandle error to the caller. Say that a user on one client removes an NFS-mounted directory and its contents using rm -rf test, while another client has a process using test as its current working directory. The next time the other process tries to read its working directory, it gets a stale filehandle error back from the NFS server:

Client A	Client B
`cd /mnt/test`	`cd /mnt`
	`rm -rf test`
`stat(.)-->Stale file handle`

If one client removes a file and then creates a new file that re-uses the freed inode, other filehandles (on other clients) that point to the re-used inode must be marked stale. Inode generation numbers were added to the basic Unix filesystem to add a time history to an inode. In addition to the inode number, the filehandle must match the current generation number of the inode, or it is marked stale. When the inode is re-used for a new file, its generation number is incremented. Stale filehandles become a problem when one user's work tramples on an area in use by another, or when a filesystem on a server is rebuilt from a backup tape. When restoring from a dump tape onto a fresh filesystem, all of the inode generation numbers in the filesystem are set to random numbers. This causes every filehandle in use for that filesystem to become stale -- every inode pointed to by a pre-restore filehandle now probably points to a completely different file on the disk.

Therefore, a quick way to cripple an NFS network is to restore a fileserver from a dump tape without rebooting the NFS clients. When you rebuild the server's filesystems, all of the inode generation numbers are reset; when you load the tape, files end up with different inode numbers and different inode generation numbers than they had on the original filesystem. All NFS client filehandles are now invalid because of the new generation numbers and the (random) renumbering of each file's inode. Any attempt to use an open filehandle results in stale filehandle errors. If you are going to restore an NFS-exported filesystem from tape, unmount it from its clients or reboot the clients.

7.2.6. NFS Version 3

There are four versions of the NFS protocol: Versions 2, 3, and 4. Version 1 did exist, but it was only a prototype, and neither an implementation nor specification was ever released. Version 4 has been specified, but at the time this book was written, there were no commercial implementatons. Version 3 has three major differences from Version 2:

Large file support

Writes to unstable storage

Large transfer sizes

7.2.7. NFS over TCP

Both NFS Version 2 and Version 3 operate over UDP and TCP. Since TCP is stateful, and NFS is stateless, it would seem to be a contradiction, if not an impossibility for NFS to operate over TCP. However, the layer between NFS and TCP is RPC, and RPC is implemented to hide state issues of TCP from NFS.

The first time an NFS client contacts a server over TCP, the RPC layer takes care of establishing a connection. If a server crashes, the client won't know that immediately, but the next time it sends a request over the connection, the connection will break due to a connection reset from the server, or a connection timeout. In either case, the RPC layer simply re-establishes a connection.

Some NFS/TCP implementations, such as that in Solaris, maintain a single connection between the NFS client and server, such that all traffic -- for all users and mount points -- is multiplexed between the client and server. Other implementations, such as those in the BSD releases, have one connection per mountpoint. Aside from a user-level NFS client like a web browser, or a Java application linked to NFS classes, you are not likely to encounter an NFS client that creates a connection per user.

If the client crashes, the server will periodically close connections that haven't been used in a while. On a Solaris NFS server, this connection idle timer defaults to six minutes.