7.2. NFS protocol and implementation
NFS is an RPC-based protocol, with
a client-server relationship between the
machine having the filesystem to be distributed and the machine
wanting access to that filesystem. NFS kernel server threads run on
the server and accept RPC calls from clients. These server threads
are initiated
by an
nfsd daemon. NFS servers
also
run the
mountd daemon to handle filesystem mount
requests and some pathname translation. On an NFS client,
asynchronous I/O threads (async threads) are
usually run to improve NFS
performance, but they are not required.
On the client, each process using NFS files is a client of the
server. The client's system calls that access NFS-mounted files
make RPC calls to the NFS servers from which these files were
mounted. The virtual filesystem really
just
extends the operation of basic system
calls like
read( ) and
write( ), similar to the way that NIS extends
the operation of library calls like
getpwuid( ).
In NIS, the
getpwuid( ) routine knows how to use
the NIS RPC protocol to locate user information that isn't in
the local
/etc/passwd file. Within the virtual
filesystem, the basic file- and filesystem-oriented system calls were
modified to "know" how to operate on non-local
filesystems.
Let's look at this with an example. On an NFS client, a user
process
executes a
chmod( )
system call on an NFS-mounted file. The virtual filesystem passes
this system call to NFS, which then executes a remote procedure call
to set the permissions on the file, as specified in the
process's system call. When the RPC completes, the system call
returns to the user process. This example is fairly simple, because
it doesn't involve any block I/O to get file data to or from
the NFS server. When blocks of files are moved around, the async
threads get involved to improve NFS performance. This section covers
the protocols used by NFS and features of its implementation that
were driven by performance or transparency goals.
7.2.1. NFS RPC procedures
Each version of the NFS RPC protocol
contains
several procedures, each of which operates
on either a file or a filesystem object. The basic procedures
performed on an NFS server can be grouped into directory operations,
file operations, link operations, and filesystem operations.
Directory operations include
mkdir and
rmdir, which
create
and destroy directories like their Unix system call equivalents.
readdir reads a directory, using an opaque
directory pointer to perform sequential reads of the same directory.
Other directory-oriented procedures are
rename
and
remove, which operate on entries in a
directory the same way the
mv and
rm commands do.
create
makes a new directory entry for a file.
The
lookup operation is
the heart of the
pathname-to-filehandle translation mechanism.
lookup finds a named directory entry and returns
a filehandle pointing to it. The
open( ) system
call uses
lookup( ) extensively: it breaks a
pathname down into its components and locates each component in its
parent directory. For example,
open( ) would
handle the pathname
/home/thud/stern by
performing three operations:
- Look up home in the root directory (/).
- Look up thud in /home.
- Look up stern in /home/thud.
File operations are very closely associated with Unix system calls:
read and
write move data to
and from the NFS client, and
getattr and
setattr get or modify the file's
attributes. In a local filesystem, such as UFS, these attributes are
stored in the file's inode, but file attributes are mapped to
whatever system is used by the NFS server. Link operations include
link, which creates a hard link on the server,
and
symlink and
readlink
which
create and read the values of symbolic
links, respectively. Finally,
statfs is a
filesystem operation that returns information about the mounted
filesystem that might be needed by
df, for
example.
Other filesystem operations include mounting and unmounting a
filesystem,
but
these are handled through the
NFS
mountd server rather
than
the
server threads. Mount operations are separated from the NFS protocol
because mount points revolve around pathnames, and pathname syntax is
peculiar to each operating system. Unix and VMS, for example, do not
use the same syntax to specify the path to a file. The mount protocol
is responsible for turning the server's file pathname into
information that NFS can use to locate the file in future operations.
From the preceding descriptions, it is fairly clear how the basic
Unix system
calls map into
NFS RPC calls. It is important to note that the NFS RPC protocol and
the vnode interface are two different things. The vnode interface
defines a set of operating system services that are used to access
all filesystems, NFS or local. Vnodes simply generalize the interface
to file objects. There are many routines in the vnode interface that
correspond directly to procedures in the NFS protocol, but the vnode
interface also contains implementations of operating system services
such as mapping file blocks and buffer cache management.
The NFS RPC protocol is a specific realization of one of these vnode
interfaces. It is used to perform specific vnode operations on remote
files. Using the vnode interface, new filesystem types may be plugged
into the operating system by adding kernel routines that perform the
necessary vnode operations on objects in that filesystem.
7.2.2. Statelessness and crash recovery
The NFS protocol is stateless, meaning that
there is no need to maintain information about the protocol on the
server. The client keeps track of all information required to send
requests to the server, but the server has no information about
previous NFS requests, or how various NFS requests relate to each
other. Remember the differences between the TCP and UDP protocols:
UDP is a stateless protocol that can lose packets or deliver them out
of order; TCP is a stateful protocol that guarantees that packets
arrive and are delivered in order. The hosts using TCP must remember
connection state information to recognize when part of a transmission
was lost.
The choice of a stateless protocol has two implications for the
design and implementation of NFS:
- NFS RPC requests must completely describe the operation to be
performed. When writing a file block, for example, the
write operation must contain a
filehandle, the offset into the file, and the length of the write
operation. This is distinctly different from the Unix
write( ) system call, which writes a buffer to
wherever the current file descriptor's write pointer directs
it. The state contained in the file descriptor does not exist on the
NFS server.
- Most NFS requests are idempotent, which
means that an NFS client may send the
same request one or more times without any harmful side effects. The
net result of these duplicate requests is the same. For example,
reading a specific block from a file is idempotent: the same data is
returned from each operation.
Obviously, some operations are not idempotent: removing a file
can't be repeated without side effects, because a second
attempt to remove the file will fail if the first one succeeded. Most
NFS servers make all requests idempotent by recording recently
performed operations. A duplicate request that matches one of the
recently performed requests is thrown away by the NFS
server.[11]
The primary motivation for choosing a stateless protocol was to
minimize the burden of crash recovery. Unlike a database system,
which must verify transaction logs and look for incomplete
operations, NFS has no explicit crash recovery mechanism. Because no
state is maintained, the server may reboot and begin accepting client
NFS requests again as if nothing had happened. Similarly, when
clients reboot, the server does not need to know anything about them.
Each NFS request contains enough information to be completed without
any reference to state on the client or
server.
7.2.3. Request retransmission
NFS RPC requests are sent from a
client
to the server one at a time. A single client process will not issue
another RPC call until the call in progress completes and has been
acknowledged by the NFS server. In this respect NFS RPC calls are
like system calls -- a process cannot continue with the next
system call until the current one completes. A single client host may
have several RPC calls in progress at any time, coming from several
processes, but each process ensures that its file operations are well
ordered by waiting for their acknowledgements. Using the NFS async
threads makes this a little more complicated, but for now it's
helpful to think of each process sending a stream of NFS requests,
one at a time.
When a client makes an RPC request, it sets a
timeout period during which the
server must service and acknowledge it. If the server doesn't
get the request because it was lost along the way, or because the
server is too overloaded to complete the request within the timeout
period, the client
retransmits the request.
Requests are idempotent (if the server has a duplicate request
cache), so no harm is done if the server executes the same request
twice -- when the NFS client gets a second confirmation from the
RPC request, the client discards it.
NFS clients continue to retransmit requests until the request
completes, either with an acknowledgement from the server or an error
from the RPC layer. If an NFS server crashes, clients continue to
repeat the call to the RPC layer (if the NFS filesystem is
hard-mounted, otherwise the RPC timeout error is returned to the
application) until the server reboots and can service them again.
When the server is up again, NFS clients continue as if nothing
happened. NFS clients cannot tell the difference between a server
that has crashed and one that is very slow. This raises some
important issues for tuning NFS servers and networks, which will be
visited in
Section 18.1, "Slow server compensation".
The duplicate request cache on NFS servers
usually contains a few hundred entries
-- the last few seconds (at most) of NFS requests on a busy
server. This cache is limited in size to establish a
"window" in which non-idempotent NFS requests are
considered duplicates caused by retransmission rather than distinct
requests. For example, if you execute:
% rm foo
on an NFS client, the client may need to send two or
more
remove
requests to the NFS server before it receives an acknowledgment.
It's up to the NFS server to weed out the duplicate
remove requests, even if they are a second or so
apart. However, if you execute
rm foo on Monday,
and then on Tuesday you execute the same command in the same
directory (where the file has already been removed), you would be
very surprised if
rm did not return an error.
Executing this "duplicate request" a day later should
produce this familiar error:
% rm foo
rm: foo: No such file or directory
To distinguish between duplicates generated due to an RPC timeout and
retry and duplicates due to you repeating a command (whether it be a
day later or a second later), NFS servers record a 32-bit RPC
transaction identifier (
xid ) with each entry in
the duplicate request cache. The xid is part of every RPC
request's
header,
and it is expected that the NFS client will generate unique xids.
7.2.4. Preserving Unix filesystem semantics
The VFS makes all filesystems appear
homogeneous to user
processes. There is a single Unix system call interface that operates
on files, and the VFS and underlying vnode interface translate
semantics of these system calls into actions appropriate for each
type of underlying filesystem. It's important to stress the
difference between
syntax and
semantics of system calls. Consistent syntax
means that the system calls take the same arguments independent of
the underlying filesystem. Semantics refers to what the system calls
actually do: preserving semantics across different filesystem types
means that a system call will have the same net effect on the files
in each filesystem type.
Unix filesystem
semantics collectively refers to the way in which Unix
files behave when various sequences of system calls are made. For
example, opening a file and then unlinking it doesn't cause the
file's data blocks to be released until the
close( ) system call is
made. A new filesystem that wants to
maintain Unix filesystem semantics must support this behavior.
The VFS definition makes it possible to ensure that semantics are
preserved for all filesystems, so they all behave in the same manner
when Unix system calls are made on their files. It is easy to use VFS
to implement a filesystem with non-Unix semantics. It's also
possible to integrate a filesystem into the VFS interface
without supporting all of the Unix
semantics; for example, you can put FAT (a filesystem used in MS-DOS,
Windows, and NT operating systems) filesystems under VFS, but you
can't create Unix-like symbolic links on them because the
native FAT filesystem doesn't support symbolic links.
In this section, we'll look at how NFS deals with Unix
filesystem semantics, including some of the operations that
aren't exactly the same under NFS. NFS has slightly different
semantics than the local Unix filesystem, but it tries to preserve
the Unix semantics. An application that works with a local filesystem
works equally well with an NFS-mounted filesystem and will not be
able to distinguish between the two.
Consistency at the vnode interface level
makes
NFS a powerful tool for creating filesystem hierarchies using many
different NFS servers. The
mount command
requires that a filesystem be mounted on a directory; but directories
are vnodes themselves. An NFS filesystem can be mounted on any vnode,
which means that NFS filesystems can be mounted on top of other NFS
filesystems or local filesystems. This is completely consistent with
the way in which local disks are mounted on local filesystems.
/net may be on the root filesystem, and
/net/host is mounted on top of it. A workstation
configured using NFS can create a view of the filesystems on the
network that best meets its requirements by mounting these
filesystems with a directory naming scheme of its choice.
Maintaining other Unix filesystem semantics is not quite as easy.
Locking operations, for example, introduce state into a system that
was meant to be stateless. This problem is addressed by a separate
lock manager daemon. Another bit of Unix lore that had be preserved
was the retention of an open file's data blocks, even when the
file's directory entry was removed. Many Unix utilities
including shells and mailers, use this "delayed unlink"
feature to create temporary files that have no name in the
filesystem, and are therefore invisible to probing users.
A complete solution to the problem would require that the server keep
open file reference counts for each file and not free the
file's data blocks until the reference count decreased to zero.
However, this is precisely the kind of state information that makes
crash recovery difficult, so NFS was implemented with a client-side
solution that handles the common applications of this feature. When a
remove operation is performed on an open file,
the client issues a
rename NFS RPC instead. The
file is renamed to
.nfsXXXX, where
XXXX is a suffix to make the filename unique.
When the file is eventually closed, the client issues the
remove operation on the previously unlinked
file. Note that there is no need for an "open" or
"close" NFS RPC procedure, since "opened" and
"closed" are states that are maintained on the client. It
is still possible to confuse two clients that attempt to unlink a
shared, open NFS-mounted file, since one client will not know that
the other has the file open, but it emulates the behavior of a local
filesystem sufficiently to
eliminate the need to change utilities
that rely on it.
7.2.5. Pathnames and filehandles
All NFS operations use filehandles to designate
the files or directories on which they will be performed. Filehandles
are created on the server and contain information that uniquely
identifies the file or directory on the server. The client's
NFS
mount and
lookup requests retrieve these filehandles for
existing files. A side effect of making all vnodes homogeneous is
that file pathname lookup must be done one component at a time. Each
directory in the pathname might be a mount point for another
filesystem, so each name look-up request cannot include multiple
components. For example, let's look at
Client
A that NFS-mounts the
/usr/local
filesystem and also NFS-mounts a filesystem on
/usr/local/bin:
clientA# mount server1:/usr/local /usr/local
clientA# mount server2:/usr/local/bin.mips /usr/local/bin
When the NFS client reaches the
bin component in
the pathname, it realizes that there is an NFS filesystem mounted on
this directory, and it sends its lookup requests to
server2 instead of
server1.
If the NFS client passed the whole pathname to
server1, it might get the wrong answer on its
lookup:
server1 has its own
/usr/local/bin directory that may or may not be
the same directory that
Client A has mounted.
While this may seem to be a very expensive series of operations, the
kernel keeps a directory name lookup cache (DNLC) that prevents every
look-up request from going to an NFS server.
The
lookup operation takes a filename and a
filehandle for a directory, and returns a filehandle pointing to the
named file on the server. How then does the pathname traversal get
started, if every
lookup requires a filehandle
from a previous pathname resolution? The
mount
operation seeds the lookup process by providing a filehandle for the
root of the mounted filesystem. Within NFS, the only procedure that
accepts full pathnames is the
mount RPC, which
turns the pathname into a filehandle for the mounted filesystem.
Let's look at how NFS turns the pathname
/usr/local/bin/emacs into an NFS filehandle,
assuming that it's on a filesystem mounted on
/usr/local from server
wahoo:
- The NFS client asks the mountd daemon on
wahoo for a filehandle for the filesystem the
client has mounted on /usr/local, using the
server's pathname that was supplied in the
/etc/vfstab file or mount
command. That is, if the client has mounted
/usr/local with the
/etc/vfstab entry:
wahoo:/tools/local - /usr/local nfs - yes ro,hard
then the client will ask wahoo for a filehandle
for the /tools/local directory.[12] - Using the mount point filehandle, the client performs a lookup
operation on the next component in the pathname:
bin. It sends a lookup to
wahoo, supplying the filehandle for the
/usr/local directory and the name
"bin." Server wahoo returns another
filehandle for this directory.
- The client goes to work on the next component in the path,
emacs. Again, it sends a
lookup using the filehandle for the directory
containing emacs and the name it is looking for.
The filehandle returned by the server is used by the client as a
"pointer" (on the server) to
/usr/local/bin/emacs (in the filesystem seen by
client) for all future operations on that file.
Filehandles are opaque to the
client. In most NFS implementations on
Unix machines, they are an encoding of the file's inode number,
disk device number, and inode generation number. Other
implementations, particularly non-Unix NFS servers that do not have
inodes, encode their own native filesystem information in the
filehandle. In any system, the filehandle is in a form that can be
disassembled only on the NFS server. The structures contained in the
filehandle are kept hidden from the client, the same way the
structures in an object-oriented system are hidden in the
object's implementation routines. In the case of NFS
filehandles, the data described by the structure doesn't even
exist on the client -- it's all on the server, where the
filehandle can be converted into a pointer to local file.
Filehandles become invalid, or stale,
when
the
inodes to which they point (on the server) are freed or re-used. NFS
clients have no way of knowing what other operations may be affecting
objects pointed to by their filehandles, so there is no way to warn a
client in advance that a filehandle is invalid. If an RPC call is
made with a filehandle that is stale, the NFS server returns a
stale filehandle error to the
caller. Say that a user on one client
removes an NFS-mounted directory and its contents using
rm
-rf test, while another client has a process using
test as its current working directory. The next
time the other process tries to read its working directory, it gets a
stale filehandle error back from the NFS server:
Client A |
Client B |
cd /mnt/test |
cd /mnt |
|
rm -rf test |
stat(.)-->Stale file handle |
|
If one client removes a file and then creates a new file that re-uses
the freed inode, other filehandles (on other clients) that point to
the re-used inode must be marked stale. Inode generation numbers were
added to the basic Unix filesystem to add a time history to an inode.
In addition to the inode number, the filehandle must match the
current generation number of the inode, or it is marked stale. When
the inode is re-used for a new file, its generation number is
incremented. Stale filehandles become a problem when one user's
work tramples on an area in use by another, or when a filesystem on a
server is rebuilt from a backup tape. When restoring from a dump tape
onto a fresh filesystem, all of the inode generation numbers in the
filesystem are set to random numbers. This causes every filehandle in
use for that filesystem to become stale -- every inode pointed
to by a pre-restore filehandle now probably points to a completely
different file on the disk.
Therefore, a quick way to cripple an NFS network
is
to restore a fileserver from a dump
tape without rebooting the NFS clients. When you rebuild the
server's filesystems, all of the inode generation numbers are
reset; when you load the tape, files end up with different inode
numbers and different inode generation numbers than they had on the
original filesystem. All NFS client filehandles are now invalid
because of the new generation numbers and the (random) renumbering of
each file's inode. Any attempt to use an open filehandle
results in stale filehandle errors. If you are going to restore an
NFS-exported filesystem
from tape, unmount it from its clients
or reboot
the
clients.
7.2.6. NFS Version 3
There are four versions of the NFS
protocol: Versions 2, 3, and 4. Version 1
did exist, but it was only a prototype, and neither an implementation
nor specification was ever released. Version 4 has been specified,
but at the time this book was written, there were no commercial
implementatons. Version 3 has three major differences from Version 2:
- Large file support
- Version 2 supported files up to
four gigabytes in length, though most
implementations are limited to up to two-gigabyte files. Version 3
supports files up to and including 264 - 1
bytes in length. Large file support was the primary driver for a
protocol revision.
- Writes to unstable storage
- Version 2 of the NFS protocol specified that NFS servers could not reply
successfully to a write request until the data
had been committed to stable storage, usually magnetic disk, but
non-volatile RAM was permissible as well. This limited the write
throughput of NFS clients, and so Version 3 of the protocol permits
the client to indicate that the write need not
be committed to stable storage. This allows NFS servers to respond
quickly to write requests. Of course, clients
are still interested in committing their data to stable storage, and
so Version 3 has a new procedure called commit,
which tells the NFS server to write the uncommitted data to stable
storage before returning success.
The theory behind this, supported by experimental measurement, is
that faster throughput is gained by the NFS server committing data to
stable storage in parallel with the client doing something else (such
as generating more NFS requests), before the client issues the
commit. Typically, the NFS Version 3 client will
issue a commit when it is about to close a file,
or when buffer space is tight.
- Large transfer sizes
- NFS Version 2 had a limit of 8192 bytes per NFS read and write
request. NFS Version 3 lets the client and server negotiate a
mutually acceptable limit.
Recall from Section 1.3.1, "Datagrams and packets" that packets larger than the
medium's MTU must be fragmented. Fragmentation of output
packets is easy, but the other direction, reassembly of input
fragments, is harder if the fragments arrive out of order, or if a
fragment is dropped or delayed. With larger NFS transfer sizes, the
risk of a reassembly problem is higher, and if there is a problem,
the entire datagram must be retransmitted, including all the
fragments. NFS Version 2 was designed to be gentler to the network
during the days when operating systems, routers, and network hardware
were less capable. Nowadays, these components are much more
effective, and so NFS Version 3 removes the artificial limits to
transfer
size.
7.2.7. NFS over TCP
Both NFS Version 2 and Version 3 operate
over UDP and TCP. Since TCP is
stateful, and NFS is stateless, it would seem to be a contradiction,
if not an impossibility for NFS to operate over TCP. However, the
layer between NFS and TCP is RPC, and RPC is implemented to hide
state issues of TCP from NFS.
The first time an NFS client contacts a server over TCP, the RPC
layer takes care of establishing a connection. If a server crashes,
the client won't know that immediately, but the next time it
sends a request over the connection, the connection will break due to
a connection reset from the server, or a connection timeout. In
either case, the RPC layer simply re-establishes a connection.
Some NFS/TCP implementations, such as that in Solaris, maintain a
single connection between the NFS client and server, such that all
traffic -- for all users and mount points -- is multiplexed
between the client and server. Other implementations, such as those
in the BSD releases, have one connection per mountpoint. Aside from a
user-level NFS client like a web browser, or a Java application
linked to NFS classes, you are not likely to encounter an NFS client
that creates a connection per user.
If the client crashes, the server will periodically close
connections that
haven't been used in a while. On a Solaris NFS server, this
connection idle
timer defaults to six minutes.
| | |
7. Network File System Design and Operation | | 7.3. NFS components |