Replication (Managing NFS and NIS, 2nd Edition)

6.5. Replication

Solaris 2.6 introduced the concept of replication to NFS clients. This feature is known as client-side failover. Client-side failover is useful whenever you have read-only data that you need to be highly available. An example will illustrate this.

Suppose your user community needs to access a collection of historical data on the last 200 national budgets of the United States. This is a lot of data, and so is a good candidate to store on a central NFS server. However, because your users' jobs depend on it, you do not want to have a single point of failure, and so you keep the data on several NFS servers. (Keeping the data on several NFS servers also gives one the opportunity to load balance). Suppose you have three NFS servers, named hamilton, wolcott, and dexter, each exporting a copy of data. Then each server might have an entry like this in its dfstab file:

share -o ro /export/budget_stats

Now, without client-side failover, each NFS client might have one of the following vfstab entries:

hamilton:/export/budget_stats    -    /stats/budget     nfs      -     yes     ro
wolcott:/export/budget_stats     -    /stats/budget     nfs      -     yes     ro
dexter:/export/budget_stats      -    /stats/budget     nfs      -     yes     ro

Suppose an NFS client is mounting /stats/budgetfrom NFS server hamilton, and hamilton stops responding. The user on that client will want to mount a different server. In order to do this, he'll have to do all of the following:

Terminate any applications that are currently accessing files under the /budget_stats mount point.
Unmount /stats/budget.
Edit the vfstab file to point at a different server.
Mount /stats/budget.

The user might have a problem with the first step, especially if the application has buffered some unsaved critical information. And the other three steps are tedious.

With client side failover, each NFS client can have a single entry in the vfstab file such as:

hamilton,wolcott,dexter:/export/budget_stat   -   /budget_stats   nfs   -  yes  ro

This vfstab entry defines a replicated NFS filesystem. When this vfstab entry is mounted, the NFS client will:

Contact each server to verify that each is responding and exporting /export/budget_stats.
Generate a list of the NFS servers that are responding and exporting /export/budget_stats and associate that list with the mount point.
Pick one of the servers to get NFS service from. In other words, the NFS traffic for the mount point is bound to one server at a time.

As long as the server selected to provide NFS service is responding, the NFS mount operates as a normal non-client-side failover mount. Assuming the NFS client selected server hamilton, if hamiltonstops responding, the NFS client will automatically select the next server, in this case wolcott, without requiring that one manually unmount hamilton, and mount wolcott. And if wolcott later stops responding, the NFS client would then select dexter. As you might expect, if later on dexter stops responding, the NFS client will bind the NFS traffic back to hamilton. Thus, client-side failover uses a round-robin scheme.

You can tell which server a replicated mount is using via the nfsstat command:

% nfsstat -m
...
/budget_stats from hamilton,wolcott,dexter:/export/budget_stats
 Flags:   
vers=3,proto=tcp,sec=sys,hard,intr,llock,link,symlink,acl,rsize=32768,wsize=32768,
retrans=5
 Failover:noresponse=1, failover=1, remap=1, currserver=wolcott

The currserver value tells us that NFS traffic for the /budget_stats mount point is bound to server wolcott. Apparently hamilton stopped responding at one point, because we see non-zero values for the counters noresponse, failover and remap. The counter noresponse counts the number of times a remote procedure call to the currently bound NFS server timed out. The counter failovercounts the number of times the NFS client has "failed over" or switched to another NFS server due to a timed out remote procedure call. The counter remap counts the number of files that were "mapped" to another NFS server after a failover. For example, if an application on the NFS client had /budget_stats/1994/deficit open, and then the client failed over to another server, the next time the application went to read data from /budget_stats/1944/deficit, the open file reference would be re-mapped to the corresponding 1944/deficit file on the newly bound NFS server.

Solaris will also notify you when a failover happens. Expect a message like:

NOTICE: NFS: failing over from hamilton to wolcott

on both the NFS client's system console and in its /var/adm/messages file.

By the way, it is not required that each server have the same pathname mounted. The mount command will let you mount replica servers with different directories. For example:

# mount -o ro serverX:/q,serverY:/m /mnt

As long as the contents of serverX:/q and serverY:/m are the same, the top level directory name does not have to be. The next section discusses rules for content of replicas.

6.5.1. Properties of replicas

Replicas on each server in the replicated filesystem have to be the same in content. For example, if on an NFS client we have done:

# mount -o ro serverX,serverY:/export /mnt

then /export on both servers needs to be an exact copy. One way to generate such a copy would be:

# rlogin serverY
serverY # cd /export
serverY # rm -rf ../export
serverY # mount serverX:/export /mnt
serverY # cd /mnt
serverY # find . -print | cpio -dmp /export
serverY # umount /mnt
serverY # exit
#

The third command invoked here, rm -rf ../export is somewhat curious. What we want to do is remove the contents of /export in a manner that is as fast and secure as possible. We could do rm -rf /exportbut that has the side of effect of removing /export as well as its contents. Since /export is exported, any NFS client that is currently mounting serverY:/export will experience stale filehandles (see Section 18.8, "Stale filehandles"). Recreating /export immediately with the mkdir command does not suffice because of the way NFS servers generate filehandles for clients. The filehandle contains among other things the inode number (a file's or directory's unique identification number) and this is almost guaranteed to be different. So we want to remove just what is under /export. A commonly used method for doing that is:

# cd /export ; find . -print | xargs rm -rf

but the problem there is that if someone has placed a filename like foo /etc/passwd (i.e., a file with an embedded space character) in /export, then the xargs rm -rf command will remove a file called foo and a file called /etc/passwd, which on Solaris may prevent one from logging into the system. Doing rm -rf ../export will prevent /export from being removed because rm will not remove the current working directory. Note that this behavior may vary with other systems, so test it on something unimportant to be sure.

At any rate, the aforementioned sequence of commands will create a replica that has the following properties:

Each regular file, directory, named pipe, symbolic link, socket, and device node in the original has a corresponding object with the same name in the copy.
The file type of each regular file, directory, named pipe, symbolic link, socket, and device node in the original is the same in the corresponding object with same name in the copy.
The contents of each regular file, directory, symbolic link and device node in the original are the equal to the contents of each corresponding object with same name in the copy.
The user identifier, group identifier, and file permissions of each regular file, directory, name pipe, symbolic link, socket, and device node in the original are to equal the user identifier, group identifier, and file permissions of each corresponding object with the same name in the copy. Strictly speaking this last property is not mandatory for client-side failover to work, but if after a failover, the user on the NFS client no longer has access to the file his application was reading, then the user's application will stop working.

6.5.2. Rules for mounting replicas

In order to use client-side failover, the filesystem must be mounted with the sub-options ro (read-only) and hard.

The reason why it has to be mounted read-only is that if NFS clients could write to the replica filesystem, then the replicas would be no longer synchronized, producing the following undesirable effects:

If another NFS client failed over from one server to the server with the modified file, it would encounter an unexpected inconsistency.
Likewise, if the NFS client or application that modified the file failed over to another server, it would find that its changes were no longer present.

The filesystem has to be mounted hard because it is not clear what it would mean to mount a replicated filesystem soft. When a filesystem is mounted soft, it is supposed to return an error from a timed-out remote procedure call. When a replicated filesystem is mounted, after a remote procedure call times out, the NFS filesystem is supposed to try the next server in the list associated with the mount point. These two semantics are at odds, so replicated filesystems must be mounted hard.

The NFS servers in the replica list must support a common NFS version. When specifying a replicated filesystem that has some servers that support NFS Version 3, and some that support just NFS Version 2, the mount command will fail with the error "replicas must have the same version." Usually, though, the NFS servers that support Version 3 will also support Version 2. Thus, if you are happy with using NFS Version 2 for your replicated filesystem, then you can force the mount to succeed by specifying the vers=2 suboption. For example:

# mount -o vers=2 serverA,serverB,serverC:/export /mnt

Note that it is not a requirement that all the NFS servers in the replicated filesystem support the same transport protocol (TCP or UDP).

6.5.3. Managing replicas

In Solaris, the onus for creating, distributing, and maintaining replica filesystems is on the system administrator; there are no tools to manage replication. The techniques used in the example given in the Section 6.5.1, "Properties of replicas", can be used, although the example script given in that subsection for generating a replica may cause stale filehandle problems when using it to update a replica; we will address this in Section 18.8, "Stale filehandles". You will want to automate the replica distribution procedure. In the example, you would alter the aforementioned example to:

Prevent stale filehandles.
Use the rsh command instead of the rlogin command.

Other methods of distribution to consider are ones that use tools like the rdistandfilesynccommands.

6.5.4. Replicas and the automounter

Replication is best combined with use of the automounter. The integration of the two is described in Section 9.5.1, "Replicated servers".