6.5. Replication
Solaris 2.6 introduced the concept of
replication to NFS clients. This
feature is known as
client-side failover.
Client-side failover is useful whenever you have read-only data that
you need to be highly available. An example will illustrate this.
Suppose your user community needs to access a collection of
historical data on the last 200 national budgets of the United
States. This is a lot of data, and so is a good candidate to store on
a central NFS server. However, because your users' jobs depend
on it, you do not want to have a single point of failure, and so you
keep the data on several NFS servers. (Keeping the data on several
NFS servers also gives one the opportunity to load balance). Suppose
you have three NFS servers, named
hamilton,
wolcott, and
dexter, each
exporting a copy of data. Then each server might have an entry like
this in its
dfstab file:
share -o ro /export/budget_stats
Now, without client-side failover, each NFS client might have one of
the following
vfstab entries:
hamilton:/export/budget_stats - /stats/budget nfs - yes ro
wolcott:/export/budget_stats - /stats/budget nfs - yes ro
dexter:/export/budget_stats - /stats/budget nfs - yes ro
Suppose an NFS client is mounting
/stats/budgetfrom NFS server
hamilton, and
hamilton stops responding. The user on that
client will want to mount a different server. In order to do this,
he'll have to do all of the following:
-
Terminate any applications that are currently accessing files under
the /budget_stats mount point.
-
Unmount /stats/budget.
-
Edit the vfstab file to point at a different
server.
-
Mount /stats/budget.
The user might have a problem with the first step, especially if the
application has buffered some unsaved critical information. And the
other three steps are tedious.
With client side failover, each NFS client can have a single entry in
the
vfstab file such as:
hamilton,wolcott,dexter:/export/budget_stat - /budget_stats nfs - yes ro
This
vfstab entry defines a
replicated NFS filesystem. When this
vfstab entry
is
mounted, the NFS client will:
-
Contact each server to verify that each is responding and exporting
/export/budget_stats.
-
Generate a list of the NFS servers that are responding and exporting
/export/budget_stats and associate that list
with the mount point.
-
Pick one of the servers to get NFS service from. In other words, the
NFS traffic for the mount point is bound to one server at a time.
As long as the server selected to provide NFS service is responding,
the NFS mount operates as a normal non-client-side failover mount.
Assuming the NFS client selected server
hamilton, if
hamiltonstops
responding, the NFS client will automatically select the next server,
in this case
wolcott, without requiring that one
manually unmount
hamilton, and mount
wolcott. And if
wolcott
later stops responding, the NFS client would then select
dexter. As you might expect, if later on
dexter stops responding, the NFS client will
bind the NFS traffic back to
hamilton. Thus,
client-side failover uses a round-robin scheme.
You can tell which server a replicated
mount
is using via the
nfsstat command:
% nfsstat -m
...
/budget_stats from hamilton,wolcott,dexter:/export/budget_stats
Flags:
vers=3,proto=tcp,sec=sys,hard,intr,llock,link,symlink,acl,rsize=32768,wsize=32768,
retrans=5
Failover:noresponse=1, failover=1, remap=1, currserver=wolcott
The
currserver value tells us that NFS traffic for the
/budget_stats mount point is bound to server
wolcott. Apparently
hamilton stopped responding at one point,
because we see non-zero values for the counters
noresponse,
failover and
remap. The counter
noresponse counts the number of times a remote
procedure call to the currently bound NFS server timed out. The
counter
failovercounts the number of times the NFS client has
"failed over" or switched to another NFS server due to a
timed out remote procedure call. The counter
remap counts the number of files that were
"mapped" to another NFS server after a failover. For
example, if an application on the NFS client had
/budget_stats/1994/deficit open, and then the client
failed over to another server, the next time the application went to
read data from
/budget_stats/1944/deficit, the
open file reference would be re-mapped to the corresponding
1944/deficit file on the newly bound NFS server.
Solaris will also notify you when a failover happens. Expect a
message like:
NOTICE: NFS: failing over from hamilton to wolcott
on both the NFS client's system console and in its
/var/adm/messages file.
By the way, it is not required that each server have the same
pathname mounted. The
mount command will let you
mount replica servers with different directories. For example:
# mount -o ro serverX:/q,serverY:/m /mnt
As long as the contents of
serverX:/q and
serverY:/m are the same, the top level directory
name does not have to be. The next section discusses rules for
content of replicas.
6.5.1. Properties of replicas
Replicas on each server in the replicated
filesystem have to be the same in
content. For example, if on an NFS client we have done:
# mount -o ro serverX,serverY:/export /mnt
then
/export on both servers needs to be an
exact copy. One way to generate such a copy would be:
# rlogin serverY
serverY # cd /export
serverY # rm -rf ../export
serverY # mount serverX:/export /mnt
serverY # cd /mnt
serverY # find . -print | cpio -dmp /export
serverY # umount /mnt
serverY # exit
#
The third command invoked here,
rm -rf ../export
is somewhat curious. What we want to do is remove the contents of
/export in a manner that is as fast and secure
as possible. We could do
rm -rf /exportbut that has the side of effect of removing
/export as well as its contents. Since
/export is exported, any NFS client that is
currently mounting
serverY:/export will
experience stale filehandles (see
Section 18.8, "Stale filehandles"). Recreating
/export
immediately with the
mkdir command
does not suffice because of the way NFS servers generate filehandles
for clients. The filehandle contains among other things the inode
number (a file's or directory's unique identification
number) and this is almost guaranteed to be different. So we want to
remove just what is under
/export. A commonly
used method for doing that is:
# cd /export ; find . -print | xargs rm -rf
but the problem there is that if someone has placed a filename like
foo /etc/passwd (i.e., a file with an embedded
space character) in
/export, then the
xargs rm -rf command will remove a file called
foo and a file called
/etc/passwd, which on Solaris may prevent one
from logging into the system. Doing
rm -rf
../export will prevent
/export from
being removed because
rm will not remove the
current working directory. Note that this behavior may vary with
other systems, so test it on something unimportant to be sure.
At any rate, the aforementioned sequence of commands will create a
replica that has the following properties:
-
Each regular file, directory, named pipe, symbolic link, socket, and
device node in the original has a corresponding object with the same
name in the copy.
-
The file type of each regular file, directory, named pipe, symbolic
link, socket, and device node in the original is the same in the
corresponding object with same name in the copy.
-
The contents of each regular file, directory, symbolic link and
device node in the original are the equal to the contents of each
corresponding object with same name in the copy.
-
The user identifier, group identifier, and file permissions of each
regular file, directory, name pipe, symbolic link, socket, and device
node in the original are to equal the user identifier, group
identifier, and file permissions of each corresponding object with
the same name in the copy. Strictly speaking this last property is
not mandatory for client-side failover to work, but if after a
failover, the user on the NFS client no longer has access to the file
his application was reading, then the user's application will
stop working.
6.5.2. Rules for mounting replicas
In order to use client-side failover, the filesystem
must be mounted with
the
sub-options
ro (read-only) and
hard.
The reason why it has to be mounted read-only is that if NFS clients
could write to the replica filesystem, then the replicas would be no
longer synchronized, producing the following undesirable effects:
-
If another NFS client failed over from one server to the server with
the modified file, it would encounter an unexpected inconsistency.
-
Likewise, if the NFS client or application that modified the file
failed over to another server, it would find that its changes were no
longer present.
The filesystem has to be mounted
hard because it
is not clear what it would mean to mount a replicated filesystem
soft. When a filesystem is mounted
soft, it is supposed to return an error from a
timed-out remote procedure call. When a replicated filesystem is
mounted, after a remote procedure call times out, the NFS filesystem
is supposed to try the next server in the list associated with the
mount point. These two semantics are at odds, so replicated
filesystems must be mounted
hard.
The NFS servers in the replica list must support a common NFS
version. When specifying a replicated filesystem that has some
servers that support NFS Version 3, and some that support just NFS
Version 2, the
mount command will fail with the
error "replicas must have the same version." Usually,
though, the NFS servers that support Version 3 will also support
Version 2. Thus, if you are happy with using NFS Version 2 for your
replicated filesystem, then you can force the mount to succeed by
specifying the
vers=2 suboption. For example:
# mount -o vers=2 serverA,serverB,serverC:/export /mnt
Note that it is not a requirement that all the NFS servers in the
replicated filesystem support the same transport
protocol (TCP or
UDP).
6.5.3. Managing replicas
In Solaris, the onus for creating, distributing, and maintaining
replica filesystems is on the system administrator; there are no
tools to manage replication. The techniques used in the example given
in the
Section 6.5.1, "Properties of replicas", can
be used, although the example script given in that subsection for
generating a replica may cause stale filehandle problems when using
it to update a replica; we will address this in
Section 18.8, "Stale filehandles". You will want to automate the replica
distribution procedure. In the example, you would alter the
aforementioned example to:
-
Prevent stale filehandles.
-
Use the rsh command instead of the
rlogin command.
Other methods of distribution to consider are ones that use tools
like the
rdistand
filesynccommands.
| | |
6.4. Symbolic links | | 6.6. Naming schemes |