11.3.2. Examining lock state on NFS/NLM servers
Solaris and other System V-derived systems have a useful tool
called crash for
analyzing system state. Crash actually reads the Unix kernel's
memory and formats its data structures in a more human readable form.
Continuing with the example from Section 11.3.1, "Diagnosing NFS lock hangs",
assuming
/export/home/mre is a directory on a
UFS filesystem, which can be verified by doing:
spike# df -F ufs | grep /export
/export (/dev/dsk/c0t0d0s7 ): 503804 blocks 436848 files
then you can use
crash to get more lock state.
The
crash command is like a shell, but with
internal commands for examining kernel state. The internal command we
will be using is
lck :
spike# crash
dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
> lck
Active and Sleep Locks:
INO TYP START END PROC PID FLAGS STATE PREV NEXT LOCK
30000c3ee18 w 0 0 13 136 0021 3 48bf0f8 ae9008 6878d00
30000dd8710 w 0 MAXEND 17 212 0001 3 8f1a48 8f02d8 8f0e18
30001cce1c0 w 193 MAXEND -1 3242 2021 3 6878850 c43a08 2338a38
Summary From List:
TOTAL ACTIVE SLEEP
3 3 0
>
An important field is PROC. PROC is the "slot" number of
the process. If it is -1, that indicates that the lock is being held
by a nonlocal (i.e., an NFS client) process, and the PID field thus
indicates the process ID, relative to the NFS client. In the sample
display, we see one such entry:
30001cce1c0 w 193 MAXEND -1 3242 2021 3 6878850 c43a08 2338a38
Note that the process id, 3242, is equal to that which the
pfiles command displayed earlier in this
example. We can confirm that this lock is for the file in question
via
crash's
uinode
command:
> uinode 30001cce1c0
UFS INODE MAX TABLE SIZE = 34020
ADDR MAJ/MIN INUMB RCNT LINK UID GID SIZE MODE FLAGS
30001cce1c0 136, 7 5516985 2 1 466 300 403 f---644 mt rf
>
The inode numbers match what
pfiles earlier
displayed on the NFS client. However, inode numbers are unique per
local filesystem. We can make doubly sure this is the file by
comparing the major and minor device numbers from the
uinode command, 136 and 7, with that of the
filesystem that is mounted on
/export :
spike# ls -lL /dev/dsk/c0t0d0s7
brw------- 1 root sys 136, 7 May 6 2000 /dev/dsk/c0t0d0s7
spike#
11.3.3. Clearing lock state
Continuing with our example from Section 11.3.2, "Examining lock state on NFS/NLM servers",
at this point we know that the file is locked by another NFS client.
Unfortunately, we don't know which client it is, as
crash won't give us that information. We
do however have a potential list of clients in the server's
/var/statmon/sm directory:
spike# cd /var/statmon/sm
spike# ls
client1 ipv4.10.1.0.25 ipv4.10.1.0.26 gonzo java
The entries prefixed with ipv4 are just symbolic links to other
entries. The non-symbolic link entries identify the hosts we want to
check for.
The most likely cause of the lock not getting released is that the
holding NFS client has crashed. You can take the list of hosts from
the
/var/statmon/sm directory and check if any
are dead, or not responding due to a network partition. Once you
determine which are dead, you can use Solaris's
clear_locks command to clear lock state.
Let's suppose you determine that
gonzo is
dead. Then you would do:
spike# clear_locks gonzo
If clearing the lock state of dead clients doesn't fix the
problem, then perhaps a now-live client crashed, but for some reason
after it rebooted, its status monitor did not send a notification to
the NLM server's status monitor. You can log onto the live
clients and check if they are currently mounting the filesystem from
the server (in our example,
spike:/export). If
they are not, then you should consider using
clear_locks to clear any residual lock state
those clients might have had.
Ultimately, you may be forced to reboot your server. Short of that
there are other things you could do. Since you know the inode number
and filesystem of file in question, you can determine the
file's name:
spike# cd /export
find . -inum 5516985 -print
./home/mre/database
You could rename file
database to something
else, and copy it back to a file named
database.
Then kill and restart the
SuperApp application
on
client1. Of course, such an approach requires
intimate knowledge or experience with the application to
know if this will be safe.