Troubleshooting (Managing NFS and NIS, 2nd Edition)

8.6. Troubleshooting

When diskless clients refuse to boot, they do so rather emphatically. Shuffling machines and hostnames to accommodate changes in personnel increases the likelihood that a diskless machine will refuse to boot. Start debugging by verifying that hostnames, IP addresses, and Ethernet addresses are all properly registered on boot and NIS servers. The point at which the boot fails usually indicates where to look next for the problem: machines that cannot even locate a boot block may be getting the wrong boot information, while machines that boot but cannot enter single-user mode may be missing their /usr filesystems.

# ps -eaf | grep rarpd   
    root   274     1  0   Apr 16 ?        0:00 /usr/sbin/in.rarpd -a
    root  5890  5825  0 01:02:18 pts/0    0:00 grep rarpd
# kill 274
# /usr/sbin/in.rarpd -d -a
/usr/sbin/in.rarpd:[1]  device hme0 ethernetaddress 8:0:20:a0:16:63
/usr/sbin/in.rarpd:[1]  device hme0 address 130.141.14.8
/usr/sbin/in.rarpd:[1]  device hme0 subnet mask 255.255.255.0
/usr/sbin/in.rarpd:[5]  starting rarp service on device hme0 address 8:0:20:a0:16:63
/usr/sbin/in.rarpd:[5]  RARP_REQUEST for 8:0:20:a0:65:8f
/usr/sbin/in.rarpd:[5]  trying physical netnum 130.141.14.0 mask ffffff00
/usr/sbin/in.rarpd:[5]  good lookup, maps to 130.141.14.9
/usr/sbin/in.rarpd:[5]  immediate reply sent

Keep in mind that when starting a daemon with the -d option, it usually stays in the foreground, so you won't get a shell prompt unless you explicitly place it in the background by appending an ampersand (&) to command invocation.

The two things to look out for when debugging rarpd are:

Does rarpd register a RARP_REQUEST? If it doesn't, this could indicate a physical network problem, or the server is not on the same physical network as the client.
Can rarpd map the client's Ethernet address back to an IP address? If not, this could indicate a bad ethers map, a bad /etc/ethers file, or an /etc/nsswitch.conf file that is not pointing at the right place.

By enabling debug mode in bootparamd on the server, you can see the hostname, addresses, and pathnames given to the diskless client. You can turn on bootparamd debugging by killing it on the server and starting it again with the -d option:

# ps -eaf | grep bootparamd 
    root   276     1  0   Apr 16 ?        0:00 /usr/sbin/rpc.bootparamd
    root  5878  5825  0 00:33:27 pts/0    0:00 grep bootparamd
 
# kill 276 
# rpc.bootparamd -d 
in debug mode.
msg 1:  group =  260   mib_id =     0   length = 128
msg 2:  group =  261   mib_id =     0   length = 132
msg 3:  group = 1025   mib_id =     0   length = 36
msg 4:  group = 1026   mib_id =     0   length = 64
msg 5:  group =  260   mib_id =    20   length = 144
msg 6:  group =  260   mib_id =   100   length = 88
msg 7:  group = 1026   mib_id =     1   length = 0
msg 8:  group = 1026   mib_id =     2   length = 0
msg 9:  group =  260   mib_id =    21   length = 2464
msg 10:  group =  260   mib_id =    22   length = 360
mibget getmsg(  ) 11 returned EOD (level 0, name 0)
interface_addr = 130.141.14.8.
interface_mask = 255.255.255.0
22 records for ipRouteEntryTable
Whoami returning name = honeymoon, router address = 130.141.14.253
getfile_1: file is "honeymoon" 130.141.14.8 "/export/root/honeymoon"

The messages that start with msg are the results of asking the IP layer for Simple Network Management Protocol (SNMP) Management Information Base (MIB) information. The bootparamd daemon makes this inquiry to find the IP address of the best router for the diskless client. The messages that say group = 260 are the ones of interest for this purpose. Of those messages, the ones with a mib_id of 0 or 20 are of interest. Normally both kinds of messages will appear. If not, that may indicate a problem with the server's network configuration. But if there are no problems, we can expect the debug output to show a router address for the client.

The getfile_1 message is simply reporting that it knows where the client's root filesystem is. Note the IP address is the same as the server's interface, which means that the NFS server for the client is the same as the bootparamd server.

If the server shows strange boot parameters passed to the client, check that the server's /etc/bootparams file is correct, and that the boot server's NIS server has up-to-date maps.

If the boot parameters received by the client are incorrect, check that the server answering the request for them has current information. Because requests are broadcast to bootparamd, the server that can reply in the shortest time supplies the information. If the client refuses to boot at all, complaining of:

null domain name 
invalid domain name 
invalid boot parameters

Also ensure that the boot server exports the client's root and swap filesystems with the proper root mapping and access restrictions. In /etc/dfs/dfstab, both the root and swap filesystems should have the options:

rw=client,root=client

to limit access to the diskless client and to allow the superuser to write to the filesystems. If the swap filesystem is not exported so that root can write to it, the diskless client will not be able to start the init process to begin the single-user boot.

8.6.4. Missing /usr

After setting the host and domain names and configuring network interfaces in the boot process, a machine mounts its /usr filesystem. If there are problems with /usr, the boot process either hangs or fails at the first reference to the /usr filesystem. The two most common problems are not being able to locate the NFS server for /usr and attempting to mount the wrong /usr.

NIS cannot be started until after /usr is mounted, since client-side daemons like ypbind live in /usr. Generally, /usr is mounted from the boot server, so a diskless client needs its own name and its server's hostname in its /etc/hosts. If /usr is not mounted from the root/swap filesystem server, the /usr server's hostname must appear in the local hosts file as well. You may need as many as four different entries in the "runt" /etc/hosts file on a diskless client: its hostname, a localhost entry, the boot server's name, and the name of the /usr server.

Heterogeneous client/server environments create another set of problems. Clients of different architectures need their own /usr filesystems with executables built for the client's CPU, not the server's. The most obvious problem is when the client mounts the wrong /usr. If the executables on it were built for a different CPU, then the first attempt to invoke one of them produces a fairly descriptive error. However, if the /usr/platform directory is for the correct CPU architecture but doesn't contain the right kernel architecture (for example, Sun's sun4u and sun4m variants), then the client boots, but certain Unix utilities will not work. Processes that read the kernel or user address spaces, such as crash, are the most likely to break.

If you suspect that you're mounting the wrong /usr, first check the client's /etc/vfstab file to see where it gets /usr :

wahoo:/export/root/honeymoon                 - /         nfs - - rw
wahoo:/export/swap/honeymoon                 - /dev/swap nfs - - -
wahoo:/export/exec/Solaris_2.7_sparc.all/usr - /usr      nfs - - ro

In this example, we would check /export/exec/Solaris_2.7_sparc.all/usr on the server wahoo. The directories in /export/exec have names with this format: Solaris_<release>_<architecture>. If the client and the server are of the same CPU architecture and are running the same release of the operating system, the usr subdirectory in /export/exec/Solaris_<release>_<architecture> is a symbolic link to the server's /usr directory.

If the client and server do not have the same release and CPU architectures, the directories in /export/exec contain complete operating system releases.

Three things can go wrong with this link-and-directory scheme:

The links /export/exec/*/usr point to the wrong place. This is possible if you changed the architecture of the server but restored /export from a backup tape. Make sure that Solaris_2.7_sparc.all/usr links point to /usr only if the server is a SPARC running Solaris 7. You'll get "exec format" errors if you mount a /usr of the wrong architecture on the client.
The /export/exec/* directories referenced by the clients don't exist. This is possible if you added a client of a new, different CPU architecture but did not install the appropriate operating system software for it. If you try to mount a directory that doesn't exist, you should see "cannot mount root" errors on the client.
The client may have the wrong mount point listed in its /etc/vfstab file. If you did not specify the architecture of the client correctly when using the AdminSuite software, the client's vfstab file is likely to contain the wrong mount information.

If you are unsure of how a mount and link combination will work, experiment on another diskless client having the same architecture. For example, mount /export/exec/Solaris_2.7_sparc.all/usr on /mnt, and then try a sample command to be sure you've mounted the right one:

client# mount wahoo:/export/exec/Solaris_2.7_sparc.all/usr /mnt 
client# cd /var 
client# /mnt/bin/ls 
4lib        dict        krb5        oasys       sbin        ucblib
5bin        dist        kvm         old         share       vmsys
X           dt          lib         openwin     snadm       xpg4
adm         games       lost+found  platform    spool
aset        include     mail        preserve    src
bin         java        man         proc        tmp
ccs         java1.1     net         pub         ucb
demo        kernel      news        sadm        ucbinclude

If commands are executed properly, then you should be able to mount /usr safely on the diskless client in question.

8.6. Troubleshooting

8.6.1. Missing and inconsistent client information

8.6.2. Checking boot parameters

8.6.3. Debugging rarpd and bootparamd

8.6.4. Missing /usr