8.6. Troubleshooting
When diskless clients refuse to boot, they do so
rather emphatically. Shuffling
machines and hostnames to accommodate changes in personnel increases
the likelihood that a diskless machine will refuse to boot. Start
debugging by verifying that hostnames, IP addresses, and Ethernet
addresses are all properly registered on boot and NIS servers. The
point at which the boot fails usually indicates where to look next
for the problem: machines that cannot even locate a boot block may be
getting the wrong boot information, while machines that boot but
cannot enter single-user mode may be missing their
/usr filesystems.
8.6.1. Missing and inconsistent client information
There are a few pieces of missing host
information that are easily
tracked down. If a client tries to boot but gets no RARP response,
check that the NIS
ethers map or the
/etc/ethers files on the boot servers contain an
entry for the client with the proper MAC address. A client reports
RARP failures by complaining that it cannot get its IP address.
Diskless clients that boot part-way but hang after mounting their
root filesystems may have
/etc/hosts files that
do not agree with the NIS
ethers or
hosts maps. It's also possible that the
client booted using one name and IP address combination, but chose to
use a different name while going through the single-user boot
process. Check the boot scripts to be sure that the client is using
the proper hostname, and also check that its local
/etc/hosts file agrees with the NIS maps.
Other less obvious failures may be due
to confusion with the
bootparams map and the
bootparamd daemon. Since the diskless client
broadcasts a request for boot parameters, any host running
bootparamd can answer it, and that server may
have an incorrect
/etc/bootparams file, or it
may have bound to an NIS server with an out-of-date map.
Sometimes when you correct information, things still do not work. The
culprit could be caching. Solaris has a name service cached daemon,
/usr/sbin/nscd, which, if running, acts as a
frontend for some databases maintained in
/etc
or NIS. The
nscd daemon could return stale
information and also stale negative information, such as a failed
lookup of an IP address in the
hosts file or
map. You can re-invoke
nscd with the
-i option to invalidate the cache. See the
manpage for more
details.
8.6.2. Checking boot parameters
The
bootparamd daemon returns a fairly large
bundle
of values to a diskless client. In addition to the pathnames used for
root and swap filesystems, the diskless client gets the name of its
boot server and a default route. Depending on how the
/etc/nsswitch.conf is set up, the boot server
takes values from a local
/etc/bootparams, so
ensure that local file copies match NIS maps if they are used.
Changing the map on the NIS master server will not help a diskless
client if its boot server uses only a local copy of the boot
parameters file.
8.6.3. Debugging rarpd and bootparamd
You can debug boot parameter problems
by enabling debugging on the boot
server. Both
rarpd and
bootparamd accept a debug option.
By enabling debugging in
rarpd on the server,
you can see what requests for what Ethernet address the client is
making, and if rarpd can map it to an IP address. You can turn on
rarpd debugging by killing it on the server and
starting it again with the
-d option:
# ps -eaf | grep rarpd
root 274 1 0 Apr 16 ? 0:00 /usr/sbin/in.rarpd -a
root 5890 5825 0 01:02:18 pts/0 0:00 grep rarpd
# kill 274
# /usr/sbin/in.rarpd -d -a
/usr/sbin/in.rarpd:[1] device hme0 ethernetaddress 8:0:20:a0:16:63
/usr/sbin/in.rarpd:[1] device hme0 address 130.141.14.8
/usr/sbin/in.rarpd:[1] device hme0 subnet mask 255.255.255.0
/usr/sbin/in.rarpd:[5] starting rarp service on device hme0 address 8:0:20:a0:16:63
/usr/sbin/in.rarpd:[5] RARP_REQUEST for 8:0:20:a0:65:8f
/usr/sbin/in.rarpd:[5] trying physical netnum 130.141.14.0 mask ffffff00
/usr/sbin/in.rarpd:[5] good lookup, maps to 130.141.14.9
/usr/sbin/in.rarpd:[5] immediate reply sent
Keep in mind that when starting a daemon with the
-d option, it usually stays in the foreground,
so you won't get a shell prompt unless you explicitly place it
in the background by appending an ampersand (&) to command
invocation.
The two things to look out for when debugging
rarpd are:
-
Does rarpd register a RARP_REQUEST? If it
doesn't, this could indicate a physical network problem, or the
server is not on the same physical network as the client.
-
Can rarpd map the client's Ethernet
address back to an IP address? If not, this could indicate a bad
ethers map, a bad
/etc/ethers file, or an
/etc/nsswitch.conf file that is not pointing at
the right place.
By enabling debug mode in
bootparamd on the
server, you can see the hostname, addresses, and pathnames given to
the diskless client. You can turn on
bootparamd
debugging by killing it on the server and starting it again with the
-d option:
# ps -eaf | grep bootparamd
root 276 1 0 Apr 16 ? 0:00 /usr/sbin/rpc.bootparamd
root 5878 5825 0 00:33:27 pts/0 0:00 grep bootparamd
# kill 276
# rpc.bootparamd -d
in debug mode.
msg 1: group = 260 mib_id = 0 length = 128
msg 2: group = 261 mib_id = 0 length = 132
msg 3: group = 1025 mib_id = 0 length = 36
msg 4: group = 1026 mib_id = 0 length = 64
msg 5: group = 260 mib_id = 20 length = 144
msg 6: group = 260 mib_id = 100 length = 88
msg 7: group = 1026 mib_id = 1 length = 0
msg 8: group = 1026 mib_id = 2 length = 0
msg 9: group = 260 mib_id = 21 length = 2464
msg 10: group = 260 mib_id = 22 length = 360
mibget getmsg( ) 11 returned EOD (level 0, name 0)
interface_addr = 130.141.14.8.
interface_mask = 255.255.255.0
22 records for ipRouteEntryTable
Whoami returning name = honeymoon, router address = 130.141.14.253
getfile_1: file is "honeymoon" 130.141.14.8 "/export/root/honeymoon"
The messages that start with msg are the results of asking the IP
layer for Simple Network Management Protocol (SNMP) Management
Information Base (MIB) information. The
bootparamd daemon makes this inquiry to find the
IP address of the best router for the diskless client. The messages
that say group = 260 are the ones of interest for this purpose. Of
those messages, the ones with a mib_id of 0 or 20 are of interest.
Normally both kinds of messages will appear. If not, that may
indicate a problem with the server's network configuration. But
if there are no problems, we can expect the debug output to show a
router address for the client.
The getfile_1 message is simply reporting that it knows where the
client's root filesystem is. Note the IP address is the same as
the server's interface, which means that the NFS server for the
client is the same as the
bootparamd server.
If the server shows strange boot parameters passed to the client,
check that the server's
/etc/bootparams
file is correct, and that the boot server's NIS server has
up-to-date maps.
If the boot parameters received by the client are incorrect, check
that the server answering the request for them has current
information. Because requests are broadcast to
bootparamd, the server that can reply in the
shortest time supplies the information. If the client refuses to boot
at all, complaining of:
null domain name
invalid domain name
invalid boot parameters
or similar problems, verify that the host answering its broadcasts is
using the same boot protocol and configuration files. See
Section 15.3, "Boot parameter confusion" for an example of invalid boot parameters.
Also ensure that the boot server exports the client's root and
swap filesystems with the proper
root mapping
and access restrictions. In
/etc/dfs/dfstab,
both the root and swap filesystems should have the options:
rw=client,root=client
to limit access to the diskless client and to allow the superuser to
write to the filesystems. If the swap filesystem is not exported so
that
root can write to it, the diskless client
will not be able to start the
init process to
begin the
single-user boot.
8.6.4. Missing /usr
After setting the host and domain
names and configuring
network interfaces in the boot process, a machine mounts its
/usr filesystem. If there are problems with
/usr, the boot process either hangs or fails at
the first reference to the
/usr filesystem. The
two most common problems are not being able to locate the NFS server
for
/usr and attempting to mount the wrong
/usr.
NIS cannot be started until after
/usr is
mounted, since client-side daemons like
ypbind
live in
/usr. Generally,
/usr is mounted from the boot server, so a
diskless client needs its own name and its server's hostname in
its
/etc/hosts. If
/usr is
not mounted from the root/swap filesystem server, the
/usr server's hostname must appear in the
local hosts file as well. You may need as many as four different
entries in the "runt"
/etc/hosts
file on a diskless client: its hostname, a localhost entry, the boot
server's name, and the name of the
/usr
server.
Heterogeneous client/server environments create another set of
problems. Clients of different architectures need their own
/usr filesystems with executables built for the
client's CPU, not the server's. The most obvious problem
is when the client mounts the wrong
/usr. If the
executables on it were built for a different CPU, then the first
attempt to invoke one of them produces a fairly descriptive error.
However, if the
/usr/platform directory is for
the correct CPU architecture but doesn't contain the right
kernel architecture (for example, Sun's
sun4u and
sun4m variants),
then the client boots, but certain Unix utilities will not work.
Processes that read the kernel or user address spaces, such as
crash, are the most likely to break.
If you suspect that you're mounting the wrong
/usr, first check the client's
/etc/vfstab file to see where it gets
/usr :
wahoo:/export/root/honeymoon - / nfs - - rw
wahoo:/export/swap/honeymoon - /dev/swap nfs - - -
wahoo:/export/exec/Solaris_2.7_sparc.all/usr - /usr nfs - - ro
In this example, we would check
/export/exec/Solaris_2.7_sparc.all/usr on the
server
wahoo. The directories in
/export/exec have names with this format:
Solaris_<release>_<architecture>. If
the client and the server are of the same CPU architecture and are
running the same release of the operating system, the
usr subdirectory in
/export/exec/Solaris_<release>_<architecture>
is a symbolic link to the server's
/usr
directory.
If the client and server do not have the same release and CPU
architectures, the directories in
/export/exec
contain complete operating system releases.
Three things can go wrong with this link-and-directory scheme:
-
The links /export/exec/*/usr point to the wrong
place. This is possible if you changed the architecture of the server
but restored /export from a backup tape. Make
sure that Solaris_2.7_sparc.all/usr links point
to /usr only if the server is a SPARC running
Solaris 7. You'll get "exec format" errors if you
mount a /usr of the wrong architecture on the
client.
-
The /export/exec/* directories referenced by the
clients don't exist. This is possible if you added a client of
a new, different CPU architecture but did not install the appropriate
operating system software for it. If you try to mount a directory
that doesn't exist, you should see "cannot mount
root" errors on the client.
-
The client may have the wrong mount point listed in its
/etc/vfstab file. If you did not specify the
architecture of the client correctly when using the AdminSuite
software, the client's vfstab file is
likely to contain the wrong mount information.
If you are unsure of how a mount and link combination will work,
experiment on another diskless client having the same architecture.
For example, mount
/export/exec/Solaris_2.7_sparc.all/usr on
/mnt, and then try a sample command to be sure
you've mounted the right one:
client# mount wahoo:/export/exec/Solaris_2.7_sparc.all/usr /mnt
client# cd /var
client# /mnt/bin/ls
4lib dict krb5 oasys sbin ucblib
5bin dist kvm old share vmsys
X dt lib openwin snadm xpg4
adm games lost+found platform spool
aset include mail preserve src
bin java man proc tmp
ccs java1.1 net pub ucb
demo kernel news sadm ucbinclude
If commands are executed properly, then you
should be able to
mount
/usr safely on the diskless client in
question.
| | |
8.5. Changing a client's name | | 8.7. Configuration options |