Chapter 15. Debugging Network Problems

This chapter consists of case studies in network problem analysis and debugging, ranging from Ethernet addressing problems to a machine posing as an NIS server in the wrong domain. This chapter is a bridge between the formal discussion of NFS and NIS tools and their use in performance analysis and tuning. The case studies presented here walk through debugging scenarios, but they should also give you an idea of how the various tools work together.

When debugging a network problem, it's important to think about the potential cause of a problem, and then use that to start ruling out other factors. For example, if your attempts to bind to an NIS server are failing, you should know that you could try testing the network using ping, the health of ypserv processes using rpcinfo, and finally the binding itself with ypset. Working your way through the protocol layers ensures that you don't miss a low-level problem that is posing as a higher-level failure. Keeping with that advice, we'll start by looking at a network layer problem.

15.1. Duplicate ARP replies

ARP misinformation was briefly mentioned in Section 13.2.3, "IP to MAC address mappings", and this story showcases some of the baffling effects it creates. A network of two servers and ten clients suddenly began to run very slowly, with the following symptoms:

Some users attempting to start a document-processing application were waiting ten to 30 minutes for the application's window to appear, while those on well-behaved machines waited a few seconds. The executables resided on a fileserver and were NFS mounted on each client. Every machine in the group experienced these delays over a period of a few days, although not all at the same time.
Machines would suddenly "go away" for several minutes. Clients would stop seeing their NFS and NIS servers, producing streams of messages like:
```
NFS server muskrat not responding still trying
```
or:
```
ypbind: NIS server not responding for domain "techpubs"; still trying
```

The local area network with the problems was joined to the campus-wide backbone via a bridge. An identical network of machines, running the same applications with nearly the same configuration, was operating without problems on the far side of the bridge. We were assured of the health of the physical network by two engineers who had verified physical connections and cable routing.

The very sporadic nature of the problem -- and the fact that it resolved itself over time -- pointed toward a problem with ARP request and reply mismatches. This hypothesis neatly explained the extraordinarily slow loading of the application: a client machine trying to read the application executable would do so by issuing NFS Version 2 requests over UDP. To send the UDP packets, the client would ARP the server, randomly get the wrong reply, and then be unable to use that entry for several minutes. When the ARP table entry had aged and was deleted, the client would again ARP the server; if the correct ARP response was received then the client could continue reading pages of the executable. Every wrong reply received by the client would add a few minutes to the loading time.

There were several possible sources of the ARP confusion, so to isolate the problem, we forced a client to ARP the server and watched what happened to the ARP table:

# arp -d muskrat 
muskrat (139.50.2.1) deleted 
# ping -s muskrat 
PING muskrat: 56 data bytes 
No further output from ping

By deleting the ARP table entry and then directing the client to send packets to muskrat, we forced an ARP of muskrat from the client. ping timed out without receiving any ICMP echo replies, so we examined the ARP table and found a surprise:

# arp -a | fgrep muskrat 
le0   muskrat               255.255.255.255       08:00:49:05:02:a9

Since muskrat was a Sun workstation, we expected its Ethernet address to begin with 08:00:20 (the prefix assigned to Sun Microsystems), not the 08:00:49 prefix used by Kinetics gateway boxes. The next step was to figure out how the wrong Ethernet address was ending up in the ARP table: was muskrat lying in its ARP replies, or had we found a network imposter?

Using a network analyzer, we repeated the ARP experiment and watched ARP replies returned. We saw two distinct replies: the correct one from muskrat, followed by an invalid reply from the Kinetics FastPath gateway. The root of this problem was that the Kinetics box had been configured using the IP broadcast address 0.0.0.0, allowing it to answer all ARP requests. Reconfiguring the Kinetics box with a non-broadcast IP address solved the problem.

The last update to the ARP table is the one that "sticks," so the wrong Ethernet address was overwriting the correct ARP table entry. The Kinetics FastPath was located on the other side of the bridge, virtually guaranteeing that its replies would be the last to arrive, delayed by their transit over the bridge. When muskrat was heavily loaded, it was slow to reply to the ARP request and its ARP response would be the last to arrive. Reconfiguring the Kinetics FastPath to use a proper IP address and network mask cured the problem.

ARP servers that have out-of-date information create similar problems. This situation arises if an IP address is changed without a corresponding update of the server's published ARP table initialization, or if the IP address in question is re-assigned to a machine that implements the ARP protocol. If an ARP server was employed because muskrat could not answer ARP requests, then we should have seen exactly one ARP reply, coming from the ARP server. However, an ARP server with a published ARP table entry for a machine capable of answering its own ARP requests produces exactly the same duplicate response symptoms described above. With both machines on the same local network, the failures tend to be more intermittent, since there is no obvious time-ordering of the replies.

There's a moral to this story: you should rarely need to know the Ethernet address of a workstation, but it does help to have them recorded in a file or NIS map. This problem was solved with a bit of luck, because the machine generating incorrect replies had a different manufacturer, and therefore a different Ethernet address prefix. If the incorrectly configured machine had been from the same vendor, we would have had to compare the Ethernet addresses in the ARP table with what we believed to be the correct addresses for the machine in question.

Chapter 15. Debugging Network Problems

Contents:

15.1. Duplicate ARP replies