Network administration tasks fall into two very different categories:
configuration and troubleshooting. Configuration tasks prepare for
the expected; they require detailed knowledge of command syntax, but
are usually simple and predictable. Once a system is properly
configured, there is rarely any reason to change it. The
configuration process is repeated each time a new operating system
release is installed, but with very few changes.
In contrast, network troubleshooting deals with the unexpected.
Troubleshooting frequently requires knowledge that is conceptual
rather than detailed. Network problems are usually unique and
sometimes difficult to resolve. Troubleshooting is an important part
of maintaining a stable, reliable network service.
In this chapter, we discuss the tools you will use to ensure that the
network is in good running condition. However, good tools are not
enough. No troubleshooting tool is effective if applied haphazardly.
Effective troubleshooting requires a methodical approach to the
problem, and a basic understanding of how the network works.
We'll start our discussion by looking at ways to approach a
13.1. Approaching a Problem
To approach a
properly, you need a basic understanding of TCP/IP. The first few
chapters of this book discuss the basics of TCP/IP and provide enough
background information to troubleshoot most network problems.
Knowledge of how TCP/IP routes data through the network, between
individual hosts, and between the layers in the protocol stack is
important for understanding a network problem. But detailed knowledge
of each protocol usually isn't necessary. When you need these
details, look them up in a definitive reference -- don't try
to recall them from memory.
Not all TCP/IP problems are alike, and not all problems can be
approached in the same manner. But the key to solving any problem is
understanding what the problem is. This is not as easy as it may
seem. The "surface" problem is sometimes misleading, and
the "real" problem is frequently obscured by many layers
of software. Once you understand the true nature of the problem, the
solution to the problem is often obvious.
First, gather detailed information about exactly what's
happening. When a user reports a problem, talk to her. Find out which
application failed. What is the remote host's name and IP
address? What is the user's hostname and address? What error
message was displayed? If possible, verify the problem by having the
user run the application while you talk her through it. If possible,
duplicate the problem on your own system.
Testing from the user's system, and other systems, find out:
Does the problem occur in other applications on the user's
host, or is only one application having trouble? If only one
application is involved, the application may be misconfigured or
disabled on the remote host. Because of security concerns, many
systems disable some services.
Does the problem occur with only one remote host, all remote hosts,
or only certain "groups" of remote hosts? If only one
remote host is involved, the problem could easily be with that host.
If all remote hosts are involved, the problem is probably with the
user's system (particularly if no other hosts on your local
network are experiencing the same problem). If only hosts on certain
subnets or external networks are involved, the problem may be related
Does the problem occur on other local systems? Make sure you check
other systems on the same subnet. If the problem occurs only on the
user's host, concentrate testing on that system. If the problem
affects every system on a subnet, concentrate on the router for that
Once you know the symptoms of the problem, visualize each protocol
and device that handles the data. Visualizing the problem will help
you avoid oversimplification, and keep you from assuming that you
know the cause even before you start testing. Using your TCP/IP
knowledge, narrow your attack to the most likely causes of the
problem, but keep an open mind.
13.1.1. Troubleshooting Hints
Below are several useful troubleshooting
hints. They are not part of a troubleshooting methodology -- just
good ideas to keep in mind.
Approach problems methodically. Allow the information gathered from
each test to guide your testing. Don't jump on a hunch into
another test scenario without ensuring that you can pick up your
original scenario where you left off.
Work carefully through the problem, dividing it into manageable
pieces. Test each piece before moving on to the next. For example,
when testing a network connection, test each part of the network
until you find the problem.
Keep good records of the tests you have completed and their results.
Keep a historical record of the problem in case it reappears.
Keep an open mind. Don't assume too much about the cause of the
problem. Some people believe their network is always at fault, while
others assume the remote end is always the problem. Some are so sure
they know the cause of a problem that they ignore the evidence of the
tests. Don't fall into these traps. Test each possibility and
base your actions on the evidence of the tests.
Be aware of security barriers. Security firewalls sometimes block
ping, traceroute, and even ICMP
error messages. If problems seem to cluster around a specific remote
site, find out if it has a firewall.
Pay attention to error messages. Error messages are often vague, but
they frequently contain important hints for solving the problem.
Duplicate the reported problem yourself. Don't rely too heavily
on the user's problem report. The user has probably seen this
problem only from the application level. If necessary, obtain the
user's data files to duplicate the problem. Even if you cannot
duplicate the problem, log the details of the reported problem for
Most problems are caused by human error. You can prevent some of
these errors by providing information and training on network
configuration and usage.
Keep your users informed. This reduces the number of duplicated
trouble reports and the duplication of effort when several system
administrators work on the same problem without knowing others are
already working on it. If you're lucky, someone may have seen
the problem before and have a helpful suggestion about how to resolve
Don't speculate about the cause of the problem while talking to
the user. Save your speculations for discussions with your networking
colleagues. Your speculations may be accepted by the user as gospel,
and become rumors. These rumors can cause users to avoid using
legitimate network services and may undermine confidence in your
network. Users want solutions to their problems; they're not
interested in speculative techno-babble.
Stick to a few simple troubleshooting tools. For most TCP/IP software
problems, the tools discussed in this chapter are sufficient. Just
learning how to use a new tool is often more time-consuming than
solving the problem with an old, familiar tool.
Thoroughly test the problem at your end of the network before
locating the owners of the remote system to coordinate testing with
them. The greatest difficulty of network troubleshooting is that you
do not always control the systems at both ends of the network. In
many cases, you may not even know who does control the remote system.
The more information you have about your end, the simpler the job
will be when you have to contact the remote administrator.
Don't neglect the obvious. A loose or damaged cable is always a
possible problem. Check plugs, connectors, cables, and switches.
Small things can cause big problems.
Copyright © 2002 O'Reilly & Associates. All rights reserved.