Network administration tasks fall into two very different categories:
configuration and troubleshooting. Configuration tasks prepare for the
expected; they require detailed knowledge of command syntax,
but are usually simple and predictable. Once a system is properly
configured, there is rarely any reason to change it. The configuration
process is repeated each time a new operating system release is
installed, but with very few changes.
In contrast, network troubleshooting deals with the unexpected.
Troubleshooting frequently requires knowledge that is conceptual rather
than detailed. Network problems are usually unique and sometimes
difficult to resolve. Troubleshooting is an important part of
maintaining a stable, reliable network service.
In this chapter, we discuss the tools you will use to ensure that the
network is in good running condition. However, good tools are not
enough. No troubleshooting tool is effective if applied
haphazardly. Effective troubleshooting requires a methodical approach
to the problem, and a basic understanding of how the network
works. We'll start our discussion by looking at ways to approach a
network problem.
To approach a problem properly, you need a basic understanding of
TCP/IP. The first few chapters of this book discuss the basics of
TCP/IP and provide enough background information to troubleshoot most
network problems. Knowledge of how TCP/IP routes data through the
network, between individual hosts, and between the layers in the
protocol stack, is important for understanding a network problem. But
detailed knowledge of each protocol usually isn't necessary. When you need these details, look them up in a definitive reference - don't try to
recall them from memory.
Not all TCP/IP problems are alike, and not all problems can be
approached in the same manner. But the key to solving any problem is
understanding what the problem is. This is not as easy as it may
seem. The "surface" problem is sometimes misleading, and the "real"
problem is frequently obscured by many layers of software. Once you understand the
true nature of the problem, the solution to the problem
is often obvious.
First, gather detailed information about exactly what's happening.
When a user reports a problem, talk to her. Find out which
application failed. What is the remote host's name and IP address?
What is the user's hostname and address? What error message was
displayed? If possible, verify the problem by having the user run the
application while you talk her through it. If possible, duplicate the
problem on your own system.
Testing from the user's system, and other systems, find out:
-
Does the problem occur in other applications on the user's host, or is
only one application having trouble? If only one application is
involved, the application may be misconfigured or disabled on the
remote host. Because of security concerns, many systems disable some
services.
-
Does the problem occur with only one remote host, all remote hosts, or
only certain "groups" of remote hosts? If only one remote host is
involved, the problem could easily be with that host. If all remote
hosts are involved, the problem is probably with the user's system
(particularly if no other hosts on your local network are experiencing
the same problem). If only hosts on certain subnets or external
networks are involved, the problem may be related to routing.
-
Does the problem occur on other local systems? Make sure you check
other systems on the same subnet. If the problem only occurs on the
user's host, concentrate testing on that system. If
the problem
affects every system on a subnet, concentrate on the router for that
subnet.
Once you know the symptoms of the problem, visualize each protocol and
device that handles the data. Visualizing the problem will help you
avoid oversimplification, and keep you from assuming that you know the
cause even before you start testing. Using your TCP/IP knowledge,
narrow your attack to the most likely causes of the problem, but keep
an open mind.
Below we offer several useful troubleshooting hints. They
are not part of a troubleshooting methodology - just good ideas to keep in
mind.
-
Approach problems methodically. Allow the information gathered from
each test to guide your testing. Don't jump on a hunch into another test
scenario without ensuring that you can pick up your
original scenario where you left off.
-
Work carefully through the problem, dividing it into manageable pieces.
Test each piece before moving on to the next. For example, when
testing a network connection, test each part of the network until you
find the problem.
-
Keep good records of the tests you have completed and their results.
Keep a historical record of the problem in case it reappears.
-
Keep an open mind. Don't assume too much about the cause of the
problem. Some people believe their network
is always at fault, while others assume the remote end is always the
problem. Some are so sure they know the cause of a problem that
they ignore the evidence of the tests. Don't fall into these traps.
Test each possibility and base your actions on the evidence of the
tests.
-
Be aware of security barriers. Security firewalls sometimes block
ping
,
traceroute
, and even ICMP error messages. If problems
seem to cluster around a specific remote site, find out if they have a
firewall.
-
Pay attention to error messages. Error messages are often vague, but
they frequently contain important hints for solving the problem.
-
Duplicate the reported problem yourself. Don't rely too heavily on the
user's problem report. The user has probably only seen this problem
from the application level. If necessary, obtain the user's data files
to duplicate the problem. Even if you cannot duplicate the problem,
log the details of the reported problem for your records.
-
Most problems are caused by human error. You can prevent some of these errors
by providing information and training on network configuration and
usage.
-
Keep your users informed. This reduces the number of duplicated
trouble reports, and the duplication of effort when several system
administrators work on the same problem without knowing others are
already working on it. If you're lucky, someone may have seen the
problem before and have a helpful suggestion about how to resolve it.
-
Don't speculate about the cause of the problem while talking to the
user. Save your speculations for discussions with your networking
colleagues. Your speculations may be accepted by the user as gospel,
and become rumors. These rumors can cause users to avoid using
legitimate network services and may undermine confidence in your
network. Users want solutions to their problems; they're not
interested in speculative techno-babble.
-
Stick to a few simple troubleshooting tools. For most TCP/IP software
problems, the tools discussed in this chapter are sufficient. Just learning how to use a new tool is often more time-consuming than solving the problem with an old familiar tool.
-
Thoroughly test the problem at your end of the network before locating
the owner of the remote system to coordinate testing with him. The
greatest difficulty of network troubleshooting is that you do not
always control the systems at both ends of the network. In many cases,
you may not even know who does control the remote system.
[1]
The more information you have about your end, the simpler the job will be when
you have to contact the remote administrator.
-
Don't neglect the obvious. A loose or damaged cable is always a
possible problem. Check plugs, connectors, cables, and switches.
Small things can cause big problems.
|
|