| | |
Chapter 12. Troubleshooting Strategies
While many of the tools described in this book are extremely
powerful, no one tool does everything. If you have been downloading
and installing these tools as you have read this book, you now have
an extensive, versatile set of tools. When faced with a problem, you
should be equipped to select the best tool or tools for the
particular job, augmenting your selection with other tools as needed.
This chapter outlines several strategies that show how these tools
can be used together. When troubleshooting, your approach should be
to look first at the specific task and then select the most
appropriate tool(s) based on the task. I do not describe the details
of using the tools or show output in this chapter. You should already
be familiar with these from the previous chapters. Rather, this
chapter focuses on the selection of tools and the overall strategy
you should take in using them. If you feel confident in your
troubleshooting skills, you may want to skip this chapter.
12.1. Generic Troubleshooting
Any troubleshooting task is
basically a series of steps. The actual steps you take will vary from
problem to problem. Later steps in the process may depend on the
results from earlier steps. Still, it is worth thinking about and
mapping out the steps since doing this will help you remain focused
and avoid needless steps. In watching others troubleshoot, I have
been astonished at how often people perform tests with no goal in
mind. Often the test has no relation to the problem at hand. It is
just something easy to do. When your car won't start, what is
the point of checking the air pressure of the tires?
For truly difficult problems, you will need to become formal and
systematic. A somewhat general, standard series of steps you can go
through follows, along with a running example. Keep in mind, this set
of steps is only a starting point.
-
Document.
Before you do anything else, start documenting what you are doing.
This is a real test of willpower and self-discipline. It is extremely
difficult to force yourself to sit down and write a problem
description or take careful notes when your system is down or
crackers are running rampant through your system.[41] This is not just you; everyone has this problem. But it
is an essential step for several reasons.
Depending on your circumstances, management may require a written
report. Even if this isn't the usual practice, if an outage
becomes prolonged or if there are other consequences, it might become
necessary. This is particularly true if there are some legal
consequences of the problem. An accurate log can be essential in such
cases.
If you have a complex problem, you are likely to forget at some point
what you have actually done. This often means starting over. It can
be particularly frustrating if you appear to have found a solution,
but you can't remember exactly what you did. A seemingly
insignificant step may prove to be a key element in a solution.
-
Collect information and
identify symptoms. Actually, this step is two intertwined steps. But
they are often so intertwined that you usually can't separate
them. You must collect information while filtering that information
for indications of anomalous behavior. These two steps will be
repeated throughout the troubleshooting process. This is easiest when
you have a clear sense of direction.
As you identify symptoms, try to expand and clarify the problem. If
the problem was reported by someone else, then you will want to try
to recreate the problem so that you can observe the symptoms
directly. Keep in mind, if you can't recognize normal behavior,
you won't be able to recognize anomalous behavior. This has
been a recurring theme in this book and a reason you should learn how
to use these tools before you need them.
As an example, the first indication of a problem might be a user
complaining that she cannot telnet from host
bsd1 to host lnx1. To
expand and clarify the problem, you might try different applications.
Can you connect using ftp ? You might look to
see if bsd1 and lnx1 are on
the same network or different networks. You might see if
lnx1 can reach bsd1. You
might include other local and remote hosts to see the extent of the
problem.
-
Define the problem. Once you
have a clear idea, you can begin coming to terms with the problem.
This is not the same as identifying the symptoms but is the process
of combining the symptoms and making generalizations. You are looking
for common elements that allow you to succinctly describe the
anomalous behavior of a system.
Your problem definition may go through several refinements.
Continuing with the previous problem, you might, over time, generate
the following series of problem definitions:
-
bsd1 can't telnet to
lnx1.
-
bsd1 can't connect to
lnx1.
-
bsd1 can't connect to
lnx1, but lnx1 can connect
to other hosts including bsd1.
-
Hosts on the same network as lnx1 can't
connect to lnx1.
-
Hosts on the same network as lnx1 can't
connect to lnx1, but hosts on remote networks
can connect to lnx1.
(Yes, this was a real problem, and no, I didn't get that last
one backward.)
It is natural to try to define the problem as quickly as possible,
but you shouldn't be too tied to your definition. Try to keep
an open mind and be willing to redefine your problem as your
information changes.
-
Identify systems or subsystems
involved. As you collect information, as seen in the previous
example, you will define and refine not only the nature of the
problem, but also the scope of the problem. This is the step in which
we divide and hopefully conquer our problem.
In this example, we have worked outward from one system to include a
number of systems. Usually troubleshooting tries to narrow the scope
of the problem, but as seen from this example, in networking just the
opposite may happen. You must discover the full scope of the problem
before you can narrow your focus. In this running example, realizing
that remote connections could connect was a key discovery.
-
Develop a testable
hypothesis. Of course, what you can test will depend on what tools
you have, the rationale for this book. But don't let tools
drive your approach. With the definition of the problem and continual
refinement comes the generation of the hypotheses as to the cause or
nature of the problem. Such generalizations are relatively worthless
unless they can be verified. (Remember those lectures on the
scientific method in high school?) In this sense, developing a set of
tests is more important than having an exact definition of a problem.
In many instances, if you know the source of the problem, you can
correct it without fully understanding the problem. For example, if
you know an Ethernet card is failing, you can replace it without ever
worrying about which chip on the card malfunctioned. I'm not
suggesting that you don't want to understand the problem, but
that there are levels of understanding. Your hypotheses must be
guided by what you can test. As in science, an untestable hypothesis
is worthless.
In general, you want tests that will reduce the size of the search
space (i.e., identify subsystem involved), that are easy to apply,
that do not create further problems, and so on.
In our running example, a necessary first step in making a connection
is doing address resolution. This suggests that there might be some
problem with the ARP mechanism. Notice that this is not a full
hypothesis, but rather a point of further investigation. Having
expanded the scope of the problem, we are attempting to focus in on
subsystems to reduce the problem. Also notice that I haven't
used any fancy tools up to this point. Keep it simple as long as you
can.
-
Select and apply tests. Not all tests are created equally. Some will
be much easier to apply, while others will provide more information.
Determining the optimal order for a set of tests is largely a
judgment call. Clearly, the simple tests that answer questions
decisively are the best.
Returning to our example, there are several ways we could investigate
whether the ARP mechanism is functioning correctly. One way would be
to use tcpdump or ethereal
to capture traffic on the network to see if the ARP requests and
responses are present. A simpler test, however, is to use the
arp command to see if the appropriate entries
are in the ARP cache on the hosts that are trying to connect to
lnx1. In this instance, it was observed that the
entries were missing from all the hosts attempting to connect to
lnx1. The exception was the router on the
network that had a much longer cache timeout than did the local
hosts. This also explained why remote hosts could connect but local
hosts could not connect. The remote hosts always went through the
router, which had cached the Ethernet address bypassing the ARP
mechanism. Note that this was not a definitive test but was done
first because it was much easier.
-
Assess results. As you
perform tests, you will need to assess the results, refine your
tests, and repeat the process. You will want new tests that confirm
your results. This is clearly an iterative process.
With our extended example, two additional tests were possible. One
was to manually add the address of lnx1 to
bsd1's ARP table using the
arp command. When this was done, connectivity
was restored. When the entry was deleted, connectivity was lost. A
more revealing but largely unnecessary test using packet-capture
software to watch the exchange of packets between the
bsd1 and lnx1 revealed that
bsd1's ARP requests were being ignored by
lnx1.
-
Develop and assess solutions.
Once you have clearly identified the problem, you must develop and
assess possible solutions. With many problems, there will be several
possible solutions to consider. You should not hastily implement a
solution until you have thought out the consequences. With
lnx1, solutions ranged from rebooting the system
to reinstalling software. I chose the simplest first and rebooted the
system.
-
Implement and evaluate your solution.
Once you have decided on a solution and have implemented it, you
should confirm the proper operation of your system. Depending on the
scope of the changes needed, this may mean extensive testing of the
system and all related systems.
With our running problem, this was not necessary. Connectivity was
fully restored when the system was rebooted. What caused the problem?
That was never fully resolved, but since the problem never recurred,
it really isn't an issue.
If restarting the system hadn't solved the problem, what would
have been the next step? In this case, the likely problem was
corrupted system software. If you are running an integrity checker
like tripwire, you might try locating anything
that has changed and do a selective reinstallation. Otherwise, you
may be faced with reinstalling the operating system.
One last word of warning. It is often tempting to seize on an overly
complex explanation and ignore simpler explanations. Frequently,
problems really are complex, but not always. It is worth asking
yourself if there is a simpler solution. Often, this will save a
tremendous amount of time.
| | | 11.5. Microsoft Windows | | 12.2. Task-Specific Troubleshooting |
|
|