Chapter 12. Troubleshooting Strategies

While many of the tools described in this book are extremely powerful, no one tool does everything. If you have been downloading and installing these tools as you have read this book, you now have an extensive, versatile set of tools. When faced with a problem, you should be equipped to select the best tool or tools for the particular job, augmenting your selection with other tools as needed.

This chapter outlines several strategies that show how these tools can be used together. When troubleshooting, your approach should be to look first at the specific task and then select the most appropriate tool(s) based on the task. I do not describe the details of using the tools or show output in this chapter. You should already be familiar with these from the previous chapters. Rather, this chapter focuses on the selection of tools and the overall strategy you should take in using them. If you feel confident in your troubleshooting skills, you may want to skip this chapter.

12.1. Generic Troubleshooting

Any troubleshooting task is basically a series of steps. The actual steps you take will vary from problem to problem. Later steps in the process may depend on the results from earlier steps. Still, it is worth thinking about and mapping out the steps since doing this will help you remain focused and avoid needless steps. In watching others troubleshoot, I have been astonished at how often people perform tests with no goal in mind. Often the test has no relation to the problem at hand. It is just something easy to do. When your car won't start, what is the point of checking the air pressure of the tires?

For truly difficult problems, you will need to become formal and systematic. A somewhat general, standard series of steps you can go through follows, along with a running example. Keep in mind, this set of steps is only a starting point.

Document. Before you do anything else, start documenting what you are doing. This is a real test of willpower and self-discipline. It is extremely difficult to force yourself to sit down and write a problem description or take careful notes when your system is down or crackers are running rampant through your system.[41] This is not just you; everyone has this problem. But it is an essential step for several reasons.
[41]Compromised hosts are a special problem requiring special responses. Documentation can be absolutely essential, particularly if you are contemplating legal action or have liability concerns. Documentation used in legal actions has special requirements. For more information you might look at Simson Garfinkel and Gene Spafford's Practical UNIX & Internet Security or visit http://www.cert.org/nav/recovering.html.

Depending on your circumstances, management may require a written report. Even if this isn't the usual practice, if an outage becomes prolonged or if there are other consequences, it might become necessary. This is particularly true if there are some legal consequences of the problem. An accurate log can be essential in such cases.
If you have a complex problem, you are likely to forget at some point what you have actually done. This often means starting over. It can be particularly frustrating if you appear to have found a solution, but you can't remember exactly what you did. A seemingly insignificant step may prove to be a key element in a solution.
Collect information and identify symptoms. Actually, this step is two intertwined steps. But they are often so intertwined that you usually can't separate them. You must collect information while filtering that information for indications of anomalous behavior. These two steps will be repeated throughout the troubleshooting process. This is easiest when you have a clear sense of direction.
As you identify symptoms, try to expand and clarify the problem. If the problem was reported by someone else, then you will want to try to recreate the problem so that you can observe the symptoms directly. Keep in mind, if you can't recognize normal behavior, you won't be able to recognize anomalous behavior. This has been a recurring theme in this book and a reason you should learn how to use these tools before you need them.
As an example, the first indication of a problem might be a user complaining that she cannot telnet from host bsd1 to host lnx1. To expand and clarify the problem, you might try different applications. Can you connect using ftp ? You might look to see if bsd1 and lnx1 are on the same network or different networks. You might see if lnx1 can reach bsd1. You might include other local and remote hosts to see the extent of the problem.
Define the problem. Once you have a clear idea, you can begin coming to terms with the problem. This is not the same as identifying the symptoms but is the process of combining the symptoms and making generalizations. You are looking for common elements that allow you to succinctly describe the anomalous behavior of a system.
Your problem definition may go through several refinements. Continuing with the previous problem, you might, over time, generate the following series of problem definitions:
- bsd1 can't telnet to lnx1.
- bsd1 can't connect to lnx1.
- bsd1 can't connect to lnx1, but lnx1 can connect to other hosts including bsd1.
- Hosts on the same network as lnx1 can't connect to lnx1.
- Hosts on the same network as lnx1 can't connect to lnx1, but hosts on remote networks can connect to lnx1.
(Yes, this was a real problem, and no, I didn't get that last one backward.)
It is natural to try to define the problem as quickly as possible, but you shouldn't be too tied to your definition. Try to keep an open mind and be willing to redefine your problem as your information changes.
Identify systems or subsystems involved. As you collect information, as seen in the previous example, you will define and refine not only the nature of the problem, but also the scope of the problem. This is the step in which we divide and hopefully conquer our problem.
In this example, we have worked outward from one system to include a number of systems. Usually troubleshooting tries to narrow the scope of the problem, but as seen from this example, in networking just the opposite may happen. You must discover the full scope of the problem before you can narrow your focus. In this running example, realizing that remote connections could connect was a key discovery.
Develop a testable hypothesis. Of course, what you can test will depend on what tools you have, the rationale for this book. But don't let tools drive your approach. With the definition of the problem and continual refinement comes the generation of the hypotheses as to the cause or nature of the problem. Such generalizations are relatively worthless unless they can be verified. (Remember those lectures on the scientific method in high school?) In this sense, developing a set of tests is more important than having an exact definition of a problem. In many instances, if you know the source of the problem, you can correct it without fully understanding the problem. For example, if you know an Ethernet card is failing, you can replace it without ever worrying about which chip on the card malfunctioned. I'm not suggesting that you don't want to understand the problem, but that there are levels of understanding. Your hypotheses must be guided by what you can test. As in science, an untestable hypothesis is worthless.
In general, you want tests that will reduce the size of the search space (i.e., identify subsystem involved), that are easy to apply, that do not create further problems, and so on.
In our running example, a necessary first step in making a connection is doing address resolution. This suggests that there might be some problem with the ARP mechanism. Notice that this is not a full hypothesis, but rather a point of further investigation. Having expanded the scope of the problem, we are attempting to focus in on subsystems to reduce the problem. Also notice that I haven't used any fancy tools up to this point. Keep it simple as long as you can.
Select and apply tests. Not all tests are created equally. Some will be much easier to apply, while others will provide more information. Determining the optimal order for a set of tests is largely a judgment call. Clearly, the simple tests that answer questions decisively are the best.
Returning to our example, there are several ways we could investigate whether the ARP mechanism is functioning correctly. One way would be to use tcpdump or ethereal to capture traffic on the network to see if the ARP requests and responses are present. A simpler test, however, is to use the arp command to see if the appropriate entries are in the ARP cache on the hosts that are trying to connect to lnx1. In this instance, it was observed that the entries were missing from all the hosts attempting to connect to lnx1. The exception was the router on the network that had a much longer cache timeout than did the local hosts. This also explained why remote hosts could connect but local hosts could not connect. The remote hosts always went through the router, which had cached the Ethernet address bypassing the ARP mechanism. Note that this was not a definitive test but was done first because it was much easier.
Assess results. As you perform tests, you will need to assess the results, refine your tests, and repeat the process. You will want new tests that confirm your results. This is clearly an iterative process.
With our extended example, two additional tests were possible. One was to manually add the address of lnx1 to bsd1's ARP table using the arp command. When this was done, connectivity was restored. When the entry was deleted, connectivity was lost. A more revealing but largely unnecessary test using packet-capture software to watch the exchange of packets between the bsd1 and lnx1 revealed that bsd1's ARP requests were being ignored by lnx1.
Develop and assess solutions. Once you have clearly identified the problem, you must develop and assess possible solutions. With many problems, there will be several possible solutions to consider. You should not hastily implement a solution until you have thought out the consequences. With lnx1, solutions ranged from rebooting the system to reinstalling software. I chose the simplest first and rebooted the system.
Implement and evaluate your solution. Once you have decided on a solution and have implemented it, you should confirm the proper operation of your system. Depending on the scope of the changes needed, this may mean extensive testing of the system and all related systems.
With our running problem, this was not necessary. Connectivity was fully restored when the system was rebooted. What caused the problem? That was never fully resolved, but since the problem never recurred, it really isn't an issue.
If restarting the system hadn't solved the problem, what would have been the next step? In this case, the likely problem was corrupted system software. If you are running an integrity checker like tripwire, you might try locating anything that has changed and do a selective reinstallation. Otherwise, you may be faced with reinstalling the operating system.

One last word of warning. It is often tempting to seize on an overly complex explanation and ignore simpler explanations. Frequently, problems really are complex, but not always. It is worth asking yourself if there is a simpler solution. Often, this will save a tremendous amount of time.

Chapter 12. Troubleshooting Strategies

Contents:

12.1. Generic Troubleshooting