12.2. Task-Specific TroubleshootingThe guidelines just given are a general or generic overview of troubleshooting. Of course, each problem will be different, and you will need to vary your approach as appropriate. The remainder of this chapter consists of guidelines for a number of the more common troubleshooting tasks you might face. It is hoped that these will give you further insight into the process.
12.2.1. Installation TestingIronically, one of the best ways to save time and avoid troubleshooting is to take the time to do a thorough job of testing when you install software or hardware. You will be testing the system when you are most familiar with the installation process, and you will avoid disruptions to service that can happen when a problem isn't discovered until the software or hardware is in use.
This is a somewhat broad interpretation of troubleshooting, but in my experience, there is very little difference between the testing you will do when you install software and the testing you will do when you encounter a problem. Overwhelmingly the only difference for most people is the scope of the testing done. Most people will test until they believe that a system is working correctly and then stop. Failures, particularly multiple failures, may leave you skeptical, while some people tend to be overly optimistic when installing new software.
184.108.40.206. Firewall testingBecause of the complexities, firewall testing is an excellent example of the problems that installation testing may present. Troubleshooting a firewall is a demanding task for several reasons. First, to avoid disruptions in service, initial firewall testing should be done in an isolated environment before moving on to a production environment.
Second, you need to be very careful to develop an appropriate set of tests so that you don't leave gaping holes in your security. You'll need to go through a firewall rule by rule. You won't be able to check every possibility, but you should be able to test each general type of traffic. For example, consider a rule that passes HTTP traffic to your web server. You will want to pass traffic to port 80 on that server. If you are taking the approach of denying all traffic that is not explicitly permitted, potentially, you will want to block traffic to that host at all other ports. You will also want to block traffic to port 80 on other hosts. Thus, you should develop a set of three tests for this one action. Although there will be some duplicated tests, you'll want to take the same approach for each rule. Developing an explicit set of tests is the key step in this type of testing.
If you doubt the need for this last test, read RFC 3093, a slightly tongue-in-cheek description of how to use port 80 to bypass a firewall.The first step in testing a firewall is to test the environment in which the firewall will function without the firewall. It can be extraordinarily frustrating to try to debug anomalous firewall behavior only to discover that you had a routing problem before you began. Thus, the first thing you will want to do is turn off any filtering and test your routing. You could use tools like ripquery to retrieve routing tables and examine entries, but it is probably much simpler to use ping to check connectivity, assuming ICMP ECHO_REQUEST packets aren't being blocked. (If this is the case, you might try tools like nmap or hping.)
You'll also want to verify that all concomitant software is working. This will include all intrusion detection software, accounting and logging software, and testing software. For example, you'll probably use packet capture software like tcpdump or ethereal to verify the operation of your firewall and will want to make sure the firewall is working properly. I hate to admit it, but I've started packet capture software on a host that I forgot was attached to a switch and banged my head wondering why I wasn't seeing anything. Clearly, if I had used this setup to make sure packets were blocked without first testing it, I could have been severely misled.
Test the firewall in isolation. If you are adding filtering to a production router, admittedly this is going to be a problem. The easiest way to test in isolation is to connect each interface to an isolated host that can both generate and capture packets. You might use hping, nemesis, or any of the other custom packet generation software discussed in Chapter 9, "Testing Connectivity Protocols". Work through each of your tests for each rule with the rule disabled and enabled. Be sure you explicitly document all your tests, particularly the syntax.
Once you are convinced that the firewall is working, it is time to move it online. If you can schedule offline testing, that is the best approach. Work through your tests again with and without the filters enabled. If offline testing isn't possible, you can still go through your tests with the filters enabled.
Finally, don't forget to come back and go through these tests periodically. In particular, you'll want to reevaluate the firewall every time you change rules.
12.2.2. Performance Analysis and MonitoringIf a system simply isn't working, then you know troubleshooting is needed. But in many cases, it may not be clear that you even have a problem. Performance analysis is often the first step to getting a handle on whether your system is functioning properly. And it is often the case that careful performance analysis will identify the problem so that no further troubleshooting is needed.
Performance analysis is another management task that hinges on collecting information. It is a task that you will never complete, and it is important at every stage in the system's life cycle. The most successful network administrator will take a proactive approach, addressing issues before they become problems. Chapter 7, "Device Monitoring with SNMP" and Chapter 8, "Performance Measurement Tools" discussed the use of specific tools in greater detail.
For planning, performance analysis is used to compare systems, establish system requirements, and do capacity planning and forecasting. For management, it provides guidance in configuring and tuning the system. In particular, the identification of bottlenecks can be essential for management, planning, and troubleshooting.
There are three general approaches to performance analysis -- analytical modeling, simulations, and measurement. Analytical models are mathematical models usually based on queuing theory. Simulations are computer models that attempt to mimic the behavior of the system through computer programs. Measurement is, of course, the collection of data from an existing network. This book has focused primarily on measurement (although simulation tools were mentioned in Chapter 9, "Testing Connectivity Protocols").
Each approach has its role. In practice, there can be a considerable overlap in using these approaches. Analytical models can serve as the basis for simulations, or direct measurements may be needed to supply parameters used with analytical models or simulations.
easurement has its limitations. Obviously, the system must exist before measurements can be made so it may not be a viable tool for planning. Measurements tend to produce the most variable results. And many things can go wrong with measurements. On the positive side, measurement carries a great deal of authority with most people. When you say you have measured something, this is treated as irrefutable evidence by many, often unjustifiably.
220.127.116.11. General stepsMeasuring performance is something of an art. It is much more difficult to decide what to measure and how to make the actual measurements than it might appear at first glance. And there are many ways to waste time collecting data that will not be useful for your purposes.
What follows is a fairly informal description of the steps involved in performance analysis. As I said before, listing the steps can be very helpful in focusing attention on some parts of the process that might otherwise be ignored. Of course, every situation is different, so these steps are only an approximation. Designing performance analysis tests is an iterative process. You should go back through these steps as you proceed, refining each step as needed.
If you would like a more complete discussion of the steps in performance analysis, you should get Raj Jain's exceptional book, The Art of Computer Systems Performance Analysis. Jain's book considers performance analysis from a broader perspective than this book.
18.104.22.168. Bottleneck analysisSince networks are composed of a number of pieces, if the pieces are not well matched, poor performance may depend on the behavior of a single component. Bottleneck analysis is the process of identifying this component.
When looking at performance, you'll need to be sure you get a complete picture. Generally, one bottleneck will dominate performance statistics. Many systems, however, will have multiple bottlenecks. It's just that one bottleneck is a little worse than the others. Correcting one bottleneck will simply shift the problem -- the bottleneck will move from one component to another. When doing performance monitoring, your goal should be to discover as many bottlenecks as possible.
Often identifying a bottleneck is easy. Once you have a clear picture of your network's architecture, topology, and uses, bottlenecks will be obvious. For example, if 90% of your network traffic is to the Internet and you have a gigabit backbone and a 56-Kbps WAN connection, you won't need a careful analysis to identify your bottleneck.
Identifying bottlenecks is process dependent. What may be a bottleneck for one process may not be a problem for another. For example, if you are moving small files, the delay in making a connection will be the primary bottleneck. If you are moving large files, the speed of the link may be more important.
Bottleneck analysis is essential in planning because it will tell you what improvements will provide the greatest benefit to your network. The only real way to escape bottlenecks is to grossly overengineer your network, not something you'll normally want to do. Thus, your goal should not be to completely eliminate bottlenecks but to minimize their impact to the point that they don't cause any real problems. Upgrading the network in a way that doesn't address bottlenecks will provide very little benefit to the network. If the bottlenecks on your network are a slow WAN connection and slow servers, upgrading from Fast Ethernet to Gigabit Ethernet will be a foolish waste of money. The key consideration here is utilization. If you are seeing 25% utilization with Fast Ethernet, don't be surprised to see utilization drop below 3% with Gigabit Ethernet. But you should be aware that even if the utilization is low, increasing the capacity of a line will shorten download times for large files. Whether this is worthwhile will depend on your organization's mission and priorities.
Here is a rough outline of the steps you might go through to identify a bottleneck:
22.214.171.124. Capacity planningCapacity planning is an extremely important task. Done correctly, it is also an extremely complex and difficult task, both to learn and to do. But this shouldn't keep you from attempting it. The description here can best be described as a crude, first-order approximation of capacity planning. But it will give you a place to start while you are learning.
Capacity planning is really an umbrella that describes several closely related activities. Capacity management is the process of allocating resources in a cost-efficient way. It is concerned with the resources that you currently have. (As you might guess, this is closely related to bottleneck analysis.) Trend analysis is the process of looking at system performance over time, trying to identify how it has changed in the past with the goal of predicting future changes. Capacity planning attempts to combine capacity management and trend analysis. The goal is to predict future needs to provide for effective planning.
The basic steps are fairly straightforward to describe, just difficult to carry out. First, decide what you need to measure. That means looking at your system in much the same way you did with bottleneck analysis but augmenting your analysis with anything you know about the future growth of your system. You'll need to think about your system in context to do this.
Next, select appropriate tools to collect the information you'll need. (mrtg and cricket are the most obvious tools among those described in this book, but there are a number of other viable tools if you are willing to do the work to archive the data.) With the tools in place, begin monitoring your system, recording and archiving appropriate data. Deciding what to keep and how to organize it is a tremendously difficult problem. Every situation is different. Each situation is largely a question of balancing the amount of work involved in keeping the data in an organized and accessible manner with the likelihood that you will actually use it. This can come only from experience.
Once you have the measurements, you will need to analyze them. In general, focus on areas that show the greatest change. Collecting and analyzing data will be an iterative process. If little is different from one measurement to the next, then collect data less frequently. When there is high variability, collect more often.
Finally, you'll make your predictions and adjust your system accordingly.
There are a number of difficulties in capacity planning. Perhaps the greatest difficulty comes with unanticipated, fundamental changes in the way your network is used. If you will be offering new services, predictions based on trends that predate these services will not adequately predict new needs. For example, if you are introducing new technologies such as Internet telephony or video, trend analysis before the fact will be of limited value. There is a saying that you can't predict how many people will use a bridge by counting how many people are currently swimming across the river. If this is the case, about the best you can do is look to others who have built similar bridges over similar rivers.
Another closely related problem is differential growth. If your network, like most, provides a variety of different services, then they are probably growing at different rates. This makes it very difficult to predict aggregate performance or need if you haven't adequately collected data to analyze individual trends.
Yet another difficulty is motivation. The key to trend analysis is keeping adequate records, i.e., measuring and recording information in a way that makes it accessible and usable. This is difficult for many people since the records won't have much immediate utility. Their worth comes from being able to look back at them over time for trends. It is difficult to invest the time needed to collect and maintain this data when there will be no immediate return on the effort and when fundamental changes can destroy the utility of the data.
Copyright © 2002 O'Reilly & Associates. All rights reserved.