| | |
1.3. Troubleshooting and Management
Troubleshooting does not exist in
isolation from network management. How you manage your network will
determine in large part how you deal with problems. A proactive
approach to management can greatly simplify problem resolution. The
remainder of this chapter describes several important management
issues. Coming to terms with these issues should, in the long run,
make your life easier.
1.3.1. Documentation
As a new administrator, your first
step is to assess your existing resources and begin creating new
resources. Software sources, including the tools discussed in this
book, are described and listed in Appendix A, "Software Sources". Other
sources of information are described in Appendix B, "Resources and References".
The most important source of information is the local documentation
created by you or your predecessor. In a properly maintained network,
there should be some kind of log about the network, preferably with
sections for each device. In many networks, this will be in an
abysmal state. Almost no one likes documenting or thinks he has the
time required to do it. It will be full of errors, out of date, and
incomplete. Local documentation should always be read with a healthy
degree of skepticism. But even incomplete, erroneous documentation,
if treated as such, may be of value. There are probably no
intentional errors, just careless mistakes and errors of omission.
Even flawed documentation can give you some sense of the history of
the system. Problems frequently occur due to multiple conflicting
changes to a system. Software that may have been only partially
removed can have lingering effects. Homegrown documentation may be
the quickest way to discover what may have been on the system.
While
the creation and maintenance of documentation may once have been
someone else's responsibility, it is now your responsibility.
If you are not happy with the current state of your documentation, it
is up to you to update it and adopt policies so the next
administrator will not be muttering about you the way you are
muttering about your predecessors.
There are a couple of sets of
standard documentation that, at a minimum, you will always want to
keep. One is purchase information, the other a change log. Purchase
information includes sales information, licenses, warranties, service
contracts, and related information such as serial numbers. An
inventory of equipment, software, and documentation can be very
helpful. When you unpack a system, you might keep a list of
everything you receive and date all documentation and software. (A
changeable rubber date stamp and ink pad can help with this last
task.) Manufacturers can do a poor job of distinguishing one version
of software and its documentation from the next. Dates can be helpful
in deciding which version of the documentation applies when you have
multiple systems or upgrades. Documentation has a way of ending up in
someone's personal library, never to be seen again, so a list
of what you should have can be very helpful at times.
Keep in mind, there are a number of
ways software can enter your system other than through purchase
orders. Some software comes through CD-ROM subscription services,
some comes in over the Internet, some is bundled with the operating
system, some comes in on a CD-ROM in the back of a book, some is
brought from home, and so forth. Ideally, you should have some
mechanism to track software. For example, for downloads from the
Internet, be sure to keep a log including a list identifying
filenames, dates, and sources.
You should also keep a change log for each
major system. Record every significant change or problem you have
with the system. Each entry should be dated. Even if some entries no
longer seem relevant, you should keep them in your log. For instance,
if you have installed and later removed a piece of software on a
server, there may be lingering configuration changes that you are not
aware of that may come to haunt you years later. This is particularly
true if you try to reinstall the program but could even be true for a
new program as well.
Beyond
these two basic sets of documentation, you can divide the
documentation you need to keep into two general
categories -- configuration documentation and process
documentation. Configuration documentation statically describes a
system. It assumes that the steps involved in setting up the system
are well understood and need no further comments, i.e., that
configuration information is sufficient to reconfigure or reconstruct
the system. This kind of information can usually be collected at any
time. Ironically, for that reason, it can become so easy to put off
that it is never done.
Process documentation describes the steps involved in setting up a
device, installing software, or resolving a problem. As such, it is
best written while you are doing the task. This creates a different
set of collection problems. Here the stress from the task at hand
often prevents you from documenting the process.
The first question you must ask is what you want to keep. This may
depend on the circumstances and which tools you are using. Static
configuration information might include lists of IP addresses and
Ethernet addresses, network maps, copies of server configuration
files, switch configuration settings such as VLAN partitioning by
ports, and so on.
When dealing with a single device, the best approach is probably just
a simple copy of the configuration. This can be either printed or
saved as a disk file. This will be a personal choice based on which
you think is easiest to manage. You don't need to waste time
prettying this up, but be sure you label and date it.
When the information spans multiple systems, such as a list of IP
addresses, management of the data becomes more difficult.
Fortunately, much of this information can be collected automatically.
Several tools that ease the process are described in subsequent
chapters, particularly in Chapter 6, "Device Discovery and Mapping".
For process documentation, the best approach is to log and annotate
the changes as you make them and then reconstruct the process at a
later time. Chapter 11, "Miscellaneous Tools" describes some of the common
Unix utilities you can use to automate documentation. You might refer
to this chapter if you aren't familiar with utilities like
tee, script, and
xwd. [2]
1.3.2. Management Practices
A fundamental assumption of this book is that troubleshooting should
be proactive. It is preferable to avoid a problem than have to
correct it. Proper management practices can help. While some of this
section may, at first glance, seem unrelated to troubleshooting,
there are fundamental connections. Management practices will
determine what you can do and how you do it. This is true both for
avoiding problems and for dealing with problems that can't be
avoided. The remainder of this chapter reviews some of the more
important management issues.
1.3.2.1. Professionalism
To effectively administer a system
requires a high degree of professionalism. This includes personal
honesty and ethical behavior. You should learn to evaluate yourself
in an honest, objective manner. (See the sidebar "The Peter Principle Revisited".) It also
requires that you conform to the organization's mission and
culture. Your network serves some higher purpose within your
organization. It does not exist strictly for your benefit. You should
manage the network with this in mind. This means that everything you
do should be done from the perspective of a cost-benefit trade-off.
It is too easy to get caught in the trap of doing something
"the right way" at a higher cost than the benefits
justify. Performance analysis is the key element.
The
organization's mind-set or culture will have a tremendous
impact on how you approach problems in general and the use of tools
in particular. It will determine which tools you can use, how you can
use the tools, and, most important, what you can do with the
information you obtain. Within organizations, there is often a battle
between openness and secrecy. The secrecy advocate believes that
details of the network should be available only on a need-to-know
basis, if then. She believes, not without justification, that this
enhances security. The openness advocate believes that the details of
a system should be open and available. This allows users to adapt and
make optimal use of the system and provides a review process, giving
users more input into the operation of the network.
Taken to an extreme, the secrecy advocate will suppress information
that is needed by the user, making a system or network virtually
unusable. Openness, taken to an extreme, will leave a network
vulnerable to attack. Most people's views fall somewhere
between these two extremes but often favor one position over the
other. I advocate prudent openness. In most situations, it makes no
sense to shut down a system because it might be
attacked. And it is asinine not to provide users with the information
they need to protect themselves. Openness among those responsible for
the different systems within an organization is absolutely essential.
1.3.2.2. Ego management
We would all like to think that we
are irreplaceable, and that no one else could do our jobs as well as
we do. This is human nature. Unfortunately, some people take steps to
make sure this is true. The most obvious way an administrator may do
this is hide what he actually does and how his system works.
This can be done many ways. Failing
to document the system is one approach -- leaving comments out of
code or configuration files is common. The goal of such an
administrator is to make sure he is the only one who truly
understands the system. He may try to limit others access to a system
by restricting accounts or access to passwords. (This can be done to
hide other types of unprofessional activities as well. If an
administrator occasionally reads other users' email, he may not
want anyone else to have standard accounts on the email server. If he
is overspending on equipment to gain experience with new
technologies, he will not want any technically literate people
knowing what equipment he is buying.)
This behavior is usually well disguised, but it is extremely common.
For example, a technician may insist on doing tasks that users could
or should be doing. The problem is that this keeps users dependent on
the technician when it isn't necessary. This can seem very
helpful or friendly on the surface. But, if you repeatedly ask for
details and don't get them, there may be more to it than meets
the eye.
Common
justifications are security and privacy. Unless you are in a
management position, there is often little you can do other than
accept the explanations given. But if you are in a management
position, are technically competent, and still hear these excuses
from your employees, beware! You have a serious problem.
No one knows everything. Whenever information is suppressed, you lose
input from individuals who don't have the information. If an
employee can't control her ego, she should not be turned loose
on your network with the tools described in this book. She will not
share what she learns. She will only use it to further entrench
herself.
The problem is basically a personnel
problem and must be dealt with as such. Individuals in technical
areas seem particularly prone to these problems. It may stem from
enlarged egos or from insecurity. Many people are drawn to technical
areas as a way to seem special. Alternately, an administrator may see
information as a source of power or even a weapon. He may feel that
if he shares the information, he will lose his leverage. Often
individuals may not even recognize the behavior in themselves. It is
just the way they have always done things and it is the way that
feels right.
If you are a manager, you should deal with this problem immediately.
If you can't correct the problem in short order, you should
probably replace the employee. An irreplaceable employee today will
be even more irreplaceable tomorrow. Sooner or later, everyone
leaves -- finds a better job, retires, or runs off to Poughkeepsie
with an exotic dancer. In the meantime, such a person only becomes
more entrenched making the eventual departure more painful. It will
be better to deal with the problem now rather than later.
1.3.2.3. Legal and ethical considerations
From the perspective of tools,
you must ensure that you use tools in a manner that conforms not just
to the policies of your organization, but to all applicable laws as
well. The tools I describe in this book can be abused, particularly
in the realm of privacy. Before using them, you should make certain
that your use is consistent with the policies of your organization
and all applicable laws. Do you have the appropriate permission to
use the tools? This will depend greatly on your role within the
organization. Do not assume that just because you have access to
tools that you are authorized to use them. Nor should you assume that
any authorization you have is unlimited.
Packet capture software is a prime example. It allows you to examine
every packet that travels across a link, including applications data
and each and every header. Unless data is encrypted, it can be
decoded. This means that passwords can be captured and email can be
read. For this reason alone, you should be very circumspect in how
you use such tools.
A key consideration is the legality of collecting such information.
Unfortunately, there is a constantly changing legal morass with
respect to privacy in particular and technology in general.
Collecting some data may be legitimate in some circumstances but
illegal in others. [3] This depends on factors such as the
nature of your operations, what published policies you have, what
assurances you have given your users, new and existing laws, and what
interpretations the courts give to these laws.
It is impossible for a book like this to provide a definitive answer
to the questions such considerations raise. I can, however, offer
four pieces of advice:
-
First, if the information you are
collecting can be tied to the activities of an individual, you should
consider the information highly confidential and should collect only
the information that you really need. Be aware that even seemingly
innocent information may be sensitive in some contexts. For example,
source/destination address pairs may reveal communications between
individuals that they would prefer not be made public.
-
Second, place your users on notice.
Let them know that you collect such information, why it is necessary,
and how you use the information. Remember, however, if you give your
users assurances as to how the information is used, you are then
constrained by those assurances. If your management policies permit,
make their prior acceptance of these policies a requirement for using
the system.
-
Third, you must realize that with monitoring comes obligations. In
many instances, your legal culpability may be less if you don't
monitor.
-
Finally, don't rely on this book or what your colleagues say.
Get legal advice from a lawyer who specializes in this area. Beware:
many lawyers will not like to admit that they don't know
everything about the law, but many aren't current with the new
laws relating to technology. Also, keep in mind that even if what you
are doing is strictly legal and you have appropriate authority, your
actions may still not be ethical.
The Peter Principle Revisited
In 1969, Laurence Peter and Raymond
Hull published the satirical book, The Peter
Principle. The premise of the book was that people rise to
their level of incompetence. For example, a talented high school
teacher might be promoted to principal, a job requiring a quite
different set of skills. Even if ill suited for the job, once she has
this job, she will probably remain with it. She just won't earn
any new promotions. However, if she is adept at the job, she may be
promoted to district superintendent, a job requiring yet another set
of skills. The process of promotions will continue until she reaches
her level of incompetence. At that point, she will spend the
remainder of her career at that level.
While hardly a rigorous sociological principle, the book was well
received because it contained a strong element of truth. In my humble
opinion, the Peter Principle usually fails miserably when applied to
technical areas such as networking and telecommunications. The
problem is the difficulty in recognizing incompetence. If
incompetence is not recognized, then an individual may rise well
beyond his level of incompetence. This often happens in technical
areas because there is no one in management who can judge an
individual's technical competence.
Arguably, unrecognized incompetence is
usually overengineering. Networking, a field of engineering, is
always concerned with trade-offs between costs and benefits. An
underengineered network that fails will not go unnoticed. But an
overengineered network will rarely be recognizable as such. Such
networks may cost many times what they should, drawing resources from
other needs. But to the uninitiated, it appears as a normal,
functioning network.
If a network engineer really wants the latest in new equipment when
it isn't needed, who, outside of the technical personnel, will
know? If this is a one-person department, or if all the members of
the department can agree on what they want, no one else may ever
know. It is too easy to come up with some technical mumbo jumbo if
they are ever questioned.
If this seems far-fetched, I once attended a meeting where a young
engineer was arguing that a particular router needed to be replaced
before it became a bottleneck. He had picked out the ideal
replacement, a hot new box that had just hit the market. The problem
with all this was that I had recently taken measurements on the
router and knew the average utilization of that
"bottleneck" was less than 5% with peaks that rarely hit
40%.
This is an extreme example of why collecting information is the
essential first step in network management and troubleshooting.
Without accurate measurements, you can easily spend money fixing
imaginary problems.
|
1.3.2.4. Economic considerations
Solutions to problems have economic
consequences, so you must understand the economic implications of
what you do. Knowing how to balance the cost of the time used to
repair a system against the cost of replacing a system is an obvious
example. Cost management is a more general issue that has important
implications when dealing with failures.
One particularly difficult task for many system administrators is to
come to terms with the economics of networking. As long as everything
is running smoothly, the next biggest issue to upper management will
be how cost effectively you are doing your job. Unless you have
unlimited resources, when you overspend in one area, you take
resources from another area. One definition of an engineer that I
particularly like is that "an engineer is someone who can do
for a dime what a fool can do for a dollar." My best guess is
that overspending and buying needlessly complex systems is the single
most common engineering mistake made when novice network
administrators purchase network equipment.
One problem is that some traditional economic models do not apply in
networking. In most engineering projects, incremental costs are less
than the initial per-unit cost. For example, if a 10,000-square-foot
building costs $1 million, a 15,000-square-foot building will cost
somewhat less than $1.5 million. It may make sense to buy additional
footage even if you don't need it right away. This is justified
as "buying for the future."
This kind of reasoning, when applied to computers and networking,
leads to waste. Almost no one would go ahead and buy a computer now
if they won't need it until next year. You'll be able to
buy a better computer for less if you wait until you need it.
Unfortunately, this same reasoning isn't applied when buying
network equipment. People will often buy higher-bandwidth equipment
than they need, arguing that they are preparing for the future, when
it would be much more economical to buy only what is needed now and
buy again in the future as needed.
Moore's Law lies at the heart of the
matter. Around 1965, Gordon Moore, one of the founders of Intel, made
the empirical observation that the density of integrated circuits was
doubling about every 12 months, which he later revised to 24 months.
Since the cost of manufacturing integrated circuits is relatively
flat, this implies that, in two years, a circuit can be built with
twice the functionality with no increase in cost. And, because
distances are halved, the circuit runs at twice the speed -- a
fourfold improvement. Since the doubling applies to previous
doublings, we have exponential growth.
It is generally estimated
that this exponential growth with chips will go on for another 15 to
20 years. In fact, this growth is nothing new. Raymond Kurzweil, in
The Age of Spiritual Machines: When Computers Exceed Human
Intelligence, collected information on computing speeds
and functionality from the beginning of the twentieth century to the
present. This covers mechanical, electromechanical (relay), vacuum
tube, discrete transistor, and integrated circuit technologies.
Kurzweil found that exponential growth has been the norm for the last
hundred years. He believes that new technologies will be developed
that will extend this rate of growth well beyond the next 20 years.
It is certainly true that we have seen even faster growth in disk
densities and fiber-optic capacity in recent years, neither of which
can be attributed to semiconductor technology.
What does this mean economically? Clearly, if you wait, you can buy
more for less. But usually, waiting isn't an option. The real
question is how far into the future should you invest? If the price
is coming down, should you repeatedly buy for the short term or
should you "invest" in the long term?
The general answer
is easy to see if we look at a few numbers. Suppose that $100,000
will provide you with network equipment that will meet your
anticipated bandwidth needs for the next four years. A simpleminded
application of Moore's Law would say that you could wait and
buy similar equipment for $25,000 in two years. Of course, such a
system would have a useful life of only two additional years, not the
original four. So, how much would it cost to buy just enough
equipment to make it through the next two years? Following the same
reasoning, about $25,000. If your growth is tracking the growth of
technology, [4] then two years ago it would have cost $100,000 to buy
four years' worth of technology. That will have fallen to about
$25,000 today. Your choice: $100,000 now or $25,000 now and $25,000
in two years. This is something of a no-brainer. It is summarized in
the first two lines of Table 1-1.
Table 1-1. Cost estimates
|
Year 1
|
Year 2
|
Year 3
|
Year 4
|
Total
|
Four-year plan
|
$100,000
|
$0
|
$0
|
$0
|
$100,000
|
Two-year plan
|
$25,000
|
$0
|
$25,000
|
$0
|
$50,000
|
Four-year plan with maintenance
|
$112,000
|
$12,000
|
$12,000
|
$12,000
|
$148,000
|
Two-year plan with maintenance
|
$28,000
|
$3,000
|
$28,000
|
$3,000
|
$62,000
|
Four-year plan with maintenance and 20% MARR
|
$112,000
|
$10,000
|
$8,300
|
$6,900
|
$137, 200
|
Two-year plan with maintenance and 20% MARR
|
$28,000
|
$2,500
|
$19,500
|
$1,700
|
$51,700
|
If
this argument isn't compelling enough, there is the issue of
maintenance. As a general rule of thumb, service contracts on
equipment cost about 1% of the purchase price per month. For
$100,000, that is $12,000 a year. For $25,000, this is $3,000 per
year. Moore's Law doesn't apply to maintenance for
several reasons:
-
A major part of maintenance is labor costs and these, if anything,
will go up.
-
The replacement parts will be based on older technology and older
(and higher) prices.
-
The mechanical parts of older systems, e.g., fans, connectors, and so
on, are all more likely to fail.
-
There is more money to be made selling new equipment so there is no
incentive to lower maintenance prices.
Thus, the $12,000 a year for maintenance on a $100,000 system will
cost $12,000 a year for all four years. The third and fourth lines of
Table 1-1 summarize these numbers.
Yet another consideration
is the time value of money. If you don't need the $25,000 until
two years from now, you can invest a smaller amount now and expect to
have enough to cover the costs later. So the $25,000 needed in two
years is really somewhat less in terms of today's dollars. How
much less depends on the rate of return you can expect on
investments. For most organizations, this number is called the
minimal acceptable rate of return (MARR). The
last two lines of Table 1-1 use a MARR of 20%.
This may seem high, but it is not an unusual number. As you can see,
buying for the future is more than two and a half times as expensive
as going for the quick fix.
Of course, all this is a gross
simplification. There are a number of other important considerations
even if you believe these numbers. First and foremost, Moore's
Law doesn't always apply. The most important exception is
infrastructure. It is not going to get any cheaper to pull cable. You
should take the time to do infrastructure well; that's where
you really should invest in the future.
Most of the
other considerations seem to favor short-term investing. First, with
short-term purchasing, you are less likely to invest in dead-end
technology since you are buying later in the life cycle and will have
a clearer picture of where the industry is going. For example, think
about the difference two years might have made in choosing between
Fast Ethernet and ATM for some organizations. For the same reason,
the cost of training should be lower. You will be dealing with more
familiar technology, and there will be more resources available. You
will have to purchase and install equipment more often, but the
equipment you replace can be reused in your network's
periphery, providing additional savings.
On the downside, the equipment you buy won't have a lot of
excess capacity or a very long, useful lifetime. It can be very
disconcerting to nontechnical management when you keep replacing
equipment. And, if you experience sudden unexpected growth, this is
exactly what you will need to do. Take the time to educate upper
management. If frequent changes to your equipment are particularly
disruptive or if you have funding now, you may need to consider
long-term purchases even if they are more expensive. Finally,
don't take the two-year time frame presented here too
literally. You'll discover the appropriate time frame for your
network only with experience.
Other problems come when comparing plans.
You must consider the total economic picture. Don't look just
at the initial costs, but consider ongoing costs such as maintenance
and the cost of periodic replacement. As an example, consider the
following plans. Plan A has an estimated initial cost of $400,000,
all for equipment. Plan B requires $150,000 for equipment and
$450,000 for infrastructure upgrades. If you consider only initial
costs, Plan A seems to be $200,000 cheaper. But equipment needs to be
maintained and, periodically, replaced. At 1% per month, the
equipment for Plan A would cost $48,000 a year to maintain, compared
to $18,000 per year with Plan B. If you replace equipment a couple of
times in the next decade, that will be an additional $800,000 for
Plan A but only $300,000 for Plan B. As this quick,
back-of-the-envelope calculation shows, the 10-year cost for Plan A
was $1.68 million, while only $1.08 million for Plan B. What appeared
to be $200,000 cheaper was really $600,000 more expensive. Of course,
this was a very crude example, but it should convey the idea.
You shouldn't take this example
too literally either. Every situation is different. In particular,
you may not be comfortable deciding what is adequate surplus capacity
in your network. In general, however, you are probably much better
off thinking in terms of scalability than raw capacity. If you want
to hedge your bets, you can make sure that high-speed interfaces are
available for the router you are considering without actually buying
those high-speed interfaces until needed.
How does
this relate to troubleshooting? First, don't buy overly complex
systems you don't really need. They will be much harder to
maintain, as you can expect the complexity of troubleshooting to grow
with the complexity of the systems you buy. Second, don't spend
all your money on the system and forget ongoing maintenance costs. If
you don't anticipate operational costs, you may not have the
funds you need.
| | | 1.2. Need for Troubleshooting Tools | | 2. Host Configurations |
|