Monitoring Your System (Building Internet Firewalls, 2nd Edition)

26.2. Monitoring Your System

Another important aspect of firewall maintenance involves monitoring your system. Monitoring is intended to tell you several things:

Has your firewall been compromised?
What kinds of attacks are being tried against your firewall?
Is your firewall in working order?
Is your firewall able to provide the service your users need?

In order to answer these questions, you'll need to know what the normal pattern of usage is.

26.2.1. Special-Purpose Monitoring Devices

You'll do most of your monitoring using the tools and the logging provided by the existing parts of your firewall, but you may find it convenient to have some dedicated monitoring devices as well. For example, you may want to put a monitoring station on your perimeter net so you can be sure only the packets you expect are going across it. You can use a general-purpose computer with network snooping software on it, or you can use a special-purpose network sniffer.

How can you make certain that this monitoring machine can't be used by an intruder? In fact, you'd prefer that an intruder not even detect its existence. On some network hardware, you can disable transmission in the network interface (with sufficient expertise and a pair of wire cutters), which will make the machine impossible to detect and extremely difficult for an intruder to use (because it can't reply). If you have the source for your operating system, you can always disable transmission there; however, in this case, it's much harder to be certain you've been successful. In most cases, you'll have to settle for extremely cautious configuration of the machine. Treat it like a bastion host that needs to do less and be more secure.

In particular, you should note that Microsoft's Network Monitor registers a special NetBIOS name when it starts up. It is a blatant advertisement of the fact that you are monitoring the network. The only way to stop it from happening is to unbind NetBIOS from the network interface you're monitoring, which will also make it impossible to use the Network Monitor Agent on it (not to mention most other Microsoft-native network applications!).

Other forms of network sniffing are more subtle, but also often detectable. Most obviously, network sniffers tend to do a lot of name service requests, in order to find hostnames for the IP addresses. Less obviously, a machine that is accepting all packets often slightly changes the way it handles incoming requests. A group called The L0pht (pronounced "loft") released an anti-sniffer-sniffer to detect network sniffers, which has led to an arms race, with people developing less and less detectable sniffers, not to mention an anti-sniffer-sniffer-sniffer. These technologies are rapidly evolving; we advise that you deploy the most up-to-date, least detectable sniffer technology available and an anti-sniffer-sniffer on critical networks to help you find sniffers installed by attackers. (Information about getting The L0pht's anti-sniffer-sniffer can be found in Appendix B, "Tools".)

Clever attackers will assume that you are intercepting packets and will take steps to conceal traffic, whether or not they can actually find your monitoring system. There's nothing you can do but stay alert.

26.2.2. Intrusion Detection Systems

An intrusion detection system is a piece of special-purpose software designed to figure out if somebody has broken into your site. Intrusion detection systems can range from relatively simple, passive programs that read log files and look for bad things, to extremely complex systems that use special-purpose monitoring devices spread out across a large network, inject fake traffic into the network to see if anybody uses the information, and/or employ sophisticated artificial intelligence techniques to detect subtle forms of abnormal behavior.

Intrusion detection is a big subject, and we can't cover it fully here. More information can be found in other places (for instance, Stephen Northcutt's Network Intrusion Detection: An Analyst's Handbook, New Riders, 1999).

There are two basic techniques for intrusion detection; either systems can know what kind of behavior is bad and set off alarms when it happens, or they can know what kind of behavior is good and set off alarms when anything else happens. Relatively speaking, it's easier to recognize bad behavior than good behavior. Unfortunately, it's more effective to recognize good behavior.

Systems that recognize bad behavior use attack signatures, information about what particular attacks look like. For instance, this sort of system would recognize a port scan as an attack because it would know that a series of attempts to contact different ports on the same host was a sign of an attack.

Systems that recognize good behavior use usage profiles, information about what normally happens. For instance, this sort of system would recognize a port scan as an attack because it would know that normally, when somebody connects to a port, the next thing that happens is that they use that port for its normal purpose. In a port scan, the attacker will connect, get a response, and immediately disconnect and try another port. This behavior is outside of the normal pattern and therefore will be detected as an attack.

The difficulty with recognizing attack signatures is that the system can detect only attacks that it knows about. When new attacks are created, the system won't know about them until a signature is created and added. In addition, it's often possible to disguise signatures. For instance, the signatures for port scans used to look for multiple connections to different ports from the same source host, so attackers now use multiple collaborating source hosts.

Systems that rely on usage profiles have problems with what are called "false positives", or cases where they think an attack is occurring but it isn't. Usage of systems changes over time, and any profile that's specific enough to catch any significant number of attacks will set off a large number of alarms. Good systems now have false positive rates in the range of about 1-3 percent; unfortunately, that's 1-3 percent of the events they look at, which in the case of a network usually means 1-3 percent of packets. Since many sites have millions of incoming packets a day, this apparently small error rate can still result in thousands of false alarms a day, which is rarely acceptable.

Intruders also have ways of hiding attacks that will defeat almost all intrusion detection systems. For instance, a patient intruder can scan a network very slowly; most intrusion detection systems look at a few minutes', or maybe a few hours', worth of context at a time. An impatient intruder can bury an attack in a large amount of network traffic, and few systems will be able to keep up.

Similarly, some techniques can be used that defeat almost all attempts at hiding attacks. The most powerful and popular of these is the honeypot, the tempting bait with nothing but a trap behind it. For instance, if you put a machine on your perimeter network that you don't use for any services, you know that any attempt to connect to it is an attack. It doesn't matter whether or not it matches an attack signature, or whether or not it fits a normal usage pattern. It's just obviously wrong.

How much an intrusion detection system can do for you depends mostly on how much time, money, and development effort you can invest in the system. Although intrusion detection is theoretically a very effective technology, actually making it work in practice is not an easy proposition, and it requires constant maintenance and attention. There is no point in having an intrusion detection system unless you have the personnel to keep it up to date and to deal with the alarms that it produces.

26.2.3. What Should You Watch For?

In a perfect world, you'd like to know absolutely everything that goes through your firewall -- every packet dropped or accepted and every connection requested. In the real world, neither the firewall nor your brain can cope with that much information. To come up with a practical compromise, you'll want to turn on the most verbose logging that doesn't slow down your machines too much and doesn't fill up your disks too fast; then, you'll want to summarize the logs that are produced.

You can improve the disk space problem by keeping verbose logs offline, on some form of removable media (for instance, tapes, writable CDs, or writable DVDs). Tapes are cheap and hold a lot of data, but they have some drawbacks. They're not particularly fast under the best circumstances, and log entries are generally too short to achieve maximum performance. They're also annoying to read data from. If you're interested in using them, write summary logs to disk and write everything to tape. If you find a situation where you need more data, you can go back to the tape for it. A tape drive can probably keep up with the packets on an average Internet connection, but it won't keep up with an internal connection at full LAN speeds or even with a T-1 connection to the Internet that's at close to its maximum performance. CD and DVD writers are even slower, but they're much easier to read data from. If you have large amounts of disk space to use as temporary storage, they may be an effective solution.

No matter what you are using to write logs to, you should protect the logs. The data that's in them may be useful to attackers, and it may be confidential for other reasons. For instance, if you log the contents of packets, you may well be logging encrypted sensitive information. Even if you don't log packet contents, the information about what packets went where may be private; it's one thing to log the 434 times that somebody tried to go to an embarrassing web site, and another to have it become public knowledge.

No matter how you're storing logs, you want to log the following cases:

All dropped or rejected packets, denied connections, and rejected attempts
At least the time, protocol, and username for every successful connection to or through your bastion host
All error messages from your routers, your bastion host, and any proxying programs

NOTE
For security reasons, some information should never be logged where an intruder could possibly be able to read it. For example, although you should log failed login attempts, you should not log the password that was used in the failed attempt. Users frequently mistype their own passwords, and logging these mistyped passwords would make it easier for a computer cracker to break into a user's account.
Some system administrators believe that the account name should also not be logged on failed login attempts, especially when the account typed by the user is nonexistent. The reason is that users occasionally type their passwords when they are prompted for their usernames. If invalid accounts are logged, it might be possible for an attacker to use those logs to infer people's passwords.

What are you watching for? You want to know what your usual pattern is (and what trends there are in it), and you want to be alerted to any exceptions to that pattern. To recognize when things are going wrong, you have to understand what happens when things are going right. It's important to know what messages you get when everything is working. Most systems produce error messages that sound peculiar and threatening even when they're working perfectly well. For example, in the sample syslog output in Example 26-1, messages 10, 14, and 17 all look vaguely threatening, but are in fact perfectly OK.[185] (Although these examples are taken from a Unix syslog, exactly the same phenomena can be seen in the Windows NT Event Log; information about setting up logging can be found in Chapter 10, "Bastion Hosts", Chapter 11, "Unix and Linux Bastion Hosts", and Chapter 12, "Windows NT and Windows 2000 Bastion Hosts ".)

[185]Message 10 is a common network failure that will result in a retry, and how good do you expect your connection to Cameroon to be? 14 is traceroute running. 17 says there are no synonyms defined, which you presumably already know.

If you see those messages for the first time when you're trying to debug a problem, you're likely to leap to the conclusion that the messages have something to do with your problem and get thoroughly sidetracked. Even if you never do figure out what the messages are and why they're appearing, just knowing that certain messages appear even when things are working fine will save you time.

Example 26-1. A Sample syslog File (Line Numbers Added)

1:  May 29 00:00:58 localhost wn[27194]: noc.nca.or.bv - - [] "GET 
    /long/consulting.html HTTP/1.0" 200 1074  <Sent file: >  
2:  May 29 00:00:58 localhost wn[27194]: <User_Agent: Mozilla/1.0N 
    (X11; SunOS 4.1.3-KL sun4m)> <Referrer: http://www.longitude.example/>  
3:  May 29 00:02:38 localhost ftpd[26086]: 26086: 05/29/95 0:02:38 
    spoke.cst.cnes.vg(gupta@) retrieved  
    /pub/firewalls/digest/v04.n278.Z(15788 bytes)  
4:  May 29 00:15:57 localhost ftpd[27195]: 27195: 05/29/95 0:01:52 
    client42.sct.io connected, duration 845 seconds  
5:  May 29 00:18:04 localhost ftpd[26086]: 26086: 05/29/95 23:26:32 
    spoke.cst.cnes.vg connected, duration 3092 seconds  
6:  May 27 01:13:38 mv-gw.longitude.example user: host 
    naismith.longitude.com admin login failed 
7:  May 27 01:13:47 mv-gw.longitude.example last message repeated 2 times 
8:  May 27 01:15:17 mv-gw.longitude.example user: host 
    naismith.longitude.example admin login succeeded 
9:  May 27 01:19:18 mv-gw.longitude.example 16 permit: TCP from 
    192.168.20.35.2591 to 172.16.1.3.53 seq 324EE800, ack 0x0, win  
    4096, SYN 
10: May 29 02:20:09 naismith sendmail[27366]: CAA27366: SYSERR(root): 
    collect: I/O error on connection from atx.eb.cm, from=<<Mailer- 
    Daemon@eb.cm>: Connection reset by peer during collect 
    with atx.eb.cm 
11: May 29 02:30:28 naismith named[79]: sysquery: server name mismatch 
    for [172.16.8.25]: (sun.nhs-relay.ac.cv != nhs-relay.ac.cv) (server  
    for cus.ox.ac.cv) 
12: May 29 02:31:00 naismith named[79]: sysquery: server name mismatch  
    for [172.16.8.25]: (nhs-relay.ac.cv != sun.nhs-relay.ac.cv) (server 
    for PANSY.CSV.WARWICK.AC.CV) 
13: May 29 02:47:04 naismith named[79]: sysquery: server name mismatch  
    for [172.16.8.25]: (nhs-relay.ac.cv != sun.nhs-relay.ac.cv) (server  
    for LUPUS.CNS.UMIST.AC.CV) 
14: May 29 07:50:59 mv-gw.longitude.example  8 deny: UDP from 
    192.168.69.250.33072 to 192.168.20.34.33467  
15: May 29 08:06:16 naismith popper: (v1.831beta) Servicing request  
    from "penta.longitude.example" at 192.168.20.36  
16: May 29 08:06:56 naismith popper: (v1.831beta) Ending request from 
    "penta.longitude.com" at 192.168.20.36  
17: May 29 10:04:02 localhost waisserver1[28430]: -2: Warning: couldn't open  
    wais-sources/firewalls-digest.syn - synonym translation 
    disabled 
18: May 29 16:26:46 mv-gw.longitude.example  8 deny: UDP from 
    192.168.186.11.20 to 192.168.20.34.1937

ost of your logging will probably be done via the Unix syslog facility or some other similar file-based log mechanism. You'll need to develop log-scanning scripts to analyze each of these log files on a regular basis. Some firewall packages, such as TIS FWTK, come with scripts to analyze and summarize their own logs. You could use these scripts as templates for your own logging, or you could write your own scripts from scratch in awk, perl, or some other suitable language.

As you can see, the log file is verbose and not particularly readable (even with better linebreaks inserted). An unimportant error condition on a distant host (the server name mismatch on nhs-relay.ac.cv) is producing multiple error messages (11, 12, and 13, in this highly condensed version). The log file is also in chronological order which is not particularly the order of importance.

Example 26-2 shows a report based on a log file, with messages arranged in a more useful order and somewhat summarized.

Example 26-2. A Report Based on a syslog File

May 27 06:42:07 localhost ftpd[10159]: securityalert: refused passwd 
    file to chen@calm.example from chen.dialup.zarf.net 
ay 27 06:42:10 localhost ftpd[10159]: securityalert: refused passwd 
    file to chen@calm.example from chen.dialup.zarf.net 
----------------------------------------------------------------- 
ay 26 12:33:39 localhost su: nxn to root on /dev/ttyp1 
ay 27 01:23:17 naismith su: bart to root on /dev/ttyp3 
----------------------------------------------------------------- 
ay 26 12:29:44 naismith kernel: uid 31 on /naismith_b: file system full 
ay 26 12:31:33 naismith kernel: uid 31 on /naismith_b: file system full 
----------------------------------------------------------------- 
ay 26 02:49:03 naismith named[79]: Malformed response from 
    [192.168.192.2].53 (ran out of data in answer)  
----------------------------------------------------------------- 
ay 26 12:14:36 mv-gw.longitude.example 16 deny: UDP from 192.168.69.1.58899 
    to 192.168.20.35.33459  
ay 26 12:15:15 mv-gw.longitude.example 16 deny: UDP from 192.168.69.1.58962 
    to 192.168.20.35.33459  
ay 27 01:24:05 mv-gw.longitude.example 16 permit: TCP from 
    192.168.20.34.2637 to 192.168.54.72.23 seq BE793A01, ack 0x0, win 
    4096, SYN  
ay 27 01:24:11 mv-gw.longitude.example 16 permit: TCP from 
    192.168.20.34.2637 to 192.168.54.72.23 seq BE793A01, ack 0x0, win 
    4096, SYN  
----------------------------------------------------------------- 
FTP:    Connections: 240 
        Files: 733 
        Bytes: 32,747,429 (31.23  M) 
        Seconds: 92,787 (25.77  hours)

In general, it's safer to write scripts to filter out messages to be ignored (leaving unusual stuff), rather than writing scripts to identify the unusual stuff directly. The reason for this is that you seldom know all of the different messages your firewall might produce. It's easier to ignore the benign messages than to recognize the dangerous ones.

Log messages fall into three categories:

Known to be OK

Setting up the criteria is an iterative process; once a human has examined a mystery message, future examples of that message can probably be classified as either OK or dangerous without being examined again. You'll change the rules as time goes on.

Log entries often must be considered in context. A message that's mildly mysterious if it occurs once is cause for serious worry if it occurs every minute. For example, "login succeeded for user smith" is good, unless it's preceded by three "login failed" messages for every user above "smith" in your password file; in that case, it's very bad indeed. In Example 26-1, message 9 shows an unexceptional outbound TCP connection, logged just on general principles. It wouldn't be at all worrying if it weren't preceded by messages 6 through 8. In context, you know that someone made three failed tries at logging in as "admin", finally succeeded, and then immediately started an outbound connection. This looks extremely suspicious. Message 7 doesn't mean anything at all without context.

In a large system, getting context may require correlating log files from multiple hosts. This is one reason for keeping consistent time settings; it is also a reason why people use intrusion detection systems. If you have high volumes of traffic, a complex firewall, or a requirement for strict security, you will probably need an intrusion detection system to help you with the log analysis.

26.2.4. The Good, the Bad, and the Ugly

Once you go beyond the obvious (for example, it's OK for users to log in; it's not OK for the disk to be bad), how can you tell when you're in trouble? Some rules of thumb:

Once is an accident; twice is coincidence; three times is enemy action

One user who tries to log in at 2 A.M. and fails is up too late and can't type. Two users who try to log in at 2 A.M. may have been at the same party, but you're certainly going to be curious about the incident. Three or more attempts to log in at 2 A.M., and someone is trying to break in. This rule of thumb applies mostly to attempts on separate accounts; stubborn repeated attempts by the same user to do the same thing that doesn't work probably merely indicates that the user is single-minded -- and wrong.

Accidents don't try to cover themselves up

If your log files are missing, if entries have been deleted, or if there is any other evidence that somebody has been covering his or her tracks, you probably have a break-in. If not, you have some other serious problem. (Either something is broken, or somebody administering the machine is deleting things or rotating the logs inappropriately.)

Most mysteries don't mean anything

For everybody who sets out to track down a mysterious problem or a strange log entry and finds an intruder, there are 99 people who set out to track down a mysterious problem or a strange log entry and find an annoying but trivial bug. You should still try to track these things down, but there's no need to panic.

Straightforward explanations are usually correct

It's possible that you were broken into at the same time you had another known problem, but it's not likely. If you know that you had a hardware failure, or a person wandering around doing misguided things, you'll want to spend some time ruling out side effects of the known problem before you decide that you also have an intruder. On the other hand, if your files are mysteriously disappearing and nothing is apparently wrong with your disk, somebody is probably deleting them, and you'll want to spend a very long time ruling out an intruder before you decide that your filesystem code is buggy.

You're going to end up classifying suspicious events into several categories:

You know what caused it, and it's not a security problem.
You don't know what caused it, you're probably never going to know what caused it, but whatever it was, it's not happening anymore.
Somebody was trying to break in but not very hard; this is a probe.
Somebody made a serious attempt to get in; this is an attack.
Somebody actually broke in.

The boundaries between these categories are vague. Unless you're dealing with messages from the first category (i.e., a known nonproblem), it's going to come down to a judgment call most of the time. It's impossible to provide an exhaustive list of the symptoms of any of these situations, but here are some generalizations that may help.

You should suspect that someone's been probing your site if you see:

A few attempts to access services at insecure ports (e.g., attempts to contact portmapper or an X server).
Attempts to log in with common account names (e.g., "guest" or "lp" ; most attempts to log in as "anonymous" are mistakes).
Requests to TFTP files or to transfer NIS maps.
Somebody feeding the debug command to your SMTP server.
Packets sent to the same ports on every IP address in a range.

You should be more concerned if you see any of the following; an attack may be going on:

Multiple failed attempts to log in to valid accounts on your machines, particularly accounts that are used across the Internet, or attempts on accounts in the order in which they appear in your password file.
Unusual accepted packets or commands whose purpose you don't understand.
Packets sent to every port in a range.
Successful logins from an unexpected site.
Sudden increases in incoming or outgoing traffic.

You should suspect a successful break-in if you see:

Deleted or modified log files.
Programs that suddenly omit expected information (this suggests that they have been replaced with versions that ignore the intruder's files and programs). On Unix machines, the most frequent victims are login, ls, ps, netstat, and ifconfig.
New log files containing password information or packet traces that you can't explain.
Directories that contain more administrative entries than they should. For example, on Unix machines, directories should contain two entries with names made out of periods ("." and "..", indicating "this directory" and "parent directory"), but there should not be more than two such entries (for instance, "..." or ".. "). If it looks as if there is more than one entry for each, the extra entry probably has spaces in it and is being used to conceal the file or directory from casual observation.
Unexpected logins as privileged users (for example, root) or unexpected users who are suddenly able to become privileged users.
Services running that you have not intentionally turned on.
Apparent probes or attacks coming from your own machines.
Extra processes with names that are variants of common system processes (for example, both sendmail and Sendmail are running, or init and initd ; this is another trick for sneaking things in where you won't notice them).
An unexpected change in the login behavior of your machine or for other machines you reach from yours. This indicates that the program you use to log in has been modified.

26.2.5. Responding to Probes

Inevitably, you're going to detect apparent probes of your firewall -- packets sent to services you don't offer to the Internet, attempts to log in to nonexistent accounts, and so on. Probes are the Internet equivalent of someone walking down a line of doors and checking every door knob to see if it's locked. Probers generally try one or two things, and if they don't get an interesting response, they move on. If you're inclined to do so, you can spend a lot of time chasing down such incidents, attempting to figure out where the probes are coming from and who is behind them. However, in most situations, it probably isn't worth the effort. The novelty of chasing down probes of this kind fades quickly. If you're getting persistent probes from some site, you might contact the management of that site to let it know what's going on (in general, it has been broken into and needs to know), but that's usually about as far as folks need to go in responding to these probes.

It's unfortunate that on the Internet today, probes are so frequent that the laissez faire attitude we've described is often an appropriate one. In good neighborhoods, people don't get away with trying door knobs. You have a right to be unhappy with people who behave this way and trying to get them to stop is perfectly reasonable. However, you do need to decide where you're going to spend your energy. Save extreme responses for extreme situations. Treating probers with maximum harshness is just going to convince people that you are unreasonable.

Some people amuse themselves by setting up firewall machines to lead on people who try common probes. For example, they put a password file in the anonymous FTP area that appears to contain user account data. However, if the prober breaks the encrypted passwords, he or she sees a snide message. This is a harmless way to spend your spare time and provides a satisfactory feeling of revenge, but it doesn't actually improve your security much. It simply annoys attackers, and doing so may cause them to take a personal interest in breaking into your site.

Some people and some firewalls prefer a more active approach to probes, engaging in counter-attacks. A fairly wide variety of things are sold as counter-attacks (firewall marketers find the concept extremely sexy). They range from attempts to find out further information about where the probes are coming from (giving you a start on tracking them down), to automatically configuring the firewall to reject all connections from the probing site, to actual counter-attacks. Regardless of their often remarkably aggressive marketing material, no commercial firewall does anything that can reasonably be described as a counter-attack, for obvious reasons involving liability. This is good because misidentification, loops, and bad feelings often result from even mild automated responses.

For instance, a forged probe purporting to be from a site that does information-gathering when it gets probes and aimed at another such site is almost guaranteed to suck the site administrators into a "Yeah, but you hit me first!" argument. Why would somebody forge such a packet? Because they did something and were annoyed at the response they got. A little forgery can bring a lot of community pressure to bear on somebody who has an overly vigorous probe response. A notable example occurred with somebody whose response to anonymous FTP attempts was to send mail threatening to sue; a widely broadcast email contained a URL that attempted anonymous FTP to the site, causing both resource problems and publicity problems.

Different sites have different opinions about what constitutes a probe, and what constitutes a full-fledged attack. Most people call something a probe as long as they know it's not going to work, even if it is determined and drawn out. For example, somebody who determinedly tries every possible combination of lowercase alphabetic characters as your root password is not going to succeed and can probably be ignored as a probe until you get tired of reading the log messages. (That kind of attack won't succeed, no matter how many combinations are tried.) However, if you have the time and the energy, it's probably worth pursuing people who are making determined attempts, even when you know they'll fail.

26.2.6. Responding to Attacks

If your logs show that someone is making a determined attack against your system (see the rules of thumb we presented in Section 26.2.4, "The Good, the Bad, and the Ugly", earlier in this chapter), you probably want to do a little more than sit back and watch. Chapter 27, "Responding to Security Incidents", describes in detail how you should respond to a real security incident.