Noticing Suspicious Activities (Perl for System Administration)

10.2. Noticing Suspicious Activities

A good night watchman needs more than just the ability to monitor for change. She or he also needs to be able to spot suspicious activities and circumstances. A hole in the perimeter fence or unexplained bumps in the night need to be brought to someone's attention. We can write programs to play this role.

10.2.1. Local Signs of Peril

It's unfortunate, but learning to be good at spotting signs of suspicious activity often comes as a result of pain and the desire to avoid it in the future. After the first few security breaches, you'll start to notice that intruders often follow certain patterns and leave behind telltale clues. Spotting these signs, once you know what they are, is often easy in Perl.

TIP

After each security breach, it is vitally important that you take a few moments to perform a postmortem of the incident. Document (to the best of your knowledge) where the intruders came in, what tools or holes they used, what they did, who else they attacked, what you did in response, and so on.

It is tempting to return to normal daily life and forget the break-in. If you can resist this temptation, you'll find later that you've gained something from the incident, rather than just losing time and effort. The Nietzchean principle of "that which does not kill you makes you stronger" is often applicable in the system administration realm as well.

For instance, intruders, especially the less-sophisticated kind, often try to hide their activities by creating "hidden" directories to store their data. On Unix and Linux systems they will put exploit code and sniffer output in directories with names like "..." (dot dot dot), ". " (dot space), or " Mail" (space Mail). These names are likely to be passed over in a cursory inspection of ls output.

We can easily write a program to search for these names using the tools we learned about in Chapter 2, "Filesystems". Here's a program based on the File::Find module (as called by find.pl ) which looks for anomalous directory names.

require "find.pl";

# Traverse desired filesystems

&find('.');

sub wanted {
    
   (-d $_) and                           # is a directory
     $_ ne "." and $_ ne ".." and        # is not . or ..

      (/[^-.a-zA-Z0-9+,:;_~$#(  )]/ or     # contains a "bad" character
       /^\.{3,}/ or                      # or starts with at least 3 dots
       /^-/) and                         # or begins with a dash

       print "'".&nice($name)."'\n";
}

# print a "nice" version of the directory name, i.e., with control chars 
# explicated. This subroutine barely modified from &unctrl(  ) in Perl's
# stock dumpvar.pl
sub nice {
    my($name) = $_[0];
    $name =~ s/([\001-\037\177])/'^'.pack('c',ord($1)^64)/eg;

    $name;
}

Remember the sidebar "Regular Expressions" in Chapter 9, "Log Files"? Filesystem sifting programs like these are another example where this holds true. The effectiveness of these programs often hinges on the quality and quantity of their regular expressions. Too few regexps and you miss things you might want to catch. Too many regexps or regexps that are inefficient gives your program an exorbitant runtime and resource usage. If you use regexps that are too loose, the program will generate many false positives. It is a delicate balance.

10.2.2. Finding Problematic Patterns

Let's use some of the things we learned in Chapter 9, "Log Files" to move us along in our discussion. We've just talked about looking for suspicious objects; now let's move on to looking for patterns that may indicate suspicious activity. We can demonstrate this with a program that does some primitive logfile analysis to determine potential break-ins.

This example is based on the following premise: most users logging in remotely do so consistently from the same place or a small list of places. They usually log in remotely from a single machine, or from the same ISP modem bank each time. If you find an account that has logged in from more than a handful of domains, it's a good indicator that this account has been compromised and the password has been widely distributed. Obviously this premise does not hold for populations of highly mobile users, but if you find an account that has been logged into from Brazil and Finland in the same two-hour period, that's a pretty good indicator that something is fishy.

Let's walk through some code that looks for this indicator. This code is Unix-centric, but the techniques demonstrated in it are platform independent. First, here's our built-in documentation. It's not a bad idea to put something like this near the top of your program for the sake of other people who will read your code. Before we move on, be sure to take a quick look at the arguments the rest of the program will support:

sub usage {
    print <<"EOU"
lastcheck - check the output of the last command on a machine
            to determine if any user has logged in from > N domains
            (inspired by an idea from Daniel Rinehart)

   USAGE:  lastcheck [args], where args can be any of:
    -i:           for IP #'s, treat class C subnets as the same "domain"
    -h:           this help message
    -f <domain>   count only foreign domains, specify home domain
    -l <command>: use <command> instead of default /usr/ucb/last
                  note: no output format checking is done!
    -m <#>:       max number of unique domains allowed, default 3
    -u <user>:    perform check for only this username
EOU
    exit;
}

First we parse the user's command-line arguments. The getopts line below will look at the arguments to the program and set $opt_<flag letter> appropriately. The colon after the letter means that option takes an argument:

use Getopt::Std;       # standard option processor
getopts('ihf:l:m:u:'); # parse user input

&usage if (defined $opt_h);

# number of unique domains before we complain
$maxdomains = (defined $opt_m) ? $opt_m : 3;

The following lines reflect the portability versus efficiency decision we discussed in the Chapter 9, "Log Files". Here we're opting to call an external program. If you wanted to make the program less portable and a little more efficient, you could use unpack( ) as discussed in that chapter:

$lastex = (defined $opt_l) ? $opt_l : "/usr/ucb/last";

open(LAST,"$lastex|") || die "Can't run the program $lastex:$!\n";

Before we get any further into the program, let's take a quick look at the hash of lists data structure this program uses as it processes the data from last. This hash will have a username as its key and a reference to a list of the unique domains that user has logged in from as its value.

For instance, a sample entry might be:

$userinfo { laf } = [ 'ccs.neu.edu', 'xerox.com', 'foobar.edu' ]

This entry shows the account laf has logged in from the ccs.neu.edu, xerox.com, and foobar.edu domains.

We begin by iterating over the input we get from last ; the output on our system looks like this:

cindy    pts/10   sinai.ccs.neu.ed  Fri Mar 27 13:51   still logged in
michael  pts/3    regulus.ccs.neu.  Fri Mar 27 13:51   still logged in 
david    pts/5    fruity-pebbles.c  Fri Mar 27 13:48   still logged in
deborah  pts/5    grape-nuts.ccs.n  Fri Mar 27 11:43 - 11:53  (00:09)
barbara  pts/3    152.148.23.66     Fri Mar 27 10:48 - 13:20  (02:31)
jerry    pts/3    nat16.aspentec.c  Fri Mar 27 09:24 - 09:26  (00:01)

You'll notice that the hostnames (column 3) in our last output have truncated names. We've seen this hostname length restriction before in Chapter 9, "Log Files", but up until now we've sidestepped the challenge it represents. We'll stare danger right in the face in a moment when we start populating our data structure.

Early on in the while loop, we try to skip lines that contain cases we don't care about. In general it is a good idea to check for special cases like this at the beginning of your loops before any actual processing of the data (e.g., a split( )) takes place. This lets the program quickly identify when it can skip a particular line and continue reading input:

while (<LAST>){

    # ignore special users
    next if /^reboot\s|^shutdown\s|^ftp\s/; 

    # if we've used -u to specify a specific user, skip all entries
    # that don't pertain to this user (whose name is stored in $opt_u 
    # by getopts for us).           
    next if (defined $opt_u && !/^$opt_u\s/); 

    # ignore X console logins
    next if /:0\s+:0/;
    
    # find the user's name, tty, and remote hostname
    ($user, $tty,$host) = split;
    
    # ignore if the log had a bad username after parsing
    next if (length($user) < 2);

    # ignore if no domain name info in name
    next if $host !~ /\./; 

    # find the domain name of this host (see explanation below)
    $dn = &domain($host);

    # ignore if you get a bogus domain name
    next if (length ($dn) < 2); 

    # ignore this input line if it is in the home domain as specified 
    # by the -f switch
    next if (defined $opt_f && ($dn =~ /^$opt_f/));
     
    # if we've never seen this user before, simply create a list with 
    # the user's domain and store this in the hash of lists.
    unless (exists $userinfo{$user}){ 
	   $userinfo{$user} = [$dn];
    }
    # otherwise, this can be a bit hairy; see the explanation below
    else {
	  &AddToInfo($user,$dn); 
   }
}
close(LAST);

Now let's take a look at the individual subroutines that handle the tricky parts of this program. Our first subroutine, &domain( ), takes a Fully Qualified Domain Name (FQDN), i.e., a hostname with the full domain name attached, and returns its best guess at the domain name of that host. It has to be a little smart for two reasons:

Not all hostnames in the logs will be actual names. They may be simple IP addresses. In this case, if the user has set the -i switch, we assume any IP address we get is a class C network subnetted on the standard byte boundary. In practical terms this means that we treat the first three octets as the "domain name" of the host. This allows us to treat logins from 192.168.1.10 as coming from the same logical source as logins from 192.168.1.12. This may not be the best of assumptions, but it is the best we can do without consulting another source of information (and it works most of the time). If the user does not use the -i switch, we treat the entire IP address as the domain of record.
As mentioned before, the hostnames may be truncated. This leaves us to deal with partial entries like grape-nuts.ccs.n and nat16.aspentec.c. This is not as bad as it might sound, since each host will have its FQDN truncated at the same point every time it is stored in the log. We attempt to work around this restriction as best we can in the &AddToInfo( ) subroutine we'll discuss in a moment.

Back to the code:

# take a FQDN and attempt to return FQD
sub domain{
    # look for IP addresses
    if ($_[0] =~ /^\d+\.\d+\.\d+\.\d+$/) {
	
	    # if the user did not use -i, simply return the IP address as is
	    unless (defined $opt_i){ 
	        return $_[0]; 
	    }
	
	    # otherwise, return everything but the last octet
	    else {
	        $_[0] =~ /(.*)\.\d+$/;
	        return $1;
	    }
    }

    # if we are not dealing with an IP address
    else {
	    # downcase the info to make later processing simpler and quicker
 	    $_[0] = lc($_[0]);

	    # then return everything after first dot
	    $_[0] =~ /^[^.]+\.(.*)/; 
	    return $1;
    }
}

This next subroutine, short as it is, encapsulates the hardest part of this program. Our &AddToInfo( ) subroutine has to deal with truncated hostnames and the storing of information into our hash table. We're going to use a substring matching technique that you may find useful in other contexts.

In this case, we'd really like all of the following domain names to be treated and stored as the same domain name in our array of unique domains for a user:

ccs.neu.edu
    ccs.neu.ed  
    ccs.n

When the uniqueness of a domain name is in question, we check three things:

Is this domain name an exact match of anything we have stored for this user?
Is this domain name a substring of already stored data?
Is the stored domain data a substring of the domain name we are checking?

If any of these are the case, we don't need to add a new entry to our data structure because we already have a substring equivalent stored in the user's domain list. If case #3 is true, we'll want to replace the stored data entry with our current entry, assuring we've stored the largest string possible. Astute readers will also note that cases #1 and #2 can be checked simultaneously since an exact match is equivalent to a substring match where all the characters match.

If all of these cases are false, we do need to store the new entry. Let's take a look at the code first and then talk about how it works:

sub AddToInfo{
    my($user, $dn) = @_;

    for (@{$userinfo{$user}}){

      # case #1 & #2 from above: is this either exact or substring match?
      return if (index($_,$dn) > -1); 

      # check case #3 from above, i.e. is the stored domain data
      # a substring of the domain name we are checking?
      if (index($dn,$_) > -1){
        $_ = $dn; # swap current & stored values
        return;
      } 
    }
    
    # otherwise, this is a new domain, add it to the list
    push @{$userinfo{$user}}, $dn;
}

@{$userinfo{$user}} returns the list of domains we've stored for the specified user. We iterate over each item in this list to see if $dn can be found in any item. If it can, we have a substring equivalent already stored, so we exit the subroutine.

If we pass this test, we look for case #3 above. Each entry in the list is checked to see if it can be found in our current domain. If it is a match, we overwrite the list entry with the current domain data, thus storing the larger of the two strings. This happens even when there is an exact match, since it does no harm. We overwrite the entry using a special property of the for and foreach Perl operators. Assigning to $_ in the middle of a for loop like this actually assigns to the current element of the list at that point in the loop. The loop variable becomes an alias for the list variable. If we've made this swap, we can leave the subroutine. If we pass all three tests, then the final line adds the domain name in question to the user's domain list.

That's it for the gory details of iterating over the file and building our data structure. To wrap this program up, let's run through all of the users we found and check how many unique domains each has logged into (i.e., the size of the list we've stored for each). For those entries that have more domains than our comfort level, we print the contents of their entry:

for (sort keys %userinfo){
    if ($#{$userinfo{$_}} > $maxdomains){
	     print "\n\n$_ has logged in from:\n";
	     print join("\n",sort @{$userinfo{$_}});
    }
}
print "\n";

Now that you've seen the code, you might wonder if this approach really works. Here's some real sample output of our program for a user who had her password sniffed at another site:

username has logged in from:
38.254.131
bu.edu
ccs.neu.ed
dac.neu.ed
hials.no
ipt.a
tnt1.bos1
tnt1.bost
tnt1.dia
tnt2.bos
tnt3.bos
tnt4.bo
toronto4.di

Some of these entries look normal for a user in the Boston area. However, the toronto4.di entry is a bit suspect and the hials.no site is in Norway. Busted!

This program could be further refined to include the element of time or correlations with another log file like that from tcpwrappers. But as you can see, pattern detection is often very useful by itself.