Relational and Boolean Operators (sed & awk, Second Edition)

7.8.1. Getting Information About Files

Now we are going to look at a couple of scripts that process the output of a UNIX command, ls. The following is a sample of the long listing produced by the command ls -l:[49]

[49]Note that on a Berkeley 4.3BSD-derived UNIX system such as Ultrix or SunOS 4.1.x, ls -l produces an eight-column report; use ls -lg to get the same report format shown here.

$ ls -l
-rw-rw-rw-   1 dale     project   6041 Jan  1 12:31 com.tmp
-rwxrwxrwx   1 dale     project   1778 Jan  1 11:55 combine.idx
-rw-rw-rw-   1 dale     project   1446 Feb 15 22:32 dang
-rwxrwxrwx   1 dale     project   1202 Jan  2 23:06 format.idx

This listing is a report in which data is presented in rows and columns. Each file is presented across a single row. The file listing consists of nine columns. The file's permissions appear in the first column, the size of the file in bytes in the fifth column, and the filename is found in the last column. Because one or more spaces separate the data in columns, we can treat each column as a field.

In our first example, we're going to pipe the output of this command to an awk script that prints selected fields from the file listing. To do this, we'll create a shell script so that we can make the pipe transparent to the user. Thus, the structure of the shell script is:

ls -l $* | awk 'script'

The $* variable is used by the shell and expands to all arguments passed from the command line. (We could use $1 here, which would pass the first argument, but passing all the arguments provides greater flexibility.) These arguments can be the names of files or directories or additional options to the ls command. If no arguments are specified, the "$*" will be empty and the current directory will be listed. Thus, the output of the ls command will be directed to awk, which will automatically read standard input, since no filenames have been given.

We'd like our awk script to print the size and name of the file. That is, print field 5 ($5) and field 9 ($9).

ls -l $* | awk '{ 
	print $5, "\t", $9
}'

If we put the above lines in a file named fls and make that file executable, we can enter fls as a command.

$ fls
6041     com.tmp
1778     combine.idx
1446     dang
1202     format.idx
$ fls com*
6041     com.tmp
1778     combine.idx

So what our program does is take the long listing and reduce it to two fields. Now, let's add new functionality to our report by producing some information that the ls -l listing does not provide. We add each file's size to a running total, to produce the total number of bytes used by all files in the listing. We can also keep track of the number of files and produce that total. There are two parts to adding this functionality. The first is to accumulate the totals for each input line. We create the variable sum to accumulate the size of files and the variable filenum to accumulate the number of files in the listing.

{
	sum += $5
	++filenum
	print $5, "\t", $9 
}

The first expression uses the assignment operator +=. It adds the value of field 5 to the present value of the variable sum. The second expression increments the present value of the variable filenum. This variable is used as a counter, and each time the expression is evaluated, 1 is added to the count.

The action we've written will be applied to all input lines. The totals that are accumulated in this action must be printed after awk has read all the input lines. Therefore, we write an action that is controlled by the END rule.

END { print "Total: ", sum, "bytes (" filenum " files)" }

We can also use the BEGIN rule to add column headings to the report.

BEGIN { print "BYTES", "\t", "FILE" }

Now we can put this script in an executable file named filesum and execute it as a single-word command.

$ filesum c* 
BYTES    FILE
882      ch01
1771     ch03
1987     ch04
6041     com.tmp
1778     combine.idx
Total:  12459 bytes (5 files)

What's nice about this command is that it allows you to determine the size of all files in a directory or any group of files.

While the basic mechanism works, there are a few problems to be taken care of. The first problem occurs when you list the entire directory using the ls -l command. The listing contains a line that specifies the total number of blocks in the directory. The partial listing (all files beginning with "c") in the previous example does not have this line. But the following line would be included in the output if the full directory was listed:

total 555

The block total does not interest us because the program displays the total file size in bytes. Currently, filesum does not print this line; however, it does read this line and cause the filenum counter to be incremented.

There is also a problem with this script in how it handles subdirectories. Look at the following line from an ls -l:

drwxrwxrwx   3 dale     project         960 Feb  1 15:47 sed

A "d" as the first character in column 1 (file permissions) indicates that the file is a subdirectory. The size of this file (960 bytes) does not indicate the size of files in that subdirectory and therefore, it is slightly misleading to add it to the file size totals. Also, it might be helpful to indicate that it is a directory.

If you want to list the files in subdirectories, supply the -R (recursive) option on the command line. It will be passed to the ls command. However, the listing is slightly different as it identifies each directory. For instance, to identify the subdirectory old, the ls -lR listing produces a blank line followed by:

./old:

Our script ignores that line and a blank line preceding it but nonetheless they increment the file counter. Fortunately, we can devise rules to handle these cases. Let's look at the revised, commented script:

ls -l $* | awk '
# filesum: list files and total size in bytes
# input: long listing produced by "ls -l"

#1 output column headers
BEGIN { print "BYTES", "\t", "FILE" }

#2 test for 9 fields; files begin with "-"
NF == 9 && /^-/ {
        sum += $5       # accumulate size of file
        ++filenum       # count number of files
        print $5, "\t", $9       # print size and filename
}

#3 test for 9 fields; directory begins with "d"
NF == 9 && /^d/ {
        print "<dir>", "\t", $9  # print <dir> and name
}

#4 test for ls -lR line ./dir:
$1 ~ /^\..*:$/ {
        print "\t" $0 # print that line preceded by tab
}

#5 once all is done,
END {
	# print total file size and number of files
	print "Total: ", sum, "bytes (" filenum " files)"
}'

The rules and their associated actions have been numbered to make it easier to discuss them. The listing produced by ls -l contains nine fields for a file. Awk supplies the number of fields for a record in the system variable NF. Therefore, rules 2 and 3 test that NF is equal to 9. This helps us avoid matching odd blank lines or the line stating the block total. Because we want to handle directories and files differently, we use another pattern to match the first character of the line. In rule 2 we test for "-" in the first position on the line, which indicates a file. The associated action increments the file counter and adds the file size to the previous total. In rule 3, we test for a directory, indicated by "d" as the first character. The associated action prints "<dir>" in place of the file size. Rules 2 and 3 are compound expressions, specifying two patterns that are combined using the && operator. Both patterns must be matched for the expression to be true.

Rule 4 tests for the special case produced by the ls -lR listing ("./old:"). There are a number of patterns that we can write to match that line, using regular expressions or relational expressions:

NF == 1			If the number of fields equals 1 ...
/^\..*:$/		If the line begins with a period followed by any number of
                                                   characters and ends in a colon...
$1 ~ /^\..*:$/		If field 1 matches the regular expression...

We used the latter expression because it seems to be the most specific. It employs the match operator (~) to test the first field against a regular expression. The associated action consists of only a print statement.

Rule 5 is the END pattern and its action is only executed once, printing the sum of file sizes as well as the number of files.

The filesum program demonstrates many of the basic constructs used in awk. What's more, it gives you a pretty good idea of the process of developing a program (although syntax errors produced by typos and hasty thinking have been gracefully omitted). If you wish to tinker with this program, you might add a counter for a directories, or a rule that handles symbolic links.

7.8. Relational and Boolean Operators

Table 7.4. Relational Operators

Table 7.5. Boolean Operators

7.8.1. Getting Information About Files