$ ls -l
-rw-rw-rw- 1 dale project 6041 Jan 1 12:31 com.tmp
-rwxrwxrwx 1 dale project 1778 Jan 1 11:55 combine.idx
-rw-rw-rw- 1 dale project 1446 Feb 15 22:32 dang
-rwxrwxrwx 1 dale project 1202 Jan 2 23:06 format.idx
This listing is a report in which data is presented in rows
and columns. Each file is presented across a single row.
The file listing consists of nine columns.
The file's permissions appear in the first column,
the size of the file in bytes in
the fifth column, and the filename is found in
the last column. Because
one or more spaces separate the data in columns,
we can treat each column as a field.
In our first example,
we're going to pipe the output of this command to an awk
script that prints selected fields from the
file listing.
To do this, we'll create a
shell script so that we can make the pipe transparent to the user.
Thus, the structure of the shell script is:
ls -l $* | awk 'script'
The $* variable is used by the shell and expands to
all arguments passed from the command line. (We could use
$1 here, which would pass the first argument, but passing
all the arguments provides greater flexibility.)
These arguments can be the names of files or directories
or additional options to the ls command.
If no arguments are specified, the "$*" will be empty
and the current directory will be listed.
Thus, the output of the ls command will be directed to
awk, which will automatically read standard input, since
no filenames have been given.
We'd like our awk script to print the size and name of the file.
That is, print field 5 ($5) and field 9 ($9).
ls -l $* | awk '{
print $5, "\t", $9
}'
If we put the above lines in a file named fls
and make that file executable, we can enter fls
as a command.
$ fls
6041 com.tmp
1778 combine.idx
1446 dang
1202 format.idx
$ fls com*
6041 com.tmp
1778 combine.idx
So what our program does is take the long listing and reduce it
to two fields. Now, let's add new functionality to our report
by producing some information that the ls -l listing does not provide.
We add each file's size to a running total, to produce
the total number of bytes used by all files in the listing.
We can also keep track of the number of files and produce
that total.
There are two parts to adding this functionality. The first is
to accumulate the totals for each input line. We create the variable
sum to accumulate the size of files and the variable
filenum to accumulate the number of files in the listing.
{
sum += $5
++filenum
print $5, "\t", $9
}
The first expression uses the assignment operator
+=. It adds the value of field 5 to the present value of the variable
sum.
The second expression increments the present value of
the variable filenum.
This variable is used as a counter, and each
time the expression is evaluated,
1 is added to the count.
The action we've written will be applied to all input lines.
The totals that are accumulated in this action must
be printed after awk has read all the input lines.
Therefore, we write an action that is controlled by the
END rule.
END { print "Total: ", sum, "bytes (" filenum " files)" }
We can also use the BEGIN rule to add column headings to the
report.
BEGIN { print "BYTES", "\t", "FILE" }
Now we can put this script in an executable file named filesum
and execute it as a single-word command.
$ filesum c*
BYTES FILE
882 ch01
1771 ch03
1987 ch04
6041 com.tmp
1778 combine.idx
Total: 12459 bytes (5 files)
What's nice about this command is that it allows you to determine
the size of all files in a directory or any group of files.
While the basic mechanism works, there are a few problems to
be taken care of.
The first problem occurs when you list the entire directory
using the ls -l command.
The listing contains
a line that specifies the total number of blocks in the directory.
The partial listing (all files beginning with "c") in the previous
example does not have this line.
But the following line would be included in the output if the
full directory was listed:
total 555
The block total does not interest us because
the program displays the total file size in bytes.
Currently, filesum does not print this line; however,
it does read this line and cause the filenum counter to be incremented.
There is also a problem with this script in how
it handles subdirectories. Look at the following
line from an ls -l:
drwxrwxrwx 3 dale project 960 Feb 1 15:47 sed
A "d" as the first character in column 1 (file permissions) indicates that
the file is a subdirectory. The size of this file (960 bytes) does not
indicate the size of files in that subdirectory and therefore, it
is slightly misleading to add it to the file size totals. Also, it might be
helpful to indicate that it is a directory.
If you want to list the
files in subdirectories, supply the -R (recursive)
option on the command line.
It will be passed to the ls command.
However, the listing is slightly different as it
identifies each directory.
For instance, to identify the subdirectory old, the
ls -lR listing produces a blank line followed
by:
./old:
Our script ignores that
line and a blank line preceding it but nonetheless they increment
the file counter.
Fortunately, we can devise rules
to handle these cases.
Let's look at the revised, commented script:
ls -l $* | awk '
# filesum: list files and total size in bytes
# input: long listing produced by "ls -l"
#1 output column headers
BEGIN { print "BYTES", "\t", "FILE" }
#2 test for 9 fields; files begin with "-"
NF == 9 && /^-/ {
sum += $5 # accumulate size of file
++filenum # count number of files
print $5, "\t", $9 # print size and filename
}
#3 test for 9 fields; directory begins with "d"
NF == 9 && /^d/ {
print "<dir>", "\t", $9 # print <dir> and name
}
#4 test for ls -lR line ./dir:
$1 ~ /^\..*:$/ {
print "\t" $0 # print that line preceded by tab
}
#5 once all is done,
END {
# print total file size and number of files
print "Total: ", sum, "bytes (" filenum " files)"
}'
The rules and their associated actions have been numbered
to make it easier to discuss them. The listing
produced by ls -l contains nine fields for
a file. Awk supplies the number of fields
for a record in the system variable NF.
Therefore, rules 2 and 3 test that NF is equal to 9.
This helps us avoid matching odd blank lines
or the line stating the block total.
Because we want to handle directories and files differently,
we use another pattern to match the first character of
the line. In rule 2 we test for "-" in the first position
on the line, which indicates a file.
The associated action increments
the file counter and adds the file size to the previous
total. In rule 3, we test for a directory, indicated by "d" as
the first character. The associated action
prints "<dir>" in place of the file size.
Rules 2 and 3 are compound expressions,
specifying two patterns that are
combined using the && operator.
Both patterns must be matched for the expression to
be true.
Rule 4 tests for the special case produced by the ls -lR
listing ("./old:").
There are a number of patterns that we can write to match
that line, using regular expressions or relational expressions:
NF == 1 If the number of fields equals 1 ...
/^\..*:$/ If the line begins with a period followed by any number of
characters and ends in a colon...
$1 ~ /^\..*:$/ If field 1 matches the regular expression...
We used the latter expression because it seems to be the most
specific. It employs the match operator (~) to test the first
field against a regular expression. The associated action
consists of only a print statement.
Rule 5 is the END pattern and its action is only executed once,
printing the sum of file sizes as well as the number of files.
The filesum program demonstrates many of the
basic constructs used in awk. What's more, it gives
you a pretty good idea of the process of developing a
program (although syntax errors produced by typos
and hasty thinking have been gracefully omitted).
If you wish to tinker with this program, you might add a counter
for a directories, or a rule that handles symbolic links.