As we were writing this book, I decided to make a list of all the
articles, the numbers of lines and characters in each - then combine
that with the description, a status code, and the article's title.
After a few minutes with
wc -l -c
(29.6
)
,
cut
(35.14
)
,
sort
(36.1
)
,
and
join
(35.19
)
,
I had a file that looked like this:
% cat messfile
2850 2095 51441 ~BB A sed tutorial
3120 868 21259 +BB mail - lots of basics
6480 732 31034 + How to find sources - JIK's periodic posting
...900 lines...
5630 14 453 +JP Running Commands on Directory Stacks
1600 12 420 !JP With find, Don't Forget -print
0495 9 399 + Make 'xargs -i' use more than one filename
Yuck. It was tough to read. The columns needed to be straightened.
A little
awk
(33.11
)
script
turned the mess into this:
% cat cleanfile
2850 2095 51441 ~BB A sed tutorial
3120 868 21259 +BB mail - lots of basics
6480 732 31034 + How to find sources - JIK's periodic posting
...900 lines...
5630 14 453 +JP Running Commands on Directory Stacks
1600 12 420 !JP With find, Don't Forget -print
0495 9 399 + Make 'xargs -i' use more than one filename
Here's the simple script I used and the command I typed to run it:
% cat neatcols
{
printf "%4s %4s %6s %-4s %s\n", \
$1, $2, $3, $4, substr($0, index($0,$5))
}
% awk -f neatcols messfile > cleanfile
You can adapt that script for whatever kinds of columns you need to
clean up.
In case you don't know awk
, here's a quick summary:
The first line of the printf
, between double quotes ("
),
tells the field widths and alignments.
For example, the first column should be right-aligned in 4 characters
(%4s
).
The fourth column should be 4 characters wide left-adjusted (%-4s
).
The fifth column is big enough to just fit (%s
).
I used string (%s
) instead of decimal (%d
) so awk
wouldn't strip off the leading zeros in the columns.
The second line arranges the input data fields onto the output line.
Here, input and output are in the same order, but I could have reordered them.
The first four columns get the first four fields ($1, $2, $3, $4
).
The fifth column is a catch-all; it gets everything else.
substr($0, index($0,$5))
means "find the fifth input column; print it and everything after it."