home | O'Reilly's CD bookshelfs | FreeBSD | Linux | Cisco | Cisco Exam  


UNIX Power Tools

UNIX Power ToolsSearch this book
Previous: 36.3 Changing the Field Delimiter Chapter 36
Sorting
Next: 36.5 Alphabetic and Numeric Sorting
 

36.4 Confusion with White Space Field Delimiters

One would hope that a simple task like sorting would be relatively unambiguous. Unfortunately, it isn't. The behavior of sort can be very puzzling. I'll try to straighten out some of the confusion - at the same time, I'll be leaving myself open to abuse by the real sort experts. I hope you appreciate this! Seriously, though: if we find any new wrinkles to the story, we'll add them in the next edition.

The trouble with sort is figuring out where one field ends and another begins. It's simplest if you can specify an explicit field delimiter ( 36.3 ) . This makes it easy to tell where fields end and begin. But by default, sort uses white space characters (tabs and spaces) to separate fields, and the rules for interpreting white space field delimiters are unfortunately complicated. As I see them, they are:

  • The first white space character you encounter is a "field delimiter"; it marks the end of the old field and the beginning of the next field.

  • Any white space character following a field delimiter is part of the new field. That is - if you have two or more white space characters in a row, the first one is used as a field delimiter, and isn't sorted. The remainder are sorted, as part of the next field.

  • Every field has at least one non-whitespace character, unless you're at the end of the line. (That is: null fields only occur when you've reached the end of a line.)

  • All white space is not equal. Sorting is done according to the ASCII ( 51.3 ) collating sequence. Therefore, TABs are sorted before spaces.

Here is a silly but instructive example that demonstrates most of the hard cases. We'll sort the file sortme , which is:

apple   Fruit shipment
20      beta    beta test sites 
 5              Something or other

All is not as it seems- cat -t -v ( 25.6 , 25.7 ) shows that the file really looks like this:

^Iapple^IFruit shipment
20^Ibeta^Ibeta test sites 
 5^I^ISomething or other

^I indicates a tab character. Before showing you what sort does with this file, let's break it into fields, being very careful to apply the rules above. In the table, we use quotes to show exactly where each field begins and ends:

Field 0 1 2 3
Line
1 "^Iapple" "Fruit" "shipment" null (no more data)
2 "20" "beta" "beta" "test"
3 "5" "^Isomething" "or" "other"

OK, now let's try some sort commands; I've added annotations on the right, showing what character the "sort" was based on. First, we'll sort on field zero - that is, the first field in each line:


% 

sort sortme

 
sort on field zero

        apple   Fruit shipments 
field 0, first character: TAB

 5              Something or other 
field 0, first character: SPACE

20      beta    beta test sites 
field 0, first character: 2

As I noted earlier, a TAB precedes a space in the collating sequence. Everything is as expected. Now let's try another, this time sorting on field 1 (the second field):


% 

sort +1 sortme

 
sort on field 1

 5              Something or other 
field 1, first character: TAB

        apple   Fruit shipments 
field 1, first character: F

20      beta    beta test sites 
field 1, first character: b

Again, the initial TAB causes "something or other" to appear first. "Fruit shipments" preceded "beta" because in the ASCII table, uppercase letters precede lowercase letters. Now, let's sort on the next field:


% 

sort +2 sortme

 
sort on field 2

20      beta    beta test sites 
field 2, first character: b

 5              Something or other 
field 2, first character: o

        apple   Fruit shipments 
field 2, first character: s

No surprises here. And finally, sort on field 3 (the "fourth" field):


% 

sort +3 sortme

 
sort on field 3

        apple   Fruit shipments 
field 3,  NULL

 5              Something or other 
field 3, first character: o

20      beta    beta test sites 
field 3, first character: t

The only surprise here is that the NULL field gets sorted first. That's really no surprise, though: NULL has the ASCII value zero, so we should expect it to come first.

OK, this was a silly example. But it was a difficult one; a casual understanding of what sort "ought to do" won't explain any of these cases. Which leads to another point. If someone tells you to sort some terrible mess of a data file, you could be heading for a nightmare. But often, you're not just sorting; you're also designing the data file you want to sort. If you get to design the format for the input data, a little bit of care will save you lots of headaches. If you have a choice, never allow TABs in the file. And be careful of leading spaces; a word with an extra space before it will be sorted before other words. Therefore, use an explicit delimiter between fields (like a colon), or use the -b option (and an explicit sort field), which tells sort to ignore initial white space.

- ML


Previous: 36.3 Changing the Field Delimiter UNIX Power Tools Next: 36.5 Alphabetic and Numeric Sorting
36.3 Changing the Field Delimiter Book Index 36.5 Alphabetic and Numeric Sorting

The UNIX CD Bookshelf Navigation The UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System