36.4 Confusion with White Space Field DelimitersOne would hope that a simple task like sorting would be relatively unambiguous. Unfortunately, it isn't. The behavior of sort can be very puzzling. I'll try to straighten out some of the confusion - at the same time, I'll be leaving myself open to abuse by the real sort experts. I hope you appreciate this! Seriously, though: if we find any new wrinkles to the story, we'll add them in the next edition. The trouble with sort is figuring out where one field ends and another begins. It's simplest if you can specify an explicit field delimiter (36.3 ) . This makes it easy to tell where fields end and begin. But by default, sort uses white space characters (tabs and spaces) to separate fields, and the rules for interpreting white space field delimiters are unfortunately complicated. As I see them, they are:
Here is a silly but instructive example that demonstrates most of the hard cases. We'll sort the file sortme , which is: apple Fruit shipment 20 beta beta test sites 5 Something or other All is not as it seems- cat -t -v (25.6 , 25.7 ) shows that the file really looks like this: ^Iapple^IFruit shipment 20^Ibeta^Ibeta test sites 5^I^ISomething or other
OK, now let's try some sort commands; I've added annotations on the right, showing what character the "sort" was based on. First, we'll sort on field zero - that is, the first field in each line: % As I noted earlier, a TAB precedes a space in the collating sequence. Everything is as expected. Now let's try another, this time sorting on field 1 (the second field): % Again, the initial TAB causes "something or other" to appear first. "Fruit shipments" preceded "beta" because in the ASCII table, uppercase letters precede lowercase letters. Now, let's sort on the next field: % No surprises here. And finally, sort on field 3 (the "fourth" field): % The only surprise here is that the NULL field gets sorted first. That's really no surprise, though: NULL has the ASCII value zero, so we should expect it to come first. OK, this was a silly example. But it was a difficult one; a casual understanding of what sort "ought to do" won't explain any of these cases. Which leads to another point. If someone tells you to sort some terrible mess of a data file, you could be heading for a nightmare. But often, you're not just sorting; you're also designing the data file you want to sort. If you get to design the format for the input data, a little bit of care will save you lots of headaches. If you have a choice, never allow TABs in the file. And be careful of leading spaces; a word with an extra space before it will be sorted before other words. Therefore, use an explicit delimiter between fields (like a colon), or use the -b option (and an explicit sort field), which tells sort to ignore initial white space. - |
|