Definitions of sort keys



Last revision July 20, 2004

A sort key is a specification of which portion of the line is to be compared to put the lines in order.

If more than one sort key is specified, then the lines are first ordered according to the first key, comparing character by character within the key. Next, all lines with equal values of the first key are ordered according to the second key. Then, all lines with equal values of the first and second keys are ordered according to the third key, and so on.

By default, there is a single sort key consisting of the entire line. You can divide the line into fields and specify which fields make up which sort key. The default field delimiter is a blank space. You can reset the field delimiter to any character x with the option -tx.

Specify the fields that make up a sort key with an option like:
      +pos1 -pos2
The sort key will start at field position pos1 and extend up to but NOT including pos2. These positions are simply the field number, counting from the left end of the line, and starting with value 0 (zero) as the first field on the line. This is different than most utilities that number fields starting with the value 1. You can also think of these position numbers as offsets from the first position: +0 is no offset (start at first position), -1 is ending position offset by one (up to second position). You can omit the ending position option (-pos2), which means the sort key extends to the last field on the line.

Give as many sort keys as you like. You can even "back up" in a later key, e.g.:
      +4 -5 +1 -2

Within any sort key, comparison is done by column from left to right, ending at the last column of the shorter field. WARNING: if your fields are separated by multiple blanks, you may not get the sorting order that you expect, unless you use the -b option to ignore leading blanks. Where there are several blanks in a row in a line, only the first blank is considered to be the delimiter. The remaining blanks are considered part of the next field. Or, if you use a non-blank delimiter character, all the blanks are considered part of the field. In the default sort order, these blanks sort before any alphanumeric character in the same character position.

For example, suppose your file samplefile has these two lines:

    abc   def ghi
    abc abc ghi

If you want to sort them alphabetically by the second field, you might give a sort command like this:
      sort +1 -2 samplefile
You might be surprised to get the output in this order:

    abc   def ghi
    abc abc ghi

Clearly, abc comes before def, so why is the line whose second field is def first? The problem is that the first line has three blanks separating the first and second fields; the second line has only a single blank. One blank is used up as the delimiter character and not considered. But the other two blanks in the first line are considered to be part of the second field. So we are actually comparing "  def" with abc. The blank character in the first position of the second field on the first line sorts before the a character in the first position of the second field on the second line.

To get the intuitively "correct" sorting in this example, use the -b option to tell sort to ignore the extra blanks when it breaks the line into fields. Now, there will be no leading blanks included in the second field with def, and it will sort after abc, as expected. This command:
      sort -b +1 -2 samplefile
will produce the intuitively "correct" output:

    abc abc ghi
    abc   def ghi

Sort keys are compared by default according to ASCII collating sequence; that is, by the ASCII code values of the characters. The general order is blanks, numerals, upper case letters, and lower case letters, with special characters mixed in various places. On pangea, give the command man ascii to see the actual ASCII table, showing the octal, hexadecimal, or decimal codes that correspond to each letter and character.

Comments or Questions?