Assignment - sort command

See both the discussion in your text book (look it up in the index) and the man page for sort. sort is used often to sort lines of text. The lines of text could be the contents of a file, or lines generated by a command and fed as standard input to the sort command through a pipe. sort can be directed to sort on certain "columns" within the lines, by directing it to the corresponding fields.

sort can be unexpectedly confusing. To avoid some of the confusion, certain understandings about sort's method of operation are useful. Consider an easy file, with the following contents:

It is unsorted. If you apply the sort command to it with the simplest syntax, "sort sortbasic", you will find it gets sorted by the first field (the fruits). However you can tell it to skip or leapfrog the first field-- which makes it use the second one for sorting-- with "sort +1 sortbasic". Similarly you can push its focus farther over, to the number column. See below:

Notice the number you give it. It is not the number of the field by which you want to sort. Rather, it is how many preceding fields you want to ignore. So it is always one less than the desired column's number. Above for example, in order to use the 3rd column (the numbers) for sorting, the sort command said "+2." Not "+3."

With the above file we were lucky. This file is unusual because it is so regular. The columns are separated by exactly the same number of spaces, and the items in every column all have the same number of letters. Uniformly, we have 5-letter fruits, 5-letter countries, and 3-digit numbers. Also, all the words follow the same capitalization pattern: first letter always capitalized, all other letters always lowercase. Most files are not so nice, and most files don't sort the way you expect. Until you learn the sort command better. Here's another, not so nice file.

Let's sort it once on each field, as we did with the other file.

Very bad!! None of the sorts worked according to our expectations. First we tried to sort by fruits in the first column, and "banana" which is at the beginning of the alphabet came in last place. Then we tried to sort on the countries in the second (+1) column and "Zambia," at the end of the alphabet, is in first place. Finally we tried the numbers in the third (+2) column and the smallest number 555, which should come first, is in the middle. Let's figure out what went wrong with each column.

With column 1, "sort sortstuff," the fruits should get sorted alphabetically. When computers sort alphabetically, they distinguish between uppercase and lowercase letters, and put all uppercase letters "ahead" of all lowercase ones in the sorting sequence. (This sorting sequence, on U.S. computers, is the well-known ASCII code sequence.)  "banana" is a loser because the "b" character is number 98 in the sequence, and the uppercase letters have lower numbers ("W" in "Watermelon" is 87 for example). 87 before 98; "Watermelon" before "banana". If you want to tell sort not to distinguish between upper and lowercase (b and B are the same, and both b and B come before both w and W), use the command's -f option:

banana-Mango-Pear-Rasberry-Watermelon, that's better.

What about column 2? Why did we get Zambia-Germany-Argentina-Mexico-Laos when we sorted on it?? Zambia before Laos? To answer, we need to identify exactly what sort was sorting. How does sort define the "fields" that it uses? How did it define the content of the "second field" in each line? The man page for sort says, "By default, fields are separated by the empty string between a non-whitespace character and a whitespace character. ...The field separator is not considered to be part of either the field preceding or the field following it." Whitespace usually means space characters. (Or sometimes tabs as well.) Space characters are regular characters inside the file, but the area they occupy when printed on the screen is "painted" plain white.

So look at the "Zambia" and "Laos" lines. Between "Pear" and "Zambia" there are 7 spaces. sort separates the first and 2nd fields "between a non-whitespace character and a whitespace character." That is, between the "r" in Pear and the space character that follows it, or between the "n" in Watermelon and the space that follows it. So spaces are "awarded" to the field on their right, not their left. In alphabetic sorting the first character of each word is initially compared, progressing to the second character in case of a tie. When sorting these 2 lines by their 2nd fields, "Z" and "L" don't govern because they are not the first characters of their fields. Rather, the first character in both fields is a space. The tie is broken by looking at the second character. In the Zambia line, it's another space; in the Laos line, its an L. Space comes before L (have another look at the ASCII code sequence, space is decimal 32 while L is 76). Space wins. Zambia wins. Zambia comes before Laos because space comes before L.

If you want to get rid of this effect, you can tell sort with the -b option to award the blanks to no field at all. It will then construe the fields like this:

and the resulting sort is what you expected:

 

Now suppose we want to sort on the numbers in the last column. We'll use +2 to sort on the 3rd field. Judging from how irregularly the numbers are staggered, blanks will be an issue unless we use -b. So let's try it:

The only problem is that 3333 should be at the bottom. It's at the top because these numbers are being viewed as numerals and sorted alphabetically just like words. "3" comes before "9" in the ASCII sequence. It even comes before "5". So it's on top. Frequently you want numbers to be sorted according to their numeric value. For that, use sort's -n option:

Perfect. 

There are other subtleties with the sort command, but the most common problems come from the treatment of case, whitespace, and numerics and now you have the solutions for all of them. Now that you do, you can see why users often sort file listings by file size using "ls -l | sort -n +4" The file size appears in the 5th column of ls -l 's output. Hence the "+4". ls -l 's output is being fed to the sort command. Hence the "|". and we want sort to know these numbers should be treated numerically, not alphabetically. Hence the "-n".

Assignment:

Your assignment deals with the file /home/ftp/pub/class/states on the remote unix computer. Here it is:

Copy it to your home directory. In your "assignments" directory, create a "sort" subdirectory.

The assignment is to sort the states file once on each of its 4 fields. Sort it on the state abbreviation and put the result in a file named states1. Then sort it on the state name, into file states2. Sort it on the population number in the 3rd column to file states3. And finally sort it again, on the state capital field, into file states4. Do it as follows:

[ your command to sort on the 1st column ] > states1
[ your command to sort on the 2nd column ] > states2
[ your command to sort numerically on the 3rd column ] > states3
[ your command to sort on the 4th column ] > states4

Put the 4 files you've created into the "assignments/sort" subdirectory. I will create the same 4 files, and give credit where your file and mine are identical.

(Incidentally, when I created this file I had to be careful to exclude states like New York, West Virginia, Rhode Island, and South Carolina. You know about sort now. Can you see why that was necessary? This question is not part of the assignment, just an interesting observation.)