Subscript Heap Sort
Elizabeth Deal
Malvern, PA
This article describes a one-level-deep, ascending, alphanumeric subscript heap sort. It is written for the PET/CBM computer. It should work on systems that use Microsoft BASIC and permit arrays of character strings (Pet, Apple, OSI, Radio Shack).
Sort vs Subscript Sort
"Subscript sort" may be called tag sort, pointer sort, index sort or whatever you wish. The principle behind this type of ordering is that elements in a list are never moved and are not actually sorted. What gets rearranged into an ascending sequence are the subscripts of the array. The neat thing about this trick is that, as we are sorting records with several fields, we never need to move masses of data around. The corresponding fields are carried with the field that is being sorted. Subsequent to sorting, the access to the elements of the array is through the ordered list of subscripts.
For people with garbage collection problems, there is an additional advantage if they are sorting character strings. Since character strings do not have to move, time-consuming garbage collection during the sort will not need to occur. For further information on that subject consult Jim Butterfield's article in COMPUTE! #10, p. 96.
Sorting in BASIC takes considerable time no matter which of many available sorting methods is selected. I like heap sort because its performance is "even" no matter what the order of the original list is and the sorting time is almost linear relative to the list size. The algorithm itself is interesting, fun to study, and efficient on long lists. On short lists (N<25) there is, however, some time penalty as compared to several other sorting methods.
Don't Reinvent The Wheel
If you haven't done so already, you might want to look into a classic on the subject of sorting, merging, and general data management—Knuth, The Art of Computer Programming, vol. 3: Sorting and Searching, Addison and Wesley, 1973. The book looks mathematical and forbidding. But the appearance is deceptive, for there are no Greek letters in it and the sentences that look mathematical are, simply, ideas for the lines of a program. The illustrations are clear and the explanations are not at all complicated.
Book in hand, the algorithm is possible to follow if you practice the binary tree logic and the entire process with pencil and paper. It is then possible to modify the program from the book or the one from COMPUTE! #2 with some degree of assurance that it will successfully sort by subscripts. This program does just that.
Suggestions On Data Management
The demonstration program consists primarily of sorting multifield records. The sort routine sorts field HV. The field type (alpha or numeric) is in HT, number of records to be sorted in H1. The resulting list of subscripts is placed in the SB array, their placement being determined by the comparative numeric or string value of the corresponding elements of the V or V$ array, depending on HT.
When sorting has been finished, in order to use the undisturbed, unsorted list, we ask for V$(f,r) as shown in lines 680-710. To use the list in sorted order we ask for V$(f,SB(r)) as coded in lines 630-661. In plain English, it means to print a value pointed to by the r-th subscript.
The program also contains some suggestions pertaining to general management of data. Take these nonsorting suggestions with a grain of salt. Vary them. These are some of the methods I use, find adequate and which fit most things I do on my PET. It does not mean that your arrangement of data or its parameters has to be like that. These ideas and the following details of the program are given mainly for people who are starting and don't know where to begin.
The program is originally set up (line 760) for NN = 20 estimated number of records and VV = 15 fields per record. You may change those estimates. The actual count of variables (NV) is performed in lines 770-810 while reading in data descriptors contained in the first DATA line. The actual count of records (KN) is done in lines 840-852 while reading in the six records from the remaining DATA lines.
There are two alphabetic and two numeric fields in each of the six records. The field type is stored in array TP. Type is 1 (one) for alpha (A) and 0 (zero) for numbers (N). TP is developed in lines 770-810 using the first DATA line. Since the ASCII collating sequence is irrelevant to sorting unaligned or non-integer or signed numeric values in character string form, these fields are not used in their string form. The values are placed in a one-dimensional work array V which has been set up in line 840. Should you be short of space for this extra array, you may change the program like this: omit all references to the V array by deleting its DIM in line 840 and lines 591 and 650. In the sort routine change all V(SB(*)) to VAL(V$(HV, SB(*))).
In any case, when HT is set to zero in line 600, the sort routine sorts numbers. This coding is in each second line of the subroutines which begin in lines 310, 330 and 350. The main routine (lines 560–711) handles these numbers as character strings, however, so that the output can look tidy while permitting a messy, unaligned input (sometimes useful in files for space-saving reasons).
Two different output methods are used, depending on the type of variable. You'll see different coding for alpha fields (line 640) from that for numeric fields(lines 650–652). The output format is controlled by arrays V1 and V2 which specify the field width. In case of alpha variables, only V1 is required (see the first DATA line where A-12 and A-14 sequences specify alpha fields of 12 and 14 characters to be left-justified by line 530). In case of numbers, both V1 and V2 are needed (see line 450 and N-2-0 and N-4-3 sequences in line 870 which specify right-justified numeric output formats of xxx and xxxx.xxx respectively). The Butterfield formatting procedure from COMPUTE! #9 is used for printing numbers in a neat column.
Why not, you may ask, just read the values into a numeric array since that's what has to be used in sorting? There are several reasons. (1) This data might be an example of an existing disk file containing only character strings. (2) This might be a larger task requiring character by character data checking. Hence there is the need for input of character strings. Editing of data is a story outside the scope of this article, but it's a good idea to remember the issue every once in a while. (3) Unless you enjoy looking at unaligned columns of numbers the output ought to be formatted. Here, again, the easiest way is to work with character strings. Again, these are the methods I am comfortable with. Your opinions may differ and lead to a totally different approach.
Ranking
Finally, there exists a short ranking routine within the listing that might be useful to statistics people who would like to use this for nonparametric tests and suspect tied scores. Ranking fits in this sorting scheme automatically. Note that if there are no tied values then, by definition, at the end of sort the subscripts are the ranks, otherwise an average of ranks is given. Thus the rank routine is needed only when tied values are obvious or suspected. This routine creates an array of ranks (RV) while doing one extra pass through the list in subscript-sorted order. Needless to say, since you get a chance in this demo program to sort on any one of the four variables, the rank values are meaningless in some situations.
Figure 1.* SORTED ON FIELD 4 *
CHARLOTTE | FARM | 74 | -93.000 |
FATHER FOX | VERMONT | 100 | .003 |
WILBUR | FARM | 1 | .488 |
MOUSE | TOOTLETOWN | 84 | 33.700 |
TEMPLETON | FARM | 98 | 647.000 |
TANKER | TOOTLETOWN | 84 | 647.000 |
TANKER | 5.5 |
MOUSE | 4.0 |
FATHER FOX | 2.0 |
CHARLOTTE | 1.0 |
WILBUR | 3.0 |
TEMPLETON | 5.5 |
WILBUR | FARM | 1 | .488 |
CHARLOTTE | FARM | 74 | -93.000 |
TANKER | TOOTLETOWN | 84 | 647.000 |
MOUSE | TOOTLETOWN | 84 | 33.700 |
TEMPLETON | FARM | 98 | 647.000 |
FATHER FOX | VERMONT | 100 | .003 |
TANKER | 3.5 |
MOUSE | 3.5 |
FATHER FOX | 6.0 |
CHARLOTTE | 2.0 |
WILBUR | 1.0 |
TEMPLETON | 5.0 |