|
Perl Practicum: Out of Sorts
by Hal Pomeranz
Welcome to the first in a series of articles intended to demystify
some of the more occult aspects of Perl programming. The emphasis will
be on practical solutions for common problems, rather than a
serialized step-by-step tutorial for the language.
There have been a fair number of articles in comp.lang.perl recently
which back down to a basic unfamiliarity with the workings of
Perl's sort() function. Very little space is specifically
devoted to sort() in the Camel Book [1] despite the fact
that it is a basic and powerful function. On the face of it,
sort() is rather straightforward: you feed it a list of strings
and it spits the list back in alphabetic order:
|
@list = (`how', `now', `brown', `cow');
@sorted = sort @list;
|
If you want a descending sort, just feed the output of
sort() through reverse() :
|
@ztoa = reverse sort @list;
|
Everybody with us so far? The
powerful aspect of sort() is the ability to define customized sort
criteria. To invoke this customization, add another parameter which is
the name of a function which defines your sort criteria:
|
@sorted = sort mysort @list;
|
Every time sort() needs to compare two values in
@list , it will invoke your function (mysort()
in this case). For ascending sorts, your function should return a
negative integer (any negative integer) if the first value is less
than the second, zero if equal, and positive if the first value is
greater than the second. Reverse these conditions for a descending
sort. Wily UNIX hackers will recognize the noncoincidental similarity
to the qsort(3) library call.
In the interests of speed, the normal Perl function call mechanism is
short-circuited; this has several effects. First, the values to be
compared are automatically passed in as $a and
$b rather than via the argument list. Second, these
values are passed by reference, so expect civilization as we know it
to end if you fiddle with the values of these variables in your
subroutine. Third, your sort function must not be recursive.
With all of this in mind, here's how to do an ascending
numeric sort:
|
@numbers = (3, 5, 4, 11, 1, 7);
@sorted = sort mysort @numbers;
sub mysort { $a <=> $b; }
|
Reverse the positions of $a and $b for a
descending sort (or use the reverse() function on the
resulting list - this wouldn't be a Perl article if the motto "There's
more than one way to do it!" didn't appear somewhere). Note that the
function avoids an explicit return() to save time. In the
absence of an explicit return() , the return value of a
Perl function is the value of the last expression in that function.
In fact, it's not necessary to create a separate subroutine - we may
write an in-line block in place of the function name:
|
@sorted = sort { $a <=> $b; } @numbers;
|
Let aesthetics be your guide here. The truly lazy will point out that
the semicolon inside the block is optional if we are using a recent
version of Perl (pl35 or greater). There are too many versions of Perl
out there which do not support this (ahem) feature for the
author to feel comfortable in recommending this practice, however.
Suppose we are producing a monthly disk report. We have an
associative array, %usage , which is keyed by user name
and stores total disk usage information in MBytes. We would like to
have a list of user names in descending order of disk usage so we can
send nasty-grams to the top ten or so. We need to sort the list of
keys in the array by the values referenced by those keys:
|
@sorted = sort { $usage{$b} <=> $usage{$a}; } keys %usage;
|
Since the granularity of our data is MBytes, we may have several users
with the same amount of disk usage. We can refine our sort to
alphabetize user names with equivalent disk usage
|
@sorted = sort { $usage{$b} <=> $usage{$a} || $a <=> $b; } keys%usage;
|
This is an example of a multifield sort. It relies on the
short-circuit evaluation of the logical OR to only do the second
comparison if the disk usage values are equal (i.e., if the "spaceship
operator," <=> , returns zero).
Suppose instead we had a list into which we had slurped the contents
of a file and we wanted to sort the lines of the file on many fields
within each line. For example, suppose we wished to sort the password
file first by parent directory of the user's home and then by user
name:
|
print sort psort @lines; sub psort {
($anm, $adr) = (split(/:/, $a))[0,5];
($bnm, $bdr) = (split(/:/, $b))[0,5];
$adr =~ s,/[^/]*,,; $bdr =~ s,/[^/]*,,;
$adr cmp $bdr || $anm cmp $bnm; }
|
The function first uses the array slice operator to extract the user
name and home directory fields from each password entry. It then uses
the substitution operator to extract the parent directory of each home
directory by removing the last "/" and everything after it.
Unfortunately, this approach is somewhat inefficient. We may compare
against a given line multiple times (quicksort is an O(n log n)
sort). Each time we do, we have to redo the split() and strip off the
last portion of the home directory path. We can improve performance on
the sort by caching this information ahead of time, at the cost of
more than tripling the amount of memory consumed:
|
for (@lines) {
($nm{$_}, $dir{$_}) = (split(/: ))[0,5];
$dir{$_} =~ s,/[^/]*,,;
}
print sort { $dir{$a} cmp $dir{$b} || $nm{$a} cmp $nm{$b}; } @lines;
|
This sort of approach is often twice as fast as the noncaching
approach.
Further optimization can be achieved if you are sorting on multiple
numeric fields, e.g., the octets of an IP address. By combining the
octets into a single integer value, we are able to sort by doing one,
rather than many, numeric comparisons:
|
for (@ipaddrs) {
$num{$_} = sprintf("%d%03d%03d%03d", split(/\./));
}
@sorted = sort { $num{$a} <=> $num{$b}; } @ipaddrs;
|
The single comparison is generally more efficient than stringing
together multiple comparisons with logical ORs.
What have we learned? First, simple things can have unexpectedly
complex applications. In the case of sort() , we have
moved from simple alphabetic sorts to sorts on multiple fields and
sorts on associated values. Second, heavy optimization can be applied
to sort if you are willing to do some work and expend some extra
memory up front. Third, some data is amenable to being jammed into a
single compound key, thus avoiding multiple comparisons.
[1] Larry Wall and Randal Schwartz, Programming
Perl, O'Reilly and Associates, 1991.ISBN 0-937175-64-1.
Reproduced from ;login: Vol. 18 No. 4, August 1993.
|