NAME
Sort::External::Cookbook - sort huge lists efficiently
DESCRIPTION
Sort::External has a special affinity for the GRT or Guttman-Rosler Transform, also known as the "packed-default" sort. The GRT first came to the author's attention via "A Fresh Look at Efficient Perl Sorting", a paper by the transform's creators, Uri Guttman and Larry Rosler. It is located as of this writing at http://www.sysarch.com/perl/sort_paper.html.
For most applications, the GRT is the most efficient Perl sorting transform. This document explores how to use it to solve common coding problems in conjunction with either sort() or Sort::External.
GRT Fundamentals
Eliminate the sortsub!
Perl's sortsubs are extremely flexible and powerful, but they are also computationally expensive. If you can eliminate the sortsub entirely from your algo, major optimizations kick in as Perl decides that it can perform the whole sort at the C level without leaving to execute Perl code.
Like Perl's native sort() function, Sort::External really, really likes to use the default ascending lexical sort. If you are chasing efficiency, you should do everything you can to avoid supplying Sort::External with a sortsub.
Encode, sort, decode.
The standard technique for sorting a hash by value in Perl is to use a sortsub:
my %colors = (
"Granny Smith" => "green",
"Golden Delicious" => "yellow",
"Pink Lady" => "pink",
);
my @sorted_by_color = sort { $colors{$a} cmp $colors{$b} } keys %colors;
# Result:
# @sorted_by_color = ("Granny Smith", "Pink Lady", "Golden Delicious");
To accomplish the same goal with a GRT, we use a three-step process:
# 1. Encode
my @sortkeys;
while (my ($variety, $color) = each %colors) {
push @sortkeys, sprintf( "%-6s%-16s", $color, $variety );
}
# @sortkeys = (
# "green Granny Smith ",
# "yellowGolden Delicious",
# "pink Pink Lady ",
# );
# 2. Sort
my @sorted_by_color = sort @sortkeys; # no sortsub!
# 3. Decode
@sorted_by_color = map { substr($_, 6) } @sorted_by_color;
When substituting Sort::External for sort, steps 2 and 3 need to be modified:
# 1. Encode
my @sortkeys;
while (my ($variety, $color) = each %colors) {
push @sortkeys, sprintf( "%-6s%-16s", $color, $variety );
}
# 2. Sort
my $sortex = Sort::External->new( -mem_threshold => 2**24 );
$sortex->feed( @sortkeys );
$sortex->finish;
# 3. Decode
my @sorted_by_color;
while (defined($_ = $sortex->fetch)) {
push @sorted_by_color, substr($_, 6);
}
For the sake of brevity, we'll use sort() instead of Sort::External for most of our examples.
Encoding techniques
sprintf
The sprintf technique for encoding sortkeys is the most straightforward and flexible. In addition to preparing multi-field text data for a lexical sort, it can be used to put numerical data into a string form where the lexical sort doubles as a numerical sort.
my %prices = (
"Granny Smith" => 1.69,
"Pink Lady" => 1.99,
"Golden Delicious" => .79, # on sale!
);
my @sortkeys;
while (my ($variety, $price) = each %prices) {
push @sortkeys, sprintf( "%1.2f%-16s", $price, $variety );
}
# @sortkeys = (
# "1.69Granny Smith ",
# "0.79Golden Delicious",
# "1.99Pink Lady ",
# );
There are two potential problems with the sprintf technique.
First, you must know the maximum width of all fields in advance. If you don't, you'll have to run through your data in a wasteful "characterization pass" solely for the purpose of measuring field widths.
Second, if the fields have a long maximum width but the content is sparse, sprintf-generated sortkeys take up a lot of unnecessary space. With Sort::External, that means extra disk i/o and reduced performance.
null-termination
An alternative to using sprintf to create fixed-width data is to separate fields with null bytes.
while (my ($variety, $color) = each %colors) {
push @sortkeys, "$color\0$variety";
}
my @sorted = sort @sortkeys;
s/.*?\0// for @sorted;
Null-termination often saves space over sprintf -- however, if your data may contain null bytes, you have to be very careful about how you use it.
pack
pack() can serve as a less-flexible version of sprintf for encoding fixed-width strings. More usefully, it can also encode positive integers into a compact sortable form. The four integer pack formats which sort correctly are:
# c => signed char maximum value 127
# C => unsigned char maximum value 255
# n => 16-bit network short maximum value 65_535
# N => 32-bit network unsigned long maximum value 4_294_967_295
my %first_introduced = (
"Granny Smith" => 1868,
"Pink Lady" => 1970,
"Golden Delicious" => 1890,
);
while (my ($variety, $year) = each %first_introduced) {
push @sortkeys, (pack('n', $year) . $variety);
}
my $sortex = Sort::External->new( -mem_threshold => 2**24 );
$sortex->feed(@sortkeys);
$sortex->finish;
my @sorted;
while (defined ($_ = $sortex->fetch)) {
push @sorted, substr($_, 2);
}
If you have a lot of large integers to sort, pack() is considerably faster and more space-efficient than sprintf().
Combining encoding techniques
It is often the case that a combination of techniques yields the most efficient sortkey encoding and decoding. Here is a contrived example:
my %apple_stats = (
"Granny Smith" => {
color => "green",
price => 1.69,
year => 1868,
},
"Pink Lady" => {
color => "pink",
price => 1.99,
year => 1970,
},
"Golden Delicious" => {
color => "yellow",
price => .79,
year => 1890,
},
);
my @sortkeys;
while (my $variety, $stats) = each %apple_stats) {
push @sortkeys, (
$stats->{color} . "\0\0" .
sprintf( "%1.2f", $stats->{price} ) .
pack( 'n', $stats->{year} ) .
"\0\0" . $variety
);
}
my @sorted = sort @sortkeys;
my @sorted_varieties = map { s/.*\0\0//; $_ } @sorted;
sprintf would be a much better choice in this particular case, since it makes it easier to recover the all the fields using unpack:
while (my ($variety, $stats) = each %apple_stats) {
push @sortkeys, (
sprintf( "%-6s%1.2f%04d%-16s",
@{ $stats }{'color','price','year'}, $variety
)
);
}
my @sorted = sort @sortkeys;
for (@sorted) {
my (%result, $variety);
(@result{'color','price','year'}, $variety)
= unpack( 'a6 a4 a4 a16', $_ );
$_ = { $variety => \%result };
}
Extended sorting techniques
Descending sort order using the "bitwise not" operator
You can reverse the order of a lexical sort by flipping all the bits on your sortkeys.
while (my ($variety, $color) = each %colors) {
push @sortkeys, sprintf( "%-6s%-16s", $color, $variety );
}
@sortkeys = map { ~ $_ } @sortkeys;
my @sorted = sort @sortkeys;
@sorted = map { ~ $_ } @sorted;
@sorted = map { substr($_, 5) } @sorted;
Multi-field sorting with mixed asc/desc order
Multi-field GRT sorts are usually as straightforward as concatenating the fields using either sprintf or null-termination. However, if you need both ascending and descending sub-sorts within the same sort, special care is required. Null-termination is generally not an option, because bit-flipped fields may contain null bytes.
# sort by:
# color descending,
# price ascending,
# year descending
while (my $variety, $stats) = each %apple_stats) {
push @sortkeys, (
(~ sprintf( "%-6s", stats->{color} ) ) .
sprintf( "%1.2f", $stats->{price} ) .
(~ sprintf( "%04d", stats->{year} ) ) .
sprintf( "%-16s", $variety )
);
}
my @sorted = sort @sortkeys;
Case-insensitive sorting
Case-insensitive sorting is something of a weakness for the GRT. The only way to avoid a sortsub is to double-encode:
while (my ($variety, $color) = each %colors) {
push @sortkeys, ( lc($variety) . "\0\0" . $variety );
push @sortkeys, ( lc($color) . "\0\0" . $color );
}
my @sorted = sort @sortkeys;
@sorted = map { s/.*?\0\0//; $_ } @sorted;
# @sorted = (
# "Golden Delicious",
# "Granny Smith",
# "green",
# "pink",
# "Pink Lady",
# "yellow",
# );
A challenge
Ken Clarke has identified a case where there seems to be no elegant GRT solution:
# Each row contains multiple fields
# AND is impossible to know the field lengths in advance
# AND there are both ascending and descending subsorts
If you can think of a way to perform this sort without a characterization pass, please write the author.
ACKNOWLEDGEMENTS
This document arose out of discussions with Ken Clarke.
Uri Guttman and Larry Rosler, as mentioned above.
SEE ALSO
AUTHOR
Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com
COPYRIGHT
Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.