NAME
Sort::External - sort huge lists
VERSION
0.05
SYNOPSIS
my $sortex = Sort::External->new( -mem_threshold = 2 ** 24 );
while (<HUGEFILE>) {
$sortex->feed( $_ );
}
$sortex->finish;
my $stuff;
while (defined($stuff = $sortex->fetch)) {
&do_stuff_with( $stuff );
}
DESCRIPTION
Problem: You have a list which is too big to sort in-memory. Solution: Use Sort::External, the closest thing to a drop-in replacement for Perl's sort() function when dealing with unmanageably large lists.
Where's the sortex() function?
In a perfect world, Sort::External would export a sortex() function that you could swap with sort() and be done. That isn't possible, because it would have to return a list which would, in all likelihood, be too large to fit in memory -- otherwise, why use Sort::External in the first place?
Replacing sort() with Sort::External
When you replace sort() with the "feed, finish, fetch" cycle of a Sort::External object, there are two things to watch out for.
-line_separator -- Sort::External uses temp files to cache sortable items. If each item is terminated by a newline/CRLF, as would be the case for lines from a text file, then Sort::External has no problem figuring out where items begin and end when reading back from disk. If that's not the case, you need to set a -line_separator, as documented below.
undef values and references -- Perl's sort() function sorts undef values to the front of a list -- it will complain if you have warnings enabled, but it preserves their undef-ness. sort() also preserves references. In contrast, Sort::External's behavior is unpredictable and almost never desirable when you feed it either undef values or references. If you really care about sorting lists containing undefs or refs, you'll have to symbollically replace and restore them yourself.
Memory management
Sort::External functions best when it can accumulate a large input cache before sorting the cache and flushing it to disk. For backwards compatibility reasons, the default threshold is 10,000 array items, which may be too high if your items are large. However, starting at version .10, Sort::External implements -mem_threshold, an *experimental* feature which makes it possible to flush to disk when the amount of memory consumed by the input cache exceeds a certain number of bytes.
There are two important caveats to keep in mind about -mem_threshold. First, Sort::External uses Devel::Size to assess memory consumption, which tends to underestimate real memory consumption somewhat -- set the docs for details. Second, the amount of memory consumed by the Sort::External process will be no less than double what you set for -mem_threshold, and probably considerably more. Don't get too aggressive. A good starting point is 2**24: 16 megabytes.
METHODS
new()
my $sortscheme = sub { $Sort::External::b <=> $Sort::External::a };
my $sortex = Sort::External->new(
-sortsub => $sortscheme, # default sort: standard lexical
-working_dir => $temp_directory, # default: see below
-line_separator => 'random', # default: $/
-mem_threshold => 2 ** 24, # default: 0 (inactive);
-cache_size => 100_000, # default: 10_000;
);
Construct a Sort::External object.
-sortsub -- A sorting subroutine. Be advised that you MUST use $Sort::External::a and $Sort::External::b instead of $a and $b in your sub.
-working_dir -- The directory where the temporary sortfiles will reside. By default, this directory is created using File::Temp's tempdir() command.
-line_separator -- By default, Sort::External assumes that your items are already terminated with a newline/CRLF or whatever your system considers to be a line ending (see the perlvar documentation for $/). If that's not true, you have two options: 1) specify your own value for -line_separator (which Sort::External will append to each item when storing and chomp() away upon retrieval) or 2) specify 'random', in which case Sort::External will use a random 16-byte string suitable for delimiting arbitrary binary data.
-mem_threshold -- EXPERIMENTAL FEATURE. Allow the input cache to grow to -mem_threshold bytes before sorting it and flushing to disk.
-cache_size -- The size for each of Sort::External's caches, in sortable items. If -mem_threshold is enabled, -cache_size is ignored.
feed()
$sortex->feed( @items );
Feed one or more sortable items to your Sort::External object. It is normal for occasional pauses to occur as caches are flushed and sortfiles are merged.
finish()
### if you intend to call fetch...
$sortex->finish;
### otherwise....
use Fcntl;
$sortex->finish(
-outfile => 'sorted.txt',
-flags => (O_CREAT | O_WRONLY),
);
Prepare to output items in sorted order. If you haven't yet exceeded the cache size, Sort::External never writes to disk -- it just sorts the items in-memory.
If you specify the parameter -outfile, Sort::External will attempt to write your sorted list to that outfile. By default, Sort::External will refuse to overwrite an existing file; if you want to override that behavior, you can pass Fcntl flags to finish() using the optional -flags parameter.
Note that you can either finish() to an -outfile, or finish() then fetch()... but not both.
fetch()
while (my $stuff = $sortex->fetch) {
&do_stuff_with( $stuff );
}
Fetch the next sorted item.
DISCUSSION
"internal" vs. "external" sorting
In the CS world, "internal sorting" refers to sorting data in RAM, while "external sorting" refers to sorting data which is stored on disk, tape, or any storage medium except RAM. The main goal when implementing an external sorting algorithm is to minimize disk I/O. Sort::External's routine can be summarized like so:
Cache sortable items in memory. Every X items, sort the cache and empty it into a temporary sortfile. As sortfiles accumulate, interleave them periodically into larger sortfiles. Complete the sort by emptying the input cache then interleaving the contents of all existing sortfiles into an output stream.
BUGS
Please report any bugs or feature requests to bug-sort-external@rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Sort-External.
ACKNOWLEDGEMENTS
Bug reports and patches: Ken Clarke, Chris Nandor.
SEE ALSO
File::Sort, File::MergeSort, and Sort::Merge as possible alternatives.
AUTHOR
Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com
COPYRIGHT
Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.