NAME

Sort::External - sort huge lists

SYNOPSIS

my $sortex = Sort::External->new( -mem_threshold => 2**24 );
while (<HUGEFILE>) {
    $sortex->feed( $_ );
}
$sortex->finish;
while (defined($_ = $sortex->fetch)) {
    &do_stuff_with( $_ );
}

DESCRIPTION

Problem: You have a list which is too big to sort in-memory.

Solution: "feed, finish, and fetch" with Sort::External, the closest thing to a drop-in replacement for Perl's sort() function when dealing with unmanageably large lists.

Where's the sortex() function?

In a perfect world, Sort::External would export a sortex() function that you could swap with sort() and be done. That isn't possible, because it would have to return a list which would, in all likelihood, be too large to fit in memory -- otherwise, why use Sort::External in the first place?

How it works

Cache sortable items in memory. Periodically sort the cache and empty it into a temporary sortfile. As sortfiles accumulate, interleave them into larger sortfiles. Complete the sort by sorting the input cache and any existing sortfiles into an output stream.

In the CS world, "internal sorting" refers to sorting data in RAM, while "external sorting" refers to sorting data which is stored on disk, tape, punchcards, or any storage medium except RAM -- hence, this module's name.

Note that if Sort::External hasn't yet flushed the cache to disk when finish() is called, the whole operation completes in-memory.

undefs and refs

Perl's sort() function sorts undef values to the front of a list -- it will complain if you have warnings enabled, but it preserves their undef-ness. sort() also preserves references. In contrast, Sort::External's behavior is unpredictable and almost never desirable when you feed it either undefs or refs. If you really care about sorting lists containing undefs or refs, you'll have to symbollically replace and restore them yourself.

Subtle changes to scalars e.g. utf8 flags get stripped

Once the input cache grows large enough, Sort::External writes items to disk and throws them away, only to recreate them later by reading back from disk. Provided that the sort does not complete in-memory, the stringified return scalars you get back from Sort::External will have changed in one subtle respect from their precursors: if they were tagged as utf8 before, they won't be now. (There are other subtle changes, but they don't matter unless you're working at the perlguts level, in which case you know what to expect.)

Memory management

Sort::External functions best when it can accumulate a large input cache before sorting the cache and flushing it to disk. For backwards compatibility, the default trigger for a buffer flush is the accumulation of 10,000 array items, which may be too high if your items are large. However, starting at version 0.10, Sort::External implements -mem_threshold, an *experimental* feature which makes it possible to flush to disk when the amount of memory consumed by the input cache exceeds a certain number of bytes.

There are two important caveats to keep in mind about -mem_threshold. First, Sort::External uses an extremely crude algorithm to assess memory consumption: the length of each scalar in bytes, plus an extra 15 per item to account for the overhead of a typical scalar. This is by no means accurate, but it provides a meaningful level of control. Second, the amount of memory consumed by the Sort::External process will be no less than double what you set for -mem_threshold, and probably considerably more. Don't get too aggressive, because the only time a very high -mem_threshold provides outsized benefits is when it's big enough that it prevents a disk flush -- in which case, you might just want to use sort().

METHODS

new()

my $sortscheme = sub { $Sort::External::b <=> $Sort::External::a };
my $sortex = Sort::External->new(
    -sortsub         => $sortscheme,      # default sort: standard lexical
    -working_dir     => $temp_directory,  # default: see below
    -mem_threshold   => 2 ** 24,          # default: 0 (inactive)
    -cache_size      => 100_000,          # default: 10_000
);

Construct a Sort::External object.

  • -sortsub -- A sorting subroutine. Be advised that you MUST use $Sort::External::a and $Sort::External::b instead of $a and $b in your sub. Before deploying a sortsub, consider using a packed-default sort instead, as described in the Sort::External::Cookbook. It's probably faster.

  • -working_dir -- The directory where the temporary sortfiles will reside. By default, this directory is created using File::Temp's tempdir() command.

  • -mem_threshold -- EXPERIMENTAL FEATURE. Allow the input cache to grow to -mem_threshold bytes before sorting it and flushing to disk. Suggested value: Start somewhere between 2**20 and 2**24: 1-16 Mb.

  • -cache_size -- The size for Sort::External's input cache, in sortable items. If -mem_threshold is enabled, -cache_size is ignored.

feed()

$sortex->feed( @items );

Feed one or more sortable items to your Sort::External object. It is normal for occasional pauses to occur as caches are flushed and sortfiles are merged.

finish()

# if you intend to call fetch...
$sortex->finish; 

# otherwise....
use Fcntl;
$sortex->finish( 
    -outfile => 'sorted.txt',
    -flags => (O_CREAT | O_WRONLY),
);

Prepare to output items in sorted order.

If you specify the parameter -outfile, Sort::External will attempt to write your sorted list to that location. By default, Sort::External will refuse to overwrite an existing file; if you want to override that behavior, you can pass Fcntl flags to finish() using the optional -flags parameter.

Note that you can either finish() to an -outfile, or finish() then fetch()... but not both.

fetch()

while (defined ($_ = $sortex->fetch)) {
    &do_stuff_with( $_ );
}

Fetch the next sorted item.

BUGS

Please report any bugs or feature requests to bug-sort-external@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Sort-External.

ACKNOWLEDGEMENTS

Bug reports and patches: Ken Clarke, Chris Nandor.

SEE ALSO

The Sort::External::Cookbook.

File::Sort, File::MergeSort, and Sort::Merge as possible alternatives.

AUTHOR

Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com

COPYRIGHT

Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.