NAME

Sort::External - Sort huge lists

VERSION

0.01

WARNING

This is ALPHA release software. The interface may change. However, it's simple enough that it probably won't stay in alpha very long.

SYNOPSIS

my $sortex = Sort::External->new;
while (<HUGEFILE>) {
    $sortex->feed( $_ );
}
$sortex->finish;
while (my $stuff = $sortex->fetch) {
    &do_stuff_with( $stuff );
}

DESCRIPTION

Use Sort::External when you have a collection which is too large to sort in-memory. Most often you will feed a file to Sort::External line-by-line, but there's nothing to stop you from feeding it items created on the fly.

All items must be terminated, typically by whatever your system uses for line endings. If you're reading from a file, they're presumably already terminated; otherwise, make sure you terminate each one.

METHODS

new()

my $sortscheme = sub { $Sort::External::b <=> $Sort::External::a }';
my $sortex = Sort::External->new(
    -sortsub         => $sortscheme,      # default sort: standard lexical
    -working_dir     => $temp_directory,  # default: see below
    -line_separator  => $special_string,  # default: $/
    -cache_size      => 100_000,          # default: 10_000;
    );

Construct a Sort::External object.

-sortsub

A sorting subroutine. Be advised that you MUST use $Sort::External::a and $Sort::External::b instead of $a and $b in your sub.

-working_dir

The directory where the temporary sortfiles will reside. By default, this directory is created using File::Temp's tempdir() command.

-line_separator

The delimiter which terminates every item. See the perlvar documentation for $/.

-cache_size

The size for each of Sort::External's caches, in sortable items. Set this higher for faster performance, but make sure you don't set it so high that Perl needs to run in virtual memory.

feed()

$sortex->feed( @items );

Feed one or more sortable items to your Sort::External object. It is normal for occasional pauses to occur as sortfiles are merged.

finish()

$sortex->finish( -outfile => 'sorted.txt' );
### or, if you intend to call fetch...
$sortex->finish; 

Prepare to output items in sorted order.

If you specify the parameter -outfile, Sort::External will attempt to write your sorted list to that outfile (it will croak() if the file already exists).

Note that you can either have finish() write to an outfile, or finish() then fetch()... but not both.

fetch()

while (my $stuff = $sortex->fetch) {
    &do_stuff_with( $stuff );
}

Fetch the next sorted item.

DISCUSSION

"internal" vs. "external" sorting

In the CS world, "internal sorting" refers to sorting data in RAM, while "external sorting" refers to sorting data which is stored on disk, tape, or any storage medium except RAM. The main goal when implementing an external sorting algorithm is to minimize disk I/O. Sort::External's routine can be summarized like so:

Cache sortable items in memory. Every X items, sort the cache and empty it into a temporary sortfile. As sortfiles accumulate, interleave them periodically into larger sortfiles. Use caching extensively during the interleaving process to minimize disk I/O. Complete the sort by emptying the input cache then interleaving the contents of all existing sortfiles into an output stream.

BUGS

Please report any bugs or feature requests to bug-sort-external@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Sort-External.

AUTHOR

Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com

COPYRIGHT

Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.