NAME
File::MergeSort - Mergesort ordered files.
SYNOPSIS
use File::MergeSort;
# Create the MergeSort object.
my $sort = new File::MergeSort(
[ $file_1, ..., $file_n ], # Anonymous array of input files
\&extract_function, # Sub to extract merge key
);
# Retrieve the next line for processing
my $line = $sort->next_line;
print $line, "\n";
# Dump remaining records in sorted order to a file.
$sort->dump( $file ); # Omit $file to default to STDOUT
DESCRIPTION
File::MergeSort provides methods to merge and process a number of pre-sorted files into a single sorted output.
Merge keys are extracted from the input lines using a user defined subroutine. Comparisons on the keys are done lexicographically.
Plaintext and compressed (.z or .gz) files are catered for. IO::Zlib
is used to handle compressed files.
File::MergeSort is a hopefully straightforward solution for situations where one wishes to merge data files with presorted records. An example might be application server logs which record events chronologically from a cluster.
POINTS TO NOTE
Comparisons on the merge keys are carried out lexicographically. The user should ensure that the subroutine used to extract merge keys formats the keys if required so that they sort correctly.
Note that earlier versions (< 1.06) of File::MergeSort preformed numeric, not lexicographical comparisons.
DETAILS
The user is expected to supply a list of file pathnames and a function to extract an index value from each record line (the merge key).
By calling the "next_line" or "dump" function, the user can retrieve the records in an ordered manner.
As arguments, MergeSort takes a reference to an anonymous array of file paths/names and a reference to a subroutine that extracts a merge key from a line.
The anonymous array of the filenames are the files to be sorted with the subroutine determining the sort order.
For each file MergeSort opens the file using IO::File or IO::Zlib for compressed files. MergeSort handles mixed compressed and uncompressed files seamlessly by detecting for files with .z or .gz extensions.
When passed a line (a scalar, passed as the first and only argument, $_[0]) from one of the files, the user supplied subroutine must return the merge key for the line.
The records are then output in ascending order based on the merge keys returned by the user supplied subroutine. A stack is created based on the merge keys returned by the subroutine.
When the next_line
method is called, File::MergeSort returns the line with the lowest merge key/value.
File::MergeSort then replenishes the stack, reads a new line from the corresponding file and places it in the proper position for the next call to next_line
.
If a simple merge is required, without any user processing of each line read from the input files, the dump
method can be used to read and merge the input files into the specified output file, or to STDOUT if no file is specified.
CONSTRUCTOR
- new( ARRAY_REF, CODE_REF );
-
Create a new
File::MergeSort
object.There are two required arguments:
A reference to an array of files to read from. These files can be either plaintext, or compressed. Any file with a .gz or .z suffix will be opened using
IO::Zlib
.A code reference. When called, the coderef should return the merge key for a line, which is given as the only argument to that subroutine/coderef.
METHODS
- next_line( );
-
Returns the next line from the merged input files.
- dump( [ FILENAME ] );
-
Reads and merges from the input files to FILENAME, or STDOUT if FILENAME is not given, until all files have been exhausted.
EXAMPLES
# This program looks at files found in /logfiles, returns the
# records of the files sorted by the date in mm/dd/yyyy format
use File::MergeSort;
my $files = qw[ logfiles/log_server_1.log
logfiles/log_server_2.log
logfiles/log_server_3.log
];
my $sort = File::MergeSort->new( $files, \&index_sub );
while (my $line = $sort->next_line) {
# some operations on $line
}
sub index_sub{
# Use this to extract a date of the form mm-dd-yyyy.
my $line = shift;
# Be cautious that only the date will be extracted.
$line =~ /(\d{2})-(\d{2})-(\d{4})/;
return "$3$1$2"; # Index is an interger, yyyymmdd
# Lower number will be read first.
}
# This slightly more compact example performs a simple merge of
# several input files with fixed width merge keys into a single
# output file.
use File::MergeSort;
my $files = qw [ input_1 input_2 input_3 ];
my $extract = sub { substr($_[0], 15, 10 ) }; # To substr merge key out of line
my $sort = File::MergeSort->new( $files, $extract );
$sort->dump( "output_file" );
TODO
+ Implement a generic test/comparison function to replace text/numeric comparison.
+ Implement a configurable record seperator.
+ Allow for optional deletion of duplicate entries.
EXPORTS
Nothing. OO interface. See CONSTRUCTOR and METHODS
AUTHOR
Chris Brown: <chris.brown@cal.berkeley.edu>
Copyright(c) 2003 Christopher Brown.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
perl, IO::File, IO::Zlib, Compress::Zlib.
File::Sort as an alternative.