NAME

File::MergeSort - Performings a merge sort on ordered data files.

SYNOPSIS

use File::MergeSort;

my $sort = new File::MergeSort( 
              \@file_list,             # Anonymous array of path/files 
              \&index_extract_function 
);


my $line = $sort->next_line;  # Retrieves the next line for porcessing
print "$line\n";

$sort->dump( [file] ); 	# Dumps remaining records in sorted order
                        # to a file.            Default: <STDOUT>

DESCRIPTION

File::MergeSort provides an easy way to merge, parse, process and analyze data that distributed in presorted files using the well known merge sort algorith. User supplies a list of pathnames and files and calls either next_line function or the dump function from the OO interface.

File::MergeSort is a hopefully straight forward solution for situations where one wishes to merge data files with all ready ordered records. An example might application server logs which record events chronilogically from a cluster. If we want to examine, process or merge several files but retain the chronological order, then MergeSort is for you.

Here's how it works ...

As arguments, MergeSort takes a reference to an anonymous array of filepaths/names and a reference to a subroutine that extracts an index value. The anonymous array of the filenames are the files to be sorted with the subroutine determining the sort order. When passed a line (i.e. a scalar) from one of the files, the user supplied subroutine must return a numeric index value associated with the line. The records are then culled in ascending order based on the index values.

More detail ...

For each file MergeSort opens a IO::File or IO::Zlib object. ( MergeSort handles mixed compressed and uncompressed files seamlessly by detecting for files with .z or .gz extensions. ) Initially the first line is indexed acording to the subroutine. A stack is created based on these values.

When the function 'next_line' is called, MergeSort returns the line with the lowest index value. MergeSort then replenishes the stack, reads a new line from the corresponding file and places it in the proper position for the next call to 'next_line'.

Additional Notes: - A stable sort is implemented, i.e. a single file is read until its index is no longer the lowest value. - If the file ends in .z or .gz then the file is opened with IO::Zlib, instead.

EXAMPLE

   # This program does looks at files found 
   # in /logfiles, returns the records of the
   # files sorted by the date  in mm/dd/yyyy
   # format

  use File::Recurse;
  use File::MergeSort;

  recurse { push(@files, $_) } "/logfiles";

  my $fs = new File::MergeSort(\@files, \&index_sub);
	
  while (my $line = $fs->next_line) {
    .
	.	some operations on $line
	.
  }


  sub index_sub{

    # Use this to extract a date of
    # the form mm-dd-yyyy.
	 
    my $line = shift;

    # Be cautious that only the date will be
    # extracted. 
    $line =~ /(\d{2})-(\d{2})-(\d{4})/;
 
    return "$3$1$2";  # Index is an interger, yyyymmdd
                      # Lower number will be read first.

  }	
	

TODO

Implement a generic test/comparison function to replace text/numeric comparison.
Implement a configurable record seperator.
Allow for optional deletion of duplicate entries.

EXPORT

None by default.

AUTHOR

Chris Brown, chris.brown@cal.berkeley.edu

Copyright(c) 2002 Christopher Brown. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the License, distributed with PERL. Not intended for evil purposes. Yadda, yadda, yadda ...

SEE ALSO

perl. IO::File. IO::Zlib. Compress::Zlib.