The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

File::FlexSort - Perl extension for merging and processing data distributed over ordered files.

SYNOPSIS

File::FlexSort provides an easy way to merge, parse, process and analyze data that exists in many files with an existing order. Because flex sort takes advantages of the existing order, the processing should be both quick and frugal with memory resources.

  use File::FlexSort;
  
  my $sort = new File::FlexSort( 
                \@file_list,             # Anonymous array of path/files 
                \&index_extract_function 
  );


  my $line = $sort->next_line;  # Retrieves the next line for porcessing
  print "$line\n";

  $sort->dump( [file] );        # Dumps remaining records in sorted order
                          # to a file.            Default: <STDOUT>

DESCRIPTION

File::FlexSort is a hopefully straight forward solution for situations where one wishes to merge data files with all ready ordered records. An example might application server logs which record events chronilogically from a cluster. If we want to examine, process or merge several files but retain the chronological order, then flexsort is for you.

Here's how it works ...

As arguments, FlexSort takes a reference to an anonymous array of filepaths/names and a reference to a subroutine that extracts an index value. The anonymous array of the filenames are the files to be sorted with the subroutine determining the sort order. When passed a line (i.e. a scalar) from one of the files, the user supplied subroutine must return a numeric index value associated with the line. The records are then culled in ascending based on the index values. In the future, File::FlexSort will likely become more flexible to live up to it's name.

More detail ...

For each file FlexSort opens a IO::File or IO::Zlib object. ( FS handles mixed compressed and uncompressed files seamlessly by detecting for files with .z or .gz extensions. ) Initially the first line is indexed acording to the subroutine. A stack is created based on these values.

When the function 'next_line' is called, FlexSort returns the line with the lowest index value. FlexSort then replenishes the stack, reads a new line from the corresponding file and places it in the proper position for the next call to 'next_line'.

Additional Notes: - A stable sort is implemented, i.e. a single file is read until its index is no longer the lowest value. - If the file ends in .z or .gz then the file is opened with IO::Zlib, instead.

EXAMPLE

   # This program does looks at files found 
   # in /logfiles, returns the records of the
   # files sorted by the date  in mm/dd/yyyy
   # format

  use File::Recurse;
  use File::FlexSort;

  recurse { push(@files, $_) } "/logfiles";

  my $fs = new File::FlexSort(\@files, \&index_sub);
        
  while (my $line = $fs->next_line) {
    .
        .       some operations on $line
        .
  }


  sub index_sub{

    # Use this to extract a date of
    # the form mm-dd-yyyy.
         
    my $line = shift;

    # Be cautious that only the date will be
    # extracted. 
    $line =~ /(\d{2})-(\d{2})-(\d{4})/;
 
    return "$3$1$2";  # Index is an interger, yyyymmdd
                      # Lower number will be read first.

  }     
        

TODO

        Implement a generic test/comparison function to replace text/numeric comparison.
        Implement a configurable record seperator.
        Allow for optional deletion of duplicate entries.

EXPORT

None by default.

AUTHOR

Chris Brown, <chris.brown@cal.berkeley.edu<gt>

Copyright(c) 2002 Christopher Brown. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the License, distributed with PERL. Not intended for evil purposes. Yadda, yadda, yadda ...

SEE ALSO

perl. IO::File. IO::Zlib. Compress::Zlib.