NAME

Data::Range::Compare::Stream::Iterator::File::MergeSortAsc - On Disk Merge Sort for really big data sets!

SYNOPSIS

use Data::Range::Compare::Stream;
use Data::Range::Compare::Stream::Iterator::File;
use Data::Range::Compare::Stream::Iterator::File::MergeSortAsc;

my $iterator=Data::Range::Compare::Stream::Iterator::File::MergeSortAsc->new(
  filename=>'somefile.csv',
);

while($iterator->has_next) {
  my $next_range=$iterator->get_next;
  print $next_range,"\n";
}

DESCRIPTION

This module Extends Data::Range::Compare::Stream::Iterator::Base and provides an on disk merge sort for objects that implement or extend Data::Range::Compare::Stream::Iterator::Base.

OO Methods

  • my $iterator=new Data::Range::Compare::Stream::Iterator::File::MergeSortAsc(key=>value);

    Instance Constructor, all arguments are optional

    At least one of the following Argument(s) is required:

    filename=>'source_file.csv'  
      # the file is assumed to be an absolute or relative path to the file location.
    
    file_list=>[]
      # An array ref of file names in absolute or relative paths
        
    iterator_list=>[]
     # an array ref of objects that implement or extend Data::Range::Compare::Stream::Iterator::Base

    Optional Arguments:

     auto_prepare=>0|1
       # Default: 0, If set to 1 sort operations happen on object creation.
    
     unlink_result_file=>1|0
       # Default: 1, If set to 0 the sorted result file will not be deleted
    
     bucket_size=>4000
       # sets the number of ranges to be pre-sorted
       # 2 buckets are created.. so the number of objects loaded into is bucked_size * 2
    
     NEW_ITERATOR_FROM=>'Data::Range::Compare::Stream::Iterator::File'
       # sets the file iterator object to be used when loading spooled files for merging
       # make sure you load or require the object class being passed in as an argument!
    
     NEW_ARRAY_ITERATOR_FROM=>'Data::Range::Compare::Stream::Iterator::Array'
       # sets the array iterator class
    
     NEW_FROM=>'Data::Range::Compare::Stream',
       # depricated but still supportd, see factory_instance.
       # sets the object class new ranges will be created from
       # This argument is passed to objects being constructed from: NEW_ITERATOR_FROM
    
     factory_instance =>$obj
       # defines the object that implements the $obj->factory($start,$end,$data).
       # new ranges are constructed from the factory interfcae.  If a factory interface
       # is not created an instance of Data::Range::Compare::Stream is assumed.
    
    
     parse_line=>undef|code_ref
       # Default: undef, Sets the code ref to be used when parsing a line
       # if not set the default internals will be used
       # This argument is passed to objects being constructed from: NEW_ITERATOR_FROM
    
     result_to_line=>undef|code_ref
       # Default: undef, Sets the code ref used to convert a result to a line that can be parsed
       # if not set the default internals will be used
       # This argument is passed to objects being constructed from: NEW_ITERATOR_FROM
    
     sort_func=>undef|code ref
       # Default: undef, Sets the code ref used for comparing objects in the sort process
       # if not set the default internals are used.
    
    tmpdir=>undef|'/some/folder'
        # tmpdir is defined its value is passed to to File::Temp->new(DIR=>$self->{tmpdir});
  • my $class=$iterator->NEW_FROM;

    Returns the Class that new Range objects are constructed from.

  • my $class=$iterator->NEW_ITERATOR_FROM;

    $class will contain the name of the class new file Iterators are to be constructed from.

  • my $class=$iterator->NEW_ARRAY_ITERATOR_FROM;

    $class will contain the name of the class new array Iterators are constructed from.

  • while($iterator->has_next) { ... }

    Returns true when there are more rows to fetch.

  • my $result=$iterator->get_next;

    Returns the next $result from the given source file.

  • my $line=$iterator->result_to_line($range);

    Given a $result from $iterator->get_next, this interface converts the $range object into a line that can be parsed by $iterator->parse_line($line). Think of this function as a data serializer for range objects generated by an $iterator object. When overloading this function or using a call back make sure result_to_line can be parsed by parse_line.

    sub result_to_line {
      my ($self,$result)=@_;
      return $self->{result_to_line}->($result) if defined($self->{result_to_line});
    
      my $range=$result->get_common;
      my $line=$range->range_start_to_string.' '.$range->range_end_to_string."\n";
      return $line;
    }
  • my $ref=$iterator->parse_line($line);

    Given a $line returns the arguments required to construct an object that extends or implements Data::Range::Compare::Stream. When overloading or passing in constructor arguments that provide a call back make sure result_to_line produces the expected line parse_line expects.

    sub parse_line {
      my ($self,$line)=@_;
      return $self->{parse_line}->($line) if defined($self->{parse_line});
      chomp $line;
      [split /\s+/,$line];
    }
  • my $cmp=$iterator->sort_method($left_range,$right_range);

    This is the internal object compare function used when sorting.

    sub sort_method {
      my ($self,$left_range,$right_range)=@_;
      
      return $self->{sort_func}->($left_range,$right_range) if $self->{sort_func};
      my $cmp=sort_in_consolidate_order_asc($left_range->get_common,$right_range->get_common);
    
      return $cmp;
    }

SEE ALSO

Data::Range::Compare::Stream::Cookbook

AUTHOR

Michael Shipper

Source-Forge Project

As of version 0.001 the Project has been moved to Source-Forge.net

Data Range Compare https://sourceforge.net/projects/data-range-comp/

COPYRIGHT

Copyright 2011 Michael Shipper. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.