NAME

List::Insertion - Binary search a list for insertion point

SYNOPSIS

Export a subroutine to locate the index of the first instance (left) of value in a sorted simple scalar list:

# Exports 'search_numeric_left' only
#
use List::Insertion {type=>"numeric", duplicate=>"left"};

# Data has value 30 duplcated

my @data=(10, 20, 30, 30, 40, 50, 60, 70);
#index    0   1   2   3   4   5   6   7

# The insert position will be index 2, which is left of duplicates
#
my $key= 30;
my $pos=search_numeric_left $key, \@data;

if($data[$pos] == $key){
  # Exact match, $pos is the index of the searched $key in the array (left)
}
else{
  # $pos is the index where $key should be inserted
}  

Same as above with string comparison:

use List::Insertion {type=>"string", duplicate=>"left"};

my @data=qw<10 20 30 30 40 50 60 70>;
#index      0  1  2  3  4  5  6  7

my $key= "30";
my $pos=search_string_left $key, \@data;

if($data[$pos] eq $key){
}
else{
}  

Export a subroutine to search for insertion point, of array of hashes, right of duplicates, with accessor to comparison value:

# Data has value=>3 duplcated. Elements are hash refs so use an accessor snippet
#
use List::Insertion {type=>"numeric", duplicate=>"right", accessor=>'->{value}'};
my @data=(
  {value=>1, other=>"a"},
  {value=>2, other=>"b"},
  {value=>3, other=>"c"},
  {value=>3, other=>"d"},
  {value=>4, other=>"e"},
  {value=>5, other=>"f"},
);

# The insert position will be index 4, which is right of duplicates
#
my $key= 3;
my $pos=search_numeric_right $key, \@data;

Export 'make_search' and create anonymous subroutine instead of injecting named subroutines to name space

use List:::Insertion "make_search";

my @data=(1, 2, 3);

my $search=make_search type=>"numeric", duplicate=>"left";
my $key=2;
my $pos=$search->($key, \@data);

DESCRIPTION

List::Insertion implements binary search algorithms to locate the insertion point of a sorted list for a given value. If the value were to be inserted at that position, the list would remain sorted.

A distinction is made between inserting left or right of an exact match or duplicated entry. This allows to insert data before contiguous equal items or after.

Performance rather than flexibility is favoured in the implementation and necessitates the following restrictions on the data stored in the array/list and how it's compared:

Data must be sorted in asscending order only

This simplifies the combinations of subroutines exported

No code blocks can be specified

While code blocks are very flexible, 90% of the functionality can be achieved with 'accessor' snippets.

Element comparision is implicitly numerical or string

No object methods for comparison. Only basic <=>, >=, le and ge operators are used for internally for string and numeric comparison.

List element must be homogeneous and defined

All items in the list are expected to have the same data structure. The structure can be a simple scalar like a string or number, in which case no 'acceesor' snippet is used.

On the other hand it could be complex, like an array of arrays or hashes of objects to any level. The only restriction in this case is the 'accessor' snippet is able to access the value for comparison via post dereferencing or method calls

Although intended for searching for a insertion point, this module can also be used for general searching of elements. A simple check of equality between the search key and value at the found index will determine if the value was actually found or not.

The returned index will never indicate 'not found' as there is always a insert location in a list.

No symbols are exported by default. Specifications given at import time are used to generate and export the search routine (s) needed. Anonymous search subroutines can also be generated.

PERFORMANCE

General Binary Search Performance

Thanks to the lack of a CODE blocks and more streamlined comparison, this module is on par with List::BinarySearch::XS for at least one level of structured data. Not bad for pure perl.

Please see the associated benchmarking script bench.pl in this distribution:

Results for searching an array with 1000 random numeric elements stored in a
key/value hash. The keys searched are also randomly generated and may not
existing inthe sorted list. Comparision between List::BinarySearch::PP,
List::BinarySearch::XS, and List::Insertion

             Rate L::BS::PP      L::I L::BS::XS
L::BS::PP 14902/s        --      -80%      -81%
L::I      74796/s      402%        --       -6%
L::BS::XS 79644/s      434%        6%        --

Building/Updating an Already Sorted List

Perl is pretty fast when sorting bulk array data in a single sort call. However adding a single new value and resorting an existing array is not nearly as fast.

Normally to update an already sorted array with a new value, you would push the new value to the array and resort. This module can improve performance by first locating the insert point at which you would splice in the new value. For example:

eg.
  
  my @data= map rand(10), 1..1000; 

  my $key=4.3;

  # Instead of this ...
  #
  my @perl_sort;
  push @perl_sort, $key;
  @perl_sort=sort {$a <=> $b} @perl_sort;

  # Do this ...
  #
  my @sorted;
  my $pos;
  if(@sort){
    $pos=search_numeric_left $key, \@sorted;
    splice @sorted, $pos, 0, $key;
  }
  else {
    push @sorted, $key;
  }

The benchmarking script build-sorted.pl in this distribution demonstrates building a list where each element is a key value pair (hash) storing a random value. This module is within approx 20% of the XS implementation of List::BinarySearch::XS and is much faster than the pure perl List::BinarySearch::PP.

Results for constructing/sorting an array with 1000 random numeric elements
one element at a time. Each element is key/value hash.  Comparision between
List::BinarySearch::PP, List::BinarySearch::XS, perl sort, and
List::Insertion

                   Rate perl_sort_update L_BS_PP_update L_I_update L_BS_XS_update
perl_sort_update 25.5/s               --           -86%       -97%           -97%
L_BS_PP_update    179/s             603%             --       -76%           -81%
L_I_update        760/s            2884%           325%         --           -18%
L_BS_XS_update    922/s            3523%           416%        21%             --

API

Importing

When importing this module, the type of comparison (string or numeric) and the behaviour in dealing with duplicate values (left or right) is specified along with optional accessor and prefix options.

Combinations of these options are generated via Data::Combination, allowing multiple subroutines to be configured and returned with minimal typing.

Consider the following examples:

use List::Insertion {type=>"numeric", position=>left};                #(1)

use List::Insertion {type=>"numeric", position=>["left", "right"]};   #(2)

use List::Insertion {                                               #(3)
  type=>"numeric", duplicate=>["left", "right"], accessor=>'->{hash_key}'
};

use List::Insertion {                                               #(4)
  type=>"numeric", duplicate=>["left", "right"], accessor=>'->{hash_key}->method',
  prefix=>"find"};
  1. Imports the subroutine "search_numeric_left"

  2. Imports the subroutines "search_numeric_left" and "search_numeric_right"

  3. Imports the subroutines "search_numeric_left" and "search_numeric_right" and uses the accessor '->{hash_key} when accessing elements. Elements must all be hash references and the hash key will be compared in numeric fashion.

  4. Imports the subroutines "find_numeric_left" and "find_numeric_right" and uses the accessor '->{hash_key}->method' when accessing elements. Elements must all respond to 'method', with its return value compared in numeric fashion.

The default values of supported options are used if no matching options are present in an import specification. The defaults are:

type=>"string",
duplicate=>"left",
accessor=>"",
prefix=>"search"

Supported options during importing include:

type

type=>NAME or type=[NAME,...]

A plain scalar or array ref of comparison type names. The type is implicitly used as the second part of an exported subroutines name. Supported values for NAME are:

numeric

Numerical comparison

string

String comparison

duplicate

duplicate=>SIDE  or pos=>[SIDE,...]

A plain scalar or array ref of side names. The side to choose when duplicate values are encountered. This is used implicitly as the last part of an exported subroutines name.

Supported values for SIDE are:

left, lesser

Choose the lower index when the duplicate items are encountered.

eg 
  my @list=( 10, 20, 20, 20, 30)
                /\
                ||

  A 'left' search for 20 will result in a index of 1
right, greater

Choose the greater index (after duplicates) when the duplicate items are encountered.

eg  
  my @list=( 10, 20, 20, 20, 30)
                            /\
                            ||
  A 'right' search for 20 will result in a index of 4

accessor

accessor=>STRING

A string consisting of perl post dereferencing/method call syntax, which is used is to access internal levels of a element's data structure. Internally this string is literally appending to the array element indexing code:

eg
  Acessor: ->{hash_ref}[array_deref]->method
  Interal code: ... $array[$index] ...

  Resulting code: ... $array[$index]->{hash_ref}[array_deref]->method ...

If not specified, it is treated as an empty string, and elements in the list are treated as numeric/string simple scalars. Their value are used directly in comparisons in the search algorithm.

If specified, elements are dereferenced/called with the accessor. The resulting value is used in the comparison in the search algorithm.

The value of the search key is NOT subject to the accessor and is used directly in comparison.

prefix

prefix=>STRING

A string which becomes the start of the imported subroutines name. If unspecified, the string "search" is used.

eg 
  use List::Insertion {prefix=>"my_searcher", ...};

  # The subrotine imported will start with my_searcher

  my $pos=my_searcher_...

Anonymous Subroutines

Instead of importing named subroutines into your namespace, anonymous subroutines can be generated by importing the make_search subroutine:

use List::Insertion "make_search";
my $sub=make_search {options}

Creates a search subroutine configured with a options hash ref. Each option is a key value pair, as described in the importing section. Only simple scalars key/values are allowed, as only a single subroutine is returned per call. Multiple calls to this subroutine will need to be used to generate multiple search subroutines.

This is the subroutine called internally during import to generate the named subroutines.

The option prefix has no effect as the routine is anonymous.

Using Generated or Exported subrotines

The generated/imported subroutines are named in the format:

prefix_type_duplicate

where prefix, type and duplicate represent the prefix, data type ( string or numeric) and duplicated entry handling (left or right) configuration

These routines are called with two arguments, the search key and reference to the sorted data:

my $insert=find_nv_left $key, \@data;

The return value is the index in the @data, which if inserting $key will keep the list sorted.

The value of the element located at $insert my be equal to the search key.

NOTE: Search routines never return less then 0 or otherwise indicate 'element not found'. The index is always the point when data can be inserted. So an empty list will always return a found index of 0, as this where an element would be inserted.

FUTURE WORK

Make an XS version

That could be tricky with the accessor feature.

Validate accessor

Add a 'validator' for testing/confirmation of accessor syntax

SEE ALSO

List::BinarySearch and the List::BinarySearch::PP(pure perl) and List::BinarySearch::XS (XS enhanced) 'sub modules' provide more flexibility than this module thanks to the use of code blocks for element comparison.

REPOSITORY and BUG REPORTING

Please report any bugs and feature requests on the repo page: GitHub

AUTHOR

Ruben Westerberg, <drclaw@mac.com>

COPYRIGHT AND LICENSE

Copyright (C) 2023 by Ruben Westerberg

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, or under the MIT license

DISCLAIMER OF WARRANTIES

THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.