NAME
Gzip::BinarySearch - binary search a sorted, gzipped flatfile database
SYNOPSIS
use Gzip::BinarySearch qw(tsv_column);
my $db = Gzip::BinarySearch->new(
file => 'file.gz',
key_func => tsv_column(2),
);
print $db->find($key);
print for $db->find_all($key);
DESCRIPTION
This module can binary search gzipped databases, such as TSVs, without decompressing the entire file. You need only declare how the file is sorted.
Behind the scenes, we use Gzip::RandomAccess to perform the random-access decompression.
METHODS
new (%args)
You may pass index_file
, index_span
and cleanup
arguments, which are passed directly to Gzip::RandomAccess, so check that module's perldoc for more info.
- file (required)
-
Path to the gzip file you want to search.
- key_func (default: first field, whitespace-separated)
-
A function that takes a line (aliased to
$_
) and should return the key for that line, which will be used when comparing lines.For TSVs, you can use
tsv_column
to generate a key function (see below). - cmp_func (default: Perl's 'cmp' operator)
-
A function that accepts two keys,
$a
and$b
, and returns a value indicating which is 'greater' in the same way as Perl'ssort
builtin. This must match the file's natural ordering (or else). - est_line_length
-
Providing an estimate of the maximum line length in the gzip file can help Gzip::BinarySearch know how much data to uncompress. The default is 512 bytes - getting it wrong will affect speed, but it'll still work.
- surrounding_lines_blocksize
-
How many bytes to search either side of a matching line to find adjacent matching lines when using
find_all
. If you have a lot of rows with the same key, upping this value will speed things up. The default is 4096 bytes.
find ($key)
Return the line matching the key supplied, or nothing (undef/empty list) if nothing found.
find_all ($key)
Return all lines matching the key supplied, or an empty list if none found. The lines will be returned in the order they appear in the file.
gzip
Returns the Gzip::RandomAccess object we're using.
EXPORTED FUNCTIONS
tsv_column ($column_number)
Returns a key function that will parse each line as a TSV and return the specified column number as a key.
fs_column ($field_separator, $column_number)
Returns a key function that will split a line by the field separator provided, and return the specified column number. ($field_separator
may be a regex or string).
For example, to split like awk(1) and use the first column:
key_func => fs_column(qr/\s+/, 1)
est_line_length
surrounding_lines_blocksize
Accessors for constructor arguments.
CAVEATS
Currently only works with Linux line endings (ASCII 0x10).
Does not support fancy multibyte encodings (specifically UTF-8) but I aim to add support in a later release.
Isn't as efficient as it could be - aligning decompression to the indexed points in the gzip would help, as would caching decompressed blocks.
AUTHOR
Richard Harris <richardjharris@gmail.com>