NAME

OurNet::FuzzyIndex - Inverted search for double-byte characters

SYNOPSIS

use OurNet::FuzzyIndex;

my $idxfile  = 'test.idx'; # Name of the database file
my $pagesize = undef;      # Page size (twice of an average record)
my $cache    = undef;      # Cache size (undef to use default)
my $subdbs   = 0;          # Number of child dbs; 0 for none

# Initiate the DB from scratch
unlink $idxfile if -e $idxfile;
my $db = OurNet::FuzzyIndex->new($idxfile, $pagesize, $cache, $subdbs);

# Index a record: key = 'Doc1', content = 'Some text here'
$db->insert('Doc1', 'Some text here');

# Alternatively, parse the content first with different weights
my %words = $db->parse("Some other text here", 5);
%words = $db->parse_xs("Some more texts here", 2, \%words);

# Then index the resulting hash with 'Doc2' as its key
$db->insert('Doc2', %words);

# Perform a query: the 2nd argument is the 'exact match' flag
my %result = $db->query('search for some text', $MATCH_FUZZY);

# Combine the result with another query
%result = $db->query('more please', $MATCH_NOT, \%result);

# Dump the results; note you have to call $db->getkey each time
foreach my $idx (sort {$result{$b} <=> $result{$a}} keys(%result)) {
    $val = $result{$idx};
    print "Matched: ".$db->getkey($idx)." (score $val)\n";
}

# Set database variables
$db->setvar('variable', "fetch success!\n");
print $db->getvar('variable');

# Get all records: the optional 0 says we want an array of keys
print "These records are indexed:\n";
print join(',', $db->getkeys(0));

# Alternatively, get it with its internal index number
my %allkeys = $db->getkeys(1);

DESCRIPTION

OurNet::FuzzyIndex implements a simple consecutive-letter indexing mechanism specifically designed for multi-byte encoding maps, e.g. big-5 or utf8.

It uses DB_File to create an associative mapping from each character to its consecutive one, utilizing DB_BTREE's duplicate key feature to speed up the query time. Its scoring algorithm is also geared to reduce redundant word's impact on the query's result.

This module also supports a distributed databases option, which optimizes each query to access only a small portion of database.

Although this module currently only supports big-5 and latin-1 encodings internally, you could override the parse.c module for extensions, or add your own translation maps.

KNOWN ISSUES

The query() function uses a time-consuming callback function _parse_q() to parse the query string; it is expected to be changed to a simple function that returns the whole processed list. (Fortunately, most query strings won't be long enough to cause significant difference.)

The MATCH_EXACT flag is misleading; FuzzyIndex couldn't tell if a query matches the content exactly from the info stored in the index file alone. You are encouraged to write your own grep-like post filter.

TODO

  • Internal handling of locale/unicode mappings

  • Boolean / selective search using combined MATCH_* flags

  • Fix bugs concerning sub_dbs

AUTHORS

Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.

COPYRIGHT

Copyright 2001 by Autrijus Tang <autrijus@autrijus.org>, Chia-Liang Kao <clkao@clkao.org>.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 97:

'=item' outside of any '=over'

Around line 669:

You forgot a '=back' before '=head1'