NAME

OurNet::FuzzyIndex - Inverted search for double-byte characters

SYNOPSIS

use OurNet::FuzzyIndex;

my $idxfile  = 'test.idx'; # Name of the database file
my $pagesize = undef;      # Page size (twice of an average record)
my $cache    = undef;      # Cache size (undef to use default)
my $subdbs   = 0;          # Number of child dbs; 0 for none

# Initiate the DB from scratch
unlink $idxfile if -e $idxfile;
my $db = OurNet::FuzzyIndex->new($idxfile, $pagesize, $cache, $subdbs);

# Index a record: key = 'Doc1', content = 'Some text here'
$db->insert('Doc1', 'Some text here');

# Alternatively, parse the content first with different weights
my %words = $db->parse("Some other text here", 5);
%words = $db->parse_xs("Some more texts here", 2, \%words);

# Then index the resulting hash with 'Doc2' as its key
$db->insert('Doc2', %words);

# Perform a query: the 2nd argument is the 'exact match' flag
my %result = $db->query('search for some text', $MATCH_FUZZY);

# Combine the result with another query
%result = $db->query('more please', $MATCH_NOT, \%result);

# Dump the results; note you have to call $db->getkey each time
foreach my $idx (sort {$result{$b} <=> $result{$a}} keys(%result)) {
    $val = $result{$idx};
    print "Matched: ".$db->getkey($idx)." (score $val)\n";
}

# Set database variables
$db->setvar('variable', "fetch success!\n");
print $db->getvar('variable');

# Get all records: the optional 0 says we want an array of keys
print "These records are indexed:\n";
print join(',', $db->getkeys(0));

# Alternatively, get it with its internal index number
my %allkeys = $db->getkeys(1);

DESCRIPTION

OurNet::FuzzyIndex implements a simple consecutive-letter indexing mechanism specifically designed for multi-byte encoding maps, e.g. big-5 or utf8.

It uses DB_File to create an associative mapping from each character to its consecutive one, utilizing DB_BTREE's duplicate key feature to speed up the query time. Its scoring algorithm is also geared to reduce redundant word's impact on the query's result.

Although this module currently only supports big-5 and latin-1 encodings internally, you could override the parse.c module for extensions, or add your own translation maps.

KNOWN ISSUES

The query() function uses a time-consuming callback function _parse_q to parse the query string; it is expected to be changed to a simple function that returns the whole processed list. (Fortunately, most query strings won't be long enough to cause significant difference.)

The MATCH_EXACT flag is misleading; FuzzyIndex couldn't tell if a query matches the conetnt exactly from the info stored in the index file alone. You are encouraged to write your own grep-like post filter.

TODO

* Internal handling of locale/unicode mappings * Boolean / selective search using combined MATCH_* flags * Fix bugs concerning sub_dbs

AUTHORS

Autrijus Tang <autrijus@autrijus.org>

COPYRIGHT

To install OurNet::FuzzyIndex, copy and paste the appropriate command in to your terminal.

cpanm

cpanm OurNet::FuzzyIndex

CPAN shell

perl -MCPAN -e shell
install OurNet::FuzzyIndex

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)