NAME

Search::Kinosearch::Lingua - Language-specific Kinosearch functions

DEPRECATED

Search::Kinosearch has been superseded by KinoSearch. Please use the new version.

SYNOPSIS

### Search::Kinosearch::Lingua is an abstract base class.
### Search::Kinosearch::Lingua::Xx subclasses are invoked indirectly.

### Example 1: A Spanish Kindexer object
my $kindexer = Search::Kinosearch::Kindexer->new(
    -language => 'Es',
    );

### Example 2: An English KSearch object (language of 'En' is implicit)
my $ksearch = Search::Kinosearch::KSearch->new()

DESCRIPTION

The purpose of the Search::Kinosearch::Lingua::Xx subclasses is to provide language-specific functionality to the rest of the Kinosearch suite. Code is loaded indirectly, based on the -language parameter for either Search::Kinosearch::Kindexer->new() or Search::Kinosearch::KSearch->new().

All Search::Kinosearch::Lingua::Xx subclasses implement two methods: tokenize and stem. The code in these methods is reused by the following:

  • &Search::Kinosearch::Kindexer::tokenize_field

  • &Search::Kinosearch::Kindexer::stem_field

  • &Search::Kinosearch::KSearch::process

Additionally, each Lingua::Xx subclass contains a default stoplist and a precompiled regex matching a single token.

Kindexer and KSearch default to 'En' (English); however, it is possible to specify no language: -language => '', in which case the default tokenize() and stem() methods from the base class Search::Kinosearch::Lingua will be utilized.

METHODS

tokenize()

Tokenizing is the process of breaking up a stream of symbols into pieces. The default tokenize() routine, invoked when a language of '' [empty string] is specified, is quite crude -- basically, all it does is split on whitespace. (For comparison, the default English-language tokenizer converts most non-word characters to spaces (apostrophes receive special treatment) prior to splitting on whitespace.)

stem()

stem() provides a wrapper for a language-specific stemming algorithm. For a conceptual explanation of stemming, see the documentation for Lingua::Stem.

Currently, the Lingua::Stem::Snowball stemmers are preferred for performance reasons.

TODO

  • Implement Search::Kinosearch::Lingua::Xx modules for as many languages as possible.

  • Consider enabling alternative stemming routines for languages with no Snowball stemmer.

SEE ALSO

AUTHOR

Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com

COPYRIGHT

Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.