NAME

Search::Tools::Keywords - extract keywords from a search query

SYNOPSIS

my $query = 'the quick fox color:brown and "lazy dog" not jumped';

my $kw = Search::Tools::Keywords->new(
           stopwords           => 'the',
           and_word            => 'and',
           or_word             => 'or',
           not_word            => 'not',
           stemmer             => &your_stemmer_here,       
           ignore_first_char   => '\+\-',
           ignore_last_char    => ''
           );
           
my @words = $kw->extract( $query );
# returns:
#   quick
#   fox
#   brown
#   lazy dog
#   jumped

DESCRIPTION

Do not confuse this class with Search::Tools::RegExp::Keywords.

Search::Tools::Keywords extracts the meaningful words from a search query. Since many search engines support a syntax that includes special characters, boolean words, stopwords, and fields, search queries can become complicated. In order to separate the wheat from the chafe, the supporting words and symbols are removed and just the actual search terms (keywords) are returned.

This class is used internally be Search::Tools::RegExp. You probably don't need to use it directly. But if you do, read on.

METHODS

new( %opts )

The new() method instantiates a S::T::K object. With the exception of extract(), all the following methods are can be passed as key/value pairs in new().

extract( query )

The extract method parses query and returns an array of meaningful words. query can either be a scalar string or an array reference (if multiple queries should be parsed simultaneously).

Only positive words are extracted. In other words, if you search for:

foo not bar

then only foo is returned. Likewise:

+foo -bar

would return only foo.

NOTE: All queries are converted to UTF-8. See the charset param.

stemmer

The stemmer function is used to find the root 'stem' of a word. There are many stemming algorithms available, including many on CPAN. The stemmer function should expect to receive two parameters: the Keywords object and the word to be stemmed. It should return exactly one value: the stemmed word.

Example stemmer function:

use Lingua::Stem;
my $stemmer = Lingua::Stem->new;

sub mystemfunc
{
    my ($kw,$word) = @_;
    return $stemmer->stem($word)->[0];
}

# and pass to Keywords new() method:

my $keyword_obj = Search::Tools::Keyword->new(stemmer => \&mystemfunc);
    

stopwords

A list of common words that should be ignored in parsing out keywords.

NOTE: If a stopword is contained in a phrase, then the phrase will be split into its separate words based on whitespace.

ignore_first_char

String of characters to strip from the beginning of all words.

ignore_last_char

String of characters to strip from the end of all words.

and_word

Default: and

or_word

Default: or

not_word

Default: not

wildcard

Default: *

locale

Set a locale explicitly for a Keywords object. The charset value is extracted from the locale. If not set, the locale is inherited from the LC_CTYPE environment variable.

charset

Base charset used for converting queries to UTF-8. If not set, extracted from locale.

AUTHOR

Peter Karman perl@peknet.com

Based on the HTML::HiLiter regular expression building code, originally by the same author, copyright 2004 by Cray Inc.

Thanks to Atomic Learning www.atomiclearning.com for sponsoring the development of this module.

COPYRIGHT

Copyright 2006 by Peter Karman. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

HTML::HiLiter, Search::QueryParser