NAME
Search::Tools::Keywords - extract keywords from a search query
SYNOPSIS
use Search::Tools::Keywords;
use Search::Tools::RegExp;
my $query = 'the quick fox color:brown and "lazy dog" not jumped';
my $kw = Search::Tools::Keywords->new(
stopwords => 'the',
and_word => 'and',
or_word => 'or',
not_word => 'not',
stemmer => &your_stemmer_here,
ignore_first_char => '\+\-',
ignore_last_char => '',
word_characters => $Search::Tools::RegExp::WordChar,
debug => 0,
phrase_delim => '"'
);
my @words = $kw->extract( $query );
# returns:
# quick
# fox
# brown
# lazy dog
DESCRIPTION
Do not confuse this class with Search::Tools::RegExp::Keywords.
Search::Tools::Keywords extracts the meaningful words from a search query. Since many search engines support a syntax that includes special characters, boolean words, stopwords, and fields, search queries can become complicated. In order to separate the wheat from the chafe, the supporting words and symbols are removed and just the actual search terms (keywords) are returned.
This class is used internally by Search::Tools::RegExp. You probably don't need to use it directly. But if you do, read on.
METHODS
new( %opts )
The new() method instantiates a S::T::K object. With the exception of extract(), all the following methods can be passed as key/value pairs in new().
extract( query )
The extract method parses query and returns an array of meaningful words. query can either be a scalar string or an array reference (if multiple queries should be parsed simultaneously).
Only positive words are extracted. In other words, if you search for:
foo not bar
then only foo
is returned. Likewise:
+foo -bar
would return only foo
.
NOTE: All queries are converted to UTF-8. See the charset
param.
stemmer
The stemmer function is used to find the root 'stem' of a word. There are many stemming algorithms available, including many on CPAN. The stemmer function should expect to receive two parameters: the Keywords object and the word to be stemmed. It should return exactly one value: the stemmed word.
Example stemmer function:
use Lingua::Stem;
my $stemmer = Lingua::Stem->new;
sub mystemfunc
{
my ($kw,$word) = @_;
return $stemmer->stem($word)->[0];
}
# and pass to Keywords new() method:
my $keyword_obj = Search::Tools::Keyword->new(stemmer => \&mystemfunc);
stopwords
A list of common words that should be ignored in parsing out keywords. May be either a string that will be split on whitespace, or an array ref.
NOTE: If a stopword is contained in a phrase, then the phrase will be tokenized into words based on whitespace, then the stopwords removed.
ignore_first_char
String of characters to strip from the beginning of all words.
ignore_last_char
String of characters to strip from the end of all words.
and_word
Default: and
or_word
Default: or
not_word
Default: not
wildcard
Default: *
locale
Set a locale explicitly for a Keywords object. The charset
value is extracted from the locale. If not set, the locale is inherited from the LC_CTYPE
environment variable.
charset
Base charset used for converting queries to UTF-8. If not set, extracted from locale
.
AUTHOR
Peter Karman perl@peknet.com
Based on the HTML::HiLiter regular expression building code, originally by the same author, copyright 2004 by Cray Inc.
Thanks to Atomic Learning www.atomiclearning.com
for sponsoring the development of this module.
COPYRIGHT
Copyright 2006 by Peter Karman. This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
HTML::HiLiter, Search::QueryParser