NAME
Search::Kinosearch - Search Engine Library
WARNING
Search::Kinosearch is ALPHA test software. Aspects of the interface are almost certain to change, given the suite's complexity. Users should not count on compatibility of files created by Kinosearch when future versions are released -- any time you upgrade Kinosearch (or possibly, a module such as Lingua::Stem::Snowball on which Kinosearch depends), you should expect to regenerate all kindex files from scratch.
SYNOPSIS
First, write an application to build a 'kindex' from your document collection. (A 'kindex' is a Kinosearch index.)
use Search::Kinosearch::Kindexer;
my $kindexer = Search::Kinosearch::Kindexer->new(
-mainpath => '/foo/bar/kindex',
);
for my $field ('title', 'bodytext') {
$kindexer->define_field(
-name => $field,
-lowercase => 1,
-tokenize => 1,
-stem => 1,
);
}
while (my ($title, $bodytext) = each %docs) {
my $doc = $kindexer->new_doc( $title );
$doc->set_field( title => $title );
$doc->set_field( bodytext => $bodytext );
$kindexer->add_doc( $doc );
}
$kindexer->generate;
$kindexer->write_kindex;
Then, write a second application to search the kindex:
use Search::Kinosearch::KSearch;
use Search::Kinosearch::Query;
my $ksearch = Search::Kinosearch::KSearch->new(
-mainpath => '/foo/bar/kindex',
);
my $query = Search::Kinosearch::Query->new(
-string => 'this AND NOT (that OR "the other thing")',
-lowercase => 1,
-tokenize => 1,
-stem => 1,
);
$ksearch->add_query( $query );
$ksearch->process;
while (my $hit = $ksearch->fetch_hit_hashref) {
print "$hit->{title}\n";
}
DESCRIPTION
Primary Features
Match 'any' or 'all' search terms
Match phrases
Boolean operators AND, OR, and AND NOT
Support for parenthetical groupings of arbitrary depth
Prepended +plus or -minus to require or negate a term
Sort results by relevance or by datetime
Stemming
Algorithmic selection of relevant excerpts
Hilighting of search terms in excerpts, even when stemmed
Fast, efficient algorithms for both indexing and searching
Works well with large or small document collections
High quality ranking algorithm based on term frequency / inverse document frequency (tf/idf)
General Introduction
Search::Kinosearch (hereafter abreviated 'Kinosearch') is a search engine library -- it handles the tricky, behind-the-scenes work involved in running a search application. It is up to you how to present the search results to the end-user.
Kinosearch has two main parts: Search::Kinosearch::Kindexer and Search::Kinosearch::KSearch (hereafter abbreviated 'Kindexer' and 'KSearch'). When you want to know which pages of a book are most relevant for a given subject (e.g. avocados, Aesop, Alekhine's Defense...) you look up the term in the book's index and it tells you which pages may be of interest to you; the 'kindex' produced by Kinosearch's Kindexer performs an analogous role -- using the interface tools provided by the KSearch module, you consult the kindex to find which documents within the document collection Kinosearch considers most relevant to your query.
[The Search::Kinosearch module itself doesn't do very much, and as an abstract base class, it does nothing on its own; this documentation is an overview which ties together the various components of the Kinosearch suite.]
The Kindexer thinks of your documents as database rows, each with as many fields as you define. HTML documents might be parsed, prepared, and fed to the Kindexer like so:
Store the URL in a kindex field called 'url'.
Store the text surrounded by the <title> tags in a kindex field called 'title';
Isolate portions of the document that are not relevant to content (such as navigation panels or advertisements) and remove them.
Strip all html tags.
Store what's left in a field called 'bodytext'.
Most of the time, you will want to take advantage of three crucial functions performed by the Kindexer, executed on a per-field basis depending on the parameters passed to define_field():
- -lowercase
-
This does exactly what you would expect - lc the text to be indexed (but leave the copy of the text to be stored intact). If you select the same option for your queries at search-time, your searches will be case-insensitive.
- -tokenize
-
Tokenizing breaks up input text into an array of words (roughly speaking). If you tokenize the text "Dr. Strangelove, or How I Learned to Stop Worrying and Love the Bomb", then searches for "Strangelove" or "Bomb" will match. If you don't tokenize it, then only a search for the complete string "Dr. Strangelove, or How I Learned to Stop Worrying and Love the Bomb" will match.
- -stem
-
Stemming reduces words to a root form. For instance, "horse", "horses", and "horsing" all become "hors" -- so that a search for 'horse' will also match documents containing 'horses' and 'horsing'. For more information, see the documentation for Lingua::Stem.
Once you have the completed the indexing phase, you will need a search application. Most often, this takes the form of a CGI script accessed via web browser. The interface design requirements of such an app will be familiar to anyone who's surfed the web.
Getting started
If you want to get started right away, copy the sample applications out of Search::Kinosearch::Tutorial, get them functioning, and then swap in some of your own material.
You may wish to consult the documentation for Search::Kinosearch::Kindexer, Search::Kinosearch::KSearch and Search::Kinosearch::Query before continuing on to the next section.
Fine Tuning Kinosearch
Performance Optimizations
The bottleneck in search applications is the ranking and sorting of a large number of relevant documents. Minimizing the time it takes to return results is a central design goal of Kinosearch.
The single most important optimization available for Kinosearch apps is to store either some or all of the kindex files in ram -- in particular, the files stored within the directory specified by the '-freqpath' parameter. These files contain all of the kindex data upon which search-time speed depends. Storing the other kindex files in ram won't hurt, but will not yield anywhere near the same benefit.
An additional search-time optimization is available when running Kinosearch applications under mod_perl. See the Search::Kinosearch::Kindex documentation for details.
Stoplists
A "stoplist" is collection of "stopwords": words which are common enough to be of little use when determining search results. For example, so many documents in English contain "the", "if", and "maybe" that it is best if they are blocked, not just at search time, but at index time. Removing them from the equation saves time and space, plus improves relevance.
By default, the Kindexer excludes a small list of stopwords from the kindex and KSearch reports when it detects any of them in a search query. It is possible to disable the stoplist or use a different one, if you so desire.
Phrase Matching Algorithm
Kinosearch's phrase matching implementation involves storing concatenated pairs of words in the kindex. This strategy yields a substantial performance benefit at search time (since the extremely fast hash-based algorithm used to scan for individual terms can be extended to detect phrases as well), but at the cost of increased kindex size. Blocks of text are broken into overlapping couplets, which are stored as individual terms in the kindex -- e.g. the text "brush your teeth" produces four kindex entries: "brush", "teeth", "brush your", and "your teeth" ("your" on its own is a stopword, so it is excluded from the kindex by default). If a user searches for the the specific phrase "brush your teeth", Kinosearch will return only documents which contain both "brush your" and "your teeth".
Ranking Algorithm
Kinosearch uses a variant of the well-established "Term Frequency / Inverse Document Frequency" weighting scheme. Explaining TF/IDF is beyond the scope of this documentation, but in a nutshell:
in a search for "skate park", documents which score well for the comparatively rare term "skate" will rank higher than documents which score well for the common term "park".
a 10-word text which has one occurrence each of both "skate" and "park" will rank higher than a 1000-word text which also contains one occurrence of each.
A web search for "tf idf" will turn up many excellent explanations of the algorithm.
Field Weights
Kinosearch will allow you to weight the title of a document more heavily than the bodytext -- or vice versa -- by assigning a weight to each field at either index-time or search-time. However, the multipliers don't have to be that large, because under TF/IDF a single occurrence of a word within the 10-word title will automatically contribute more to the score of a document than a single occurrence of the same word in the 1000-word bodytext. See the documentation for Kindexer's define_field() method for more details.
TO DO
Implement data compression methods.
Refine the API.
Dispense with DB_File and implement a composite index format.
Shrink proximity data.
Enable range queries.
Enable customizable tokenizing and stemming.
Implement docrank.
Fix excerpt highlighting for phrases so that a search for 'we the people' doesn't cause every instance of 'the' to be highlighted.
Complete support for UTF-8. At present, everything works so long as the encoding of all supplied files is ASCII compatible and is consistent across all files.
Add unicode normalization.
Add support for other encodings.
Implement -cache_level option for Search::Kinosearch::Kindexer, in order to allow user control over memory requirements.
Implement descending term weighting.
Implement alternative backends.
Add support for 64-bit timestamps.
Enable support for psuedo-stemming using Lingua::Spelling::Alternative for languages where no Snowball stemmer is available.
BUGS
Excerpt may be up to 20 characters longer than requested.
Spurious results can turn up in searches for phrases 3 terms or longer: for instance, a document containing "brush your hair and floss your teeth" will be returned in a search for '"brush your teeth"'.
FILES
Kinosearch requires several other CPAN distributions:
SEE ALSO
ACKNOWLEDGEMENTS
Chris Nandor has been helpful with debugging.
AUTHOR
Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com
COPYRIGHT
Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.