NAME
Search::Kinosearch::Kindexer - create kindex files
DEPRECATED
Search::Kinosearch has been superseded by KinoSearch. Please use the new version.
SYNOPSIS
my $kindexer = Search::Kinosearch::Kindexer->new(
-mainpath => '/foo/bar/kindex',
-temp_directory => '/foo/bar',
);
for my $field ('title', 'bodytext') {
$kindexer->define_field(
-name => $field,
-lowercase => 1,
-tokenize => 1,
-stem => 1,
);
}
while (my ($title, $bodytext) = each %docs) {
my $doc = $kindexer->new_doc( $title );
$doc->set_field( title => $title );
$doc->set_field( bodytext => $bodytext );
$kindexer->add_doc( $doc );
}
$kindexer->generate;
$kindexer->write_kindex;
DESCRIPTION
Create a kindex
How to create a kindex, in 7 easy steps...
Step 1: Create a Kindexer object.
my $kindexer = Search::Kinosearch::Kindexer->new(
-mainpath => '/foo/bar/kindex',
-temp_directory => '/foo/bar',
-mode => 'overwrite',
);
Step 2: Define all the fields that you'll ever need this kindex to have -- because as soon as you process your first document, you lose the ability to add, remove, or change the characteristics of any fields.
$kindexer->define_field(
-name => 'url',
-score => 0,
);
$kindexer->define_field(
-name => 'title',
-lowercase => 1,
-tokenize => 1,
-stem => 1,
);
$kindexer->define_field(
-name => 'bodytext',
-lowercase => 1,
-tokenize => 1,
-stem => 1,
);
$kindexer->define_field(
-name => 'keywords',
-lowercase => 1,
-tokenize => 1,
-stem => 1,
-store => 0,
);
$kindexer->define_field(
-name => 'section',
-weight => 0,
);
Step 3: Start a new document, identified by something unique (such as a URL).
my $doc = $kindexer->new_doc($url);
Step 4: set the value for each field.
$doc->set_field( url => $url );
$doc->set_field( title => $title );
$doc->set_field( bodytext => $bodytext );
$doc->set_field( keywords => $keywords );
$doc->set_field( section => $section );
Step 5: Add the document to the kindex;
$kindexer->add_doc( $doc );
Step 6: Repeat steps 3-5 for each document in the collection.
Step 7: Finalize the kindex and write it out.
$kindexer->generate;
$kindexer->write_kindex;
Update an existing kindex
Other than making sure that -mode is set to 'update', there is no difference in how you treat the Kindexer, though you may wish to choose a custom setting for -optimization.
my $kindexer = Search::Kinosearch::Kindexer->new(
-mainpath => '/path/to/kindex',
-temp_directory => '/foo/bar',
-mode => 'update',
-optimization => 1,
);
If you want to overwrite a document currently in the kindex, simply call new_doc() etc, the way you did when you added it to the kindex in the first place, making sure that the unique identifier matches.
File Locking
Whenever you create a Kindexer, a KSearch or a Kindex object, a shared lock is requested on a file called 'kinoreadlock' within the -mainpath directory. Kindexer objects perform all of their file manipulation on temporary files which are swapped in at the last moment, so it is safe to continue searching against an existing kindex while it is in the process of being updated, or even overwritten.
After the Kindexer recieves the shared lock on 'kinoreadlock', it requests an exclusive lock on another file called 'kinowritelock'. If it cannot get this exclusive lock, it bombs out immediately, since it is not safe for two Kindexers to run updates against the same kindex simultaneously.
After $kindexer->generate() completes, the files are ready to be swapped into place using File::Copy's move() command. Calling write_kindex() triggers a request for an exclusive lock on 'kinoreadlock', for which the Kindexer will wait as long as necessary. Once the exclusive lock is granted, the outdated files are unlinked, the new files take their spots, and all locks are released. If the the temp directory and the kindex are on the same volume, the process can be almost instantaneous.
METHODS
new()
my $kindexer = Search::Kinosearch::Kindexer->new(
-mode => 'overwrite', # default: 'update'
-mainpath => '/foo/bar/kindex', # default: ./kindex
-freqpath => '/baz/freqdata', # default: ./kindex/freqdata
-temp_directory => '/foo/temp' # default: current directory
-optimization => 1, # default: 2
-language => 'Es', # default: 'En'
-encoding => 'UTF-8', # default: 'UTF-8'
-phrase_matching => 0, # default: 1
-enable_datetime => 1, # default: 0
-stoplist => \%stoplist, # default: see below
-max_kinodata_fs => 2 ** 29, # default: 2 ** 28 [256 Mb]
-verbosity => 1, # default: 0
);
Create a Kindexer object.
- -mode
-
Two options are available: 'overwrite' and 'update'. If there is no kindex at the specified -mainpath, a kindex will be created no matter what -mode is set to. In either case, no permanent file modifications beyond the creation of -mainpath, -freqpath, and the lockfiles are applied until write_kindex() is called.
- -mainpath
-
The path to your kindex. If not specified, defaults to 'kindex'.
- -freqpath
-
Files within this directory contain term frequency data. The speed with which they can be read has a major impact on search-time performance, so you may wish to copy this directory onto a ram disk once Kindexer finishes. If you don't specify -freqpath, it appears as a directory called 'freqdata' within -mainpath.
- -temp_directory
-
The Kindexer object will create a single randomly-named temp directory within whatever is specified as -temp_directory, then use that inner directory for all its temporary files. BUG: In the 0.02 branch of Kinosearch, the -temp_directory MUST be on the same filesystem as the kindex itself.
- -optimization
-
This parameter, which controls the behavior of update mode, is primarily relevant for large-scale Kinosearch deployments that require frequent updates. For small scale deployments, the simplest course is to run Kindexer with -mode set to 'overwrite' and regenerate the kindex from scratch every time -- in which case -optimization is irrelevant.
There are 4 possible settings for -optimization:
1 - Full optimization. Long indexing times, but quick searches. All subkindexes are merged into one every time.
2 - Close to full optimization (default setting). Searches perform at close to maximum speed; index times are usually pretty short, but every once in a while a spike occurs. A maximum of 2 subkindexes are allowed to exist at any given moment. If the second (auxilliary) subkindex is detected to be larger than 10% of the size of the first (primary) subkindex when the Kindexer starts, -optimization is kicked up to level 1 and the two are merged.
3 - The "incremental indexing" setting. Indexing times are usually quick, though spikes occur; search times may be somewhat slower, though the difference is minimal if you are using a ram disk. Subkindexes are consolidated either when there are 10 of them, or when several of them contain as many documents as their left neighbor. The goal is to minimize the resources expended on consolidating subkindexes while maintaining decent search-time performance.
4 - No optimization. Short indexing times; search performance degrades the more the kindex is updated, and doesn't recover until it is updated with -optimization set to 1, 2, or 3. Subkindexes are not merged -- a new subkindex is tacked on to the end of the kindex every time Kindexer is called upon to update it.
- -language
-
The language of the documents being indexed. Options: the final part of any Search::Kinosearch::Lingua::Xx module name e.g. 'Es' (Spanish), 'Hr' (Croatian). [At present, only 'En' works.] This setting determines the algorithms used for stemming and tokenizing. See the Search::Kinosearch::Lingua documentation for details.
- -encoding
-
This doesn't do anything yet.
- -phrase_matching
-
If set to 1, word pairs will be indexed along with individual words. Enabling phrase matching at index-time is required for enabling phrase matching at search-time, because if the word pairs aren't in there, the phrase matching algorithm breaks. Disabling phrase matching reduces the size of the kindex considerably. See the Search::Kinosearch documentation for a discussion of Kinosearch's phrase-matching algorithm.
- -enable_datetime
-
Set this to 1 if you want to be able to assign datetimes to individual records (see Search::Kinosearch::Doc) and sort searches by datetime. Note that enabling datetime increases the size of your kindex, specifically the frequency data portion.
- -stoplist
-
The default stoplist for each language is defined in its Lingua module (e.g. Search::Kinosearch::Lingua::En). If you wish to use a custom stoplist, supply a hashref pointing to a hash where the keys are all stopwords.
- -max_kinodata_fs
-
Set the maximum size for a kinodata file, in bytes. At present, changing this won't do anything significant.
- -verbosity
-
Verbose (debugging) output. At present, this only tells you the progress of flock calls.
define_field()
Kinosearch conceptualizes each document like a row in a database table: as a collection of discrete fields. Before you add any documents to the kindex, you must define attributes for each field.
$kindexer->define_field(
-name => 'category', # required (no default)
-store => 1, # default: 1
-score => 1, # default: 1
-weight => 2, # default: 1
-lowercase => 1, # default: 0
-tokenize => 1, # default: 0
-stem => 1, # default: 0
);
- -name
-
The name of the field. Can contain only [a-zA-Z_].
- -store
-
If -store is set to 1 (the default), the field's contents will be recorded in the kindex and available for retrieval at search-time. Examples: title, text, URL, and so forth would typically have -store set to 1, so that their contents could be used in the presentation of search results; keywords fields would most often have -store set to 0.
- -score
-
If -score is set to 1 (the default), the field's contents will be included in the kindex and considered by default when determining the score of a document against a given search phrase. Note that it is possible to issue field-specific queries at search-time, so the -score attribute is not the only tool for determining which fields are to be considered. Example: URL fields would typically have -score set to 0.
- -weight
-
If weight is set to a number other than 1, then this field will contribute more heavily to a document's aggregate score for any term within it than would otherwise be the case. You can also weight a field more heavily at search-time, but since KSearch is slightly more efficient if it can use the aggregate score, apply field-weighting at index-time if you can. Note that if you perform field-weighting at search-time, any weighting that you set with define_field at index-time will be ignored.
- -lowercase
-
Lowercase the text to be indexed. (The copy of text to be stored will be not be affected.)
- -tokenize
-
Tokenize the text to be indexed. Not all fields should be tokenized -- for example, there is rarely any point in tokenizing a URL.
- -stem
-
Stem the text to be indexed.
new_doc()
my $doc = $kindexer->new_doc( $doc_id );
Spawn a Search::Kinosearch::Doc object.
One argument is required: a unique identifier. The identifier could be a database primary key, a URL, a filepath, or anything else. If the document's contents change later and you wish to update the kindex to reflect that change, use the same identifier.
add_doc()
$kindexer->add_doc( $doc );
Add a document, in the form of a Search::Kinosearch::Doc object, to the kindex.
delete_doc()
$kindexer->delete_doc( $doc_id );
Delete a document from the kindex.
doc_is_indexed()
my $confirmation = $kindexer->doc_is_indexed( $doc_id );
Check for the existence of a document in the kindex.
generate()
$kindexer->generate();
Complete the kindex, but don't save it just yet. Note: depending on how many files have been indexed this pass and how much optimation has to take place, generate() can take a while.
write_kindex()
$kindexer->write_kindex();
Clear out existing kindex files as necessary -- all of them in overwrite mode, some of them in update mode -- and use File::Copy's move() command to transfer new files to their destinations.
BUGS
The -temp_directory must be on the same filesystem as -mainpath, or the move() operation may fail.
TO DO
Add more verbose output.
SEE ALSO
AUTHOR
Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com
COPYRIGHT
Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.