NAME
DBIx::KwIndex - create and maintain keyword indices in DBI tables
_________________________________________________________________
SYNOPSIS
package MyKwIndex;
use DBIx::KwIndex;
sub document_sub { ... }
package main;
$kw = DBIx::KwIndex->new({dbh => $dbh, index_name => 'myindex'})
or die "can't create index";
$kw->add_document ([1,2,3,...]) or die $kw->{ERROR};
$kw->remove_document([1,2,3,...]) or die $kw->{ERROR};
$kw->update_document([1,2,3,...]) or die $kw->{ERROR};
$docs = $kw->search({ words=>'upset stomach' });
$docs = $kw->search({ words=>'upset stomach', boolean=>'AND' });
$docs = $kw->search({ words=>'upset stomach', start=>11, num=>10 });
$docs = $kw->search({ words=>'upset (bite|stomach)', re=>1 });
$kw->add_stop_word(['the','an','am','is','are']) or die $kw->{ERROR};
$words = $kw->common_word(85);
$kw->remove_word(['gingko', 'bibola']) or die $kw->{ERROR};
$ndocs = $kw->document_count();
$nwords = $kw->word_count();
$kw->remove_index or die $kw->{ERROR};
$kw->empty_index or die $kw->{ERROR};
_________________________________________________________________
DESCRIPTION
DBIx::KwIndex is a keyword indexer. It indexes documents and stores
the index data in database tables. You can tell DBIx::KwIndex to index
[lots] of documents and later on show you which ones contain a certain
word. The typical application of DBIx::KwIndex is in a search engine.
How to use this module:
1. Provide a database handle.
use DBI;
my $dbh = DBI->connect(...) or die $DBI::errstr;
2. Subclass DBIx::KwIndex and provide a `document_sub' method to
retrieve documents referred by an integer id. The method should
accept a list of document ids in an array reference and return the
documents in an array reference. In this way, you can index any
kind of documents that you want: text files, HTML files, BLOB
columns, etc., as long as you provide the suitable document_sub()
to retrieve the documents. The one thing to remember is that the
documents must be referred by unique integer number. Below is a
sample of a document_sub() that retrieves document from the
'content' field of a database table.
package MyKwIndex;
require DBIx::KwIndex;
use base 'DBIx::KwIndex';
sub document_sub {
my ($self, $ary_ref) = @_;
my $dbh = $self->{dbh};
my $result = $dbh->selectall_arrayref(
'SELECT id,content FROM documents
WHERE id IN ('. join(',',@$ary_ref). ')');
# if retrieval fails, you should return undef
defined($result) or return undef;
# now returns the content field in the order of the id's
# requested. remember to return the documents exactly
# in the order requested!
my %tmp = map { $_->[0] => $_->[1] } @$result;
return [ @tmp{ @$aref } ];
}
3. Create the indexer object.
my $kw = MyKwIndex->new({
dbh => $dbh,
index_name => 'article_index',
# other options...
});
dbh is the database handle. index_name is the name of the index,
DBIx::KwIndex will create several tables which are all prefixed
with the index_name. The default index_name is 'kwindex'. Other
options include: max_word_length (default 32).
4. Index some documents. You can index one document at a time, e.g.
$kw->add_document([1]) or die $kw->{ERROR};
$kw->add_document([2]) or die $kw->{ERROR};
or small batches of documents at a time:
$kw->add_document([1..10]) or die $kw->{ERROR};
$kw->add_document([11..20]) or die $kw->{ERROR};
or large batches of documents at a time:
$kw->add_document([1..300]) or die $kw->{ERROR};
$kw->add_document([301..600]) or die $kw->{ERROR};
Which one to choose is a matter of memory-speed trade-off. Larger
batches will increase the speed of indexing, but with increased
memory usage.
Note: DBIx::KwIndex ignores single-character words, numbers, and
words longer than 'max_word_length'.
5. If you want to search the index, use the search() method.
$docs = $kw->search({ words => 'upset stomach' });
die "can't search" if !defined($docs);
The search() method will return an ARRAY ref containing the
document ids that matches the criteria. Other parameter include:
num => maximum number of results to retrieve; start => starting
position (1 = from the beginning); boolean => 'AND' or 'OR'
(default is 'OR'); re => use regular expression, 1 or 0.
Note: num and start uses the LIMIT clause (which is quite unique
to MySQL). re uses the REGEXP clause. Do not use these options if
your database server does not support them.
Also note: Searching is entirely done from the index. No documents
will be retrieved while searching. A simple 'relevancy' ranking is
used. Search is case-insensitive and there is no phrase-search
support yet.
Some examples:
# retrieve only the 11th-20th result.
$docs = $kw->search({ words=>'upset stomach', start=>11, num=>10 });
die "can't search" if !defined($docs);
# find documents which contains all the words.
$docs = $kw->search({ words=>['upset stomach'], boolean=>'AND' });
die "can't search" if !defined($docs);
6. Now suppose some documents change, and you need to update the
index to reflect that. Just use the methods below. # if you want
to remove documents from index $kw->remove_document([90..100]) or
die $kw->{ERROR};
# if you want to update the index
$kw->update_document([90..100]) or die $kw->{ERROR};
_________________________________________________________________
SOME UTILITY METHODS
If you want to exclude some words (usually very common words, or
``stop words'') from being indexed, do this before you index any
document:
$kw->add_stop_word(['the','an','am','is','are'])
or die "can't add stop words";
Adding stop words is a good thing to do, as stop words are not very
useful for your index. They occur in a large proportion of documents
(they do not help searches differentiate documents) and they increase
the size your index (slowing the searches).
But which words are common in your collection? you can use the
common_word method:
$words = $kw->common_word(85);
This will return an array reference containing all the words that
occur in at least 85% of all documents (default is 80%).
If you want to delete some words from the index:
$kw->remove_word(['common','cold']);
or die "can't remove words";
To get some statistics about your index:
# the number of documents
$ndocs = $kw->document_count();
# the number of words
$nwords = $kw->word_count();
Last, if you got bored with the index and want to delete it:
$kw->remove_index or die $kw->{ERROR};
This will delete the database tables. Or, if you just want to empty
the index and start all over:
$kw->empty_index or die $kw->{ERROR};
_________________________________________________________________
AUTHOR
Steven Haryanto <steven@haryan.to>
_________________________________________________________________
COPYRIGHT
Copyright (c) 1995-1999 Steven Haryanto. All rights reserved.
You may distribute under the terms of either the GNU General Public
License or the Artistic License, as specified in the Perl README file.
_________________________________________________________________
BUGS/CAVEATS/TODOS
Test the module under other database server (besides MySQL).
Use a more correct search sorting (the current one is kinda bogus :).
Probably implement phrase-searching (but this will require a larger
vectorlist).
Probably, maybe, implement English/Indonesian stemming.
Any safer, non database-specific way to test existence of tables other
than $dbh->tables?
_________________________________________________________________
NOTES
At least two other Perl extensions exist for creating keyword indices
and storing them in a database: DBIx::TextIndex and MyConText. As of
this writing, only DBIx::TextIndex features phrase-searching and
boolean NOT; and only DBIx::KwIndex offers feature to delete documents
from index (but please see the updated version and documentation for
details). I personally find DBIx::KwIndex more convenient when I need
to index documents that change often, because one can add/remove some
documents without rebuilding the entire index.
Advices/comments/patches welcome.
_________________________________________________________________
HISTORY
0001xx=first draft,satunet.com. 000320=words->scalar.
000412=0.01/documentation/cpan.