NAME
DBIx::TextIndex - Perl extension for full-text searching in SQL
databases
SYNOPSIS
use DBIx::TextIndex;
$index = DBIx::TextIndex->new({
index_dbh => $index_dbh,
collection => 'collection_name',
doc_fields => ['field1', 'field2'],
});
$index->initialize();
$index->add( key1 => { field1 => 'some text', field2 => 'more text' } );
$results = $index->search({
field1 => '"a phrase" +and -not or',
field2 => 'more words',
});
foreach my $key
(sort {$$results{$b} <=> $$results{$a}} keys %$results )
{
print "Key: $key Score: $$results{$key} \n";
}
DESCRIPTION
DBIx::TextIndex was developed for doing full-text searches on BLOB
columns stored in a database. Almost any database with BLOB and DBI
support should work with minor adjustments to SQL statements in the
module. MySQL, PostgreSQL, and SQLite are currently supported.
As of version 0.24, data from any source can be indexed by passing it to
the "add()" method as a string.
INDEX CREATION
Preparing an index for use for the first time
To set up a new index, call "new()", followed by "initialize()".
$index = DBIx::TextIndex->new({
index_dbh => $dbh,
collection => 'my_books',
doc_fields => [ 'title', 'author', 'text' ],
});
$index->initialize();
"initialize()" should only be called the first time a new index is
created. Calling initialize a second time with the same collection name
will delete and re-create the index.
The "doc_fields" attribute specifies which fields of a document are
contained in the index. This decision must be made at initialization --
additional document fields cannot be added to the index later.
After the index is initialized once, subsequent calls to "new()" require
only the "index_dbh" and "collection" arguments.
$index = DBIx::TextIndex->new({
index_dbh => $dbh,
collection => 'my_books',
});
Adding documents to the index
Every document is made up of fields, and has a unique key that is
returned with search results.
$index->add( book1 => {
author => 'Leo Tolstoy',
title => 'War and Peace',
text => '"Well, Prince, so Genoa and Lucca ...',
},
book2 => {
author => 'J.R.R. Tolkien',
title => 'The Hobbit',
text => 'In a hole in the ground there lived ...',
},
);
With each call to "add()", the index is written to tables in the
underlying SQL database.
When adding many documents in a loop, use "begin_add()" and
"commit_add()" around the loop. This will increase indexing performance
by delaying writes to the SQL database until "commit_add()" is called.
$index->begin_add();
while ( my ($book_id, $author, $title, $text) = fetch_doc() ) {
$index->add( $book_id => { author => $author,
title => $title,
text => $text } );
}
$index->commit_add();
Indexing data in SQL tables
DBIx::TextIndex has additional convenience methods to index data
contained in SQL tables. Before calling "initialize()" also set the
"doc_dbh", "doc_table", and "doc_id_field" attributes:
$index = DBIx::TextIndex->new({
index_dbh => $dbh,
collection => 'my_books',
doc_dbh => $doc_dbh,
doc_table => 'book',
doc_id_field => 'book_id',
doc_fields => [ 'title', 'author', 'text' ],
});
$index->initialize();
After initialization, subsequent creation of index objects only require
the "index_dbh", "collection", and "doc_dbh" arguments:
$index = DBIx::TextIndex->new({
index_dbh => $dbh,
collection => 'my_books',
doc_dbh => $doc_dbh,
});
Passing an array of ids to "add_doc()" indexes the "doc_fields"
(columns) in "doc_table" matched using the "doc_id_field" column.
$index->add_doc(1, 2, 3);
"add_doc()" creates SQL statements to retrieve data from the document
table before adding to the index. In the above example, a series of
statements like "SELECT title, author, text FROM book WHERE book_id = 1"
would be issued.
If more flexibility is needed, data could be fetched first and passed to
the "add()" method instead. For example, a multi-table JOIN could be
issued or several columns could be concatenated into a single index
field.
QUERY SYNTAX
FIXME: This section is incomplete.
Searches are case insensitive.
Phrase Searches
Enclose phrases in double quotes:
"See Spot run"
Proximity Searches
Use the tilde "~" operator at the end of phrase to find words within a
certain distance.
"some phrase"~1 - matches only exact "some phrase"
"some phrase"~2 - matches "some other phrase"
"some phrase"~10 - matches "some [1..9 words] phrase"
Defaults to "~1" when omitted, which is a normal phrase search.
The proximity match works from left to right, which means ""some
phrase"~3" does not match "phrase other some" or "phrase some"
Wildcard Partial-Term Searches
You can use wildcard characters "*" or "?" at the end of or in the
middle of search terms:
"*" matches zero or more characters
car* - "car", "cars", "careful", "cartel", ....
ca*r - "car", "career", "caper", "cardiovascular"
"?" matches any single character
car? - "care", "cars", "cart"
d?g - "dig", "dog", "dug"
"+" at the end matches singular or plural form (naively, by appending an
's' to the word)
car+ - "car", "cars"
By default, at least 1 alphanumeric character must appear before the
first wildcard character. The option "min_wildcard_length" can changed
to require more alphanumeric characters before the first wildcard.
The option "max_wildcard_term_expansion" specifies the maximum number of
words a wildcard term can expand to before throwing a query exception.
The default is 30 words.
USAGE
The following methods are available:
"new()"
$index = DBIx::TextIndex->new(\%args)
Constructor method, accepts args as a hashref. The first time an index
is created, "index_dbh", "collection", "doc_fields" and must be passed.
For subsequent calls to new, only "index_dbh" and "collection" are
required.
To index documents using "add_doc()", "doc_dbh", "doc_table", and
"doc_id_field" are also required for initialization. "doc_dbh" is
required each time the index is used to add documents.
Other arguments are optional.
"new()" accepts these arguments:
index_dbh
index_dbh => $index_dbh
DBI connection handle used to store tables for DBIx::TextIndex. Use
a separate database if possible to avoid name collisions with
existing tables.
collection
collection => $collection
A name for the index. Should contain only alpha-numeric characters
or underscores [A-Za-z0-9_]. Limited to 100 characters.
doc_dbh
doc_dbh => $doc_dbh
A DBI connection handle to database containing text documents
doc_table
doc_table => $doc_table
Name of database table containing text documents
doc_fields
doc_fields => \@doc_fields
An arrayref of fields contained in the index. If using "add_doc()",
lists column names to be indexed in "doc_table".
doc_id_field
doc_id_field => $doc_id_field
Name of an integer key column in "doc_table". Must be a primary or
unique key.
proximity_index
proximity_index => 1
Enables index structure to support phrase and proximity searches.
Default is on (1), pass 0 to turn off.
errors
errors => {
empty_query => "your query was empty",
quote_count => "phrases must be quoted correctly",
no_results => "your seach did not produce any results",
no_results_stop => "no results, these words were stoplisted: ",
wildcard_length =>
"Use at least one letter or number at the beginning " .
"of the word before wildcard characters.",
wildcard_expansion =>
"The wildcard term you used was too broad, " .
"please use more characters before or after the wildcard",
}
This hash reference can be used to override default error messages.
charset
charset => 'iso-8859-1'
Default is 'iso-8859-1'.
Accented characters are converted to ASCII equivalents based on the
charset.
Pass 'iso-8859-2' for Czech or other Slavic languages.
Only iso-8859-1 and iso-8859-2 have been tested.
stoplist
stoplist => [ 'en' ]
Activates stoplisting of very common words that are present in
almost every document. Default is to not use stoplisting. Value of
the parameter is a reference to array of two-letter language codes
in lower case. Currently only two stoplists exist:
en - English
cz - Czech
Stoplisting is usually not recommended because certain queries
containing common words cannot be resolved, such as: "The Who" or
"To be or not to be." DBIx::TextIndex is optimized well enough that
the performance gains from stoplisting are minimal.
max_word_length
max_word_length => 20
Specifies maximum word length resolution. Defaults to 20 characters.
phrase_threshold
phrase_threshold => 1000
If "proximity_index" is turned off, and documents were indexed with
"add_doc()", and "doc_dbh" is available, some phrase queries can be
resolved by scanning the original document rows with a LIKE
'%phrase%' query. The phrase threshold is the maximum number of rows
that will be scanned.
It is recommended that the "proximity_index" option always be used,
because it is more efficient than scanning rows, and it is not
limited to documents added using "add_doc()".
decode_html_entities
decode_html_entities => 1
Decode html entities before indexing documents (e.g. & -> &).
Default is 1.
print_activity
print_activity => 0
Activates STDOUT debugging. Higher value increases verbosity.
update_commit_interval
update_commit_interval => 20000
When indexing a large number of documents using "add_doc()" or
"add()" inside a "begin_add()" / "commit_add()" block, this setting
will trigger an automatic commit to the database when the number of
added documents exceeds this number.
Setting this higher will increase indexing speed, but also increase
memory usage. In tests, the default setting of 20000 when indexing
10KB documents results in about 500MB of memory used.
doc_key_sql_type
doc_key_sql_type => varchar
SQL datatype to store doc keys, defaults to varchar. If only numeric
keys are required, this could be changed to an integer type for more
compact storage.
doc_key_length
doc_key_length => 200
The maximum length of a doc_key.
After creating a new TextIndex for the first time, and after calling
initialize(), only the index_dbh, doc_dbh, and collection arguments are
needed to create subsequent instances of a TextIndex.
"initialize()"
$index->initialize()
This method creates all the inverted tables for DBIx::TextIndex in the
database specified by index_dbh. This method should be called only once
when creating an index for the first time. It drops all the inverted
tables before creating new ones.
"initialize()" also stores the "doc_table", "doc_fields",
"doc_id_field", "char_set", "stoplist", "error" attributes,
"proximity_index", "max_word_length", "phrase_threshold" and
"min_wildcard_length" preferences in a special table called
"collection," so subsequent calls to "new()" for a given collection do
not need those arguments.
Calling "initialize()" will upgrade the collection table created by
earlier versions of DBIx::TextIndex if necessary.
"add()"
$index->add($doc_key, \%doc_fields)
Indexes a document represented by hashref, where the keys of the hash
are field names and the values are strings to be indexed. When
"search()" is called, and a hit for that document is scored, $doc_key
will be returned in the search results.
"begin_add()"
Before performing a large number of <add()> operations in a loop, call
"begin_add()" to delay writing to the database until "commit_add()" is
called. If "begin_add()" is not called, "add()" will run in an
"autocommit" mode.
Has no effect if using "add_doc()" method instead of "add()".
The "update_commit_interval" parameter defines an upper limit on the
number of documents held in memory before being committed to the
database. If the limit is reached, the changes to the index will be
comitted at that point.
"commit_add()"
Commits a group of "add()" operations to the database. It is only
necessary to call this if "begin_add()" was called first.
"add_doc()"
$index->add_doc(\@doc_ids)
Adds all the @docs_ids matching rows with "doc_id_field" from
"doc_table" to the index. Reads from the database handle specified by
"doc_dbh".
If @doc_ids references documents that are already indexed, those
documents will be re-indexed.
"remove()"
$index->remove(\@doc_keys)
@doc_keys can be a list of doc keys originally passed to "add()" or the
numeric doc ids used for "add_doc()".
The disk space used for the removed doc keys is not recovered, so an
index rebuild is recommended after a significant amount of documents are
removed.
"search()"
$results = $index->search(\%args)
"search()" returns $results, a hash reference. The keys of the hash are
doc ids, and the values are the relative scores of the documents. If an
error occured while searching, search will throw a
DBIx::TextIndex::Exception::Query object.
eval {
$results = $index->search({
first_field => '+andword -notword orword "phrase words"',
second_field => ...
...
});
};
if ($@) {
if ($@->isa('DBIx::TextIndex::Exception::Query') {
print "No results: " . $@->error . "\n";
} else {
# Something more drastic happened
$@->rethrow;
}
} else {
print "The score for $doc_id is $results->{$doc_id}\n";
}
"unscored_search()"
$doc_keys = $index->unscored_search(\%args)
unscored_search() returns $doc_ids, a reference to an array. Since the
scoring algorithm is skipped, this method is much faster than search().
A DBIx::TextIndex::Exception::Query object will be thrown if the query
is bad or no results are found.
eval {
$doc_ids = $index->unscored_search({
first_field => '+andword -notword orword "phrase words"',
second_field => ...
});
};
if ($@) {
if ($@->isa('DBIx::TextIndex::Exception::Query') {
print "No results: " . $@->error . "\n";
} else {
# Something more drastic happened
$@->rethrow;
}
} else {
print "Here's all the doc ids:\n";
map { print "$_\n" } @$doc_ids;
}
"optimize()"
FIXME: Implementation not complete
"delete()"
$index->delete()
"delete()" removes the tables associated with a TextIndex from
index_dbh.
"stat()"
Allows you to obtain some meta information about the index. Accepts one
parameter that specifies what you want to obtain.
$index->stat('total_words')
Returns a total count of words in the index. This number may differ from
the total count of words in the documents itself.
"upgrade_collection_table()"
$index->upgrade_collection_table()
Upgrades the collection table to the latest format. Usually does not
need to be called by the programmer, because initialize() handles
upgrades automatically.
BOOLEAN SEARCH MASKS
DBIx::TextIndex can apply boolean operations on arbitrary lists of doc
ids to search results.
Take this table:
doc_id category doc_full_text
1 green full text here ...
2 green ...
3 blue ...
4 red ...
5 blue ...
6 green ...
Masks that represent doc ids for in each the three categories can be
created:
"add_mask()"
$index->add_mask($mask_name, \@doc_ids);
$index->add_mask('green_category', [ 1, 2, 6 ]);
$index->add_mask('blue_category', [ 3, 5 ]);
$index->add_mask('red_category', [ 4 ]);
The first argument is an arbitrary string, and the second is a reference
to any array of doc ids that the mask name identifies.
Mask operations are passed in a second argument hash reference to
$index->search():
%query_args = (
first_field => '+andword -notword orword "phrase words"',
second_field => ...
...
);
%args = (
not_mask => \@not_mask_list,
and_mask => \@and_mask_list,
or_mask => \@or_mask_list,
or_mask_set => [ \@or_mask_list_1, \@or_mask_list_2, ... ],
);
$index->search(\%query_args, \%args);
not_mask
For each mask in the not_mask list, the intersection of the search
query results and all documents not in the mask is calculated.
From our example above, to narrow search results to documents not in
green category:
$index->search(\%query_args, { not_mask => ['green_category'] });
and_mask
For each mask in the and_mask list, the intersection of the search
query results and all documents in the mask is calculated.
This would give return results only in blue category:
$index->search(\%query_args,
{ and_mask => ['blue_category'] });
Instead of using named masks, lists of doc ids can be passed on the
fly as array references. This would give the same results as the
previous example:
my @blue_ids = (3, 5);
$index->search(\%query_args,
{ and_mask => [ \@blue_ids ] });
or_mask_set
With the or_mask_set argument, the union of all the masks in each
list is computed individually, and then the intersection of each
union set with the query results is calculated.
or_mask
An or_mask is treated as an or_mask_set with only one list. In this
example, the union of blue_category and red_category is taken, and
then the intersection of that union with the query results is
calculated:
$index->search(\%query_args,
{ or_mask => [ 'blue_category', 'red_category' ] });
"delete_mask()"
$index->delete_mask($mask_name);
Deletes a single mask from the mask table in the database.
RESULTS HIGHLIGHTING
A module HTML::Highlight can be used either independently or together
with DBIx::TextIndex for this task.
The HTML::Highlight module provides a very nice Google-like highligting
using different colors for different words or phrases and also can be
used to preview a context in which the query words appear in resulting
documents.
The module works together with DBIx::TextIndex using its new method
html_highlight().
Check example script 'html_search.cgi' in the 'examples/' directory of
DBIx::TextIndex distribution or refer to the documentation of
HTML::Highlight for more information.
AUTHOR
Daniel Koch, dkoch@bizjournals.com.
COPYRIGHT
Copyright 1997-2004 by Daniel Koch. All rights reserved.
LICENSE
This package is free software; you can redistribute it and/or modify it
under the same terms as Perl itself, i.e., under the terms of the
"Artistic License" or the "GNU General Public License".
DISCLAIMER
This package is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the "GNU General Public License" for more details.
ACKNOWLEDGEMENTS
Thanks to Jim Blomo, for PostgreSQL patches.
Thanks to the lucy project (http://www.seg.rmit.edu.au/lucy/) for ideas
and code for the Okapi scoring function.
Simon Cozens' Lucene::QueryParser module was adapted to create the
DBIx::TextIndex QueryParser module.
Special thanks to Tomas Styblo, for first version of proximity index,
Czech language support, stoplists, highlighting, document removal and
many other improvements.
Thanks to Ulrich Pfeifer for ideas and code from Man::Index module in
"Information Retrieval, and What pack 'w' Is For" article from The Perl
Journal vol. 2 no. 2.
Thanks to Steffen Beyer for the Bit::Vector module, which enables fast
set operations in this module. Version 5.3 or greater of Bit::Vector is
required by DBIx::TextIndex.
BUGS
Documentation is not complete.
Please feel free to email me (dkoch@bizjournals.com) with any questions
or suggestions.
SEE ALSO
perl(1).
Revision history for Perl extension DBIx::TextIndex.
=head1 CHANGES
0.24
WARNING: indexes must be recreated, format has changed
Ability to reindex existing documents
New methods add(), begin_add(), commit_add() with support for arbitrary
keys and indexing of data from sources other than a database table
Made individual database drivers subclasses of DBIx::TextIndex::DBD
for easier porting. A new driver can override selected database-specific
methods instead of having to rewrite all methods.
Check for division by zero in _resolve_plural()
Bug fix for pathological case where only query term is a NOT term
0.23
SQLite support
Changed wildcard syntax slightly: a '?' character matches any
single character. A '+' at the end of a word behaves like the '?'
character in previous versions, matching the singular or plural
forms of the word.
Wildcard characters '*' and '?' can now be used in the middle
of a query term, as well as at the end of a query term.
Added error messages error_wildcard_length and
error_wildcard_expansion, collection table will be upgraded
Added argument 'max_wildcard_term_expansion' to new(). This specifies
the maximum number of words a wildcard term can expand to before
throwing an exception. The default is 30 words.
Fixed incorrect results for Okapi-scored results with wildcard terms
Fixed segfault if position index structure is corrupted
Fixed bug occurring when add_doc() is followed by search() on same
instance of $index
0.22
WARNING: indexes must be recreated from scratch.
Rewrote proximity index code from scratch using XS. New index format
is faster, more compact, and scaleable. Flag proximity_index
now defaults to 1.
Removed legacy_tf_idf scoring method.
Fixed division-by-zero error with wildcard searches
Added support for PostgreSQL
0.21
Fixed bug where proximity search was case-sensitive instead of
case-insensitive
Fixed problem where some punctuation characters in query were
treated as query terms instead of ignoring the characters
0.20
Added inline fielded queries, e.g.:
$index->search( field1 => 'field2:this field3:"that phrase" foo bar' );
In previous version, this search could only be represented like this:
$index->search( field1 => 'foo bar',
field2 => 'this',
field3 => '"that phrase"');
The advantage of the new syntax is that fields can be combined in
arbitary boolean expressions, where previously only the union of the
fields was returned. For example:
(field1:this OR field2:that) AND (field3:"foo bar" OR field1:foo)
Wild and plural search bug fixes
Turned proximity search back on, changed syntax. For example, to
find "foo" followed by "bar" with 0, 1 or 2 words between:
"foo bar"~3
0.19
Rewrote query parser. Now we support nested subqueries in parentheses,
AND/OR/NOT operators, and existing '+/-' operators
(this OR that) AND (foo OR bar) +"this phrase" -"not this phrase"
Changed prefix query syntax slightly: '*' is now prefix operator, '?'
is plural operator:
dog* matches dog, dogs, dogma, dogmatic, dogmatically
cat? matches cat, cats
Rewrote critical loop of Okapi scoring as XSUB. Search over large
collections is much faster.
0.18
Fixed bug where Okapi function did not score multiple fields
Tweaks to improve boolean search performance.
Added periodic flush of hash tables caches to save memory
in long-running instances.
Used a more compact represention of docweights, index needs to be
recreated.
0.17
Reduced memory usage by TermDocsCache
Implemented Okapi BM25 weighting function [S.E. Robertson et al., Google
for description] for scoring documents. Pass scoring_method =>
'legacy_tfidf' to new() or search() to use older method. Thanks to
the lucy project (http://www.seg.rmit.edu.au/lucy/) for inspiration.
Added docweights table to support Okapi BM25 scoring. Indexes need to
be dropped and recreated.
Pass "update_commit_interval" to new() to control memory usage in
index updates. As the index is built, postings lists are
built up in memory; this parameter controls the number of documents
held in memory before being flushed to the database. Default is 20000.
To disable, pass 0, and all documents passed to add_doc() will
indexed in memory before writing to the database.
Fixed problem with inaccurate search results if multiple searches were
performed on a single instance while another instance was updating index.
0.16
Added XSUB pack_term_docs_append_vint to speed up _commit_docs when
adding new postings to existing compressed postings
0.15
WARNING: indexes will have to be recreated after upgrading to this version
Changed inverted list tables: renamed "docs" to "term_docs" and removed
redundant column "docs_vector". Bit vectors are now built from postings
stored in term_docs column. Total index size should be reduced
significantly with removal of docs_vector column.
Implemented DBIx::TextIndex::TermDocsCache to reduce database queries
Fixed bug where single word queries for high doc frequency terms would
return no results
0.14
Fixed memory leak in integer decoding
0.13
Changed format for inverted list postings to compressed binary
format. Index sizes should be 50% smaller than with previous versions.
Implemented XS code for fast coding and decoding. This format is similar
to the format in Lucene (http://jakarta.apache.org/lucene/).
Renamed 'freq_d' to 'docfreq_t'.
Bug fix: add_doc will only set bits in all_docs_vector if document
was actually added
UPGRADE warning, indexes will have to be rebuilt.
0.12
Added _optimize_or_search(). If query contains just OR words,
turn the rarest words into AND words to reduce result set size
before scoring. Practically, this makes queries behave more like
"all of the words" instead of "any of the words," which seems to
be what the average user expects.
Changed all symbols 'occurence' to 'freq_d' (frequency of docs in
entire collection that contain term)
Multi-field queries will return the intersection of all fields
instead of the union.
Added method all_doc_ids().
There is no longer any need to sort doc_ids before passing
to add_doc().
Added more tests. WARNING: MySQL database 'test' must be available
on localhost for tests to succeed.
UPGRADE WARNING: collection table format changed, new index table
collection_all_docs_vector added, some field names have changed in
inverted tables. Any indexes created with 0.11 or earlier will have
to be deleted and recreated.
UPGRADE WARNING: all symbols with "document" have been renamed to
"doc" for brevity. Methods have also been renamed, e.g. add_document()
is now add_doc(). The old method names will work, but are deprecated.
Replaced option 'language' with 'charset'. iso-8859-1 is the
default charset.
Added call to Text::Unaccent::unac_string (www.senga.org) to replace
accented characters with plain ASCII equivalent. Uses 'charset' option
to determine mapping.
UPGRADE WARNING: added structured exceptions using Exception::Class.
Calls to search() now have to be wrapped with eval blocks to catch
query exceptions.
0.11
Bug fix: HTML tags are now changed to a single space, instead of
empty string when indexing document. Prevents concatenation of words
in some cases.
0.10
Fixed collection table upgrade bug
0.09
Changed $MAX_WORD_LENGTH default to 20
Allow numbers to be indexed as words
Use HTML::Entities to decode entities in indexed documents, on by default. Set option decode_html_entities to 0 to disable.
Use $dbh->tables to check for existence of tables (caution, may not
work in DBI > 1.30).
0.08
Bug fix: add_mask() was not inserting masks
0.07
UPGRADE WARNING: collection table format changed, use new method
$index->upgrade_collection_table() to recreate collection table.
Calling initialize() method for a new collection will also upgrade
collection table. Index backup recommended.
Added error_ prefix to error message column names in collection table
Added version column to collection table
Added language column to collection table, removed czech_language column
UPGRADE WARNING: instead of new({ czech_language => 1}), use
new({ language => 'cz' })
Bug fix: _store_collection_info() error if stop lists are not used
unscored_search() will now return a scalar error message
if an error occurs in search
search() will croak if passed an invalid field name to search on
Added documentation for mask operations
0.06
tripie's patch v2 updates:
- a bug in document removing proccess related to incorrect
'occurency' data updates when multi-field documents were removed,
was fixed. The methods remove_document() and _inverted_remove()
were affected.
- a bug related to wildcards in queries in form of "+word +next%"
or "+word% +next%" was fixed
- a bug related to "%" wildcards used while searching of
multi-field documents was fixed
- a bug related to stoplists and phrases that contain a
non-stoplisted word together with a stoplisted word was fixed
- a new full-featured solution of highligting of query words or
patterns in content of resulting documents was added
I've written a new module HTML::Highlight, that can be used
either independently or together with DBIx::TextIndex. Its
advantages include:
- it makes highlighting very easy
- takes phrases and wildcards into account
- supports diacritics insensitive highlighting for iso-8859-2
languages
- takes HTML tags into account. That means when a user searches
for for example 'font', than a FONT element in <FONT COLOR="red">
does not get "highlighted".
The module provides a very nice Google-like highlighting using
different colors for different words or phrases.
The module works together with DBIx::TextIndex using its new method
html_highlight().
The module can also be used to preview a context in which query
words appear in resulting documents.
- the old method highlight() was not changed nor removed for sake of
compatibility with old code
- I put a new 'html_search.cgi' script to examples/ to show how the new
highlighting and context previewing works.
- the HTML::Highlight module can be found on CPAN -
http://www.cpan.org/authors/id/T/TR/TRIPIE/
- the new highlighting solution has been documented in a new section
of the documentation
tripie v1 changes:
NOTE: tripie proximity index is replaced with newer compact index
structure as of version 0.22
added proximity indexing
- based on positions of words in a document
- by default it is disabled, activate it by new option "proximity_index"
- very efficient for bigger documents, much worse for small ones
- it's very big (approx. 20 bytes for each word)
- allows fast proximity based searches in form of:
":2 some phrase" => matches "some nice phrase"
":1 some phrase" => matches only exact "some phrase"
":10 some phrase" => matches "some [1..9 words] phrase"
defaults to ":1" when omitted
- the proximity matches work only forwards, not backwards,
that means:
":3 some phrase" does not match "phrase nice some" or "phrase some"
NOTE: as of version 0.20, this syntax has changed:
"some phrase"~3
rewrote the word splitter and query parser
- added support for czech language diacritics insensitive indexing
and searching (option "czech_language")
(note: changed option to "charset", pass value "iso-8859-2"
to enable -dkoch)
that is implemented by converting both the indexed data and
the query from iso-8859-2 to ASCII
- this can also be used for other iso-8859-2 based Slavic languages
- the above is performed by my module "CzFast" that can
be found on CPAN (my CPAN id is "TRIPIE"), and is optional
added partial pattern matching using wildcards "%" or "*"
- these wildcards can be used at end of a word to match all words
that begin with that word, ie.
the "%" character means "match any characters"
car% ==> matches "car", "cars", "careful", "cartel", ....
the "*" character means "match also the plural form"
car* ==> matches only "car" or "cars"
- added option "min_wildcard_length" to specify minimal length of a
word base appearing before the "%" wildcard character to avoid
selection of excessive amount of results
added a database abstraction layer - all SQL queries were moved to
separate module (see lib/ and the docs for new "db" option) and
polished a bit for better maintainability and possible support of
other SQL dialects
added stoplists
- some words that are too common (are present in almost every document)
are not indexed and are removed from the search query before processing
to avoid expensive processing of excessively huge result sets
- user is notified when a search does not produce any results because
some words he used in his query were stoplisted (no_results_stop)
- stoplists can be easily localized or modified
- default is not to use any stoplist, one or more stoplists
can be selected using the "stoplist" option
- stoplist data files are in lib/
- english (en) and czech (cz) stoplists are included
- more than one stoplist can be used
added a facility to remove documents from the index -
check the documentation for new method "remove_document" for more info.
There is no way to recover all the space taken by the documents that are
being removed. This method manages to recover approx. 80% of the space.
It's recommended to rebuild the index when you remove a
significant amount of documents.
added a facility to obtain some statistical information about the index -
check the documentation for new method "stat"
max_word_length, phrase_threshold and result_threshold are now configurable
options
added configuration options to customize/localize the error messages
(no_results etc.)
all new configuration options are properly stored in collection's data
max_word_length limit now works much better - all words are stripped
down to the maximum size before they are stored to the index,
and also all query words are stripped down to the maximum word
size before they are proccessed. Now when the max word length
is set to, say, six and a user searches for for example
"consciousness", all documents containing any words beginning
with "consci" are returned Therefore the new max_word_length
option is not a limit of a word size, but rather a
"resolution" preference. added some comments and occasionally
corrected indentation
documented the enhancements
bugfix: when RaiseError was set on a DBI connection, then one
query which only switched off PrintError to avoid some problems, failed
note: the interface was not changed - old code using this module should
run without any changes
Thanks for this excellent module and please excuse my inferior English !
Tomas Styblo, tripie@cpan.org
0.05
Added unscored_search() which returns a reference to an array of
doc_ids, without scores. Should be much faster than scored search.
Added error handling in case _occurence() doesn't return a number.
0.04
Bug fix: add_document() will return if passed empty array ref instead
of producing error.
Changed _boolean_compare() and _phrase_search() so and_words and
phrases behave better in multiple-field searches. Result set for each
field is calculated first, then union of all fields is taken for
final result set.
Scores are scaled lower in _search().
0.03
Added example scripts in examples/.
0.02
Added or_mask_set.
0.01
Initial public release. Should be considered beta, and methods may be
added or changed until the first stable release.