NAME
Search::ContextGraph - Run searches using a contextual network graph
SYNOPSIS
use Search::ContextGraph;
my %docs = (
'First Document' => { 'elephant' => 2, 'snake' => 1 },
'Second Document' => { 'camel' => 1, 'pony' => 1 },
'Third Document' => { 'snake' => 2, 'constrictor' => 1 },
);
my $cg = Search::ContextGraph->new();
$cg->add_documents( %docs );
# Regular word search
my ( $docs, $words ) = $cg->search('snake');
# Document similarity search
my ( $docs, $words ) = $cg->find_similar('First Document');
# Search on a little bit of both...
my ( $docs, $words ) =
$cg->mixed_search( { docs => [ 'First Document' ],
terms => [ 'snake', 'pony' ]
);
# Print out result set of returned documents
foreach my $k ( sort { $docs->{$b} <=> $docs->{$a} }
keys %{ $docs } ) {
print "Document $k had relevance ", $docs->{$k}, "\n";
}
# Store the graph for future generations
$cg->store( "filename" );
# Reload it
my $new = Search::ContextGraph->retrieve( "filename" );
DESCRIPTION
Search a document collection using a spreading activation search. The search algorithm represents the collection as a set of term and document nodes, connected to one another based on a co-occurrence matrix. If a word occurs in a document, we create an edge between the appropriate term and document node. Searches take place by spreading energy from a query node along the edges of the graph according to some simple rules. All result nodes exceeding a threshold T are returned. You can read a full description of this algorithm at http://www.nitle.org/papers/Contextual_Network_Graphs.pdf.
The search engine gives expanded recall (relevant results even when there is no keyword match) without incurring the kind of computational and patent issues posed by latent semantic indexing (LSI).
METHODS
- new %PARAMS
-
Object constructor. Possible parameters:
debug LEVEL
Set this to 1 or 2 to turn on verbose debugging output
xs
When true, tells the module to use compiled C internals. This reduces memory requirements by about 60%, but actually runs a little slower than the pure Perl version. Don't bother to turn it on unless you have a huge graph. Default is pure Perl.
BUG: using the compiled version makes it impossible to store the graph to disk.
START_ENERGY
Initial energy to assign to a query node. Default is 100.
ACTIVATE_THRESHOLD
Minimal energy needed to propagate search along the graph. Default is 1.
COLLECT_THRESHOLD
Minimal energy needed for a node to enter the result set. Default is 1.
- [get|set]_activate_threshold
-
Accessor for node activation threshold value. This value determines how far energy can spread in the graph. Lower it to increase the number of results. Default is 1.
- [get|set]_collect_threshold
-
Accessor for collection threshold value. This determines how much energy a node must have to make it into the result set. Lower it to increase the number of results. Default is 1.
- set_debug_mode [012]
-
Turns debugging on or off. 1 is verbose, 2 is very verbose, 0 is off.
- [get|set]_initial_energy
-
Accessor for initial energy value at the query node. This controls how much energy gets poured into the graph at the start of the search. Increase this value to get more results from your queries.
- load_from_tdm TDM_FILE [, LM_FILE]
-
Opens and loads a term-document matrix (TDM) file to initialize the graph. The TDM encodes information about term-to-document links. For notes on the proper file format, see the README file Note that document-document links are NOT YET IMPLEMENTED.
- raw_search @NODES
-
Given a list of nodes, returns a hash of nearest nodes with relevance values, in the format NODE => RELEVANCE, for all nodes above the threshold value. (You probably want one of search, find_similar, or mixed_search instead).
- debug_on, debug_off
-
Toggles debug mode
- add_documents %DOCS
-
Load up the search engine with documents in the form TITLE => WORDS, where WORDS is either a reference to a hash of terms and occurence counts, or a reference to an array of words. For example:
TITLE => { WORD1 => COUNT1, WORD2 => COUNT2 ... }
or
TITLE => [ WORD1, WORD2, WORD3 ]
- search @QUERY
-
Searches the graph for all of the words in @QUERY. Use find_similar if you want to do a document similarity instead, or mixed_search if you want to search on any combination of words and documents. Returns a pair of hashrefs: the first a reference to a hash of docs and relevance values, the second to a hash of words and relevance values.
- find_similar @DOCS
-
Given an array of document identifiers, performs a similarity search and returns a pair of hashrefs. First hashref is to a hash of docs and relevance values, second is to a hash of words and relevance values.
- mixed_search @DOCS
-
Given a hashref in the form: { docs => [ 'Title 1', 'Title 2' ], terms => ['buffalo', 'fox' ], } } Runs a combined search on the terms and documents provided, and returns a pair of hashrefs. The first hashref is to a hash of docs and relevance values, second is to a hash of words and relevance values.
- store FILENAME
-
Stores the object to a file for later use. Not compatible (yet) with compiled XS version, which will give a fatal error.
BUGS
Document-document links are not yet implemented
No way to delete nodes once they're in the graph
No way to break edges once they're in the graph
Can't store graph if using compiled C internals
AUTHOR
Maciej Ceglowski <maciej@ceglowski.com>
The technique used here was developed in 2003 by John Cuadrado, and later found to have antecedents in the spreading activation approach described in a 1981 doctoral dissertation by Scott Preece. XS implementation thanks to Schuyler Erle.
CONTRIBUTORS
Schuyler Erle
Ken Williams
Leon Brocard
COPYRIGHT AND LICENSE
Perl module: (C) 2003 Maciej Ceglowski
XS Implementation: (C) 2003 Maciej Ceglowski, Schuyler Erle
This program is free software, distributed under the GNU Public License. See LICENSE for details.