NAME

Search::ContextGraph - Run searches using a contextual network graph

SYNOPSIS

use Search::ContextGraph;

my %docs = (
  'First Document'  => { 'elephant' => 2, 'snake' => 1 },
  'Second Document' => { 'camel' => 1, 'pony' => 1 },
  'Third Document'  => { 'snake' => 2, 'constrictor' => 1 },
);

my $cg = Search::ContextGraph->new();
$cg->add_documents( %docs );

# Regular word search
my ( $docs, $words ) = $cg->search('snake');

# Document similarity search
my ( $docs, $words ) = $cg->find_similar('First Document');

# Search on a little bit of both...
my ( $docs, $words ) = 
  $cg->mixed_search( { docs  => [ 'First Document' ],
                       terms => [ 'snake', 'pony' ]
                   );

# Print out result set of returned documents
foreach my $k ( sort { $docs->{$b} <=> $docs->{$a} }
    keys %{ $docs } ) {
    print "Document $k had relevance ", $docs->{$k}, "\n";
}

# Store the graph for future generations
$cg->store( "filename" );

# Reload it
my $new = Search::ContextGraph->retrieve( "filename" );

DESCRIPTION

Search a document collection using a spreading activation search. The search algorithm represents the collection as a set of term and document nodes, connected to one another based on a co-occurrence matrix. If a word occurs in a document, we create an edge between the appropriate term and document node. Searches take place by spreading energy from a query node along the edges of the graph according to some simple rules. All result nodes exceeding a threshold T are returned. You can read a full description of this algorithm at http://www.nitle.org/papers/Contextual_Network_Graphs.pdf.

The search engine gives expanded recall (relevant results even when there is no keyword match) without incurring the kind of computational and patent issues posed by latent semantic indexing (LSI).

METHODS

new %PARAMS

Object constructor.

[get|set]_activate_threshold

Accessor for node activation threshold value. This value determines how far energy can spread in the graph

[get|set]_collect_threshold

Accessor for collection threshold value. This determines how much energy a node must have to make it into the result set.

[get|set]_initial_energy

Accessor for initial energy value at the query node. The higher this value, the larger the result set

load_from_tdm TDM_FILE [, LM_FILE]

Opens and loads a term-document matrix (TDM) file to initialize the graph. Optionally also opens and loads a document link matrix (DLM) file of document-to-document links. The TDM encodes information about term-to-document links, while the DLM file holds information about inter-document links, like hyperlinks or citation data. For notes on these file formats, see the README file Note that document-document links are NOT YET IMPLEMENTED.

raw_search @NODES

Given a list of nodes, returns a hash of nearest nodes with relevance values, in the format NODE => RELEVANCE, for all nodes above the threshold value. (You probably want one of search, find_similar, or mixed_search instead).

set_debug_mode

Turns verbose comments on if given a true value as its argument

add_documents %DOCS

Load up the search engine with documents in the form TITLE => WORDHASH, where WORDHASH is a reference to a hash of terms and occurence counts. In other words,

TITLE => { WORD1 => COUNT1, WORD2 => COUNT2 ... }
search @QUERY

Searches the graph for all of the words in @QUERY. No support yet for document similarity searches, but it's coming. Returns a pair of hashrefs: the first a reference to a hash of docs and relevance values, the second to a hash of words and relevance values.

find_similar @DOCS

Given an array of document identifiers, returns a pair of hashrefs. First hashref is to a hash of docs and relevance values, second is to a hash of words and relevance values.

mixed_search @DOCS

Given a hashref in the form: { docs => [ 'Title 1', 'Title 2' ], terms => ['buffalo', 'fox' ], } } Runs a combined search on the terms and documents provided, and returns a pair of hashrefs. The first hashref is to a hash of docs and relevance values, second is to a hash of words and relevance values.

BUGS

No way to delete nodes once they're in the graph
No way to break edges once they're in the graph

AUTHOR

Maciej Ceglowski <maciej@ceglowski.com>

The technique used here was developed in 2003 by John Cuadrado, and later found to have antecedents in the spreading activation approach described in a 1981 doctoral dissertation by Scott Preece.

CONTRIBUTORS

Ken Williams Leon Brocard

COPYRIGHT AND LICENSE

(C) 2003 Maciej Ceglowski

This program may be distributed under the same terms as Perl itself.