The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

TM::Corpus - Topic Maps, Document Corpus

SYNOPSIS

   use TM;
   my $tm = ...

   use TM::Corpus;
   my $co = new TM::Corpus (map => $tm);    # bind with map

   $co->update;                             # copy all content from the map
   $co->harvest (new LWP::UserAgent);       # add documents from the Internet

ABSTRACT

This package connects a topic map instance and a document corpus into one container.

DESCRIPTION

A corpus is normally a set of documents. A topic map based corpus is a set of documents, internal or external to a topic map.

Whenever your topic map is stable, you can first update the corpus with the content and then let an user agent download all documents which are mentioned in the map. With this data corpus you can then do any number of things, one of them having it fulltext-searched.

INTERFACE

Constructor

The constructor accepts a hash as parameter with the following keys:

map (mandatory)

The value must be a TM object. Any map should do.

ua (optional)

You can pass in your own LWP::UserAgent object. That is used when you ask to harvest the documents behind occurrence URLs. If you omit that, a stock object will be generated.

Methods

map

my $tm = $co->map

Read-only access to the underlying map.

resources

Read-only access to the data. Probably not wise to use, but here it is.

update

$co = $co->update

$co->update

This method synchronizes all data from the map into the corpus. The underlying map is the authoritative source, but when it is modified, the corpus is NOT automatically updated. Instead, you should invoke this method at a suitable time.

harvest

$co = $co->harvest

$co->harvest

$co->harvest ($ua)

This method uses the defined user agent to resolve all URLs within the underlying map and to load the content locally. All network related modalities (timeout, limits, etc.) have to be implemented via the user agent.

deficit

$deficit = $co->deficit

This method returns all the URL references which could not be resolved successfully during all previous invocations of harvest. It is a hash (reference) with the assertion ID as key and a url, fails combo as value.

Example:

    warn "damn" if keys %{ $co->deficit };
inject

$co->inject ($tid => $doc, ...)

This method injects documents into the corpus. For each of these you have to provide a topic identifier (tid, see TM) and a TM::Corpus::Document instance.

extract

@docs = $co->extract ($tid, ...)

Given a list of topic identifiers, the subject addresses of these will be use to identify the underlying documents. These are then returned in a list.

eject

<$co>->eject ($tid, ...)

This method removes all documents which are subjects of the topics handed in. The topics are removed as well from the underlying topic map.

features

$fs = <$co>->features (%options)

($fs, $vs) = <$co>->features (%options)

@@@@ total @@@

This method computes a hash (reference) of feature values inside the corpus. Optionally the method also returns a list of all feature vectors from which the feature set has been computed.

The extracting and tokenizing options are those in TM::Corpus::Document. Additionally you can define cut-off points to remove the most frequent and the least frequent features.

low (default: 0.01)
high (default: 1.0)

@@ option nr @@@@

0.01, 1.00 constants

SEE ALSO

TM::Corpus::MLDBM, TM::Corpus::SearchAble

COPYRIGHT AND LICENSE

Copyright 200[89] by Robert Barta, <drrho@cpan.org>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.