The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

TM::Corpus::Document - Topic Maps, Document

SYNOPSIS

   use TM::Corpus::Document;
   my $d = new TM::Corpus::Document ({ mime => 'text/plain',
                                       val  => 'this is some text' });
   # accessors
   $val  = $d->val ('new text');
   $mime = $d->mime ('new/mime');
   $url  = $d->ref ('http://somewhere/some.txt');

   my @tokens = $d->tokenize; # leaving defaults

   # using some predefined tokenizing steps, in this order
   my @tokens = $d->tokenize (tokenizers => 'NUMBER QUOTER COM&BO');

   # using negative ones (i.e. throw things away)
   my @tokens = $d->tokenize (tokenizers => 'COM&BO COM-BO -INTERPUNCT');

   # using filters (detect numbers and throw them away
   my @tokens = $d->tokenize (tokenizers => 'NUMBER !NUMBER');

   # get also debugging output
   my @tokens = $d->tokenize (tokenizers => 'NUMBER TAP !NUMBER TAP');

   # define your own filters
   $TM::Corpus::Document::FILTERS{'!4LETTER'} = 
                sub { $_ = shift; return length($_) == 4 ? '' : $_; };
   my @tokens = $d->tokenize (tokenizers => 'WORDER !4LETTER');

   # collect features, here single tokens and two subsequent tokens
   my %features = $d->features (tokenizers  => '...', 
                                featurizers => 'TOKEN1 TOKEN2')

ABSTRACT

This package implements documents, i.e. document pertinent information, such as its content, the corresponding MIME type, maybe a reference to the document if it has one.

Most notable is functionality to find the tokens (i.e. word substrings) and derive from these also a feature vector for the document.

DESCRIPTION

INTERFACE

Constructor

The constructor expects a hash reference with one or more of the following fields:

ref

A URI string to refer to the network address of the document. In Topic Maps parlor this will be the subject locator for the document topic.

val

The character stream associated with the document.

mime

The MIME type of the content.

Methods

ref

Accessor for the ref component of the document. Nothing happens with the other components.

val

Accessor for the val component of the document. Nothing happens with the other components.

mime

Accessor for the mime component of the document. Nothing happens with the other components.

tokenize

This method returns a list reference to recognized tokens.

To generate this, the method will first find an extractor according to the document's MIME type. That will extract text, but also relevant meta data, such as title, length, etc. Some extractors are predefined; you can get a list with

   perl -MTM::Corpus::Document -e 'warn join ",", keys %TM::Corpus::Document::EXTRACTORS;'

The extractor can also be overridden:

   $d->tokenize (extractor => sub { ... });

It gets the value (content) as first parameter.

In a second step the content stream of the document is analyzed for patterns, such as numbers, dates or words. To control from the outside what is relevant and what should be done in which order, this is specified with a simple language.

Example:

   $d->tokenize (tokenizers => 'COM&BO COM-BO');

Positive tokenizers detect patters and bless them as valid tokens which will not be further analyzed or questioned:

WORDER: detects word in current locale
QUOTER: detects substrings wrapped in ""
NUMBER: detects decimal numbers
DATE: detects date specification in current locale (NOT IMPLEMENTED!)
COM&BO: detects patterns like AT&T
COM-BO: detects patterns like T-Mobile
Capitalize: detects capitalized words

Negative tokenizers detect patterns and immediately throw them away:

-WORDER: everything which is left as text fragment is suppressed
-QUOTER: quoted text is suppressed
-NUMBER: decimal numbers are suppressed
-INTERPUNCT: interpunctations characters are suppressed

Filters take existing tokens and either modify then, suppress them or pass them through (and suppress everything else).

!NUMBER: number tokens are replaced with empty tokens

You can override and extend tokenizers and filters by tampering with the hashes %TOKENIZERS and %FILTERS. You can hook in, for instance a stopword list like this:

    my %stops =  map { $_ => 1 } qw(Terror CIA HLS);
    $TM::Corpus::Document::FILTERS{'!STOPS'} = 
                 sub { $_ = shift; return $stops{$_} ? '' : $_; };

    $d->tokenize (tokenizers => ' .... !STOPS ....');
features

This method computes the feature vector from a document. It accepts all parameters from method tokenize as it will invoke this first. Additionally you can specify how to tokenize

  my %fv = $d->features (tokenizers  => 'QUOTER NUMBER WORDER', 
                         featurizers => 'TOKEN1 TOKEN2');

Following tokenizers are defined:

TOKEN1: occurrences of single tokens are counted in the document
TOKEN2: occurrences of two subsequent tokens in the document are counted
TOKEN3: group of 3 are counted
MIME: the MIME type is converted into some numeric value

You can extend or modify the %FEATURIZERS hash to add your own featuritis.

NOTES

No. Plucene tokenizing was NOT helpful.

SEE ALSO

TM::Corpus

COPYRIGHT AND LICENSE

Copyright 200[8] by Robert Barta, <drrho@cpan.org>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.