Add many tokens to the batch, by supplying the string to be tokenized, and arrays of token starts and token ends.

Take an array of Perl scalars and map their string contents to the texts for each token in the batch.

Return a Perl array whose elements correspond to the token texts in this batch.

NAME

KinoSearch::Analysis::TokenBatch - A collection of tokens.

SYNOPSIS

# create a TokenBatch with a single Token
my $source_batch = KinoSearch::Analysis::TokenBatch->new(
    text => 'Key Lime Pie',
);

# lowercase and split text into multiple tokens, append to new batch
my $dest_batch = KinoSearch::Analysis::TokenBatch->new;
while ( my $source_token = $source_batch->next ) {
    my $source_text = $source_token->get_text;
    while ( $source_text =~ /\s*?(\S+)/g ) {
        my $new_token = KinoSearch::Analysis::Token->new(
            text         => lc($1),
            start_offset => $-[1],
            end_offset   => $+[1],
        );
        $dest_batch->append($new_token);
    }
}

# prints 'keylimepie'
while ( my $token = $dest_batch->next ) { 
    print $token->get_text;
}

DESCRIPTION

A TokenBatch is a collection of Tokens objects which you can add to, then iterate over.

METHODS

new

my $batch = KinoSearch::Analysis::TokenBatch->new(
    text => $utf8_text,
);

# ... which is equivalent to:
my $batch = KinoSearch::Analysis::TokenBatch->new;
my $token = KinoSearch::Analysis::Token->new(
    text         => $utf8_text,
    start_offset => 0,
    end_offset   => length($utf8_text),
);
$batch->append($token);

Constructor. Takes one optional hash-style argument.

  • text - UTF-8 encoded text, used to prime the TokenBatch with a single initial <Token|KinoSearch::Analysis::Token>.

append

$batch->append($token);

Tack a Token onto the end of the batch.

add_many_tokens

$batch->add_many_tokens( $string, \@starts, \@ends );
# or...
$batch->add_many_tokens( $string, \@starts, \@ends, \@boosts );

High efficiency method for adding multiple tokens to the batch with one call. The starts and ends, which must be specified in characters (not bytes), will be used to identify substrings of $string to use as token texts.

(Note: boosts should be supplied only for fields which are set to store_pos_boost.)

next

while ( my $token = $batch->next ) {
    # ...
}

Return the next token in the TokenBatch, or undef if out of tokens.

reset

$batch->reset;

Reset the TokenBatch's iterator, so that the next call to next() returns the first Token in the batch.

COPYRIGHT

Copyright 2005-2007 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.

See KinoSearch version 0.20_01.