Add many tokens to the batch, by supplying the string to be tokenized, and arrays of token starts and token ends.
Take an array of Perl scalars and map their string contents to the texts for each token in the batch.
Return a Perl array whose elements correspond to the token texts in this batch.
NAME
KinoSearch::Analysis::TokenBatch - A collection of tokens.
SYNOPSIS
# create a TokenBatch with a single Token
my $source_batch = KinoSearch::Analysis::TokenBatch->new(
text => 'Key Lime Pie',
);
# lowercase and split text into multiple tokens, append to new batch
my $dest_batch = KinoSearch::Analysis::TokenBatch->new;
while ( my $source_token = $source_batch->next ) {
my $source_text = $source_token->get_text;
while ( $source_text =~ /\s*?(\S+)/g ) {
my $new_token = KinoSearch::Analysis::Token->new(
text => lc($1),
start_offset => $-[1],
end_offset => $+[1],
);
$dest_batch->append($new_token);
}
}
# prints 'keylimepie'
while ( my $token = $dest_batch->next ) {
print $token->get_text;
}
DESCRIPTION
A TokenBatch is a collection of Tokens objects which you can add to, then iterate over.
METHODS
new
my $batch = KinoSearch::Analysis::TokenBatch->new(
text => $utf8_text,
);
# ... which is equivalent to:
my $batch = KinoSearch::Analysis::TokenBatch->new;
my $token = KinoSearch::Analysis::Token->new(
text => $utf8_text,
start_offset => 0,
end_offset => length($utf8_text),
);
$batch->append($token);
Constructor. Takes one optional hash-style argument.
text - UTF-8 encoded text, used to prime the TokenBatch with a single initial Token.
append
$batch->append($token);
Tack a Token onto the end of the batch.
add_many_tokens
$batch->add_many_tokens( $string, \@starts, \@ends );
# or...
$batch->add_many_tokens( $string, \@starts, \@ends, \@boosts );
High efficiency method for adding multiple tokens to the batch with one call. The starts and ends, which must be specified in characters (not bytes), will be used to identify substrings of $string
to supply as token texts to Token->new.
(Note: boosts should be supplied only for fields which are set to store_pos_boost
.)
next
while ( my $token = $batch->next ) {
# ...
}
Return the next token in the TokenBatch, or undef
if out of tokens.
reset
$batch->reset;
Reset the TokenBatch's iterator, so that the next call to next() returns the first Token in the batch.
COPYRIGHT
Copyright 2005-2007 Marvin Humphrey
LICENSE, DISCLAIMER, BUGS, etc.
See KinoSearch version 0.20.