NAME
Search::Tokenizer - Decompose a string into tokens (words)
SYNOPSIS
# generic usage
use Search::Tokenizer;
my $tokenizer = Search::Tokenizer->new(
regex => qr/.../,
filter => sub { ... },
stopwords => {word1 => 1, word2 => 1, ... },
lower => 1,
);
my $iterator = $tokenizer->($string);
while (my ($term, $len, $start, $end, $index) = $iterator->()) {
...
}
# usage for DBD::SQLite (with builtin tokenizers: word, word_locale,
# word_unicode, unaccent)
use Search::Tokenizer;
$dbh->do("CREATE VIRTUAL TABLE t "
." USING fts3(tokenize=perl 'Search::Tokenizer::unaccent')");
DESCRIPTION
This module builds an iterator function that will progressively extract terms from a given input string. Terms are defined by a regular expression (for example \w+
). Term matching relies on the builtin "global match" operator of Perl (the 'g' flag), and therefore is quite efficient.
Before being returned to the caller, terms may be filtered by an auxiliary function, for performing tasks such as stemming or stopword elimination.
A tokenizer returned from the new method is a code reference, not a regular Perl object. To use the tokenizer, just call it with a string to parse : this will return another code reference, which works as an iterator. Each call to the iterator will return the next term from the string, until the string is exhausted.
This API was explicitly designed for integrating Perl with the FTS3 fulltext search engine in DBD::SQLite; however, the API is general enough to be useful for other purposes, which is why it is published in its own, separate distribution.
METHODS
Creating a tokenizer
my $tokenizer = Search::Tokenizer->new($regex);
my $tokenizer = Search::Tokenizer->new(%options);
Builds a new tokenizer, returned as a code reference. The first syntax with a single Regexp argument is a shorthand for ->new(regex => $regex)
. The second syntax, with named arguments, has the following available options :
regex => $regex
-
$regex
is a compiled regular expression that specifies how to match a term; that regular expression should not match the empty string (otherwise the tokenizer would enter an infinite loop). The default isqr/\w+/
. Here are some examples of more advanced regexes :# take 'locale' into account $regex = do {use locale; qr/\w+/}; # rely on Unicode's definition of "word characters" $regex = qr/\p{Word}+/; # words like "don't", "it's" are treated as a single term $regex = qr/\w+(?:'\w+)?/; # same thing but also with internal hyphens like "fox-trot" $regex = qr/\w+(?:[-']\w+)?/;
lower => $bool
-
If true, the term returned by the
$regex
is converted to lowercase (or more precisely: is "case-folded" through "fc" in Unicode::CaseFold). This option is activated by default. filter => $filter
-
$filter
is a reference to a function that may modify or cancel a term before it is returned to the caller. The filter takes one single argument (the term) and returns a scalar (the modified term). If the value returned from the filter is empty, then this term is canceled. filter_in_place => $filter
-
Like
filter
, except that the filtering function directly modifies the term in its$_[0]
argument instead of returning a new term. This is useful for example when building a filter from Lingua::Stem::Snowball or from Text::Transliterator::Unaccent. stopwords => $hashref
-
The keys in
$hashref
are terms to cancel (usually : common terms for which indexing would consume lots of resources with little added value). Values in the hash should evaluate to true. Lists of stopwords for various languages may be found in the Lingua::StopWords module. Stopwords filtering is applied after thefilter
orfilter_in_place
function (if any).
Whenever a term is canceled through the filter or stopwords options, the tokenizer does not return that term to the client, but nevertheless rembembers the canceled position: so for example when tokenizing "Once upon a time" with
$tokenizer = Search::Tokenizer->new(
stopwords => Lingua::StopWords::getStopWords('en')
);
we get the term sequence
("upon", 4, 5, 9, 1)
("time", 4, 12, 16, 3)
where terms "once" and "a" in positions 0 and 2 have been canceled.
Creating an iterator
my $iterator = $tokenizer->($text);
# loop over terms ..
while (my $term = $iterator->()) {
work_with_term($term);
}
# .. or loop over terms with detailed information
while (my @term_details = $iterator->()) {
work_with_details(@term_details); # ($term, $len, $start, $end, $index)
}
The tokenizer takes one string argument and returns an iterator. The iterator takes no argument; each call returns a next term from the string, until the string is exhausted, at which point the iterator returns an empty result.
If called in a scalar context, the iterator returns just a string; if called in a list context, it returns a tuple composed from
- $term
-
the term (after filtering)
- $len
-
the term length
- $start
-
the starting offset in the string where this term was found
- $end
-
the end offset (where the search for the next term will start)
- $index
-
the index of this term within the string, starting at 0
Length and start/end offsets are computed in characters, not in bytes (note for SQLite users : the C layer in SQLite needs byte values, but the conversion will be automatically taken care of by the C implementation in DBD::SQLite).
Beware that ($end - $start) is the length of the original term extracted by the regex, while $len is the length of the final $term, after filtering; both may differ, especially if stemming is being applied.
BUILTIN TOKENIZERS
For convenience, the following tokenizers are builtin :
Search::Tokenizer::word
-
Terms are "words" according to Perl's notion of
\w+
. Search::Tokenizer::word_locale
-
Terms are "words" according to Perl's notion of
\w+
underuse locale
. Search::Tokenizer::word_unicode
-
Terms are "words" according to Unicode's notion of
\p{Word}+
. Search::Tokenizer::unaccent
-
Like
Search::Tokenizer::word_unicode
, but filtered through Text::Transliterator::Unaccent to replace all accented characters by their base character.
These builtin tokenizers may take the same arguments as new()
: for example
use Search::Tokenizer;
my $tokenizer = Search::Tokenizer::unaccent(lower => 0, stopwords => ...);
SEE ALSO
Other tokenizers on CPAN : KinoSearch::Analysis::Tokenizer and Search::Tools::Tokenizer.
Stopwords : Lingua::StopWords
Stemming : Lingua::Stem::Snowball
Removing accented characters : Text::Transliterator::Unaccent
AUTHOR
Laurent Dami, <dami@cpan.org>
LICENSE AND COPYRIGHT
Copyright 2010, 2021 Laurent Dami.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.