NAME

Search::Fulltext::Tokenizer::Ngram - Character n-gram tokenizer for Search::Fulltext

VERSION

version 0.01

SYNOPSIS

use utf8;
use Search::Fulltext;
use Search::Fulltext::Tokenizer::Bigramm;

my $searcher = Search::Fulltext->new(
    docs => [
        'ハンプティ・ダンプティ 塀の上',
        'ハンプティ・ダンプティ 落っこちた',
        '王様の馬みんなと 王様の家来みんなでも',
        'ハンプティを元に 戻せなかった',
    ],
    tokenizer => q/perl 'Search::Fulltext::Tokenizer::Bigram::get_tokenizer'/,
);
my $hit_document_ids = $searcher->search('ハンプティ');  # [0, 1, 3]

DESCRIPTION

This module provides character N-gram tokenizers for Search::Fulltext.

By default {1,2,3}-gram tokenzers are available.

CREATING A N(> 3)-GRAM TOKENIZER

If you wish to use other N-grams where N > 3, you can create it by inheriting Search::Fulltext::Tokenizer::Ngram:

package My::Tokenizer::42gram;

use parent qw/Search::Fulltext::Tokenizer::Ngram/;

my $iterator_generator = __PACKAGE__->new(42);

sub get_tokenizer {
    sub { $iterator_generator->create_token_iterator(@_) };
}

SEE ALSO

Search::Fulltext::Tokenizer::Unigram Search::Fulltext::Tokenizer::Bigram Search::Fulltext::Tokenizer::Trigram

AUTHOR

Koichi SATOH <sekia@cpan.org>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2014 by Koichi SATOH.

This is free software, licensed under:

The MIT (X11) License

1 POD Error

The following errors were encountered while parsing the POD:

Around line 63:

Non-ASCII character seen before =encoding in ''ハンプティ・ダンプティ'. Assuming UTF-8