NAME

Lingua::CJK::Tokenizer - CJK Tokenizer

SYNOPSIS

my $tknzr = Lingua::CJK::Tokenizer->new();
$tknzr->ngram_size(5);
$tknzr->max_token_count(100);
$tokens_ref = $tknzr->tokenize("CJK Text");
$tokens_ref = $tknzr->segment("CJK Text");
$tokens_ref = $tknzr->split("CJK Text");
$flag = $tknzr->has_cjk("CJK Text");
$flag = $tknzr->has_cjk_only("CJK Text");

DESCRIPTION

This module tokenizes CJK texts into n-grams.

METHODS

ngram_size

sets the size of returned n-grams

max_token_count

sets the limit on the number of returned n-grams in case input text is too long or of indefinite size

tokenize

tokenizes texts into n-grams

segment

cuts cjk texts into chunks

split

tokenizes texts into uni-grams.

has_cjk

returns true if text has cjk characters

has_cjk_only

returns true if text has only cjk characters

PREREQUISITE

This module requires libunicode by Tom Tromey.

COPYRIGHT

Copyright (c) 2009 Yung-chung Lin.

This program is free software; you can redistribute it and/or modify it under the MIT License.