NAME
Lingua::CJK::Tokenizer - CJK Tokenizer
SYNOPSIS
my $tknzr = Lingua::CJK::Tokenizer->new();
$tknzr->ngram_size(5);
$tknzr->max_token_count(100);
$tokens_ref = $tknzr->tokenize("CJK Text");
$tokens_ref = $tknzr->segment("CJK Text");
$tokens_ref = $tknzr->split("CJK Text");
$flag = $tknzr->has_cjk("CJK Text");
$flag = $tknzr->has_cjk_only("CJK Text");
DESCRIPTION
This module tokenizes CJK texts into n-grams.
METHODS
ngram_size
sets the size of returned n-grams
max_token_count
sets the limit on the number of returned n-grams in case input text is too long or of indefinite size
tokenize
tokenizes texts into n-grams
segment
cuts cjk texts into chunks
split
tokenizes texts into uni-grams.
has_cjk
returns true if text has cjk characters
has_cjk_only
returns true if text has only cjk characters
PREREQUISITE
This module requires libunicode by Tom Tromey.
COPYRIGHT
Copyright (c) 2009 Yung-chung Lin.
This program is free software; you can redistribute it and/or modify it under the MIT License.