NAME
Lingua::JA::Summarize - A keyword extractor / summary generator
SYNOPSIS
# Functional style
use Lingua::JA::Summarize qw(:all);
@keywords = keyword_summary('You need longer text to obtain keywords');
print join(' ', @keywords) . "\n";
# OO style
use Lingua::JA::Summarize;
$s = Lingua::JA::Summarize->new;
$s->analyze('You need longer text to obtain keywords');
$s->analyze_file('filename_to_analyze.txt');
@keywords = $s->keywords;
print join(' ', @keywords) . "\n";
DESCRIPTION
Lingua::JA::Summarize is a keyword extractor / summary generator for Japanese texts. By using MeCab, the module extracts keywords from Japanese texts.
CONSTRUCTOR
- new()
- new({ params })
-
You may provide behaviour parameters through a hashref.
ex. new({ mecab => '/usr/local/mecab/bin/mecab' })
ANALYZING TEXT
- analyze($string)
- analyze_file($filename)
-
Use either of the function to analyze text. The functions throw an error if failed.
OBTAINING KEYWORDS
- keywords($name)
- keywords($name, { params })
-
Returns an array of keywords. Following parameters are available for controlling the output.
- maxwords
-
Maximum number of words. The default is 5.
- threshold
-
Threshold for the calculated significance value to be treated as a keyword.
CONTROLLING THE BEHAVIOUR
Use the descibed member functions to control the behaviour of the analyzer.
- alnum_as_word([boolean])
-
Sets or retrives a flag indicating whether or not, not to split a word consisting of alphabets and numerics. Also controls the splitting of apostrophies.
If set to false, "O'reilly" would be treated as "o reilly", "30boxes" as "30 boxes".
The default is true.
- default_cost([number])
-
Sets or retrieves the default cost applied for unknown words. The default is 800.
- jaascii_as_word([boolean])
-
Sets or retrieves a flag indicating whether or not to consider consecutive ascii word and Japanese word as a single word. The default is true.
If set to true, strings like "ǧ¾Úapi" and "lamda´Ø¿ô" are treated as single words.
- mecab([mecab_path])
-
Sets or retrieves mecab path. The default is "mecab".
- ng([ng_words])
-
Sets or retrieves a hash array listing omitted words. Default hash is generated by Lingua::JA::Summarize::NG function.
- omit_number([boolean])
-
Sets or retrieves a flag indicating whether or not to omit numbers.
- singlechar_factor([number])
-
Sets or retrieves a factor value to be used for calculating weight of single-character words. The default is 0.5.
- url_as_word([boolean])
-
Sets or retrieves a flag indicating whether or not to treat URLs as single words.
STATIC FUNCTIONS
- keyword_summary($text)
- keyword_summary($text, { params })
-
Given a text to analyze, returns an array of keywords. Either any properties described in the
CONTROLLING THE BEHAVIOUR
section or the parameters of thekeyword
member function could be set as parameters. - NG()
-
Returns a default hashref containing NG words.
AUTHOR
Kazuho Oku <kazuhooku ___at___ gmail.com>
ACKNOWLEDGEMENTS
Thanks to Takesako-san for writing the prototype.
COPYRIGHT Copyright (C) 2006 Cybozu Labs, Inc.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.7 or, at your option, any later version of Perl 5 you may have available.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 402:
Non-ASCII character seen before =encoding in '"ǧ¾Úapi"'. Assuming CP1252