NAME

Lingua::JA::NormalizeText - Text Normalizer

SYNOPSIS

use Lingua::JA::NormalizeText;
use utf8;

my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
my $normalizer = Lingua::JA::NormalizeText->new(@options);

print $normalizer->normalize('鳥が㌧㌦でありんす♥');
# -> 鳥がトンドルです♥

sub dearinsu_to_desu
{
    my $text = shift;
    $text =~ s/でありんす/です/g;

    return $text;
}

# or

use Lingua::JA::NormalizeText qw/old2new_kanji/;
use utf8;

print old2new_kanji('惡の華');
# -> 悪の華

DESCRIPTION

Lingua::JA::NormalizeText normalizes text.

METHODS

new(@options)

Creates a new Lingua::JA::NormalizeText instance.

The following options are available:

OPTION                 SAMPLE INPUT           OUTPUT FOR SAMPLE INPUT
---------------------  ---------------------  -----------------------
lc                     DdD                    ddd
uc                     DdD                    DDD
nfkc                   ㌦                     ドル (length: 2)
nfkd                   ㌦                     ドル (length: 3)
nfc
nfd
decode_entities        ♥               ♥
strip_html             <em>あ</em>                あ    
alnum_z2h              ABC123           ABC123
alnum_h2z              ABC123                 ABC123
space_z2h
space_h2z
katakana_z2h           ハァハァ               ハァハァ
katakana_h2z           スーハースーハー               スーハースーハー
katakana2hiragana      パンツ                 ぱんつ
hiragana2katakana      ぱんつ                 パンツ
wave2tilde             〜, 〰                 ~
tilde2wave             ~                     〜
wavetilde2long         〜, 〰, ~             ー
wave2long              〜, 〰                 ー
tilde2long             ~                     ー
fullminus2long         -                     ー
dashes2long            —                      ー
drawing_lines2long     ─                      ー
unify_long_repeats     ヴァーーー             ヴァー
nl2space               (LF)(CR)(CRLF}         (space)(space)(space)
unify_nl               (LF)(CR)(CRLF)         \n\n\n
unify_long_spaces      あ(space)(space)あ     あ(space)あ
unify_whitespaces      \x{00A0}               (space)
trim                   (space)あ(space)あ(space)  あ(space)あ
ltrim                  (space)あ(space)       あ(space)
rtrim                  ああ(space)(space)     ああ
old2new_kana           ゐヰゑヱヸヹ           いイえエイ゙エ゙
old2new_kanji          亞逸鬭                 亜逸闘
tab2space              (tab)(tab)             (space)(space)
remove_controls        あ\x{0000}あ           ああ
remove_spaces          (space)あ(space)あ(space)  ああ
dakuon_normalize       さ\x{3099}             ざ
handakuon_normalize    は\x{309A}             ぱ
all_dakuon_normalize   さ\x{3099}は\x{309A}   ざぱ

The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.)

External functions are also addable. (See dearinsu_to_desu function of the SYNOPSIS section.)

normalize($text)

normalizes $text.

OPTIONS

dashes2long

Note that this option does not convert hyphens into long.

drawing_line2long

This option converts drawing lines which are similar to long(U+30FC) in appearance.

unify_long_spaces

Note that this option unifies only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000).

remove_controls

Note that this option does not remove the following characters:

CHARACTER TABULATION
LINE FEED
CARRIAGE RETURN

remove_spaces

Note that this option removes only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000).

unify_whitespaces

This option converts the following characters into SPACE(U+0020).

LINE TABULATION
FORM FEED
NEXT LINE
NO-BREAK SPACE
OGHAM SPACE MARK
MONGOLIAN VOWEL SEPARATOR
EN QUAD
EM QUAD
EN SPACE
EM SPACE
THREE-PER-EM SPACE
FOUR-PER-EM SPACE
SIX-PER-EM SPACE
FIGURE SPACE
PUNCTUATION SPACE
THIN SPACE
HAIR SPACE
LINE SEPARATOR
PARAGRAPH SEPARATOR
NARROW NO-BREAK SPACE
MEDIUM MATHEMATICAL SPACE

Note that this does not convert the following characters:

CHARACTER TABULATION
LINE FEED
CARRIAGE RETURN
IDEOGRAPHIC SPACE

AUTHOR

pawa <pawapawa@cpan.org>

SEE ALSO

新旧字体表: http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html

Lingua::JA::Regular::Unicode

Lingua::JA::Dakuon

Lingua::JA::Moji

Unicode::Normalize

HTML::Entities

HTML::Scrubber

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.