NAME

ShiftJIS::Regexp - Perl module to use Shift_JIS-oriented regexps in the byte-oriented perl.

SYNOPSIS

use ShiftJIS::Regexp qw(:all);

match('‚ ‚¨‚P‚Q', '\p{InHiragana}{2}\p{IsDigit}{2}');
match('‚ ‚¢‚¢‚¤‚¤‚¤', '^‚ ‚¢+‚¤{3}$');
replace($str, 'A', '‚`', 'g');

DESCRIPTION

This module provides some functions to use Shift_JIS-oriented regexps in the byte-oriented perl.

The legal Shift_JIS character in this module must match the following regexp:

[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]

FUNCTIONS

issjis(STRING)

Returns a boolean indicating whether the string is legally encoded in Shift_JIS.

re(PATTERN)
re(PATTERN, MODIFIER)

Returns regexp parsable by the byte-oriented perl.

PATTERN is specified as a string.

MODIFIER is specified as a string.

i  do case-insensitive pattern matching (only for ascii alphabets)
s  treat string as single line
m  treat string as multiple lines
x  ignore whitespace (i.e. [ \n\r\t\f], but not comments!)
   unless backslashed or inside a character class
match(STRING, PATTERN)
match(STRING, PATTERN, MODIFIER)

emulation of m// operator for the Shift_JIS encoding.

PATTERN is specified as a string.

MODIFIER is specified as a string.

i  do case-insensitive pattern matching (only for ascii alphabets)
s  treat string as single line
m  treat string as multiple lines
x  ignore whitespace (i.e. [ \n\r\t\f], but not comments!)
   unless backslashed or inside a character class
g  match globally
z  tell the function the pattern matches zero-length substring
      (sorry, due to the poor auto-detection)
replace(STRING or SCALAR REF, PATTERN, REPLACEMENT)
replace(STRING or SCALAR REF, PATTERN, REPLACEMENT, MODIFIER)

emulation of s/// operator for the Shift_JIS encoding.

If a reference of scalar variable is specified as the first argument, returns the number of substitutions made. If a string is specified as the first argument, returns the substituted string and the specified string is unaffected.

my $d = '\p{IsDigit}';
my $str = '‹à‚P‚T‚R‚O‚O‚O‚O‰~';
1 while replace(\$str, "($d)($d$d$d)(?!$d)", '$1�C$2');
print $str; # ‹à‚P�C‚T‚R‚O�C‚O‚O‚O‰~

MODIFIER is specified as a string.

i  do case-insensitive pattern matching (only for ascii alphabets)
s  treat string as single line  treat string as single line
m  treat string as multiple lines
x  ignore whitespace (i.e. [ \n\r\t\f], but not comments!)
   unless backslashed or inside a character class
g  match globally
z  tell the function the pattern matches zero-length substring
      (sorry, due to the poor auto-detection)
jsplit(PATTERN, STRING)
jsplit(PATTERN, STRING, LIMIT)

This function emulates CORE::split.

If not in list context, these functions do only return the number of fields found, but do not split into the @_ array.

But ' ' as PATTERN has no special meaning; when you want to split the string on whitespace, you can use splitspace() function.

You should specify PATTERN as a string.

jsplit('�^', '‚ ‚¢‚¤�^‚¦‚¨ƒ�^');
splitspace(STRING)
splitspace(STRING, LIMIT)

This function emulates CORE::split ' ', STRING and returns the array given by split on whitespace including IDEOGRAPHIC SPACE. Leading whitespace characters do not produce any field.

splitchar(STRING)
splitchar(STRING, LIMIT)

This function emulates CORE::split //, STRING and returns the array given by split of the supplied string into characters.

REGEXPS

regexp          meaning

^               match the start of the string
                match the start of any line with 'm' modifier

$               match the end of the string
                match the end of any line with 'm' modifier

.               match any character except \n
                match any character with 's' modifier

\C              match a single C char (octet), i.e. [\0-\xFF] in perl.
\j              match any character, i.e. [\0-\x{FCFC}] in this module.
\J              match any character except \n, i.e. [^\n] in this module.

  * \j and \J are extensions by this module. e.g.

     match($_, '(\j{5})\z') returns last five chars including \n at the end
     match($_, '(\J{5})\Z') returns last five chars excluding \n at the end

\a              alarm      (BEL)
\b              backspace  (BS) * within character classes *
\t              tab        (HT, TAB)
\n              newline    (LF, NL)
\f              form feed  (FF)
\r              return     (CR)
\e              escape     (ESC)

\0              null       (NUL)

\ooo            octal single-byte character
\xhh            hexadecimal single-byte character
\x{hhhh}        hexadecimal double-byte character
\c[             control character

   e.g. \012 \123 \x5c \x5C \x{824F} \x{9Fae} \cA \cZ \c^ \c?

regexp           equivalent character class

\d               [\d]              [0-9]
\D               [\D]              [^0-9]
\w               [\w]              [0-9A-Z_a-z]
\W               [\W]              [^0-9A-Z_a-z]
\s               [\s]              [\t\n\r\f ]
\S               [\S]              [^\t\n\r\f ]

\p{IsDigit}      [[:digit:]]       [0-9‚O-‚X]
\P{IsDigit}      [[:^digit:]]      [^0-9‚O-‚X]
\p{IsUpper}      [[:upper:]]       [A-Z‚`-‚y]
\P{IsUpper}      [[:^upper:]]      [^A-Z‚`-‚y]
\p{IsLower}      [[:lower:]]       [a-z‚�-‚š]
\P{IsLower}      [[:^lower:]]      [^a-z‚�-‚š]
\p{IsAlpha}      [[:alpha:]]       [A-Za-z‚`-‚y‚�-‚š]
\P{IsAlpha}      [[:^alpha:]]      [^A-Za-z‚`-‚y‚�-‚š]
\p{IsAlnum}      [[:alnum:]]       [0-9A-Za-z‚O-‚X‚`-‚y‚�-‚š]
\P{IsAlnum}      [[:^alnum:]]      [^0-9A-Za-z‚O-‚X‚`-‚y‚�-‚š]

\p{IsWord}       [[:word:]]
       [0-9A-Z_a-z‚O-‚X‚`-‚y‚�-‚šƒŸ-ƒ¶ƒ¿-ƒÖ„@-„`„p-„‘¦-ß‚Ÿ-‚ñƒ@-ƒ–�J�K�R-�[ˆŸ-˜r˜Ÿ-ê¤]
\P{IsWord}       [[:^word:]]
       [^0-9A-Z_a-z‚O-‚X‚`-‚y‚�-‚šƒŸ-ƒ¶ƒ¿-ƒÖ„@-„`„p-„‘¦-ß‚Ÿ-‚ñƒ@-ƒ–�J�K�R-�[ˆŸ-˜r˜Ÿ-ê¤]

\p{IsPunct}      [[:punct:]]
             [!-/:-@[-`{-~¡-¥�A-�I�L-�Q�\-�¬�¸-�¿�È-�Î�Ú-�è�ð-�÷�ü„Ÿ-„¾]
\P{IsPunct}      [[:^punct:]]
             [^!-/:-@[-`{-~¡-¥�A-�I�L-�Q�\-�¬�¸-�¿�È-�Î�Ú-�è�ð-�÷�ü„Ÿ-„¾]
\p{IsSpace}      [[:space:]]       [\t\n\r\f \x{8140}]
\P{IsSpace}      [[:^space:]]      [^\t\n\r\f \x{8140}]
\p{IsGraph}      [[:graph:]]       [^\0- \x7F\x{8140}]
\P{IsGraph}      [[:^graph:]]      [\0- \x7F\x{8140}]
\p{IsPrint}      [[:print:]]       [^\0- \x0B\x0E-\x1F\x7F]
\P{IsPrint}      [[:^print:]]      [\x00-\x08\x0B\x0E-\x1F\x7F]
\p{IsCntrl}      [[:cntrl:]]       [\x00-\x1F]
\P{IsCntrl}      [[:^cntrl:]]      [^\x00-\x1F]

\p{IsAscii}      [[:ascii:]]       [\x00-\x7F]
\P{IsAscii}      [[:^ascii:]]      [^\x00-\x7F]
\p{IsHankaku}    [[:hankaku:]]     [\xA1-\xDF]
\P{IsHankaku}    [[:^hankaku:]]    [^\xA1-\xDF]
\p{IsZenkaku}    [[:zenkaku:]]     [\x{8140}-\x{FCFC}]
\P{IsZenkaku}    [[:^zenkaku:]]    [^\x{8140}-\x{FCFC}]

\p{InLatin}      [[:latin:]]       [A-Za-z]
\P{InLatin}      [[:^latin:]]      [^A-Za-z]
\p{InFullLatin}  [[:fulllatin:]]   [‚`-‚y‚�-‚š]
\P{InFullLatin}  [[:^fulllatin:]]  [^‚`-‚y‚�-‚š]
\p{InGreek}      [[:greek:]]       [ƒŸ-ƒ¶ƒ¿-ƒÖ]
\P{InGreek}      [[:^greek:]]      [^ƒŸ-ƒ¶ƒ¿-ƒÖ]
\p{InCyrillic}   [[:cyrillic:]]    [„@-„`„p-„‘]
\P{InCyrillic}   [[:^cyrillic:]]   [^„@-„`„p-„‘]
\p{InHalfKana}   [[:halfkana:]]    [¦-ß]
\P{InHalfKana}   [[:^halfkana:]]   [^¦-ß]
\p{InHiragana}   [[:hiragana:]]    [‚Ÿ-‚ñ�J�K�T�U]
\P{InHiragana}   [[:^hiragana:]]   [^‚Ÿ-‚ñ�J�K�T�U]
\p{InKatakana}   [[:katakana:]]    [ƒ@-ƒ–�[�R�S]
\P{InKatakana}   [[:^katakana:]]   [^ƒ@-ƒ–�[�R�S]
\p{InFullKana}   [[:fullkana:]]    [‚Ÿ-‚ñƒ@-ƒ–�J�K�[�T�U�R�S]
\P{InFullKana}   [[:^fullkana:]]   [^‚Ÿ-‚ñƒ@-ƒ–�J�K�[�T�U�R�S]
\p{InKana}       [[:kana:]]        [¦-ß‚Ÿ-‚ñƒ@-ƒ–�J�K�[�T�U�R�S]
\P{InKana}       [[:^kana:]]       [^¦-ß‚Ÿ-‚ñƒ@-ƒ–�J�K�[�T�U�R�S]
\p{InKanji1}     [[:kanji1:]]      [ˆŸ-˜r]
\P{InKanji1}     [[:^kanji1:]]     [^ˆŸ-˜r]
\p{InKanji2}     [[:kanji2:]]      [˜Ÿ-ê¤]
\P{InKanji2}     [[:^kanji2:]]     [^˜Ÿ-ê¤]
\p{InKanji}      [[:kanji:]]       [�V-�ZˆŸ-˜r˜Ÿ-ê¤]
\P{InKanji}      [[:^kanji:]]      [^�V-�ZˆŸ-˜r˜Ÿ-ê¤]
\p{InBoxDrawing} [[:boxdrawing:]]  [„Ÿ-„¾]
\P{InBoxDrawing} [[:^boxdrawing:]] [^„Ÿ-„¾]

* On \p{Prop} or \P{Prop} expressions, 'Is' or 'In' can be omitted
  like \p{Digit} or \P{Kanji}.
 (the omission of 'In' is an extension by this module)
Character class

Ranges in character class are supported.

The order of Shift_JIS characters is: 0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC.

So [\0-\x{fcfc}] matches any one Shift_JIS character.

In character classes, any character or byte sequence that does not match any one Shift_JIS character, e.g. re('[\xA0-\xFF]'), is croaked.

Character classes that match non-Shift_JIS substring are not supported (use \C or alternation).

Code embedded in regexp

Parsing (?{ ... }) or (??{ ... }) assertions is carried out without any special care of double-byte characters.

(?{ ... }) assertions are disallowed in match() or replace() functions by perl due to security concerns. Use them via re() function inside your scope.

use ShiftJIS::Regexp qw(:all);

use re 'eval';

$::res = 0;
$_ = 'ƒ|' x 8;

my $regex = re(q/
     \j*?
     (?{ $cnt = 0 })
     (
       ƒ| (?{ local $cnt = $cnt + 1; })
     )*  
     ƒ|ƒ|ƒ|
     (?{ $::res = $cnt })
   /, 'x');

/$regex/;
print $::res; # 5

CAVEAT

A legal Shift_JIS character in this module must match the following regexp:

[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]

Any string from external resource should be checked by issjis() function, excepting you know it is surely encoded in Shift_JIS. If an illegal Shift_JIS string is specified, the result should be unexpectable.

Some Shift_JIS double-byte character have one of [\x40-\x7E] as the trail byte.

@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Perl lexer doesn't take any care to these characters, so they sometimes make trouble. e.g. the quoted literal "•\" causes fatal error, since its trail byte 0x5C escapes the closing quote.

Such a problem doesn't arise when the string is gotten from any external resource. But writing the script containing the Shift_JIS double-byte character needs the greatest care.

The use of single-quoted heredoc << '' or \xhh meta characters is recommended in order to define a Shift_JIS string literal.

The safe ASCII-graphic characters, [\x21-\x3F], are:

!"#$%&'()*+,-./0123456789:;<=>?

They are preferred as the delimiter of quote-like operators.

BUGS

The \U, \L, \Q, \E, and interpolation are not considered. If necessary, use them in "" (or qq//) operators in the argument list.

The word boundary \b, \B do not work correctly.

The look-behind assertion like (?<=[A-Z]) is not prevented from matching trail byte of the previous double byte character: e.g. match("ƒAƒCƒE", '(?<=[A-Z])(\p{InKana})') returns ('ƒC').

Use of not greedy regexp, which can match empty string, such as .?? and \d*?, as the PATTERN in jsplit(), may cause failure to the emulation of CORE::split.

AUTHOR

Tomoyuki SADAHIRO

bqw10602@nifty.com
http://homepage1.nifty.com/nomenclator/perl/
This program is free software; you can redistribute it and/or 
modify it under the same terms as Perl itself.

SEE ALSO

perl(1).

1 POD Error

The following errors were encountered while parsing the POD:

Around line 752:

Non-ASCII character seen before =encoding in 'match('‚ ‚¨‚P‚Q','. Assuming CP1252