NAME

ShiftJIS::Regexp - Shift_JIS-oriented regular expressions on byte-oriented perl

ABOUT THIS POD

This POD is written in Shift_JIS encoding.

Do you see '' as HIRAGANA LETTER A? or '\' as YEN SIGN, not as REVERSE SOLIDUS? Otherwise you'd change your font to an appropriate one. (or the POD might be badly converted.)

SYNOPSIS

use ShiftJIS::Regexp qw(:all);

match('‚ ‚¨‚P‚Q', '\p{Hiragana}{2}\p{Digit}{2}');
# that is equivalant to this:
match('‚ ‚¨‚P‚Q', '\pH{2}\pD{2}');

match('‚ ‚¢‚¢‚¤‚¤‚¤', '^‚ ‚¢+‚¤{3}$');

replace($str, 'A', '‚`', 'g');

DESCRIPTION

This module provides some functions to use Shift_JIS-oriented regular expressions on the byte-oriented perl.

The legal Shift_JIS character in this module must match the following regular expression:

[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]

Functions

re(PATTERN)
re(PATTERN, MODIFIER)

Returns a regular expression parsable by the byte-oriented perl.

PATTERN is specified as a string.

MODIFIER is specified as a string.

i  case-insensitive pattern (only for ascii alphabets)
I  case-insensitive pattern (greek, cyrillic, fullwidth latin)
j  hiragana-katakana-insensitive pattern

s  treat string as single line
m  treat string as multiple lines
x  ignore whitespace (i.e. [ \n\r\t\f], but not comments!)
   unless backslashed or inside a character class

o  once parsed (not compiled!) and the result is cached internally.

re('^ƒRƒ“ƒsƒ…�[ƒ^�[?$') matches 'ƒRƒ“ƒsƒ…�[ƒ^�[' or 'ƒRƒ“ƒsƒ…�[ƒ^'.

re('^‚ç‚­‚¾$','j') matches '‚ç‚­‚¾', 'ƒ‰ƒNƒ_', '‚çƒN‚¾', etc.

o modifier

while(<DATA>){
  print replace($_, '(perl)', '<strong>$1</strong>', 'igo');
}
   is more efficient than

while(<DATA>){
  print replace($_, '(perl)', '<strong>$1</strong>', 'ig');
}

because in the latter case the pattern is parsed every time
whenever the function is called.
match(STRING, PATTERN)
match(STRING, PATTERN, MODIFIER)

An emulation of m// operator for the Shift_JIS encoding.

PATTERN is specified as a string.

MODIFIER is specified as a string.

i  case-insensitive pattern (only for ascii alphabets)
I  case-insensitive pattern (greek, cyrillic, fullwidth latin)
j  hiragana-katakana-insensitive pattern

s  treat string as single line
m  treat string as multiple lines
x  ignore whitespace (i.e. [ \n\r\t\f], but not comments!)
   unless backslashed or inside a character class
g  match globally
z  tell the function the pattern matches zero-length substring
      (sorry, due to the poor auto-detection)

o  once parsed (not compiled!) and the result is cached internally.
replace(STRING or SCALAR REF, PATTERN, REPLACEMENT)
replace(STRING or SCALAR REF, PATTERN, REPLACEMENT, MODIFIER)

An emulation of s/// operator for the Shift_JIS encoding.

If a reference of scalar variable is specified as the first argument, returns the number of substitutions made. If a string is specified as the first argument, returns the substituted string and the specified string is unaffected.

my $str = '‹à‚P‚T‚R‚O‚O‚O‚O‰~';
1 while replace(\$str, '(\pD)(\pD{3})(?!\pD)', '$1�C$2');
print $str; # ‹à‚P�C‚T‚R‚O�C‚O‚O‚O‰~

MODIFIER is specified as a string.

i  case-insensitive pattern (only for ascii alphabets)
I  case-insensitive pattern (greek, cyrillic, fullwidth latin)
j  hiragana-katakana-insensitive pattern

s  treat string as single line  treat string as single line
m  treat string as multiple lines
x  ignore whitespace (i.e. [ \n\r\t\f], but not comments!)
   unless backslashed or inside a character class
g  match globally
z  tell the function the pattern matches zero-length substring
      (sorry, due to the poor auto-detection)

o  once parsed (not compiled!) and the result is cached internally.
jsplit(PATTERN or ARRAY REF of [PATTERN, MODIFIER], STRING)
jsplit(PATTERN or ARRAY REF of [PATTERN, MODIFIER], STRING, LIMIT)

This function emulates CORE::split.

If not in list context, these functions do only return the number of fields found, but do not split into the @_ array.

PATTERN is specified as a string.

jsplit('�^', '‚ ‚¢‚¤�^‚¦‚¨ƒ�^');

But ' ' as PATTERN has no special meaning; it splits the string on a single space similarly to CORE::split / /.

When you want to split the string on whitespace, pass an undefined value as PATTERN or use the splitspace() function.

jsplit(undef, ' �@ This  is �@ perl.');
splitspace(' �@ This  is �@ perl.');
# ('This', 'is', 'perl.')

If you want to pass pattern with modifiers, specify an arrayref of [PATTERN, MODIFIER] as the first argument.

jsplit([ '‚ ', 'jo' ], '01234‚ ‚¢‚¤‚¦‚¨ƒAƒCƒEƒGƒI');

Or you can say (see "Embedded Modifiers"):

jsplit('(?jo)‚ ', '01234‚ ‚¢‚¤‚¦‚¨ƒAƒCƒEƒGƒI');

MODIFIER is specified as a string.

i  do case-insensitive pattern matching (only for ascii alphabets)
I  do case-insensitive pattern matching
   (greek, cyrillic, fullwidth latin)
j  do hiragana-katakana-insensitive pattern matching

s  treat string as single line
m  treat string as multiple lines
x  ignore whitespace (i.e. [ \n\r\t\f], but not comments!)
   unless backslashed or inside a character class

o  once parsed (not compiled!) and the result is cached internally.
splitspace(STRING)
splitspace(STRING, LIMIT)

This function emulates CORE::split ' ', STRING, LIMIT and returns the array given by split on whitespace including IDEOGRAPHIC SPACE. Leading whitespace characters do not produce any field.

Note: splitspace(STRING, LIMIT) is equivalent to jsplit(undef, STRING, LIMIT).

splitchar(STRING)
splitchar(STRING, LIMIT)

This function emulates CORE::split //, STRING, LIMIT and returns the array given by split of the specified string into characters.

Note: splitchar(STRING, LIMIT) is equivalent to jsplit('', STRING, LIMIT).

Basic Regular Expressions

regexp          meaning

^               match the start of the string
                match the start of any line with 'm' modifier

$               match the end of the string, or before newline at the end
                match the end of any line with 'm' modifier

.               match any character except \n
                match any character with 's' modifier

\A              only at beginning of string
\Z              at the end of the string, or before newline at the end
\z              only at the end of the string (eq. '(?!\n)\Z')

\C              match a single C char (octet), i.e. [\0-\xFF] in perl.
\j              match any character, i.e. [\0-\x{FCFC}] in this module.
\J              match any character except \n, i.e. [^\n] in this module.

  * \j and \J are extensions by this module. e.g.

     match($_, '(\j{5})\z') returns last five chars including \n at the end
     match($_, '(\J{5})\Z') returns last five chars excluding \n at the end

\a              alarm      (BEL)
\b              backspace  (BS) * within character classes *
\e              escape     (ESC)
\f              form feed  (FF)
\n              newline    (LF, NL)
\r              return     (CR)
\t              tab        (HT, TAB)
\0              null       (NUL)

\ooo            octal single-byte character
\xhh            hexadecimal single-byte character
\x{hhhh}        hexadecimal double-byte character
\c[             control character

   e.g. \012 \123 \x5c \x5C \x{824F} \x{9Fae} \cA \cZ \c^ \c?

Predefined Character Classes

  \d                        [\d]              [0-9]
  \D                        [\D]              [^0-9]
  \w                        [\w]              [0-9A-Z_a-z]
  \W                        [\W]              [^0-9A-Z_a-z]
  \s                        [\s]              [\t\n\r\f ]
  \S                        [\S]              [^\t\n\r\f ]

  \p{Xdigit}     \pX        [[:xdigit:]]      [0-9A-Fa-f]
  \p{Digit}      \pD        [[:digit:]]       [0-9‚O-‚X]
  \p{Upper}      \pU        [[:upper:]]       [A-Z‚`-‚y]
  \p{Lower}      \pL        [[:lower:]]       [a-z‚�-‚š]
  \p{Alpha}      \pA        [[:alpha:]]       [A-Za-z‚`-‚y‚�-‚š]
  \p{Alnum}      \pQ        [[:alnum:]]       [0-9A-Za-z‚O-‚X‚`-‚y‚�-‚š]

  \p{Word}       \pW        [[:word:]]        [_\p{Digit}\p{European}\p{Kana}\p{Kanji}]
  \p{Punct}      \pP        [[:punct:]]       [!-/:-@[-`{-~¡-¥�A-�I�L-�Q�\-�¬�¸-�¿�È-�Î�Ú-�è�ð-�÷�ü„Ÿ-„¾]
  \p{Graph}      \pG        [[:graph:]]       [\p{Word}\p{Punct}]
  \p{Print}      \pT        [[:print:]]       [\x20\x{8140}\p{Graph}]
  \p{Space}      \pS        [[:space:]]       [\x20\x{8140}\x09-\x0D]
  \p{Blank}      \pB        [[:blank:]]       [\x20\x{8140}\t]
  \p{Cntrl}      \pC        [[:cntrl:]]       [\x00-\x1F\x7F]
  \p{ASCII}                 [[:ascii:]]       [\x00-\x7F]

  \p{Roman}      \pR        [[:roman:]]       [\x21-\x7E]
  \p{Hankaku}    \pY        [[:hankaku:]]     [\xA1-\xDF]
  \p{Zenkaku}    \pZ        [[:zenkaku:]]     [\x{8140}-\x{FCFC}]

( \p{^Zenkaku}   \PZ        [[:^zenkaku:]]    [\x00-\x7F\xA1-\xDF] )

  \p{X0201}                 [[:x0201:]]       [\x20-\x7F\xA1-\xDF]
  \p{X0208}                 [[:x0208:]]       [\x{8140}-�¬�¸-�¿�È-�Î�Ú-�è�ð-�÷�ü‚O-‚X‚`-‚y‚�-‚š‚Ÿ-‚ñƒ@-ƒ–ƒŸ-ƒ¶ƒ¿-ƒÖ„@-„`„p-„‘„Ÿ-„¾ˆŸ-˜r˜Ÿ-ê¤]
  \p{X0211}                 [[:x0211:]]       [\x00-\x1F]
  \p{JIS}        \pJ        [[:jis:]]         [\p{X0201}\p{X0208}\p{X0211}]

  \p{NEC}        \pN        [[:nec:]]         [\x{8740}-\x{875D}\x{875f}-\x{8775}\x{877E}-\x{879c}\x{ed40}-\x{eeec}\x{eeef}-\x{eefc}]
  \p{IBM}        \pI        [[:ibm:]]         [\x{fa40}-\x{fc4b}]
  \p{Vendor}     \pV        [[:vendor:]]      [\p{NEC}\p{IBM}]
  \p{MSWin}      \pM        [[:mswin:]]       [\p{JIS}\p{NEC}\p{IBM}]

  \p{Halfwidth}             [[:halfwidth:]]   [!#$%&()*+,./0-9:;<=>?@A-Z\[\\\]^_`a-z{|}~]
  \p{Fullwidth}  \pF        [[:fullwidth:]]   [�I�”���“�•�i�j�–�{�C�D�^‚O-‚X�F�G�ƒ���„�H�—‚`-‚y�m���n�O�Q�M‚�-‚š�o�b�p�P]

  \p{Latin}                 [[:latin:]]       [A-Za-z]
  \p{FullLatin}             [[:fulllatin:]]   [‚`-‚y‚�-‚š]
  \p{Greek}                 [[:greek:]]       [ƒŸ-ƒ¶ƒ¿-ƒÖ]
  \p{Cyrillic}              [[:cyrillic:]]    [„@-„`„p-„‘]
  \p{European}   \pE        [[:european:]]    [A-Za-z‚`-‚y‚�-‚šƒŸ-ƒ¶ƒ¿-ƒÖ„@-„`„p-„‘]

  \p{HalfKana}              [[:halfkana:]]    [¦-ß]
  \p{Hiragana}   \pH        [[:hiragana:]]    [‚Ÿ-‚ñ�J�K�T�U]
  \p{Katakana}   \pK        [[:katakana:]]    [ƒ@-ƒ–�[�R�S]
  \p{FullKana}  [\pH\pK]    [[:fullkana:]]    [‚Ÿ-‚ñƒ@-ƒ–�J�K�[�T�U�R�S]
  \p{Kana}                  [[:kana:]]        [¦-ß‚Ÿ-‚ñƒ@-ƒ–�J�K�[�T�U�R�S]
  \p{Kanji0}     \p0        [[:kanji0:]]      [�V-�Z]
  \p{Kanji1}     \p1        [[:kanji1:]]      [ˆŸ-˜r]
  \p{Kanji2}     \p2        [[:kanji2:]]      [˜Ÿ-ê¤]
  \p{Kanji}    [\p0\p1\p2]  [[:kanji:]]       [�V-�ZˆŸ-˜r˜Ÿ-ê¤]
  \p{BoxDrawing}            [[:boxdrawing:]]  [„Ÿ-„¾]
  • \p{NEC} matches an NEC special character or an NEC-selected IBM extended character.

    \p{IBM} matches an IBM extended character.

    \p{Vendor} matches a character of vendor-defined characters in Microsoft CP932, i.e. equivalent to [\p{NEC}\p{IBM}].

    \p{MSWin} matches a character of Microsoft CP932.

    \p{Kanji0} matches a kanji of the minimum kanji class of JIS X 4061; \p{Kanji1}, of the level 1 kanji of JIS X 0208; \p{Kanji2}, of the level 2 kanji of JIS X 0208; \p{Kanji}, of the basic kanji class of JIS X 4061.

  • \p{Prop}, \P{^Prop}, [\p{Prop}], etc. are equivalent to each other; and their complements are \P{Prop}, \p{^Prop}, [\P{Prop}], [^\p{Prop}], etc.

    \pP, \P^P, [\pP], etc. are equivalent to each other; and their complements are \PP, \p^P, [\PP], [^\pP], etc.

    [[:class:]]is equivalent to [^[:^class:]]; and their complements are [[:^class:]] or [^[:class:]].

    In \p{Prop}, \P{Prop}, [:class:] expressions, Prop and class are case-insensitive (e.g. \p{digit}, [:BoxDrawings:]).

  • Prefixes Is and In for \p{Prop} and \P{Prop} (e.g. \p{IsProp}, \P{InProp}, etc.) are optional. But \p{isProp}, \p{ISProp}, etc. are not ok, as Is and In are not case-insensitive. Using of Is and In is deprecated since they may conflict with a property name beginning with 'is' or 'in' in future.

Character Classes

Ranges in character class are supported.

The order of Shift_JIS characters is: 0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC.

So [\0-\x{fcfc}] matches any one Shift_JIS character.

In character classes, any character or byte sequence that does not match any one Shift_JIS character, e.g. re('[\xA0-\xFF]'), is croaked.

Character classes that match non-Shift_JIS substring are not supported (use \C or alternation).

Character Equivalences

Since the version 0.13, the POSIX character equivalent classes [=cc=] are supported. e.g. [[=‚ =]] is identical to [‚Ÿƒ@§‚ ƒA±]; [[=P=]] to [pP‚�‚o]; [[=4=]] to [4‚S]. They are used in a character class, like [[=cc=]], [[=p=][=e=][=r=][=l=]].

As cc in [=cc=], any character literal or meta chatacter (\xhh, \x{hhhh}) that belongs to the character equivalents can be used. e.g. [=‚ =], [=ƒA=], [=\x{82A0}=], [=\xB1=], etc. have identical meanings.

[[=‚©=]] matches '‚©', 'ƒJ', '¶', '‚ª', 'ƒK', '¶Þ', 'ƒ•' ('¶Þ' is a two-character string, but one collation element, HALFWIDTH FORM FOR KATAKANA LETTER GA.

[[===]] matches EQUALS SIGN or FULLWIDTH EQUALS SIGN; [[=[=]] matches LEFT SQUARE BRACKET or FULLWIDTH LEFT SQUARE BRACKET; [[=]=]] matches RIGHT SQUARE BRACKET or FULLWIDTH RIGHT SQUARE BRACKET; [[=\=]] matches YEN SIGN or FULLWIDTH YEN SIGN.

Code Embedded in a Regular Expression (Perl 5.005 or later)

Parsing (?{ ... }) or (??{ ... }) assertions is carried out without any special care of double-byte characters.

(?{ ... }) assertions are disallowed in match() or replace() functions by perl due to security concerns. Use them via re() function inside your scope.

use ShiftJIS::Regexp qw(:all);

use re 'eval';

$::res = 0;
$_ = 'ƒ|' x 8;

my $regex = re(q/
     \j*?
     (?{ $cnt = 0 })
     (
       ƒ| (?{ local $cnt = $cnt + 1; })
     )*  
     ƒ|ƒ|ƒ|
     (?{ $::res = $cnt })
   /, 'x');

/$regex/;
print $::res; # 5

Embedded Modifiers

Since version 0.15, embedded modifiers are extended.

An embedded modifier, (?iIjsmxo), that appears at the beginning of the 'regexp' or that follows one of regular expressions ^, \A, or \G at the beginning of the 'regexp' is allowed to contain I, j, o modifiers.

e.g. (?sm)pattern  ^(?i)pattern  \G(?j)pattern  \A(?ijo)pattern

And match('ƒG', '(?i)ƒg') returns false (Good result) even on Perl below 5.005, since it works like match('ƒG', 'ƒg', 'i').

Avoiding Mismatching

Using 'e' modifier in replacement or looping in a while-clause are not supported by this module.

They can be used only via a usual syntax (i.e. in m// or s/// operators).

Use a regular expression '\A(\j*?)' or '\G(\j*?)', to avoid mismatching a single-byte character on a trailing byte of a double-byte character, or a double-byte character on two bytes before and after a character boundary.

Don't forget $1 corresponds to '(\j*?)' and backreferences intended to use begin from $2.

Ex.1

use ShiftJIS::Regexp qw(re);

$_ = '‚ ‚¢‚¤‚¦‚¨ƒAƒCƒEƒGƒIŠ¿Žš ƒVƒtƒg‚i‚h‚r';
my $regex = re('\G(\j*?)(\pK)');
# or say: my $regex = re('(\R{padG})(\pK)');

while (/$regex/go) {
    print "found a katakana: $2\n";
}

Ex.2

use ShiftJIS::Regexp qw(re);
use ShiftJIS::String qw(strrev); # a Shift_JIS-oriented scalar reverse()

my $regex = re('\G(\j*?)(\w+)');
# or say: my $regex = re('(\R{padG})(\w+)');

foreach ('s/Perl/Camel/g', '(ƒAƒCƒEƒGƒI)AIUEO-Š¿Žš') {
    (my $str = $_) =~ s/$regex/$1.strrev($2)/geo; # <$1.> must be said.
    print "$str\n";
}

Note: If matching on a very long string, a special regular expression \R{padG} may be safer than \G(\j*?) as the former has a lower probability of that the repeating count of * would overflows a preset limit.

CAVEATS

A legal Shift_JIS character in this module must match the following regular expression:

[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]

Any string from external resource should be checked by issjis() function of ShiftJIS::String, excepting you know it is surely encoded in Shift_JIS.

Use of an illegal Shift_JIS string may lead to odd results.

Some Shift_JIS double-byte characters have one of [\x40-\x7E] as the trail byte.

@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

The Perl lexer doesn't take any care to these characters, so they sometimes make trouble. e.g. the quoted literal "•\" causes fatal error, since its trail byte 0x5C backslashes the closing quote.

Such a problem doesn't arise when the string is gotten from any external resource. But writing the script containing Shift_JIS double-byte characters needs the greatest care.

The use of single-quoted heredoc, << '', or \xhh meta characters is recommended in order to define a Shift_JIS string literal.

The safe ASCII-graphic characters, [\x21-\x3F], are:

!"#$%&'()*+,-./0123456789:;<=>?

They are preferred as the delimiter of quote-like operators.

BUGS

The \U, \L, \Q, \E, and interpolation are not considered. If necessary, use them in "" (or qq//) operators in the argument list.

The regular expressions of the word boundary, \b and \B, don't work correctly.

Never pass any regular expression containing '(?i)' on perl below 5.005. Pass 'i' modifier as the second argument. (On Perl 5.005 or later, '(?i)' is allowed because '(?-i:RE)' prevents it from wrong matching)

e.g.

match('ƒG', '(?i)ƒg') returns true on Perl below 5.005 (Wrong).
match('ƒG', '(?i)ƒg') returns false on Perl 5.005 or later (Good).
match('ƒG', 'ƒg', 'i') returns false, ok.
# The trail byte of 'ƒG' is 'G' and that of 'ƒg' is 'g';

(see also "Embedded Modifiers")

The i, I and j modifiers are invalid to \p{}, \P{}, and POSIX [: :]. (e.g. \p{IsLower}, [:lower:], etc). So use re('\p{IsAlpha}') instead of re('\p{IsLower}', 'iI').

The look-behind assertion like (?<=[A-Z]) is not prevented from matching trail byte of the previous double byte character: e.g. match("ƒAƒCƒE", '(?<=[A-Z])(\p{InKana})') returns ('ƒC') (of course wrong).

Use of not greedy regular expressions, which can match empty string, such as .?? and \d*?, as the PATTERN in jsplit(), may cause failure to the emulation of CORE::split.

AUTHOR

Tomoyuki SADAHIRO

bqw10602@nifty.com
http://homepage1.nifty.com/nomenclator/perl/
This program is free software; you can redistribute it and/or 
modify it under the same terms as Perl itself.

SEE ALSO

ShiftJIS::String
ShiftJIS::Collate

1 POD Error

The following errors were encountered while parsing the POD:

Around line 383:

Non-ASCII character seen before =encoding in ''C<‚ >''. Assuming CP1252