NAME

ShiftJIS::String - functions to manipulate Shift_JIS encoded strings

SYNOPSIS

use ShiftJIS::String;

ShiftJIS::String::substr($str, ShiftJIS::String::index($str, $substr));

ABOUT THIS POD

This POD is written in Shift_JIS encoding.

Do you see '' as HIRAGANA LETTER A? or '\' as YEN SIGN, not as REVERSE SOLIDUS? Otherwise you'd change your font to an appropriate one. (or the POD might be badly converted.)

DESCRIPTION

This module provides some functions which emulate the corresponding CORE functions and helps someone to manipulate multiple-byte character sequences in Shift_JIS encoding.

* 'Hankaku' and 'Zenkaku' mean 'halfwidth' and 'fullwidth' characters in Japanese, respectively.

FUNCTIONS

issjis(LIST)

Returns a boolean indicating whether all the strings in the parameter list are legally encoded in Shift_JIS.

Length

length(STRING)

Returns the length in characters of the supplied string.

Reverse

strrev(STRING)

Returns a reversed string (having all characters in the opposite order).

index(STRING, SUBSTR)
index(STRING, SUBSTR, POSITION)

Returns the position of the first occurrence of SUBSTR in STRING at or after POSITION. If POSITION is omitted, starts searching from the beginning of the string.

If the substring is not found, returns -1.

rindex(STRING, SUBSTR)
rindex(STRING, SUBSTR, POSITION)

Returns the position of the last occurrence of SUBSTR in STRING at or after POSITION. If POSITION is specified, returns the last occurrence at or before that position.

If the substring is not found, returns -1.

strspn(STRING, SEARCHLIST)

Returns returns the position of the first occurrence of any character not contained in the search list.

strspn("+0.12345*12", "+-.0123456789");
# returns 8. 

If the specified string does not contain any character in the search list, returns 0.

The string consists of characters in the search list, the returned value equals the length of the string.

strcspn(STRING, SEARCHLIST)

Returns returns the position of the first occurrence of any character contained in the search list.

strcspn("Perl‚Í–Ê”’‚¢�B", "�Ô�‰©”’�•");
# returns 6. 

If the specified string does not contain any character in the search list, the returned value equals the length of the string.

Substring

substr(STRING or SCALAR REF, OFFSET)
substr(STRING or SCALAR REF, OFFSET, LENGTH)
substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)

It works like CORE::substr, but using character semantics of Shift_JIS encoding.

If the REPLACEMENT as the fourth parameter is specified, replaces parts of the SCALAR and returns what was there before.

You can utilize the lvalue reference, returned if a reference of scalar variable is used as the first argument.

${ &substr(\$str,$off,$len) } = $replace;

    works like

CORE::substr($str,$off,$len) = $replace;

The returned lvalue is not Shift_JIS-oriented but byte-oriented, then successive assignment may cause unexpected results.

$str = "0123456789";
$lval  = &substr(\$str,3,1);
$$lval = "‚ ‚¢";
$$lval = "a";
# $str is NOT "012a‚¢456789", but an illegal string "012a\xA0‚¢456789".

Split

strsplit(SEPARATOR, STRING)
strsplit(SEPARATOR, STRING, LIMIT)

This function emulates CORE::split, but splits on the SEPARATOR string, not by a pattern. If not in list context, only return the number of fields found, but does not split into the @_ array.

strsplit('||', '||‚ ‚¢‚¤‚¦‚¨||ƒpƒsƒvƒyƒ|||01234||');
# ('', '‚ ‚¢‚¤‚¦‚¨', 'ƒpƒsƒvƒyƒ|', '01234')

strsplit('�^', 'Perl�^épék�^Camel');
# ('Perl', 'épék', 'Camel')

If an empty string is specified as SEPARATOR, splits the specified string into characters (similarly to CORE::split //, STRING, LIMIT).

strsplit('', 'This is Perl.', 7);
# ('T', 'h', 'i', 's', ' ', 'i',  's Perl.')

If an undefined value is specified as SEPARATOR, splits the specified string on whitespace characters (including IDEOGRAPHIC SPACE). Leading whitespace characters do not produce any field (similarly to CORE::split ' ', STRING, LIMIT).

strsplit(undef, ' �@ This  is �@ Perl.');
# ('This', 'is', 'Perl.')

Comparison

strcmp(LEFT-STRING, RIGHT-STRING)

Returns 1 (when LEFT-STRING is greater than RIGHT-STRING) or 0 (when LEFT-STRING is equal to RIGHT-STRING) or -1 (when LEFT-STRING is lesser than RIGHT-STRING).

The order is roughly as shown the following list.

JIS X 0201 Roman, JIS X 0201 Kana, then JIS X 0208 Kanji (Zenkaku).

For example, 0x41 as 'A' is lesser than 0xB1 ('±' HANKAKU KATAKANA A). 0xB1 as '±' is lesser than 0x8341 ('ƒA' KATAKANA A). 0x8341 as 'ƒA' is lesser than 0x8383 ('ƒƒ' KATAKANA SMALL YA). 0x8383 as 'ƒƒ' is lesser than 0x83B1 ('ƒ±' GREEK CAPITAL TAU).

Caveat! Compare the 2nd and the 4th examples. Byte "\xB1" is lesser than byte "\x83" as the leading bytes; while greater as the trailing bytes. Shortly, the ordering as binary is broken for the Shift_JIS codepoint order.

strEQ(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is equal to RIGHT-STRING.

Note: strEQ is an expensive equivalence of the CORE's eq operator.

strNE(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is not equal to RIGHT-STRING.

Note: strNE is an expensive equivalence of the CORE's ne operator.

strLT(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is lesser than RIGHT-STRING.

strLE(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is lesser than or equal to RIGHT-STRING.

strGT(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is greater than RIGHT-STRING.

strGE(LEFT-STRING, RIGHT-STRING)

Returns a boolean whether LEFT-STRING is greater than or equal to RIGHT-STRING.

strxfrm(STRING)

Returns a string transformed so that CORE:: cmp can be used for binary comparisons (NOT the length of the transformed string).

I.e. strxfrm($a) cmp strxfrm($b) is equivalent to strcmp($a, $b), as long as your cmp doesn't use any locale other than that of Perl.

Character Range

mkrange(EXPR, EXPR)

Returns the character list (not in list context, as a concatenated string) gained by parsing the specified character range.

A character range is specified with a HYPHEN-MINUS, '-'. The backslashed combinations '\-' and '\\' are used instead of the characters '-' and '\', respectively. The hyphen at the beginning or end of the range is also evaluated as the hyphen itself.

For example, mkrange('+\-0-9a-fA-F') returns ('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C', 'D', 'E', 'F') and scalar mkrange('‚©-‚²') returns '‚©‚ª‚«‚¬‚­‚®‚¯‚°‚±‚²'.

The order of Shift_JIS characters is: 0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC. So, mkrange('ˆŸ-˜r') returns the list of all characters in level 1 Kanji.

If true value is specified as the second parameter, Reverse character ranges such as '9-0', 'Z-A' can be used; otherwise, reverse character ranges are croaked.

Transliteration

strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)

Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list.

If a reference of scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.

$str = "‚È‚ñ‚Æ‚¢‚¨‚¤‚©";
print strtr(\$str,"‚ ‚¢‚¤‚¦‚¨", "ƒAƒCƒEƒGƒI"), "  ", $str;
# output: 3  ‚È‚ñ‚ƃCƒIƒE‚©

$str = "Œã–å‚̘T�B";
print strtr($str,"Œã˜T�B", "‘OŒÕ�A"), $str;
# output: ‘O–å‚ÌŒÕ�AŒã–å‚̘T�B

SEARCHLIST and REPLACEMENTLIST

Character ranges such as "‚Ÿ-‚¨" (internally utilizing mkrange()) are supported.

If the REPLACEMENTLIST is empty (specified as '', not undef, because the use of uninitialized value causes warning under -w option), the SEARCHLIST is replicated.

If the replacement list is shorter than the search list, the final character in the replacement list is replicated till it is long enough (but differently works when the 'd' modifier is used).

strtr(\$str, '‚Ÿ-‚ñƒ@-ƒ–¦-ß', '#');
  # replaces all Kana letters by '#'. 

MODIFIER

  c   Complement the SEARCHLIST.
  d   Delete found but unreplaced characters.
  s   Squash duplicate replaced characters.
  R   No use of character ranges.
  r   Allows to use reverse character ranges.
  o   Caches the conversion table internally.

strtr(\$str, '‚Ÿ-‚ñƒ@-ƒ–¦-ß', '');
  # counts all Kana letters in $str. 

$onlykana = strtr($str, '‚Ÿ-‚ñƒ@-ƒ–¦-ß', '', 'cd');
  # deletes all characters except Kana letters. 

strtr(\$str, " \x81\x40\n\r\t\f", '', 'd');
  # deletes all whitespace characters including IDEOGRAPHIC SPACE.

strtr("‚¨‚©‚©‚¤‚ß‚Ú‚µ�@‚¿‚¿‚Æ‚Í‚Í", '‚Ÿ-‚ñ', '', 's');
  # output: ‚¨‚©‚¤‚ß‚Ú‚µ�@‚¿‚Æ‚Í

strtr("�ðŒ�‰‰ŽZŽq‚ÌŽg‚¢‚·‚¬‚ÍŒ©‹ê‚µ‚¢", '‚Ÿ-‚ñ', '�”', 'cs');
  # output: �”‚Ì�”‚¢‚·‚¬‚Í�”‚µ‚¢

If 'R' modifier is specified, '-' is not evaluated as a meta character but HYPHEN-MINUS itself like in tr'''. Compare:

strtr("90 - 32 = 58", "0-9", "A-J");
  # output: "JA - DC = FI"

strtr("90 - 32 = 58", "0-9", "A-J", "R");
  # output: "JA - 32 = 58"
  # cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J';
  # '0' to 'A', '-' to '-', and '9' to 'J'.

If 'r' modifier is specified, you are allowed to use reverse character ranges. For example, strtr($str, "0-9", "9-0", "r") is equivalent to strtr($str, "0123456789", "9876543210").

strtr($text, 'ˆŸ-˜r', '˜r-ˆŸ', "r");
  # Your text may seem to be clobbered.

PATTERN and TOPATTERN

By use of PATTERN and TOPATTERN, you can transliterate the string using lists containing some multi-character substrings.

If called with four arguments, SEARCHLIST, REPLACEMENTLIST and STRING are splited characterwise;

If called with five arguments, a multi-character substring that matchs PATTERN in SEARCHLIST, REPLACEMENTLIST or STRING is regarded as an transliteration unit.

If both PATTERN and TOPATTERN are specified, a multi-character substring either that matchs PATTERN in SEARCHLIST or STRING, or that matchs TOPATTERN in REPLACEMENTLIST is regarded as an transliteration unit.

print strtr(
  "Caesar Aether Goethe", 
  "aeoeueAeOeUe", 
  "äööÄÖÜ", 
  "", 
  "[aouAOU]e",
  "&[aouAOU]uml;");

# output: Cäsar Äther Göthe

LIST as Anonymous Array

Instead of specification of PATTERN and TOPATTERN, you can use anonymous arrays as SEARCHLIST and/or REPLACEMENTLIST as follows.

print strtr(
  "Caesar Aether Goethe", 
  [qw/ae oe ue Ae Oe Ue/], 
  [qw/ä ö ö Ä Ö Ü/]
);

Caching the conversion table

If 'o' modifier is specified, the conversion table is cached internally. e.g.

foreach (@hiragana_strings) {
  print strtr($_, '‚Ÿ-‚ñ', 'ƒ@-ƒ“', 'o');
}
# katakana strings are printed

will be almost as efficient as this:

$hiragana_to_katakana = trclosure('‚Ÿ-‚ñ', 'ƒ@-ƒ“');

foreach (@hiragana_strings) {
  print &$hiragana_to_katakana($_);
}

You can use whichever you like.

Without 'o',

foreach (@hiragana_strings) {
  print strtr($_, '‚Ÿ-‚ñ', 'ƒ@-ƒ“');
}

will be very slow since the conversion table is made whenever the function is called.

Generation of the Closure to Transliterate

trclosure(SEARCHLIST, REPLACEMENTLIST)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)

Returns a closure to transliterate the specified string. The return value is an only code reference, not blessed object. By use of this code ref, you can save yourself time as you need not specify the parameter list every time.

my $digit_tr = trclosure("1234567890-", "ˆê“ñŽOŽlŒÜ˜ZŽµ”ª‹ã�Z�|");
print &$digit_tr ("TEL �F0124-45-6789\n"); # ok to perl 5.003
print $digit_tr->("FAX �F0124-51-5368\n"); # perl 5.004 or better

# output:
# “d˜b�F�Zˆê“ñŽl�|ŽlŒÜ�|˜ZŽµ”ª‹ã
# FAX �F�Zˆê“ñŽl�|ŒÜˆê�|ŒÜŽO˜Z”ª

The functionality of the closure made by trclosure() is equivalent to that of strtr(). Frankly speaking, the strtr() calls trclosure() internally and uses the returned closure.

Case of the Alphabet

toupper(STRING or SCALAR REF)

Returns an uppercased string of STRING. Converts only half-width Latin characters a-z to A-Z.

If a reference of scalar variable is specified as the first argument, the string referred to it is uppercased and the number of characters replaced is returned.

tolower(STRING or SCALAR REF)

Returns a lowercased string of STRING. Converts only half-width Latin characters A-Z to a-z.

If a reference of scalar variable is specified as the first argument, the string referred to it is lowercased and the number of characters replaced is returned.

Conversion between hiragana and katakana

If a reference of scalar variable is specified as the first argument, the string referred to it is converted and the number of characters replaced is returned. Otherwise, returns a string converted and the specified string is unaffected.

Note: The conversion between a voiced (or semivoiced) katakana (or hiragana), such as 'ƒK', 'ƒp', and hankaku katakana with a voiced mark or a semi-voiced mark, such as '¶Þ', 'Êß', is counted as 1. Similarly, the conversion between zenkaku hiragana '‚¤�J' and zenkaku katakana 'ƒ”' is counted as 1.

kanaH2Z(STRING or SCALAR REF)
kataH2Z(STRING or SCALAR REF)

Converts Hankaku Katakana to Zenkaku Katakana

Note: kataH2Z is an alias of kanaH2Z.

kataZ2H(STRING or SCALAR REF)

Converts Zenkaku Katakana to Hankaku Katakana

kanaZ2H(STRING or SCALAR REF)

Converts Zenkaku Hiragana and Katakana to Hankaku Katakana

hiXka(STRING or SCALAR REF)

Converts Zenkaku Hiragana to Zenkaku Katakana and Zenkaku Katakana to Zenkaku Hiragana at once.

hi2ka(STRING or SCALAR REF)

Converts Zenkaku Hiragana to Zenkaku Katakana

ka2hi(STRING or SCALAR REF)

Converts Zenkaku Katakana to Zenkaku Hiragana

Conversion of Whitespace Characters

If a reference of scalar variable is specified as the first argument, the string referred to it is converted and the number of characters replaced is returned. Otherwise, returns a string converted and the specified string is unaffected.

spaceH2Z(STRING or SCALAR REF)

Converts space (half-width) to ideographic space (full-width) in the specified string and returns the converted string.

spaceZ2H(STRING or SCALAR REF)

Converts ideographic space (full-width) to space (half-width) in the specified string and returns the converted string.

CAVEAT

A legal Shift_JIS character in this module must match the following regular expression:

[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]

Any string from an external source should be checked by issjis() function, excepting you know it is surely encoded in Shift_JIS.

Use of an illegal Shift_JIS string may lead to odd results.

Some Shift_JIS double-byte characters have a trailing byte in the range of [\x40-\x7E], viz.,

@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

The Perl lexer (parhaps) doesn't take any care to these bytes, so they sometimes make trouble. e.g. the quoted literal "•\" causes a fatal error, since its trailing byte 0x5C backslashes the closing quote.

Such a problem doesn't arise when the string is gotten from any external resource. But writing the script containing Shift_JIS double-byte characters needs the greatest care.

The use of single-quoted heredoc, << '', or \xhh meta characters is recommended in order to define a Shift_JIS string literal.

The safe ASCII-graphic characters, [\x21-\x3F], are:

!"#$%&'()*+,-./0123456789:;<=>?

They are preferred as the delimiter of quote-like operators.

BUGS

This library supposes $[ is always equal to 0, never 1.

The functions provided by this library use many regexp operations. Therefore, $1 etc. values may be changed or discarded unexpectedly. I suggest you save it in a certain variable before call of the function.

AUTHOR

Tomoyuki SADAHIRO

bqw10602@nifty.com
http://homepage1.nifty.com/nomenclator/perl/

Copyright(C) 2001-2002, SADAHIRO Tomoyuki. Japan. All rights reserved.

This program is free software; you can redistribute it and/or 
modify it under the same terms as Perl itself.

SEE ALSO

ShiftJIS::Regexp
ShiftJIS::Collate
String::Multibyte

1 POD Error

The following errors were encountered while parsing the POD:

Around line 16:

Non-ASCII character seen before =encoding in ''C<‚ >''. Assuming CP1252