NAME
ShiftJIS::String - functions to manipulate Shift_JIS encoded strings
SYNOPSIS
use ShiftJIS::String;
ShiftJIS::String::substr($str, ShiftJIS::String::index($str, $substr));
ABOUT THIS POD
This POD is written in Shift_JIS encoding.
Do you see '‚
' as HIRAGANA LETTER A
? or '\
' as YEN SIGN
, not as REVERSE SOLIDUS
? Otherwise you'd change your font to an appropriate one. (or the POD might be badly converted.)
DESCRIPTION
This module provides some functions which emulate the corresponding CORE
functions and helps someone to manipulate multiple-byte character sequences in Shift_JIS encoding.
* 'Hankaku' and 'Zenkaku' mean 'halfwidth' and 'fullwidth' characters in Japanese, respectively.
FUNCTIONS
Check Whether the String is Legal
issjis(LIST)
-
Returns a boolean indicating whether all the strings in the parameter list are legally encoded in Shift_JIS.
Length
Reverse
Search
index(STRING, SUBSTR)
index(STRING, SUBSTR, POSITION)
-
Returns the position of the first occurrence of
SUBSTR
inSTRING
at or afterPOSITION
. IfPOSITION
is omitted, starts searching from the beginning of the string.If the substring is not found, returns -1.
rindex(STRING, SUBSTR)
rindex(STRING, SUBSTR, POSITION)
-
Returns the position of the last occurrence of
SUBSTR
inSTRING
at or afterPOSITION
. IfPOSITION
is specified, returns the last occurrence at or before that position.If the substring is not found, returns -1.
strspn(STRING, SEARCHLIST)
-
Returns returns the position of the first occurrence of any character not contained in the search list.
strspn("+0.12345*12", "+-.0123456789"); # returns 8.
If the specified string does not contain any character in the search list, returns 0.
The string consists of characters in the search list, the returned value equals the length of the string.
strcspn(STRING, SEARCHLIST)
-
Returns returns the position of the first occurrence of any character contained in the search list.
strcspn("Perl‚Í–Ê”’‚¢�B", "�Ô�‰©”’�•"); # returns 6.
If the specified string does not contain any character in the search list, the returned value equals the length of the string.
Substring
substr(STRING or SCALAR REF, OFFSET)
substr(STRING or SCALAR REF, OFFSET, LENGTH)
substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)
-
It works like
CORE::substr
, but using character semantics of Shift_JIS encoding.If the
REPLACEMENT
as the fourth parameter is specified, replaces parts of theSCALAR
and returns what was there before.You can utilize the lvalue reference, returned if a reference of scalar variable is used as the first argument.
${ &substr(\$str,$off,$len) } = $replace; works like CORE::substr($str,$off,$len) = $replace;
The returned lvalue is not Shift_JIS-oriented but byte-oriented, then successive assignment may cause unexpected results.
$str = "0123456789"; $lval = &substr(\$str,3,1); $$lval = "‚ ‚¢"; $$lval = "a"; # $str is NOT "012a‚¢456789", but an illegal string "012a\xA0‚¢456789".
Split
strsplit(SEPARATOR, STRING)
strsplit(SEPARATOR, STRING, LIMIT)
-
This function emulates
CORE::split
, but splits on theSEPARATOR
string, not by a pattern. If not in list context, only return the number of fields found, but does not split into the@_
array.strsplit('||', '||‚ ‚¢‚¤‚¦‚¨||ƒpƒsƒvƒyƒ|||01234||'); # ('', '‚ ‚¢‚¤‚¦‚¨', 'ƒpƒsƒvƒyƒ|', '01234') strsplit('�^', 'Perl�^épék�^Camel'); # ('Perl', 'épék', 'Camel')
If an empty string is specified as
SEPARATOR
, splits the specified string into characters (similarly toCORE::split //, STRING, LIMIT
).strsplit('', 'This is Perl.', 7); # ('T', 'h', 'i', 's', ' ', 'i', 's Perl.')
If an undefined value is specified as
SEPARATOR
, splits the specified string on whitespace characters (includingIDEOGRAPHIC SPACE
). Leading whitespace characters do not produce any field (similarly toCORE::split ' ', STRING, LIMIT
).strsplit(undef, ' �@ This is �@ Perl.'); # ('This', 'is', 'Perl.')
Comparison
strcmp(LEFT-STRING, RIGHT-STRING)
-
Returns 1 (when
LEFT-STRING
is greater thanRIGHT-STRING
) or 0 (whenLEFT-STRING
is equal toRIGHT-STRING
) or -1 (whenLEFT-STRING
is lesser thanRIGHT-STRING
).The order is roughly as shown the following list.
JIS X 0201 Roman, JIS X 0201 Kana, then JIS X 0208 Kanji (Zenkaku).
For example,
0x41
as'A'
is lesser than0xB1
('±' HANKAKU KATAKANA A
).0xB1
as'±'
is lesser than0x8341
('ƒA' KATAKANA A
).0x8341
as'ƒA'
is lesser than0x8383
('ƒƒ' KATAKANA SMALL YA
).0x8383
as'ƒƒ'
is lesser than0x83B1
('ı' GREEK CAPITAL TAU
).Caveat! Compare the 2nd and the 4th examples. Byte
"\xB1"
is lesser than byte"\x83"
as the leading bytes; while greater as the trailing bytes. Shortly, the ordering as binary is broken for the Shift_JIS codepoint order. strEQ(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is equal toRIGHT-STRING
.Note:
strEQ
is an expensive equivalence of theCORE
'seq
operator. strNE(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is not equal toRIGHT-STRING
.Note:
strNE
is an expensive equivalence of theCORE
'sne
operator. strLT(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is lesser thanRIGHT-STRING
. strLE(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is lesser than or equal toRIGHT-STRING
. strGT(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is greater thanRIGHT-STRING
. strGE(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is greater than or equal toRIGHT-STRING
. strxfrm(STRING)
-
Returns a string transformed so that
CORE:: cmp
can be used for binary comparisons (NOT the length of the transformed string).I.e.
strxfrm($a) cmp strxfrm($b)
is equivalent tostrcmp($a, $b)
, as long as yourcmp
doesn't use any locale other than that of Perl.
Character Range
mkrange(EXPR, EXPR)
-
Returns the character list (not in list context, as a concatenated string) gained by parsing the specified character range.
A character range is specified with a
HYPHEN-MINUS
,'-'
. The backslashed combinations'\-'
and'\\'
are used instead of the characters'-'
and'\'
, respectively. The hyphen at the beginning or end of the range is also evaluated as the hyphen itself.For example,
mkrange('+\-0-9a-fA-F')
returns('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C', 'D', 'E', 'F')
andscalar mkrange('‚©-‚²')
returns'‚©‚ª‚«‚¬‚‚®‚¯‚°‚±‚²'
.The order of Shift_JIS characters is:
0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC
. So, mkrange('ˆŸ-˜r') returns the list of all characters in level 1 Kanji.If true value is specified as the second parameter, Reverse character ranges such as
'9-0'
,'Z-A'
can be used; otherwise, reverse character ranges are croaked.
Transliteration
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)
-
Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list.
If a reference of scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.
$str = "‚È‚ñ‚Æ‚¢‚¨‚¤‚©"; print strtr(\$str,"‚ ‚¢‚¤‚¦‚¨", "ƒAƒCƒEƒGƒI"), " ", $str; # output: 3 ‚È‚ñ‚ƃCƒIƒE‚© $str = "Œã–å‚̘T�B"; print strtr($str,"Œã˜T�B", "‘OŒÕ�A"), $str; # output: ‘O–å‚ÌŒÕ�AŒã–å‚̘T�B
SEARCHLIST and REPLACEMENTLIST
Character ranges such as
"‚Ÿ-‚¨"
(internally utilizingmkrange()
) are supported.If the
REPLACEMENTLIST
is empty (specified as''
, notundef
, because the use of uninitialized value causes warning under -w option), theSEARCHLIST
is replicated.If the replacement list is shorter than the search list, the final character in the replacement list is replicated till it is long enough (but differently works when the 'd' modifier is used).
strtr(\$str, '‚Ÿ-‚ñƒ@-ƒ–¦-ß', '#'); # replaces all Kana letters by '#'.
MODIFIER
c Complement the SEARCHLIST. d Delete found but unreplaced characters. s Squash duplicate replaced characters. R No use of character ranges. r Allows to use reverse character ranges. o Caches the conversion table internally. strtr(\$str, '‚Ÿ-‚ñƒ@-ƒ–¦-ß', ''); # counts all Kana letters in $str. $onlykana = strtr($str, '‚Ÿ-‚ñƒ@-ƒ–¦-ß', '', 'cd'); # deletes all characters except Kana letters. strtr(\$str, " \x81\x40\n\r\t\f", '', 'd'); # deletes all whitespace characters including IDEOGRAPHIC SPACE. strtr("‚¨‚©‚©‚¤‚ß‚Ú‚µ�@‚¿‚¿‚Æ‚Í‚Í", '‚Ÿ-‚ñ', '', 's'); # output: ‚¨‚©‚¤‚ß‚Ú‚µ�@‚¿‚Æ‚Í strtr("�ðŒ�‰‰ŽZŽq‚ÌŽg‚¢‚·‚¬‚ÍŒ©‹ê‚µ‚¢", '‚Ÿ-‚ñ', '�”', 'cs'); # output: �”‚Ì�”‚¢‚·‚¬‚Í�”‚µ‚¢
If
'R'
modifier is specified,'-'
is not evaluated as a meta character butHYPHEN-MINUS
itself like intr'''
. Compare:strtr("90 - 32 = 58", "0-9", "A-J"); # output: "JA - DC = FI" strtr("90 - 32 = 58", "0-9", "A-J", "R"); # output: "JA - 32 = 58" # cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J'; # '0' to 'A', '-' to '-', and '9' to 'J'.
If
'r'
modifier is specified, you are allowed to use reverse character ranges. For example,strtr($str, "0-9", "9-0", "r")
is equivalent tostrtr($str, "0123456789", "9876543210")
.strtr($text, 'ˆŸ-˜r', '˜r-ˆŸ', "r"); # Your text may seem to be clobbered.
PATTERN and TOPATTERN
By use of
PATTERN
andTOPATTERN
, you can transliterate the string using lists containing some multi-character substrings.If called with four arguments,
SEARCHLIST
,REPLACEMENTLIST
andSTRING
are splited characterwise;If called with five arguments, a multi-character substring that matchs
PATTERN
inSEARCHLIST
,REPLACEMENTLIST
orSTRING
is regarded as an transliteration unit.If both
PATTERN
andTOPATTERN
are specified, a multi-character substring either that matchsPATTERN
inSEARCHLIST
orSTRING
, or that matchsTOPATTERN
inREPLACEMENTLIST
is regarded as an transliteration unit.print strtr( "Caesar Aether Goethe", "aeoeueAeOeUe", "äööÄÖÜ", "", "[aouAOU]e", "&[aouAOU]uml;"); # output: Cäsar Äther Göthe
LIST as Anonymous Array
Instead of specification of
PATTERN
andTOPATTERN
, you can use anonymous arrays asSEARCHLIST
and/orREPLACEMENTLIST
as follows.print strtr( "Caesar Aether Goethe", [qw/ae oe ue Ae Oe Ue/], [qw/ä ö ö Ä Ö Ü/] );
Caching the conversion table
If
'o'
modifier is specified, the conversion table is cached internally. e.g.foreach (@hiragana_strings) { print strtr($_, '‚Ÿ-‚ñ', 'ƒ@-ƒ“', 'o'); } # katakana strings are printed
will be almost as efficient as this:
$hiragana_to_katakana = trclosure('‚Ÿ-‚ñ', 'ƒ@-ƒ“'); foreach (@hiragana_strings) { print &$hiragana_to_katakana($_); }
You can use whichever you like.
Without
'o'
,foreach (@hiragana_strings) { print strtr($_, '‚Ÿ-‚ñ', 'ƒ@-ƒ“'); }
will be very slow since the conversion table is made whenever the function is called.
Generation of the Closure to Transliterate
trclosure(SEARCHLIST, REPLACEMENTLIST)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)
-
Returns a closure to transliterate the specified string. The return value is an only code reference, not blessed object. By use of this code ref, you can save yourself time as you need not specify the parameter list every time.
my $digit_tr = trclosure("1234567890-", "ˆê“ñŽOŽlŒÜ˜ZŽµ”ª‹ã�Z�|"); print &$digit_tr ("TEL �F0124-45-6789\n"); # ok to perl 5.003 print $digit_tr->("FAX �F0124-51-5368\n"); # perl 5.004 or better # output: # “d˜b�F�Zˆê“ñŽl�|ŽlŒÜ�|˜ZŽµ”ª‹ã # FAX �F�Zˆê“ñŽl�|ŒÜˆê�|ŒÜŽO˜Z”ª
The functionality of the closure made by
trclosure()
is equivalent to that ofstrtr()
. Frankly speaking, thestrtr()
callstrclosure()
internally and uses the returned closure.
Case of the Alphabet
toupper(STRING or SCALAR REF)
-
Returns an uppercased string of
STRING
. Converts only half-width Latin charactersa-z
toA-Z
.If a reference of scalar variable is specified as the first argument, the string referred to it is uppercased and the number of characters replaced is returned.
tolower(STRING or SCALAR REF)
-
Returns a lowercased string of
STRING
. Converts only half-width Latin charactersA-Z
toa-z
.If a reference of scalar variable is specified as the first argument, the string referred to it is lowercased and the number of characters replaced is returned.
Conversion between hiragana and katakana
If a reference of scalar variable is specified as the first argument, the string referred to it is converted and the number of characters replaced is returned. Otherwise, returns a string converted and the specified string is unaffected.
Note: The conversion between a voiced (or semivoiced) katakana (or hiragana), such as 'ƒK'
, 'ƒp'
, and hankaku katakana with a voiced mark or a semi-voiced mark, such as '¶Þ'
, 'Êß'
, is counted as 1. Similarly, the conversion between zenkaku hiragana '‚¤�J'
and zenkaku katakana 'ƒ”'
is counted as 1.
kanaH2Z(STRING or SCALAR REF)
kataH2Z(STRING or SCALAR REF)
-
Converts Hankaku Katakana to Zenkaku Katakana
Note:
kataH2Z
is an alias ofkanaH2Z
. kataZ2H(STRING or SCALAR REF)
-
Converts Zenkaku Katakana to Hankaku Katakana
kanaZ2H(STRING or SCALAR REF)
-
Converts Zenkaku Hiragana and Katakana to Hankaku Katakana
hiXka(STRING or SCALAR REF)
-
Converts Zenkaku Hiragana to Zenkaku Katakana and Zenkaku Katakana to Zenkaku Hiragana at once.
hi2ka(STRING or SCALAR REF)
-
Converts Zenkaku Hiragana to Zenkaku Katakana
ka2hi(STRING or SCALAR REF)
-
Converts Zenkaku Katakana to Zenkaku Hiragana
Conversion of Whitespace Characters
If a reference of scalar variable is specified as the first argument, the string referred to it is converted and the number of characters replaced is returned. Otherwise, returns a string converted and the specified string is unaffected.
spaceH2Z(STRING or SCALAR REF)
-
Converts space (half-width) to ideographic space (full-width) in the specified string and returns the converted string.
spaceZ2H(STRING or SCALAR REF)
-
Converts ideographic space (full-width) to space (half-width) in the specified string and returns the converted string.
CAVEAT
A legal Shift_JIS character in this module must match the following regular expression:
[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]
Any string from an external source should be checked by issjis()
function, excepting you know it is surely encoded in Shift_JIS.
Use of an illegal Shift_JIS string may lead to odd results.
Some Shift_JIS double-byte characters have a trailing byte in the range of [\x40-\x7E]
, viz.,
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
The Perl lexer (parhaps) doesn't take any care to these bytes, so they sometimes make trouble. e.g. the quoted literal "•\"
causes a fatal error, since its trailing byte 0x5C
backslashes the closing quote.
Such a problem doesn't arise when the string is gotten from any external resource. But writing the script containing Shift_JIS double-byte characters needs the greatest care.
The use of single-quoted heredoc, << ''
, or \xhh
meta characters is recommended in order to define a Shift_JIS string literal.
The safe ASCII-graphic characters, [\x21-\x3F]
, are:
!"#$%&'()*+,-./0123456789:;<=>?
They are preferred as the delimiter of quote-like operators.
BUGS
This library supposes $[
is always equal to 0, never 1.
The functions provided by this library use many regexp operations. Therefore, $1
etc. values may be changed or discarded unexpectedly. I suggest you save it in a certain variable before call of the function.
AUTHOR
Tomoyuki SADAHIRO
bqw10602@nifty.com
http://homepage1.nifty.com/nomenclator/perl/
Copyright(C) 2001-2002, SADAHIRO Tomoyuki. Japan. All rights reserved.
This program is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
SEE ALSO
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 16:
Non-ASCII character seen before =encoding in ''C<‚ >''. Assuming CP1252