NAME
String::Multibyte - Perl module to manipulate multibyte character strings
SYNOPSIS
use String::Multibyte;
$utf8 = String::Multibyte->new('UTF8');
$utf8_len = $utf8->length($utf8_str);
DESCRIPTION
This module provides some functions which emulate the corresponding CORE
functions to manipulate multiple-byte character strings.
Definition of Multibyte Charsets
The definition files are sited under the /String/Multibyte
directory.
The definition file must return a hashref, whose keys should include ('charset', 'regexp', 'nextchar', 'cmpchar' )
.
The value for the key 'charset'
, is a string of the charset name. Omission of the 'charset'
matters very little.
The value for the key 'regexp'
, REQUIRED, is a regexp matching one character of the concerned charset. If the 'regexp'
is omitted, calling any method is croaked.
The value for the key 'nextchar'
must be a coderef that returns the next character to the specified character. If the 'nextchar'
coderef is omitted, mkrange
and strtr
functions don't understand hyphen metacharacter for character ranges.
The value for the key 'cmpchar'
must be a coderef that compares the specified two characters. If the 'cmpchar'
coderef is omitted, mkrange
and strtr
functions don't understand reverse character ranges.
Constructor
$mbcs = String::Multibyte->new(CHARSET)
$mbcs = String::Multibyte->new(CHARSET, VERBOSE)
-
CHARSET
is the charset name; exactly speaking, the file name of the definition file (without the suffix.pm
).returns the instance to tell methods in which charset the specified strings should be handled.
e.g. $sjis = String::Multibyte->new('ShiftJIS'); $substr = $sjis->substr('あいうえお',2,2); # 'うえ' # 'あいうえお' should be encoded in Shift_JIS.
If true value is specified as
VERBOSE
, the called method (exceptingislegal
) will check its arguments and carps if any of them is not legally encoded in the concerned charset.Otherwise such a check won't be carried out (saves a bit of time, but unsafe, though you can use the
islegal
method if necessary).
Check Whether the String is Legal
$mbcs->islegal(LIST)
-
Returns a boolean indicating whether all the strings in arguments are legally encoded in the concerned charset.
Length
Reverse
Search
$mbcs->index(STRING, SUBSTR)
$mbcs->index(STRING, SUBSTR, POSITION)
-
Returns the position of the first occurrence of
SUBSTR
inSTRING
at or afterPOSITION
. IfPOSITION
is omitted, starts searching from the beginning of the string.If the substring is not found, returns -1.
$mbcs->rindex(STRING, SUBSTR)
$mbcs->rindex(STRING, SUBSTR, POSITION)
-
Returns the position of the last occurrence of
SUBSTR
inSTRING
at or afterPOSITION
. IfPOSITION
is specified, returns the last occurrence at or before that position.If the substring is not found, returns -1.
$mbcs->strspn(STRING, SEARCHLIST)
-
Returns returns the position of the first occurrence of any character not contained in the search list.
$mbcs->strspn("+0.12345*12", "+-.0123456789"); # returns 8.
If the specified string does not contain any character in the search list, returns 0.
The string consists of characters in the search list, the returned value equals the length of the string.
$mbcs->strcspn(STRING, SEARCHLIST)
-
Returns returns the position of the first occurrence of any character contained in the search list.
$mbcs->strcspn("Perlは面白い。", "赤青黄白黒"); # returns 6.
If the specified string does not contain any character in the search list, the returned value equals the length of the string.
Substring
$mbcs->substr(STRING or SCALAR REF, OFFSET)
$mbcs->substr(STRING or SCALAR REF, OFFSET, LENGTH)
$mbcs->substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)
-
It works like
CORE::substr
, but using character semantics of multibyte charset encoding.If the
REPLACEMENT
as the fourth argument is specified, replaces parts of theSCALAR
and returns what was there before.You can utilize the lvalue reference, returned if a reference of scalar variable is used as the first argument.
${ $mbcs->substr(\$str,$off,$len) } = $replace; works like CORE::substr($str,$off,$len) = $replace;
The returned lvalue is not multibyte character-oriented but byte-oriented, then successive assignment may lead to odd results.
$str = "0123456789"; $lval = $sjis->substr(\$str,3,1); $$lval = "あい"; $$lval = "a"; # $str is NOT "012aい456789", but an illegal string "012a\xA0い456789".
Split
$mbcs->strsplit(SEPARATOR, STRING)
$mbcs->strsplit(SEPARATOR, STRING, LIMIT)
-
This function emulates
CORE::split
, but splits on theSEPARATOR
string, not by a pattern.If not in list context, only return the number of fields found, but does not split into the
@_
array.$sjis->strsplit('/', 'Perl/駱駝/Camel'); # ('Perl', '駱駝', 'Camel')
If empty string is specified as
SEPARATOR
, splits the specified string into characters.$bytes->strsplit('', 'This is perl.', 7); # ('T', 'h', 'i', 's', ' ', 'i', 's perl.')
Character Range
$mbcs->mkrange(EXPR, EXPR)
-
Returns the character list (not in list context, as a concatenated string) gained by parsing the specified character range.
The result depends on the the character order for the concerned charset. About the character order for each charset, see its definition file.
If the character order is undefined in the definition file, returns an identical string with the specified string.
A character range is specified with a HYPHEN-MINUS,
'-'
. The backslashed combinations'\-'
and'\\'
are used instead of the characters'-'
and'\'
, respectively. The hyphen at the beginning or end of the range is also evaluated as the hyphen itself.For example,
$mbcs->mkrange('+\-0-9A-F')
returns('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F')
andscalar $mbcs->mkrange('A-P')
returns'ABCDEFGHIJKLMNOP'
.If true value is specified as the second argument, reverse character ranges such as
'9-0'
,'Z-A'
are allowed.$bytes = String::Multibyte->new('Bytes'); $bytes->mkrange('p-e-r-l', 1); # ponmlkjihgfefghijklmnopqrqponml
Transliteration
$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)
$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER)
-
Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list.
If a reference of scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.
If
'h'
modifier is specified, returns a hash of histogram in list context; a reference to hash of histogram in scalar context;$str = "なんといおうか"; print $mbcs->strtr(\$str,"あいうえお", "アイウエオ"), " ", $str; # output: 3 なんとイオウか $str = "後門の狼。"; print $mbcs->strtr($str,"後狼。", "前虎、"), $str; # output: 前門の虎、後門の狼。
SEARCHLIST and REPLACEMENTLIST
Character ranges such as
"ぁ-お"
(internally utilizingmkrange()
) are supported.If the
REPLACEMENTLIST
is empty (specified as''
, notundef
, because the use of uninitialized value causes warning under -w option), theSEARCHLIST
is replicated.If the replacement list is shorter than the search list, the final character in the replacement list is replicated till it is long enough (but differently works when the 'd' modifier is used).
$mbcs->strtr(\$str, 'ぁ-んァ-ヶヲ-゚', '#'); # replaces all Kana letters by '#'.
MODIFIER
c Complement the SEARCHLIST. d Delete found but unreplaced characters. s Squash duplicate replaced characters. h Return a hash (or a hashref) of histogram. R No use of character ranges. r Allows to use reverse character ranges. o Caches the conversion table internally. $mbcs->strtr(\$str, 'ぁ-んァ-ヶヲ-゚', ''); # counts all Kana letters in $str. $mbcs->strtr(\$str, 'ぁ-んァ-ヶヲ-゚', '', 'h'); # counts all Kana letters in $str and return a histogram. $mbcs->$onlykana = strtr($str, 'ぁ-んァ-ヶヲ-゚', '', 'cd'); # deletes all characters except Kana. $mbcs->strtr(\$str, " \x81\x40\n\r\t\f", '', 'd'); # deletes all whitespace characters including full-width space $mbcs->strtr("おかかうめぼし ちちとはは", 'ぁ-ん', '', 's'); # output: おかうめぼし ちとは $mbcs->strtr("条件演算子の使いすぎは見苦しい", 'ぁ-ん', '#', 'cs'); # output: #の#いすぎは#しい
If
'R'
modifier is specified,'-'
is not evaluated as a meta character but hyphen itself like intr'''
. Compare:$mbcs->strtr("90 - 32 = 58", "0-9", "A-J"); # output: "JA - DC = FI" $mbcs->strtr("90 - 32 = 58", "0-9", "A-J", "R"); # output: "JA - 32 = 58" # cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J'; # '0' to 'A', '-' to '-', and '9' to 'J'.
If
'r'
modifier is specified, reverse character ranges are allowed. e.g.$mbcs->strtr($str, "0-9", "9-0", "r") is identical to $mbcs->strtr($str, "0123456789", "9876543210")
Caching the conversion table
If
'o'
modifier is specified, the conversion table is cached internally. e.g.foreach(@hiragana_strings){ print $mbcs->strtr($_, 'ぁ-ん', 'ァ-ン', 'o'); } # katakana strings are printed
will be almost as efficient as this:
$hiragana_to_katakana = $mbcs->trclosure('ぁ-ん', 'ァ-ン'); foreach(@hiragana_strings){ print &$hiragana_to_katakana($_); }
You can use whichever you like.
Without
'o'
,foreach(@hiragana_strings){ print $mbcs->strtr($_, 'ぁ-ん', 'ァ-ン'); }
will be very slow since the conversion table is made whenever the function is called.
Generation of the Closure to Transliterate
$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST)
$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)
-
Returns a closure to transliterate the specified string. The return value is an only code reference, not blessed object. By use of this code ref, you can save yourself time as you need not specify arguments every time.
my $digit_tr = $mbcs->trclosure("1234567890-", "一二三四五六七八九〇-"); print &$digit_tr ("TEL :0124-45-6789\n"); # ok to perl 5.003 print $digit_tr->("FAX :0124-51-5368\n"); # perl 5.004 or better # output: # TEL :〇一二四-四五-六七八九 # FAX :〇一二四-五一-五三六八
The functionality of the closure made by
trclosure()
is equivalent to that ofstrtr()
. Frankly speaking, thestrtr()
callstrclosure()
internally and uses the returned closure.
BUGS
This modules supposes $[
is always equal to 0, never 1.
The functions provided by this library use many regexp operations. Therefore, $1
etc. values may be changed or discarded unexpectedly. I suggest you save it in a certain variable before call of the function.
AUTHOR
Tomoyuki SADAHIRO
bqw10602@nifty.com
http://homepage1.nifty.com/nomenclator/perl/
Copyright(C) 2001, SADAHIRO Tomoyuki. Japan. All rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
perl(1).
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 532:
Non-ASCII character seen before =encoding in '$sjis->substr('あいうえお',2,2);'. Assuming UTF-8