NAME
ShiftJIS::CP932::MapUTF - conversion between Microsoft Windows CP-932 and Unicode
SYNOPSIS
use ShiftJIS::CP932::MapUTF qw(:all);
$utf8_string = cp932_to_utf8($cp932_string);
$cp932_string = utf8_to_cp932($utf8_string);
DESCRIPTION
The table of Microsoft Windows CodePage 932 (CP-932) comprises 7915 characters:
JIS X 0201/0211 single-byte characters (191 characters),
JIS X 0211 double-byte characters (6879 characters),
NEC special characters (83 characters, row 13),
NEC-selected IBM extended characters (374 characters, rows 89..92),
and IBM extended characters (388 characters, rows 115..119).
This table includes duplicates that do not round trip map. These duplicates are due to the characters defined by vendors, NEC and IBM. For example, there are two characters that are mapped to U+2252
in Unicode; i.e., 0x81e0
(a JIS X 0208 character) and 0x8790
(an NEC special character).
Actually, 7915 characters in CP-932 must be mapped to 7517 characters in Unicode. There are 398 non-round-trip mappings; i.e.
This module provides some functions to convert properly from CP-932 to Unicode, and vice versa.
Transcoding from CP-932 to Unicode
If the first parameter is a reference, that is used for coping with CP-932 characters unmapped to Unicode, SJIS_CALLBACK
. (any reference will not allowed as STRING
.)
If SJIS_CALLBACK
is given, STRING
is the second parameter; otherwise the first.
If SJIS_CALLBACK
is not specified, CP-932 characters unmapped to Unicode are silently deleted and partial bytes are skipped by one byte. (as if a coderef constantly returning null string, sub {''}
, is passed as SJIS_CALLBACK
.)
Currently, only coderefs are allowed as SJIS_CALLBACK
. A string returned from SJIS_CALLBACK
is inserted in place of the unmapped character.
A coderef as SJIS_CALLBACK
is called with one or more arguments. If the unmapped character is a partial double-byte character (i.e. a string with onebyte length of leading byte), the first argument is undef
and the second argument is an unsigned integer representing the byte. If the unmapped character is not partial, the first argument is a defined string representing a character.
By default, a partial double-byte character may appear only at the end of STRING
; does not in the beginning nor in the middle (see also 't' of SJIS_OPTION
).
Example
my $sjis_callback = sub {
my ($char, $byte) = @_;
return function($char) if defined $char;
die sprintf "found partial byte 0x%02x", $byte;
};
In the example above, $char
may be one of "\x80"
, "\x82\xf2"
, "\xfc\xfc"
, "\xff"
.
The return value of SJIS_CALLBACK
must be legal in the target format. E.g. never use with cp932_to_utf16be()
a callback that returns UTF-8. I.e. you should prepare SJIS_CALLBACK
for each UTF.
SJIS_OPTION
may be specified after STRING
. They can be combined like 'tg'
and 'gst'
(the order does not matter).
'g' add mappings of Gaiji (user defined characters)
[0xF040 to 0xF9FC (rows 95 to 114) in CP-932]
to Unicode's PUA [0xE000 to 0xE757] (1880 characters).
's' add mappings of undefined Single-byte characters:
0x80 => U+0080, 0xA0 => U+F8F0,
0xFD => U+F8F1, 0xFE => U+F8F2, 0xFF => U+F8F3.
't' check the Trailing byte range [0x40..0x7E, 0x80..0xFC].
E.g. "\x81\x39" is regarded as an undefined double-byte character
by default; with 't', it is a partial character byte 0x81
followed by a single-byte character "\x39".
cp932_to_utf8([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
-
Converts CP-932 to UTF-8.
cp932_to_unicode([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
-
Converts CP-932 to Unicode. (Perl's internal format, flagged with
SVf_UTF8
, see perlunicode)This function is provided only for Perl 5.6.1 or later, and via XS.
cp932_to_utf16le([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
-
Converts CP-932 to UTF-16LE.
cp932_to_utf16be([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
-
Converts CP-932 to UTF-16BE.
cp932_to_utf32le([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
-
Converts CP-932 to UTF-32LE.
cp932_to_utf32be([SJIS_CALLBACK,] STRING [, SJIS_OPTION])
-
Converts CP-932 to UTF-32BE.
Transcoding from Unicode to CP-932
Any duplicates are converted according to Microsoft PRB Q170559. E.g. U+2252
is converted to "\x81\xE0"
, not to "\x87\x90"
.
If the first parameter is a reference, that is used for coping with Unicode characters unmapped to CP-932, UNICODE_CALLBACK
. (any reference will not allowed as STRING
.)
If UNICODE_CALLBACK
is given, STRING
is the second parameter; otherwise the first.
If UNICODE_CALLBACK
is not specified, CP-932 characters unmapped to Unicode are silently deleted and partial bytes are skipped by one byte. (as if a coderef constantly returning null string, sub {''}
is passed as UNICODE_CALLBACK
.)
Currently, only coderefs are allowed as UNICODE_CALLBACK
. A string returned from the coderef is inserted in place of the unmapped character.
A coderef as UNICODE_CALLBACK
is called with one or more arguments. If the unmapped character is a partial character (an illegal byte), the first argument is undef
and the second argument is an unsigned integer representing the byte. If not partial, the first argument is an unsigned interger representing a Unicode code point.
For example, characters unmapped to CP-932 are converted to numerical character references for HTML 4.01.
sub toHexNCR {
my ($char, $byte) = @_;
return sprintf("&#x%x;", $char) if defined $char;
die sprintf "illegal byte 0x%02x was found", $byte;
}
$cp932 = utf8_to_cp932 (\&toHexNCR, $utf8_string);
$cp932 = unicode_to_cp932(\&toHexNCR, $unicode_string);
$cp932 = utf16le_to_cp932(\&toHexNCR, $utf16le_string);
The return value of UNICODE_CALLBACK
must be legal in CP-932.
UNICODE_OPTION
may be specified after STRING
. They can be combined like 'fg'
and 'gsf'
(the order does not matter).
'g' add mappings of Gaiji (user defined characters)
[0xF040 to 0xF9FC (rows 95 to 114) in CP-932]
from Unicode's PUA [0xE000 to 0xE757] (1880 characters).
's' add mappings of undefined Single-byte characters:
U+0080 => 0x80, U+F8F0 => 0xA0,
U+F8F1 => 0xFD, U+F8F2 => 0xFE, U+F8F3 => 0xFF.
'f' add some Fallback mappings from Unicode to CP-932.
The characters additionally mapped are
some characters in latin-1 region [U+00A0..U+00FF], and
HIRAGANA LETTER VU [U+3094, to KATAKANA LETTER VU, 0x8394].
utf8_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
-
Converts UTF-8 to CP-932.
unicode_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
-
Converts Unicode to CP-932.
This Unicode is coded in the Perl's internal format (see perlunicode). If not flagged with
SVf_UTF8
, upgraded as an ISO 8859-1 (latin1) string.This function is provided only for Perl 5.6.1 or later, and via XS.
utf16_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
-
Converts UTF-16 (with or w/o
BOM
) to CP-932. utf16le_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
-
Converts UTF-16LE to CP-932.
utf16be_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
-
Converts UTF-16BE to CP-932.
utf32_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
-
Converts UTF-32 (with or w/o
BOM
) to CP-932. utf32le_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
-
Converts UTF-32LE to CP-932.
utf32be_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])
-
Converts UTF-32BE to CP-932.
Export
By default:
cp932_to_utf8 utf8_to_cp932
cp932_to_utf16le utf16le_to_cp932
cp932_to_utf16be utf16be_to_cp932
cp932_to_unicode unicode_to_cp932 (only for XS)
On request:
cp932_to_utf32le utf32le_to_cp932
cp932_to_utf32be utf32be_to_cp932
utf16_to_cp932 [*]
utf32_to_cp932 [*]
[*] Their counterparts cp932_to_utf16()
and cp932_to_utf32()
are not implemented yet. They need more investigation on return values from SJIS_CALLBACK
... (concatenation needs recognition of and coping with BOM
)
CAVEAT
Pure Perl edition of this module doesn't understand any logically wide characters (see perlunicode). Use utf8::decode
/utf8::encode
(see utf8) on Perl 5.7 or later if necessary.
AUTHOR
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
http://homepage1.nifty.com/nomenclator/perl/
Copyright(C) 2001-2003, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
- Microsoft PRB, Article ID: Q170559
-
Conversion Problem Between Shift-JIS and Unicode
- cp932 to Unicode table
-
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://www.microsoft.com/typography/unicode/932.txt (dead link)
http://www.microsoft.com/globaldev/reference/dbcs/932.htm
http://oss.software.ibm.com/cvs/icu/charset/data/xml/windows-932-2000.xml
http://oss.software.ibm.com/cvs/icu/charset/data/ucm/windows-932-2000.ucm