NAME

ShiftJIS::CP932::MapUTF - conversion between Microsoft Windows CP-932 and Unicode

SYNOPSIS

use ShiftJIS::CP932::MapUTF qw(:all);

$utf8_string  = cp932_to_utf8($cp932_string);
$cp932_string = utf8_to_cp932($utf8_string);

DESCRIPTION

The table of Microsoft Windows CodePage 932 (CP-932) comprises 7915 characters:

JIS X 0201 single-byte graphic characters (159 characters),
JIS X 0211 single-byte control characters (32 characters),
JIS X 0208 double-byte graphic characters (6879 characters),
NEC special characters (83 characters, row 13),
NEC-selected IBM extended characters (374 characters, rows 89..92),
and IBM extended characters (388 characters, rows 115..119).

This table includes duplicates that do not round trip map. These duplicates are due to the characters defined by vendors, NEC and IBM. For example, there are two characters that are mapped to U+2252 in Unicode; i.e., 0x81e0 (a JIS X 0208 character) and 0x8790 (an NEC special character).

Actually, 7915 characters in CP-932 must be mapped to 7517 characters in Unicode. There are 398 non-round-trip mappings; i.e.

This module provides some functions to map properly from CP-932 to Unicode, and vice versa.

Transcoding from CP-932 to Unicode

If the first parameter is a reference, that are used for coping with CP-932 characters unmapped to Unicode, SJIS_CALLBACK. (any reference will not allowed as STRING.)

If SJIS_CALLBACK is given, the second parameter is used as STRING; otherwise the first.

If SJIS_CALLBACK is not specified, CP-932 characters unmapped to Unicode are silently deleted and partial bytes are skipped by one byte.

Currently, only coderefs are used as SJIS_CALLBACK. A string returned from SJIS_CALLBACK is inserted in place of unmapped characters.

A coderef as SJIS_CALLBACK is called with one or more arguments. If the unmapped character is a partial double-byte character (i.e. the leading byte string), the first argument is undef and the second argument is an unsigned integer representing the byte. If not partial, the first argument is a defined string representing a character.

Example

my $sjis_callback = sub {
    my ($char, $byte) = @_;
    return function($char) if defined $char;
    die sprintf "found partial byte 0x%02x", $byte;
};

In the example above, $char may be one of "\x80", "\x82\xf2", "\xfc\xfc", "\xff".

The return value of SJIS_CALLBACK must be legal in the target format. E.g. never use with cp932_to_utf16be() a callback that returns UTF-8. I.e. you should prepare SJIS_CALLBACK for each UTF.

SJIS_OPTION may be specified after STRING. They can be combined like 'tg' and 'gst' (the order does not matter).

'g'    add mapping of CP-932 gaiji (user defined characters)
       [0xF040 to 0xF9FC (rows 95 to 114)]
       to Unicode's PUA [0xE000 to 0xE757]�i1880 characters�j.

's'    add mapping of CP-932 undefined single-byte characters:
       0x80 => U+0080,  0xA0 => U+F8F0,
       0xFD => U+F8F1,  0xFE => U+F8F2,  0xFF => U+F8F3.

't'    check trailing byte ranges [0x40..0x7E, 0x80..0xFC].
       I.e. "\x81\x39" is assumed as an undefined double-byte character
       by default; with 't', it is a partial byte 0x81
       followed by a single-byte character "\x39".
cp932_to_utf8([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to UTF-8.

cp932_to_unicode([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to Unicode. (Perl's internal format, flagged with SVf_UTF8, see perlunicode)

This function is provided only for Perl 5.6.1 or later, and via XS.

cp932_to_utf16le([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to UTF-16LE.

cp932_to_utf16be([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to UTF-16BE.

cp932_to_utf32le([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to UTF-32LE.

cp932_to_utf32be([SJIS_CALLBACK,] STRING [, SJIS_OPTION])

Converts CP-932 to UTF-32BE.

Transcoding from Unicode to CP-932

Any duplicates are converted according to Microsoft PRB Q170559. E.g. U+2252 is converted to \x81\xe0, not to \x87\x90.

If the first parameter is a reference, that are used for coping with Unicode characters unmapped to CP-932, UNICODE_CALLBACK. (any reference will not allowed as STRING.)

If UNICODE_CALLBACK is given, the second parameter is used as STRING; otherwise the first.

If UNICODE_CALLBACK is not specified, CP-932 characters unmapped to Unicode are silently deleted and partial bytes are skipped by one byte.

Currently, only coderefs are used as UNICODE_CALLBACK. A string returned from the coderef is inserted in place of unmapped characters.

A coderef as UNICODE_CALLBACK is called with one or more arguments. If the unmapped character is a partial character (an illegal byte), the first argument is undef and the second argument is an unsigned integer representing the byte. If not partial, the first argument is an unsigned interger representing a Unicode code point.

For example, characters unmapped to CP-932 are converted to numerical character references for HTML 4.01.

sub toHexNCR {
    my ($char, $byte) = @_;
    return sprintf("&#x%x;", $char) if defined $char;
    die sprintf "illegal byte 0x%02x was found", $byte;
}

$cp932 = utf8_to_cp932   (\&toHexNCR, $utf8_string);
$cp932 = unicode_to_cp932(\&toHexNCR, $unicode_string);
$cp932 = utf16le_to_cp932(\&toHexNCR, $utf16le_string);

The return value of UNICODE_CALLBACK must be legal in CP-932.

UNICODE_OPTION may be specified after STRING. They can be combined like 'fg' and 'gsf' (the order does not matter).

'g'    add mapping of CP-932 gaiji (user defined characters)
       [0xF040 to 0xF9FC (rows 95 to 114)]
       from Unicode's PUA [0xE000 to 0xE757]�i1880 characters�j.

's'    add mapping of CP-932 undefined single-byte characters:
       U+0080 => 0x80,  U+F8F0 => 0xA0,
       U+F8F1 => 0xFD,  U+F8F2 => 0xFE,  U+F8F3 => 0xFF.

'f'    add some fallback mappings from Unicode to CP-932.
utf8_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-8 to CP-932.

unicode_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts Unicode to CP-932.

This Unicode is in the Perl's internal format (see perlunicode). If not flagged with SVf_UTF8, upgraded as an ISO 8859-1 string).

This function is provided only for Perl 5.6.1 or later, and via XS.

utf16_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-16 (with or w/o BOM) to CP-932.

utf16le_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-16LE to CP-932.

utf16be_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-16BE to CP-932.

utf32_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-32 (with or w/o BOM) to CP-932.

utf32le_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-32LE to CP-932.

utf32be_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])

Converts UTF-32BE to CP-932.

Export

By default:

cp932_to_utf8     utf8_to_cp932
cp932_to_utf16le  utf16le_to_cp932
cp932_to_utf16be  utf16be_to_cp932

cp932_to_unicode  unicode_to_cp932 (only for XS)

On request:

cp932_to_utf32le  utf32le_to_cp932
cp932_to_utf32be  utf32be_to_cp932
                  utf16_to_cp932 [*]
                  utf32_to_cp932 [*]

[*] Their counterparts cp932_to_utf16() and cp932_to_utf32() are not implemented yet. They needs more investigation on return values from SJIS_CALLBACK... (concatenation needs recognition of and coping with BOM)

CAVEAT

Pure Perl edition of this module doesn't understand any logically wide characters (see perlunicode). Use utf8::decode/utf8::encode (see utf8.pm) on Perl 5.7 or later if necessary.

AUTHOR

SADAHIRO, Tomoyuki

SADAHIRO@cpan.org

http://homepage1.nifty.com/nomenclator/perl/

Copyright(C) 2001-2003, SADAHIRO Tomoyuki. Japan. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Microsoft PRB, Article ID: Q170559

Conversion Problem Between Shift-JIS and Unicode

cp932 to Unicode table

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

http://www.microsoft.com/typography/unicode/932.txt (dead link)

http://oss.software.ibm.com/cvs/icu/charset/data/xml/windows-932-2000.xml

http://oss.software.ibm.com/cvs/icu/charset/data/ucm/windows-932-2000.ucm

1 POD Error

The following errors were encountered while parsing the POD:

Around line 84:

Non-ASCII character seen before =encoding in '0xE757]�i1880'. Assuming CP1252