NAME

ShiftJIS::CP932::MapUTF - conversion between Microsoft Windows CP-932 and Unicode

SYNOPSIS

use ShiftJIS::CP932::MapUTF qw(:all);

$utf8_string  = cp932_to_utf8($cp932_string);
$cp932_string = utf8_to_cp932($utf8_string);

DESCRIPTION

The Microsoft Windows CodePage 932 (CP-932) table comprises 7915 characters:

JIS X 0201:1997 single-byte characters (159 characters),
JIS X 0211:1994 single-byte characters (32 characters),
JIS X 0208:1997 double-byte characters (6879 characters),
NEC special characters (83 characters in SJIS row 13),
NEC-selected IBM extended characters (374 characters in SJIS rows 89..92),
and IBM extended characters (388 characters in SJIS rows 115..119).

It contains duplicates that do not round trip map. These duplicates are due to the characters defined by vendors, NEC and IBM. For example, there are two characters that are mapped to U+2252; i.e., 0x81e0 (a JIS X 0208 character) and 0x8790 (an NEC special character).

There are 398 non-round-trip mappings; i.e. 7915 characters in CP-932 must be mapped to 7517 characters in Unicode.

This module provides some functions to map properly from Windows CP-932 to Unicode, and vice versa.

Functions to transcode CP-932 to Unicode

If a coderef SJIS_CALLBACK is not specified, characters unmapped to Unicode are deleted; otherwise, a string returned from SJIS_CALLBACK is inserted there. The argument for SJIS_CALLBACK is a CP-932 string like "\x80", "\x82\xf2", "\xfc\xfc", "\xff" (that may be a byte that is illegal or a partial charcter).

The return value of SJIS_CALLBACK must be legal in the target format. E.g. never use with cp932_to_utf16be() a callback that returns UTF-8. I.e. you should prepare a callback coderef for each encoding.

cp932_to_utf8(STRING)
cp932_to_utf8(SJIS_CALLBACK, STRING)

Converts Windows CP-932 to UTF-8

cp932_to_unicode(STRING)
cp932_to_unicode(SJIS_CALLBACK, STRING)

Converts Windows CP-932 to Unicode (Perl's internal format (see perlunicode), flagged).

This function is provided only for Perl 5.6.1 or later, and via XS.

cp932_to_utf16le([SJIS_CALLBACK,] STRING)

Converts Windows CP-932 to UTF-16LE.

cp932_to_utf16be([SJIS_CALLBACK,] STRING)

Converts Windows CP-932 to UTF-16BE.

cp932_to_utf32le([SJIS_CALLBACK,] STRING)

Converts Windows CP-932 to UTF-32LE.

cp932_to_utf32be([SJIS_CALLBACK,] STRING)

Converts Windows CP-932 to UTF-32BE.

Functions to transcode CP-932 from Unicode

Any duplicates are normalized; e.g. U+2252 is converted to \x81\xe0, not to \x87\x90, in CP-932.

If the UNICODE_CALLBACK coderef is not specified, illegal bytes are skipped by one byte, and characters unmapped to Windows CP-932 are deleted; otherwise, a string returned from UNICODE_CALLBACK is inserted there.

The 1st argument for UNICODE_CALLBACK is its Unicode codepoint (integer), or undef when encounters an illegal byte.

If the 1st argument is undef, integer in byte is passed as the 2nd argument.

For example, characters unmapped to Windows CP-932 are converted to numerical character references for HTML 4.01.

sub toHexNCR {
    my ($char, $byte) = @_;
    return sprintf("&#x%x;", $char) if defined $char;
    die sprintf "illegal byte 0x%02x was found", $byte;
}

$cp932 = utf8_to_cp932(\&toHexNCR, $utf8_string);
$cp932 = unicode_to_cp932(\&toHexNCR, $unicode_string);
$cp932 = utf16le_to_cp932(\&toHexNCR, $utf16le_string);
utf8_to_cp932(STRING)
utf8_to_cp932(UNICODE_CALLBACK, STRING)

Converts UTF-8 to Windows CP-932.

unicode_to_cp932(STRING)
unicode_to_cp932(UNICODE_CALLBACK, STRING)

Converts Unicode to Windows CP-932.

This Unicode is in the Perl's internal format (see perlunicode). If not flagged with SVf_UTF8, upgraded as an ISO 8859-1 string).

This function is provided only for Perl 5.6.1 or later, and via XS.

utf16le_to_cp932([UNICODE_CALLBACK,] STRING)

Converts UTF-16LE to Windows CP-932.

utf16be_to_cp932([UNICODE_CALLBACK,] STRING)

Converts UTF-16BE to Windows CP-932.

utf32le_to_cp932([UNICODE_CALLBACK,] STRING)

Converts UTF-32LE to Windows CP-932.

utf32be_to_cp932([UNICODE_CALLBACK,] STRING)

Converts UTF-32BE to Windows CP-932.

Export

By default:

cp932_to_utf8     utf8_to_cp932
cp932_to_utf16le  utf16le_to_cp932
cp932_to_utf16be  utf16be_to_cp932

cp932_to_unicode  unicode_to_cp932 (only for XS)

On request:

cp932_to_utf32le  utf32le_to_cp932
cp932_to_utf32be  utf32be_to_cp932

CAVEAT

Pure Perl version of this module doesn't understand any logically wide characters (see perlunicode). Use utf8::decode/utf8::encode (see utf8) on Perl 5.7 or later if necessary.

AUTHOR

Tomoyuki SADAHIRO

bqw10602@nifty.com
http://homepage1.nifty.com/nomenclator/perl/

Copyright(C) 2001-2002, SADAHIRO Tomoyuki. Japan. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

Microsoft PRB, Article ID: Q170559

Conversion Problem Between Shift-JIS and Unicode

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

cp932 to Unicode table