NAME

ShiftJIS::X0213::MapUTF - conversion between Shift_JISX0213 and Unicode

SYNOPSIS

use ShiftJIS::X0213::MapUTF;

$unicode_string  = sjis0213_to_unicode($sjis0213_string);
$sjis0213_string = unicode_to_sjis0213($unicode_string);

DESCRIPTION

This module provides some functions to map from Shift_JISX0213 to Unicode, and vice versa.

sjis0213_to_unicode(STRING)
sjis0213_to_unicode(CODEREF, STRING)

Converts Shift_JISX0213 to Unicode (UTF-8/UTF-EBCDIC as a Unicode-oriented perl knows).

Characters unmapped to Unicode are deleted, if CODEREF is not specified; otherwise, converted using the CODEREF from the Shift_JISX0213 character string.

sjis0213_to_utf16be(STRING)
sjis0213_to_utf16be(CODEREF, STRING)

Converts Shift_JISX0213 to UTF-16BE.

sjis0213_to_utf16le(STRING)
sjis0213_to_utf16le(CODEREF, STRING)

Converts Shift_JISX0213 to UTF-16LE.

Characters unmapped to Unicode are deleted, if CODEREF is not specified; otherwise, converted using the CODEREF from the Shift_JISX0213 character string.

unicode_to_sjis0213(STRING)
unicode_to_sjis0213(CODEREF, STRING)

Converts Unicode (UTF-8/UTF-EBCDIC as a Unicode-oriented perl knows) to Shift_JISX0213.

Characters unmapped to Shift_JISX0213 are deleted, if CODEREF is not specified; otherwise, converted using the CODEREF from its Unicode codepoint (integer).

For example, characters unmapped to Shift_JISX0213 are converted to numerical character references for HTML 4.01.

unicode_to_sjis0213(sub {sprintf "&#x%04x;", shift}, $unicode_string);
utf16be_to_sjis0213(STRING)
utf16be_to_sjis0213(CODEREF, STRING)

Converts UTF-16BE to Shift_JISX0213.

utf16le_to_sjis0213(STRING)
utf16le_to_sjis0213(CODEREF, STRING)

Converts UTF-16LE to Shift_JISX0213.

Characters unmapped to Shift_JISX0213 are deleted, if CODEREF is not specified; otherwise, converted using the CODEREF from its Unicode codepoint (integer).

For example, characters unmapped to Shift_JISX0213 are converted to numerical character references for HTML 4.01.

utf16le_to_sjis0213(sub {sprintf "&#x%04x;", shift}, $utf16LE_string);

BUGS

On mapping between Shift_JISX0213 and Unicode used in this module, notice that:

  • If an authentic mapping would have been published, the mapping by this module will be corrected according to that mapping.

  • 0xFC5A in Shift_JISX0213 is mapped to U+9B1D according to JIS X 0213:2000, while Unicode's Unihan.txt maps it to U+9B1C.

  • 0x81D4 and 0x81D5 in Shift_JISX0213 is mapped in the block of Halfwidth and Fullwidth Forms, not in the block of Miscellaneous Mathematical Symbols-B, according to Shibano's JIS KANJI JITEN, published in June, 2002.

  • The following 25 JIS Non-Kanji characters are not included in Unicode 3.2.0. So they are mapped to each 2 characters in Unicode. These mappings are done round-trippedly for *one Shift_JISX0213 character*. Then round-trippedness for a Shift_JISX0213 *string* is broken. (E.g. Shift_JISX0213 <0x8663> and <0x857B, 0x867B> both are mapped to <U+00E6, U+0300>, and <U+00E6, U+0300> is mapped only to SJIS <0x8663>.)

    SJIS0213  Unicode 3.2.0    # Name by JIS X 0213:2000
    
    0x82F5    <U+304B, U+309A> # [HIRAGANA LETTER BIDAKUON NGA]
    0x82F6    <U+304D, U+309A> # [HIRAGANA LETTER BIDAKUON NGI]
    0x82F7    <U+304F, U+309A> # [HIRAGANA LETTER BIDAKUON NGU]
    0x82F8    <U+3051, U+309A> # [HIRAGANA LETTER BIDAKUON NGE]
    0x82F9    <U+3053, U+309A> # [HIRAGANA LETTER BIDAKUON NGO]
    0x8397    <U+30AB, U+309A> # [KATAKANA LETTER BIDAKUON NGA]
    0x8398    <U+30AD, U+309A> # [KATAKANA LETTER BIDAKUON NGI]
    0x8399    <U+30AF, U+309A> # [KATAKANA LETTER BIDAKUON NGU]
    0x839A    <U+30B1, U+309A> # [KATAKANA LETTER BIDAKUON NGE]
    0x839B    <U+30B3, U+309A> # [KATAKANA LETTER BIDAKUON NGO]
    0x839C    <U+30BB, U+309A> # [KATAKANA LETTER AINU CE]
    0x839D    <U+30C4, U+309A> # [KATAKANA LETTER AINU TU(TU)]
    0x839E    <U+30C8, U+309A> # [KATAKANA LETTER AINU TO(TU)]
    0x83F6    <U+31F7, U+309A> # [KATAKANA LETTER AINU P]
    0x8663    <U+00E6, U+0300> # [LATIN SMALL LETTER AE WITH GRAVE]
    0x8667    <U+0254, U+0300> # [LATIN SMALL LETTER OPEN O WITH GRAVE]
    0x8668    <U+0254, U+0301> # [LATIN SMALL LETTER OPEN O WITH ACUTE]
    0x8669    <U+028C, U+0300> # [LATIN SMALL LETTER TURNED V WITH GRAVE]
    0x866A    <U+028C, U+0301> # [LATIN SMALL LETTER TURNED V WITH ACUTE]
    0x866B    <U+0259, U+0300> # [LATIN SMALL LETTER SCHWA WITH GRAVE]
    0x866C    <U+0259, U+0301> # [LATIN SMALL LETTER SCHWA WITH ACUTE]
    0x866D    <U+025A, U+0300> # [LATIN SMALL LETTER HOOKED SCHWA WITH GRAVE]
    0x866E    <U+025A, U+0301> # [LATIN SMALL LETTER HOOKED SCHWA WITH ACUTE]
    0x8685    <U+02E9, U+02E5> # [RISING SYMBOL]
    0x8686    <U+02E5, U+02E9> # [FALLING SYMBOL]

AUTHOR

Tomoyuki SADAHIRO

bqw10602@nifty.com
http://homepage1.nifty.com/nomenclator/perl/

Copyright(C) 2002-2002, SADAHIRO Tomoyuki. Japan. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

JIS X 0213:2000

7-bit and 8-bit double byte coded extended KANJI sets for information interchange (by JIS Committee)

JIS KANJI JITEN, the revised edition.

edited by Shibano, published by Japanese Standards Association (JSA), 2002, Tokyo [ISBN4-542-20129-5]

http://www.jsa.or.jp/

Japanese Standards Association (access to JIS)

http://www.unicode.org/Public/UNIDATA/Unihan.txt

Unihan database (Unicode version: 3.2.0) by Unicode (c).

http://homepage1.nifty.com/nomenclator/unicode/sjis0213.zip

A mapping table between Shift_JISX0213 and Unicode 3.2.0.

(This table is prepared by me, and with no authority; but through the table, you will know what is to be done by this module.)

ShiftJIS::CP932::MapUTF

conversion between Microsoft Windows CP-932 and Unicode

(The CP932-Unicode mapping is different with the Shift_JISX0213-Unicode mapping, but what you desire may be the former.)