NAME

Unicode::Transform - conversion among Unicode Transformation Formats (UTFs)

SYNOPSIS

use Unicode::Transform qw(:all);

$unicode_string = utf16be_to_unicode($utf16be_string);
$utf16le_string = unicode_to_utf16le($unicode_string);
$utf8_string    = utf32be_to_utf8   ($utf32be_string);

$utf8_string    = utf32be_to_utf8(\&chr_utf8, $utf32be_string);
     # illegal code points are allowed.

DESCRIPTION

This module provides some functions to convert a string among some Unicode Transformation Formats (UTFs).

Conversion Between UTF

(Exporting: use Unicode::Transform qw(:conv);)

<SRC_UTF_NAME>_to_<DST_UTF_NAME>([CALLBACK,] STRING)

Function names

A function name consists of SRC_UTF_NAME, string '_to_', and DST_UTF_NAME.

SRC_UTF_NAME (UTF name which a source string is in) and DST_UTF_NAME (UTF name which a return value is in) must be one in the list of hyphen-removed and lowercased names following:

unicode    (for Perl's internal strings; see perlunicode)
utf16le    (for UTF-16LE)
utf16be    (for UTF-16BE)
utf32le    (for UTF-32LE)
utf32be    (for UTF-32BE)
utf8       (for UTF-8)
utf8mod    (for UTF-8-Mod)
utfcp1047  (for CP-1047-oriented UTF-EBCDIC).

In all, 64 (i.e. 8 times 8) functions are available. Available function names include utf16be_to_utf32le(), utf8_to_unicode(). DST_UTF_NAME may be same as SRC_UTF_NAME like utf8_to_utf8().

Parameters

If the first parameter is a reference, that is CALLBACK, which is used for coping with illegal characters and octets. Any reference will not allowed as STRING.

If CALLBACK is given, STRING is the second parameter; otherwise the first. STRING is a source string. Currently, only coderefs are allowed as CALLBACK.

If CALLBACK is omitted, illegal code points and partial octets are deleted, as if a code reference constantly returning empty string, sub {''}, was used as CALLBACK.

Illegal code points comprise surrogate code points [0xD800..0xDFFF] and out-of-range code points [0x110000 and greater]).

Partial octets are octets which do not represent any code point. They include the first octet without following octets in UTF-8 like "\xC2", the last octet in UTF-16BE,LE with odd number of octets.

If CALLBACK is specified, the appearance of an illegal code point or a partial octet calls the code reference. The first parameter for CALLBACK is the unsigned integer value of its code point; if the value is lesser than 256, that is a partial octet.

The return value from CALLBACK is inserted there. You may use chr_<DST_UTF_NAME>() as CALLBACK (see below). Return value from CALLBACK should be in UTF of DST_UTF_NAME.

You can call die or croak in CALLBACK if you want to trap an ill-formed source.

Conversion from Code Point to String

(Exporting: use Unicode::Transform qw(:chr);)

Returns the character represented by that CODEPOINT as the string in the Unicode transformation format. CODEPOINT should be an unsigned integer. Returns a string even if CODEPOINT is a surrogate code point [0xD800..0xDFFF].

The maximum value of CODEPOINT is:

0x0010_FFFF for chr_utf16le() and chr_utf16be()
0x7FFF_FFFF for chr_utf8(), chr_utf8mod(), chr_utfcp1047()
0xFFFF_FFFF for chr_utf32le(), chr_utf32be()

Returns undef if CODEPOINT is greater than the maximum value.

chr_unicode(CODEPOINT)
chr_utf16le(CODEPOINT)
chr_utf16be(CODEPOINT)
chr_utf32le(CODEPOINT)
chr_utf32be(CODEPOINT)
chr_utf8(CODEPOINT)
chr_utf8mod(CODEPOINT)
chr_utfcp1047(CODEPOINT)

Numeric Value of the First Character

(Exporting: use Unicode::Transform qw(:ord);)

Returns an unsigned integer value of the first character of STRING. If STRING is empty or begins at a partial octet, returns undef.

STRING may begin at a surrogate code point [0xD800..0xDFFF] or an out-of-range code point [0x110000 and greater]).

ord_unicode(CODEPOINT)
ord_utf16le(CODEPOINT)
ord_utf16be(CODEPOINT)
ord_utf32le(CODEPOINT)
ord_utf32be(CODEPOINT)
ord_utf8(CODEPOINT)
ord_utf8mod(CODEPOINT)
ord_utfcp1047(CODEPOINT)

AUTHOR

SADAHIRO Tomoyuki <SADAHIRO@cpan.org>

http://homepage1.nifty.com/nomenclator/perl/

Copyright(C) 2002-2003, SADAHIRO Tomoyuki. Japan. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

perlunicode
UTF-EBCDIC (and UTF-8-Mod)

http://www.unicode.org/reports/tr16/