NAME
Unicode::Transform - conversion among Unicode Transformation Formats (UTFs)
SYNOPSIS
use Unicode::Transform qw(:all);
$unicode_string = utf16be_to_unicode($utf16be_string);
$utf16le_string = unicode_to_utf16le($unicode_string);
$utf8_string = utf32be_to_utf8 ($utf32be_string);
$utf8_string = utf32be_to_utf8(\&chr_utf8, $utf32be_string);
# illegal code points are allowed.
DESCRIPTION
This module provides some functions to convert a string among some Unicode Transformation Formats (UTFs).
Conversion Between UTF
(Exporting: use Unicode::Transform qw(:conv);)
Function names
A function name consists of SRC_UTF_NAME, string '_to_', and DST_UTF_NAME.
SRC_UTF_NAME (UTF name which a source string is in) and DST_UTF_NAME (UTF name which a return value is in) must be one in the list of hyphen-removed and lowercased names following:
unicode (for Perl's internal strings; see perlunicode)
utf16le (for UTF-16LE)
utf16be (for UTF-16BE)
utf32le (for UTF-32LE)
utf32be (for UTF-32BE)
utf8 (for UTF-8)
utf8mod (for UTF-8-Mod)
utfcp1047 (for CP-1047-oriented UTF-EBCDIC).
In all, 64 (i.e. 8 times 8) functions are available. Available function names include utf16be_to_utf32le(), utf8_to_unicode(). DST_UTF_NAME may be same as SRC_UTF_NAME like utf8_to_utf8().
Parameters
If the first parameter is a reference, that is CALLBACK, which is used for coping with illegal characters and octets. Any reference will not allowed as STRING.
If CALLBACK is given, STRING is the second parameter; otherwise the first. STRING is a source string. Currently, only coderefs are allowed as CALLBACK.
If CALLBACK is omitted, illegal code points and partial octets are deleted, as if a code reference constantly returning empty string, sub {''}, was used as CALLBACK.
Illegal code points comprise surrogate code points [0xD800..0xDFFF] and out-of-range code points [0x110000 and greater]).
Partial octets are octets which do not represent any code point. They include the first octet without following octets in UTF-8 like "\xC2", the last octet in UTF-16BE,LE with odd number of octets.
If CALLBACK is specified, the appearance of an illegal code point or a partial octet calls the code reference. The first parameter for CALLBACK is the unsigned integer value of its code point; if the value is lesser than 256, that is a partial octet.
The return value from CALLBACK is inserted there. You may use chr_<DST_UTF_NAME>() as CALLBACK (see below). Return value from CALLBACK should be in UTF of DST_UTF_NAME.
You can call die or croak in CALLBACK if you want to trap an ill-formed source.
Conversion from Code Point to String
(Exporting: use Unicode::Transform qw(:chr);)
Returns the character represented by that CODEPOINT as the string in the Unicode transformation format. CODEPOINT should be an unsigned integer. Returns a string even if CODEPOINT is a surrogate code point [0xD800..0xDFFF].
The maximum value of CODEPOINT is:
0x0010_FFFF for chr_utf16le() and chr_utf16be()
0x7FFF_FFFF for chr_utf8(), chr_utf8mod(), chr_utfcp1047()
0xFFFF_FFFF for chr_utf32le(), chr_utf32be()
Returns undef if CODEPOINT is greater than the maximum value.
chr_unicode(CODEPOINT)chr_utf16le(CODEPOINT)chr_utf16be(CODEPOINT)chr_utf32le(CODEPOINT)chr_utf32be(CODEPOINT)chr_utf8(CODEPOINT)chr_utf8mod(CODEPOINT)chr_utfcp1047(CODEPOINT)
Numeric Value of the First Character
(Exporting: use Unicode::Transform qw(:ord);)
Returns an unsigned integer value of the first character of STRING. If STRING is empty or begins at a partial octet, returns undef.
STRING may begin at a surrogate code point [0xD800..0xDFFF] or an out-of-range code point [0x110000 and greater]).
ord_unicode(CODEPOINT)ord_utf16le(CODEPOINT)ord_utf16be(CODEPOINT)ord_utf32le(CODEPOINT)ord_utf32be(CODEPOINT)ord_utf8(CODEPOINT)ord_utf8mod(CODEPOINT)ord_utfcp1047(CODEPOINT)
AUTHOR
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
http://homepage1.nifty.com/nomenclator/perl/
Copyright(C) 2002-2003, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.