NAME
Unicode::Transform - conversion among Unicode Transformation Formats (UTFs)
SYNOPSIS
use Unicode::Transform qw(:all);
$unicode_string = utf16be_to_unicode($utf16be_string);
$utf16le_string = unicode_to_utf16le($unicode_string);
$utf8_string = utf32be_to_utf8 ($utf32be_string);
$utf8_string = utf32be_to_utf8(\&chr_utf8, $utf32be_string);
# illegal code points are allowed.
DESCRIPTION
This module provides some functions to convert a string among some Unicode Transformation Formats (UTFs).
Conversion Between UTF
(Exporting: use Unicode::Transform qw(:conv);
)
Function names
A function name consists of SRC_UTF_NAME
, string '_to_'
, and DST_UTF_NAME
.
SRC_UTF_NAME
(UTF name which a source string is in) and DST_UTF_NAME
(UTF name which a return value is in) must be one in the list of hyphen-removed and lowercased names following:
unicode (for Perl's internal strings; see perlunicode)
utf16le (for UTF-16LE)
utf16be (for UTF-16BE)
utf32le (for UTF-32LE)
utf32be (for UTF-32BE)
utf8 (for UTF-8)
utf8mod (for UTF-8-Mod)
utfcp1047 (for CP-1047-oriented UTF-EBCDIC).
In all, 64 (i.e. 8 times 8) functions are available. Available function names include utf16be_to_utf32le()
, utf8_to_unicode()
. DST_UTF_NAME
may be same as SRC_UTF_NAME
like utf8_to_utf8()
.
Parameters
If the first parameter is a reference, that is CALLBACK
, which is used for coping with illegal characters and octets. Any reference will not allowed as STRING
.
If CALLBACK
is given, STRING
is the second parameter; otherwise the first. STRING
is a source string. Currently, only coderefs are allowed as CALLBACK
.
If CALLBACK
is omitted, illegal code points and partial octets are deleted, as if a code reference constantly returning empty string, sub {''}
, was used as CALLBACK
.
Illegal code points comprise surrogate code points [0xD800..0xDFFF
] and out-of-range code points [0x110000
and greater]).
Partial octets are octets which do not represent any code point. They include the first octet without following octets in UTF-8 like "\xC2"
, the last octet in UTF-16BE,LE with odd number of octets.
If CALLBACK
is specified, the appearance of an illegal code point or a partial octet calls the code reference. The first parameter for CALLBACK
is the unsigned integer value of its code point; if the value is lesser than 256, that is a partial octet.
The return value from CALLBACK
is inserted there. You may use chr_<DST_UTF_NAME>()
as CALLBACK
(see below). Return value from CALLBACK
should be in UTF of DST_UTF_NAME
.
You can call die
or croak
in CALLBACK
if you want to trap an ill-formed source.
Conversion from Code Point to String
(Exporting: use Unicode::Transform qw(:chr);
)
Returns the character represented by that CODEPOINT
as the string in the Unicode transformation format. CODEPOINT
should be an unsigned integer. Returns a string even if CODEPOINT
is a surrogate code point [0xD800..0xDFFF
].
The maximum value of CODEPOINT
is:
0x0010_FFFF for chr_utf16le() and chr_utf16be()
0x7FFF_FFFF for chr_utf8(), chr_utf8mod(), chr_utfcp1047()
0xFFFF_FFFF for chr_utf32le(), chr_utf32be()
Returns undef
if CODEPOINT
is greater than the maximum value.
chr_unicode(CODEPOINT)
chr_utf16le(CODEPOINT)
chr_utf16be(CODEPOINT)
chr_utf32le(CODEPOINT)
chr_utf32be(CODEPOINT)
chr_utf8(CODEPOINT)
chr_utf8mod(CODEPOINT)
chr_utfcp1047(CODEPOINT)
Numeric Value of the First Character
(Exporting: use Unicode::Transform qw(:ord);
)
Returns an unsigned integer value of the first character of STRING
. If STRING
is empty or begins at a partial octet, returns undef
.
STRING
may begin at a surrogate code point [0xD800..0xDFFF
] or an out-of-range code point [0x110000
and greater]).
ord_unicode(CODEPOINT)
ord_utf16le(CODEPOINT)
ord_utf16be(CODEPOINT)
ord_utf32le(CODEPOINT)
ord_utf32be(CODEPOINT)
ord_utf8(CODEPOINT)
ord_utf8mod(CODEPOINT)
ord_utfcp1047(CODEPOINT)
AUTHOR
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
http://homepage1.nifty.com/nomenclator/perl/
Copyright(C) 2002-2003, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.