NAME
Unicode::Transform - conversion among Unicode Transformation Formats (UTFs)
SYNOPSIS
use Unicode::Transform;
$unicode_string = utf16be_to_unicode($utf16be_string);
$utf16le_string = unicode_to_utf16le($unicode_string);
$utf8_string = utf32be_to_utf8 ($utf32be_string);
DESCRIPTION
This module provides some functions to convert a string among some Unicode Transformation Formats (UTFs).
Conversion Between UTF
(Exporting: use Unicode::Transform qw(:conv);
)
Function names
A function name consists of SRC_UTF_NAME
, string '_to_'
, and DST_UTF_NAME
.
SRC_UTF_NAME
(UTF name which a source string is in) and DST_UTF_NAME
(UTF name which a return value is in) must be one in the list of hyphen-removed and lowercased names following:
unicode (for Perl's internal strings; see perlunicode)
utf16le (for UTF-16LE)
utf16be (for UTF-16BE)
utf32le (for UTF-32LE)
utf32be (for UTF-32BE)
utf8 (for UTF-8)
utf8mod (for UTF-8-Mod)
utfcp1047 (for CP-1047-oriented UTF-EBCDIC).
In all, 64 (i.e. 8 times 8) functions are available. Available function names include utf16be_to_utf32le()
, utf8_to_unicode()
. DST_UTF_NAME
may be same as SRC_UTF_NAME
like utf8_to_utf8()
.
Parameters
If the first parameter is a reference, that is CALLBACK
, which is used for coping with illegal characters and octets. Any reference will not allowed as STRING
.
If CALLBACK
is given, STRING
is the second parameter; otherwise the first. STRING
is a source string. Currently, only coderefs are allowed as CALLBACK
.
If CALLBACK
is omitted, illegal code points and partial octets are deleted.
Illegal code points comprise surrogate code points [0xD800..0xDFFF
] and out-of-range code points [0x110000
and greater]).
Partial octets are octets which do not represent any code point. They include the first octet without following octets in UTF-8 like "\xC2"
, the last octet in UTF-16BE,LE with odd-numbered octets.
If CALLBACK
is specified, the appearance of an illegal code point or a partial octet calls the code reference. The first parameter for CALLBACK
is the unsigned integer value of its code point; if the value is lesser than 256, that is a partial octet.
The return value from CALLBACK
is inserted there.
(You can call die
or croak
in CALLBACK
if you want to trap an ill-formed source.)
Conversion from Code Point to String
(Exporting: use Unicode::Transform qw(:chr);
)
Returns the character represented by that CODEPOINT
as the string in the Unicode transformation format. CODEPOINT
can be in the range of 0..0x7FFF_FFFF
. Returns a string even if CODEPOINT
is a surrogate code point [0xD800..0xDFFF
].
chr_utf16le()
and chr_utf16be()
returns undef
when CODEPOINT
is out of range [i.e., when 0x110000
and greater]).
chr_unicode(CODEPOINT)
chr_utf16le(CODEPOINT)
chr_utf16be(CODEPOINT)
chr_utf32le(CODEPOINT)
chr_utf32be(CODEPOINT)
chr_utf8(CODEPOINT)
chr_utf8mod(CODEPOINT)
chr_utfcp1047(CODEPOINT)
Numeric Value of the First Character
(Exporting: use Unicode::Transform qw(:ord);
)
Returns an unsigned integer value of the first character of STRING
. If STRING
is empty or begins at a partial octet, returns undef
.
STRING
may begin at a surrogate code point [0xD800..0xDFFF
] or an out-of-range code point [0x110000
and greater]).
ord_unicode(CODEPOINT)
ord_utf16le(CODEPOINT)
ord_utf16be(CODEPOINT)
ord_utf32le(CODEPOINT)
ord_utf32be(CODEPOINT)
ord_utf8(CODEPOINT)
ord_utf8mod(CODEPOINT)
ord_utfcp1047(CODEPOINT)
AUTHOR
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
http://homepage1.nifty.com/nomenclator/perl/
Copyright(C) 2002-2003, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.