NAME
Unicode::Transform - conversion among Unicode Transformation Formats
SYNOPSIS
use Unicode::Transform ':all';
$unicode_string = utf16be_to_unicode($utf16be_string);
$utf16le_string = unicode_to_utf16le($unicode_string);
$utf8_string = utf32be_to_utf8 ($utf32be_string);
$utf8_string = utf32be_to_utf8(\&chr_utf8, $utf32be_string);
# ill-formed octet sequences are allowed.
DESCRIPTION
This module provides some functions to convert a string among some Unicode Transformation Formats (UTF).
Conversion Between UTF
(Exporting: use Unicode::Transform ':conv';
)
Returns a string in DST_UTF_NAME corresponding to STRING in SRC_UTF_NAME.
Function names
A function name consists of SRC_UTF_NAME, a string '_to_', and DST_UTF_NAME. SRC_UTF_NAME and DST_UTF_NAME must be one in the list of hyphen-removed and lowercased names following:
unicode (for Perl internal Unicode encoding; see perlunicode)
utf16le (for UTF-16LE)
utf16be (for UTF-16BE)
utf32le (for UTF-32LE)
utf32be (for UTF-32BE)
utf8 (for UTF-8)
utf8mod (for UTF-8-Mod)
utfcp1047 (for CP1047-oriented UTF-EBCDIC).
In all, 64 (i.e. 8 times 8) functions are available. Available function names include utf16be_to_utf32le()
and utf8_to_unicode()
. DST_UTF_NAME may be same as SRC_UTF_NAME like utf8_to_utf8()
.
Conversions where both SRC_UTF_NAME and DST_UTF_NAME begin at 'utf' are defined well and stably. In contrast to these UTF, the Perl internal Unicode encoding is influenced by the platform-dependent features (e.g. 32bit/64bit, ASCII/EBCDIC).
Parameters
If the first parameter is a reference, that is regarded as the CALLBACK. Any reference will not allowed as STRING. If CALLBACK is given, the second parameter is STRING; otherwise the first is. Currently, only code references are allowed as CALLBACK.
If CALLBACK is omitted, only Unicode scalar values (0x0000..0xD7FF
and 0xE000..0x10FFFF
) are allowed. Ill-formed octet sequences (corresponding to a code point outside the range of Unicode scalar values) and partial octets (which does not correspond to any code point) are deleted, as if a code reference constantly returning an empty string, sub {''}
, was used as CALLBACK.
Examples of partial octets: the first octet without following octets in UTF-8 like "\xC2"
; the last octet in UTF-16BE,LE with odd number of octets.
If CALLBACK is specified, the appearance of an ill-formed octet sequences or a partial octet calls the code reference. The first parameter for CALLBACK is the unsigned integer value of its code point; if the value is lesser than 256, that is a partial octet.
The return value from CALLBACK will be inserted there. You may use chr_<DST_UTF_NAME>()
as CALLBACK (see below). Return value from CALLBACK should be in UTF of DST_UTF_NAME.
You can call die
or croak
in CALLBACK when you want to stop the operation if the whole STRING would not be well-formed.
Conversion from Code Point to String
(Exporting: use Unicode::Transform ':chr';
)
Returns a string in DST_UTF_NAME corresponding to CODEPOINT. CODEPOINT should be an unsigned integer. If CODEPOINT is outside the range of Unicode scalar values, a corresponding ill-formed octet sequence will be returned.
If CODEPOINT is greater than the maximum value, returns undef
. The maximum value of CODEPOINT is:
0x0010_FFFF for chr_utf16le() and chr_utf16be()
0x7FFF_FFFF for chr_utf8(), chr_utf8mod(), chr_utfcp1047()
0xFFFF_FFFF for chr_utf32le(), chr_utf32be()
The maximum value of CODEPOINT for chr_unicode()
depends on the platform features (e.g. 32bit/64bit, ASCII/EBCDIC).
Function names
The full list of functions provided:
chr_unicode(CODEPOINT)
chr_utf16le(CODEPOINT)
chr_utf16be(CODEPOINT)
chr_utf32le(CODEPOINT)
chr_utf32be(CODEPOINT)
chr_utf8(CODEPOINT)
chr_utf8mod(CODEPOINT)
chr_utfcp1047(CODEPOINT)
Numeric Value of the First Character
(Exporting: use Unicode::Transform ':ord';
)
Returns an unsigned integer value of the first character of STRING in SRC_UTF_NAME. STRING may begin at an ill-formed octet sequence corresponding to a surrogate code point (0xD800..0xDFFF
) or an out-of-range code point (0x110000
and greater). If STRING is empty or begins at a partial octet, returns undef
.
Function names
The full list of functions provided:
ord_unicode(STRING)
ord_utf16le(STRING)
ord_utf16be(STRING)
ord_utf32le(STRING)
ord_utf32be(STRING)
ord_utf8(STRING)
ord_utf8mod(STRING)
ord_utfcp1047(STRING)
AUTHOR
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Copyright(C) 2002-2005, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.