NAME

I18N::Charset - IANA Character Set Registry names and Unicode::Map8 conversion scheme names

SYNOPSIS

use I18N::Charset;

$sCharset = iana_charset_name('WinCyrillic');
# $sCharset is now 'windows-1251'
$sCharset = map8_charset_name('windows-1251');
# $sCharset is now 'cp1251' which can be passed to Unicode::Map8->new()
$sCharset = umap_charset_name('Adobe DingBats');  # $sCharset gets ''
# $sCharset is now 'ADOBE-DINGBATS' which can be passed to Unicode::Map->new()

I18N::Charset::add_iana_alias('my-japanese' => 'iso-2022-jp');
I18N::Charset::add_map8_alias('my-arabic' => 'arabic7');
I18N::Charset::add_umap_alias('my-hebrew' => 'ISO-8859-8');

DESCRIPTION

The I18N::Charset module provides access to the IANA Character Set Registry names for identifying character encoding schemes. It also provides a mapping to the character set names used by the Unicode::Map8 and Unicode::Map modules.

So, for example, if you get an HTML document with a META CHARSET="..." tag, you can fairly quickly determine what Unicode:: module can be used to convert it to Unicode.

If you don't have the module Unicode::Map8 installed, the map8_ functions will always return undef; similarly for Unicode::Map and the umap_ functions.

CONVERSION ROUTINES

There are three conversion routines: iana_charset_name(), map8_charset_name(), and umap_charset_name().

iana_charset_name()

This function takes a string containing the name of a character set and returns a string which contains the official IANA name of the character set identified. If no valid character set name can be identified, then undef will be returned. The case and punctuation within the string are not important.

$sCharset = iana_charset_name('WinCyrillic');
map8_charset_name()

This function takes a string containing the name of a character set (in almost any format) and returns a string which contains a name for the character set that can be used to designate a Unicode::Map8 character code map. If no valid character set name can be identified, then undef will be returned. The case and punctuation within the argument string are not important.

$sCharset = map8_charset_name('windows-1251');
umap_charset_name()

This function takes a string containing the name of a character set (in almost any format) and returns a string which contains a name for the character set that can be used to designate a Unicode::Map character code mapping. If no valid character set name can be identified, then undef will be returned. The case and punctuation within the argument string are not important.

$sCharset = umap_charset_name('hebrew');

QUERY ROUTINES

There is one function which can be used to obtain a list of all IANA-registered character set names.

all_iana_charset_names()

Returns a list of all registered IANA character set names. The names are not in any particular order.

CHARACTER SET NAME ALIASING

This module supports three semi-private routines for specifying character set name aliases.

add_iana_alias()

This function takes two strings: a new alias, and a target IANA Character Set Name (or another alias). It defines the new alias to refer to that character set name (or to the character set name to which the second alias refers).

Returns the target character set name of the successfully installed alias. Returns 'undef' if the target character set name is not registered. Returns 'undef' if the target character set name of the second alias is not registered.

I18N::Charset::add_iana_alias('my-alias1' => 'Shift_JIS');

With this code, "my-alias1" becomes an alias for the existing IANA character set name 'Shift_JIS'.

I18N::Charset::add_iana_alias('my-alias2' => 'sjis');

With this code, "my-alias2" becomes an alias for the IANA character set name referred to by the existing alias 'sjis' (which happens to be 'Shift_JIS').

add_map8_alias()

This function takes two strings: a new alias, and a target Unicode::Map8 Character Set Name (or an exising alias to a Map8 name). It defines the new alias to refer to that mapping name (or to the mapping name to which the second alias refers).

If the first argument is a registered IANA character set name, then all aliases of that IANA character set name will end up pointing to the target Map8 mapping name.

Returns the target mapping name of the successfully installed alias. Returns 'undef' if the target mapping name is not registered. Returns 'undef' if the target mapping name of the second alias is not registered.

I18N::Charset::add_map8_alias('normal' => 'ANSI_X3.4-1968');

With the above statement, "normal" becomes an alias for the existing Unicode::Map8 mapping name 'ANSI_X3.4-1968'.

I18N::Charset::add_map8_alias('normal' => 'US-ASCII');

With the above statement, "normal" becomes an alias for the existing Unicode::Map mapping name 'ANSI_X3.4-1968' (which is what "US-ASCII" is an alias for).

I18N::Charset::add_map8_alias('IBM297' => 'EBCDIC-CA-FR');

With the above statement, "IBM297" becomes an alias for the existing Unicode::Map mapping name 'EBCDIC-CA-FR'. As a side effect, all the aliases for 'IBM297' (i.e. 'cp297' and 'ebcdic-cp-fr') also become aliases for 'EBCDIC-CA-FR'.

add_umap_alias()

This function works identically to add_map8_alias() above, but operates on Unicode::Map encoding tables.

EXAMPLES

KNOWN BUGS AND LIMITATIONS

  • There could probably be many more aliases added (for convenience) to all the IANA names. If you have some specific recommendations, please email the author!

  • The only character set names which have a corresponding mapping in the Unicode::Map8 module are the character sets that Unicode::Map8 can convert.

    Similarly, the only character set names which have a corresponding mapping in the Unicode::Map module are the character sets that Unicode::Map can convert.

  • In the current implementation, all tables are read in and initialized when the module is loaded, and then held in memory until the program exits. A "lazy" implementation (or a less-portable tied hash) might lead to a shorter startup time. Suggestions, patches, comments are always welcome!

SEE ALSO

Unicode::Map

Convert strings from various multi-byte character encodings to and from Unicode.

Unicode::Map8

Convert strings from various 8-bit character encodings to and from Unicode.

Jcode

Convert strings among various Japanese character encodings and Unicode.

Unicode::MapUTF8

A wrapper around all three of these character set conversion distributions.

AUTHOR

Martin Thurn <MartinThurn@iname.com>

COPYRIGHT

Copyright (c) 1998-2001 TASC, Inc.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 371:

You forgot a '=back' before '=head1'