NAME
I18N::Charset - IANA Character Set Registry names and Unicode::Map8 conversion scheme names
SYNOPSIS
use I18N::Charset;
$sCharset = iana_charset_name('WinCyrillic'); # $sCharset gets 'windows-1251'
$sCharset = map8_charset_name('windows-1251'); # $sCharset gets 'cp1251'
I18N::Charset::add_iana_alias('my-japanese' => 'iso-2022-jp');
I18N::Charset::add_map8_alias('my-arabic' => 'arabic7');
DESCRIPTION
The I18N::Charset
module provides access to the IANA Character Set Registry names for identifying character encoding schemes. It also provides a mapping to the character set names used by the Unicode::Map8 module.
So, for example, if you get an HTML document with a META CHARSET="..." tag, you can quickly determine what Unicode::Map8 conversion to use on it.
If you don't have the module Unicode::Map8 installed, the map8_ functions will always return undef.
CONVERSION ROUTINES
There are two conversion routine: iana_charset_name()
and map8_charset_name()
.
- iana_charset_name()
-
This function takes a string containing the name of a character set and returns a string which contains the official IANA name of the character set identified. If no valid character set name can be identified, then
undef
will be returned. The case and punctuation within the string are not important.$sCharset = iana_charset_name('WinCyrillic');
- map8_charset_name()
-
This function takes a string containing the name of a character set and returns a string which contains a name for the character set that can be used to designate a Unicode::Map8 character code map. If no valid character set name can be identified, then
undef
will be returned. The case and punctuation within the argument string are not important.$sCharset = map8_charset_name('windows-1251');
QUERY ROUTINES
There is one function which can be used to obtain a list of all IANA-registered character set names.
all_iana_charset_names()
-
Returns a list of all registered IANA character set names. The names are not in any particular order.
CHARACTER SET NAME ALIASING
This module supports two semi-private routines for specifying character set name aliases. In order to avoid namespace corruption, they are not exported.
- add_iana_alias()
-
This function takes two strings, an alias and a target IANA Character Set Name (or another alias). It defines the alias to refer to that character set name (or to the character set name that the second alias refers to).
Returns the target character set name of the successfully installed alias. Returns 'undef' if the target character set name is not registered. Returns 'undef' if the target character set name of the second alias is not registered.
I18N::Charset::add_iana_alias('my-alias1' => 'Shift_JIS');
With this code, "my-alias1" becomes an alias for the existing IANA character set name 'Shift_JIS'.
I18N::Charset::add_iana_alias('my-alias2' => 'sjis');
With this code, "my-alias2" becomes an alias for the IANA character set name referred to by the existing alias 'sjis' (which happens to be 'Shift_JIS').
- add_map8_alias()
-
This function takes two strings, a new alias and a target Unicode::Map8 Character Set Name (or an exising alias to a Map8 name). It defines the new alias to refer to that mapping name (or to the mapping name that the second alias refers to).
If the first argument is a registered IANA character set name, then all aliases of that character set name will end up pointing to the target Map8 mapping name.
Returns the target mapping name of the successfully installed alias. Returns 'undef' if the target mapping name is not registered. Returns 'undef' if the target mapping name of the second alias is not registered.
I18N::Charset::add_map8_alias('normal' => 'ANSI_X3.4-1968');
With this code, "normal" becomes an alias for the existing Unicode::Map mapping name 'ANSI_X3.4-1968'.
I18N::Charset::add_map8_alias('normal' => 'US-ASCII');
With this code, "normal" becomes an alias for the existing Unicode::Map mapping name 'ANSI_X3.4-1968' (which is what "US-ASCII" is an alias for).
I18N::Charset::add_map8_alias('IBM297' => 'EBCDIC-CA-FR');
With this code, "IBM297" becomes an alias for the existing Unicode::Map mapping name 'EBCDIC-CA-FR'. As a side effect, all the aliases for 'IBM297' (by default, 'cp297' and 'ebcdic-cp-fr') also become aliases for 'EBCDIC-CA-FR'.
EXAMPLES
KNOWN BUGS AND LIMITATIONS
There could probably be many more aliases added (for convenience) to all the IANA names. If you have some specific recommendations, please email the author!
There are many character set names which do not have a corresponding mapping in the Unicode::Map8 module (or at least I have not been able to figure out which mapping they correspond to). For the most part, these are obscure encodings.
In the current implementation, all data is read when the module is loaded, and then held in memory. A lazy implementation would be more memory-friendly.
SEE ALSO
- Unicode::Map8
-
Convert strings from various character encodings to Unicode.
- Locale::Country
-
ISO two letter codes for identification of country (ISO 3166).
- Locale::Language
-
ISO two letter codes for identification of language (ISO 639). (Those codes are used in the Content-Language header in HTTP.)
AUTHOR
Martin Thurn <MartinThurn@iname.com>
COPYRIGHT
Copyright (c) 1998 TASC, Inc.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 269:
You forgot a '=back' before '=head1'