NAME
perl18n - Perl i18n (internalization)
DESCRIPTION
Perl supports the language-specific notions of data like "is this a letter" and "which letter comes first". These are very important issues especially for languages other than English -- but also for English: it would be very naïve indeed to think that A-Za-z
defines all the "letters".
Perl understands the language-specific data via the standardized (ISO C, XPG4, POSIX 1.c) method called "the locale system". The locale system is controlled per application using one function call and several environment variables.
USING LOCALES
If your operating system supports the locale system and you have installed the locale system and you have set your locale environment variables correctly (please see below) before running Perl, Perl will understand your data correctly according to your locale settings.
In runtime you can switch locales using the POSIX::setlocale().
# setlocale is the function call
# LC_CTYPE will be explained later
use POSIX qw(setlocale LC_CTYPE);
# query and save the old locale.
$old_locale = setlocale(LC_CTYPE);
setlocale(LC_CTYPE, "fr_CA.ISO8859-1");
# LC_CTYPE now in locale "French, Canada, codeset ISO 8859-1"
setlocale(LC_CTYPE, "");
# LC_CTYPE now in locale what the LC_ALL / LC_CTYPE / LANG define.
# see below for documentation about the LC_ALL / LC_CTYPE / LANG.
# restore the old locale
setlocale(LC_CTYPE, $old_locale);
The first argument of setlocale()
is called the category and the second argument the locale. The category tells in what aspect of data processing we want to apply language-specific rules, the locale tells in what language-country/territory-codeset - but read on for the naming of the locales: not all systems name locales as in the example.
For further information about the categories, please consult your setlocale(3) manual. For the locales available in your system, also consult the setlocale(3) manual and see whether it leads you to the list of the available locales (search for the SEE ALSO
section). If that fails, try out in command line the following commands:
and see whether they list something resembling these
en_US.ISO8859-1 de_DE.ISO8859-1 ru_RU.ISO8859-5
en_US de_DE ru_RU
en de ru
english german russian
english.iso88591 german.iso88591 russian.iso88595
Sadly enough even if the calling interface has been standardized the names of the locales are not. The naming usually is language_country/territory.codeset but the latter parts may not be present.
Two special locales are worth special mention: "C"
and "POSIX"
. Currently and effectively these are the same locale: the difference is mainly that the first one is defined by the C standard and the second one is defined by the POSIX standard. What they mean and define is the default locale in which every program does start in. The language is (American) English and the character codeset ASCII
. NOTE: Not all systems have the "POSIX"
locale (not all systems are POSIX), so use the "C"
locale when you need the default locale.
The use locale
Pragma
By default, Perl ignores the current locale. The use locale
pragma tells Perl to use the current locale for some operations: The comparison functions (lt, le, eq, cmp, ne, ge, gt, sort) use LC_COLLATE
; regular expressions and case-modification functions (uc, lc, ucfirst, lcfirst) use LC_CTYPE
; and formatting functions (printf and sprintf) use LC_NUMERIC
. The default behavior returns with no locale
or by reaching the end of the enclosing block.
Note that the result of any operation that uses locale information is tainted, since locales can be created by unprivileged users on some systems (see perlsec.pod).
Category LC_COLLATE: Collation
When in the scope of use locale
, Perl obeys the LC_COLLATE environment variable which controls application's notions on the collation (ordering) of the characters. B
does in most Latin alphabets follow the A
but where do the Á
and Ä
belong?
NOTE: Comparing and sorting by locale is usually slower than the default sorting; factors of 2 to 4 have been observed. It will also consume more memory: while a Perl scalar variable is participating in any string comparison or sorting operation and obeying the locale collation rules it will take about 3-15 (the exact value depends on the operating system) times more memory than normally. These downsides are dictated more by the operating system implementation of the locale system than by Perl.
Here is a code snippet that will tell you what are the alphanumeric characters in the current locale, in the locale order:
use POSIX qw(setlocale LC_COLLATE);
use locale;
setlocale(LC_COLLATE, "");
print +(sort grep /\w/, map { chr() } 0..255), "\n";
The default collation must be used for example for sorting raw binary data whereas the locale collation is useful for natural text.
NOTE: In some locales some characters may have no collation value at all -- this means for example if the '-'
is such a character the relocate
and re-locate
may sort to the same place.
NOTE: For certain environments the locale support by the operating system is very simply broken and cannot be used or fixed by Perl. Such deficiencies can and will result in mysterious hangs and/or Perl core dumps. One such example is IRIX before the release 6.2, the LC_COLLATE
support simply does not work. When confronted with such systems, please report in excruciating detail to perlbug@perl.com
, complain to your vendor, maybe some bug fixes exist for your operating system for these problems? Sometimes such bug fixes are called an operating system upgrade.
NOTE: In the pre-5.003_06 Perl releases the per-locale collation was possible using the I18N::Collate
library module. This is now mildly obsolete and to be avoided. The LC_COLLATE
functionality is integrated into the Perl core language and one can use scalar data completely normally -- there is no need to juggle with the scalar references of I18N::Collate
.
Category LC_CTYPE: Character Types
When in the scope of use locale
, Perl obeys the LC_CTYPE
locale information which controls application's notions on which characters are alphabetic characters. This affects in Perl the regular expression metanotation \\w
which stands for alphanumeric characters, that is, alphabetic and numeric characters (please consult perlre for more information about regular expressions). Thanks to the LC_CTYPE
, depending on your locale settings, characters like Æ
, É
, ß
, ø
, may be understood as \w
characters.
Category LC_NUMERIC: Numeric Formatting
When in the scope of use locale
, Perl obeys the LC_NUMERIC
locale information which controls application's notions on how numbers should be formatted for input and output. This affects in Perl the printf and fprintf function, as well as POSIX::strtod.
ENVIRONMENT
- PERL_BADLANG
-
A string that controls whether Perl warns in its startup about failed locale settings. This can happen if the locale support in the operating system is lacking (broken) is some way. If this string has an integer value differing from zero, Perl will not complain.
NOTE: This is just hiding the warning message. The message tells about some problem in your system's locale support and you should investigate what the problem is.
The following environment variables are not specific to Perl: They are part of the standardized (ISO C, XPG4, POSIX 1.c) setlocale method to control an application's opinion on data.
- LC_ALL
-
LC_ALL
is the "override-all" locale environment variable. If it is set, it overrides all the rest of the locale environment variables. - LC_CTYPE
-
In the absence of
LC_ALL
,LC_CTYPE
chooses the character type locale. In the absence of bothLC_ALL
andLC_CTYPE
,LANG
chooses the character type locale. - LC_COLLATE
-
In the absence of
LC_ALL
,LC_COLLATE
chooses the collation locale. In the absence of bothLC_ALL
andLC_COLLATE
,LANG
chooses the collation locale. - LC_NUMERIC
-
In the absence of
LC_ALL
,LC_NUMERIC
chooses the numeric format locale. In the absence of bothLC_ALL
andLC_NUMERIC
,LANG
chooses the numeric format. - LANG
-
LANG
is the "catch-all" locale environment variable. If it is set, it is used as the last resort after the overallLC_ALL
and the category-specificLC_...
.
There are further locale-controlling environment variables (LC_MESSAGES, LC_MONETARY, LC_TIME
) but Perl does not currently use them, except possibly as they affect the behavior of library functions called by Perl extensions.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 11:
Non-ASCII character seen before =encoding in 'naïve'. Assuming CP1252