NAME
Locale::MakePhrase - Language translation facility
SYNOPSIS
These group of modules are used to translate application text strings, which may or may not include values which also need to be translated, into the prefered language of the end-user.
Example:
use Locale::MakePhrase::BackingStore::Directory;
use Locale::MakePhrase;
my $bs = new Locale::MakePhrase::BackingStore::Directory(
directory => '/some/path/to/language/files',
);
my $mp = new Locale::MakePhrase(
language => 'en_AU',
backing_store => $bs,
);
...
my $color_count = 1;
print $mp->translate("Please select [_1] colors.",$color_count);
Output:
Please select a colour.
Notice that a) the word 'color' has been localised to Australian English, and b) that the argument has influenced the resultant output text to take into account the display of the singular version.
DESCRIPTION
This aim of these modules are to implement run-time evaluation of an input phrase, including program arguments, and have it generate a suitable output phrase, in the language and encoding specified by the user of the application.
Since this problem has been around for some time, there are a number of sources of useful information available on the web, which describes why this problem is hard to solve. The problem with most existing solutions is that each design suffers some form of limitation, often due to the designer thinking that there are enough commonalities between all/some langugaes that these commonalities can be factored into a various rules which can be implemented in programming code.
However, each language has it own history and evolution. Thus it is pointless to compare two different languages unless they have a common history and a common character set.
Before continuing to read this document, you really should read the following info on the Locale::Maketext Perl module:
http://search.cpan.org/~sburke/Locale-Maketext-1.08/lib/Locale/Maketext.pod
and at the slides presented here:
http://www.autrijus.org/webl10n/
The Locale::MakePhrase modules are based on a design similar to the Locale::Maketext module, except that this new implementation has taken a different approach, that being...
Since it is possible (and quite likely) that the application will need to be able to understand the language rules of any specific language, we want to use a run-time evaluation of the rules that a linguist would use to convert one language to another. Thus we have coined the term linguistic rules as a means to describe this technique. These rules are used to decide which piece of text is displayed, for a given input text and arguments.
REQUIREMENTS
The Locale::MakePhrase module was initially designed to meet the requirements of a web application (as opposed to a desktop application), which may display many languages in the HTML form at any given instance.
Its design is modelled on a similar design of using language lexicons, which is in use in the existing Locale::Maketext Perl module. The reason for building a new module is because:
We wanted to completely abstract the language rule capability, to be programming language agnostic so that we could re-implement this module in other programming languages.
We needed run-time evaluation of the rules, since the translations may be updated at any time; new rules may be added whenever there is some ambigutiy in the existing phrase. Also, we didn't want to re-start the application whenever we updated a rule.
We would like to support various types of storage mechanisms for the translations. The origonal design constraint prefered the use of a PostgreSQL database to hold the translations - most existing language translation systems use flat files.
We want to store/manipulate the current text phrase, only encoded in UTF-8 (ie: we dont want to store the text in a locale-specific encoding). This allows us to output text to any other character set.
As an example of application usage, it is possible for a Hebrew speaking user to be logged into a web-form which contains Japanese data. As such they will see:
Menus and tooltips will be translated into the users' language (ie: Hebrew).
Titles will be in the language of the dataset (ie: Japanese).
Some of the data was in Latin character set (ie: English).
If the user prefered to see the page as RTL rather than LTR, the page was altered to reflect this preference.
BACKGROUND
When implementing any new software, it is necessary to understand the problem domain. In the case of language translation, there are a number of requirements that we can define:
Quite a few people speak multiple languages; we would like the language translation system to use the users preferred language localisation, or if we don't know which language that is, try to make an approximate guess, based on application capabilites.
Since some people speak multiple languages, the application may not have been localised to their prefered localisation. We should try to fallback to using a language which is similar.
Some languages support the notion of a dialect for that language. A good example is that the English language is used in many countries, but countries such as the United States, Australia and Great Britain each have their own localised version ie. the dialect is specified as the country or region. The language translation mechanism needs to be able to use the users' preferred dialect when looking up the text to display. If no translation is found, then it should fall back to the parent language.
Some languages are written using a script which displays its output as right-to-left text (as used by Arabic, Hebrew, etc), rather than left-to-right text (as used by English, Latin, Greek, etc). The language translation mechanism should allow the text display mechanism to change the text direction if that is a requirement (which is another reason for mandating the use of UTF-8).
The string to be translated should support the ability to re-order the wording of the text.
The text translation mechanism should support the ability to show arguments supplied to the string (by the application), within the correct context of the meaning of the string.
- Eg:
-
We could say something like "You selected 4 balls" (where the number 4 is program dependant); in another language you may want to say the equivalent of "4 balls selected".
Notice that the numeric position has moved from being the third mnemonic, to being the first mnemonic. The requirement is that we would like to be able to rearrange the order/placement of any mnemonic (including any program arguments).
We would like to be able to support an arbitrary number of argument replacements. We shouldn't be limited in the number of replacements that need to occur, for any given number program arguments.
Most program arguments that are given to strings are in numeric format (i.e. they are a number). We would also like to support arguments which are text strings, which themselves should be open to language translation (but only after rule evaluation). The purpose being that the output phrase should make sense within the current context of the application.
In a lot of languages there is the concept of singular and plural. While in other languages there is no such concept, while in others still there is the concept of duality. There is also the concept that a phrase can be descriptive when discussing the zero of something. Thus we want to display a specific phrase, depending on the value of an argument.
- Eg:
-
In English, the following text "Selected __ files" has multiple possible outputs, depending on the program value; we can have:
0 case: "No files selected" - no numeric value 1 case: "One file selected" - 'files' is singular 2 case: "Selected two files" - the '__' is a text value, not a number more than 2 case: "Lots of selections" - no direct comparison to the original text
...as we can see, this is just for translating a single text string, from English to English.
To counter this problem, the translation system needs to be able to apply linguistic rules to the original text, so that it can evaluate which piece of text should be displayed, given the current context and program argument.
When updating a specific phrase for language translation, the next screen re-draw should show the new translation text. Thus translations need to be dynamically changeable, and run-time configurable.
INTERNAL TEXT ENCODING
This module uses UTF-8 text encoding internally, thus it requires a minimum of Perl 5.8. So, for any given application string and user language combination, we require the backing store look-up the combination, then return a list of Locale::MakePhrase::LanguageRule objects, which must be created with the key and translated strings being stored in the UTF-8 encoding.
Thus, to simplify the string-load functionality, we recommend to load / store the translated strings as UTF-8 encoded strings. See Locale::MakePhrase::BackingStore for more information.
- ie.
-
The PostgreSQL backing store assumes that the database instance stores strings in the UNICODE encoding (rather than, say, ASCII); this avoids the need to translate every string when we load it.
OUTPUT TEXT ENCODING
Locale::MakePhrase uses UTF-8 encoding internally, as described above. This is also the default output encoding. You can choose to have a different output encoding, such as ISO-8859-1.
Normlly, if the output display mechanism can display UNICODE (encoded as UTF-8), then text will be rendered in the correct language and correct text direction (ie. left-to-right or right-to-left).
By supplying the encoding as a constructor argument, Locale::MakePhrase will transpose the translated text from UTF-8, into your output-specific encoding (using the Encode module). This is useful in cases where font support within an application, hasn't yet evolved to the same level as a language-specific font.
See the Encode module for a list of available output encodings.
Default output character set encoding: UTF-8
WHAT ARE LINGUISTIC RULES?
Since the concept of a linguistic rule is at the heart of this translation module, its documentation is located in Locale::MakePhrase::RuleManager. It explains the syntax of the rule expressions, how rules are sorted and selected, as well as the operators and functions that are available within the expressions. You should read that information, before continuing.
- Available operators:
-
==, !=, <, >, <=, >=, eq, ne
- Available functions:
-
defined(x), length(x), int(x), abs(n), lc(s), uc(s), left(s,n), right(s,n), substr(s,n), substr(s,n,r)
Object API
The following methods are part of the Locale::MakePhrase object API:
new()
Construct new instance of Locale::MakePhrase object. Takes the following named parameters (ie: via a hash or hashref):
language
languages
-
Specify one or more languages which are used for locating the correct language string (all forms are supported; first found is used).
They take either a string (eg 'en'), a comma-seperated list (eg 'en_AU, en_GB') or an array of strings (eg ['en_AU','en_GB']).
The order specified, is the order that phrases are looked up. These strings go through a manipulation process (using the Perl module I18N::LangTags) of:
The strings are converted to RFC3066 language tags; these become the primary tags.
Superordinate tags are retrieved for each primary tag.
Alternates of the primary tags are then retrieved.
Panic language tags are retrieved for each primary tag (if enabled).
The fallback language is retrieved (see 'fallback language').
Duplicate language tags are removed.
All tags are converted to lowercase, and '-' are changed to '_'.
This leaves us with a list of at least the fallback language.
charset
encoding
-
This option (both forms are supported; first found is used) allows you to change the output character set encoding, to something other than UTF-8, such as ISO-8859-1.
See ENCODING for more information.
backing_store
-
Takes either a reference to a backing store instance, or to a string which can be used to dynamically construct the instance.
The final backing store instance must have a type of Locale::MakePhrase::BackingStore.
Default: use a Locale::MakePhrase::BackingStore
rule_manager
-
Takes either a reference to a rule manager instance, or to a string which can be used to dynamically construct the instance.
The final manager instance must have a type of Locale::MakePhrase::RuleManager.
Default: use a Locale::MakePhrase::RuleManager
malformed_character_mode
-
Perl normally outputs \x{HH} for malformed characters (or \x{HHHH}, \x{HHHHHH}, etc. for wide characters). Setting this value, changes the behaviour to output alternative character entity formats.
Note that if you are using Locale::MakePhrase to generate strings used within web pages / HTML, you should set this parameter to
Locale::MakePhrase->MALFORMED_MODE_HTML
. numeric_format
-
This option allows the user to control how numbers are output. You can set the output to be one of a number of forms of stringification defined in Locale::MakePhrase::Numeric, eg:
- '.', ',', '(', ')'
-
Place comma seperators before every third digit; use brackets for negative, as in: (10,000,000.1)
This takes either a string format or an array reference containing the format.
Default: dont format; show decimal as full-stop
die_on_bad_translation
-
Set this option to true to make Locale::MakePhrase die if the translated string is incorrectly formatted (eg: too many argument place holders are specified) or the expression is not valid. The alternative is to output the phrase <INVALID TRANSLATION> or <INVALID EXPRESSION>.
Die'ing here means that translations have the ability to abort your code. If you dont have control over the quality of the phrases added to your dictionary, you should probably use the default behaviour.
Note that an invalid expression or translation generates a warning to STDERR.
Default: dont die; output the appropriate error phrase
translate_arguments
-
Set this option to false to make Locale::MakePhrase not translate the applied arguments, before applying them to the output of the engine. This saves you from having to call translate() for each argument, within your own code.
Default: do translate arguments
add_newline
-
Set this option to true to make Locale::MakePhrase automatically add newline characters to the end of every translated string. The reason for having this is to allow your translation-key to not require the OS-dependent newline character(s), and to not require newline character(s) on the target-translation.
Note that the API provides alternate method calls so as to allow you to add newline character(s) as necessary.
Default: dont add any newline characters
panic_language_lookup
-
Set this option to true to make Locale::MakePhrase load 'panic' languages as defined by "panic_languages" in I18N::LangTags. Basically it provides a mechanism to allow the engine to return a language string from languages which has a similar heritage to the primary language(s), if a translation from the primary language hasn't been found.
eg: Spanish has a similar heritage as Italian, thus if no translations are found in Italian, then Spanish translations will be used.
Default: dont lookup panic-languages
- Notes:
-
If the arguments aren't a hash or hashref, then we assume that the arguments are languages tags.
If you dont supply any language, the fallback language will be used.
Default language: en
$self init([...])
Allow sub-class a chance to control construction of the object. You must return a reference to $self, to 'allow' the construction to complete.
At this point of construction you can call $self->options()
which returns a reference to the current constructor options. This allows you to add/modify any existing options; for example you may want to inject something specific...
$string context_translate($context, $string [, ...])
[ $context
is either a text string or an object reference (which then gets stringified into its class name). ]
This is a primary entry point; call this with your application context, your string and any program arguments which need to be translated. Note however that in most cases you will most likely want to call the translate function instead; see below.
In some cases you will find that you will use the same text phrase in one part of your application, in a seperate part of your application, but the meaning of the phrase is different (due to the different application context); supplying a context will allow your backing store to use the extra context information, to return the correct language rules.
The steps involved in a string translation are:
Fetch all possible translation rules for all language tags (including alternates and the fallbacks), from the backing store. The store will return a list reference of LanguageRule objects.
Sort the list based on the implementation defined in the Locale::MakePhrase::RuleManager module.
The the rule instance for which the rule-expression evaluates to true for the supplied program arguments (if there is no expression, the rule is always true).
If no rules have been selected, then make a rule from the input string.
Apply the program arguments to the rules' translated text. If the argument is a text phrase, it (optionally) undergoes the language translation procedure. If the argument is numeric, it is formatted by one of your language sub-classes, or the Locale::MakePhrase::Numeric module.
We apply the output character set encoding to convert the text from UTF-8 into the prefered character set. If the output encoding is UTF-8 (thus matching the internal encoding), this item does nothing.
$string translate($string [, ...])
This is a primary entry point; call this with your string and any program arguments which need to be translated.
This function is a wrapper around the context_translate
function, where the context is set to undef (which is usually what you want).
$string context_translate_ln($context, $string [, ...])
This is a primary entry point; call this with your context, string and any program arguments which need to be translated.
This function is a wrapper around the context_translate
function, but this adds newline character(s) to the output.
$string translate_ln($string [, ...])
This is a primary entry point; call this with your string and any program arguments which need to be translated.
As above, this function is a wrapper around the context_translate
function, where the context is set to undef, but this adds newline character(s) to the output.
$string format_number($number,$options)
This method implements the numbers-specific formatting, by calling into Locale::MakePhrase::Numeric's stringify_number
method.
To provide custom handling of number formatting, you can do one of:
Define a Locale::MakePhrase::Numeric number formatting option.
Implement 'per-language' number formatting, by sub-classing the Locale::MakePhrase::Language module, then implementing a
format_number
method.
$backing_store fallback_backing_store()
Backing store to use, if not specified on construction. You can overload this in a sub-class.
$string fallback_language()
Language to fallback to, if all others fail (this defaults to 'en'). You can override this method in a sub-class.
Usually this will be the language that you are writing your application code (eg: you may be coding using German rather than English).
Note that this must return a RFC-3066 compliant language tag.
$string_array language_classes()
This method returns a list of possible class names (which must be sub-classes of Locale::MakePhrase::Language) which can get prepended to the language tags for this instance. Locale::MakePhrase will then try to dynamically load these modules during construction.
The idea being that you simply need to put your language-specific module in the same directory as your sub-class, thus we will find the custom modules.
Alternatively, you can sub-class this method, to return the correct class heirachy name.
$format numeric_format($format)
This method allows you to set and/or get the format that is being used for numeric formatting. You can supply an array, an array ref, or a string.
Accessor methods
- $hash options()
-
Returns the options that were supplied to the constructor.
- $string_array languages()
-
Returns a list of the language tags that are in use.
- $object_list language_modules()
-
Returns a list of the loaded language modules.
- $object backing_store()
-
Returns the loaded backing store instance.
- $object rule_manager()
-
Returns the loaded rule manager instance.
- $string encoding()
-
Returns the output character set encoding.
- $int malformed_character_mode()
-
Returns the current UTF-8 malformed character output mode.
- $bool die_on_bad_translation()
-
Returns the current state of 'die_on_bad_translation'.
- $bool translate_arguments()
-
Returns the current state of 'translate_arguments'.
- $bool add_newline()
-
Returns the current state of 'add_newline'.
- $bool panic_language_lookup()
-
Returns the current state of 'panic_language_lookup'.
Function API
The following items are helper functions, which can be used to simplify the usage of Locale::MakePhrase objects.
$string mp($string [, ...])
This is a helper function to the translate() function call. It will use the last-constructed instance of Locale::MakePhrase to invoke the translate function on. eg:
print mp("This is test no: [_1]",$test_no);
could produce:
This is the first test.
$string __ $string [, ...]
This function is the same as the previous helper function, except that it makes you code easier to read and easier to write. eg:
print __"This is test no: [_1]",$test_no;
could produce:
This is test no: 4
Note that we use double-underscore as this makes search-n-replace tasks easier than if we used a single-underscore.
NOTE
The previous functions use a reference to an internal variable. If you are using this module from within Apache (say under mod_perl), make sure that you construct a new instance of a Locale::MakePhrase object, in the child Apache processes.
SUB-CLASSING
These modules can be used standalone, or they can be sub-classed so as to control certain aspects of its behaviour. Each inidividual module from this group, is capable of being sub-classed; refer to each modules' specific documentation, for more details.
In particular the Locale::MakePhrase::Language module is designed to be sub-classed, so as to support, say, language-specific keyboard input handling.
Construction control
Due to the magic of inheritance, there are two primary ways to control construction any of these modules:
Overload the
new()
methodImplement the
new()
method in your sub-classcall
SUPER::new()
so as to execute the parent class constructorre-bless the returned object
For example:
sub new { my $class = shift; ... my $self = $class->SUPER::new(...sub-class specific arguments...); $self = bless $self, $class; ... return $self; }
Overload the
init()
method.implement the
init()
method in your sub-classreturn a reference to the current object.
For example:
sub init { my $self = shift; ... return $self; }
Sub-classing this module
This module (Makephrase.pm
) has a number of methods which can be overloaded:
init()
fallback_backing_store()
fallback_language()
language_classes()
format_number()
DEBUGGING
Since this module and framework are relativley new, it is quite likely that a few bugs may still exist. By setting the module-specific DEBUG
variable, you can enable debug messages to be sent to STDERR.
Set the value to zero, to disable debug. Setting progressively higher values (up to a maximum value of 9), results in more debug messages being generated.
The following variables can be set:
$Locale::MakePhrase::DEBUG
$Locale::MakePhrase::RuleManager::DEBUG
$Locale::MakePhrase::LanguageRule::DEBUG
$Locale::MakePhrase::BackingStore::Cached::DEBUG
$Locale::MakePhrase::BackingStore::File::DEBUG
$Locale::MakePhrase::BackingStore::Directory::DEBUG
$Locale::MakePhrase::BackingStore::PostgreSQL::DEBUG
NOTES
Text directionality
This module internally uses UTF-8 character encoding for text storage for a number of reasons, one of them being for the ability to encode the directionality within the text string using Unicode character glyphs.
However it is up to the application drawing mechanism to support the correct interpretation of these Unicode glyphs, before the text can be displayed in the correct direction.
Localised text layout
In some languages there may be a requirement that we layout the application interface, using a different layout scheme than what would normally be available. This requirement is known as layout localisation. An example might be, Chinese text should prefer to layout top-to-bottom left-to-right, (rather than left-to-right top-to-bottom).
This module doesn't provide this facility, as that is up to the application layout mechanism to handle the differences in layout. eg: A web-browser uses HTML as a formatting language; web-browsers do not implement top-to-bottom text layout.
SEE ALSO
Locale::MakePhrase is made up of a number of modules, for which there is POD documentation for each module. Refer to:
- . Locale::MakePhrase::Language
- . Locale::MakePhrase::Language::en
- . Locale::MakePhrase::LanguageRule
- . Locale::MakePhrase::RuleManager
- . Locale::MakePhrase::BackingStore
- . Locale::MakePhrase::BackingStore::File
- . Locale::MakePhrase::BackingStore::Directory
- . Locale::MakePhrase::BackingStore::PostgreSQL
- . Locale::MakePhrase::Utils
- . Locale::MakePhrase::Numeric
- . Locale::MakePhrase::Print
It also uses the following modules internally:
You can (and should) read the documentation provided by the Locale::Maketext module.
BUGS
Multiple levels of quoting
The rule expression parser cannot handle multiple levels of quoting. It needs modification to support this (however, this may make the parser slower).
Expression parsing failure
The rule expression parser splits the rule into sub-expressions by chunking on ' && '. This means it will fail to parse a text evaluation containing these characters. For example this will fail to parse:
_1 eq ' && '
Since the ' && ' is not a common text expression, this bug will probably never be fixed.
TODO
Need to add support for male / female context of phrase. This could be implemented using a context specific translation, however the better way would be to add native support for gender.
CREDITES
This module was written for NetRatings, Inc.; they paid for part of my time to develop this module.
Various suggestions and bug fixes were also provided by:
LICENSE
This module was written by Mathew Robertson mailto:mathew@users.sf.net for NetRatings, Inc. http://www.netratings.com. Copyright (C) 2006
This module is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License version 2 (or at your option, any later version) as published by the Free Software Foundation http://www.fsf.org.
This module is distributed WITHOUT ANY WARRANTY WHATSOEVER, in the hope that it will be useful to others.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 1077:
=cut found outside a pod block. Skipping to next block.