NAME

Locale::Unicode - Unicode Locale Identifier compliant with BCP47 and CLDR

SYNOPSIS

use Locale::Unicode;
my $locale = Locale::Unicode->new( 'ja-Kana-t-it' ) ||
    die( Locale::Unicode->error, "\n" );
say $locale; # ja-Kana-t-it

# Some undefined locale in Cyrillic script
my $locale = Locale::Unicode->new( 'und-Cyrl' );
$locale->transform( 'und-latn' );
$locale->mechanism( 'ungegn-2007' );
say $locale; # und-Cyrl-t-und-latn-m0-ungegn-2007
# A locale in Cyrillic, transformed from Latin, according to a UNGEGN specification dated 2007.

VERSION

v0.1.0

DESCRIPTION

This module implements the Unicode LDML (Locale Data Markup Language) extensions

It does not enforce the standard, and is merely an API to construct, access and modify locales. It is your responsibility to set the right values.

For your convenience, summary of key elements of the standard can be found in this documentation.

It is lightweight and fast with no dependency outside of Scalar::Util and Want. It requires perl v5.10 minimum to operate.

The objects stringifies, and once its string value is computed, it is cached and re-used until it is changed. Thus repetitive call to as_string or to stringification does not incur any speed penalty by recomputing what has not changed.

CONSTRUCTOR

new

my $locale = Locale::Unicode->new( 'en' );
my $locale = Locale::Unicode->new( 'en-GB' );
my $locale = Locale::Unicode->new( 'en-Latn-AU' );
my $locale = Locale::Unicode->new( 'he-IL-u-ca-hebrew-tz-jeruslm' );
my $locale = Locale::Unicode->new( 'ja-Kana-t-it' );
my $locale = Locale::Unicode->new( 'und-Latn-t-und-cyrl' );
my $locale = Locale::Unicode->new( 'und-Cyrl-t-und-latn-m0-ungegn-2007' );
my $locale = Locale::Unicode->new( 'de-u-co-phonebk-ka-shifted' );
# Machine translated from German to Japanese using an undefined vendor
my $locale = Locale::Unicode->new( 'ja-t-de-t0-und' );
$locale->script( 'Kana' );
$locale->country_code( 'JP' );
# Now: ja-Kana-JP-t-de-t0-und

This takes a locale as compliant with the BCP47 standard, and an optional hash or hash reference of options and this returns a new object.

The locale provided is parsed and its components can be accessed and modified using all the methods of this class API.

If an hash or hash reference of options are provided, it will be used to set or modify the components from the locale provided.

If an error occurs, an exception object is set and undef is returned in scalar context, or an empty list in list context. The exception object can then be retrieved using error, such as:

my $locale = Locale::Unicode->new( $somthing_bad ) ||
    die( Locale::Unicode->error );

METHODS

All the methods below are context sensitive.

If they are called in an object context, they will return the current Locale::Unicode object for chaining, otherwise, they will return the current value. And if that value is undef, it will return undef in scalar context, but an empty list in list context.

Also, if an error occurs, it will set an exception object and returns undef in scalar context, or an empty list in list context.

apply

my $hash_reference = Locale::Unicode->parse( 'ja-Kana-t-it' );
$locale->apply( $hash_reference );

Provided with an hash reference of key-value pairs, and this will set each corresponding method with the associated value.

If a property provided has no corresponding method, it emits a warning if warnings are enabled

It returns the current object upon success, or sets an error object upon error and returns undef in scalar context, or an empty list in list context.

as_string

Returns the Locale object as a string, based on its latest attributes set.

The string value returned is computed only once and further call to as_string returns a cached value unless changes were made to the Locale attributes.

break_exclusion

my $locale = Locale::Unicode->new( 'ja' );
$locale->break_exclusion( 'hani-hira-kata' );
# Now: ja-dx-hani-hira-kata

This is a Unicode Dictionary Break Exclusion Identifier that specifies scripts to be excluded from dictionary-based text break (for words and lines).

Sets or gets the Unicode extension dx

See also dx

This specifies scripts to be excluded from dictionary-based text break.

ca

This is an alias for "calendar"

calendar

my $locale = Locale::Unicode->new( 'th' );
$locale->calendar( 'buddhist' );
# or:
# $locale->ca( 'buddhist' );
# Now: th-u-ca-buddhist
# which is the Thai with Buddist calendar

Sets or gets the Unicode extension ca, which is a calendar identifier.

See the section on "BCP47 EXTENSIONS" for the proper values.

cf

This is an alias for "cu_format"

co

my $locale = Locale::Unicode->new( 'de' );
$locale->collation( 'phonebk' );
$locale->ka( 'shifted' );
# Now: de-u-co-phonebk-ka-shifted

This is a Unicode collation identifier that specifies a type of collation (sort order).

This is an alias for "collation"

colAlternate

my $locale = Locale::Unicode->new( 'de' );
$locale->collation( 'phonebk' );
$locale->ka( 'shifted' );
# Now: de-u-co-phonebk-ka-shifted

$locale->collation( 'noignore' );
# or similarly:
$locale->collation( 'non-ignorable' );

Sets alternate handling for variable weights.

Sets or gets the Unicode extension ka

See "Collation Options" for more information.

colBackwards

$locale->colBackwards(1); # true
# Now: kb-true
$locale->colBackwards(0); # false
# Now: kb-false

Sets collation boolean value for backward collation weight.

Sets or gets the Unicode extension kb

See "Collation Options" for more information.

colCaseFirst

Sets or gets the Unicode extension kf

colCaseLevel

$locale->colCaseLevel(1); # true
# Now: kc-true
$locale->colCaseLevel(0); # false
# Now: kc-false

Sets collation boolean value for case level.

Sets or gets the Unicode extension kc

See "Collation Options" for more information.

colHiraganaQuaternary

$locale->colHiraganaQuaternary(1); # true
# Now: kh-true
$locale->colHiraganaQuaternary(0); # false
# Now: kh-false

Sets collation parameter key for special Hiragana handling.

Sets or gets the Unicode extension kh

See "Collation Options" for more information.

collation

my $locale = Locale::Unicode->new( 'fr' );
$locale->collation( 'emoji' );
# Now: fr-u-co-emoji

my $locale = Locale::Unicode->new( 'de' );
$locale->collation( 'phonebk' );
# Now: de-u-co-phonebk
# which is: German using Phonebook sorting

Sets or gets the Unicode extension co

This specifies a type of collation (sort order).

See "Unicode extensions" for possible values and more information on standard.

See also "Collation Options" for more on collation options.

colNormalisation

This is an alias for colNormalization

colNormalization

$locale->colNormalization(1); # true
# Now: kk-true
$locale->colNormalization(0); # false
# Now: kk-false

Sets collation parameter key for normalisation.

Sets or gets the Unicode extension kk

See "Collation Options" for more information.

colNumeric

$locale->colNumeric(1); # true
# Now: kn-true
$locale->colNumeric(0); # false
# Now: kn-false

Sets collation parameter key for numeric handling.

Sets or gets the Unicode extension kn

See "Collation Options" for more information.

colReorder

my $locale = Locale::Unicode->new( 'en' );
$locale->colReorder( 'latn-digit' );
# Now: en-u-kr-latn-digit
# Reorder digits after Latin characters.

my $locale = Locale::Unicode->new( 'en' );
$locale->colReorder( 'arab-cyrl-others-symbol' );
# Now: en-u-kr-arab-cyrl-others-symbol
# Reorder Arabic characters first, then Cyrillic, and put
# symbols at the end—after all other characters.

Sets collation reorder codes.

Sets or gets the Unicode extension kr

See "Collation Options" for more information.

shiftedGroup

This is an alias for "colValue"

colStrength

$locale->colStrength( 'level1' );
# Now: ks-level1
# or, equivalent:
$locale->colStrength( 'primary' );

$locale->colStrength( 'level2' );
# or, equivalent:
$locale->colStrength( 'secondary' );

$locale->colStrength( 'level3' );
# or, equivalent:
$locale->colStrength( 'tertiary' );

$locale->colStrength( 'level4' );
# or, equivalent:
$locale->colStrength( 'quaternary' );
$locale->colStrength( 'quarternary' );

$locale->colStrength( 'identic' );
$locale->colStrength( 'identic' );
$locale->colStrength( 'identical' );

Sets the collation parameter key for collation strength used for comparison.

Sets or gets the Unicode extension ks

See "Collation Options" for more information.

colValue

$locale->colValue( 'currency' );
$locale->colValue( 'punct' );
$locale->colValue( 'space' );
$locale->colValue( 'symbol' );

Sets the collation value for the last reordering group to be affected by ka-shifted.

Sets or gets the Unicode extension kv

See "Collation Options" for more information.

colVariableTop

Sets the string value for the variable top.

Sets or gets the Unicode extension vt

See "Collation Options" for more information.

country_code

my $locale = Locale::Unicode->new( 'en' );
$locale->country_code( 'US' );
# Now: en-US
$locale->country_code( 'GB' );
# Now: en-GB

Sets or gets the country code part of the locale.

A country code should be an ISO 3166 2-letters code, but keep in mind that the LDML (Locale Data Markup Language) accepts old data to ensure stability.

cu

my $locale = Locale::Unicode->new( 'ja' );
$locale->cu( 'jpy' );
# Now: ja-u-cu-jpy
# which is the Japanese Yens

This is a Unicode currency identifier that specifies a type of currency (ISO 4217 code.

This is an alias for "currency"

cu_format

# Using minus sign symbol for negative numbers
$locale->cf( 'standard' );
# Using parentheses for negative numbers
$locale->cf( 'account' );

This is a currency format identifier such as standard or account

Sets or gets the Unicode extension cf

See the section on "BCP47 EXTENSIONS" for the proper values.

currency

my $locale = Locale::Unicode->new( 'ja' );
$locale->currency( 'jpy' );
# or
# $locale->cu( 'jpy' );
# Now: ja-u-cu-jpy
# which is the Japanese yens

Sets or gets the Unicode extension cu

This specifies a type of ISO4217 currency code.

d0

This is an alias for "destination"

dest

This is an alias for "destination"

destination

Sets or gets the Transformation extension d0 for destination.

See the section on "Transform extensions" for more information.

dx

This is an alias for "break_exclusion"

em

This is an alias for "emoji"

emoji

This is a Unicode Emoji Presentation Style Identifier that specifies a request for the preferred emoji presentation style.

Sets or gets the Unicode extension em.

false

This is read-only and returns a Locale::Unicode::Boolean object representing a false value.

fw

This is an alias for "first_day"

first_day

This is a Unicode First Day Identifier that specifies the preferred first day of the week for calendar display.

Sets or gets the Unicode extension fw.

Its values are sun, mon, etc... sat

h0

This is an alias for "hybrid"

hc

This is an alias for "hour_cycle"

hour_cycle

This is a Unicode Hour Cycle Identifier that specifies the preferred time cycle.

Sets or gets the Unicode extension hc.

hybrid

my $locale = Locale::Unicode->new( 'ru' );
$locale->transform( 'en' );
$locale->hybrid(1); # true
# or
# $locale->hybrid( 'hybrid' );
# or
# $locale->h0( 'hybrid' );
# Now: ru-t-en-h0-hybrid
# Hybrid Cyrillic - Runglish

my $locale = Locale::Unicode->new( 'en' );
$locale->transform( 'zh-hant' );
$locale->hybrid( 'hybrid' );
# Now: en-t-zh-hant-h0-hybrid
# which is Hybrid Latin - Chinglish

Those are Hybrid Locale Identifiers indicating that the t value is a language that is mixed into the main language tag to form a hybrid.

Sets or gets the Transformation extension h0.

See the section on "Transform extensions" for more information.

i0

This is an alias for "input"

k0

This is an alias for "keyboard"

input

my $locale = Locale::Unicode->new( 'zh' );
$locale->input( 'pinyin' );
# Now: zh-t-i0-pinyin

This is an Input Method Engine transformation.

Sets or gets the Transformation extension i0.

See the section on "Transform extensions" for more information.

ka

This is an alias for "colAlternate"

kb

This is an alias for "colBackwards"

kc

This is an alias for "colCaseLevel"

keyboard

my $locale = Locale::Unicode->new( 'en' );
$locale->keyboard( 'dvorak' );
# Now: en-t-k0-dvorak

This is a keyboard transformation, such as used by client-side virtual keyboards.

Sets or gets the Transformation extension k0.

See the section on "Transform extensions" for more information.

kf

This is an alias for "colCaseFirst"

kh

This is an alias for "colHiraganaQuaternary"

kk

This is an alias for "colNormalization"

kn

This is an alias for "colNumeric"

kr

This is an alias for "colReorder"

ks

This is an alias for "colStrength"

kv

This is an alias for "colValue"

lang

# current value: fr-FR
$obj->lang( 'de' );
# Now: de-FR

Sets or gets the locale part of this Local object.

See also "locale"

lb

This is an alias for "line_break"

line_break

This is a Unicode Line Break Style Identifier that specifies a preferred line break style corresponding to the CSS level 3 line-break option.

Sets or gets the Unicode extension lb.

line_break_word

This is a Unicode Line Break Word Identifier that specifies a preferred line break word handling behavior corresponding to the CSS level 3 word-break option

Sets or gets the Unicode extension lw.

locale

This is an alias for "lang"

locale3

my $locale = Locale::Unicode->new( 'jpn' );
$locale->script( 'Kana' );
# Now: jpn-Kana

Sets or gets the 3-letter ISO 639-2 code. Keep in mind, however, that to ensure stability, the LDML (Locale Data Markup Language) also uses old data.

lw

This is an alias for "line_break_word"

m0

This is an alias for "mechanism"

machine

my $locale = Locale::Unicode->new( 'ja' );
$locale->transform( 'de' );
$locale->machine( 'und' );
# Now: ja-t-de-t0-und
# Japanese translated from Germany by an undefined vendor

This is used to indicate content that has been machine translated, or a request for a particular type of machine translation of content.

Sets or gets the Transformation extension t0.

See the section on "Transform extensions" for more information.

measurement

This is a Unicode Measurement System Identifier that specifies a preferred measurement system.

Sets or gets the Unicode extension ms.

mechanism

my $locale = Locale::Unicode->new( 'und-Latn' );
$locale->transform( 'ru' );
$locale->mechanism( 'ungegn-2007' );
# Now: und-Latn-t-ru-m0-ungegn-2007
# representing a transformation from United Nations Group of Experts on 
# Geographical Names in 2007

This is a transformation mechanism referencing an authority or rules for a type of transformation.

Sets or gets the Transformation extension m0.

See the section on "Transform extensions" for more information.

ms

This is an alias for "measurement"

mu

This is an alias for "unit"

nu

This is an alias for "number"

number

This is a Unicode Number System Identifier that specifies a type of number system.

Sets or gets the Unicode extension nu.

private

my $locale = Locale::Unicode->new( 'ja-JP' );
$locale->private( 'something-else' );
# Now: ja-JP-x-something-else

This serves to set or get the value for a private subtag.

region

# current value: fr-FR
$locale->region( 'DE' );
# Now: fr-DE

Sets or gets the region part of a Unicode locale.

This is normally an ISO3166-1 country code.

region_override

my $locale = Locale::Unicode->new( 'en-GB' );
$locale->region_override( 'uszzzz' );
# Now: en-GB-u-rg-uszzzz
# which is a locale for British English but with region-specific defaults set to US.

This is a Unicode Region Override that specifies an alternate region to use for obtaining certain region-specific default values.

Sets or gets the Unicode extension rg.

reset

When provided with any argument, this will reset the cached value computed by "as_string"

rg

This is an alias for "region_override"

s0

This is an alias for "source"

script

# current value: zh-Hans
$locale->script( 'Hant' );
# Now: zh-Hant

Sets or gets the script part of the Locale identifier.

sd

This is an alias for "subdivision"

sentence_break

This is a Unicode Sentence Break Suppressions Identifier that specifies a set of data to be used for suppressing certain sentence breaks.

Sets or gets the Unicode extension ss.

source

This is a transformation source for non-languages or scripts, such as fullwidth-halfwidth conversion.

Sets or gets the Transformation extension s0.

See the section on "Transform extensions" for more information.

ss

This is an alias for "sentence_break"

subdivision

my $locale = Locale::Unicode->new( 'gsw' );
$locale->subdivision( 'chzh' );
# or
# $locale->sd( 'chzh' );
# Now: gsw-u-sd-chzh

my $locale = Locale::Unicode->new( 'en-US' );
$locale->sd( 'usca' );
# Now: en-US-u-sd-usca

This is a Unicode Subdivision Identifier that specifies a regional subdivision used for locale. This is typically the States in the U.S., or prefectures in France or Japan, or provinces in Canada.

Sets or gets the Unicode extension sd.

Be careful of the rule in the standard. For example, en-CA-u-sd-gbsct would be invalid because gb in gbsct does not match the region subtag CA

t0

This is an alias for "machine"

t_private

my $locale = Locale::Unicode->new( 'ja' );
$locale->transform( 'und' );
$locale->t_private( 'medical' );
# Now: ja-t-de-t0-und-x0-medical

This is a private transformation subtag.

Sets or gets the Transformation private subtag x0.

t_x0

This is an alias for "t_private"

time_zone

This is a Unicode Timezone Identifier that specifies a time zone.

Sets or gets the Unicode extension tz.

timezone

This is an alias for "time_zone"

transform

my $locale = Locale::Unicode->new( 'ja' );
$locale->transform( 'it' );
# Now: ja-t-it
# which is Japanese, transformed from Italian

my $locale = Locale::Unicode->new( 'ja-Kana' );
$locale->transform( 'it' );
# Now: ja-Kana-t-it
# which is Japanese Katakana, transformed from Italian

# 'und' is undefined and is perfectly valid
my $locale = Locale::Unicode->new( 'und-Latn' );
$locale->transform( 'und-cyrl' );
# Now: und-Latn-t-und-cyrl
# which is Latin script, transformed from the Cyrillic script

Sets or gets the Transformation extension t.

transform_locale

my $locale = Locale::Unicode->new( 'ja' );
my $locale2 = Locale::Unicode->new( 'it' );
$locale->transform_locale( $locale2 );
# Now: ja-t-it
my $object = $locale->transform_locale;

Sets or gets a Locale::Unicode object used to indicate the original locale subject to transformation.

This will trigger an exception if a value, other than Locale::Unicode or an inheriting class object, is set.

See the section on "Transform extensions" for more information.

translation

Sets or gets the Transformation extension t0.

true

This is read-only and returns a Locale::Unicode::Boolean object representing a true value.

tz

This is an alias for "time_zone"

unit

This is a Measurement Unit Preference Override that specifies an override for measurement unit preference.

Sets or gets the Unicode extension mu.

va

This is an alias for "variant"

variant

This is a Unicode Variant Identifier that specifies a special variant used for locales.

Sets or gets the Unicode extension va.

vt

This is an alias for "colVariableTop"

CLASS FUNCTIONS

matches

Provided with a BCP47 locale, and this returns an hash reference of its components if it matches the BCP47 regular expression, which can be accessed as global class variable $LOCALE_RE.

If nothing matches, it returns an empty string in scalar context, or an empty list in list context.

If an error occurs, its sets an error object and returns undef in scalar context, or an empty list in list context.

parse

my $hash_ref = Locale::Unicode->parse( 'ja-Kana-t-it' );
# Transcription in Japanese Katakana of an Italian word:
# {
#     ext_transform => "t-it",
#     ext_transform_subtag => "it",
#     locale => "ja",
#     script => "Kana",
# }
my $hash_ref = Locale::Unicode->parse( 'he-IL-u-ca-hebrew-tz-jeruslm' );
# Represents Hebrew as spoken in Israel, using the traditional Hebrew calendar, 
# and in the "Asia/Jerusalem" time zone
# {
#     country_code => "IL",
#     ext_unicode => "u-ca-hebrew-tz-jeruslm",
#     ext_unicode_subtag => "ca-hebrew-tz-jeruslm",
#     locale => "he",
# }

Provided with a BCP47 locale, and an optional hash reference like the one returned by matches, and this will return an hash reference with detailed broken down of the locale embedded information, as per the Unicode BCP47 standard.

tz_id2name

Provided with a CLDR timezone ID, such as jptyo for Asia/Tokyo, and this returns the IANA Olson name equivalent, which, in this case, would be Asia/Tokyo

If an error occurs, its sets an error object and returns undef in scalar context, or an empty list in list context.

tz_id2names

my $ref = Locale::Unicode->tz_id2names( 'unknown' );
# yields an empty array object
my $ref = Locale::Unicode->tz_id2names( 'jptyo' );
# Asia/Tokyo

Provided with a CLDR timezone ID, such as ausyd, which stands primarily for Australia/Sydney, and this returns an array object of IANA Olson timezone names, which, in this case, would yield: ['Australia/Sydney', 'Australia/ACT', 'Australia/Canberra', 'Australia/NSW']

The order is set by BCP47 timezone data

If an error occurs, its sets an error object and returns undef in scalar context, or an empty list in list context.

tz_info

my $def = Locale::Unicode->tz_id2names( 'jptyo' );
# yields the following hash reference:
# {
#     alias => [qw( Asia/Tokyo Japan )],
#     desc => "Tokyo, Japan",
#     tz => "Asia/Tokyo",
# }
my $def = Locale::Unicode->tz_id2names( 'unknown' );
# yields an empty string (not undef)

Provided with a CLDR timezone ID, such as jptyo and this returns an hash reference representing the dictionary entry for that ID.

If no information exists for the given timezone ID, an empty string is returned. undef is returned only for errors.

If an error occurs, its sets an error object and returns undef in scalar context, or an empty list in list context.

tz_name2id

my $id = Locale::Unicode->tz_name2id( 'Asia/Tokyo' );
# jptyo
my $id = Locale::Unicode->tz_name2id( 'Australia/Canberra' );
# ausyd

Provided with an IANA Olson timezone name, such as Asia/Tokyo and this returns its CLDR equivalent, which, in this case, would be jptyo

If none exists, an empty string is returned.

If an error occurs, its sets an error object and returns undef in scalar context, or an empty list in list context.

OVERLOADING

Any object from this class is overloaded and stringifies to its locale representation.

For example:

my $locale = Locale::Unicode->new('ja-Kana-t-it' );
say $locale; # ja-Kana-t-it
$locale->transform( 'de' );
say $locale; # ja-Kana-t-de

BCP47 EXTENSIONS

Unicode extensions

Example:

Known BCP47 language extensions as defined in RFC6067 are as follows:

Transform extensions

This is used for transliterations, transcriptions, translations, etc, as per RFC6497>

For example:

The complete list of valid subtags is as follows. They are all two to eight alphanumeric characters.

Collation Options

Parametric settings can be specified in language tags or in rule syntax (in the form [keyword value] ). For example, -ks-level2 or [strength 2] will only compare strings based on their primary and secondary weights.

The options description below is taken from the LDML standard, and reflect how the algorithm works when implemented by web browser, or other runtime environment. This module does not do any of those algorithms. The documentation is only here for your benefit and convenience.

See the standard documentation and the DUCET (Default Unicode Collation Element Table) for more information.

AUTHOR

Jacques Deguest <jack@deguest.jp>

SEE ALSO

https://github.com/unicode-org/cldr/tree/main/common/bcp47, https://en.wikipedia.org/wiki/IETF_language_tag

https://www.rfc-editor.org/info/bcp47

Unicode Locale Data Markup Language

BCP47

RFC6067 on the Unicode extensions

RFC6497 on the transformation extension

COPYRIGHT & LICENSE

Copyright(c) 2024 DEGUEST Pte. Ltd.

All rights reserved

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.