NAME
Unicode::Casing - Perl extension to override system case changing functions
SYNOPSIS
use Unicode::Casing
uc => \&my_uc, lc => \&my_lc,
ucfirst => \&my_ucfirst, lcfirst => \&my_lcfirst;
no Unicode::Casing;
package foo::bar;
use Unicode::Casing -load;
sub import {
Unicode::Casing->import(
uc => \&_uc,
lc => \&_lc,
ucfirst => \&_ucfirst,
lcfirst => \&_lcfirst,
);
}
sub unimport {
Unicode::Casing->unimport;
}
DESCRIPTION
This module allows overriding the system-defined character case changing functions. Any time something in its lexical scope would ordinarily call lc()
, lcfirst()
, uc()
, or ucfirst()
the corresponding user-specified function will instead be called. This applies to direct calls, and indirect calls via the \L
, \l
, \U
, and \u
escapes in double quoted strings and regular expressions.
Each function is passed a string to change the case of, and should return the case-changed version of that string. Using, for example, \U
inside the override function for uc()
will lead to infinite recursion, but the standard casing functions are available via CORE::. For example,
sub my_uc {
my $string = shift;
print "Debugging information\n";
return CORE::uc($string);
}
use Unicode::Casing uc => \&my_uc;
uc($foo);
gives the standard upper-casing behavior, but prints "Debugging information" first.
It is an error to not specify at least one override in the "use" statement. Ones not specified use the standard version. It is also an error to specify more than one override for the same function.
use re 'eval'
is not needed to have the inline case-changing sequences work in regular expressions.
Here's an example of a real-life application, for Turkish, that shows context-sensitive case-changing. (Because of bugs in earlier Perls, version 5.12 is required for this example to work properly.)
sub turkish_lc($) {
my $string = shift;
# Unless an I is before a dot_above, it turns into a dotless i (the
# dot above being attached to the I, without an intervening other
# Above mark; an intervening non-mark (ccc=0) would mean that the
# dot above would be attached to that character and not the I)
$string =~ s/I (?! [^\p{ccc=0}\p{ccc=Above}]* \x{0307} )/\x{131}/gx;
# But when the I is followed by a dot_above, remove the dot_above so
# the end result will be i.
$string =~ s/I ([^\p{ccc=0}\p{ccc=Above}]* ) \x{0307}/i$1/gx;
$string =~ s/\x{130}/i/g;
return CORE::lc($string);
}
A potential problem with context-dependent case changing is that the routine may be passed insufficient context, especially with the in-line escapes like \L
.
90turkish.t, which comes with the distribution includes a full implementation of all the Turkish casing rules.
Note that there are problems with the standard case changing operation for characters whose code points are between 128 and 255. To get the correct Unicode behavior, the strings must be encoded in utf8 (which the override functions can force) or calls to the operations must be within the scope of use feature 'unicode_strings'
(which is available starting in Perl version 5.12).
Note that there can be problems installing this (at least on Windows) if using an old version of ExtUtils::Depends. To get around this follow these steps:
upgrade ExtUtils::Depends
force install B::Hooks::OP::Check
force install B::Hooks::OP::PPAddr
See http://perlmonks.org/?node_id=797851.
AUTHOR
Karl Williamson, <khw@cpan.org>
, with advice and guidance from various Perl 5 porters, including Paul Evans, Burak Gürsoy, Florian Ragwitz, and Ricardo Signes.
COPYRIGHT AND LICENSE
Copyright (C) 2011 by Karl Williamson
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.