NAME
Text::Transliterator::Unaccent - Compile a transliterator from Unicode tables, to remove accents from text
SYNOPSIS
my $unaccenter = Text::Transliterator::Unaccent->new(script => 'Latin',
wide => 0,
upper => 0,
modifiers => 'r');
$unaccenter->($string);
my $map = Text::Transliterator::Unaccent->char_map(script => 'Latin');
my $descr = Text::Transliterator::Unaccent->char_map_descr();
DESCRIPTION
This package compiles a transliteration function that will replace accented characters by unaccented characters. That function is fast, because it uses the builtin tr/.../.../
Perl operator; it is compact, because it only treats the Unicode subset that you need for your language; and it is complete, because it relies on the builtin Unicode character tables shipped with your Perl installation.
The algorithm for detecting accented characters is derived from the notion of compositions in Unicode; that notion is explained in perluniintro. Characters considered "accented" are the precomposed characters for which the Unicode canonical decomposition contains more than one codepoint; for such decompositions, the first codepoint is the unaccented character that will be mapped to the accented one. This definition seems to work well for the Latin script; I presume that it also makes sense for other scripts as well, but I'm not able to test.
METHODS
new
my $unaccenter = Text::Transliterator::Unaccent->new(%options);
# or
my $unaccenter = Text::Transliterator::Unaccent->new(); # script => 'Latin'
Compiles a new 'unaccenter' function. Valide %options
are :
script => $unicode_script
-
$unicode_script
is the name of a Unicode script, such as 'Latin', 'Greek' or 'Cyrillic'. For a complete list of unicode scripts, seeperl -MUnicode::UCD=charscripts -e "print join ', ', keys %{charscripts()}"
block => $unicode_block
-
$unicode_block
is the name of a Unicode block. For a complete list of Unicode blocks, seeperl -MUnicode::UCD=charblocks -e "print join ', ', keys %{charblocks()}"
range => \@codepoint_ranges
-
@codepoint_ranges
is a list of arrayrefs that contain start-of-range, end-of-range code point pairs. wide => $bool
-
Decides if wide characters (i.e. characters with code points above 255) are kept or not within the map. The default is true.
upper => $bool
-
Decides if uppercase characters are kept or not within the map. The default is true.
lower => $bool
-
Decides if lowercase characters are kept or not within the map. The default is true.
modifiers => $string
-
Any combination of the
cdsr
modifiers to thetr/.../.../
operator. In particular, the'r'
modifier may be used to specify that transliterated strings should be returned as new strings instead of modifying the input strings in place.
%options
may contain a list of several scripts, blocks and/or ranges; all will get concatenated into a single correspondance map. If the list is empty, the default range is script => 'Latin'
.
Unlike usual object-oriented modules, here the return value from the new
method is a reference to a function, not an object. That function should be called as
$unaccenter->(@strings);
By default every member of @strings
is modified in place, like with the tr/.../.../
operator, unless the r
modifier is present.
The function returns the list of results of the tr/.../.../
operation on each of the input strings. By default this will be the number of transliterated characters for each string. If the r
modifier is present, the return value is the list of transliterated strings. In scalar context, the last member of the list is returned (for compatibility with the previous API).
char_map
my $map = Text::Transliterator::Unaccent->char_map(@range_description);
Utility class method that returns a hashref of the accented characters in @range_description
, mapped to their unaccented corresponding characters, according to the algorithm described in the introduction. The @range_description
format is exactly like for the new()
method.
char_map_descr
my $descr = Text::Transliterator::Unaccent->char_map_descr(@range_descr);
Utility class method that returns a textual description of the map generated by @range_descr
.
SEE ALSO
Text::Unaccent is another unaccenter module, with a C and a Pure Perl version. It is based on iconv
instead of Perl's internal Unicode tables, and therefore may produce slighthly different results. According to some experimental benchmarks, the C version of Text::Unaccent
is faster than Text::Transliterator::Unaccent
on short strings and on small number of calls, and slower on long strings or high number of calls (but this may be a side-effect of the fact that it returns a copy of the string instead of replacing characters in-place); however I am not able to give a predictable rule about which module is faster in which circumstances.
Text::StripAccents is a Pure Perl module. In only handles Latin1, and is several orders of magnitude slower because it does an internal split and join of the whole string.
Search::Tokenizer uses the present module for building an unaccent
tokenizer.
AUTHOR
Laurent Dami, <dami@cpan.org>
LICENSE AND COPYRIGHT
Copyright 2010-2025 Laurent Dami.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.