lib/Text/Transliterator/Unaccent.pm

package Text::Transliterator::Unaccent;

use warnings;
use strict;

our $VERSION = "1.03";

use Text::Transliterator;
use Unicode::UCD        qw(charinfo charscript charblock);
use Unicode::Normalize  qw();

sub char_map {
  my $class = shift;

  my @all_ranges;
  my $ignore_wide  = 0;
  my $ignore_upper = 0;
  my $ignore_lower = 0;

  # decode arguments to get character ranges and boolean flags
  while (my ($kind, $arg) = splice(@_, 0, 2)) {
    my $ranges;

    my $todo = {
      script   => sub { $ranges = charscript($arg)
                          or die "$arg is not a valid Unicode script" },
      block    => sub { $ranges = charblock($arg)
                          or die "$arg is not a valid Unicode block"  },
      ranges   => sub { $ranges = $arg },
      wide     => sub { $ignore_wide  = !$arg                         },
      upper    => sub { $ignore_upper = !$arg                         },
      lower    => sub { $ignore_lower = !$arg                         },
     };
    my $coderef = $todo->{$kind}
      or die "invalid argument: $kind";
    $coderef->();
    push @all_ranges, @$ranges if $ranges;
  }

  # default
  @all_ranges = @{charscript('Latin')} if !@all_ranges;

  # build the map
  my %map;
  foreach my $range (@all_ranges) {
    my ($start, $end) = @$range;

    # iterate over characters in range
  CHAR:
    for my $c ($start .. $end) {

      # maybe drop that char under some conditions
      last CHAR if $ignore_wide and $c > 255;
      next CHAR if $ignore_upper and chr($c) =~ /\p{Uppercase_Letter}/;
      next CHAR if $ignore_lower and chr($c) =~ /\p{Lowercase_Letter}/;

      # get canonical decomposition (if any)
      my $canon  = Unicode::Normalize::getCanon($c);

      # store into map
      if ($canon && length($canon) > 1) {
        # the unaccented char is the the base (first char) of the decomposition
        my $base = substr $canon, 0, 1;
        $map{chr($c)} = $base,
      }
    }
  }

  return \%map;
}

sub char_map_descr {
  my $class = shift;

  my $map = $class->char_map(@_);

  my $txt = "";
  while (my ($k, $v) = each %$map) {
    my $accented = ord($k);
    my $base     = ord($v);
    $txt .= sprintf "U+%04x %-40s => U+%04x %s\n", 
               $accented,
               charinfo($accented)->{name},
               $base,
               charinfo($base)->{name};
  }
  return $txt;
}

sub new {
  my $class = shift;
  my $map = $class->char_map(@_);
  return Text::Transliterator->new($map)
}

1; # End of Text::Transliterator::Unaccent


__END__

=head1 NAME

Text::Transliterator::Unaccent - Compile a transliterator from Unicode tables, to remove accents from text

=head1 SYNOPSIS

  my $unaccenter = Text::Transliterator::Unaccent->new(script => 'Latin',
                                                       wide   => 0,
                                                       upper  => 0);
  $unaccenter->($string);

  my $map   = Text::Transliterator::Unaccent->char_map(script => 'Latin');

  my $descr = Text::Transliterator::Unaccent->char_map_descr();

=head1 DESCRIPTION

This package compiles a transliteration function that will replace
accented characters by unaccented characters. That function
is fast, because it uses the builtin C<tr/.../.../> Perl operator; it
is compact, because it only treats the Unicode subset that you need
for your language; and it is complete, because it relies on
the builtin Unicode character tables shipped with your Perl installation.

The algorithm for detecting accented characters is derived from the notion
of I<compositions> in Unicode; that notion is explained in L<perluniintro>.
Characters considered "accented" are the precomposed characters for
which the Unicode canonical decomposition contains more than one
codepoint; for such decompositions, the first codepoint is the
unaccented character that will be mapped to the accented one.  This
definition seems to work well for the Latin script; I presume that it
also makes sense for other scripts as well, but I'm not able to test.

=head1 METHODS

=head2 new

  my $unaccenter = Text::Transliterator::Unaccent->new(@range_description);
  # or
  my $unaccenter = Text::Transliterator::Unaccent->new(); # script => 'Latin'

Compiles a new 'unaccenter' function. The C<@range_description>
argument specifies which ranges of characters will be handled, and is
comprised of pairs of shape :

=over

=item C<< script => $unicode_script >>

C<$unicode_script> is the name of a Unicode script, such as 'Latin', 
'Greek' or 'Cyrillic'.
For a complete list of unicode scripts, see

  perl -MUnicode::UCD=charscripts -e "print join ', ', keys %{charscripts()}"

=item C<< block => $unicode_block >>

C<$unicode_block> is the name of a Unicode block. For a complete list of 
Unicode blocks, see

  perl -MUnicode::UCD=charblocks -e "print join ', ', keys %{charblocks()}"

=item C<< range => \@codepoint_ranges >>

C<@codepoint_ranges> is a list of arrayrefs that contain
I<start-of-range, end-of-range>
code point pairs.

=item C<< wide => $bool >>

Decides if wide characters (i.e. characters with code points above 255)
are kept or not within the map. The default is I<true>.

=item C<< upper => $bool >>

Decides if uppercase characters are kept or not within the map. The
default is I<true>.

=item C<< lower => $bool >>

Decides if lowercase characters are kept or not within the map. The
default is I<true>.

=back

The C<@range_description> may contain a list of several scripts,
blocks and/or ranges; all will get concatenated into a single
correspondance map.  If the list is empty, the default range is
C<< script => 'Latin' >>.

The return value from that C<new> method is actually
a reference to a function, not an object. That function is called as 

  $unaccenter->(@strings);

and modifies every member of C<@strings> I<in place>, 
like the C<tr/.../.../> operator.
The return value is the number of transliterated characters
in the last member of C<@strings>.


=head2 char_map

  my $map = Text::Transliterator::Unaccent->char_map(@range_description);

Utility class method that 
returns a hashref of the accented characters in C<@range_description>,
mapped to their unaccented corresponding characters, according to
the algorithm described in the introduction. The C<@range_description>
format is exactly like for the C<new()> method.

=head2 char_map_descr

  my $descr = Text::Transliterator::Unaccent->char_map_descr(@range_descr);

Utility class method that 
returns a textual description of the map 
generated by C<@range_descr>.

=head1 SEE ALSO

L<Text::Unaccent> is another unaccenter module, with a C and a Pure
Perl version. It is based on C<iconv> instead of Perl's internal
Unicode tables, and therefore may produce slighthly different
results. According to some experimental benchmarks, the C version of
C<Text::Unaccent> is faster than C<Text::Transliterator::Unaccent> on
short strings and on small number of calls, and slower on long strings
or high number of calls (but this may be a side-effect of the fact
that it returns a copy of the string instead of replacing characters
in-place); however I am not able to give a predictable rule about
which module is faster in which circumstances.

L<Text::StripAccents> is a Pure Perl module. In only handles Latin1, and
is several orders of magnitude slower because it does an
internal split and join of the whole string.

L<Search::Tokenizer> uses the present module for building
an C<unaccent> tokenizer.


=head1 AUTHOR

Laurent Dami, C<< <dami@cpan.org> >>

=head1 BUGS

Please report any bugs or feature requests to
C<bug-text-transliterator at rt.cpan.org>, or through the web
interface at
L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Transliterator>.
I will be notified, and then you'll automatically be notified of
progress on your bug as I make changes.


=head1 SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Text::Transliterator::Unaccent


You can also look for information at:

=over 4

=item * RT: CPAN's request tracker

L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=Text-Transliterator>

=item * AnnoCPAN: Annotated CPAN documentation

L<http://annocpan.org/dist/Text-Transliterator>

=item * CPAN Ratings

L<http://cpanratings.perl.org/d/Text-Transliterator>

=item * Search CPAN

L<http://search.cpan.org/dist/Text-Transliterator/>

=back


=head1 LICENSE AND COPYRIGHT

Copyright 2010, 2017 Laurent Dami.

This program is free software; you can redistribute it and/or modify it
under the terms of either: the GNU General Public License as published
by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.


=cut
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)
Keyboard Shortcuts