NAME
Text::Names::Canonicalize - Locale-aware personal name canonicalization with YAML rules, inheritance, and user overrides
VERSION
Version 0.01
SYNOPSIS
use Text::Names::Canonicalize qw(
canonicalize_name
canonicalize_name_struct
);
my $canon = canonicalize_name(
"Jean d'Ormesson",
locale => 'fr_FR',
);
# jean d'ormesson
my $struct = canonicalize_name_struct(
"Karl von der Heide",
locale => 'de_DE',
);
# {
# original => "Karl von der Heide",
# locale => "de_DE",
# parts => {
# given => ["karl"],
# surname => ["von der", "heide"],
# },
# canonical => "karl von der heide",
# }
DESCRIPTION
Text::Names::Canonicalize provides a robust, data-driven engine for canonicalizing personal names across multiple languages and cultural conventions. It is designed for data cleaning, indexing, matching, and normalization tasks where consistent, locale-aware handling of names is essential.
The module uses declarative YAML rules for each locale, supports inheritance between locale files, detects circular includes, and allows users to override or extend rules via configuration files.
A command-line tool text-names-canonicalize is included for interactive use.
FEATURES
Locale-aware name canonicalization
YAML-based rule definitions
Inheritance between locale files (
include:)Circular-include detection
User override rules via
$CONFIG_DIRor~/.configMulti-word particle handling (e.g.
von der,d',l')Tokenization and surname-strategy engine
CLI tool with
--explainand--rules
FUNCTIONS
canonicalize_name( $name, %opts )
Returns a canonicalized string form of the name.
my $canon = canonicalize_name("John Mc Donald", locale => 'en_GB');
Options:
localeLocale code (e.g.
en_GB,fr_FR,de_DE). Defaults toen_GB.
canonicalize_name_struct( $name, %opts )
Returns a structured hashref describing the canonicalization process:
{
original => "...",
locale => "...",
parts => {
given => [...],
surname => [...],
},
canonical => "...",
}
Useful for debugging, testing, and downstream processing.
LOCALE SYSTEM
Locale rules are stored as YAML files under:
Text/Names/Canonicalize/Rules/*.yaml
Each file contains one or more rulesets (typically default).
Inheritance
A ruleset may include one or more parent locales:
default:
include: en_GB
particles:
- de
- du
Parents are merged in order, and child keys override parent keys.
Circular include detection
Circular include chains (direct or indirect) are detected and reported with a clear error message.
USER OVERRIDES
Users may override or extend locale rules by placing YAML files in:
$CONFIG_DIR/text-names-canonicalize/rules/*.yaml
or, if $CONFIG_DIR is not set:
~/.config/text-names-canonicalize/rules/*.yaml
User rules override built-in rules at the per-ruleset level.
CLI TOOL
The distribution includes a command-line utility:
text-names-canonicalize [options] "Full Name"
Options:
--locale LOCALE Select locale (default: en_GB)
--explain Dump structured canonicalization
--rules Show resolved ruleset for the locale
Examples:
text-names-canonicalize "Jean d'Ormesson" --locale fr_FR
text-names-canonicalize "Karl von der Heide" --locale de_DE --explain
text-names-canonicalize --rules --locale fr_FR
YAML RULE FORMAT
Each ruleset contains:
particles- list of surname particlessuffixes- generational/professional suffixesstrip_titles- titles to removehyphen_policy- currentlypreservesurname_strategy- e.g.last_token_with_particles
SUPPORTED LOCALES
base- shared Western defaultsen_GB- British Englishen_US- American Englishfr_FR- Frenchde_DE- German
Additional locales can be added easily by creating new YAML files.
EXTENDING
To add a new locale:
1. Create a YAML file in Rules/
2. Optionally inherit from base or another locale
3. Add locale-specific particles, titles, or suffixes
4. Write tests under t/
To override rules locally:
mkdir -p ~/.config/text-names-canonicalize/rules
cp my_rules.yaml ~/.config/text-names-canonicalize/rules/
AUTHOR
Nigel Horne, <njh at nigelhorne.com>
SEE ALSO
REPOSITORY
https://github.com/nigelhorne/Text-Names-Canonicalize
SUPPORT
This module is provided as-is without any warranty.
Please report any bugs or feature requests to bug-text-names-canonicalize at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Names-Canonicalize. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
You can find documentation for this module with the perldoc command.
perldoc Text::Names::Canonicalize
You can also look for information at:
MetaCPAN
RT: CPAN's request tracker
https://rt.cpan.org/NoAuth/Bugs.html?Dist=Text-Names-Canonicalize
CPAN Testers' Matrix
CPAN Testers Dependencies
http://deps.cpantesters.org/?module=Text::Names::Canonicalize
LICENCE AND COPYRIGHT
Copyright 2026 Nigel Horne.
Usage is subject to licence terms.
The licence terms of this software are as follows:
Personal single user, single computer use: GPL2
All other users (including Commercial, Charity, Educational, Government) must apply in writing for a licence for use from Nigel Horne at the above e-mail.