NAME
Lingua::JA::Name::Splitter - split a Japanese name into given and family
SYNOPSIS
use Lingua::JA::Name::Splitter 'split_kanji_name';
my ($family, $given) = split_kanji_name ('風太郎');
# Now $family = 風 and $given = 太郎.
DESCRIPTION
This module attempts to split the names of Japanese people into given and family names.
FUNCTIONS
split_kanji_name
my ($family, $given) = split_kanji_name ('渡辺純子');
# Now $family = 渡辺 and $given = 純子
Native Japanese writing does not use spaces, so names appear as a string of characters with no break. This function provides a "guesswork" solution for dealing with names. It is a rough guess based on a simple algorithm, and thus is suitable for those who need to deal with large numbers of names quickly. Its output is not reliable, and must be checked by a human.
The heuristic methods used are as follows. The first character is assumed to be the family name's, and the last character is assumed to be the given name's. When there are more than two characters in the name, hiragana are assumed part of the given name. Kanji characters are weighted by distance from the beginning of the name. A dictionary of probabilities of family or given name kanji is also used to weight some characters. The name is then split at the first character which seems more likely to be part of the given name.
split_romaji_name
my ($first, $last) = split_romaji_name ('KATSU, Shintaro');
# $first = Shintaro, $last = Katsu
my ($first, $last) = split_romaji_name ('Risa Yoshiki');
# $first = Risa, $last = Yoshiki
Given a string containing a name of a Japanese person in romanized form, guess which part is the first and which part is the last name using the spaces, capitalization and commas in the name.
Japanese people write their names in a variety of romanized formats, such as "KATSU, Shintaro", "Shintaro Katsu", "KATSU Shintaro", or even "ShintaroKATSU". This function is intended as a "rock breaker" for processing a large number of Japanese names in romanized form. Its output needs to be checked by a human.
SEE ALSO
For people who've stumbled upon this module by accident and wonder why anyone would need a Japanese name splitter, see the Sci.lang.japan FAQ on Japanese names or the Wikipedia page on Japanese names.
The dictionary of data used to make the names is taken from Enamdict, please see ENAMDICT/JMnedict Japanese Proper Names Dictionary Files. The script which makes the dictionary, enamdict-counter.pl, is in the module's repository, but it is not provided in the distribution itself.
AUTHOR
Ben Bullock, <bkb@cpan.org>
COPYRIGHT & LICENCE
This package and associated files are copyright (C) 2012-2013 Ben Bullock.
You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.