NAME
Data::Kanji::Tomoe - parse the data files of the Tomoe project
SYNOPSIS
my $tomoe = Data::Kanji::Tomoe->new (
tomoe_data_file => '/path/to/data/file',
character_callback => \& user_callback,
);
$tomoe->parse ();
DESCRIPTION
This Perl module parses the kanji or hanzi data files supplied with the Tomoe "handwriting recognition engine".
The data itself is not supplied with this module.
The parsing is based on XML::Parser. It breaks the Tomoe data into individual characters, and calls a subroutine supplied by the user with the data for each character.
METHODS
new
my $obj = Data::Kanji::Tomoe->new ();
my $obj = Data::Kanji::Tomoe->new (
tomoe_data_file => '/path/to/data/file',
character_callback => \& user_callback,
);
Create the object. The argument is a hash. The name of the data file to be parsed, under the key tomoe_data_file
, must be supplied.
parse
$tomoe->parse ();
Parse the XML in the Tomoe data file.
As each <character>...</character> is parsed from the file, the callback specified by character_callback
is called back in the form
&{$callback} ($obj, $character);
where $character
is the character
$Character
is a hash reference with the following keys and values.
- utf8
-
Value: the character itself.
- strokes
-
Value: an array reference containing the strokes of the character. Each element of the array reference is a reference to an array of the points of the line. Each of these points is another reference. So, for example, if the original Tomoe data consists of
<strokes> <stroke> <point x="1" y="2"/> <point x="3" y="4"/> </stroke> <stroke> <point x="5" y="6"/> </stroke> </strokes>
then
$character->{strokes}
contains something like[[[1, 2], [3, 4]], [[5, 6]]]
Any data which the user wishes to send can be transmitted through the object itself:
my $obj = Data::Kanji::Tomoe->new (
tomoe_data_file => '/path/to/data/file',
character_callback => \& user_callback,
data_I_wish_to_send => \%data,
);
$obj->parse ();
sub user_callback
{
my ($obj, $c) = @_;
my $data = $obj->{data_I_wish_to_send};
}
SEE ALSO
Tomoe
The Tomoe "handwriting recognition engine" is located at http://tomoe.sourceforge.jp. The data files can be downloaded from this location. The most recent update of the software was on 29 June 2007, and the project is currently dormant. For queries about Tomoe, try the mailing list for the "Tegaki Handwriting Recognition Project". This is a similar project which some of the same people are involved in.
Other sources of kanji shape data
Those who are new to this field, who are considering what data to use, should note that the Tomoe data for the Japanese kanji contains many errors. A far better set of data for most purposes is the data of the KanjiVG project. A sister parser to the current project is on CPAN as Data::Kanji::KanjiVG. Users who do not specifically know what data to use are strongly recommended not to use the Tomoe data, which is currently unmaintained and contains a lot of errors.
Scripts
The git repository for this project contains a script and a schema for inserting the Tomoe files into a SQLite database as well as a script for extracting an individual character and drawing a PNG graphic of it. These files are not included in the CPAN distribution of this module. These scripts are not supported as part of this distribution. Users will need to modify the scripts to make use of them.
AUTHOR
Ben Bullock, <bkb@cpan.org>
COPYRIGHT & LICENCE
This package and associated files are copyright (C) 2012-2013 Ben Bullock.
You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.