NAME
Data::Kanji::Kanjidic - parse the "kanjidic" kanji data file
SYNOPSIS
use Data::Kanji::Kanjidic 'parse_kanjidic';
my $kanji = parse_kanjidic ('/path/to/kanjidic');
for my $k (keys %$kanji) {
print "$k has radical number $kanji->{$k}{B}.\n";
}
FUNCTIONS
parse_kanjidic
my $kanjidic = parse_kanjidic ('kanjidic');
The input is the file name where Kanjidic may be found. The return value is a hash reference. The keys of this hash reference are kanji, encoded as Unicode. Each of the values of the hash reference are entries corresponding to the kanji in the keys. Each value represents one line of Kanjidic. Each is a hash reference, with the keys described in "parse_entry".
This function assumes that the kanjidic file is encoded using the EUC-JP encoding.
parse_entry
my %values = parse_entry ($line);
Parse one line of Kanjidic. The input is one line from Kanjidic, encoded as Unicode. The return value is a hash containing each field from the line.
The possible keys and values of the returned hash are as follows. Values are scalars unless otherwise mentioned.
- kanji
-
The kanji itself (the same as the key).
- B
-
Bushu (radical as defined by the Nelson kanji dictionary).
- C
-
Classic radical (the usual radical, where this is different from the Nelson radical).
- DB
-
Japanese for Busy People textbook numbers.
- DC
-
The index numbers used in "The Kanji Way to Japanese Language Power" by Dale Crowley.
- DF
-
"Japanese Kanji Flashcards", by Max Hodges and Tomoko Okazaki.
- DG
-
The index numbers used in the "Kodansha Compact Kanji Guide".
- DH
-
The index numbers used in the 3rd edition of "A Guide To Reading and Writing Japanese" edited by Kenneth Hensall et al.
- DJ
-
The index numbers used in the "Kanji in Context" by Nishiguchi and Kono.
- DK
-
The index numbers used by Jack Halpern in his Kanji Learners Dictionary.
- DM
-
The index numbers from the French-language version of "Remembering the kanji".
- DO
-
The index numbers used in P.G. O'Neill's Essential Kanji.
- DR
-
The codes developed by Father Joseph De Roo, and published in his book "2001 Kanji" (Bonjinsha).
- DS
-
The index numbers used in the early editions of "A Guide To Reading and Writing Japanese" edited by Florence Sakade.
- DT
-
The index numbers used in the Tuttle Kanji Cards, compiled by Alexander Kask.
- E
-
The numbers used in Kenneth Henshall's kanji book.
- F
-
Frequency of kanji.
The following example program prints a list of kanji from most to least frequently used.
use Data::Kanji::Kanjidic 'parse_kanjidic'; my $kanji = parse_kanjidic ('/path/to/kanjidic'); my @sorted; for my $k (keys %$kanji) { if ($kanji->{$k}->{F}) { push @sorted, $kanji->{$k}; } } @sorted = sort {$a->{F} <=> $b->{F}} @sorted; binmode STDOUT, ":utf8"; for (@sorted) { print "$_->{kanji}: $_->{F}\n"; }
- G
-
Year of elementary school.
- H
-
Number in Jack Halpern dictionary.
- I
-
The Spahn-Hadamitzky book number.
- IN
-
The Spahn-Hadamitzky kanji-kana book number.
- J
-
Japanese proficiency test level.
- K
-
The index in the Gakken Kanji Dictionary (A New Dictionary of Kanji Usage).
- L
-
Code from "Remembering the Kanji" by James Heisig.
- MN
-
Morohashi index number.
- MP
-
Morohashi volume/page.
- N
-
Nelson code from original Nelson dictionary.
- O
-
The numbers used in P.G. O'Neill's "Japanese Names". This may take multiple values, so the value is an array reference.
- P
-
SKIP code.
- Q
-
Four-corner code. This may take multiple values, so the value is an array reference.
- S
-
Stroke count. This may take multiple values, so the value is an array reference.
- T
-
SPECIAL.
- U
-
Unicode code point as a hexadecimal number.
- V
-
Nelson code from the "New Nelson" dictionary. This may take multiple values, so the value is an array reference.
- W
-
Korean pronunciation. This may take multiple values, so the value is an array reference.
The following example program prints a list of Korean pronunciations, romanized (requires Lingua::KO::Munja).
use Data::Kanji::Kanjidic 'parse_kanjidic'; use Lingua::KO::Munja ':all'; # 강남스타일 binmode STDOUT, ":utf8"; my $kanji = parse_kanjidic ($ARGV[0]); for my $k (sort keys %$kanji) { my $w = $kanji->{$k}->{W}; if ($w) { my @h = map {'"' . hangul2roman ($_) . '"'} @$w; print "$k is Korean ", join (", ", @h), "\n"; } }
- X
-
Cross reference.
- XDR
-
De Roo cross-reference. This may take multiple values, so the value is an array reference.
- XH
-
Cross-reference. This may take multiple values, so the value is an array reference.
- XI
-
Cross-reference.
- XJ
-
Cross-reference. This may take multiple values, so the value is an array reference.
- XN
-
Nelson cross-reference. This may take multiple values, so the value is an array reference.
- XO
-
Cross-reference.
- Y
-
Pinyin pronunciation. This may take multiple values, so the value is an array reference.
- ZBP
-
MISCLASSIFICATIONrp. This may take multiple values, so the value is an array reference.
- ZPP
-
MISCLASSIFICATIONpp. This may take multiple values, so the value is an array reference.
- ZRP
-
MISCLASSIFICATIONrp. This may take multiple values, so the value is an array reference.
- ZSP
-
MISCLASSIFICATIONsp. This may take multiple values, so the value is an array reference.
- kokuji
-
This has a true value (1) if the character is marked as a "kokuji" in Kanjidic. See http://www.sljfaq.org/afaq/kokuji.html.
- english
-
This contains an array reference to the English-language meanings given in Kanjidic. It may be undefined, if there are no English-language meanings listed.
# The following "joke" program converts English into kanji. # Call it like "english-to-kanji.pl /where/is/kanjidic english-text". use Data::Kanji::Kanjidic 'parse_kanjidic'; use Convert::Moji 'make_regex'; my $kanji = parse_kanjidic ($ARGV[0]); my %english; for my $k (keys %$kanji) { my $english = $kanji->{$k}->{english}; if ($english) { for (@$english) { push @{$english{$_}}, $k; } } } my $re = make_regex (keys %english); open my $in, "<", $ARGV[1] or die $!; while (<$in>) { s/\b($re)\b/$english{$1}[int rand (@{$english{$1}})]/ge; print; }
Given input like this,
This is an example of the use of "english-to-kanji.pl", a program which converts English words into kanji. This may or may not be regarded as a good idea. What do you think?
it outputs this:
This is an 鑒 之 彼 使 之 "english負to負kanji.pl", a program 孰 converts 英 辭 into kanji. This 得 将 得 無 跨 regarded as a 臧 見. What 致 尓 憶?
- onyomi
-
This is an array reference which contains the on'yomi (音読) of the kanji. (See http://www.sljfaq.org/afaq/kanji-pronunciation.html.) It may be undefined, if no on'yomi readings are listed. The on'yomi readings are in katakana, as per Kanjidic itself. It is encoded in Perl's internal Unicode encoding.
The following example prints a list of kanji which have the same on'yomi:
binmode STDOUT, ":utf8"; use Data::Kanji::Kanjidic 'parse_kanjidic'; my $kanji = parse_kanjidic ($ARGV[0]); my %all_onyomi; for my $k (keys %$kanji) { my $onyomi = $kanji->{$k}->{onyomi}; if ($onyomi) { for my $o (@$onyomi) { push @{$all_onyomi{$o}}, $k; } } } for my $o (sort keys %all_onyomi) { if (@{$all_onyomi{$o}} > 1) { print "Same onyomi 「$o」 for 「@{$all_onyomi{$o}}」!\n"; } }
- kunyomi
-
This is an array reference which contains the kun'yomi (訓読) of the kanji. (See http://www.sljfaq.org/afaq/kanji-pronunciation.html.) It may be undefined, if no kun'yomi readings are listed. The kun'yomi readings are in hiragana, as per Kanjidic itself. It is encoded in Perl's internal Unicode encoding.
- nanori
-
This is an array reference which contains nanori (名乗り) readings of the character. It may be undefined, if no nanori readings are listed. The nanori readings are in hiragana, as per Kanjidic itself. They are encoded in Perl's internal Unicode encoding.
- morohashi
-
This is a hash reference containing data on the kanji's location in the Morohashi 'Dai Kan-Wa Jiten' kanji dictionary. The hash reference has the following keys.
- volume
-
The volume number of the character.
- page
-
The page number of the character.
- index
-
The index number of the character.
If there is no information, this remains unset.
For example, to print all the existing values,
use Data::Kanji::Kanjidic 'parse_kanjidic'; use FindBin; binmode STDOUT, ":utf8"; my $kanji = parse_kanjidic ("/path/to/kanjidic"); for my $k (sort keys %$kanji) { my $mo = $kanji->{$k}->{morohashi}; if ($mo) { print "$k: volume $mo->{volume}, page $mo->{page}, index $mo->{index}.\n"; } }
For detailed explanations of these codes, see "Kanjidic".
kanjidic_order
kanjidic_order ($kanjidic_ref);
Return a list sorted by stroke order of the elements of \%kanjidic
. Also add the field "kanji_id" to each of them so that the order can be reconstructed when referring to elements.
grade
my $grade2 = grade ($kanjidic_ref, 2);
Given a school grade such as 2
above, and the return value of "parse_kanjidic", $kanjidic_ref
, return an array reference containing a list of all of the kanji from that grade.
SEE ALSO
Other Perl modules
- Lingua::JP::Kanjidic
-
This module parses an old version of kanjidic.
Kanjidic
The official description of kanjidic is in http://www.csse.monash.edu.au/~jwb/kanjidic.html. To download kanjidic, please go to this web page and then download it from the link provided.
AUTHOR
Ben Bullock, <bkb@cpan.org>
COPYRIGHT & LICENCE
This package and associated files are copyright (C) 2012-2013 Ben Bullock.
You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.