Take me over?
NAME
Lingua::ZH::CCDICT - An interface to the CCDICT Chinese dictionary
SYNOPSIS
use Lingua::ZH::CCDICT;
my $dict = Lingua::ZH::CCDICT->new( storage => 'InMemory',
file => '/path/to/ccdict.txt',
);
DESCRIPTION
This module provides a Perl interface to the CCDICT dictionary created by Thomas Chin. This dictionary is indexed by Unicode character number, and contains information about these characters.
As of version 3.2.0 of the dictionary, it was released under the Open Publication License v0.4, without either of the optional clauses. See the CCDICT licensing statement for more details. IANAL, but I believe that the OPL combined with the fact that this module is under the Artistic License, makes this module and the CCDICT dictionary fit both the Free Software and Open Source definitions.
The dictionary contains the following information, though not all information is avaialable for all characters.
Radical number
The number of the radical. Always available.
Index
The total number of strokes minus the number of strokes in the radical. Always available.
Alternate radical number and index
Actually, the dictionary defines a format for storing this information, but as of version 3.2.0 of the dictionary, it does not actually contain this for any characters.
Total stroke count
The total number of strokes in the character.
Cangjie
The Cangjie Chinese input system code.
Four Corner
The Four Corner Chinese input system code.
In addition, the dictionary contains English definitions (often multiple definitions), and romanizations for the character in different languages and systems. The romanizations available include the MacIver, Rey, Hagfa Pinyim, Siyan, and Hailu systems for Hakka, as well as the Jyutping Cantonese system and the Hanyu and Tongyong Pinyin systems for Mandarin.
The Hanyu Pinyin system was invented in Mainland China in 1952, while the TongYong Pinyin system is a variation on this system invented in Taiwan in 1998. TongYong PinYin was adopted as Taiwan's official Pinyin system in 2001.
However, due to the overwhelming dominance of the Hanyu system in worldwide Chinese education, the Hanyu system is generally known simply as Pinyin. If you studied Mandarin as a foreign language, it is likely that you learned Hanyu Pinyin.
DICTIONARY BUGS
The CCDICT dictionary is distributed by Thomas Chin in a simple, but non-standard, textual ASCII-only format. Errors in the dictionary are handled by this module internally, although occasional odd entries may result in odd data. Please send bug reports to me so I can make sure that the error is actually in the dictionary, not in this code.
STORAGE
This module is capable of parsing the CCDICT format file, and can also rewrite it in a number of other formats, including XML or as a set of BerkeleyDB files.
Each storage system is implemented via a module in the Lingua::ZH::CCDICT::Storage::*
class hierarchy. All of these modules are subclasses of Lingua::ZH::CCDICT
class, and implement its methods for searching the dictionary.
In addition some storage classes may offer additional methods.
Storage Subclasses
The following storage subclasses are available:
Lingua::ZH::CCDICT::Storage::InMemory
This class stores the entire parsed dictionary in memory. Be forewarned, on my GNU/Linux 2.4 machine, this takes about 234 megabytes of memory. Carpe user!
The only parameter it takes is "file", which should contain the full path to a CCDICT source file.
Lingua::ZH::CCDICT::Storage::XML
This class can convert the CCDICT source file to XML, and perform searches on it using XML::Twig.
The only parameter it takes is "xml_file", which should contain the full path to a CCDICT xml file. If the
parse_source_file
method is called, then the XML file specified by the "xml_file" parameter will be created, overwriting any existing data.This class is quite memory-efficient but searches are painfully slow.
Lingua::ZH::CCDICT::Storage::BerkeleyDB
This class can convert the CCDICT source file to a set of BerkeleyDB files.
The only parameter it takes is "work_dir". This directory will be used when creating new files from a CCDICT source file, when
parse_source_file
is called. Once these files exist, they can be used to perform searches.This class is the most memory-efficient of all the storage classes, as it uses BerkeleyDB cursors for result sets. It is also quite fast.
USAGE
This module allows you to look up information in the dictionary based on a number of keys. These include the Unicode character (as a character, not its number), stroke count, radical number, and any of the various romanization systems.
METHODS
These methods are always available.
new
This method always takes at least one parameter, "storage". This indicates what storage subclass to use. The current options are "InMemory", "XML", and "BerkeleyDB".
Any other parameters given will be passed to the appropriate subclass's
new
method.parse_source_file ($filename)
Given a source file, this will parse it and create a representation of the appropriate type.
Match methods
When doing a lookup based on the romanization of a character, the tone is indicated with a number at the end of the syllable, as opposed to using the Unicode character combining the latin letter with the diacritic.
In addition, lookups based on a Hanyu Pinyin romanization should use the u-with-umlaut character (character 252 in ASCII) rather than a doubled "u" character. Lower case should be used when doing lookups on romanizations.
The return value for any lookup will be an object of an Lingua::ZH::CCDICT::ResultSet
subclass. All the subclasses share a similar interface, described below.
Result sets always return matches in ascending Unicode character order.
match_unicode (@chars)
This method matches on one or more Unicode characters. Unicode characters should be given as Perl characters (i.e.
chr(0x7D20)
), not as a number.match_radical (@numbers)
Given a set of numbers, this method returns those characters containing the specified radical(s).
match_index (@numbers)
Given a set of numbers, this method returns those characters containing the specified index(es).
match_alternate_radical (@numbers)
Given a set of numbers, this method returns those characters containing the specified radical(s) as alternates.
match_alternate_index (@numbers)
Given a set of numbers, this method returns those characters containing the specified index(es) as alternates.
match_stroke_count (@numbers)
Given a set of numbers, this method returns those characters containing the specified number(s) of strokes.
match_cangjie (@codes)
Given a set of Cangjie codes, this method returns the character(s) for those code(s).
match_four_corner (@codes)
Given a set of Four Corner codes, this method returns the character(s) for those code(s).
match_maciver (@romanizations)
match_rey (@romanizations)
match_hagfa_pinyim (@romanizations)
match_siyan (@romanizations)
match_hailu (@romanizations)
match_jyutpinh (@romanizations)
match_pinyin (@romanizations)
This returns matches for Hanyu Pinyin.
match_tongyong (@romanizations)
all_characters
Returns a result set containing all of the characters in the dictionary.
entry_count
Returns the number of entries in the dictionary
The Lingua::ZH::CCDICT::ResultSet Class
This class offers the following API:
next
Return the next item in the result set. If there are no items left then a false value is returned. A subsequent call will start back at the first result.
all
Returns all of the items in the result set.
reset
Resets the index so that the next call to next returns the first item in the set.
count
Returns a number indicating how many items have been returned so far.
Subclasses of this class may offer additional methods. See their documentation for details.
The Lingua::ZH::CCDICT::ResultItem Class
Each individual result returned by an iterator returns an object of this class. This class provides the following methods:
unicode
radical
index
alternate_radical
alternate_index
stroke_count
cangjie
four_corner
These methods always return a single item, when the requested data is available, or a false value if this item is not available.
maciver
rey
hagfa_pinyim
siyan
hailu
jyutping
pinyin
Also available via the method
hanyu
.tongyong
english
These methods represent data for which there may be multiple values. In a list context, all values are returned. In a scalar context, only the first value is returned. When the requested data is not available, a false value is returned.
Romanizations are returned as
Lingua::ZH::CCDICT::Romanization
objects. This class is described below.
The Lingua::ZH::CCDICT::Romanization Class
This class represents romanizations. For all romanizations, two methods are available:
syllable
This is the romanized syllable, with the tone indicated via a number at the end of the syllable.
is_obsolete
The CCDICT dictionary marks some romanizations as obselete. For those entries, this value is true.
All objects of this class are overloaded so that they stringify to the value of the syllable
method, and in string comparisons they use the value of this method as well. In addition, they are overloaded in a boolean context to return true.
The Lingua::ZH::CCDICT::Romanization::Pinyin
class is used for the return values of the Lingua::ZH::CCDICT::ResultItem
class's tongyong
method, and provides the following additional method:
as_unicode
The syllable with tone markings as diacritics using Unicode characters where needed.
The Lingua::ZH::CCDICT::Romanization::Pinyin::Hanyu
class is used for the return values of the Lingua::ZH::CCDICT::ResultItem
class's pinyin
method, and provides the following additional method:
as_ascii
The syllable with umlaut-"u" characters replaced with a doubled "u". Useful when you can only display ASCII.
ENVIRONMENT VARIABLES
CCDICT_DEBUG_SOURCE
Causes a warning when bad data is enountered in the ccdict dictionary source. This is primarily useful if you want to find bugs in the dictionary itself.
CCDICT_VERBOSE
Tells the module to give you progress reports when parsing the source file. These are sent to STDERR.
AUTHOR
David Rolsky <autarch@urth.org>
COPYRIGHT
Copyright (c) 2002 David Rolsky. All rights reserved. This program is free software licensed under the ...
The Artistic License
The full text of the license can be found in the LICENSE file included with this module.
CCDICT COPYRIGHT
Copyright (c) 1995-2002 Thomas Chin.
SEE ALSO
Lingua::ZH::CCDICT::Storage::InMemory, Lingua::ZH::CCDICT::Storage::BerkeleyDB
Lingua::ZH::CEDICT - for converting between Chinese and English.
Encode::HanConvert and Lingua::ZH::HanConvert - for converting between simplified and traditional characters in various character sets.
http://www.chinalanguage.com/CCDICT/ - the home of the CCDICT dictionary.