NAME

Lingua::ZH::CCDICT - An interface to the CCDICT Chinese dictionary

SYNOPSIS

use Lingua::ZH::CCDICT;

my $dict = Lingua::ZH::CCDICT->new( storage => 'InMemory',
                                    file    => '/path/to/ccdict.txt',
                                  );

DESCRIPTION

This module provides a Perl interface to the CCDICT dictionary created by Thomas Chin. This dictionary is indexed by Unicode character number, and contains information about these characters.

As of version 3.2.0 of the dictionary, it was released under the Open Publication License v0.4, without either of the optional clauses. See the CCDICT licensing statement for more details. IANAL, but I believe that the OPL combined with the fact that this module is under the Artistic License, makes this module and the CCDICT dictionary fit both the Free Software and Open Source definitions.

The dictionary contains the following information, though not all information is avaialable for all characters.

  • Radical number

    The number of the radical. Always available.

  • Index

    The total number of strokes minus the number of strokes in the radical. Always available.

  • Alternate radical number and index

    Actually, the dictionary defines a format for storing this information, but as of version 3.2.0 of the dictionary, it does not actually contain this for any characters.

  • Total stroke count

    The total number of strokes in the character.

  • Cangjie

    The Cangjie Chinese input system code.

  • Four Corner

    The Four Corner Chinese input system code.

In addition, the dictionary contains English definitions (often multiple definitions), and romanizations for the character in different languages and systems. The romanizations available include the MacIver, Rey, Hagfa Pinyim, Siyan, and Hailu systems for Hakka, as well as the Jyutping Cantonese system and the Hanyu and Tongyong Pinyin systems for Mandarin.

The Hanyu Pinyin system was invented in Mainland China in 1952, while the TongYong Pinyin system is a variation on this system invented in Taiwan in 1998. TongYong PinYin was adopted as Taiwan's official Pinyin system in 2001.

However, due to the overwhelming dominance of the Hanyu system in worldwide Chinese education, the Hanyu system is generally known simply as Pinyin. If you studied Mandarin as a foreign language, it is likely that you learned Hanyu Pinyin.

DICTIONARY BUGS

The CCDICT dictionary is distributed by Thomas Chin in a simple, but non-standard, textual ASCII-only format. Errors in the dictionary are handled by this module internally, although occasional odd entries may result in odd data. Please send bug reports to me so I can make sure that the error is actually in the dictionary, not in this code.

STORAGE

This module is capable of parsing the CCDICT format file, and can also rewrite it in a number of other formats, including XML or as a set of BerkeleyDB files.

Each storage system is implemented via a module in the Lingua::ZH::CCDICT::Storage::* class hierarchy. All of these modules are subclasses of Lingua::ZH::CCDICT class, and implement its methods for searching the dictionary.

In addition some storage classes may offer additional methods.

Storage Subclasses

The following storage subclasses are available:

  • Lingua::ZH::CCDICT::Storage::InMemory

    This class stores the entire parsed dictionary in memory. Be forewarned, on my GNU/Linux 2.4 machine, this takes about 234 megabytes of memory. Carpe user!

    The only parameter it takes is "file", which should contain the full path to a CCDICT source file.

  • Lingua::ZH::CCDICT::Storage::XML

    This class can convert the CCDICT source file to XML, and perform searches on it using XML::Twig.

    The only parameter it takes is "xml_file", which should contain the full path to a CCDICT xml file. If the parse_source_file method is called, then the XML file specified by the "xml_file" parameter will be created, overwriting any existing data.

    This class is quite memory-efficient but searches are painfully slow.

  • Lingua::ZH::CCDICT::Storage::BerkeleyDB

    This class can convert the CCDICT source file to a set of BerkeleyDB files.

    The only parameter it takes is "work_dir". This directory will be used when creating new files from a CCDICT source file, when parse_source_file is called. Once these files exist, they can be used to perform searches.

    This class is the most memory-efficient of all the storage classes, as it uses BerkeleyDB cursors for result sets. It is also quite fast.

USAGE

This module allows you to look up information in the dictionary based on a number of keys. These include the Unicode character (as a character, not its number), stroke count, radical number, and any of the various romanization systems.

METHODS

These methods are always available.

  • new

    This method always takes at least one parameter, "storage". This indicates what storage subclass to use. The current options are "InMemory", "XML", and "BerkeleyDB".

    Any other parameters given will be passed to the appropriate subclass's new method.

  • parse_source_file ($filename)

    Given a source file, this will parse it and create a representation of the appropriate type.

Match methods

When doing a lookup based on the romanization of a character, the tone is indicated with a number at the end of the syllable, as opposed to using the Unicode character combining the latin letter with the diacritic.

In addition, lookups based on a Hanyu Pinyin romanization should use the u-with-umlaut character (character 252 in ASCII) rather than a doubled "u" character. Lower case should be used when doing lookups on romanizations.

The return value for any lookup will be an object of an Lingua::ZH::CCDICT::ResultSet subclass. All the subclasses share a similar interface, described below.

Result sets always return matches in ascending Unicode character order.

  • match_unicode (@chars)

    This method matches on one or more Unicode characters. Unicode characters should be given as Perl characters (i.e. chr(0x7D20)), not as a number.

  • match_radical (@numbers)

    Given a set of numbers, this method returns those characters containing the specified radical(s).

  • match_index (@numbers)

    Given a set of numbers, this method returns those characters containing the specified index(es).

  • match_alternate_radical (@numbers)

    Given a set of numbers, this method returns those characters containing the specified radical(s) as alternates.

  • match_alternate_index (@numbers)

    Given a set of numbers, this method returns those characters containing the specified index(es) as alternates.

  • match_stroke_count (@numbers)

    Given a set of numbers, this method returns those characters containing the specified number(s) of strokes.

  • match_cangjie (@codes)

    Given a set of Cangjie codes, this method returns the character(s) for those code(s).

  • match_four_corner (@codes)

    Given a set of Four Corner codes, this method returns the character(s) for those code(s).

  • match_maciver (@romanizations)

  • match_rey (@romanizations)

  • match_hagfa_pinyim (@romanizations)

  • match_siyan (@romanizations)

  • match_hailu (@romanizations)

  • match_jyutpinh (@romanizations)

  • match_pinyin (@romanizations)

    This returns matches for Hanyu Pinyin.

  • match_tongyong (@romanizations)

  • all_characters

    Returns a result set containing all of the characters in the dictionary.

  • entry_count

    Returns the number of entries in the dictionary

The Lingua::ZH::CCDICT::ResultSet Class

This class offers the following API:

  • next

    Return the next item in the result set. If there are no items left then a false value is returned. A subsequent call will start back at the first result.

  • all

    Returns all of the items in the result set.

  • reset

    Resets the index so that the next call to next returns the first item in the set.

  • count

    Returns a number indicating how many items have been returned so far.

Subclasses of this class may offer additional methods. See their documentation for details.

The Lingua::ZH::CCDICT::ResultItem Class

Each individual result returned by an iterator returns an object of this class. This class provides the following methods:

  • unicode

  • radical

  • index

  • alternate_radical

  • alternate_index

  • stroke_count

  • cangjie

  • four_corner

    These methods always return a single item, when the requested data is available, or a false value if this item is not available.

  • maciver

  • rey

  • hagfa_pinyim

  • siyan

  • hailu

  • jyutping

  • pinyin

    Also available via the method hanyu.

  • tongyong

  • english

    These methods represent data for which there may be multiple values. In a list context, all values are returned. In a scalar context, only the first value is returned. When the requested data is not available, a false value is returned.

    Romanizations are returned as Lingua::ZH::CCDICT::Romanization objects. This class is described below.

The Lingua::ZH::CCDICT::Romanization Class

This class represents romanizations. For all romanizations, two methods are available:

  • syllable

    This is the romanized syllable, with the tone indicated via a number at the end of the syllable.

  • is_obsolete

    The CCDICT dictionary marks some romanizations as obselete. For those entries, this value is true.

All objects of this class are overloaded so that they stringify to the value of the syllable method, and in string comparisons they use the value of this method as well. In addition, they are overloaded in a boolean context to return true.

The Lingua::ZH::CCDICT::Romanization::Pinyin class is used for the return values of the Lingua::ZH::CCDICT::ResultItem class's tongyong method, and provides the following additional method:

  • as_unicode

    The syllable with tone markings as diacritics using Unicode characters where needed.

The Lingua::ZH::CCDICT::Romanization::Pinyin::Hanyu class is used for the return values of the Lingua::ZH::CCDICT::ResultItem class's pinyin method, and provides the following additional method:

  • as_ascii

    The syllable with umlaut-"u" characters replaced with a doubled "u". Useful when you can only display ASCII.

ENVIRONMENT VARIABLES

  • CCDICT_DEBUG_SOURCE

    Causes a warning when bad data is enountered in the ccdict dictionary source. This is primarily useful if you want to find bugs in the dictionary itself.

  • CCDICT_VERBOSE

    Tells the module to give you progress reports when parsing the source file. These are sent to STDERR.

AUTHOR

David Rolsky <autarch@urth.org>

COPYRIGHT

Copyright (c) 2002 David Rolsky. All rights reserved. This program is free software licensed under the ...

The Artistic License

The full text of the license can be found in the LICENSE file included with this module.

CCDICT COPYRIGHT

Copyright (c) 1995-2002 Thomas Chin.

SEE ALSO

Lingua::ZH::CCDICT::Storage::InMemory, Lingua::ZH::CCDICT::Storage::BerkeleyDB

Lingua::ZH::CEDICT - for converting between Chinese and English.

Encode::HanConvert and Lingua::ZH::HanConvert - for converting between simplified and traditional characters in various character sets.

http://www.chinalanguage.com/CCDICT/ - the home of the CCDICT dictionary.