NAME

CEDict::Pinyin - Validates pinyin strings

SYNOPSIS

use CEDict::Pinyin;
use Data::Dumper;

my $py    = CEDict::Pinyin->new;
my $parts = [];

print "Validating pinyin strings:\n";
for ("ji2 - rui4 cheng2", "xi'an", "dian4 nao3, yuyan2", "kongzi",
		"123", "not pinyin", "gu1 fstr4 zu3") {
	my $parts = [];
	$py->setSource($_);
	if ($py->isPinyin($parts)) {
		print "Valid string: $_\n";
	} else {
		print "Invalid string: $_\n";
	print Dumper($parts);
}

DESCRIPTION

This class helps you validate and parse pinyin. Currently the pinyin must follow some rules about how it is formatted before being considered "valid" by this class's validation method. All valid pinyin syllables are expressed by characters within the 7-bit ASCII range. That means the validation method will fail on a string like "nán nǚ lǎo shào". The pinyin should instead contain numbers after the letter to represent tones. Instead of the string above we should use "nan2 nv lao3 shao4". Being able to accept a string with accented characters that represent the tone of the syllable is a feature I hope to add to a future version of this module. The parser first takes a look at the entires string you pass it to see if it is even worth parsing. The regular expression used is shown below.

/^[A-Za-z]+[A-Za-z1-5,'\- ]*$/

If the pinyin doesn't match this regex, then isPinyin returns false and stops parsing the string. All this means is that if you want to use this module to validate your pinyin but your pinyin is not exactly in the same format as just described then you need cleanup your pinyin strings a little bit first.

Again, hopefully future versions of this class will be more flexible in what is accepted as valid pinyin. However we want to be sure that what we are looking at is really pinyin and not some English words as this module was originally written in part to distinguish between a pinyin string and English. I would also like to keep this idea in future versions, so if you update the class with your own code, please keep that in mind.

Methods

CEDict::Pinyin->new(SCALAR)

Creates a new CEDict::Pinyin object. SCALAR should be a string containing the pinyin you want to work with. If SCALAR is ommited it can be set later using the setSource method.

$obj->setSource(SCALAR)

Sets the source string to work with. Currently only the isPinyin method accesses this attribute.

$obj->isPinyin or $obj>->isPinyin(ARRAYREF)

Validates the pinyin supplied to the constructor or to $obj->setSource(SCALAR). If an ARRAYREF is supplied as an argument, adds each syllable of the parsed pinyin to the array. If a syllable is considered invalid then the method stops parsing and immediately returns false. Returns true otherwise.

CEDict::Pinyin->buildRegex(STRING)

Takes a string containing pinyin and returns a regular expression that can be used with the MySQL database (so far only tested against the 5.1 series). Accepts an asterisk ("*") as a wildcard. Note that the isPinyin method will return false when validating such a string, so if you plan on first validating the pinyin then generating the regex, make sure you are validating the string without the asterisks ($string =~ s/\*//g).

AUTHOR

Christopher Davaz www.chrisdavaz.com cdavaz@gmail.com

VERSION

Version 0.1 (Jun 11 2008)

COPYRIGHT

Copyright (c) 2008 Christopher Davaz. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.