NAME
Lingua::EO::Orthography - A orthography/substitute converter for Esperanto characters
VERSION
This document describes Lingua::EO::Orthography version 0.02
.
Translations
- en: English
-
Lingua::EO::Orthography (This document)
- eo: Esperanto
- ja: Japanese
SYNOPSIS
use utf8;
use Lingua::EO::Orthography;
my ($converter, $original, $converted);
# orthographize ...
$converter = Lingua::EO::Orthography->new;
$original = q(C^i-momente, la songha h'orajxo ^sprucigas aplauwdon.);
$converted = $converter->convert($original);
# substitute ... (X-system)
$converter->sources([qw(orthography)]); # (accepts multiple notations)
$converter->target('postfix_x');
# same as above:
# $converter = Lingua::EO::Orthography->new(
# sources => [qw(orthography)],
# target => 'postfix_x',
# );
$original = q(ト・-momente, la sonト拌 ト・oraトオo ナ挾rucigas aplaナュdon);
$converted = $converter->convert($original);
DESCRIPTION
6 letters in the Esperanto alphabet did not exist in ASCII. Their letters, which have supersigns (eo: supersignoj), are often spelled in substitute notations (en: surogataj skribosistemoj) for the history, namely, for the ages of typography and typewriter. Currently, it is not unusual to spell them in orthography (eo: ortografio) by the spread of Unicode (eo: Unikodo). However, there is still much environment where the input with a keyboard is difficult, and people may treat an old document described in substitute notation.
This object oriented module provides you a conversion of their notations.
Caveat
This module is on stage of beta release, and the API may be changed. Your feedback is welcome.
Catalogue of notations
The following notation names are usable in new(), add_sources(), and so on.
I am going to expand an API in the future, and you will can add notations except them.
orthography
-
ト蠀 ト褀 ト鰀 ト鴀 ト、 ト・ トエ トオ ナ鰀 ナ鴀 ナャ ナュ (\x{108} \x{109} \x{11C} \x{11D} \x{124} \x{125} \x{134} \x{135} \x{15C} \x{15D} \x{16C} \x{16D})
It is the orthography of the Esperanto alphabet. The converter treats letters with supersign, which exist in Unicode. The character encoding is UTF-8.
You should use the orthography today unless there is some particular reason because Unicode was spread sufficiently. Perl 5.8.1 or later also treat it correctly.
I recommend that you treat UTF-8 flagged string in your program throughout and convert string in only input from external or output to external (on demand), for to correctly work functions such as
length()
in the condition which turns utf8 pragma on. It is the same as the principle of Encode and Perl IO layer. zamenhof
-
Ch ch Gh gh Hh hh Jh jh Sh sh U u
It is a substitute notation, which places
h
as a postfix, however, does not place it foru
.It was suggested by Dr. Zamenhof, the father of Esperanto, in Fundamento de Esperanto and people called it Zamenhof system (eo: Zamenhofa sistemo). For this reason, people also called it the second orthography, but it is not used very much today.
It has a problem that string which range between roots (such as 'flug/haven/o') looks like substituted string in several words such as 'flughaveno' (en: 'airport'). This module does not evade this problem at the present time.
capital_zamenhof
-
CH ch GH gh HH hh JH jh SH sh U u
It is a variant of 'capital_zamenhof' notation.
It places a capital
H
as a postfix of a capital alphabet. postfix_h
-
Ch ch Gh gh Hh hh Jh jh Sh sh Uw uw
It is an extended notation of 'capital_zamenhof' notation.
It places
w
as a postfix ofu
.People called it H-system (eo: H-sistemo).
postfix_capital_h
-
CH ch GH gh HH hh JH jh SH sh UW uw
It is a variant of 'postfix_h' notation.
It places a capital
H
orW
as a postfix of a capital alphabet. postfix_x
-
Cx cx Gx gx Hx hx Jx jx Sx sx Ux ux
It is a substitute notation, which places
x
as a postfix.People called it X-system (eo: X-sistemo, iksa sistemo).
People widely use it as a substitute notation, because X does not exist in the Esperanto alphabet, and was not used except for the case of to describe non-Esperanto word as the original language.
postfix_capital_x
-
CX cx GX gx HX hx JX jx SX sx UX ux
It is a variant of 'postfix_x' notation.
It places a capital
X
as a postfix of a capital alphabet. postfix_caret
-
C^ c^ G^ g^ H^ h^ J^ j^ S^ s^ U^ u^
It is a substitute notation, which places a caret
^
as a postfix.People called it caret system (eo: ト餌pelita sistemo).
People often use it as a substitute notation, because caret have the same shape as circumflex.
This module does not support a way, which describe
u~
likeu^
at the present time. postfix_apostrophe
-
C' c' G' g' H' h' J' j' S' s' U' u'
It is a substitute notation, which places an apostrophe
'
as a postfix. prefix_caret
-
^C ^c ^G ^g ^H ^h ^J ^j ^S ^s ^U ^u
It is a substitute notation, which places a caret
^
as a prefix.
Comparison with Lingua::EO::Supersignoj
There is Lingua::EO::Supersignoj in CPAN. It provides us with correspondent functions of this module.
I compare them by the following list:
Viewpoints ::Supersignoj ::Orthography Note
-------------------------- --------------- --------------------------- ----
Version 0.02 0.02
Can convert @lines Yes No *1
Have accessors Yes Yes, and it has utilities *2
Can customize notation Only 'u' No (under consideration) *3
Can treat 'flughaveno' No No (under consideration) *4
API language eo: Esperanto en: English
Can convert as N:1 No Yes *5
Speed Satisfied About 400% faster *6
Immediate dependencies 1 (0 in core) 6 (2 in core) *7
Whole dependencies 1 (0 in core) 15 (8 in core) *7
Test case number 3 93 *8
License Unknown Perl (Artistic or GNU GPL)
Last modified on Mar. 2003 Mar. 2010
To convert
@lines
with Lingua::EO::Orthography:@converted_lines = map { $converter->convert($_) } @original_lines;
Lingua::EO::Orthography has utility methods, what are all_sources(), add_sources() and remove_sources().
I plan to design the API of this function:
$converter = Lingua::EO::Orthography->new( notations => { postfix_asterisk => [qw(C* c* G* g* H* h* J* j* S* s* U* u*)], }, ); $notations_ref = $converter->notations; @notations = $converter->all_notations; @notations = $converter->notations({ postfix_underscore => [qw(C_ c_ G_ g_ H_ h_ J_ j_ S_ s_ U_ u_)], }); $converter->add_notations( postfix_diacritics => [qw(C^ c^ G^ g^ H^ h^ J^ j^ S^ s^ U~ u~)], );
I plan to design the API of this function:
$converter = Lingua::EO::Orthography->new( ignore_words => [qw( bushaltejo flughaveno Kinghaio ... )], ); $ignore_words_ref = $converter->ignore_words; @ignore_words = $converter->all_ignore_words; @ignore_words = $converter->ignore_words([qw(kuracherbo)]); $converter->add_ignore_words([qw( longhara navighalto ... )]);
I expect that you may design your practical application to accept multiple notations, from my experience.
I included an example in the distribution. Lingua::EO::Orthography can convert string into the orthography at once, such as examples/converter.pl. The correspondent in Lingua::EO::Supersignoj is examples/correspondent.pl. In this case, you must convert string while you replace source notation.
Lingua::EO::Orthography can convert string about 400% faster than Lingua::EO::Supersignoj.
The reason for the difference is to cache a pattern of regular expression and a character converting table to replace string, with Memoize. Furthermore, Lingua::EO::Orthography can convert characters from multiple notations at once.
See examples/benchmark.pl in this distribution.
The source of dependencies is http://deps.cpantesters.org/.
Such number excludes modules for building and testing.
Any dependencies of Lingua::EO::Orthography have a certain favorable opinion. I quite agree with those recommendation.
But, I consider reducing dependencies. I already abandon make this module to depend namespace::clean, namespace::autoclean, and so on.
Such number excludes author's tests.
METHODS
Constructor
new
$converter = Lingua::EO::Orthography->new(%init_arg);
Returns a Lingua::EO::Orthography object, which is a converter.
Accepts a hash as a converting alignment. You can assign sources
and/or target
as key of the hash.
sources => \@source_notations
-
Accepts an array reference or
:all
as source notations.:all
is equivalent to zamenhof, capital_zamenhof, postfix_h, postfix_capital_h, postfix_x, postfix_capital_x, postfix_caret, postfix_apostrophe and prefix_caret.If you omit to assign it, the converter consider that you assign
:all
to it.If you assign a value except
:all
and an array reference, number of notation elements is 0 or notations elements has an unknown notation orundef
, the converter throws an exception. target => $target_notation
-
Accepts a string as target notation.
If you omit to assign it, the converter consider that you assign orthography to it.
If you assign an unknown notation or
undef
, the converter throws an exception.
Accessors
sources
$source_notations_ref = $converter->sources;
Returns source notations as an array reference. If you want to get it as a list, you can use all_sources().
$source_notations_ref = $converter->sources(\@notations);
Accepts an array reference as source notations. You can use notations as new() constructor.
Return value is the same as when an argument was not passed.
target
$target_notation = $converter->target;
Returns target notation as a scalar.
$target_notation = $converter->target($notation);
Accepts a string as target notation. You can use notations as new() constructor.
Return value is the same as when an argument was not passed.
Converter
convert
$converted_string = $converter->convert($original_string);
Accepts string, convert it, and returns converted string. Argument string was not polluted by this method, that is to say, argument string was not changed by side-effect of this method. A conversion of string is based on notations, which assigned at new() constructor or accessors of sources() and target().
String are case-sensitive. That is to say, the converter does not consider cX
to substitute notations in 'postfix_x' notation, and do not convert it.
String of arguments should turn UTF8 flag on. String of return value also became on.
An URL or an e-mail address may have string, which was consused itself with substitute notation. If you do not will convert it, run convert() each words after to split()
a sentence into words. This let you that the converter except string, which includes ://
or @
, from the target of the conversion. See RFC 2396 and 3986 for URI, and see RFC 5321 and 5322 for e-mail address. I described a concrete example to examples/ignore_addresses.pl in the distribution.
Utilities
all_sources
@all_source_notations = $converter->all_sources;
Returns source notations as a list. If you want to get it as an array reference, you can use sources().
add_sources
$source_notations_ref = $converter->add_sources(@adding_notations);
Adds passed notations as a list to source notations. You can use notations as new() constructor.
Returns source notations as an array reference.
remove_sources
$source_notations_ref = $converter->remove_sources(@removing_notations);
Removes passed notations as a list from source notations. You can use notations as new() constructor.
Returns rest source notations as an array reference.
Notations after the removing must maintain at least 1. If you remove all notations, the converter throws an exception.
SEE ALSO
L. L. Zamenhof, Fundamento de Esperanto, 1905
INCOMPATIBILITIES
None reported.
BUGS AND LIMITATIONS
No bugs have been reported.
Making suggestions and reporting bugs
Please report any found bugs, feature requests, and ideas for improvements to <bug-lingua-eo-orthography at rt dot cpan dot org>
, or through the web interface at http://rt.cpan.org/Public/Bug/Report.html?Queue=Lingua-EO-Orthography. I will be notified, and then you'll automatically be notified of progress on your bugs/requests as I make changes.
When reporting bugs, if possible, please add as small a sample as you can make of the code that produces the bug. And of course, suggestions and patches are welcome.
SUPPORT
You can find documentation for this module with the perldoc
command.
% perldoc Lingua::EO::Orthography
The Esperanto edition of documentation is also available.
% perldoc Lingua::EO::Orthography::EO
You can also find the Japanese edition of documentation for this module with the perldocjp
command from Pod::PerldocJp.
% perldocjp Lingua::EO::Orthography::JA
You can also look for information at:
- RT: CPAN's request tracker
-
http://rt.cpan.org/Public/Dist/Display.html?Name=Lingua-EO-Orthography
- AnnoCPAN: Annotated CPAN documentation
- Search CPAN
- CPAN Ratings
VERSION CONTROL
This module is maintained using git. You can get the latest version from git://github.com/gardejo/p5-lingua-eo-orthography.git.
CODE COVERAGE
I use Devel::Cover to test the code coverage of my tests, below is the Devel::Cover
summary report on this distribution's test suite.
---------------------------- ------ ------ ------ ------ ------ ------ ------
File stmt bran cond sub pod time total
---------------------------- ------ ------ ------ ------ ------ ------ ------
.../Lingua/EO/Orthography.pm 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0
---------------------------- ------ ------ ------ ------ ------ ------ ------
TO DO
More tests
Less dependencies
To provide an API to add user's notation
To correctly treat words such as
flughaveno
(flug/haven/o
) in 'postfix_h' notation with user's lexiconTo correctly treat words such as
ankaナュ
in 'zamenhof' notation with user's lexiconTo release a Moose friendly class such as
Lingua::EO::Orthography::Moosified
AUTHOR
- MORIYA Masaki, alias Gardejo
-
<moriya at cpan dot org>
, http://gardejo.org/
ACKNOWLEDGEMENTS
Juerd Waalboer wrote Lingua::EO::Supersignoj, which this module refer to.
COPYRIGHT AND LICENSE
Copyright (c) 2010 MORIYA Masaki, alias Gardejo
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlgpl and perlartistic.
The full text of the license can be found in the LICENSE file included with this distribution.