NAME

Lingua::EO::Orthography - A converter of notations (orthography and substitute notations) for Esperanto characters

VERSION

This document describes Lingua::EO::Orthography version 0.00.

Translations

en: English

Lingua::EO::Orthography (This document)

eo: Esperanto

Lingua::EO::Orthography::EO

ja: Japanese

Lingua::EO::Orthography::JA

SYNOPSIS

use utf8;
use Lingua::EO::Orthography;

my ($converter, $original, $converted);

# orthographize ...
$converter = Lingua::EO::Orthography->new;
$original  = q(C^i-momente, la songha h'orajxo ^sprucigas aplauwdon.);
$converted = $converter->convert($original);

# substitute ... (X-system)
$converter->sources([qw(orthography)]); # (accepts multiple notations)
$converter->target('postfix_x');
    # same as above:
    # $converter = Lingua::EO::Orthography->new(
    #     sources => [qw(orthography)],
    #     target  => 'postfix_x',
    # );
$original  = q(Ĉi-momente, la sonĝa ĥoraĵo ŝprucigas aplaŭdon);
$converted = $converter->convert($original);

DESCRIPTION

6 letters in the Esperanto alphabet did not exist in ASCII. Their letters, which have supersigns (eo: supersignoj), are often spelled in substitute notations (en: surogataj skribosistemoj) for the history, namely, for the ages of typography and typewriter. Currently, it is not unusual to spell them in orthography (eo: ortografio) by the spread of Unicode (eo: Unikodo). However, there is still much environment where the input with a keyboard is difficult, and people may treat an old document described in substitute notation.

This object oriented module provides you a conversion of their notations.

Caveat

This module is on stage of beta release, and the API may be changed. Your feedback is welcome.

Catalogue of notations

The following notation names are usable in new(), add_sources(), and so on.

I am going to expand an API in the future, and you will can add notations except them.

orthography
Ĉ ĉ Ĝ ĝ Ĥ ĥ Ĵ ĵ Ŝ ŝ Ŭ ŭ

(\x{108} \x{109} \x{11C} \x{11D} \x{124} \x{125}
 \x{134} \x{135} \x{15C} \x{15D} \x{16C} \x{16D})

It is the orthography of the Esperanto alphabet. The converter treats letters with supersign, which exist in Unicode. The character encoding is UTF-8.

You should use the orthography today unless there is some particular reason because Unicode was spread sufficiently. Perl 5.8.1 or later also treat it correctly.

I recommend that you treat UTF-8 flagged string in your program throughout and convert string in only input from external or output to external (on demand), for to correctly work functions such as length() in the condition which turns utf8 pragma on. It is the same as the principle of Encode and Perl IO layer.

zamenhof
Ch ch Gh gh Hh hh Jh jh Sh sh U  u

It is a substitute notation, which places h as a postfix, however, does not place it for u.

It was suggested by Dr. Zamenhof, the father of Esperanto, in Fundamento de Esperanto and people called it Zamenhof system (eo: Zamenhofa sistemo). For this reason, people also called it the second orthography, but it is not used very much today.

It has a problem that string which range between roots (such as 'flug/haven/o') looks like substituted string in several words such as 'flughaveno' (en: 'airport'). This module does not evade this problem at the present time.

capital_zamenhof
CH ch GH gh HH hh JH jh SH sh U  u

It is a variant of 'capital_zamenhof' notation.

It places a capital H as a postfix of a capital alphabet.

postfix_h
Ch ch Gh gh Hh hh Jh jh Sh sh Uw uw

It is an extended notation of 'capital_zamenhof' notation.

It places w as a postfix of u.

People called it H-system (eo: H-sistemo).

postfix_capital_h
CH ch GH gh HH hh JH jh SH sh UW uw

It is a variant of 'postfix_h' notation.

It places a capital H or W as a postfix of a capital alphabet.

postfix_x
Cx cx Gx gx Hx hx Jx jx Sx sx Ux ux

It is a substitute notation, which places x as a postfix.

People called it X-system (eo: X-sistemo, iksa sistemo).

People widely use it as a substitute notation, because X does not exist in the Esperanto alphabet, and was not used except for the case of to describe non-Esperanto word as the original language.

postfix_capital_x
CX cx GX gx HX hx JX jx SX sx UX ux

It is a variant of 'postfix_x' notation.

It places a capital X as a postfix of a capital alphabet.

postfix_caret
C^ c^ G^ g^ H^ h^ J^ j^ S^ s^ U^ u^

It is a substitute notation, which places a caret ^ as a postfix.

People called it caret system (eo: ĉapelita sistemo).

People often use it as a substitute notation, because caret have the same shape as circumflex.

This module does not support a way, which describe u~ like u^ at the present time.

postfix_apostrophe
C' c' G' g' H' h' J' j' S' s' U' u'

It is a substitute notation, which places an apostrophe ' as a postfix.

prefix_caret
^C ^c ^G ^g ^H ^h ^J ^j ^S ^s ^U ^u

It is a substitute notation, which places a caret ^ as a prefix.

Comparison with Lingua::EO::Supersignoj

There is Lingua::EO::Supersignoj in CPAN. It provides us with correspondent functions of this module.

I compare them by the following list:

Viewpoints                 ::Supersignoj   ::Orthography               Note
-------------------------- --------------- --------------------------- ----
Version                    0.02            0.00
Can convert @lines         Yes             No                          *1
Have accessors             Yes             Yes, and it has utilities   *2
Can customize notation     Only 'u'        No (under consideration)    *3
Can treat 'flughaveno'     No              No (under consideration)    *4
API language               eo: Esperanto   en: English
Can convert as N:1         No              Yes                         *5
Speed                      Satisfied       About 400% faster           *6
Immediate dependencies     1 (0 in core)   6 (2 in core)               *7
Whole dependencies         1 (0 in core)   15 (8 in core)              *7
Test case number           3               93                          *8
License                    Unknown         Perl (Artistic or GNU GPL)
Last modified on           Mar. 2003       Mar. 2010
  1. To convert @lines with Lingua::EO::Orthography:

    @converted_lines = map { $converter->convert($_) } @original_lines;
  2. Lingua::EO::Orthography has utility methods, what are all_sources(), add_sources() and remove_sources().

  3. I plan to design the API of this function:

    $converter = Lingua::EO::Orthography->new(
        notations => {
            postfix_asterisk => [qw(C* c* G* g* H* h* J* j* S* s* U* u*)],
        },
    );
    
    $notations_ref = $converter->notations;
    
    @notations = $converter->all_notations;
    
    @notations = $converter->notations({
        postfix_underscore => [qw(C_ c_ G_ g_ H_ h_ J_ j_ S_ s_ U_ u_)],
    });
    
    $converter->add_notations(
        postfix_diacritics => [qw(C^ c^ G^ g^ H^ h^ J^ j^ S^ s^ U~ u~)],
    );
  4. I plan to design the API of this function:

    $converter = Lingua::EO::Orthography->new(
        ignore_words => [qw(
            bushaltejo flughaveno Kinghaio ...
        )],
    );
    
    $ignore_words_ref = $converter->ignore_words;
    
    @ignore_words = $converter->all_ignore_words;
    
    @ignore_words = $converter->ignore_words([qw(kuracherbo)]);
    
    $converter->add_ignore_words([qw(
        longhara navighalto ...
    )]);
  5. I expect that you may design your practical application to accept multiple notations, from my experience.

    I included an example in the distribution. Lingua::EO::Orthography can convert string into the orthography at once, such as examples/converter.pl. The correspondent in Lingua::EO::Supersignoj is examples/correspondent.pl. In this case, you must convert string while you replace source notation.

  6. Lingua::EO::Orthography can convert string about 400% faster than Lingua::EO::Supersignoj.

    The reason for the difference is to cache a pattern of regular expression and a character converting table to replace string, with Memoize. Furthermore, Lingua::EO::Orthography can convert characters from multiple notations at once.

    See examples/benchmark.pl in this distribution.

  7. The source of dependencies is http://deps.cpantesters.org/.

    Such number excludes modules for building and testing.

    Any dependencies of Lingua::EO::Orthography have a certain favorable opinion. I quite agree with those recommendation.

    But, I consider reducing dependencies. I already abandon make this module to depend namespace::clean, namespace::autoclean, and so on.

  8. Such number excludes author's tests.

METHODS

Constructor

new

$converter = Lingua::EO::Orthography->new(%init_arg);

Returns a Lingua::EO::Orthography object, which is a converter.

Accepts a hash as a converting alignment. You can assign sources and/or target as key of the hash.

sources => \@source_notations

Accepts an array reference or :all as source notations.

:all is equivalent to zamenhof, capital_zamenhof, postfix_h, postfix_capital_h, postfix_x, postfix_capital_x, postfix_caret, postfix_apostrophe and prefix_caret.

If you omit to assign it, the converter consider that you assign :all to it.

If you assign a value except :all and an array reference, number of notation elements is 0 or notations elements has an unknown notation or undef, the converter throws an exception.

target => $target_notation

Accepts a string as target notation.

If you omit to assign it, the converter consider that you assign orthography to it.

If you assign an unknown notation or undef, the converter throws an exception.

Converter

convert

$converted_string = $converter->convert($original_string);

Accepts string, convert it, and returns converted string. Argument string was not polluted by this method, that is to say, argument string was not changed by side-effect of this method. A conversion of string is based on notations, which assigned at new() constructor or accessors of sources() and target().

String are case-sensitive. That is to say, the converter does not consider cX to substitute notations in 'postfix_x' notation, and do not convert it.

String of arguments should turn UTF8 flag on. String of return value also became on.

An URL or an e-mail address may have string, which was consused itself with substitute notation. If you do not will convert it, run convert() each words after to split() a sentence into words. This let you that the converter except string, which includes :// or @, from the target of the conversion. See RFC 2396 and 3986 for URI, and see RFC 5321 and 5322 for e-mail address. I described a concrete example to examples/ignore_addresses.pl in the distribution.

Accessors

sources

$source_notations_ref = $converter->sources;

Returns source notations as an array reference. If you want to get it as a list, you can use all_sources().

$source_notations_ref = $converter->sources(\@notations);

Accepts an array reference as source notations. You can use notations as new() constructor.

Return value is the same as when an argument was not passed.

target

$target_notation = $converter->target;

Returns target notation as a scalar.

$target_notation = $converter->target($notation);

Accepts a string as target notation. You can use notations as new() constructor.

Return value is the same as when an argument was not passed.

Utilities

all_sources

@all_source_notations = $converter->all_sources;

Returns source notations as a list. If you want to get it as an array reference, you can use sources().

add_sources

$source_notations_ref = $converter->add_sources(@adding_notations);

Adds passed notations as a list to source notations. You can use notations as new() constructor.

Returns source notations as an array reference.

remove_sources

$source_notations_ref = $converter->remove_sources(@removing_notations);

Removes passed notations as a list from source notations. You can use notations as new() constructor.

Returns rest source notations as an array reference.

Notations after the removing must maintain at least 1. If you remove all notations, the converter throws an exception.

SEE ALSO

INCOMPATIBILITIES

None reported.

TO DO

  • More tests

  • Less dependencies

  • To provide an API to add user's notation

  • To correctly treat words such as flughaveno (flug/haven/o) in 'postfix_h' notation with user's lexicon

  • To correctly treat words such as ankaŭ in 'zamenhof' notation with user's lexicon

  • To release a Moose friendly class such as Lingua::EO::Orthography::Moosified

BUGS AND LIMITATIONS

No bugs have been reported.

Making suggestions and reporting bugs

Please report any found bugs, feature requests, and ideas for improvements to <bug-lingua-eo-orthography at rt dot cpan dot org>, or through the web interface at http://rt.cpan.org/Public/Bug/Report.html?Queue=Lingua-EO-Orthography. I will be notified, and then you'll automatically be notified of progress on your bugs/requests as I make changes.

When reporting bugs, if possible, please add as small a sample as you can make of the code that produces the bug. And of course, suggestions and patches are welcome.

SUPPORT

You can find documentation for this module with the perldoc command.

% perldoc Lingua::EO::Orthography

The Esperanto edition of documentation is also available.

% perldoc Lingua::EO::Orthography::EO

You can also find the Japanese edition of documentation for this module with the perldocjp command from Pod::PerldocJp.

% perldocjp Lingua::EO::Orthography::JA

You can also look for information at:

RT: CPAN's request tracker

http://rt.cpan.org/Public/Dist/Display.html?Name=Lingua-EO-Orthography

AnnoCPAN: Annotated CPAN documentation

http://annocpan.org/dist/Lingua-EO-Orthography

Search CPAN

http://search.cpan.org/dist/Lingua-EO-Orthography

CPAN Ratings

http://cpanratings.perl.org/dist/Lingua-EO-Orthography

VERSION CONTROL

This module is maintained using git. You can get the latest version from git://github.com/gardejo/p5-lingua-eo-orthography.git.

CODE COVERAGE

I use Devel::Cover to test the code coverage of my tests, below is the Devel::Cover summary report on this distribution's test suite.

---------------------------- ------ ------ ------ ------ ------ ------ ------
File                           stmt   bran   cond    sub    pod   time  total
---------------------------- ------ ------ ------ ------ ------ ------ ------
.../Lingua/EO/Orthography.pm  100.0  100.0  100.0  100.0  100.0  100.0  100.0
Total                         100.0  100.0  100.0  100.0  100.0  100.0  100.0
---------------------------- ------ ------ ------ ------ ------ ------ ------

AUTHOR

MORIYA Masaki, alias Gardejo

<moriya at cpan dot org>, http://ttt.ermitejo.com/

ACKNOWLEDGEMENTS

COPYRIGHT AND LICENSE

Copyright (c) 2010 MORIYA Masaki, alias Gardejo

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlgpl and perlartistic.

The full text of the license can be found in the LICENSE file included with this distribution.