NAME
Lingua::Deva - Convert between Latin and Devanagari Sanskrit text
SYNOPSIS
use v5.12.1;
use strict;
use utf8;
use charnames ':full';
use Lingua::Deva;
# Basic usage
my $d = Lingua::Deva->new();
say $d->to_latin('आसीद्राजा'); # prints 'āsīdrājā'
say $d->to_deva('Nalo nāma'); # prints 'नलो नाम'
# With configuration: strict, allow Danda, 'w' for 'v'
my %c = %Lingua::Deva::Maps::Consonants;
$d = Lingua::Deva->new(
strict => 1,
allow => [ "\N{DEVANAGARI DANDA}" ],
C => do { $c{'w'} = delete $c{'v'}; \%c },
);
say $d->to_deva('ziwāya'); # 'zइवाय', warning for 'z'
say $d->to_latin('सर्वम्।'); # 'sarvam।', no warnings
DESCRIPTION
Facilities for converting Sanskrit in Latin transliteration to Devanagari and vice-versa. The principal interface is exposed through instances of the Lingua::Deva
class. "Deva" is the name for the Devanagari (devanāgarī) script according to ISO 15924.
Using the module is as simple as creating a Lingua::Deva
instance and calling to_deva()
or to_latin()
with appropriate string arguments.
my $d = Lingua::Deva->new();
say $d->to_latin('कामसूत्र');
say $d->to_deva('Kāmasūtra');
The default translation maps adhere to the IAST transliteration scheme, but it is easy to customize these mappings. This is done by copying and modifying a map from Lingua::Deva::Maps
and passing it to the Lingua::Deva
constructor.
# Copy and modify the consonants map
my %c = %Lingua::Deva::Maps::Consonants;
$c{"c\x{0327}"} = delete $c{"s\x{0301}"};
# Pass a reference to the modified map to the constructor
my $d = Lingua::Deva->new( C => \%c );
Behind the scenes, all translation is done via an intermediate object representation called "aksara" (Sanskrit akṣara). These objects are instances of Lingua::Deva::Aksara
, which provides an interface to inspect and manipulate individual aksaras.
# Create an array of aksaras
my $a = $d->l_to_aksara('Kāmasūtra');
# Print vowel in the fourth Aksara
say $a->[3]->vowel();
Having the intermediate Lingua::Deva::Aksara
representation comes with a slight penalty in efficiency, but gives you the advantage of having aksara structure available for precise analysis and validation.
Methods
- new()
-
Constructor. Takes optional arguments which are described below.
strict => 0 or 1
In strict mode warnings for invalid input are output. Invalid means either not a Devanagari token (eg. "q") or structurally ill-formed (eg. a Devanagari diacritic vowel following an independent vowel).
Off by default.
allow => [ ... ]
In strict mode, the
allow
array can be used to exempt certain characters from being flagged as invalid even though they normally would be.C => { consonants map }
V => { independent vowels map }
D => { diacritic vowels map }
F => { finals map }
Translation maps in the direction Latin to Devanagari.
DC => { consonants map }
DV => { independent vowels map }
DD => { diacritic vowels map }
DF => { finals map }
Translation maps in the direction Devanagari to Latin.
The default maps are in
Lingua::Deva::Maps
. To customize, make a copy of an existing mapping hash and pass it to one of these parameters. Note that the map keys need to be in Unicode NFD form (seeUnicode::Normalize
).
- l_to_tokens()
-
Converts a string of Latin characters into "tokens" and returns a reference to an array of tokens. A "token" is either a character sequence which may constitute a single Devanagari grapheme or a single non-Devanagari character.
my $t = $d->l_to_tokens("Bhārata\n"); # $t now refers to the array ['Bh','ā','r','a','t','a',"\n"]
The input string will be normalized (NFD). No chomping takes place. Upper case and lower case distinctions are preserved.
Technical note: This is not a general-purpose tokenizer. A token consisting of more than one element is only correctly recognized if all preceding subsequences are also tokens. For the token "abc" to be recognized, both "ab" and "a" need to be tokens as well. Fortunately, the decomposed tokens in IAST transliteration do fulfil this property:
"r\x{0323}\x{0304}" "r\x{0323}" "r"
- l_to_aksara()
-
Converts its argument into "aksaras" and returns a reference to an array of aksaras (see
Lingua::Deva::Aksara
). The argument can be a Latin string, or a reference to an array of tokens.my $a = $d->l_to_aksara('hyaḥ'); is( ref($a->[0]), 'Lingua::Deva::Aksara', 'one aksara object' ); done_testing();
Input tokens which can not be part of an aksara are passed through untouched. This means that the resulting array can contain both aksara objects and separate tokens.
In strict mode warnings for invalid tokens are output.
- d_to_aksara()
-
Converts a Devanagari string into "aksaras" and returns a reference to an array of aksaras.
my $text = 'बुद्धः'; my $a = $d->d_to_aksara($text); my $o = $a->[1]->onset(); # $o now refers to the array ['d','dh']
Input tokens which can not be part of an aksara are passed through untouched. This means that the resulting array can contain both aksara objects and separate tokens.
In strict mode warnings for invalid tokens are output.
- to_deva()
-
Converts a Latin string or an array of aksaras to a Devanagari string.
say $d->to_deva('Kāmasūtra'); # same as my $a = $d->l_to_aksara('Kāmasūtra'); say $d->to_deva($a);
Aksaras are assumed to be well-formed.
- to_latin()
-
Converts a Devanagari string or an array of aksaras to an equivalent string in Latin transliteration.
Aksaras are assumed to be well-formed.
EXAMPLES
The synopsis gives the simplest usage patterns. Here are a few more.
To use "ring below" instead of "dot below" for syllabic r:
my %v = %Lingua::Deva::Maps::Vowels;
$v{"r\x{0325}"} = delete $v{"r\x{0323}"};
$v{"r\x{0325}\x{0304}"} = delete $v{"r\x{0323}\x{0304}"};
my %d = %Lingua::Deva::Maps::Diacritics;
$d{"r\x{0325}"} = delete $d{"r\x{0323}"};
$d{"r\x{0325}\x{0304}"} = delete $d{"r\x{0323}\x{0304}"};
my $d = Lingua::Deva->new( V => \%v, D => \%d );
say $d->to_deva('Kr̥ṣṇa');
Use the aksara objects to produce simple statistics.
# Count distinct rhymes in @aksaras
for my $a (grep { defined $_->get_rhyme() } @aksaras) {
$rhymes{ join '', @{$a->get_rhyme()} }++;
}
# Print number of 'au' rhymes
say $rhymes{'au'};
The following script converts a Latin input file "in.txt" to Devanagari.
#!/usr/bin/env perl
use v5.12.1;
use strict;
use warnings;
use open ':encoding(UTF-8)';
use Lingua::Deva;
open my $in, '<', 'in.txt' or die;
open my $out, '>', 'out.txt' or die;
my $d = Lingua::Deva->new();
while (my $line = <$in>) {
print $out $d->to_deva($line);
}
On a Unicode-capable terminal one-liners are also possible:
echo 'Himālaya' | perl -MLingua::Deva -e 'print Lingua::Deva->new()->to_deva(<>);'
DEPENDENCIES
There are no requirements apart from standard Perl modules.
Note that a modern, Unicode-capable version of Perl >= 5.12 is required.
AUTHOR
glts <676c7473@gmail.com>
BUGS
Report bugs to the author or at https://github.com/glts/Lingua-Deva
COPYRIGHT
This program is free software. You may copy or redistribute it under the same terms as Perl itself.
Copyright (c) 2012 by glts <676c7473@gmail.com>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.1 or, at your option, any later version of Perl 5 you may have available.