NAME

Lingua::Interset - DZ Interset is a universal morphosyntactic feature set to which all tagsets of all corpora/languages can be mapped.

VERSION

version 3.000

SYNOPSIS

use Lingua::Interset qw(decode encode);

my $tag1 = 'NN'; # in the English Penn Treebank, "NN" means "noun"
my $feature_structure = decode('en::penn', $tag1);
print($feature_structure->as_string(), "\n");
$feature_structure->set_number('plur');
my $tag2 = encode('en::penn', $feature_structure);
print("$tag2\n");

DESCRIPTION

DZ Interset is a universal framework for reading, writing, converting and interpreting part-of-speech and morphosyntactic tags from multiple tagsets of many different natural languages.

Individual tagsets are mapped to the Interset using specialized modules called tagset drivers. Every driver must implement three methods: decode, encode and list.

The main module, Lingua::Interset, provides parameterized access to the drivers and their methods. Instead of having to use particular modules (which would mean you know in advance what tagsets your program will be working with) you just specify the tagset giving its identifier as a parameter. Tagset ids are derived from Perl package names but they are always all-lowercase. Most tagsets are taylored for one language and their id has two components (separated by ::): the ISO 639 code of the language, and a part to distinguish various tagsets for the language. This second component may be some sort of abbreviated name of the corpus where the tagset is used, for example.

More information is given at the DZ Interset project page, https://wiki.ufal.ms.mff.cuni.cz/user:zeman:interset.

FUNCTIONS

decode()

my $fs  = decode ('en::penn', 'NNS');

A generic interface to the decode() method of Lingua::Interset::Tagset. Takes tagset id and a tag in that tagset. Returns a Lingua::Interset::FeatureStructure object with corresponding feature values set.

encode()

my $fs  = decode ('en::penn', 'NNS');
my $tag = encode ('en::conll', $fs);

A generic interface to the encode() method of Lingua::Interset::Tagset. Takes tagset id and a Lingua::Interset::FeatureStructure object. Returns the tag in the given tagset that corresponds to the feature values. Note that some features may be ignored because they cannot be represented in the given tagset.

encode_strict()

my $fs  = decode ('en::penn', 'NNS');
my $tag = encode_strict ('en::conll', $fs);

A generic interface to the encode_strict() method of Lingua::Interset::Tagset. Takes tagset id and a feature structure (Lingua::Interset::FeatureStructure). Returns a tag of the identified tagset that matches the contents of the feature structure.

Unlike encode(), encode_strict() always returns a known tag, i.e. one that is returned by the list() method of the Tagset object. Many tagsets consist of structured tags, i.e. they can be defined as a compact representation of a feature structure (a set of attribute-value pairs). It is in principle possible to encode such combinations of features and values that did not appear in the original tagset. For example, a tagset for Czech is unlikely to contain a tag saying that a word is preposition and at the same time setting non-empty value for gender. Yet it is possible to create such a tag because the tagset encodes part of speech and gender independently.

If this is undesirable behavior, the application should call encode_strict() instead of encode(). Then it will be guaranteed that the resulting tag is one of those returned by list(). Nevertheless, think twice whether you really need the guarantee, as it does not come for free. The necessity to replace forbidden feature values by permitted ones may sometimes lead to surprising or confusing results.

list()

my $list_of_tags = list ('en::penn');

A generic interface to the list() method of Lingua::Interset::Tagset. Takes tagset id and returns the reference to the list of all known tags of that tagset. This is not directly needed to decode, encode or convert tags but it is very useful for testing and advanced operations over the tagset. Note however that many tagset drivers contain only an approximate list, created by collecting tag occurrences in some corpus.

get_driver_object()

my $driver = get_driver_object ('en::penn');

A generic accessor to installed Interset drivers of tagsets. Takes tagset id and returns a Lingua::Interset::Tagset object.

The objects are cached and if you call this function several times for the same tagset, you will always get the reference to the same object. Tagset objects do not have variable state, so it probably does not make sense to have several different driver objects for the same tagset. If you want to get a different object, you must call new(), e.g. Lingua::Interset::Tagset::EN::Penn->new().

find_drivers()

my $list_of_drivers = find_drivers ();

This function searches relevant folders in @INC for installed Interset drivers for tagsets. It looks both for the new Interset 2 drivers (e.g. Lingua::Interset::Tagset::EN::Penn) and for the old Interset 1 drivers (e.g. tagset::en::penn). It returns a reference to an array of hash references. Every hash in the list contains the following fields (here with example values):

my %record =
(
    'old'     => 1, # 1 or 0 ... old or new driver?
    'tagset'  => 'en::penn', # tagset id
    'package' => 'Lingua::Interset::Tagset::EN::Penn', # this is what you 'use' or 'require' in your code
    'path'    => '/home/zeman/perl5/lib/Lingua/Interset/Tagset/EN/Penn.pm' # path where it is installed
);

Note that you may find more than one package for the same tagset id. This function will list all of them. When you ask Interset to do something with a tagset (e.g. decode ('en::penn', $tag)), Interset will select one of the available packages for you. It will prefer new drivers over the old ones. If you have two old or two new drivers, their priority will be decided by Perl and it should correspond to the order of your $PERL5LIB environment variable. To avoid confusion, it is recommended that you have each package installed only once.

hash_drivers()

my $hash_of_drivers = hash_drivers ();

Returns the set of all known tagset drivers indexed by tagset id. The elements are hashes themselves, with the same record structure as returned by find_drivers(). Unlike find_drivers(), here the records are organized in a hash instead of a list, and only one driver per tagset is present. If there are two drivers installed for the same tagset, the one that appears earlier in @INC (or in the PERL5LIB environment variable) is returned. Exception: Interset 2.0 and newer drivers are prefered over the old ones.

find_tagsets()

my @list_of_tagset_ids = find_tagsets ();

This function uses find_drivers() and further processes its output. It returns the list of tagset ids for which there is a driver installed. The user can then call the get_driver_object() method on these ids.

SEE ALSO

Lingua::Interset::FeatureStructure, Lingua::Interset::Tagset

AUTHOR

Dan Zeman <zeman@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

This software is copyright (c) 2017 by Univerzita Karlova (Charles University).

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.