NAME
DigLib::Thesaurus - Perl extension for managing an ISO thesaurus
SYNOPSIS
use DigLib::Thesaurus;
$obj = thesaurusNew();
$obj = thesaurusLoad('iso-file');
$obj = thesaurusRetrieve('storable-file');
$obj->save('iso-file');
$obj->storeOn('storable-file');
$obj->addTerm('term');
$obj->addRelation('term','relation','term1',...,'termn');
$obj->deleteTerm('term');
$obj->describe('Relation','description');
$obj->addInverse('Relation1','Relation2');
$html = $obj->navigate(+{configuration},%parameters);
$html = $obj->getHTMLTop();
$obj->dt('termo', %handler);
$obj->full_dt(%handler);
$obj->complete();
$obj->tc('termo', 'relation1', 'relation2');
$obj->depth_first('term', 2, "NT", "UF")
DESCRIPTION
A Thesaurus is a classification structure. We can see it as a graph where nodes are terms and the vertices are relations between terms.
This module provides transparent methods to maintain Thesaurus files. The module uses a subset from ISO 2788 wich defines some standard features to be found on thesaurus files. This ISO includes a set of relations that can be seen as standard but, this program can use user defined ones. So, it can be used on ISO or not ISO thesaurus files.
File Structure
Thesaurus used with this module are standard ASCII documents. This file can contain processing instructions, comments or term definitions. The instructions area is used to define new relations and mathematic properties between them.
We can see the file with this structure:
______________
| |
| HEADER | --> Can contain, only, processing instructions,
|______________| comment or empty lines.
| |
| Def Term 1 | --> Each term definition should be separated
| | from each other with an empty line.
| Def Term 2 |
| |
| ..... |
| |
| Def Term n |
|______________|
Comments can appear on any line. Meanwhile, the comment character (#) should be the first character on the line (with no spaces before). Comments line span to the end of the line (until the first carriage return).
Processing instructions lines, like comments, should start with the percent sign (%). We describe these instructions later on this document.
Terms definitions can't contain any empty line because they are used to separate definitions from each other. On the first line of term definition record should appear the defined term. Next lines defines relations with other terms. The first characters should be an abbreviation of the relation (on upper case) and spaces. Then, should appear a comma separated list of terms.
There can be more than one line with the same relation. Thesaurus module will concatenate the lists. If you want to continue a list on the next line you can repeat the relation term of leave some spaces between the start of the line and the terms list.
Here is an example:
Animal
NT cat, dog, cow
fish, ant
NT camel
BT Life being
cat
BT Animal
SN domestic animal to be kicked when
anything bad occurs.
There can be defined a special term (_top_
). It should be used when you want a top tree for thesaurus navigation. So, we normally define the _top_
term with the more interesting terms to be navigated.
The ISO subset used are:
- TT - Top Term
-
The broadest term we can define about the current term.
- NT - Narrower Term
-
Terms more specific than current term.
- BT - Broader Term
-
More generic terms than current term.
- USE - Synonym
-
Another chances when finding a Synonym.
- UF - Quasi-Synonym
-
Terms that are no synonyms of current term but can be used, sometimes with that meaning.
- RT - Related Term
-
Related term that can't be inserted on any other category.
- SN - Scope Note
-
Text. Note of context of the current term. Use for definitions or comments about the scope you are using that term.
Processing Instructions
Processing instructions, as said before, are written on a line starting with the percent sign. Current commands are:
- top
-
When presenting a thesaurus, we need a term, to know where to start. Normally, we want the thesaurus to have some kind of top level, where to start navigating. This command specifies that term, the term that should be used when no term is specified.
Example:
%top Contents Contents NT Biography ... RT ...
- inverse
-
This command defines the mathematic inverse of the relation. That is, if you define
inverse A B
and you know thatfoo
is related byA
withbar
, then,bar
is related byB
withfoo
.Example:
%inv BT NT %inverse UF USE
- description
-
This command defines a description for some relation class. These descriptions are used when outputing thesaurus on HTML.
Example:
%desc SN Note of Scope %description IOF Instance of
If you are constructing a multi-lingue thesaurus, you will want to translate the relation class description. To do this, you should use the
description
command with the language in from of it:%desc[PT] SN Nota de Contexto %description[PT] IOF Instância de
- externals
-
This defines classes that does not relate terms but, instead, relate a term with some text (a scope note, an url, etc.). This can be used like this:
%ext SN URL %externals SN URL
Note that you can specify more than one relation type per line.
- languages
-
This other command permits the construction of a multi-lingue thesaurus. TO specify languages classifiers (like PT, EN, FR, and so on) you can use one of these lines:
%lang PT EN FR %languages PT EN FR
To describe (legend) the language names, you should use the description command, so, you could append:
%description PT Portuguese %description EN English %description FR French
I18N
Internationalization functions, languages
and setLanguage
should be used before any other function or constructor. Note that when loading a saved thesaurus, descriptions defined on that file will be not translated. That's important!
interfaceLanguages()
This function returns a list of languages that can be used on the current Thesaurus version.
interfaceSetLanguage( <lang-name> )
This function turns on the language specified. So, it is the first function you should call when using this module. By default, it uses Portuguese. Future version can change this, so you should call it any way.
API
This module uses a perl object oriented model programming, so you must create an object with one of the thesaurusNew
, thesaurusLoad
or thesaurusRetrieve
commands. Next commands should be called using the OO fashion.
Constructors
thesaurusNew
To create an empty thesaurus object. The returned newly created object conains the inversion properties from the ISO classes and some stub descriptions for the same classes.
thesaurusLoad
To use the thesaurusLoad
function, you must supply a file name. This file name should correspond to the ISO ASCII file as defined on earlier sections. It returns the object with the contents of the file. If the file does not defined relations and descriptions about the ISO classes, they are added.
thesaurusRetrieve
Everybody knows that text access and parsing of files is not efficient. So, this module can save and load thesaurus from Storable files. This funtion should receive a file name from a file wich was saved using the storeOn
function.
Methods
save
This method dumps the object on an ISO ASCII file. Note that the sequence thesaurusLoad
, save
is not the identity function. Comments are removed and processing instructions can be added. To use it, you should supply a file name.
Note: if the process fails, this method will return 0. Any other method die when failing to save on a file.
storeOn
This method saves the thesaurus object in Storable format. You should use it when you want to load with the thesaurusRetrieve
function.
addTerm
You can add terms definitions using the perl API. This method adds a term on the thesaurus. Note that if that term already exists, all it's relations will be deleted.
addRelation
To add relations to a term, use this method. It can be called again and again. Previous inserted relations will not be deleted. This method can be used with a list of terms for the relation like:
$obj->thesaurusAddRelation('Animal','NT','cat','dog','cow','camel');
deleteTerm
Use this method to remove all references of the term supplied. Note that all references will be deleted.
describe
You can use this method to describe some relation class. You can use it to change the description of an existing class (like the ISO ones) or to define a new class.
addInverse
This method should be used to describe the inversion property to relation classes. Note that if there is some previous property about any of the relations, it will de deleted. If any of the relations does not exist, it will be added.
navigate
This function is a some what type of CGI included on a object method. You must supply an associative array of CGI parameters. This method prints an HTML thesaurus for Web Navigation.
The typical thesaurus navigation CGI is:
#!/usr/bin/perl -w
use CGI qw/:standard/;
print header;
for (param()) { $arg{$_} = param($_) }
$thesaurus = thesaurusLoad("thesaurus_file");
$thesaurus->navigate(%arg);
This method can receive, as first argument, a reference to an associative array with some configuration variables like what relations to be expanded and what language to be used by default.
So, in the last example we could write
$thesaurus->navigate(+{expand=>['NT', 'USE'],
lang =>'EN'})
meaning that the structure should show two levels of 'NT' and 'USE' relations, and that it should use the english language.
complete
This function completes the thesaurus based on the invertibility properties. This operation is only needed when adding terms and relations by this API. Whenever the system loads a thesaurus ISO file, it is completed.
dt and full_dt
The dt
method is used to produce something from some term information. It should be passed as argument a term and an associative array with anonymous subroutines that process each class. Example:
$the->dt("frog", {NT => sub{ #Do nothing
},
-default => sub{ print "$class", join(",",@terms) }});
The full_dt method does not receive a term and calls the dt method for all terms in the thesaurus.
depth_first
The depth_first
method is used to get the list of terms related with $term
by relations @r
up to the level $lev
$the->depth_first($term ,$lev, @r)
$the->depth_first("frog", 2, "NT","UF")
tc
transitive closure
The tc
method is used to eval the transitive closure of the relations @r
starting from a term $term
$the->tc($term , @r)
$the->tc("frog", "NT","UF")
terms
The terms
method is used to get all the terms related by relations @r
with $term
$the->terms($term , @r)
$the->terms("frog", "NT","UF")
AUTHORS
Alberto Simões, <albie@alfarrabio.um.geira.pt>
José João Almeida, <jj@di.uminho.pt>
Sara Correia, <sara.correia@portugalmail.com>
This module is included in the Natura project. You can visit it at http://natura.di.uminho.pt.
SEE ALSO
The example thesaurus file (examples/thesaurus
), DigLib::MLang(3) and perl(1) manpages.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 1090:
Non-ASCII character seen before =encoding in 'Instância'. Assuming CP1252