NAME
Biblio::WebPortal - Perl extension for Digital Library support
SYNOPSIS
use Bilio::WebPortal;
$a = mkdiglib($conf)
$a->search( term => 'animal', regexp => 'water' );
$a->asHTML();
$a->asLaTeX();
print $diglib->navigate(%vars);
DESCRIPTION
Biblio::WebPortal uses Biblio::Thesaurus and a configuration file to manage digital libraries in a simple way. For this purpose, we define a digital library as a set of searchable catalogs and an ontology for that subject. Biblio::WebPortal configuration file has a list of catalogs with their respective parse information.
To this be possible, it should be some way to access any kind of catalog: a plain text file, XML document, SQL database or anything else. The only method possible is to define functions to convert these implementation techniques into a mathematical definition. So, the user should give four functions to this module to it be capable of use the catalog. These functions are:
- split the catalog
-
Given a string (say, a catalog identifier) the function should return a Perl array with all catalog entries. This array should be the same everytime the function is called for the same catalog to maintain some type of indexing. The function can use this string as a filename, a SQL table identifier or anything else the function can understand.
- terms for an entry
-
Given an entry with the format returned by the previous function, this function should return a list of terms related to the object catalogued by this entry. These terms will be used latter for thesaurus integration.
- html from the entry
-
Given an entry, return a piece of HTML code to be embebed when listing records.
- text from the entry
-
Given an entry, return the searchable text it includes.
The following example shows a sample configuration file:
$userconf = {
catalog => "/var/library/catalog.xml",
thesaurus => "/var/library/thesaurus",
name => 'libraryName',
catsyn => {
asList => sub{ my $file=shift;
my $t=`cat $file`;
return ($t =~ m{(<entry.*?</entry>)}gs); },
asRelations => sub{ my $f=shift;
my $data;
while($f =~ m{<rel\s+tipo='(.*?)'>(.*?)</rel>}g)
{ push @{$data->{$1}}, $2; }
$data; },
asHTML => sub{ my $f=shift; &mp::cat::fichacat2html($f)},
asLaTeX => sub{ ... },
asText => sub{ my $f=shift;
$f =~ s{</?\w+}{ }g;
$f =~ s/(\s*[\n>"'])+\s*/,/g;
$f =~ s/\w+=//g;
$f =~ s/\s{2,}/ /g;
$f } } };
When using the mkdiglib
function with this configuration information, the module will create a set of files with cached data for quick response, inside a libraryName
directory. This function returns a library object.
The configuration file can refer to more than one catalog file. This is done with the following syntax:
$userconf = {
thesaurus => "/var/library/thesaurus",
name => 'libraryName',
catalog => [
{ file => "/var/library/catalog.xml",
type => {
asList => sub{ ... },
asRelations => sub{ ... },
asHTML => sub{ ... },
asText => sub{ ... },
asLaTeX => sub{ ... },
} },
{ file => ["/var/library/data1.db", "/var/library/data2.db"],
type => {
asList => sub{ ... },
asRelations => sub{ ... },
asHTML => sub{ ... },
asText => sub{ ... },
} }, ] }
After creating the object, we can open it on another script with the opendiglib
command wich receives the base name of the digital library. The base name is the path where it was created concatenated with the identifier used.
The most common way to use the digital library is to build a script like:
use Biblio::WebPortal;
use CGI qw/:standard :cgi-bin/;
my $library = "/var/library/libraryName";
my %vars = Vars();
print header;
my $diglib = Biblio::WebPortal::opendiglib( { name => $library } );
print $diglib->navigate(%vars);
The following attributes can be used in conjuntion with the previous configuration:
- scriptname
-
This should be used whenever the module can't detect the correct script name on the navigate method. Use it to point to the correct place.
- bt_next_txt
-
Set this attribute to the string you want to see with the link to the next page of search results;
- bt_prev_txt
-
Set this attribute to the string you want to see with the link to the previous page of search results;
Note that configuration options from Biblio::Thesaurus
navigate method are allowed in the configuration file;
Module Interface
-
This method is used to navigate over a digital library. It should be called with the hash of variables passed by the CGI;
The Digital Library directory
When creating a Biblio::WebPortal object (a digital library), a directory is created, with the name given in the configuration file. This directory contais a set of files, each one of them with already processed information.
catalogs.index
-
This is a text file. It contains a map between integers and processed catalogs. Each line consists of a sequential integer (beginning in 0), two dots and a fullpath to the catalog file.
All other databases will use that integer when referring to a catalog.
0:/home/user/diglib/catalog1.xml 1:/home/user/diglib/catalog2.xml
entry-catalog.index
-
Another text file which maps digital library identifiers to entries in each different catalog. Biblio::WebPortal will assign a different integer to each entry, no matter the catalog it is from. This file contains, in each line, the entry identifier in the digital library, two dots, the identifier of the catalog it cames from (the identifier defined in the
catalogs.index
file), a dot, and a number indicating the entry order in the respective catalog. Note that this order starts at 0, like the catalogs identifiers.1:0.0 2:0.1 3:0.2 4:0.3 5:0.4 6:1.0 7:1.1 8:1.2
html.db
-
This is a Berkeley DB file where keys are the entry identifiers defined in
entry-catalog.index
file. For each key, the database stores a pre-calculated HTML version for the entry (using theasHTML
function shown in previous section). latex.db
-
This is a Berkeley DB file where keys are the entry identifiers defined in
entry-catalog.index
file. For each key, the database stores a pre-calculated LaTeX version for the entry (using theasLaTeX
function shown in previous section). relation.index
-
A text file mapping entry identifiers into relation terms. Each line contains the entry identifier, a mark, and a list of classification terms.
relations.db
-
...
relations.list
-
This is a text file where each line contains a term. These are the classification terms used in all catalogs.
relations.statistics
-
Contains the same thing as
relations.list
, but each term is followed by a mark and an occurrence number. text.index
-
This text file contains in each line the entry identifier, a mark, and the text version for the entry, calculated using the
asText
function shown in the previous section. thesaurus.log
-
This is a text file in thesaurus format which maps to the term 'Others' all classification terms found on catalogs but does not exists in the thesaurus file.
thesaurus.store
-
This is a data dump format (Storable perl module) for the full thesaurus struture. It is used as a cache for quick read when using a navigation enabled thesaurus web page.
AUTHOR
José João Almeida <jj@di.uminho.pt>
Alberto Simões <albie@alfarrabio.di.uminho.pt>
SEE ALSO
Manpages: Biblio::Thesaurus(3) Biblio::Catalog(3) perl(1)
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 960:
Non-ASCII character seen before =encoding in 'José'. Assuming CP1252