NAME
Htdig::Database - Perl interface Ht://Dig docdb and config files
SYNOPSIS
use Htdig::Database;
my $config = Htdig::Database::get_config( $config_file )
or die "$0: Can't access $config_file\n";
my $record = Htdig::Database::parse_docdb( $docdb_record );
print "URL = $record->{URL}\n";
DESCRIPTION
Exported functions
The following functions are provided by Htdig::Database:
get_config
parse_docdb
encode_url
decode_url
By default, functions are not exported into the callers namespace, and you must invoke them using the full package name, e.g.:
Htdig::Database::getconfig( $config_file );
To import all available function names, invoke the module with:
use Htdig::Database qw(:all);
Parsing a config file
get_config parses a config file and returns a hash ref that contains the configuration attributes. For example:
my $config = Htdig::Database::get_config( $config_file )
or die "$0: Can't access $config_file\n";
print "start_url = $config->{start_url}\n";
All values in the hash are scalars, and any items that are intended to be lists or booleans must be parsed by the calling program. get_config returns undef if the config file can't be opened, and carps about various syntax errors.
Parsing a record from the document database
parse_docdb parses a record from the document database and returns a hash ref. For example:
my %docdb;
tie( %docdb, 'DB_File', $docdb, O_RDONLY, 0, $DB_BTREE ) ||
die "$0: Unable to open $docdb: $!";
while ( my ( $key, $value ) = each %docdb ) {
next if $key =~ /^nextDocID/;
my %rec = Htdig::Database:parse_docdb( $value );
print " URL: $record->{URL}\n";
print "HOPCOUNT: $record->{HOPCOUNT}\n";
}
URL's in the database are encoded using two attributes from the configuration file: common_url_parts and url_part_aliases. parse_docdb does only rudimentary decoding. It can't handle more than 25 elements in the common_url_parts list, and it currently can't handle url_part_aliases at all.
get_config caches the value of common_url_parts that's used for decoding URL's, and should usually be called before parse_docdb.
Compressed data in the HEAD element will be automatically decompressed if the Compress::Zlib module is available. If Compress::Zlib is not installed, compressed data will be silently replaced by the string:
"Compressed data: Zlib not available"
If only a single value is needed from the database record, it can be specified as a second parameter to parse_docdb, which then returns the requested value as a scalar. For example:
my $url = Htdig::Database:parse_docdb( $value, 'URL' );
Encoding a URL
my $encoded_url = Htdig::Database::encode_url( $url );
This may be useful for computing database keys, since the keys are encoded URL's. get_config should be called before encode_url or decode_url to initialize the value of common_url_parts.
Decoding a URL
my $url = Htdig::Database::decode_url( $encoded_url );
This should seldom be necessary, since URL's are normally decoded by parse_docdb.
AUTHOR
Warren Jones <wjones@halcyon.com>
BUGS
Only simple cases of URL encoding are handled correctly. No more than 25 elements are allowed in common_url_parts. The value of url_part_aliases is not used at all. Someday this module may implement the same URL encoding logic found in HtWordCodec.cc, but a better solution might be to provide an XSUB interface to the C++ functions.
This module works with ht://Dig 3.1.4. It probably works with 3.1.5, though this hasn't been tested. Because of changes in the database format, it will not work with version 3.2.