NAME

Htdig::Database - Perl interface Ht://Dig docdb and config files

SYNOPSIS

    use Htdig::Database;

    my $config = Htdig::Database::get_config( $config_file )
	    or die "$0: Can't access $config_file\n";
    my $record = Htdig::Database::parse_docdb( $docdb_record );
    print "URL = $record->{URL}\n";

DESCRIPTION

Exported functions

The following functions are provided by Htdig::Database:

get_config
parse_docdb
encode_url
decode_url

By default, functions are not exported into the callers namespace, and you must invoke them using the full package name, e.g.:

Htdig::Database::getconfig( $config_file );

To import all available function names, invoke the module with:

use Htdig::Database qw(:all);

Parsing a config file

get_config parses a config file and returns a hash ref that contains the configuration attributes. For example:

    my $config = Htdig::Database::get_config( $config_file )
	or die "$0: Can't access $config_file\n";
    print "start_url = $config->{start_url}\n";

All values in the hash are scalars, and any items that are intended to be lists or booleans must be parsed by the calling program. get_config returns undef if the config file can't be opened, and carps about various syntax errors.

Parsing a record from the document database

parse_docdb parses a record from the document database and returns a hash ref. For example:

    my %docdb;
    tie( %docdb, 'DB_File', $docdb, O_RDONLY, 0, $DB_BTREE ) ||
	die "$0: Unable to open $docdb: $!";

    while ( my ( $key, $value ) = each %docdb ) {
	next if $key =~ /^nextDocID/;
        my %rec = Htdig::Database:parse_docdb( $value );
	print "     URL: $record->{URL}\n";
	print "HOPCOUNT: $record->{HOPCOUNT}\n";
    }

URL's in the database are encoded using two attributes from the configuration file: common_url_parts and url_part_aliases. parse_docdb does only rudimentary decoding. It can't handle more than 25 elements in the common_url_parts list, and it currently can't handle url_part_aliases at all.

get_config caches the value of common_url_parts that's used for decoding URL's, and should usually be called before parse_docdb.

Compressed data in the HEAD element will be automatically decompressed if the Compress::Zlib module is available. If Compress::Zlib is not installed, compressed data will be silently replaced by the string:

"Compressed data: Zlib not available"

If only a single value is needed from the database record, it can be specified as a second parameter to parse_docdb, which then returns the requested value as a scalar. For example:

my $url = Htdig::Database:parse_docdb( $value, 'URL' );

Encoding a URL

my $encoded_url = Htdig::Database::encode_url( $url );

This may be useful for computing database keys, since the keys are encoded URL's. get_config should be called before encode_url or decode_url to initialize the value of common_url_parts.

Decoding a URL

my $url = Htdig::Database::decode_url( $encoded_url );

This should seldom be necessary, since URL's are normally decoded by parse_docdb.

AUTHOR

Warren Jones <wjones@halcyon.com>

BUGS

Only simple cases of URL encoding are handled correctly. No more than 25 elements are allowed in common_url_parts. The value of url_part_aliases is not used at all. Someday this module may implement the same URL encoding logic found in HtWordCodec.cc, but a better solution might be to provide an XSUB interface to the C++ functions.

This module works with ht://Dig 3.1.4. It probably works with 3.1.5, though this hasn't been tested. Because of changes in the database format, it will not work with version 3.2.