NAME

NexTrieve::Targz - create and maintain Targz archives

SYNOPSIS

use NexTrieve;
$ntv = NexTrieve->new( | {method => value} );
$targz = $ntv->Targz( | {method => value} );

# specifying the conversion to XML
$rfc822 = $ntv->RFC822( {method => value} );
$targz->RFC822( $rfc822 );

# adding to archive and creating new XML files from seperate message files
$targz->add_file( <files> );

# adding to archive and creating new XML files from mbox files
$targz->add_mbox( <mboxes> );

# adding and creating new XML and automatically index document sequence
$targz->Docseq( $ntv->Index( $resource )->Docseq );
$targz->add_file( <files> );
$targz->Docseq->done;

# updating XML with new specifications
$targz->update_xml( $rfc822 );

# updating XML and automatically re-index
$targz->update_xml( $rfc822,$ntv->Index( $resource )->Docseq );

# re-index all XML
$targz->xml( $ntv->Index( $resource )->Docseq );

# obtain document sequence XML
$xml = targz->xml;

DESCRIPTION

The Targz object of the Perl support for NexTrieve. Do not create directly, but through the Targz method of the NexTrieve object;

YET ANOTHER ARCHIVE

The Targz archive is basically just another way of archiving RFC822 messages. The names comes from the fact that it internally uses "tar"-files that are "gz"ipped for storage of the messages.

Apart from being able to archive messages, it can also save other representations of those messages: currently only one alternative method is allowed, namely the XML format as used by NexTrieve document sequences.

The NexTrieve::Targz module implements the Targz archive. It was developed internally at NexTrieve for an initial version of the News Search Engine of http://www.search.nl .

When a message is added to the Targz archive (also referred to as "targz"), the header of the message is read to determine the origination date of the message. Messages of which no origination date can be determined, can not be added to the targz. The origination date is determined by looking at the "From ", "Date:" and any other header that has a ";..." comment field.

An internal ID is assigned to the message. This internal ID is calculated by taking the epoch value of midnigh GMT of the date of the message and adding an ordinal number to it. So the ID of the first message is in fact the epoch value of one second past midnight, the second message two seconds, etc. This internal ID value has the advantage that it will fit in 4 bytes for quite some years to come, it is unique and it can be easily used in constraints that have a granularity of a day.

Messages that originated on the same date are stored in the same tarfile, which is stored gzipped to save space. Up to 86399 messages can be stored for a single date, which seems enough for even the most busy newsgroup.

To be able to access single messages, a tarfile of a specific date is automatically extracted completely whenever the absolute filename of a message is requested. In benchmarks it was shown that there is hardly any difference in CPU-usage between extracting a single message or all messages. By extracting all messages of a date, it is possible to use the existence of the directory in which they were extracted as a flag.

The Targz archive is a compromise between simplicity, space used to store messages and being able to access them quickly. Older versions of this software that were used internally at NexTrieve, used a monthly tarfile for low-volume newsgroups. But since the invention of super-economic file-systems such as ReiserFS (which is highly recommended as the file system of choice for storing Targz archives on), the overhead of having a single tarfile per date seems to be more than bearable, especially compared to the simplicity it creates.

Apart from the tarfiles that are created, no external files are being kept. Ordinal numbers for messages are determined by the files that are already in a tarfile and nothing else.

Internally at NexTrieve, this archive format is in use for storing over 65 million messages from over 10000 textual(non-binary) newsgroups in about 50 Gigabyte of diskspace, with about 250000 messages being added each day. At the same time, the XML generated for these messages is being served as HTML on a web-server.

PREREQUISITES

Currently a "tar" program that understands the following parameters must be available for this module to operate correctly:

--append        append to existing tarfile
--create        create new tarfile
--directory=    extract files to indicated directory
--extract       extract files from archive
--file=         specify name of tarfile to work on
--gzip          filter tarfile through "gzip"
--gunzip        filter tarfile through "gunzip"
--list          list filenames of files stored in tarfile
--remove-files  remove original files upon storing in tarfile
--to-stdout     extract files to STDOUT
--verbose       perform action verbosely (list files being extracted)

Currently a "gzip" program that understands the following parameters must be available for this module to operate correctly:

--best          use best compression possible

OBJECT METHODS

The following methods return objects.

Docseq

$targz->Docseq( | $docseq, | {method => value} ;
$docseq = $targz->Docseq;

The "Docseq" method allows you to access the NexTrieve::Docseq object that lives inside of the NexTrieve::Targz object.

The first optional input parameter is the NexTrieve::Docseq object that should be used by the NexTrieve::Targz object. The current NexTrieve::Docseq object is assumed if none is specified. A new one is created if none was associated with the object before.

A reference to a method-value pair hash to be applied to the NexTrieve::Docseq object can be specified as the second input parameter.

For more information, see the documentation of the NexTrieve::Docseq module itself.

Resource

$resource = $mbox->Resource( | {method => value} );

The "Resource" method allows you to create a NexTrieve::Resource object from the internal structure of the NexTrieve::RFC822.pm object that lives inside of the NexTrieve::Targz object.

For more information, see the documentation of the NexTrieve::RFC822 and NexTrieve::Resource modules itself.

RFC822

$targz->RFC822( | $rfc822, | {method => value} ;
$rfc822 = $targz->RFC822;

The "RFC822" method allows you to access the NexTrieve::RFC822 object that lives inside of the NexTrieve::Targz object.

The first optional input parameter is the NexTrieve::RFC822 object that should be used by the NexTrieve::Targz object. The current NexTrieve::RFC822 object is assumed if none is specified. A new one is created if none was associated with the object before.

A reference to a method-value pair hash to be applied to the NexTrieve::RFC822 object can be specified as the second input parameter.

For more information, see the documentation of the NexTrieve::RFC822 module itself.

OTHER METHODS

The following methods change aspects of the NexTrieve::Targz object.

add_file

$targz->add_file || die "could not add *.new files\n";;
$targz->add_file( <files> ) || die "could not add files\n";;

The "add_file" method allows you to add RFC822 messages that are stored in seperate files to the targz. Returns true if successful.

The input parameters can either be filenames or references to lists with filenames. If no input parameters are specified, all files with the extension ".new" that are stored in the directory will be assumed.

If the rm_original method was previously called with a true value, then the files specified will be deleted on successful execution of this method.

add_mbox

$targz->add_mbox( <mboxes> ) || die "could not add mboxes\n";;

The "add_mbox" method allows you to add RFC822 messages that are stored in in one or more Unix mailboxes to the targz. Returns true if successful. If the rm_original method was previously called with a true value, then the files specified will be deleted on successful execution of this method.

add_news

$targz->add_news( $nntp ) || die "could not add news\n";

($messages,$nntp) = $targz->add_news( $nntp,\&create_NNTP );
die "could not add news\n" unless $messages =~ m#^\d+$#;

The "add_news" method allows you to add RFC822 messages from a news (NNTP) server.

The first input parameter specifies the Net::NNTP object that should be used to obtain messages from the newsgroup of the name of the targz.

The optional second input parameter specifies a reference to an (anonymous) subroutine that can be called to create the Net::NNTP object. This is especially handy when reading a lot of newsgroups with the same Net::NNTP object: some news servers let the connection go stale after a while: by specifying this parameter you allow the add_news method to recover from such a situation automatically.

Returns the number of messages that were successfully obtained in a scaler context. In a list context, the second output parameter is the possibly adapted Net::NNTP object that was passed as the first input parameter.

clean

$exit = $targz->clean;

The "clean" method cleans the temporary directory in use by the object. It is usually called automatically when the object is DESTROYed, unless inhibited by a call to the no_auto_clean method.

It returns the exit status of the system's "rm" command.

count

$all = $targz->count;
$some = $targz->count( 'regexp' );
$oneday = $targz->count( datestamp );
$stored = $targz->count( '',$hashref );

The "count" method returns the number of messages in the targz. The amount can be for the whole targz or constraint by a regular expression (as used in a "grep()") or for just a single date (if a datestamp is specified).

The optional second input parameter is a reference to a hash. This hash will be filled with a key for each datestamp for which files are found. The value of the key in the hash is a reference to a list which currently contains two value: the last modified time of the tarfile and the number of messages in it. If the tarfile is deemed to not have changed, the tarfile itself will not be read but instead the value found in the value will be used. The hash reference can e.g. be stored in the directory with the Storable module, which is what the count_storable method does, or can be stored in any other database backend that you might desire.

Use the ids method to find out the actual ID's of the messages.

count_storable

$all = $targz->count_storable;
$some = $targz->count_storable( 'regexp' );
$oneday = $targz->count_storable( datestamp );

The "count_storable" method is similar to the count method, but it uses a hash that is stored in an external file ("count.gz" in the directory directory) to remember which tarfiles were counted already before. If the Storable module is not available, calling this method will still work but the counting will be much slower.

The "count_storable" method returns the number of messages in the targz. The amount can be for the whole targz or constraint by a regular expression (as used in a "grep()") or for just a single date (if a datestamp is specified).

datestamps

foreach (@{$targz->datestamps}) {

The "datestamps" method returns a reference to a list of datestamps of the dates of which the targz contains messages. Datestamps are in the form "YYYYMMDD". They are always ordered in ascending order. A datestamp can be used as an input parameter to files.

directory

$directory = $targz->directory;

The "directory" method returns the directory that is used by the NexTrieve::Targz object to permanently store information. The directory is created when the NexTrieve::Targz object is created.

filename

$rfc = $targz->filename( $id );
$xml = $targz->filename( "$id.xml" );
system( "cat $xml" );

The "filename" method returns an absolute filename for the message specified by the input parameter. As a side-effect, extracts all messages of the same date into a temporary directory.

The input parameter is the id of which to obtain the absolute filename. It can be suffixes by the string ".xml" to indicate that the XML version of the message is requested. The id values can be obtained by a call to ids.

Returns the empty string if there is no message (or XML-version of that message) available.

ids

$all = $targz->ids;
$some = $targz->ids( 'regexp' );
$oneday = $targz->ids( datestamp );

The "ids" method returns a reference to a list of ID's of the messages in the targz. The list can be complete or constraint by a regular expression (as used in a "grep()") or for just a single date (if a datestamp is specified).

Use the filename method to find out the absolute filename of a message to be able to get at its contents.

name

$name = $targz->name;

The "name" method returns the name of the targz. This is the same as the name of the final subdirectory on which the targz works.

no_auto_clean

$targz->no_auto_clean( 1 );
$no_auto_clean = $targz->no_auto_clean;

The "no_auto_clean" method specifies whether the temporary directory that is used by the object should be cleaned when the object is DESTROYed. By default, the object cleans the temporary directory. A true value indicates that the temporary directory should not be removed when the object is DESTROYed. This is generally only useful in debugging situations.

rm_original

$targz->rm_original( 1 );
$rm_original = $targz->rm_original;

The "rm_original" method specifies whether files that are specified to be added (with either the add_file or add_mbox method) are automatically removed from the file system upon successful adding.

tarfile

$tarfile = $targz->tarfile( '20020323' );
$tarfile = $targz->tarfile( $datestamp,'xml' );

The "tarfile" method returns the absolute name of the tarfile that contains the files of a given date. It is only necessary if you want to do some low level action on the tarfile.

The first input parameter specifies the datestamp of the date of which you want to know the tarfile name.

The optional second input parameter specifies which type of information you want the tarfile name of. Two values are currently supported: 'rfc' and 'xml'. The value 'rfc' will be assumed if this input parameter is not specified.

update_xml

$files = $targz->updatexml( | $rfc822, | $docseq );

The "update_xml" method reads all the messages in the targz and creates new XML for them, either using the NexTrieve::RFC822 object that lives inside the NexTrieve::Targz object, or with a specific one that is specified.

The second input parameter specifies the Docseq object that should also be used to process all newly created document XML. The NexTrieve::Docseq object that lives inside the NexTrieve::Targz object will be assumed if none is specified.

The number of files that were processed, is returned.

work

$work = $targz->work;

The "work" method returns the work directory that is used by the NexTrieve::Targz objects that are in this process. The work directory is created when the NexTrieve::Targz object is created. The location of the work directory is determined by the Tmp setting of the NexTrieve::Targz object (which is inherited from the NexTrieve object).

xml

$xml = $targz->xml;
$targz->xml( | $docseq );

The "xml" method either returns the document sequence XML of the entire targz, or processes the document XML of all the messages in the targz with the Docseq object specified. When called in a void context, the Docseq object of the targz will be assumed, or a new one will be created. When called in a scalar context, no Docseq object will be used unless specifically specified.

AUTHOR

Elizabeth Mattijsen, <liz@dijkmat.nl>.

Please report bugs to <perlbugs@dijkmat.nl>.

SUPPORT

NexTrieve is no longer being supported.

COPYRIGHT

Copyright (c) 1995-2003 Elizabeth Mattijsen <liz@dijkmat.nl>. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

The NexTrieve.pm and the other NexTrieve::xxx modules.