NAME

WARC::Record::Sponge - data sponge for WARC records

SYNOPSIS

use Digest::SHA;
use WARC::Builder;
use WARC::Record::Sponge;

$builder = new WARC::Builder ( ... );
$sponge = new WARC::Record::Sponge ( type => 'response' );
$sponge->begin_digest(block => sha1 => new Digest::SHA ('sha1'));

while (<$socket>) {
  print $sponge $_;
  # ... other processing ...
  $sponge->begin_digest(payload => sha1 => new Digest::SHA ('sha1'))
   if $end_of_headers_reached;
}

$builder->add($sponge);	# add to growing WARC volume

DESCRIPTION

WARC::Record::Sponge objects provide a streaming interface for constructing WARC records as data is received using a temporary file to store the record content. This allows recording records that exceed available memory.

This class provides objects with a tied filehandle interface using a data sponge model. In the "soak" phase, data is written to the handle, along with markers indicating the computation of digests for that data. In the "squeeze" phase, data is read back from the handle and the digests are collected. The object can then be reset to return to the "soak" phase to collect new data and "squeezed" again. All digest markers are removed upon returning to the "soak" phase. The handle is seekable in the "squeeze" phase, but append-only in the "soak" phase.

A WARC::Record::Sponge isa WARC::Record and inherits the fields method. Header fields may be set on a WARC::Record::Sponge, but all fields other than "WARC-Type" are erased when the reset method is used.

Methods

$sponge = new WARC::Record::Sponge ( ... )
$sponge->begin_digest ( $key , $tag , $digest )

Insert a digest marker using a digest object that must support the add, clone, and digest methods from the Digest API. All data written to the record is included in all digests active when the data is written.

$value = $sponge->get_digest ( $key )

Return a digest value for the data from the digest marker inserted with the given key to the current end of the data. The result is a Base32 value labelled with the tag given when the digest was started.

$sponge->readback

End the "soak" phase and switch to the "squeeze" phase.

$sponge->content_length

Return the length of the data stored in the sponge if called in the "squeeze" phase. Returns undefined if called during the "soak" phase.

$sponge->reset

End the "squeeze" phase and return to a new "soak" phase.

CAVEATS

The readback mode only implements block reads using read or sysread; readline and getc are not implemented.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

SEE ALSO

WARC::Record, WARC::Builder, WARC, Digest

COPYRIGHT AND LICENSE

Copyright (C) 2020 by Jacob Bachmeyer

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.