NAME
WARC::Record::Sponge - data sponge for WARC records
SYNOPSIS
use Digest::SHA;
use WARC::Builder;
use WARC::Record::Sponge;
$builder = new WARC::Builder ( ... );
$sponge = new WARC::Record::Sponge ( type => 'response' );
$sponge->begin_digest(block => sha1 => new Digest::SHA ('sha1'));
while (<$socket>) {
print $sponge $_;
# ... other processing ...
$sponge->begin_digest(payload => sha1 => new Digest::SHA ('sha1'))
if $end_of_headers_reached;
}
$builder->add($sponge); # add to growing WARC volume
DESCRIPTION
WARC::Record::Sponge
objects provide a streaming interface for constructing WARC records as data is received using a temporary file to store the record content. This allows recording records that exceed available memory.
This class provides objects with a tied filehandle interface using a data sponge model. In the "soak" phase, data is written to the handle, along with markers indicating the computation of digests for that data. In the "squeeze" phase, data is read back from the handle and the digests are collected. The object can then be reset to return to the "soak" phase to collect new data and "squeezed" again. All digest markers are removed upon returning to the "soak" phase. The handle is seekable in the "squeeze" phase, but append-only in the "soak" phase.
A WARC::Record::Sponge
isa WARC::Record
and inherits the fields
method. Header fields may be set on a WARC::Record::Sponge
, but all fields other than "WARC-Type" are erased when the reset
method is used.
Methods
- $sponge = new WARC::Record::Sponge ( ... )
- $sponge->begin_digest ( $key , $tag , $digest )
-
Insert a digest marker using a digest object that must support the
add
,clone
, anddigest
methods from theDigest
API. All data written to the record is included in all digests active when the data is written. - $value = $sponge->get_digest ( $key )
-
Return a digest value for the data from the digest marker inserted with the given key to the current end of the data. The result is a Base32 value labelled with the tag given when the digest was started.
- $sponge->readback
-
End the "soak" phase and switch to the "squeeze" phase.
- $sponge->content_length
-
Return the length of the data stored in the sponge if called in the "squeeze" phase. Returns undefined if called during the "soak" phase.
- $sponge->reset
-
End the "squeeze" phase and return to a new "soak" phase.
CAVEATS
The readback mode only implements block reads using read
or sysread
; readline
and getc
are not implemented.
AUTHOR
Jacob Bachmeyer, <jcb@cpan.org>
SEE ALSO
WARC::Record, WARC::Builder, WARC, Digest
COPYRIGHT AND LICENSE
Copyright (C) 2020 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.