NAME
WARC::Record::Logical::Heuristics - heuristics for locating record segments
SYNOPSIS
use WARC::Record::Logical::Heuristics;
DESCRIPTION
This is an internal module that provides functions for locating record segments when the needed information is not available from an index.
These mostly assume that IIPC WARC guidelines have been followed, as otherwise there simply is no efficient solution.
Implementations vary, however, with some using only an incrementing serial number and a constant timestamp from the initiation of the crawl job, while the guidelines and specification envision a timestamp reflecting the first write to that specific file rather than the start of the crawl. Constant timestamps are checked first, since the search is simpler.
- $WARC::Record::Logical::Heuristics::Patience
-
This variable sets a threshold used to limit the reach of an unproductive search. This module tracks the "effort" expended (I/O performed) during a search and abandons the search if the threshold is exceeded. Finding results dynamically (and temporarily) increases this threshold during a search, such that this really sets how far the search will go between results before giving up and concluding that there are no more results.
The search will reach farther if either the WARC files are not compressed, or the "sl" GZIP extension documented in WARC::Builder is used. Decompressing record data to find the next record is considerable effort for larger records, but is not counted for very small records that the system is likely to already have cached after the header has been read.
- %WARC::Record::Logical::Heuristics::Effort
-
This internal hash indicates how costly certain operations should be considered. The keys and their meanings are subject to change at whim, but this is available for quick tuning if needed. Generally, the better solution is to index your data rather than spend time tuning heuristics.
- ( $first_segment, @clues ) = find_first_segment( $record )
-
Attempt to locate the first segment of the logical record suggested by the given record without using indexes. Croaks if given a record that does not appear to have been written using WARC segmentation. Returns a
WARC::Record
object for the first record and a list of other objects that may be useful for locating continuation records. Returns undef in the first slot if no clear first segment was found, but can still return other records encountered during the search even if the search was ultimately unsuccessful. - ( @segments ) = find_continuation( $first_segment, @clues )
-
Attempt to locate the continuation segments of a logical record without using indexes. Uses the clues returned from
find_first_segment
to aid in the search and returns a list of continuation records found that appear to be part of the same logical record as the given first segment.
AUTHOR
Jacob Bachmeyer, <jcb@cpan.org>
SEE ALSO
WARC, WARC::Collection, WARC::Record
COPYRIGHT AND LICENSE
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.