NAME
WARC - Web ARChive support for Perl
SYNOPSIS
use WARC;
$collection = assemble WARC::Collection (@indexes);
$record = $collection->search(url => $url, time => $when);
$volume = mount WARC::Volume ($filename);
$record = $volume->first_record;
$next_record = $record->next;
$record = $volume->record_at($offset);
# $record is a WARC::Record object
DESCRIPTION
The WARC
module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume
and WARC::Collection
classes are available.
Overview of the WARC reader support modules
- WARC::Collection
-
A
WARC::Collection
object represents a set of indexed WARC files. - WARC::Volume
-
A
WARC::Volume
object represents a single WARC file. - WARC::Record
-
Each record in a WARC volume is analogous to an
HTTP::Message
, with headers specific to the WARC format. - WARC::Record::Logical
-
Support class for WARC records that span multiple segments.
- WARC::Record::Payload
-
Planned support for tied filehandles reading WARC payloads.
- WARC::Fields
-
A
WARC::Fields
object represents the set of headers in a WARC record, analogous to the use ofHTTP::Headers
withHTTP::Message
. TheHTTP::Headers
class is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.The key-value format used in WARC headers has its own MIME type "application/warc-fields" and is also usable as the contents of a "warcinfo" record and elsewhere. The
WARC::Fields
class also provides support for objects of this type. - WARC::Index
-
WARC::Index
is the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assemblingWARC::Collection
objects. - WARC::Index::Entry
-
WARC::Index::Entry
is the base class for WARC index entries returned from the various index formats. - WARC::Index::File::CDX
-
Access module for the common CDX WARC index format.
- WARC::Index::File::SDBM
-
Planned "fast index" format using "SDBM_File" to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.
- WARC::Index::File::SQLite
-
Another planned "fast index" format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.
Overview of the WARC writer support modules
- WARC::Builder
-
The
WARC::Builder
class provides a means to write new WARC files. - WARC::Index::Builder
-
WARC::Index::Builder
is the base class for the index-building tools. - WARC::Index::File::CDX::Builder
- WARC::Index::File::SDBM::Builder
- WARC::Index::File::SQLite::Builder
-
The
WARC::Index::File::*::Builder
classes provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.The
build
constructor thatWARC::Index
provides uses one of these classes for the actual work.
CAVEATS
Support for the RFC 2047 "encoded-words" mechanism is required by the WARC specification but not yet implemented.
Support for WARC record segmentation is planned but not yet implemented.
Handling segmented WARC records requires using the WARC::Collection
interface to find the next segment in a different WARC file. The WARC::Volume
interface is only usable for access within one WARC file.
The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for "WARC-alike" handlers are planned as WARC::Alike::*
. Metadata normally present in WARC volumes may not be available from other formats.
Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.
AUTHOR
Jacob Bachmeyer, <jcb@cpan.org>
SEE ALSO
Information about the WARC format at http://bibnum.bnf.fr/WARC/.
An overview of the WARC format at https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.
# TODO: add relevant RFCs.
The POD pages for the modules mentioned in the overview lists.
COPYRIGHT AND LICENSE
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.