NAME
WARC - Web ARChive support for Perl
SYNOPSIS
use WARC;
$collection = assemble WARC::Collection (@indexes);
$record = $collection->search(url => $url, time => $when);
$volume = mount WARC::Volume ($filename);
$record = $volume->first_record;
$next_record = $record->next;
$record = $volume->record_at($offset);
# $record is a WARC::Record object
DESCRIPTION
The WARC
module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume
and WARC::Collection
classes are available.
Overview of the WARC reader support modules
- WARC::Collection
-
A
WARC::Collection
object represents a set of indexed WARC files. - WARC::Volume
-
A
WARC::Volume
object represents a single WARC file. - WARC::Record
-
Each record in a WARC volume is analogous to an
HTTP::Message
, with headers specific to the WARC format. - WARC::Record::Logical
-
Support class for WARC records that span multiple segments.
- WARC::Record::Payload
-
Planned support for tied filehandles reading WARC payloads.
- WARC::Fields
-
A
WARC::Fields
object represents the set of headers in a WARC record, analogous to the use ofHTTP::Headers
withHTTP::Message
. TheHTTP::Headers
class is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.The key-value format used in WARC headers has its own MIME type "application/warc-fields" and is also usable as the contents of a "warcinfo" record and elsewhere. The
WARC::Fields
class also provides support for objects of this type. - WARC::Index
-
WARC::Index
is the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assemblingWARC::Collection
objects. - WARC::Index::Entry
-
WARC::Index::Entry
is the base class for WARC index entries returned from the various index formats. - WARC::Index::File::CDX
-
Access module for the common CDX WARC index format.
- WARC::Index::File::SDBM
-
Planned "fast index" format using "SDBM_File" to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.
- WARC::Index::File::SQLite
-
Another planned "fast index" format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.
- WARC::Index::Volatile
-
Simple in-memory index module for small-scale applications that need index support but want to avoid requiring additional files beyond the WARC volume itself. This reads an entire WARC volume to build and attach an index.
Overview of the WARC writer support modules
- WARC::Builder
-
The
WARC::Builder
class provides a means to write new WARC files. - WARC::Index::Builder
-
WARC::Index::Builder
is the base class for the index-building tools. - WARC::Index::File::CDX::Builder
- WARC::Index::File::SDBM::Builder
- WARC::Index::File::SQLite::Builder
-
The
WARC::Index::File::*::Builder
classes provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.The
build
constructor thatWARC::Index
provides uses one of these classes for the actual work.
CAVEATS
Support for the RFC 2047 "encoded-words" mechanism is required by the WARC specification but not yet implemented.
Support for WARC record segmentation is planned but not yet implemented.
Handling segmented WARC records requires using the WARC::Collection
interface to find the next segment in a different WARC file. The WARC::Volume
interface is only usable for access within one WARC file.
The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for "WARC-alike" handlers are planned as WARC::Alike::*
. Metadata normally present in WARC volumes may not be available from other formats.
Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.
AUTHOR
Jacob Bachmeyer, <jcb@cpan.org>
SEE ALSO
Information about the WARC format at http://bibnum.bnf.fr/WARC/.
An overview of the WARC format at https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.
# TODO: add relevant RFCs.
The POD pages for the modules mentioned in the overview lists.
COPYRIGHT AND LICENSE
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.