NAME
WARC::Fields - WARC record headers and application/warc-fields
SYNOPSIS
require WARC::Fields;
$f = new WARC::Fields;
$f = $record->fields; # get WARC record headers
$g = $f->clone; # make writable copy
$g->set_readonly; # make read-only
$f->field('WARC-Type' => 'metadata'); # set
$value = $f->field('WARC-Type'); # get
$f->remove_field('WARC-Type'); # delete
$fields_text = $f->as_string; # get WARC header lines
tie @field_names, ref $f, $f; # bind ordered list of field names
tie %fields, ref $f, $f; # bind hash of field names => values
$row = $f->[$num]; # tie an anonymous array and access it
$value = $f->{$name}; # likewise with an anonymous tied hash
$name = "$row"; # tied array returns objects
$value = $row->value; # one specific value
$offset = $row->offset; # N of M with same name
foreach (keys %{$f}) { ... } # iterate over names, in order
DESCRIPTION
The WARC::Fields
class encapsulates information in the "application/warc-fields" format used for WARC record headers. This is a simple key-value format closely analogous to HTTP headers, however differences are significant enough that the HTTP::Headers
class cannot be reliably reused for WARC fields.
Instances of this class are usually created as member variables of the WARC::Record
class, but can also be returned as the content of WARC records with Content-Type "application/warc-fields".
Instances of WARC::Fields
retrieved from WARC files are read-only and will croak() if any attempt is made to change their contents.
This class strives to faithfully represent the contents of a WARC file, although the field names are defined to be case-insensitive.
Most WARC headers may only appear once and with a single value in valid WARC records, with the notable exception of the WARC-Concurrent-To header. WARC::Fields
neither attempts to enforce nor relies upon this constraint. Headers that appear multiple times are considered to have multiple values, that is, the value associated with the header name will be an array reference. Similarly, the name of a recurring header is repeated in the tied array interface. When iterating a tied hash, all values of a recurring header are collected and returned with the first occurrence of its key.
As with HTTP::Headers
, the '_' character is converted to '-' in field names unless the first character of the name is ':', which cannot itself appear in a field name. Unlike HTTP::Headers
, the leading ':' is stripped off immediately and the name stored otherwise exactly as given. The method and tied hash interfaces allow this convenience feature. The field names exposed via the tied array interface are reported exactly as they appear in the WARC file.
Strictly, "X-Crazy-Header" and "X_Crazy_Header" are two different headers that the above convenience mechanism conflates. The solution is simple: if (and only if) a header field already exists with the exact name given, it is used, otherwise y/_/-/ occurs and the name is rechecked for another exact match. If no match is found, case is folded and a third check performed. If a match is found, the existing header is updated, otherwise a new header is created with character case as given.
The WARC standard specifically states that field names are case-insensitive, accordingly, "X-Crazy-Header" and "X-CRAZY-HeAdEr" are considered the same header for the method and tied hash interfaces. They will appear exactly as given in the tied array interface, however.
Methods
- $f = WARC::Fields->new
-
Construct a new
WARC::Fields
object. Initial contents can be passed as key-value pairs to this constructor and will be added in the given order. - $f->clone
-
Copy a
WARC::Fields
object. A copy of a read-only object is writable. - $f->field( $name )
- $f->field( $name => $value )
- $f->field( $n1 => $v1, $n2 => $v2, ... )
-
Get or set the value of one or more fields. The field name is not case sensitive, but
WARC::Fields
will preserve its case if a new entry is created. - $f = WARC::Fields->parse( $text )
- $f = WARC::Fields->parse( from => $fh )
-
Construct a new
WARC::Fields
object, reading initial contents from the provided text string or filehandle.If the
parse
method encounters a field name with a leading ':', which implies an empty name and is not allowed, the leading ':' is silently dropped from the line and parsing retried. If the line is not valid after this change, theparse
method croaks. - $f->as_string
-
Return the contents as a formatted WARC header or application/warc-fields block.
- $f->set_readonly
-
Mark a
WARC::Fields
object read-only. All methods that modify the object will croak() if called on a read-only object.
Tied Array Access
The order of field names can be fully controlled by tying an array to a WARC::Fields
object and manipulating the array using ordinary Perl operations. Removing a name from the array effectively removes the field from the object, but the value for that name is still remembered, allowing names to be moved about without loss of data.
WARC::Fields
will croak() if an attempt is made to set a field name with a leading ':' using the tied array interface.
The tied array interface accepts simple string values but returns objects with additional information. The returned object stringifies to the name for that row but additionally has value
and offset
methods.
- $row = $array[$n]
- $row = $f->[$n]
-
The tied array
FETCH
method returns a "row object" instead of the name itself. - $name = "$row"
- $name = $row->name
- $name = "$f->[$n]"
- $name = $f->[$n]->name
-
The
name
method on a row object returns the field name. Stringification is overloaded to call this method. - $value = $row->value
- $value = $array[$n]->value
- $value = $f->[$n]->value
-
The
value
method on a row object returns the field value for this particular row. Only a single scalar is returned, even if multiple rows share the same name. - $offset = $row->offset
- $offset = $array[$n]->offset
- $offset = $f->[$n]->offset
-
The
offset
method on a row object returns the position of this row amongst multiple rows with the same field name. These positions are numbered from zero and are identical to the positions in the array reference returned for this row's field name from thefield
method or the tied hash interface.
Tied Hash Access
The contents of a WARC::Fields
object can be easily examined by tying a hash to the object. Reading or setting a hash key is equivalent to the field
method, but the tied hash will iterate keys and values in the order in which each key first appears in the internal list.
Overloaded Dereference Operators
The WARC::Fields
class provides overloaded dereference operators for array and hash dereferencing. The overloaded operators provide an anonymous tied array or hash as needed, allowing the object itself to be used as a reference to its tied array and hash interfaces. There is a caveat to be aware of, however, so read on.
Reference Count Trickery with Overloaded Dereference Operators
To avoid problems, the underlying tied object is a reference to the parent object. For ordinary use of tie
, this is a strong reference, however, the anonymous tied array and hash are cached in the object to avoid having to tie a new object every time the dereference operators are used.
To prevent memory leaks due to circular references, the overloaded dereference operators tie a weak reference to the parent object. The tied aggregate always holds a strong reference to its object, but when the dereference operators are used, that inner object is a weak reference to the actual WARC::Fields
object.
The caveat is this: do not attempt to save a reference to the array or hash produced by dereferencing a WARC::Fields
object. The parent WARC::Fields
object must remain in scope for as long as any anonymous tied aggregates exist.
AUTHOR
Jacob Bachmeyer, <jcb@cpan.org>
SEE ALSO
WARC, HTTP::Headers, Scalar::Util for weaken
COPYRIGHT AND LICENSE
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.