NAME
WARC::Fields - WARC record headers and application/warc-fields
SYNOPSIS
require WARC::Fields;
$f = new WARC::Fields;
$f = $record->fields; # get WARC record headers
$g = $f->clone; # make writable copy
$g->set_readonly; # make read-only
$f->field('WARC-Type' => 'metadata'); # set
$value = $f->field('WARC-Type'); # get
$fields_text = $f->as_string; # get WARC header lines for display
$fields_block = $f->as_block; # format for WARC file
tie @field_names, ref $f, $f; # bind ordered list of field names
tie %fields, ref $f, $f; # bind hash of field names => values
$entry = $f->[$num]; # tie an anonymous array and access it
$value = $f->{$name}; # likewise with an anonymous tied hash
$name = "$entry"; # tied array returns objects
$value = $entry->value; # one specific value
$offset = $entry->offset; # N of M with same name
foreach (keys %{$f}) { ... } # iterate over names, in order
DESCRIPTION
The WARC::Fields
class encapsulates information in the "application/warc-fields" format used for WARC record headers. This is a simple key-value format closely analogous to HTTP headers, however differences are significant enough that the HTTP::Headers
class cannot be reliably reused for WARC fields.
Instances of this class are usually created as member variables of the WARC::Record
class, but can also be returned as the content of WARC records with Content-Type "application/warc-fields".
Instances of WARC::Fields
retrieved from WARC files are read-only and will croak() if any attempt is made to change their contents.
This class strives to faithfully represent the contents of a WARC file, while providing a simple interface to answer simple questions.
Multiple Values
Most WARC headers may only appear once and with a single value in valid WARC records, with the notable exception of the WARC-Concurrent-To header. WARC::Fields
neither attempts to enforce nor relies upon this constraint. Headers that appear multiple times are considered to have multiple values. When iterating a tied hash, all values of a recurring header are collected and returned with the first occurrence of its key.
Multiple values are returned from the field
method and tied hash interface as array references, and are set by passing in an array reference. Existing rows are reused where possible when updating a field with multiple values. If the new array reference contains fewer items (including the special case of replacing multiple values with a single value) excess rows are deleted. If the new array reference requires additional rows to be inserted, they are inserted immediately after the last existing row for a field, with the same name case as that row.
Precise control of the layout is available using the tied array interface, but the ordering of the header rows is not constrained in the WARC specification.
Field Name Mangling
As with HTTP::Headers
, the '_' character is converted to '-' in field names unless the first character of the name is ':', which cannot itself appear in a field name. Unlike HTTP::Headers
, the leading ':' is stripped off immediately and the name stored otherwise exactly as given. The field
method and tied hash interface allow this convenience feature. The field names exposed via the tied array interface are reported exactly as they appear in the WARC file.
Strictly, "X-Crazy-Header" and "X_Crazy_Header" are two different headers that the above convenience mechanism conflates. The solution is simple: if (and only if) a header field already exists with the exact name given, it is used, otherwise s/_/-/g
occurs and the name is rechecked for another exact match. If no match is found, case is folded and a third check performed. If a match is found, the existing header is updated, otherwise a new header is created with character case as given.
The WARC specification specifically states that field names are case-insensitive, accordingly, "X-Crazy-Header" and "X-CRAZY-HeAdEr" are considered the same header for the field
method and tied hash interface. They will appear exactly as given in the tied array interface, however.
Methods
- $f = WARC::Fields->new
-
Construct a new
WARC::Fields
object. Initial contents can be passed as key-value pairs to this constructor and will be added in the given order.Repeating a key or supplying an array reference as a value assigns multiple values to a key. To reduce the risk of confusion, only quoting with a leading ':' overrides the convenience feature of applying
s/_/-/g
when constructing aWARC::Fields
object. The exact match rules used when setting values on an existing object do not apply here.Field names given when constructing a WARC::Fields object are otherwise stored exactly as given, with case preserved, even when other names that fold to the same string have been given earlier in the argument list.
- $f->clone
-
Copy a
WARC::Fields
object. A copy of a read-only object is writable. - $f->field( $name )
- $f->field( $name => $value )
- $f->field( $n1 => $v1, $n2 => $v2, ... )
-
Get or set the value of one or more fields. The field name is not case sensitive, but
WARC::Fields
will preserve its case if a new entry is created.Setting a field to
undef
effectively deletes that field, although it remains visible in the tied array interface and will retain its position if a new value is assigned. Setting a field to an empty array reference removes that field entirely. - $f = WARC::Fields->parse( $text )
- $f = WARC::Fields->parse( from => $fh )
- $f = parse WARC::Fields from => $fh
-
Construct a new
WARC::Fields
object, reading initial contents from the provided text string or filehandle.The
parse
method throws an exception if it encounters input that it does not understand.If the
parse
method encounters a field name with a leading ':', which implies an empty name and is not allowed, the leading ':' is silently dropped from the line and parsing retried. If the line is not valid after this change, theparse
method throws an exception. This feature is in keeping with the general principle of "be liberal in what you accept" and is a preemptive workaround for a predicted bug in other implementations. - $f->as_block
- $f->as_string
-
Return the contents as a formatted WARC header or application/warc-fields block. The
as_block
method uses network line endings and UTF-8 as specified for the WARC format, while theas_string
method uses the local line endings and does not perform encoding. - $f->set_readonly
-
Mark a
WARC::Fields
object read-only. All methods that modify the object will croak() if called on a read-only object.
Tied Array Access
The order of fields can be fully controlled by tying an array to a WARC::Fields
object and manipulating the array using ordinary Perl operations. The splice
and sort
functions are likely to be useful for reordering array elements if desired.
WARC::Fields
will croak() if an attempt is made to set a field name with a leading ':' using the tied array interface.
The tied array interface accepts simple string values but returns objects with additional information. The returned object has an overloaded string conversion that yields the name for that entry but additionally has value
and offset
methods.
An entry object is bound to a slot in its parent WARC::Fields
object, but will be copied if it is assigned to another slot in the same or another WARC::Fields
object.
Due to complex aliasing rules necessary for array slice assignment to work for permuting rows in the table, entry objects must be short-lived. Storing the object read from a tied array and attempting to use it after modifying its parent WARC::Fields
object produces unspecified results.
- $entry = $array[$n]
- $entry = $f->[$n]
-
The tied array
FETCH
method returns a "entry object" instead of the name itself. - $name = "$entry"
- $name = $entry->name
- $name = "$f->[$n]"
- $name = $f->[$n]->name
-
The
name
method on a entry object returns the field name. String conversion is overloaded to call this method. - $value = $entry->value
- $value = $array[$n]->value
- $value = $f->[$n]->value
- $entry->value( $new_value )
- $array[$n]->value( $new_value )
- $f->[$n]->value( $new_value )
-
The
value
method on a entry object returns the field value for this particular entry. Only a single scalar is returned, even if multiple entries share the same name.If given an argument, the
value
method replaces the value for this particular entry. The argument will be coerced to a string. - $offset = $entry->offset
- $offset = $array[$n]->offset
- $offset = $f->[$n]->offset
-
The
offset
method on a entry object returns the position of this entry amongst multiple entries with the same field name. These positions are numbered from zero and are identical to the positions in the array reference returned for this entry's field name from thefield
method or the tied hash interface.
Tied Hash Access
The contents of a WARC::Fields
object can be easily examined by tying a hash to the object. Reading or setting a hash key is equivalent to the field
method, but the tied hash will iterate keys and values in the order in which each key first appears in the internal table.
Like the tied array interface, the tied hash interface returns magical objects that internally refer back to the parent WARC::Fields
object. These objects remain valid if the underlying WARC::Fields
object is changed, but further use may produce surprising and unspecified results.
The use of magical objects enables the values in a tied hash to always be arrays, even for keys that do not exist (the array will have zero elements) or that have only one value (the array will have a string conversion that produces that one value). This allows a tied hash to support autovivification of an array value just as Perl's own hashes do.
Overloaded Dereference Operators
The WARC::Fields
class provides overloaded dereference operators for array and hash dereferencing. The overloaded operators provide an anonymous tied array or hash as needed, allowing the object itself to be used as a reference to its tied array and hash interfaces. There is a caveat, however, so read on.
Reference Count Trickery with Overloaded Dereference Operators
To avoid problems, the underlying tied object is a reference to the parent object. For ordinary use of tie
, this is a strong reference, however, the anonymous tied array and hash are cached in the object to avoid having to tie
a new object every time the dereference operators are used.
To prevent memory leaks due to circular references, the overloaded dereference operators tie a weak reference to the parent object. The tied aggregate always holds a strong reference to its object, but when the dereference operators are used, that inner object is a weak reference to the actual WARC::Fields
object.
The caveat is thus: do not attempt to save a reference to the array or hash produced by dereferencing a WARC::Fields
object. The parent WARC::Fields
object must remain in scope for as long as any anonymous tied aggregates exist.
CAVEATS
Do not save references to the anonymous tied aggregates returned by dereferencing a WARC::Fields
object.
Do not save references to the entries read from tied aggregates unless the WARC::Fields
object is read-only. Modifications may or may not be reflected in previously constructed entry objects and hash value arrays and the exact behavior may change without warning or notice.
AUTHOR
Jacob Bachmeyer, <jcb@cpan.org>
SEE ALSO
WARC, HTTP::Headers, Scalar::Util for weaken
COPYRIGHT AND LICENSE
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.