NAME
File::ANVL - routines to support A Name Value Language
SYNOPSIS
use File::ANVL; # to import routines into a Perl script
getlines( # read from $filehandle (defaults to *ARGV) up to
$filehandle # blank line; returns record read or undef on EOF;
); # record may be all whitespace (almost EOF)
trimlines( # strip initial whitespace from record, often just
$record, # returned by getlines(), and return remainder;
$r_wslines, # optional ref to line count in trimmed whitespace
$r_rrlines ); # optional ref to line count of real record lines
anvl_recarray( # split $record into array of lineno-name-value
$record, # triples, first triple being <anvl, beta, "">
$r_elems, # reference to returned array
$lineno, # starting line number (default 1)
$opts ); # options/default, eg, comments/0, autoindent/1
erc_anvl_expand_array(# change short ERC ANVL array to long form ERC
$r_elems ); # reference to array to modify in place
anvl_valsplit( # split ANVL value into an array of subvalues
$value, # input value; arg 2 is reference to returned
$r_svals ); # array of arrays of returned values
anvl_rechash( # split ANVL record into hash of elements
$record, # input record; arg 2 is reference to returned
$r_hash, # hash; a value is scalar, or array of scalars
$strict ); # if more than one element shares its name
anvl_decode( $str ); # decode ANVL-style %xy chars in string
anvl_name_naturalize( # convert name from sort-friendly to natural
$name ); # word order using ANVL inversion points
anvl_om( # read and process records from *ARGV
$om, # a File::OM formatting object
{ # a hash reference to various options
autoindent => 0, # don't (default do) correct sloppy indention
comments => 1, # do (default don't) preserve input comments
verbose => 1, # output record and line numbers (default don't)
... } ); # other options listed later
anvl_opt_defaults(); # return hash reference with factory defaults
*DEPRECATED*
anvl_recsplit( # split record into array of name-value pairs;
$record, # input record; arg 2 is reference to returned
$r_elems, # array; optional arg 3 (default 0) requires
$strict ); # properly indented continuation lines
anvl_encode( $str ); # ANVL-encode string
*REPLACED*
# instead of anvl_fmt use File::OM::ANVL object's 'elems' method
$elem = anvl_fmt( # format ANVL element, wrapping to 72 columns
$name, # $name is what goes to left of colon (:)
$value, # $value is what goes to right of colon
... ); # more name/value pairs may follow
DESCRIPTION
This is documentation for the ANVL Perl module, which provides a general framework for data represented in the ANVL format. ANVL (A Name Value Language) represents elements in a label-colon-value format similar to email headers. Specific conversions, based on an "output multiplexer" File::OM, are possible to XML, Turtle, JSON, and Plain unlabeled text.
The OM package can also be used to build records from scratch in ANVL or other the formats. Below is an example of how to create a particular kind of ANVL record known as an ERC (which uses Dublin Kernel metadata). For the formats ANVL, Plain, and XML, the returned text string by default is wrapped to 72 columns.
use File::OM;
my $om = File::OM->new("ANVL");
$anvl_record = $om->elems(
"erc", "",
"who", $creator,
"what", $title,
"when", $date,
"where", $identifier)
. "\n"; # 2nd newline in a row terminates ANVL record
The getlines()
function reads from $filehandle up to a blank line and returns the lines read. This is a general function for reading "paragraphs", which is useful for reading ANVL records. If unspecified, $filehandle defaults to *ARGV, which makes it easy to take input from successive file arguments specified on the command line (or from STDIN if none) of the calling program.
For convenience, trimlines()
is often used to process the record just returned by getlines()
. It strips leading whitespace, optionally counts lines, and returns undef if the passed record is undefined or contains only whitespace, both being equivalent to end-of-file (EOF).
These functions treat whitespace specially. Input is read up until at least one non-whitespace character and a blank line (two newlines in a row) or EOF is reached. If EOF is reached and the record would contain only whitespace, undef is returned. Input line counts for preliminary trimmed whitespace ($wslines) and real record lines ($rrlines) can be returned through optional scalar references given to trimlines()
. These functions work together to permit the caller access to all inputs, to accurate line counts, and a familiar "loop until EOF" paradigm, as in
while (defined trimlines(getlines(), \$wslcount, \$rrlcount)) ...
The anvl_recarray()
function splits an ANVL record into elements, returning them via the array reference given as the second argument. Each returned element is a triple consisting of line number, name, and value. An optional third argument gives the starting line number (default 1). An optional fourth argument is a reference to a hash containing options; the argument { comments => 1, autoindent => 0 } will cause comments to be kept (stripped by default) and recoverable indention errors to be flagged as errors (corrected to continuation lines by default). This function returns the empty string on success, or a message beginning "warning: ..." or "error: ...".
The first triple of the returned array is special in that it describes the origin of the record; its elements are
INDEX NAME VALUE
0 format original format ("ANVL", "JSON", "XML", etc)
1 <unused>
2 <unused>
The remaining triples are free form except that the values will have been drawn from the original format and possibly decoded. The first item ("lineno") in each remaining triple is a number followed by a letter, such as "34:" or "6#". The number indicates the line number (or octet offset, depending on the origin format) of the start of the element. The letter is either ':' to indicate a real element or '#' to indicate a comment; if the latter, the element name has no defined meaning and the comment is contatined in the value. Here's example code that reads a 3-element record and reformats it.
($msg = File::ANVL::anvl_recarray('
a: b c
d: e
f
g:
h i
' and die "anvl_recarray: $msg"; # report what went wrong
for ($i = 4; $i < $#elems; $i += 3)
{ print "[$elems[$i] <- $elems[$i+1]] "; }
which prints
[a <- b c] [d <- e f] [g <- h i]
erc_anvl_expand_array()
inspects and possibly modifies in place the kind of element array resulting from a call to anvl_recarray()
. It returns the empty string on success, otherwise an error message. This routine is useful for transforming a short form ERC ANVL record into long form, for example, expanding erc: a | b | c | d
into the equivalent,
erc:
who: a
what: b
when: c
where: d
The anvl_valsplit()
function splits an ANVL value into sub-values (svals) and repeated values (rvals), returning them as an array of arrays via the array reference given as the second argument. The top-level of the array represents svals and the next level represents rvals. This function returns the empty string on success, or a message beginning "warning: ..." or "error: ...".
The anvl_rechash()
function splits an ANVL record into elements, returning them via the hash reference given as the second argument. A hash key is defined for each element name found. Under that key is stored the corresponding element value, or an array of values if more than one occurrence of the element name was encountered. This function returns the empty string on success, or a message beginning "warning: ..." or "error: ...".
The anvl_decode()
function takes an ANVL-encoded string and returns it after converting encoded characters to the standard representaion (e.g., %vb becomes `|'). Some decoding, such as for the expansion block below,
print anvl_decode('http://example.org/node%{
? db = foo
& start = 1
& end = 5
& buf = 2
& query = foo + bar + zaf
%}');
will affect an entire region. This code prints
http://example.org/node?db=foo&start=1&end=5&buf=2&query=foo+bar+zaf
The anvl_name_naturalize()
function takes an ANVL string (aval) and returns it after inversion at any designated inversion points. The input string will be returned if it does not end in a comma (`,'). For example, "Pat Smith" is returned by the call,
anvl_name_naturalize("Smith, Pat,");
The anvl_om()
routine takes a formatting object created by a call to File::OM($format)
, reads a stream of ANVL records, processes each element, and calls format-specific methods to build the output. Those methods are typically affected by transferring command line options in at object creation time.
use File::ANVL;
use File::OM;
my $fmt = $opt{format};
$om = File::OM->new($opt{format}, # from command line
{comments => $opt{comments}) or # from command line
die "unknown format $fmt";
Options control various aspects of reading ANVL input records. The 'autoindent' option (default on) causes the parser to recover if it can when continuation lines are not properly indented. The 'comments' options (default off) causes input comments to be preserved in the output, format permitting. The 'verbose' option inserts record and line numbers in comments. Pseudo-comments will be created for formats that don't natively define comments (JSON, Plain).
Like the individual OM methods, anvl_om()
returns the built string by default, or the return status of print
using the file handle supplied as the 'outhandle' options (normally set to '') at object creation time, for example,
{ outhandle => *STDOUT }
The way anvl_om()
works is roughly as follows.
$om->ostream(); # open stream
... { # loop over all records, eg, $recnum++
$anvlrec = trimlines(getlines());
last unless $anvlrec;
$err = anvl_recarray($anvlrec, $$o{elemsref}, $startline, $opts);
$err and return "anvl_recarray: $err";
...
$om->orec($anvlrec, $recnum, $startline); # open record
...... { # loop over all elements, eg, $elemnum++
$om->elem($name, $value, $elemnum, $lineno); # do element
...... }
$om->crec($recnum); # close record
... }
$om->cstream(); # close stream
DEPRECATED: The anvl_recsplit()
function splits an ANVL record into elements, returning them via the array reference given as the second argument. Each returned element is a pair of elements: a name and a value. An optional third argument, if true (default 0), rejects unindented continuation lines, a common formatting mistake. This function returns the empty string on success, or message beginning "warning: ..." or "error: ...". Here's an example that extracts and uses the first returned element.
($msg = anvl_recsplit($record, $elemsref)
and die "anvl_recsplit: $msg"; # report what went wrong
print scalar($$elemsref), " elements found\n",
"First element label is $$elemsref[0]\n",
"First element value is $$elemsref[1]\n";
SEE ALSO
A Name Value Language (ANVL) http://www.cdlib.org/inside/diglib/ark/anvlspec.pdf
A Metadata Kernel for Electronic Permanence (PDF) http://journals.tdl.org/jodi/article/view/43
HISTORY
This is a beta version of ANVL tools. It is written in Perl.
AUTHOR
John A. Kunze jak at ucop dot edu
COPYRIGHT AND LICENSE
Copyright 2009-2010 UC Regents. Open source BSD license.
PREREQUISITES
Perl Modules: File::OM
Script Categories:
UNIX : System_administration