NAME
PICA::Parser - Parse PICA+ data
SYNOPSIS
use PICA::Parser;
PICA::Parser->parsefile( $filename_or_handle ,
Field => \&field_handler,
Record => \&record_handler
);
PICA::Parser->parsedata( $string_or_function ,
Field => \&field_handler,
Record => \&record_handler
);
$parser = PICA::Parser->new(
Record => \&record_handler,
Proceed => 1
);
$parser->parsefile( $filename );
$parser->parsedata( $picadata );
print $parser->counter() . " records read.\n";
You can also export parsedata
and parsefile
:
use PICA::Parser qw(parsefile);
parsefile( $filename, Record => sub {
my $record = shift;
print $record->to_string() . "\n";
});
Both function return the parser, so you can use constructs like
my @records = parsefile($filename)->records();
DESCRIPTION
This module can be used to parse normalized PICA+ and PICA+ XML. The conrete parsers are implemented in PICA::PlainParser and PICA::XMLParser.
CONSTRUCTOR
new ( [ %params ] )
Creates a Parser to store common parameters (see below). These parameters will be used as default when calling parsefile
or parsedata
. Note that you do not have to use the constructor to use PICA::Parser
. These two methods do the same:
my $parser = PICA::Parser->new( %params );
$parser->parsefile( $file );
PICA::Parser->parsefile( $file, %params );
Common parameters that are passed to the specific parser are:
- Field
-
Reference to a handler function for parsed PICA+ fields. The function is passed a PICA::Field object and it should return it back to the parser. You can use this function as a simple filter by returning a modified field. If no PICA::Field object is returned then it will be skipped.
- Record
-
Reference to a handler function for parsed PICA+ records. The function is passed a PICA::Record. If the function returns a record then this record will be stored in an array that is passed to
Collection
. You can use this method as a filter by returning a modified record. - Error
-
This handler is used if an error occured while parsing, for instance if data does not look like PICA+. By default errors are just ignored.
TODO: Count errors and return the number of errors in the
errors
method. - Dumpformat
-
If set to true, parse dumpformat (no newlines).
- Strict
-
Stop on errors. By default a parser just omits records that could not been parsed. (default is false). Up to now strict_mode is only available in PICA::PlainParser! /bin/bash: indent: command not found
Skip empty records so they will not be passed to the record handler (default is false). Empty records easily occur for instance if your field handler does not return anything - this is useful for performance but you should not forget to set the EmptyRecords parameter. In every case empty records are counted with a special counter that can be read with the
empty
method. The normal counter (methodcounter
) counts all records no matter if empty or not. - Proceed
-
By default the internal counters are reset with each call of
parsefile
andparsedata
. If you set theProceed
parameter to a true value, the same parser will be reused without reseting.
METHODS
parsefile ( $filename-or-handle [, %params ] )
Parses pica data from a file, specified by a filename or filehandle. The default parser is PICA::PlainParser. If the filename extension is .xml
or .xml.gz
or the 'Format' parameter set to 'xml' then PICA::XMLParser is used instead.
PICA::Parser->parsefile( "data.picaplus", Field => \&field_handler );
PICA::Parser->parsefile( \*STDIN, Field => \&field_handler, Format='XML' );
See the constructor new
for a description of parameters. The Proceed
parameter is ignored.
You cannot parse a file named "PICA::Parser"
by the way.
parsedata ( $data [, %params ] )
Parses data from a string, array reference, or function. See parsefile
and the parsedata
method of PICA::PlainParser and PICA::XMLParser for a description of parameters.
By default PICA::PlainParser is used unless there the 'Format' parameter set to 'xml':
PICA::Parser->parsedata( $picastring, Field => \&field_handler );
PICA::Parser->parsedata( \@picalines, Field => \&field_handler );
If data is a PICA::Record object, it is directly passed to the record handler without re-parsing. See the constructor new
for a description of parameters. The Proceed
parameter is ignored.
records
Get an array of the read records (as returned by the record handler which can thus be used as a filter). If no record handler was specified, records will be collected unmodified. For large record sets it is recommended not to collect the records but directly use them with a record handler.
counter
Get the number of read records so far.
empty
Get the number of empty records that have been read so far. Empty records are counted but not passed to the record handler unless you specify the EmptyRecords
parameter. The number of non-empty records is the difference between counter
and empty
.
INTERNAL METHODS
_getparser ( [ %params] )
Internal method to get a new parser of the internal parser of this object. By default, gives a PICA:PlainParser unless you specify the Format
parameter. Single parameters override the default parameters specified at the constructor (except the the Proceed
parameter).
TODO
Support multiple handlers per record? Better logging needs to be added, for instance a status message every n records. This may be implemented with multiple handlers per record (maybe piped). Handling of broken records should also be improved.
AUTHOR
Jakob Voss <jakob.voss@gbv.de>
LICENSE
Copyright (C) 2007, 2009 by Verbundzentrale Goettingen (VZG) and Jakob Voss
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.