Net::OAI::Harvester - A package for harvesting metadata using OAI-PMH
## create a harvester for the Library of Congress
my $harvester = Net::OAI::Harvester->new(
baseURL => '''
## find out the name for a repository
my $identity = $harvester->identify();
print "name: ",$identity->name(),"\n";
## get a list of identifiers
my $identifiers = $harvester->listIdentifiers(
'metadataPrefix' => 'oai_dc'
while ( my $header = $identifiers->next() ) {
print "identifier: ",$header->identifier(),"\n";
## list all the records in a repository
my $records = $harvester->listRecords(
'metadataPrefix' => 'oai_dc'
while ( my $record = $records->next() ) {
my $header = $record->header();
my $metadata = $record->metadata();
print "identifier: ", $header->identifier(), "\n";
print "title: ", $metadata->title(), "\n";
## GetRecord, ListSets, ListMetadataFormats also supported
Net::OAI::Harvester is a Perl extension for easily querying OAI-PMH repositories. OAI-PMH is the Open Archives Initiative Protocol for Metadata Harvesting. OAI-PMH allows data repositories to share metadata about their digital assets. Net::OAI::Harvester is a OAI-PMH client, so it does for OAI-PMH what LWP::UserAgent does for HTTP.
You create a Net::OAI::Harvester object which you can then use to retrieve metadata from a selected repository. Net::OAI::Harvester tries to keep things simple by providing an API to get at the data you want; but it also has a framework which is easy to extend should you need to get more complicated.
The guiding principle behind OAI-PMH is to allow metadata about online resources to be shared by data providers, so that the metadata can be harvested by interested parties. The protocol is essentially XML over HTTP (much like XMLRPC or SOAP). Net::OAI::Harvester does XML parsing for you (using XML::SAX internally), but you can get at the raw XML if you want to do your own XML processing, and you can drop in your own XML::SAX handler if you would like to do your own parsing of metadata elements.
A OAI-PMH repository supports 6 verbs: GetRecord, Identify, ListIdentifiers, ListMetadataFormats, ListRecords, and ListSets. The verbs translate directly into methods that you can call on a Net::OAI::Harvester object. More details about these methods are supplied below, however for the real story please consult the spec at
Net::OAI::Harvester has a few features that are worth mentioning:
Since the OAI-PMH results can be arbitrarily large, a stream based (XML::SAX) parser is used. As the document is parsed corresponding Perl objects are created (records, headers, etc), which are then serialized on disk (as YAML if you are curious). The serialized objects on disk can then be iterated over one at a time. The benefit of this is a lower memory footprint when (for example) a ListRecords verb is exercised on a repository that returns 100,000 records.
XML::SAX filters are used which will allow interested developers to write their own metadata parsing packages, and drop them into place. This is useful because OAI-PMH is itself metadata schema agnostic, so you can use OAI-PMH to distribute all kinds of metadata (Dublin Core, MARC, EAD, or your favorite metadata schema). OAI-PMH does require that a repository at least provides Dublin Core metadata as a baseline. Net::OAI::Harvester has built in support for unqualified Dublin Core, and has a framework for dropping in your own parser for other kinds of metadata. If you create a XML::Handler that you would like to contribute back into the Net::OAI::Harvester project please get in touch!
All the Net::OAI::Harvester methods return other objects. As you would expect new() returns an Net::OAI::Harvester object; similarly getRecord() returns an Net::OAI::Record object, listIdentifiers() returns a Net::OAI::ListIdentifiers object, identify() returns an Net::OAI::Identify object, and so on. So when you use one of these methods you'll probably want to check out the docs for the object that gets returned so you can see what to do with it.
The constructor which returns an Net::OAI::Harvester object. You must supply the baseURL parameter, to tell Net::OAI::Harvester what data repository you are going to be harvesting. For a list of data providers check out the directory available on the Open Archives Initiative homepage.
my $harvester = Net::OAI::Harvester->new(
baseURL => '''
identify() is the OAI verb that tells a metadata repository to provide a description of itself. A call to identify() returns a Net::OAI::Identify object which you can then call methods on to retrieve the information you are intersted in. For example:
my $identity = $harvester->identify();
print "repository name: ",$identity->repositoryName(),"\n";
print "protocol version: ",$identity->protocolVersion(),"\n";
print "earliest date stamp: ",$identity->earliestDatestamp(),"\n";
print "admin email(s): ", join( ", ", $identity->adminEmail() ), "\n";
For more details see the Net::OAI::Identify documentation.
listMetadataFormats() asks the repository to return a list of metadata formats that it supports. A call to listMetadataFormats() returns an Net::OAI::ListMetadataFormats object.
my $list = $harvester->listMetadataFormats();
print "archive supports metadata prefixes: ",
join( ',', $list->prefixes() ),"\n";
If you are interested in the metadata formats available for a particular resource identifier then you can pass in that identifier.
my $list = $harvester->listMetadataFormats( identifier => '1234567' );
print "record identifier 1234567 can be retrieved as ",
join( ',', $list->prefixes() ),"\n";
See documentation for Net::OAI::ListMetadataFormats for more details.
getRecord() is used to retrieve a single record from a repository. You must pass in the identifier
and an optional metadataPrefix
parameters to identify the record, and the flavor of metadata you would like. Net::OAI::Harvester includes a parser for OAI DublinCore, so if you do not specifiy a metadataPrefix 'oai_dc' will be assumed. If you would like to drop in you own XML::Handler for another type of metadata use the handler
my $record = $harvester->getRecord(
identifier => 'abc123',
## get the Net::OAI::Record::Header object
my $header = $record->header();
## get the metadata object (default will be an Net::OAI::Record::OAI_DC object)
my $metadata = $record->metadata();
listRecords() allows you to retrieve all the records in a data repository. You must supply the metadataPrefix
parameter to tell your Net::OAI::Harvester which type of records you are interested in. listRecords() returns an Net::OAI::ListRecords object. There are four other optional parameters from
, until
, set
, and resumptionToken
which are better described in the OAI-PMH spec.
my $records = $harvester->listRecords(
metadataPrefix => 'oai_dc'
## iterate through the results with next()
while ( my $record = $records->next() ) {
my $metadata = $record->metadata();
You must handle resumption tokens yourself, but it is fairly easy to do with a loop, and the token() method.
my $finished = undef;
my %opts = ( metadataPrefix => 'oai_dc' );
while ( ! $finished ) {
my $records = $harvester->listRecords( %opts );
while ( my $record = $records->next() ) {
my $metadata = $record->metadata();
my $rToken = $records->token();
if ( $token ) {
$opts{ resumptionToken } = $rToken->token();
listIdentifiers() takes the same parameters that listRecords() takes, but it returns only the record headers, allowing you to quickly retrieve all the record identifiers for a particular repository. The object returned is a Net::OAI::ListIdentifiers object.
my $headers = $harvester->listRecords(
metadataPrefix => 'oai_dc'
## iterate through the results with next()
while ( my $header = $identifiers->next() ) {
print "identifier: ", $header->identifier(), "\n";
listSets() takes an optional resumptionToken
parameter, and returns a Net::OAI::ListSets object. listSets() allows you to harvest a subset of a particular repository with listRecords(). For more information see the OAI-PMH spec and the Net::OAI::ListSets docs.
my $sets = $harvester->listSets();
foreach ( $sets->setSpecs() ) {
print "set spec: $_ ; set name: ", $sets->setName( $_ ), "\n";
Gets or sets the base URL for the repository being harvested.
$harvester->baseURL( '' );
Or if you want to know what the current baseURL is
$baseURL = $harvester->baseURL();
Gets or sets the LWP::UserAgent object being used to perform the HTTP transactions. This method could be useful if you wanted to change the agent string, timeout, or some other feature.
More documentation of other classes.
Create common handlers for other metadata formats (MARC, qualified DC, etc).
Selectively load Net::OAI::* classes as needed, rather than getting all of them at once at the beginning of Net::OAI::Harvester.
OAI-PMH Specification at
Ed Summers <>