Museum::Rijksmuseum::Object::Harvester - Bulk-fetching of Rijksmuseum data via the OAI-PMH interface
See Museum::Rijksmuseum::Object
Does a bulk fetch of the Rijksmuseum collection database using the OAI-PMH interface. For each record a callback will be called with the data. Note that the format of this data won't necessarily be the same as returned by the Museum::Rijksmuseum::Object calls, as it's coming from a different endpoint.
use Museum::Rijksmuseum::Object::Harvester;
my $h = Museum::Rijksmuseum::Object::Harvester->new( key => 'abc123xyz' );
my $status = $h->harvest(
set => 'subject:PublicDomainImages',
from => '2023-01-01',
type => 'identifiers',
callback => \&process_record,
if ( $status->{error} ) {
die "Error: $status->{error}\nLast resumption token: $status->{resumptionToken}\n";
if ( $status->{resumptionToken} ) {
print "Finished, token: $status->{resumptionToken}\n";
my $h = Museum::Rijksmuseum::Object::Harvester->new( key => 'abc123xyz' );
Create a new instance of the harvester. key
is required.
my $status = $h->harvest(
set => 'subject:PublicDomainImages',
from => '2023-01-01',
to => '2023-01-31',
resumptionToken => $last_token_you_saw,
delay => 1_000, # 1 second
type => 'identifiers',
callback => \&process_record,
Begins harvesting the records from the Rijksmuseum. The only required fields are callback
and type
, but the default delay is 10 seconds so you probably want to think about putting something sensible in there (or leave it at 10 seconds if you don't mind being very polite.) If you have a resumption token, perhaps you're recovering from a previous failure, you can supply that. from
and to
are not defined in the API documentation, so it's uncertain what they refer to. Latest update time maybe?
can in theory be identifiers
or records
(mapping to ListIdenifiers
and ListRecords
internally), but records
is currently unsupported as at writing time I don't need it and it's a fair bit of work to do right.
will be called for every identifier or record, in the case of identifers it'll eceive a hashref containing identifier
and datestamp
. If the callback returns a non-false value (i.e. any value), we quietly shut down. Due to the way resumption tokens work (i.e. they can be the same for subsequent requests), even if you request a shutdown, you'll still be fed the rest of the batch. This helps avoid missing records.
The return value is a hashref that contains error
if something went wrong, and possibly a resumptionToken
to let you know how to pick up again.
There is some basic retry logic with exponential backoff that'll hopefully help seamlessly recover from transient network or service issues.
The API key provided by the Rijksmuseum.
Robin Sheat, <rsheat at>
- Handle the ListRecords verb
This'll require writing a parser for EDM-DC or similar.
- Implement logging
A proper logging system would allow recording of transient failures to see if they are becoming a problem. It would also allow the option for more fine-grained progress information to be displayed.
Please report any bugs or feature requests to bug-museum-rijksmuseum-object at
, or through the web interface at I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
Alternately, use the tracker on the repository page at
You can find documentation for this module with the perldoc command.
perldoc Museum::Rijksmuseum::Object::Harvester
You can also look for information at:
Repository page (report bugs here)
RT: CPAN's request tracker (or here)
CPAN Ratings
Search CPAN
This software is Copyright (c) 2023 by Robin Sheat.
This is free software, licensed under:
The Artistic License 2.0 (GPL Compatible)