NAME
Museum::Rijksmuseum::Object::Harvester - Bulk-fetching of Rijksmuseum data via the OAI-PMH interface
VERSION
See Museum::Rijksmuseum::Object
SYNOPSIS
Does a bulk fetch of the Rijksmuseum collection database using the OAI-PMH interface. For each record a callback will be called with the data. Note that the format of this data won't necessarily be the same as returned by the Museum::Rijksmuseum::Object calls, as it's coming from a different endpoint.
use Museum::Rijksmuseum::Object::Harvester;
my $h = Museum::Rijksmuseum::Object::Harvester->new( key => 'abc123xyz' );
my $status = $h->harvest(
set => 'subject:PublicDomainImages',
from => '2023-01-01',
type => 'identifiers',
callback => \&process_record,
);
if ( $status->{error} ) {
die "Error: $status->{error}\nLast resumption token: $status->{resumptionToken}\n";
}
if ( $status->{resumptionToken} ) {
print "Finished, token: $status->{resumptionToken}\n";
}
SUBROUTINES/METHODS
new
my $h = Museum::Rijksmuseum::Object::Harvester->new( key => 'abc123xyz' );
Create a new instance of the harvester. key
is required.
harvest
my $status = $h->harvest(
set => 'subject:PublicDomainImages',
from => '2023-01-01',
to => '2023-01-31',
resumptionToken => $last_token_you_saw,
delay => 1_000, # 1 second
type => 'identifiers',
callback => \&process_record,
);
Begins harvesting the records from the Rijksmuseum. The only required fields are callback
and type
, but the default delay is 10 seconds so you probably want to think about putting something sensible in there (or leave it at 10 seconds if you don't mind being very polite.) If you have a resumption token, perhaps you're recovering from a previous failure, you can supply that. from
and to
are not defined in the API documentation, so it's uncertain what they refer to. Latest update time maybe?
type
can in theory be identifiers
or records
(mapping to ListIdenifiers
and ListRecords
internally), but records
is currently unsupported as at writing time I don't need it and it's a fair bit of work to do right.
callback
will be called for every identifier or record, in the case of identifers it'll eceive a hashref containing identifier
and datestamp
. If the callback returns a non-false value (i.e. any value), we quietly shut down. Due to the way resumption tokens work (i.e. they can be the same for subsequent requests), even if you request a shutdown, you'll still be fed the rest of the batch. This helps avoid missing records.
The return value is a hashref that contains error
if something went wrong, and possibly a resumptionToken
to let you know how to pick up again.
There is some basic retry logic with exponential backoff that'll hopefully help seamlessly recover from transient network or service issues.
ATTRIBUTES
key
The API key provided by the Rijksmuseum.
AUTHOR
Robin Sheat, <rsheat at cpan.org>
TODO
- Handle the ListRecords verb
-
This'll require writing a parser for EDM-DC or similar.
- Implement logging
-
A proper logging system would allow recording of transient failures to see if they are becoming a problem. It would also allow the option for more fine-grained progress information to be displayed.
BUGS
Please report any bugs or feature requests to bug-museum-rijksmuseum-object at rt.cpan.org
, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Museum-Rijksmuseum-Object. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
Alternately, use the tracker on the repository page at https://gitlab.com/eythian/museum-rijksmuseum-object.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Museum::Rijksmuseum::Object::Harvester
You can also look for information at:
Repository page (report bugs here)
RT: CPAN's request tracker (or here)
https://rt.cpan.org/NoAuth/Bugs.html?Dist=Museum-Rijksmuseum-Object
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
This software is Copyright (c) 2023 by Robin Sheat.
This is free software, licensed under:
The Artistic License 2.0 (GPL Compatible)