NAME

Museum::Rijksmuseum::Object::Harvester - Bulk-fetching of Rijksmuseum data via the OAI-PMH interface

VERSION

See Museum::Rijksmuseum::Object

SYNOPSIS

Does a bulk fetch of the Rijksmuseum collection database using the OAI-PMH interface. For each record a callback will be called with the data. Note that the format of this data won't necessarily be the same as returned by the Museum::Rijksmuseum::Object calls, as it's coming from a different endpoint.

use Museum::Rijksmuseum::Object::Harvester;

my $h      = Museum::Rijksmuseum::Object::Harvester->new( key => 'abc123xyz' );
my $status = $h->harvest(
    set      => 'subject:PublicDomainImages',
    from     => '2023-01-01',
    type     => 'identifiers',
    callback => \&process_record,
);
if ( $status->{error} ) {
    die "Error: $status->{error}\nLast resumption token: $status->{resumptionToken}\n";
}
if ( $status->{resumptionToken} ) {
    print "Finished, token: $status->{resumptionToken}\n";
}

SUBROUTINES/METHODS

new

my $h = Museum::Rijksmuseum::Object::Harvester->new( key => 'abc123xyz' );

Create a new instance of the harvester. key is required.

harvest

my $status = $h->harvest(
    set             => 'subject:PublicDomainImages',
    from            => '2023-01-01',
    to              => '2023-01-31',
    resumptionToken => $last_token_you_saw,
    delay           => 1_000, # 1 second
    type            => 'identifiers',
    callback        => \&process_record,
);

Begins harvesting the records from the Rijksmuseum. The only required fields are callback and type, but the default delay is 10 seconds so you probably want to think about putting something sensible in there (or leave it at 10 seconds if you don't mind being very polite.) If you have a resumption token, perhaps you're recovering from a previous failure, you can supply that. from and to are not defined in the API documentation, so it's uncertain what they refer to. Latest update time maybe?

type can in theory be identifiers or records (mapping to ListIdenifiers and ListRecords internally), but records is currently unsupported as at writing time I don't need it and it's a fair bit of work to do right.

callback will be called for every identifier or record, in the case of identifers it'll eceive a hashref containing identifier and datestamp. If the callback returns a non-false value (i.e. any value), we quietly shut down. Due to the way resumption tokens work (i.e. they can be the same for subsequent requests), even if you request a shutdown, you'll still be fed the rest of the batch. This helps avoid missing records.

The return value is a hashref that contains error if something went wrong, and possibly a resumptionToken to let you know how to pick up again.

There is some basic retry logic with exponential backoff that'll hopefully help seamlessly recover from transient network or service issues.

ATTRIBUTES

key

The API key provided by the Rijksmuseum.

AUTHOR

Robin Sheat, <rsheat at cpan.org>

TODO

Handle the ListRecords verb

This'll require writing a parser for EDM-DC or similar.

Implement logging

A proper logging system would allow recording of transient failures to see if they are becoming a problem. It would also allow the option for more fine-grained progress information to be displayed.

BUGS

Please report any bugs or feature requests to bug-museum-rijksmuseum-object at rt.cpan.org, or through the web interface at https://rt.cpan.org/NoAuth/ReportBug.html?Queue=Museum-Rijksmuseum-Object. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

Alternately, use the tracker on the repository page at https://gitlab.com/eythian/museum-rijksmuseum-object.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Museum::Rijksmuseum::Object::Harvester

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

This software is Copyright (c) 2023 by Robin Sheat.

This is free software, licensed under:

The Artistic License 2.0 (GPL Compatible)