NAME
XML::RSS::TimingBot - for efficiently fetching RSS feeds
SYNOPSIS
use XML::RSS::TimingBot;
$browser = XML::RSS::TimingBot->new;
my $response = $browser->get(
'http://interglacial.com/rss/cairo_times.rss'
# or whatever feed
);
... And process $response just as if it came from
a plain old LWP::UserAgent object, for example: ...
if($response->code == 200) { # 200 = "okay, here it is"
...process it...
} elsif($response->code == 304) { # 304 = "Not Changed"
# do nothing
} else {
print "Hm, couldn't access it: ", $response->status_line, "\n";
}
$browser->commit; # Save our history. Don't forget!!
DESCRIPTION
If you use LWP::UserAgent for fetching RSS/RDF feeds, use XML::RSS::TimingBot instead! XML::RSS::TimingBot has the same interface, but knows when to more efficiently request the data.
DETAILED DESCRIPTION
XML::RSS::TimingBot is for requesting RSS feeds only as often as needed. This class does this in two ways:
* When you request a feed the first time, this class remembers what Last-Modified and ETag headers it has; that the next time you request that feed, this class can specify that the feed's server should return data only if that feed has changed since last time, or has a different ETag value. If the feed has changed, you'll get the HTTP response back with full content and with a normal "200" status code. If the feed hasn't changed, you'll get a contentless "304" response (meaning "I'm not giving you any content, because it hasn't changed").
* When you request a feed, this class remembers any data that might be in the RSS that says how often this feed updates. See XML::RSS::Timing for the full story; but as a common example if there's a <ttl>180</ttl>
in the feed, that means that the feed will rebuild only once every three hours (180 minutes). When this class sees that in the received RSS data, it remembers this so that if you go to get the feed more often than that, it will stop you and give a "304" (Not Modified) error response.
USING UNDER WITH LWP::Simple
If you have an RSS-reading program is like this:
use LWP::Simple;
...and later...
$data = get($rss_url);
... then you can make it work with this this XML::RSS::TimingBot (or a subclass) by just changing the "use LWP::Simple" line to this:
use LWP::Simple qw(:DEFAULT $ua);
use XML::RSS::TimingBot;
$ua = XML::RSS::TimingBot->new;
And then change your $data = get($rss_url);
line to either have a commit call like this:
$data = get($rss_url); $ua->commit;
Or just add a $ua->commit;
line later in the program where it will be called after the feed data is processed, presumably toward the end of this RSS-aggregator program's run.
METHODS
This module inherits all of the methods from LWP::UserAgent and LWP::UserAgent::Determined, and adds the following ones.
- $browser->commit
-
This saves to disk (or database, whatever) all the data that this browser object has accumulated about how long to wait before re-requesting what URLs with what Last-Modified and ETag headers.
- $browser->minAge(60*60); # an hour
- $minage_seconds = $browser->minAge();
-
This sets (or in the second case, just reads) this browser object's
minAge
attribute. This attribute denotes the minimum amount of time (in seconds) that your client will go between polling, overriding whatever this feed says if it says a shorter interval.For example, if a feed says it can update every 5 minutes, but you've set your
minAge
to a half hour, then this timing object will act as if the feed really said to update only half hour at most. (This won't have any effect on feeds that say they update at intervals longer than what minAge is set to.)If you set minAge, you should probably set it only to a smallish value, like the number of seconds in an hour (60*60). By default, minAge is not set, meaning no minimum is enforced.
- $maxage_minutes = $browser->maxAge();
- $browser->maxAge(62*24*60*60); # two months
-
This sets (or in the second case, just reads) this browser object's
maxAge
attribute. This attribute denotes the maximum amount of time (in seconds) that your client will go between polling, overriding whatever this feed says if it says a longer interval.For example, if a feed says it updates only once a year but you've set your
minAge
to two months, then this timing object will act as if the feed really said to update every two months. (This won't have any effect on feeds that say they update at intervals shorter than what maxAge is set to.)If you set this, you should probably set it only to a large value, like the number of seconds in two months (62*24*60*60). By default, maxAge is not set, meaning no maximum is enforced. (So if a feed says to update only once a year, then that's what this timing object faithfully implements.)
THE BASIC STORAGE SYSTEM
XML::RSS::TimingBot uses a simple flat-file database system to store information about what URLs shouldn't be requested until when, and what the last-modified and ETag headers were from what URLs.
If you're using XML::RSS::TimingBot to poll vast numbers of feeds, you can try out XML::RSS::TimingBot, but you'll probably want to use XML::RSS::TimingbotDBI, which stores all its data not in a flat file database, but in an SQL table that you specify.
(Users who aren't polling vast numbers of feeds are still free to use XML::RSS::TimingbotDBI! It's just that XML::RSS::Timingbot is probably more convenient for you, since it doesn't need to have DBI installed, etc.)
There is a notable drawbacks to the basic flat-file storage system: it doesn't ever tidy up its database.
The fact that XML::RSS::TimingBot never tidies up its database is less serious. Basically, XML::RSS::TimingBot offers no way to say "I don't plan to ever look at $url again, so go ahead and delete your database's data on it". This lack is unlikely to be a problem for you unless you have a lot of feeds constantly being added and removed from polling. There's no obvious tidy solution, but a crude and effective solution is to just delete the local flat-file database directory every now and then (like every two months).
The current flat-file database works by keeping a bunch of text files in a directory called "rssdata", which is a subdirectory of one of the following:
$ENV{'TIMINGBOTPATH'} if it's defined,
otherwise $ENV{'APPDATA'} if it's defined,
otherwise $ENV{'HOME'} if it's defined (and it usually is)
otherwise, in the current directory ('.' in Unix terms, not `pwd`)
Normally you don't need to deal with any of this; but if having a "rssdata" directory in your home directory annoys you, then go ahead and set $ENV{'TIMINGBOTPATH'} = "/wherever/the/hell/you/want"
before you go manipulating an XML::RSS::TimingBot object.
(If the "rssdata" directory doesn't exist where XML::RSS::TimingBot expects it to be, it will be created there with default permissions. If that's not what you want, create it with whatever permissions you like.)
SEMI-INTERNAL METHODS
These are internal methods used by XML::RSS::TimingBot objects, but I document them here in case they might be useful to you. (Subclassers: you really shouldn't ever need to override these.)
- $sem_file = $browser->rss_lock_file;
- $browser->rss_lock_file($some_file);
-
This reads (or, in the second case, sets) the value of the
rss_lock_file
attribute. That's the name of a file that will be used to avoid race conditions as the "Basic Storage System" (see above) reads and writes to its data files. (This isn't used by XML::RSS::TimingBotDBI.)Normally you don't need to worry about this attribute, but here are its possible values:
A false value ("0", undef, or empty string) means that you won't use any file locking. Under MSWin and old MacOS, this is the default!. That's because those OSs don't support proper Unix-style file locking.
The value "1" means to use a sanely named file in "rssdata" directory. This is the default, except under MSWin and old MacOS.
Any other value (like, say,
"$ENV{'HOME'}/.rssualock"
) means to use that filespec as the lock file.
- $epoch_time = $browser->feed_get_next_update($url);
- $browser->feed_set_next_update($url, $next_update_epoch_time);
-
This queries (or in the second case, sets) the earliest time (in epoch time) that this feed will actually be queried.
- $last_mod_string = $browser->feed_get_last_modified($url);
- $browser->feed_set_last_modified($url, $last_mod_string);
-
This queries (or in the second case, sets) what the last-mod string for the given URL is.
- $etag_string = $browser->feed_get_etag($url);
- $browser->feed_set_etag($url, $etag_string);
-
This queries (or in the second case, sets) what the ETag string for the given URL is.
SUBCLASS INTERFACE METHODS
Read this section only if XML::RSS::TimingBot and XML::RSS::TimingBotDBI don't work for you and you need to write an XML::RSS::TimingBot subclass to do what you want.
To write a subclass, you need to override two crucial methods, datum_from_db
and commit
, which get called like so by either the user or XML:RSS::TimingBot itself:
$value = $browser->datum_from_db($url, $attribute_name);
$browser->commit();
The first method, datum_from_db
is called whenever a particular bit of data needs to be gotten from the database. The commit
method is called whenever the user wants to commit to disk/database all of this object's newly acquired memory of URLs and their attributes. "Attributes" is my term for fields. Currently XML::RSS::TimingBot only uses three attribute names: "lastmodified", "nextupdate", and "etag". You may consider any other attribute names to be an error. (Although I just might, in the future, add more; but I consider this unlikely at the moment.)
When $browser->commit
is called, $browser->{'rsstimingbot_for_db'}
will either be blank or {}
(nothing to write out), or will be reference to a hash-of-hash of strings, i.e.,
$browser->{'rsstimingbot_for_db'}{$url}{$attribute} = $value
You can traverse that structure like so:
my $hoh = $browser->{'rsstimingbot_for_db'} || return;
return unless scalar keys %$hoh;
foreach my $url (sort keys %$hoh) {
$for_this_url = $hoh->{$url} || next;
foreach my $attr (sort keys %$for_this_url) {
$value = $for_this_url->{$attr};
$value = '' unless defined $value;
... and here you do something to save
the datum that $url's $attrib is $value ...
}
}
# And if all went well, clear the cache:
%$hoh = ();
If neither XML::RSS::TimingBot nor XML::RSS::TimingBotDBI do what you want, and you need to write a new subclass of XML::RSS::TimingBot that will do what you want, feel free to email me for suggestions if the above isn't what you need.
IMPLEMENTATION
This class works by subclassing LWP::UserAgent (actually LWP::UserAgent::Determined, which subclasses LWP::UserAgent) and overriding its simple_request
method with an around-method. The around-method blocks the request if earlier requests expressed that the feed would not have new data by now (via ttl
, skipHours
, updateBase
, etc elements).
Otherwise the request gets a If-Modified-Since header added if that feed's Last-Modified line was noted last time, and the request gets an If-None-Match header added if that feed's ETag line was noted last time. Then the superclass's simple_request
method is called to actually perform the request.
If it returns non-RSS data, or returns an error, then that response is simply returned. Otherwise it is scanned for timing elements (ttl
, etc), whose contents are fed to an XML::RSS::Timing object to calculate when the feed could next have new data. The response's Last-Modified and ETag values are also stored, if they're found.
These pieces of data -- the feed's Last-Modified value, its ETag value, and the time that it shouldn't be polled again until -- are kept in the object until you call the commit
method, at which point they are written to disk (or to a DBI object, in the case of XML::RSS::TimingBotDBI's commit
override method).
XML PARSING
Thing module uses some regular expressions to extract the values of the timing elements from the RSS data. Using regexps for parsing XML is generally a bad idea, but in this specific case, it seems quite unproblematic. Email me if you come across a real-life case that my regexps don't handle.
ENVIRONMENT
This module is influenced by three environment variables: TIMINGBOTPATH, APPDATA, and HOME. See above, under "THE BASIC STORAGE SYSTEM".
SEE ALSO
lwptut, lwpcook, LWP::UserAgent, HTTP::Response, and the book Perl & LWP by Sean Burke.
XML::RSS::TimingBotDBI, XML::RSS::Timing, LWP::UserAgent::Determined, LWP
The HTTP spec, currently RFC 2616.
COPYRIGHT AND DISCLAIMER
Copyright 2004, Sean M. Burke sburke@cpan.org
, all rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
AUTHOR
Sean M. Burke, sburke@cpan.org