NAME
rss2leafnode -- post RSS or Atom feeds and web pages to newsgroups
SYNOPSIS
rss2leafnode [--options]
DESCRIPTION
RSS2Leafnode downloads RSS or Atom feeds and posts items as messages to an NNTP news server. It's designed to make simple text entries available in local newsgroups, not propagating anywhere (though that's not enforced).
Desired feeds are given in a configuration file .rss2leafnode.conf in your home directory. For example to put a feed in group "r2l.perl"
fetch_rss ('r2l.perl', 'http://log.perl.org/atom.xml');
This is actually Perl code, so comment lines begin with #
and you can write conditionals etc. The target newsgroup must exist (see "Leafnode" below). With that done, run rss2leafnode
as
rss2leafnode
You can automate with cron
or similar. If you do it under user news
it could be just after a normal news fetch. The --config
option below lets you run different config files at different times, etc. A sample config file is included in the RSS2Leafnode sources.
Messages are added to the news spool using NNTP "POST" commands. When a feed is re-downloaded any items previously added are not repeated. Multiple feeds can be put into a single newsgroup. Feeds are inserted as they're downloaded, so the first articles appear while the rest are still in progress.
The target newsgroup can also be a news:
or nntp:
URL of a server on a different host or a different port number if running a personal server on a high port.
fetch_rss('news://somehost.mydomain.org:8119/r2l.weather',
'http://feeds.feedburner.com/PTCC');
Web Pages
Plain web pages can be downloaded too. Each time the page changes a new article is injected. This is good for a latest news or status page which don't have an RSS feed. For example
fetch_html ('r2l.finance,
'http://www.baresearch.com/free/index.php?category=1');
The target can be an image or similar directly too, it's simply put into a news message with its indicated MIME type. How well it displays depends on your newsreader.
The message "Subject" is the HTML <title>
, or something better from URI::Title
or Image::ExifTool
if you've got them. URI::Title
has special cases for a few unhelpful sites and Image::ExifTool
can get a PNG image title.
Re-Downloading
HTTP ETag
and Last-Modified
headers are used, if provided by the server, to avoid re-downloading unchanged content (feeds or web pages). <thr:count>
is used if provided to check for unchanged comments feeds. Values seen from the last run are saved in a .rss2leafnode.status file in your home directory.
If you've got XML::RSS::Timing
then it's used for RSS ttl
, updateFrequency
, etc from a feed. This means the feed is not re-downloaded until its specified update times. Only a few feeds have useful timing info, most merely give a ttl
advising for instance 5 minutes between rechecks.
With --verbose
the next calculated update time is printed in case you're wondering why nothing is happening. The easiest way to force a re-download is to delete the ~/.rss2leafnode.status file. Old status file entries are automatically dropped if you don't fetch a particular feed for a while, so that file should normally need no maintenance.
Leafnode
rss2leafnode
was originally created with the leafnode
program in mind, but can be used with any server accepting posts. It's your responsibility to be careful where a target newsgroup propagates. Don't make automated postings to the world!
For leafnode see its README file section "LOCAL NEWSGROUPS" on creating local-only groups. Basically you add a line to the /etc/news/leafnode/local.groups file like
r2l.stuff y My various feeds
The group name is arbitrary and the description is optional, but note it must be a tab character between the name and the "y" and between the "y" and any description. "y" means posting is allowed.
Small News
The Small News "sn" program is a possible local server too. Create groups in it with snnewgroup r2l.something
. When running snntpd
from inetd
or similar don't forget a logger program argument on the command line as shown in its INSTALL.run, otherwise log messages from snntpd
will confuse client programs, including Net::NNTP
as used by rss2leafnode
.
Copyright
It's your responsibility to check the terms of use for any feeds or web pages you download with rss2leafnode
. Pay particular attention if propagating or re-transmitting resulting messages.
Copyright or license statements in a feed are included in the messages under X-Copyright
headers. Unless the content is in the public domain such copyright notices must be retained.
COMMAND LINE OPTIONS
The command line options are
--config=/some/filename
-
Read the specified configuration file instead of ~/.rss2leafnode.conf.
--help
-
Print some brief help information.
--verbose
-
Print some diagnostics about what's being done. With
--verbose=2
print various technical details. --version
-
Print the program version number and exit.
CONFIG OPTIONS
The following variables can be set in the configuration file
- $rss_get_links (default 0)
-
If true then download links in each item and include the content in the news message. For example,
$rss_get_links = 1; fetch_rss ('r2l.finance', 'http://au.biz.yahoo.com/financenews/htt/financenews.xml');
Not all feeds have interesting things at their link. Sometimes the RSS has the full item text already. But if the RSS is a summary then
$rss_get_links
can make the full article ready to read immediately, instead of having to click through from the message.Only the immediate link target URL is retrieved. No images within the page are downloaded (which is often a good thing), and you'll probably have trouble if the link uses frames (a set of HTML pages instead of just one).
- $rss_get_comments (default 0)
-
If true then download the comments feeds for items and post as followup news articles. For example,
$rss_get_comments = 1; fetch_rss ('r2l.food', 'http://wickedgooddinner.blogspot.com/feeds/posts/default');
To send a followup comment you generally must go to the links in the original article (or the followups) and use some sort of web form. Posting a message to the newsgroup goes nowhere.
When a feed is available in both Atom and RSS formats sometimes only the Atom one includes a comments feed URL.
Comments feeds are followed for as long as an article appears in the feed, though in the current implementation might be checked for new comments only when the originating feed changes.
- $render (default 0)
-
If true then render HTML to text for the news messages. Normally item text,
$rss_get_links
downloaded parts, andfetch_html
pages are all presented astext/html
. If your newsreader doesn't handle HTML very well then$render
is a good way to see just the text. Setting1
usesHTML::FormatText
$render = 1; fetch_rss ('r2l.weather', 'http://xml.weather.yahoo.com/forecastrss?p=ASXX0001&u=f');
Setting
"WithLinks"
uses theHTML::FormatText::WithLinks
variant (you must have that module) which shows HTML links as footnotes.$render = 'WithLinks'; fetch_rss ('r2l.stuff', 'http://rss.sciam.com/sciam/basic-science');
Settings
elinks
,lynx
orw3m
dump through the respective external program (you must haveHTML::FormatExternal
and the program).$render = 'lynx'; $rss_get_links = 1; fetch_rss ('r2l.sport', 'http://fr.news.yahoo.com/rss/rugby.xml');
- $render_width (default 60)
-
The number of columns to use when rendering HTML to plain text or when wrapping Atom text. You can set this to whatever you find easiest to read, or any special width needed by a particular feed.
- $get_icon (default 0)
-
Download an RSS/Atom icon or HTML favicon as an image for the
Face
header. TheFace
header is shown by Gnus and perhaps only a few other news readers. In Gnus it appears with the "From" in the article mode display on a graphical screen. It can be a good visual cue to the channel origin, but may not always be worth the extra download.$get_icon = 1; fetch_rss ('r2l.whatsnew', 'http://www.archive.org/services/collection-rss.php');
Image::Magick
is required to process the images. Banner images which are much wider than high are suppressed as probably advertising and in any case not suited to 48x48 size of the Face header specification. A 48x48 image may add perhaps 4 kbytes or more to each message.For plain RSS and Atom feeds an image is normally per-channel so is the same for all articles from the feed. But an
itunes:image
can be per-item and is used if present.
Obscure Options
- $rss_charset_override (default undef)
-
If set then force RSS content to be interpreted in this charset, irrespective of what the document says. See "ENCODINGS" in XML::Parser for the charsets supported by the parser (the .enc files under /usr/lib/perl5/XML/Parser/Encodings/ plus some builtins).
Use this option if the document is wrong or has no charset specified and isn't the XML default utf-8. Usually you'll only want this for a particular offending feed. For example,
# AIR is latin-1, but doesn't have a <?xml> saying that $rss_charset_override = 'iso-8859-1'; fetch_rss ('r2l.finance', 'http://www.aireview.com.au/rss.php'); $rss_charset_override = undef;
By default RSS2Leafnode attempts to cope with bad multibyte sequences by re-coding to the feed's claimed charset. If that works then the text will have some substitute characters (either U+FFFD or question marks "?") and a warning is given like
Feed http://example.org/feed.xml recoded utf-8 to parse, expect substitutions for bad non-ascii (line 214, column 75, byte 13196)
Bad single-byte codings generally aren't detected and will just go through to display something incorrect (eg. MS-DOS codepage 1252 used where Latin-1 is claimed). Nose around the raw feed as necessary to see where it goes wrong.
- $html_charset_from_content (default 0)
-
If true then the charset used for
fetch_html
content is taken from the HTML itself, rather than the server's HTTP headers. Normally the server should be believed, but if a particular server is misconfigured then you can try this.$html_charset_from_content = 1; fetch_rss ('r2l.stuff', 'http://www.somebadserver.com/newspage.html');
Variable Extent
Variables take effect from the point they're set, through to the end of the file, or until a new setting. The Perl local
feature and a braces block can confine a setting to a particular few feeds. Eg.
{ local $rss_get_links = 1;
fetch_rss ('r2l.finance',
'http://www.debian.org/News/weekly/dwn.en.rdf');
}
OTHER DETAILS
Non-ascii RSS and Atom text and rendered HTML text are all coded as utf-8 in the generated messages so for non-ascii content you'll need a newsreader which supports that. Unrendered HTML is left in the charset the server gave, to ensure it matches any <meta http-equiv>
in the document. In all cases the charset is specified in the MIME message headers or attachment parts. Transfer format in the message body is chosen by MIME::Entity
(except Atom base64 <content>
) which normally means quoted-printable it there's any non-ascii or very long lines.
Links are shown for
<link> RSS and Atom
<enclosure> RSS
<comments> RSS
<content> Atom externals, except other XML feeds
<wfw:comment> well-formed web
<wiki:diff>
<wiki:history>
<sioc:has_creator>
<sioc:has_discussion>
<sioc:links_to>
<sioc:reply_of>
Author <url> Atom and wiki, not downloaded
Comment or reply links show a count from any of
<thr:total>
count="123" \ attribute of <link>
thr:count="123" /
<slash:comments> sub-element of <comments>
The RSS format comment feeds used by $rss_get_comments
are as follows. "appication" is a typo from WordPress pre 2.5 and still sometimes found in use as of Feb 2011.
<wfw:commentRss>
<link rel='replies' type='application/atom+xml' ...>
<link rel='replies' type='appication/atom+xml' ...>
Common Alerts Protocol (CAP) fields for weather alerts etc are shown if present (eg. from the US NOAA). This can have more detail than just the text. Pseudo-link footnotes are shown for,
<geo:lat>,<geo:long>
<geo:Point>
<georss:point>
<statusnet:origin> possibly with URL target too
<media:credit>
Unrecognised item fields are shown in XML at the end of the message so as not to drop information, and to perhaps suggest extra things RSS2Leafnode might present or interpret.
An attempt is made to repair bad XML from a feed with XML::Liberal
if you have that module. It uses XML::LibXML
and the libxml
library and is often successful on annoying things like bad entities, at least enough to process something.
Too much or too little entity escaping tends to be the most common XML problem. Too little can turn HTML markup into nested XML elements. RSS2Leafnode treats that as if it was XHTML elements, though the result is likely to be imperfect. Too much escaping currently ends up displaying raw or semi-raw HTML. An option for extra unescaping might improve the display of some bad feeds, but in practice each bad feed is bad in its own special way.
Message Headers
For reference the headers in the messages are generated roughly as follows,
- From:
-
First non-empty of
<author> <dc:creator> <dc:contributor> <wiki:username> <itunes:author> <managingEditor> <webMaster> <dc:publisher> <itunes:owner> channel <title>
If there's no identifiable mailbox part then
nobody@rss2leafnode.dummy
is added to make an RFC822 address. The channel title as a fallback shows something about where a message came from when there's no other author identified. An author's home page is shown in the links (as noted above). - Subject:
-
<title>
or<dc:subject>
. A<dc:subject>
is normally only a keyword but might be better than nothing. - Date:
-
First present of
<pubDate> <dc:date> <modified> <updated> <issued> <created> <lastBuildDate> <published>
These are all supposed to be ISO format "2000-01-01T12:00:00Z" etc and are converted to RFC822 style. An unrecognised form is put through unmodified.
- Date-Received:
-
The date/time when
rss2leafnode
made the message. - Message-ID:
-
First of
<id> (Atom) <guid isPermaLink="true"> <link> from Yahoo Finance <guid isPermaLink="false"> plus feed URL MD5 hash of various fields and feed URL
Yahoo Finance items repeated in different feeds are noticed using a special match of the
<link>
so that just one copy is posted. (As of March 2010 those items don't offer RSSguid
identifiers.) - Keywords:
-
All of
<category> <itunes:category> <cap:category> <itunes:keywords> <media:keywords> <dc:subject>
The sub-category system of <itunes:category> is not currently put through.
- In-Reply-To:
-
<thr:in-reply-to>
elements (per RFC 4685) turned into Message-IDs the same way as an Atom <id>. This might help thread display in a news reader if the parent item was downloaded too.<sioc:reply_of>
is not used here. It'd be a possibility, but would probably need a hard-coded mapping of URL to Message-ID. For now it's just shown as a link (as noted above). - Content-Location:
-
The URL of a
fetch_html()
or a$get_links
attachment part. Good newsreaders use this to resolve relative links in a HTML part. - Content-Language:
-
First of
<language> <dc:language> <twitter:lang> xml:lang="" HTTP response Content-Language header
xml:lang
is a standard XML attribute which may be present on any element and is sometimes found on Atom<content>
text. - Content-MD5:
-
From the corresponding HTTP header of a
fetch_html()
or$get_links
download part, though in practice this is almost never used. - Importance:
- Priority:
-
Common Alerts Protocol
<cap:severity>
levels Extreme and Severe are treated as "Importance: high" and "Priority: urgent".<wiki:importance>minor
is "Importance: low". These headers are only supposed to be for X.400 inter-operation though. - Precedence:
-
"list" for certain Google Groups lists, identified by their link URLs per
List-Post
below. Perhaps other feeds which come from mailing lists could be identified in the future. - Face:
-
As per the
$get_icons
option above, the first of item or channel<image> RSS <icon> Atom <logo> Atom <itunes:image> <statusnet:postIcon> <activity:actor><link rel="avatar"> HTML favicon for fetch_html()
Gnus and perhaps other newsreaders can display
Face:
, see http://quimby.gnus.org/circus/face.It'd be possible to generate an
X-Face:
as well or instead, but it's black and white and a conversion from a colour image out of the feeds is unlikely to look good most of the time. - List-Post:
-
Mailbox of a Google Groups mailing list feeds such as http://groups.google.com/group/cake-php/feed/rss_v2_0_msgs.xml. This may help post a followup to the list, depending on the newsreader. (A followup to an
rss2leafnode
newsgroup will normally go nowhere.) - PICS-Label:
-
Channel
<rating>
. Perhaps in the future<itunes:explicit>
or<media:adult>
could be turned into a rating too. - X-Mailer:
-
"RSS2Leafnode/VERSION" plus the usual from
MIME::Entity
(see "build PARAMHASH" in MIME::Entity). - X-Copyright:
-
An RSS2Leafnode extension, being all of following. See "Copyright" above.
<rights> <copyright> <dc:license> <dc:rights> <creativeCommons:license> <cc:license>
- X-RSS-Url:
-
An RSS2Leafnode extension, being the originating
fetch_rss()
feed URL downloaded. This is handy if an item has come out badly and want to check the raw feed. - X-RSS-Generator:
-
An RSS2Leafnode extension, being the channel
<generator>
. This might help assign blame for bad feed content etc.
Of course all this mapping wouldn't be necessary if RSS had been news to start with. A news server already serves short messages, either read-only or with followups, and if news servers hadn't got a (well deserved) reputation for being a pain to administer, and transferring gigabytes of "full feed" instead of on-demand, then RSS might never have been needed. Of course the other side is that if you're accustomed to HTTP for web pages then everything starts looking like a web page, and if you're used to HTML then an edifice like XML to encapsulate a half dozen bits of text seems like a good idea. :-)
BUGS
The way Message-IDs are checked on the news server means that the server should be setup to retain messages for at least as long as the feed retains items. If that's not so then old articles will be re-posted by the next fetch_rss
and will look like new articles to a newsreader.
Letting the news server track articles keeps down the amount of state rss2leafnode
must maintain and means multiple users can insert a feed without duplication. But perhaps long running or mothballed feeds will need further repost protection.
Some pre-releases of leafnode 2 have trouble with posts to local newsgroups while a fetchnews
run is in progress. The local articles don't show up until after a subsequent further fetchnews
.
No attention is paid to <atom:updated>
for changes in an item. Should an updated item be re-posted? If the <atom:id>
changes then that will happen (or if there's no id
and the content is different enough to make the MD5 hash change). But id
is supposed to stay the same for an update is it?
The way $rss_get_links
only gets the immediate link target could perhaps be extended to fetch images, frame parts, etc of a HTML page there and include them in the message as RFC 2557 style "MHTML". Not sure that there's any news readers which would actually display that though.
ENVIRONMENT VARIABLES
NNTPSERVER
NEWSHOST
-
Default news server as per
Net::NNTP
.
FILES
- ~/.rss2leafnode.conf
-
Configuration file.
- ~/.rss2leafnode.status
-
Status file, recording "last modified" dates for downloads. This can be deleted if something bad seems to have happened to it; the next
rss2leafnode
run will recreate it. /etc/perl/Net/libnet.cfg
~/.libnet.cfg
-
Defaults per
Net::NNTP
andNet::Config
.
SEE ALSO
leafnode(8), HTML::FormatText, HTML::FormatText::WithLinks, HTML::FormatExternal, lynx(1), URI::Title, XML::Parser, XML::Liberal, Image::Magick, Net::NNTP
, Net::Config
Plagger, feed2imap(1), rss2email(1), rssdrop(1), toursst(1), http://www.gwene.org
HOME PAGE
http://user42.tuxfamily.org/rss2leafnode/index.html
LICENSE
Copyright 2007, 2008, 2009, 2010, 2011 Kevin Ryde
RSS2Leafnode is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3, or (at your option) any later version.
RSS2Leafnode is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with RSS2Leafnode. If not, see http://www.gnu.org/licenses/.