NAME

XML::RSS::FromHTML::Simple - Create RSS feeds for sites that don't offer them

SYNOPSIS

use XML::RSS::FromHTML::Simple;

my $proc = XML::RSS::FromHTML::Simple->new({
    url      => "http://perlmeister.com/art_eng.html",
    rss_file => "new_articles.xml",
});

$proc->link_filter( sub {
    my($link, $text) = @_;

        # Only extract links that contain 'linux-magazine'
        # in their URL
    if( $link =~ m#linux-magazine#) {
        return 1;
    } else {
        return 0;
    }
});

    # Create RSS file
$proc->make_rss() or die $proc->error();

ABSTRACT

This module helps creating RSS feeds for sites that don't them. It examines HTML documents, extracts their links and puts them and their textual descriptions into an RSS file.

DESCRIPTION

XML::RSS::FromHTML::Simple helps reeling in web pages and creating RSS files from them. Typically, it is used to contact websites that are displaying news content in HTML, but aren't providing RSS files of their own. RSS files are typically used to track the content on frequently changing news websites and to provide a way for other programs to figure out if new news have arrived.

To create a new RSS generator, call new():

use XML::RSS::FromHTML::Simple;

my $f = XML::RSS::FromHTML::Simple->new({
    url      => "http://perlmeister.com/art_eng.html",
    rss_file => $outfile,
});

url is the URL to a site whichs content you'd like to track. rss_file is the name of the resulting RSS file, it defaults to out.xml.

Instead of reeling in a document via HTTP, you can just as well use a local file:

my $f = XML::RSS::FromHTML::Simple->new({
    html_file => "art_eng.html",
    base_url  => "http://perlmeister.com",
    rss_file  => "perlnews.xml",
});

Note that in this case, a base_url is necessary to allow the generator to put fully qualified URLs into the RSS file later.

XML::RSS::FromHTML::Simple creates accessor functions for all of its attributes. Therefore, you could just as well create a boilerplate object and set its properties afterwards:

my $f = XML::RSS::FromHTML::Simple->new();
$f->html_file("art_eng.html");
$f->base_url("http://perlmeister.com");
$f->rss_file("perlnews.xml");

Typically, not all links embedded in the HTML document should be copied to the resulting RSS file. The link_filter() attribute takes a subroutine reference, which decides for each URL whether to process it or ignore it:

$f->link_filter( sub {
    my($url, $text) = @_;

    if($url =~ m#linux-magazine\.com/#) {
        return 1;
    } else {
        return 0;
    }
});

The link_filter subroutine gets called with each URL and its link text, as found in the HTML content. If link_filter returns 1, the link will be added to the RSS file. If link_filter returns 0, the link will be ignored.

In addition to decide if the Link is RSS-worthy, the filter may also change the value of the URL or the corresponding text by modifying $_[0] or $_[1] directly.

To start the RSS generator, run

$f->make_rss() or die $f->error();

which will generate the RSS file. If anything goes wrong, make_rss() returns false and the error() method will tell why it failed.

UTF-8 Woes

XML::RSS::FromHTML::Simple has been designed to handle UTF-8 encoded web pages well, but there are a few gotchas you should be aware of.

If the LWP::UserAgent used by XML::RSS::FromHTML::Simple detects that a web page is utf-8-encoded, it will return its content in utf-8 encoded strings via the decoded_content() method.

This means that if you filter on this content, you need to use utf-8 strings for comparisons, and if you specify strings or regexes literally in your code in utf-8, you'll have to make sure that the use utf8 pragma is set (unless, by the time you're reading this, we have the year 2038 and all source code gets written in utf8 by default).

Also make sure that your regexes handle non-ascii characters which might occur in those strings. Simon Cozen's "Advanced Perl Programming" has an excellent chapter on how to tackle some of these problems correctly.

Secondly, the current version of LWP has an issue with pages that have UTF-8-encoded data in the HEAD section. It will print a warning like

  Parsing of undecoded UTF-8 will give garbage when decoding entities
  at .../LWP/Protocol.pm line 114.

which can be worked around by setting

my $ua = LWP::UserAgent->new(parse_head => 0);

and providing this resilient user agent to the XML::RSS::FromHTML::Simple constructor:

    my $f = XML::RSS::FromHTML::Simple->new({
        url      => "...",
        rss_file => "...",
	ua       => $ua,
    });

Note that this relies on the web server sending a header like

Content-Type: text/html; charset=utf-8' 

or the resulting string won't have the utf-8 bit set.

Details on this problem are available at

http://www.nntp.perl.org/group/perl.libwww/2007/02/msg6965.html
http://www.nntp.perl.org/group/perl.libwww/2006/08/msg6801.html

in the libwww mailing list archive.

DEBUGGING

XML::RSS::FromHTML::Simple is Log::Log4perl-enabled, to figure out what's going on under the hood, simply call

use Log::Log4perl qw(:easy);
Log::Log4perl->easy_init($DEBUG);

before using XML::RSS::FromHTML::Simple. For details on Log4perl, check the http://log4perl.sourceforge.net website.

HISTORY

This module has been inspired by Sean Burke's article in TPJ 11/2002. I've discussed its code in the 02/2005 issue of Linux Magazine:

http://www.linux-magazine.com/issue/51/Perl_Collecting_News_Headlines.pdf

There's also XML::RSS::FromHTML on CPAN, which looks like it's offering a more powerful API. The focus of XML::RSS::FromHTML::Simple, on the other hand, is simplicity.

LEGALESE

This program is free software, you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

2007, Mike Schilli <m@perlmeister.com>

1 POD Error

The following errors were encountered while parsing the POD:

Around line 339:

Non-ASCII character seen before =encoding in ' '. Assuming CP1252