NAME
HTML::ListScraper - generic web page scraping support
VERSION
Version 0.05
SYNOPSIS
use HTML::ListScraper;
$scraper = HTML::ListScraper->new( api_version => 3,
marked_sections => 1 );
# set up $scraper options...
$scraper->parse($html);
$scraper->eof;
@seq = $scraper->find_sequences;
$seq = shift @seq;
if ($seq) { # is-a HTML::ListScraper::Sequence
foreach $inst ($seq->instances) { # is-a HTML::ListScraper::Instance
foreach $tag ($inst->tags) { # is-a HTML::ListScraper::Tag
print "<", $tag->name, ">\n";
print $tag->text, "\n";
}
}
}
DESCRIPTION
While Perl has good support and is often used for extracting machine-friendly data from HTML pages, most scripts used for that task are ad-hoc, parsing just one site's HTML and depending on superficial, transient details of its structure - and are therefore brittle and labor-intensive to maintain. This module tries to support more generic scraping for a class of pages: those whose most important part is a list of links.
HTML::ListScraper
is a subclass of HTML::Parser, building on its ability to convert an octet stream - whether strictly valid HTML or something just vaguely similar to it - to tags and text. HTML parsing works the same as with HTML::Parser
, except you don't need to register your own HTML event handlers.
When the document is parsed, call find_sequences
to find out which tags in the document repeat, one after the other, more than once (text and comments are ignored for this comparison). Since there'll probably be quite a lot of such sequences, HTML::ListScraper
tries to find the "longest one repeating most often", specifically, it maximizes log(number of non-overlapping runs)*log(number of tags in the sequence)
. There can obviously be more than one such sequence, which is why the method returns an array (and the array can also be empty - see below). Your application can then iterate over the returned structure to find items of interest.
This module includes a script, scrape
, displaying the sequences found by HTML::ListScraper
, so that you can see which items your application needs - and if they aren't there, you can try to tweak HTML::ListScraper
's settings with the various scrape
switches to make it find more.
HTML::ListScraper
methods are as follows:
new
HTML::ListScraper
's constructor. Passes all its parameters to the superclass and registers HTML::Parser
's event handlers start
, text
and end
.
min_count
Numeric threshold for the frequency of found sequences - get_sequences
returns only those which repeat at least min_count
times. Call without arguments to get the current value, with an argument to set it. Default (as well as the minimal allowed value) is 2.
shapeless
By default, get_sequences
returns only "well-shaped" sequences, whose every opening tag is followed by the appropriate closing tag, with an exception for those tags whose closing tag is optional - i.e. <div><br></div>
is well-shaped but neither <div><br>
nor <br></div>
is. Tags which don't need a closing tag are those identified by is_unclosed_tag
. Closing tags are paired with the nearest opening tag with the same name which hasn't been paired yet. A well-shaped sequence is basically an HTML fragment - like a tree, except it doesn't have to have a single root.
Well-shaped sequences should be fine when processing valid HTML, but since this module doesn't restrict itself to valid HTML, that isn't always good enough. Setting shapeless
to a true value removes this filtering and makes all sequences eligible.
is_unclosed_tag
Test for tag names with optional closing tag. Takes a tag name, returns true for tags declared in HTML 4.01 Transitional DTD as having either optional or no closing tag. Note that subclassing this method won't change HTML::ListScraper
behavior - it delegates to a real implementation deep in this module's guts, which are not documented here.
get_all_tags
Accessor for the document's tag sequence maintained by HTML::ListScraper
, used mainly for debugging. Takes no arguments, returns an array (array reference if called in a scalar context) of HTML::ListScraper::Tag objects.
get_sequences
The core of HTML::ListScraper
. Takes no arguments, returns an array of HTML::ListScraper::Sequence objects. The sequences are sorted by length (shortest first).
"Sequences" with just 1 tag and sequences which don't repeat are never returned; depending on the value of min_count
and shapeless
, get_sequences
may also ignore other ones (see min_count
and shapeless
).
find_sequences
A generalization of get_sequences
. Like get_sequences
, find_sequences
takes no arguments and returns an array of HTML::ListScraper::Sequence objects - the same sequences, in fact, as get_sequences
, but with potentially more instances. In addition to the exact matches, find_sequences
tries to find "approximate" instance matches, that is, tag sequences with a non-zero but low edit distance from the exact sequence.
The alignment uses Algorithm::NeedlemanWunsch (q.v.) in its local mode, with fixed scores whose particular values hopefully don't matter much (see the source of HTML::ListScraper::Sweep
if you're really interested in them). Approximate instances are sought between the exact ones, from the most similar to a cut-off point of low similarity.
Found approximate instances are identified by HTML::ListScraper::Instance::match
value approx
. their score is available as the value of HTML::ListScraper::Instance::score
. That value isn't always defined, though: if the shapeless
flag isn't set, approximate tag sequences are made to look like valid HTML fragments by removing unpaired tags. Since that obviously damages the score, no score is returned for such cut-up instances.
get_known_sequence
When the "longest sequence repeating most often" found by HTML::ListScraper
isn't quite the sought one, you can specify exactly which one you want by calling get_known_sequence
instead of get_sequences
. get_known_sequence
takes a list of tag names spelled using the same convention as HTML::ListScraper::Tag, i.e. in lowercase, without angle brackets and with closing tags having '/' as the first character. If the parsed document doesn't contain the specified sequence, get_known_sequence
returns undef
. Otherwise, it returns an instance of HTML::ListScraper::Sequence.
find_known_sequence
A generalization of get_known_sequence
. Like get_known_sequence
, find_known_sequence
takes a list of tag names and finds both exact and approximate matches for it. If the parsed document doesn't contain at least one at least approximately matching tag sequences, find_known_sequence
returns undef
. Otherwise, it returns an instance of HTML::ListScraper::Sequence.
on_start
Attribute start handler. Registered with signature self, tagname, attr
, although the only attribute preserved by HTML::ListScraper
is href
. For ultimate flexibility in preprocessing the input HTML, you can subclass this method, but do call the base version at least conditionally. Note that if you want to just ignore some tags, there are simpler ways, i.e. HTML::Parser::ignore_tags
.
on_text
Text handler. Registered with signature self, dtext
. For ultimate flexibility in preprocessing the input HTML, you can subclass this method, but do call the base version at least conditionally.
on_end
Attribute end handler. Registered with signature self, tagname
. For ultimate flexibility in preprocessing the input HTML, you can subclass this method.
BUGS
Requires too much configuration.
AUTHOR
Vaclav Barta, <vbar@comp.cz>
COPYRIGHT & LICENSE
Copyright 2007 Vaclav Barta, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.