NAME
scrape - command-line frontend to HTML::ListScraper
SYNOPSIS
scrape --core=all sample.html
scrape --core=list [ --min-count=10 ] [ --detail=all ] [ --shapeless ]
[ --ignore=b,i,em,strong,wbr ] [ --export=seq.txt ] sample.html
scrape --core=item --import=seq.txt sample.html
scrape --whole sample.html
scrape --core=all --detail=all --acquire=Perl.html
'http://search.yahoo.com/search?p=Perl'
DESCRIPTION
This script processes a HTML page with HTML::ListScraper
and shows the result, as YAML (down to the tag sequences, which are YAML scalars formatted by HTML::ListScraper::Interactive). It's meant for interactive exploration of HTML::ListScraper results and fine-tuning of its settings for a specific scraping application.
For every invocation, the single source file or URL is mandatory. URLs are recognized by their http
scheme - source names that don't start with http://
are normally interpreted as file names. All other command-line switches are optional and mutually independent. Note that with no switches, the script doesn't output anything. The switches are as follows:
==head2 core
Show found repeats. Value is a string, one of
- item (or just "i")
-
Show only the first sequence instance.
- list (or just "l")
-
Show all instances of the first sequence.
- all (or just "a")
-
Show all instances of all found sequences.
By default, no matches are shown. When they are shown, a YAML document, corresponding to a HTML::ListScraper::Sequence, has the sequence length as YAML field len
, the repeat count as count
and a YAML sequence with items corresponding to HTML::ListScraper::Instance. Each item starts with a field, keyed by the value of HTML::ListScraper::Instance::match
, whose value is the start position, followed by score
(for approximate matches only) and inst
with the actual tag sequence. The tag sequence is formatted by HTML::ListScraper::Interactive::format_tags
, with formatting options depending on the value of the --detail
command line switch.
==head2 shapeless
Boolean switch, sets HTML::ListScraper::shapeless
to true.
==head2 min-count
Value is an integer bigger than 1, used to set HTML::ListScraper::min_count
.
==head2 detail
Specifies formatting of found tag sequences. Value is a string, one of
- none
-
Don't show the matches at all. This is useful to see just how many sequences were found, how many instances they have and where.
-
Show just the tags, without text and links. This is the default value.
- text
-
Show tags and text.
- attributes
-
Show tags with links.
- all
-
Show all content fields of HTML:ListScraper::Tag: tags, text and links.
==head2 whole
Boolean switch. When used, scrape
outputs, as the first YAML document containing a single YAML scalar, the whole sequence maintained by HTML::ListScraper
. Note that the sequence is formatted without attributes, without text and with tag positions, irrespective of the value of --detail
.
==head2 ignore
A comma-separated list of tags the HTML parser should ignore. The list items shouldn't contain any slashes nor angle brackets. For every name in the list, both opening and closing tag are ignored. Default is b, i, em, strong
; when specifying the value explicitly, you probably want to include these tags in it.
==head2 export
Instructs scrape
to dump the first found sequence into the file specified by the option's value. If the file already exists, it's overwritten. When no sequence is found, nothing is dumped. Note that the sequence is formatted with just tags, irrespective of the value of --detail
.
==head2 import
Instructs scrape
to call HTML::ListScraper::find_known_sequence
instead of HTML::ListScraper::find_sequences
, with arguments read from the file specified by the option's value. Lines of that file are converted to tag names by HTML::ListScraper::Interactive::canonicalize_tags
.
==head2 acquire
Instructs scrape
to save the downloaded HTML into the file specified by the option's value. If the file already exists, it's overwritten. Using this switch causes scrape
to interpret the source as a URL, irrespective of its scheme, and pass it to LWP.
AUTHOR
Vaclav Barta, <vbar@comp.cz>
COPYRIGHT & LICENSE
Copyright 2007 Vaclav Barta, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.