NAME
HTML::ExtractText::Extra - extra useful HTML::ExtractText
SYNOPSIS
At its simplest; use CSS selectors:
# Same usage as HTML::ExtractText, but now we have extra
# optional options (default values are shown):
use HTML::ExtractText::Extra;
my $ext = HTML::ExtractText::Extra->new(
whitespace => 1, # strip leading/trailing whitespace
nbsp => 1, # replace non-breaking spaces with regular ones
);
$ext->extract(
{
page_title => 'title', # same extraction as HTML::ExtractText
links => ['a', qr{http://|www\.} ], # strip what matches
bold => ['b', sub { "<$_[0]>"; } ], # wrap what's found in <>
},
$html,
) or die "Error: $ext";
print "Page title is $ext->{page_title}\nLinks are: $ext->{links}";
DESCRIPTION
The module offers extra options and post-processing that the vanilla HTML::ExtractText does not provide.
METHODS FROM HTML::ExtractText
This module offers all the standard methods and behaviour HTML::ExtractText provides. See its documentation for details.
EXTRA OPTIONS IN ->new
my $ext = HTML::ExtractText::Extra->new(
whitespace => 1, # strip leading/trailing whitespace
nbsp => 1, # replace non-breaking spaces with regular ones
);
whitespace
my $ext = HTML::ExtractText::Extra->new(
whitespace => 1,
);
Optional. Defaults to: 1. When set to a true value,
leading and trailing whitespace will be trimmed from the results.
nbsp
my $ext = HTML::ExtractText::Extra->new(
nbsp => 1,
);
Optional. Defaults to: 1. When set to a true value,
non-breaking spaces in the results will be converted into regular spaces.
Note that this does not affect how the normal white-space folding
operates, so foo bar will end up having 3 spaces between
foo and bar.
EXTRA PROCESSING OPERATIONS IN ->extract
$ext->extract(
{
page_title => 'title', # same extraction as HTML::ExtractText
links => ['a', qr{http://|www\.} ], # strip what matches
bold => ['b', sub { "<$_[0]>"; } ], # wrap what's found in <>
},
$html,
) or die "Error: $ext";
This module extends possible values in the hashref given as the first
argument to ->extract method. They are given by changing
the string containing the selector to an arrayref, where the first element
is the selector you want to match and the rest of the elements are as
follows:
Regex reference
$ext->extract({ links => ['a', qr{http://|www\.} ] }, $html )
When second element of the arrayref is a regex reference, any text that matches the regex will be stripped from the text that is being extracted.
Code reference
$ext->extract({ links => ['a', sub { "<$_[0]>"; } ] }, $html )
When second element of the arrayref is a code reference, it will be
called for each found bit of text we're extracting and its @_ will
contain that text as the first element. Whatever the sub returns will
be used as the result of extraction.
ACCESSORS
whitespace
$ext->whitespace(0);
Accessor method for the whitespace argument to ->new.
nbsp
$ext->nbsp(0);
Accessor method for the nbsp argument to ->new.
SEE ALSO
HTML::ExtractText - a basic version of this extractor
Mojo::DOM, Text::Balanced, HTML::Extract
REPOSITORY
BUGS
To report bugs or request features, please use https://github.com/zoffixznet/HTML-ExtractText-Extra/issues
If you can't access GitHub, you can email your request
to bug-html-extracttext-extra at rt.cpan.org
AUTHOR
LICENSE
You can use and distribute this module under the same terms as Perl itself.
See the LICENSE file included in this distribution for complete
details.