NAME

HTML::ExtractText::Extra - extra useful HTML::ExtractText

SYNOPSIS

At its simplest; use CSS selectors:

# Same usage as HTML::ExtractText, but now we have extra
# optional options (default values are shown):
use HTML::ExtractText::Extra;
my $ext = HTML::ExtractText::Extra->new(
    whitespace => 1, # strip leading/trailing whitespace
    nbsp       => 1, # replace non-breaking spaces with regular ones
);

$ext->extract(
    {
        page_title => 'title', # same extraction as HTML::ExtractText
        links => ['a', qr{http://|www\.} ], # strip what matches
        bold  => ['b', sub { "<$_[0]>"; } ], # wrap what's found in <>
    },
    $html,
) or die "Error: $ext";
print "Page title is $ext->{page_title}\nLinks are: $ext->{links}";

DESCRIPTION

The module offers extra options and post-processing that the vanilla HTML::ExtractText does not provide.

METHODS FROM HTML::ExtractText

This module offers all the standard methods and behaviour HTML::ExtractText provides. See its documentation for details.

EXTRA OPTIONS IN ->new

my $ext = HTML::ExtractText::Extra->new(
    whitespace => 1, # strip leading/trailing whitespace
    nbsp       => 1, # replace non-breaking spaces with regular ones
);

whitespace

my $ext = HTML::ExtractText::Extra->new(
    whitespace => 1,
);

Optional. Defaults to: 1. When set to a true value, leading and trailing whitespace will be trimmed from the results.

nbsp

my $ext = HTML::ExtractText::Extra->new(
    nbsp => 1,
);

Optional. Defaults to: 1. When set to a true value, non-breaking spaces in the results will be converted into regular spaces. Note that this does not affect how the normal white-space folding operates, so foo &nbsp; bar will end up having 3 spaces between foo and bar.

EXTRA PROCESSING OPERATIONS IN ->extract

$ext->extract(
    {
        page_title => 'title', # same extraction as HTML::ExtractText
        links => ['a', qr{http://|www\.} ],  # strip what matches
        bold  => ['b', sub { "<$_[0]>"; } ], # wrap what's found in <>
    },
    $html,
) or die "Error: $ext";

This module extends possible values in the hashref given as the first argument to ->extract method. They are given by changing the string containing the selector to an arrayref, where the first element is the selector you want to match and the rest of the elements are as follows:

Regex reference

$ext->extract({ links => ['a', qr{http://|www\.} ] }, $html )

When second element of the arrayref is a regex reference, any text that matches the regex will be stripped from the text that is being extracted.

Code reference

$ext->extract({ links => ['a', sub { "<$_[0]>"; } ] }, $html )

When second element of the arrayref is a code reference, it will be called for each found bit of text we're extracting and its @_ will contain that text as the first element. Whatever the sub returns will be used as the result of extraction.

ACCESSORS

whitespace

$ext->whitespace(0);

Accessor method for the whitespace argument to ->new.

nbsp

$ext->nbsp(0);

Accessor method for the nbsp argument to ->new.

SEE ALSO

HTML::ExtractText - a basic version of this extractor

Mojo::DOM, Text::Balanced, HTML::Extract

REPOSITORY

Fork this module on GitHub: https://github.com/zoffixznet/HTML-ExtractText-Extra

BUGS

To report bugs or request features, please use https://github.com/zoffixznet/HTML-ExtractText-Extra/issues

If you can't access GitHub, you can email your request to bug-html-extracttext-extra at rt.cpan.org

AUTHOR

ZOFFIX ZOFFIX

LICENSE

You can use and distribute this module under the same terms as Perl itself. See the LICENSE file included in this distribution for complete details.