NAME

HTML::ExtractText::Extra - extra useful HTML::ExtractText

SYNOPSIS

At its simplest; use CSS selectors:

# Same usage as HTML::ExtractText, but now we have extra
# optional options (default values are shown):
use HTML::ExtractText::Extra;
my $ext = HTML::ExtractText::Extra->new(
    whitespace => 1, # strip leading/trailing whitespace
    nbsp       => 1, # replace non-breaking spaces with regular ones
);

$ext->extract(
    {
        page_title => 'title', # same extraction as HTML::ExtractText
        links => ['a', qr{http://|www\.} ], # strip what matches
        bold  => ['b', sub { "<$_[0]>"; } ], # wrap what's found in <>
    },
    $html,
) or die "Error: $ext";
print "Page title is $ext->{page_title}\nLinks are: $ext->{links}";

DESCRIPTION

The module offers extra options and post-processing that the vanilla HTML::ExtractText does not provide.

METHODS FROM `HTML::ExtractText`

This module offers all the standard methods and behaviour HTML::ExtractText provides. See its documentation for details.

EXTRA OPTIONS IN `->new`

my $ext = HTML::ExtractText::Extra->new(
    whitespace => 1, # strip leading/trailing whitespace
    nbsp       => 1, # replace non-breaking spaces with regular ones
);

`whitespace`

my $ext = HTML::ExtractText::Extra->new(
    whitespace => 1,
);

Optional. Defaults to: 1. When set to a true value, leading and trailing whitespace will be trimmed from the results.

`nbsp`

my $ext = HTML::ExtractText::Extra->new(
    nbsp => 1,
);

Optional. Defaults to: 1. When set to a true value, non-breaking spaces in the results will be converted into regular spaces. Note that this does not affect how the normal white-space folding operates, so foo   bar will end up having 3 spaces between foo and bar.

EXTRA PROCESSING OPERATIONS IN `->extract`

$ext->extract(
    {
        page_title => 'title', # same extraction as HTML::ExtractText
        links => ['a', qr{http://|www\.} ],  # strip what matches
        bold  => ['b', sub { "<$_[0]>"; } ], # wrap what's found in <>
    },
    $html,
) or die "Error: $ext";

This module extends possible values in the hashref given as the first argument to ->extract method. They are given by changing the string containing the selector to an arrayref, where the first element is the selector you want to match and the rest of the elements are as follows:

Regex reference

$ext->extract({ links => ['a', qr{http://|www\.} ] }, $html )

When second element of the arrayref is a regex reference, any text that matches the regex will be stripped from the text that is being extracted.

Code reference

 $ext->extract({ links => ['a', sub { "<$_[0]>"; } ] }, $html )

When second element of the arrayref is a code reference, it will be called for each found bit of text we're extracting and its @_ will contain that text as the first element. Whatever the sub returns will be used as the result of extraction.

ACCESSORS

`whitespace`

$ext->whitespace(0);

Accessor method for the whitespace argument to ->new.

`nbsp`

$ext->nbsp(0);

Accessor method for the nbsp argument to ->new.

REPOSITORY

Fork this module on GitHub: https://github.com/zoffixznet/HTML-ExtractText-Extra

BUGS

To report bugs or request features, please use https://github.com/zoffixznet/HTML-ExtractText-Extra/issues

If you can't access GitHub, you can email your request to bug-html-extracttext-extra at rt.cpan.org

AUTHOR

ZOFFIX

LICENSE

You can use and distribute this module under the same terms as Perl itself. See the LICENSE file included in this distribution for complete details.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

METHODS FROM HTML::ExtractText

EXTRA OPTIONS IN ->new

whitespace

nbsp

EXTRA PROCESSING OPERATIONS IN ->extract

Regex reference

Code reference

ACCESSORS

whitespace

nbsp

SEE ALSO

REPOSITORY

BUGS

AUTHOR

LICENSE

METHODS FROM `HTML::ExtractText`

EXTRA OPTIONS IN `->new`

`whitespace`

`nbsp`

EXTRA PROCESSING OPERATIONS IN `->extract`

`whitespace`

`nbsp`