The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

HTML::Feature - Extract Feature Sentences From HTML Documents

SYNOPSIS

    use HTML::Feature;

    my $f = HTML::Feature->new(enc_type => 'utf8');
    my $result = $f->parse('http://www.perl.com');

    # or $f->parse($html);

    print "Title:"        , $result->title(), "\n";
    print "Description:"  , $result->desc(),  "\n";
    print "Featured Text:", $result->text(),  "\n";



    # a simpler method is, 

    use HTML::Feature qw(feature);
    print scalar feature('http://www.perl.com');

    # very simple!

DESCRIPTION

This module extracst blocks of feature sentences out of an HTML document.

Unlike other modules that performs similar tasks, this module by default extracts blocks without using morphological analysis, and instead it uses simple statistics processing.

Because of this, HTML::Feature has an advantage over other similar modules in that it can be applied to documents in any language.

METHODS

new()

    my $f = HTML::Feature->new(%param);
    my $f = HTML::Feature->new(
        engine => $class, # backend engine module (default: 'TagStructure') 
        max_bytes => 5000, # max number of bytes per node to analyze (default: '')
        min_bytes => 10, # minimum number of bytes per node to analyze (default is '')
        enc_type => 'euc-jp', # encoding of return values (default: 'utf-8')
        http_proxy => 'http://proxy:3128', # http proxy server (default: '')
   );

Instantiates a new HTML::Feature object. Takes the following parameters

engine

Specifies the class name of the engine that you want to use.

HTML::Feature is designed to accept different engines to change its behavior. If you want to customize the behavior of HTML::Feature, specify your own engine in this parameter

The rest of the arguments are directly passed to the HTML::Feature::Engine object constructor.

parse()

    my $result = $f->parse($url);
    # or
    my $result = $f->parse($html);
    # or
    my $result = $f->parse($http_response);

Parses the given argument. The argument can be either a URL, a string of HTML, or an HTTP::Response object. HTML::Feature will detect and delegate to the appropriate method (see below)

parse_url($url)

Parses an URL. This method will use LWP::UserAgent to fetch the given url.

parse_html($html)

Parses a string containing HTML.

parse_response($http_response)

Parses an HTTP::Response object.

extract()

    $data = $f->extract(url => $url);
    # or
    $data = $f->extract(string => $html);

HTML::Feature::extract() has been deprecated and exists for backwards compatiblity only. Use HTML::Feature::parse() instead.

extract() extracts blocks of feature sentences from the given document, and returns a data structure like this:

    $data = {
        title => $title,
        description => $desc,
        block => [
            {
                contents => $contents,
                score => $score
            },
            .
            .
        ]
    }

feature

feature() is a simple wrapper that does new(), parse() in one step. If you do not require complex operations, simply calling this will suffice. In scalar context, it returns the feature text only. In list context, some more meta data will be returned as a hash.

This function is exported on demand.

    use HTML::Feature qw(feature);
    print scalar feature($url);  # print featured text

    my %data = feature($url); # wantarray(hash)
    print $data{title};
    print $data{desc};
    print $data{text};

AUTHOR

Takeshi Miki <miki@cpan.org>

Special thanks to Daisuke Maki

COPYRIGHT AND LICENSE

Copyright (C) 2007 Takeshi Miki This library is free software; you can redistribute it and/or modifyit under the same terms as Perl itself, either Perl version 5.8.8 or,at your option, any later version of Perl 5 you may have available.