The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

HTML::Feature - an extractor of feature sentence from HTML

SYNOPSIS

    use strict;
    use HTML::Feature;
    use Data::Dumper;
    use Encode;
    binmode STDOUT, ':encoding(utf-8)';

    my $f = HTML::Feature->new;
    my $data = $f->extract( url => 'http://www.perl.com' );

    # print result data

    my $boundary = "-" x 40;

    print "\n";
    print $boundary, "\n";
    print "* TITLE:\n";
    print $boundary, "\n";
    print $data->{title}, "\n";

    print "\n";
    print $boundary, "\n";
    print "* DESCRIPTION:\n";
    print $boundary, "\n";
    print $data->{description}, "\n";

    my $i = 0;
    for(@{$data->{block}}){
        $i++;
        print "\n";
        print $boundary, "\n";
        print "* CONTENTS-$i:\n";
        print $boundary, "\n";
        print $_->{contents}, "\n";
        print $boundary, "\n";
        print "SCORE:",$_->{score},"\n";
    }

    # print more details
    prin Dumper($data);

DESCRIPTION

This module extracts some feature sentence from HTML.

First, HTML document is divided into plural blocks by a certain boundary line.

And each blocking is evaluated individually.

Evaluation of each block is decided by document size (the number of bytes) and a coefficient of a tag.

Being optional, arbitrary value can set a coefficient of a tag.

By the way, this module is not designed to extract a feature sentence from a page such as a list of links(for example, top pages of portal site).

It may extract well a feature sentence from a page with quantity of some document, (for example, news peg or blog) .

METHODS

new([options])

a object is made by using the options.

extract(url => $url | string => $string)

return feature blocks with TITLE and DESCRIPTION.

OPTIONS

    # it is possible to transfer default value to the constructor
    my $f = HTML::Feature->new(
        # set the factor every of tag 
        tag_score => {
            a => 0.90,
            b => 1.05,
            strong => 1.05,
            h1 => 1.2,
            h2 => 1.1,
            h3 => 1.05
        },
        # set return number
        ret_num => 3,
        # set Corresponding character code
        suspects_enc => ['euc-jp', 'shiftjis', '7bit-jis']
    );

SEE ALSO

HTML::TokeParser,HTML::Entites,Encode::Guess

AUTHOR

Takeshi Miki <miki@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2007 Takeshi Miki

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 271:

You forgot a '=back' before '=head1'