NAME

HTML::Feature - an extractor of feature sentence from HTML

SYNOPSIS

use strict;
use HTML::Feature;
use Data::Dumper;
use Encode;
binmode STDOUT, ':encoding(utf-8)';

my $f = HTML::Feature->new;
my $data = $f->extract( url => 'http://www.perl.com' );

# print result data

my $boundary = "-" x 40;

print "\n";
print $boundary, "\n";
print "* TITLE:\n";
print $boundary, "\n";
print $data->{title}, "\n";

print "\n";
print $boundary, "\n";
print "* DESCRIPTION:\n";
print $boundary, "\n";
print $data->{description}, "\n";

my $i = 0;
for(@{$data->{block}}){
    $i++;
    print "\n";
    print $boundary, "\n";
    print "* CONTENTS-$i:\n";
    print $boundary, "\n";
    print $_->{contents}, "\n";
    print $boundary, "\n";
    print "SCORE:",$_->{score},"\n";
}

# print more details
print Dumper($data);

DESCRIPTION

This module extracts some feature sentence from HTML.

First, HTML document is divided into plural blocks by a certain boundary line.

And each blocking is evaluated individually.

Evaluation of each block is decided by document size (the number of bytes) and a coefficient of a tag.

Being optional, arbitrary value can set a coefficient of a tag.

By the way, this module is not designed to extract a feature sentence from a page such as a list of links(for example, top pages of portal site).

It may extract well a feature sentence from a page with quantity of some document, (for example, news peg or blog) .

METHODS

new([options])

a object is made by using the options.

extract(url => $url | string => $string)

return feature blocks with TITLE and DESCRIPTION.

OPTIONS

# it is possible to transfer default value to the constructor
my $f = HTML::Feature->new(
    # set defaule value
    $self->{tag_score} ||= {
        a      => 0.85,
        option => 0.5,
        b      => 1.15,
        strong => 1.15,
        h1     => 2,
        h2     => 1.8,
        h3     => 1.5
    };
    $self->{string_score} ||= {
        'copyright'     => 0.65,
        'all rights reserved' => 0.65
    };
    $self->{ret_num} ||= 1;
    $self->{suspects_enc} ||= [ 'euc-jp', 'shiftjis', '7bit-jis', ];
);

SEE ALSO

HTML::TokeParser,HTML::Entites,Encode::Guess

AUTHOR

Takeshi Miki <miki@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2007 Takeshi Miki

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 298:

You forgot a '=back' before '=head1'