NAME
HTML::ContentExtractor - extract the main content from a web page by analysising the DOM tree!
VERSION
Version 0.01
SYNOPSIS
use HTML::ContentExtractor;
my $extractor = HTML::ContentExtractor->new();
my $agent=LWP::UserAgent->new;
my $url='http://sports.sina.com.cn/g/2007-03-23/16572821174.shtml';
my $res=$agent->get($url);
my $HTML = $res->decoded_content();
$extractor->extract($url,$HTML);
print $extractor->as_html();
print $extractor->as_text();
DESCRIPTION
Web pages often contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. This module is used to reduce the noise content in web pages and thus identify the content rich regions.
A web page is first parsed by an HTML parser, which corrects the markup and creates a DOM (Document Object Model) tree. By using a depth-first traversal to navigate the DOM tree, noise nodes are identified and removed, thus the main content is extracted. Some useless nodes (script, style, etc.) are removed; the container nodes (table, div, etc.) which have high link/text ratio (higher than threshold) are removed; (link/text ratio is the ratio of the number of links and non-linked words.) The nodes contain any string in the predefined spam string list are removed.
Please notice the input HTML should be encoded in utf-8 format( so do the spam words), thus the module can handle web pages in any language (I've used it to process English, Chinese, and Japanese web pages).
- $e = HTML::ContentExtractor->new(%options);
-
Constructs a new
HTML::ContentExtractor
object. The optional %options hash can be used to set the options list below. -
This is used to get/set the table tags array. The tags are used as the container tags.
-
This is used to get/set the ignore tags array. The elements of such tags will be removed.
- $e->spam_words();
- $e->spam_words(@strings);
-
This is used to get/set the spam words list. The elements have such string will be removed.
- $e->link_text_ratio();
- $e->link_text_ratio($ratio);
-
This is used to get/set the link/text ratio, default is 0.05.
- $e->min_text_len();
- $e->min_text_len($len);
-
This is used to get/set the min text length, default is 20. If length of the text of an elment is less than this value, this element will be removed.
- $e->extract($url,$HTML);
-
This is used to perform the extraction process. Please notice the input $HTML must be encoded in UTF-8.
- $e->as_html();
-
Return the extraction result in HTML format.
- $e->as_text();
-
Return the extraction result in text format.
AUTHOR
Zhang Jun, <jzhang533 at gmail.com>
COPYRIGHT & LICENSE
Copyright 2007 Zhang Jun, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.