NAME

HTML::CruftText - Remove unuseful text from HTML

VERSION

Version 0.01

SYNOPSIS

Removes junk from HTML page text.

This module uses a regular expression based approach to remove cruft from HTML. I.e. content/text that is very unlikely to be useful or interesting.

use HTML::CruftText;

open (my $MYINPUTFILE, '<input.html' );

my @lines = <$MYINPUTFILE>;

my $de_crufted_lines = HTML::CruftText::clearCruftText( \@lines);

...

DESCRIPTION

This module was developed for the Media Cloud project (http://mediacloud.org) as the first step in differentiating article text from ads, navigation, and other boilerplate text. Its approach is very conservative and almost never removes legitimate article text. However, it still leaves in a lot of cruft so many users will want to do additional processing.

Typically, the clearCruftText method is called with an array reference containing the lines of an HTML file. Each line is then altered so that the cruft text is removed. After completion some lines will be entirely blank, while others will have certain text removed. In a few rare cases, additional HTML tags are added. The result is NOT GUARANTEED to be valid, balanced HTML though some HTML is retained because it is extremely useful for further processing. Thus some users will want to run an HTML stripper over the results.

The following tactics are used to remove cruft text:

* Nonbody text --anything outside of the <body></body> tags -- is removed

* Text within the following tags is removed: <script>, <style>, <frame>, <applet>, and <textarea>

* clickprint markers -- many web sites have clickprint annotation comments that explicitly mark whether text should be included.

* Removal of HTML tags in comments: we remove any HTML tags within <!-- --> comments but keep other comment text. This makes the result easier to process with regular expressions.

* Close tags that span multiple lines within an single open tag. For example, we would change:

    FOO<a 
  href="bar.com>BAZ

to:

    FOO<a >
  <a href="bar.com>BAZ
  

this makes the output easier to process with regular expressions.

SUBROUTINES/METHODS

clearCruftText( $lines )

This is the main method for this module. Removes cruft text from $lines and returns the result. Generally $lines will be a reference to an array of lines from an HTML file. However, this method can also be called with a string, in which case, the string will be split into multiple lines and an array reference of decrufted html lines is returned.

has_clickprint ( $lines )

Returns true if the HTML in $lines has clickprint annotation comment tags. Returns false otherwise.

AUTHOR

David Larochelle, <dlarochelle at cyber.law.harvard.edu>

BUGS

Please report any bugs or feature requests to bug-html-crufttext at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-CruftText. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc HTML::CruftText

You can also look for information at:

ACKNOWLEDGEMENTS

LICENSE AND COPYRIGHT

Copyright 2012 Berkman Center for Internet & Society at Harvard University.

This program is released under the following license: aAffero General Public License