package HTML::Truncate;
use warnings; # rm when ready
no warnings 'uninitialized';
use strict;
use HTML::TokeParser;
use HTML::Entities ();
use Carp;
our @ISA = qw( Exporter );
our @EXPORT_OK = ("truncate_html");
=head1 NAME
HTML::Truncate - (alpha software!) truncate HTML by text or raw
character count while preserving well-formedness.
=head1 VERSION
0.01
=cut
our $VERSION = '0.01';
=head1 ABSTRACT
When working with text it is convenient and common to want to truncate
strings to make them fit a desired context. E.g., you might have a
menu that is only 100px wide and prefer text doesn't wrap so you'd
truncate it around 15-30 characters, depending on preference and
typeface size. This is trivial with plain text and C<substr> but with
HTML it is somewhat difficult because whitespace has fluid
significance and open tags that are not properly closed destroy
well-formedness and can wreck an entire layout.
HTML::Truncate attempts to account for those two problems by padding
truncation for spacing and entities and closing any tags that remain
open at the point of truncation.
=head1 SYNOPSIS
# OO
use HTML::Truncate;
my $html_truncate = HTML::Truncate->new();
$html_truncate->chars(200);
$html_truncate->ellipsis($read_more_link);
print $html_truncate->($bunch_of_html);
# Functional
use HTML::Truncate "truncate_html";
# NON-OPERATIONAL RIGHT NOW
print truncate_html($html, 80, "–");
=head1 WHITESPACE & ENTITIES
Repeated natural whitespace (i.e., "\s+" and not " ") in HTML
-- with rare exception (pre tags or user defined styles) -- is not
meaningful. Therefore it is normalized when truncating. Entities are
also normalized. The following is only counted 14 chars long.
\n<p>\nthis is ‘text’\n\n</p>
^^^^^^^12345----678--9------01234------^^^^^^^^
=head1 FUNCTIONS
=head2 truncate_html($html, $char_or_percent, $ellipsis)
Not operative yet!
L<HTML::Truncate> maybe be used functionally or with object methods.
It has only one function, C<truncate_html()> which is not exported by
default.
No arguments are strictly required. Without HTML to operate upon it
returns undef. Character count defaults to 100. Ellipsis defaults to
"C<…>."
=cut
sub truncate_html { # function, compare w/ method truncate
my ( $html, $chars, $ellipsis ) = @_;
return unless $html;
die "truncate_html() is not functional!";
}
=head1 METHODS
=head2 HTML::Truncate->new
Can take all the methods as hash style args. "percent" will always
override "chars" so don't use them both.
my $ht = HTML::Truncate->new(utf8 => 1,
chars => 500, # default is 100
);
=cut
sub new {
my $class = shift;
my $self = bless
{
_chars => 100,
_percent => '10%',
_utf8 => undef,
_style => 'text',
_ellipsis => '…',
_raw_html => undef,
}, $class;
while ( my ( $k, $v ) = splice(@_, 0, 2) )
{
next unless exists $self->{"_$k"};
$self->$k($v);
}
$self;
}
=head2 $ht->style
Set/get. Either the default "text" or "html." (N.b.: only "text" is
supported so far.) This determines which characters will counted for
the truncation point. The reason why "html" is probably a poor choice
is that you might set what you believe to be a reasonable truncation
length of 20 chars and get an HTML tag like E<lt>a
href="http://blah.blah.boo/longish/path/to/resource... and end up with
no useful output.
Another problem is that the truncate might fall inside an attribute,
like the "href" above, which means that attribute will necessarily be
excluded, quite probably rendering the remaining tag invalid so the
entire tag must be tossed out to preserve well-formedness.
But the best reason not to use "html" right now is it's not supported
yet. It probably will be sometime in the future but unless you send a
patch to do it, it will be awhile. It would be useful, for example to
keep fixed length database records containing HTML truncated validly,
but it's not something I plan to use personally so it will come last.
=cut
sub style {
my ( $self, $style ) = @_;
return $self->{_style} unless defined $style;
croak "'html' style is not yet supported, sorry!"
if $style eq 'html';
croak "Value for style must be either 'text' or 'html'"
unless $style =~ /^text|html$/;
$self->{_style } = $style;
}
=head2 $ht->utf8
Set/get, true/false. If utf8 is set, entities will be transformed with
C<HTML::Entity::decode> and the default ellipsis will be a literal
ellipsis and not the default of C<…>.
=cut
sub utf8 {
my $self = shift;
if ( @_ )
{
$self->{_utf8} = shift;
return 1; # say we did it, even if untrue value
}
else
{
return $self->{_utf8};
}
}
=head2 $ht->chars
Set/get. The number of characters remaining after truncation,
including the C<ellipsis>. The C<style> attribute determines whether
the chars will only count text or HTML and text. Only "text" is
supported currently.
Entities are counted as single characters. E.g., C<©> is one
character for truncation counts.
=cut
sub chars {
my ( $self, $chars ) = @_;
return $self->{_chars} unless defined $chars;
$chars =~ /^(?:[1-9][_\d]*|0)$/
or croak "Specified chars must be a number";
$self->{_chars} = $chars;
}
=head2 $ht->percent
Set/get. A percentage to keep while truncating the rest. For a
document of 1,000 chars, percent('15%') and chars(150) would be
equivalent. If you ever add a truncate C<percent> it will override any
C<chars> settings. The actual amount of character that the percent
represents cannot be known until the given HTML is parsed.
=cut
sub percent {
my ( $self, $percent ) = @_;
return sprintf("%d%%", 100 * $self->{_percent})
unless $percent;
$percent =~ /^(100|[1-9]?[0-9])\%$/
or croak "Specified percent is invalid '$percent'";
$self->{_percent} = $1 / 100;
}
=head2 $ht->ellipsis
Set/get. Ellipsis in this case means--
The omission of a word or phrase necessary for a complete syntactical
construction but not necessary for understanding.
http://www.answers.com/topic/ellipsis
What it will probably mean in most real applications is "read more."
The default is C<…> which if the utf8 flag is true will render
as a literal ellipsis, C<chr(8230)>.
The reason the default is C<…> and not "..." is this is meant
for use in HTML environments, not plain text, and "..." (dot-dot-dot)
is not typographically correct or equivalent to a real horizontal
ellipsis character.
=cut
sub ellipsis {
my $self = shift;
if ( @_ )
{
$self->{_ellipsis} = shift;
}
elsif ( $self->utf8() )
{
return HTML::Entities::decode($self->{_ellipsis});
}
else
{
return $self->{_ellipsis};
}
}
=head2 $ht->truncate
=cut
sub truncate { # method, compare w/ function truncate_html
my $self = shift if ref($_[0]) eq __PACKAGE__;
my ( $html, $chars_or_perc, $ellipsis ) = @_;
return unless $html;
$self->{_raw_html} = \$html;
if ( $chars_or_perc =~ /\d+\%$/ )
{
$self->percent($chars_or_perc);
$self->_load_chars_from_percent();
}
elsif ( defined $chars_or_perc )
{
$self->chars($chars_or_perc);
}
$self->ellipsis($ellipsis) if defined $ellipsis;
my $p = HTML::TokeParser->new( $self->{_raw_html} );
my %stand_alone = map { $_ => 1 } qw( br img hr link );
my %skip = map { $_ => 1 } qw( script iframe head title style link
);
my ( @tag_q );
$self->{_renew} = '';
my $chars = $self->chars();
TOKENS:
while ( my $token = $p->get_token() )
{
if ( $token->[0] eq 'S' )
{
next TOKENS if $skip{$token->[1]};
push @tag_q, $token->[1] unless $stand_alone{$token->[1]};
$self->{_renewed} .= $token->[-1];
}
elsif ( $token->[0] eq 'E' )
{
next TOKENS if $skip{$token->[1]};
my $open = pop @tag_q;
my $close = $token->[1];
croak "<$open> closed by </$close>" unless $open eq
$close;
$self->{_renewed} .= $token->[-1];
}
elsif ( $token->[0] eq 'T' )
{
next TOKENS if $token->[2];
my $txt = $token->[1];
$self->{_renewed} .= $txt and next if $txt =~ /^\s+$/;
# 987 should be padding instead so we don't alter original
my $length = length $txt;
# print qq|<#$txt#>\n|;
for ( $txt =~ /
\A(\s+)(?=\S)
|
(?<=\S)(\s+)\Z
|
(?<=\&)(\#\d+;)
|
(?<=\&)([[:alpha:]]{2,5};)
|
\s(\s+)
/gx )
{
$chars += length $1; # padding
}
if ( $length > $chars )
{
$self->{_renewed} .= substr($txt, 0, ( $chars ) );
$self->{_renewed} .= $self->ellipsis();
last TOKENS;
}
else
{
$self->{_renewed} .= $txt;
$chars -= $length;
}
}
}
$self->{_renewed} .= join('', map {"</$_>"} reverse @tag_q);
return $self->{_renewed} if defined wantarray;
}
sub _load_chars_from_percent {
my $self = shift;
my $p = HTML::TokeParser->new( $self->{_raw_html} );
my $txt_length = 0;
CHARS:
while ( my $token = $p->get_token )
{
# don't check padding b/c we're going by a document average
next unless $token->[0] eq 'T' and not $token->[2];
$txt_length += () = $token->[1] =~
/\&\#\d+;|\&[[:alpha:]]{2,5};|\S|\s+/g;
}
$self->chars( int( $txt_length * $self->{_percent} ) );
}
=head1 TO DO
Many more tests. Go through entire dist and make sure everything is
kosher (autogenerated with the lovely L<Module::Starter>). Reorganize
POD to read in best learning order. Make sure the padding check is
working across wide range of cases. POD to describe various ways to
use it including a TT2 filter. "html" style truncating.
=head1 AUTHOR
Ashley Pond V, C<< <ashley@cpan.org> >>
=head1 LIMITATIONS
There are places where this will break down right now. I'll pad out
possible edge cases as I find them or they are sent to me via the CPAN
bug ticket system.
=head1 BUGS
Please report any bugs or feature requests to
C<bug-html-truncate@rt.cpan.org>, or through the web interface at
L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Truncate>.
I will be notified, and then you'll automatically be notified of
progress on
your ticket as I make changes.
=head1 SEE ALSO
L<HTML::Entities>, L<HTML::TokeParser>, the "truncate" filter in
L<Template>, and L<Text::Truncate>.
=head1 COPYRIGHT & LICENSE
Copyright 2005 Ashley Pond V, all rights reserved.
This program is free software; you can redistribute it and/or modify
it
under the same terms as Perl itself.
=cut
1; # End of HTML::Truncate