NAME

HTML::StripScripts - strip scripting constructs out of HTML

SYNOPSIS

use HTML::StripScripts;

my $hss = HTML::StripScripts->new({ Context => 'Inline' });

$hss->input_start_document;

$hss->input_start('<i>');
$hss->input_text('hello, world!');
$hss->input_end('</i>');

$hss->input_end_document;

print $hss->filtered_document;

DESCRIPTION

This module strips scripting constructs out of HTML, leaving as much non-scripting markup in place as possible. This allows web applications to display HTML originating from an untrusted source without introducing XSS (cross site scripting) vulnerabilities.

The process is based on whitelists of tags, attributes and attribute values. This approach is the most secure against disguised scripting constructs hidden in malicious HTML documents.

As well as removing scripting constructs, this module ensures that there is a matching end for each start tag, and that the tags are properly nested.

The HTML document must be parsed into start tags, end tags and text before it can be filtered by this module. Use either HTML::StripScripts::Parser or HTML::StripScripts::Regex instead if you want to input an unparsed HTML document.

CONSTRUCTORS

new ( CONFIG )

Creates a new HTML::StripScripts filter object, bound to a particular filtering policy. If present, the CONFIG parameter must be a hashref. The following keys are recognized (unrecognized keys will be silently ignored).

Context

A string specifying the context in which the filtered document will be used. This influences the set of tags that will be allowed.

If present, the Context value must be one of:

Document

If Context is Document then the filter will allow a full HTML document, including the HTML tag and HEAD and BODY sections.

Flow

If Context is Flow then most of the cosmetic tags that one would expect to find in a document body are allowed, including lists and tables but not including forms.

Inline

If Context is Inline then only inline tags such as B and FONT are allowed.

NoTags

If Context is NoTags then no tags are allowed.

The default Context value is Flow.

BanList

If present, this option must be a hashref. Any tag that would normally be allowed (because it presents no XSS hazard) will be blocked if the lowercase name of the tag is a key in this hash.

For example, in a guestbook application where HR tags are used to separate posts, you may wish to prevent posts from including HR tags, even though HR is not an XSS risk.

BanAllBut

If present, this option must be reference to an array holding a list of lowercase tag names. This has the effect of adding all but the listed tags to the ban list, so that only those tags listed will be allowed.

AllowSrc

By default, the filter won't allow constructs that cause the browser to fetch things automatically, such as SRC attributes in IMG tags. If this option is present and true then those constructs will be allowed.

AllowHref

By default, the filter won't allow constructs that cause the browser to fetch things if the user clicks on something, such as the HREF attribute in A tags. Set this option to a true value to allow this type of construct.

AllowRelURL

By default, the filter won't allow relative URLs such as ../foo.html in SRC and HREF attribute values. Set this option to a true value to allow them.

METHODS

This class provides the following methods:

input_start_document ()

This method initializes the filter, and must be called once before starting on each HTML document to be filtered.

input_start ( TEXT )

Handles a start tag from the input document. TEXT must be the full text of the tag, including angle-brackets.

input_end ( TEXT )

Handles an end tag from the input document. TEXT must be the full text of the end tag, including angle-brackets.

input_text ( TEXT )

Handles some non-tag text from the input document.

input_process ( TEXT )

Handles a processing instruction from the input document.

input_comment ( TEXT )

Handles an HTML comment from the input document.

input_declaration ( TEXT )

Handles an declaration from the input document.

input_end_document ()

Call this method to signal the end of the input document.

filtered_document ()

Returns the filtered document as a string.

SUBCLASSING

The HTML::StripScripts class is subclassable. Filter objects are plain hashes and HTML::StripScripts reserves only hash keys that start with _hss. The filter configuration can be set up by invoking the hss_init() method, which takes the same arguments as new().

OUTPUT METHODS

The filter outputs a stream of start tags, end tags, text, comments, declarations and processing instructions, via the following output_* methods. Subclasses may override these to intercept the filter output.

The default implementations of the output_* methods pass the text on to the output() method. The default implementation of the output() method appends the text to a string, which can be fetched with the filtered_document() method once processing is complete.

If the output() method or the individual output_* methods are overridden in a subclass, then filtered_document() will not work in that subclass.

output_start_document ()

This method gets called once at the start of each HTML document passed through the filter. The default implementation does nothing.

output_end_document ()

This method gets called once at the end of each HTML document passed through the filter. The default implementation does nothing.

output_start ( TEXT )

This method is used to output a filtered start tag.

output_end ( TEXT )

This method is used to output a filtered end tag.

output_text ( TEXT )

This method is used to output some filtered non-tag text.

output_declaration ( TEXT )

This method is used to output a filtered declaration.

output_comment ( TEXT )

This method is used to output a filtered HTML comment.

output_process ( TEXT )

This method is used to output a filtered processing instruction.

output ( TEXT )

This method is invoked by all of the default output_* methods. The default implementation appends the text to the string that the filtered_document() method will return.

REJECT METHODS

When the filter encounters something in the input document which it cannot transform into an acceptable construct, it invokes one of the following reject_* methods to put something in the output document to take the place of the unacceptable construct.

The TEXT parameter is the full text of the unacceptable construct.

The default implementations of these methods output an HTML comment containing the text filtered.

Subclasses may override these methods, but should exercise caution. The TEXT parameter is unfiltered input and may contain malicious constructs.

reject_start ( TEXT )
reject_end ( TEXT )
reject_text ( TEXT )
reject_declaration ( TEXT )
reject_comment ( TEXT )
reject_process ( TEXT )

WHITELIST INITIALIZATION METHODS

The filter refers to various whitelists to determine which constructs are acceptable. To modify these whitelists, subclasses can override the following methods.

Each method is called once at object initialization time, and must return a reference to a nested data structure. These references are installed into the object, and used whenever the filter needs to refer to a whitelist.

The default implementations of these methods can be invoked as class methods.

init_context_whitelist ()

Returns a reference to the Context whitelist, which determines which tags may appear at each point in the document, and which other tags may be nested within them.

It is a hash, and the keys are context names, such as Flow and Inline.

The values in the hash are hashrefs. The keys in these subhashes are lowercase tag names, and the values are context names, specifying the context that the tag provides to any other tags nested within it.

The special context EMPTY as a value in a subhash indicates that nothing can be nested within that tag.

init_attrib_whitelist ()

Returns a reference to the Attrib whitelist, which determines which attributes each tag can have and the values that those attributes can take.

It is a hash, and the keys are lowercase tag names.

The values in the hash are hashrefs. The keys in these subhashes are lowercase attribute names, and the values are attribute value class names, which are short strings describing the type of values that the attribute can take, such as color or number.

init_attval_whitelist ()

Returns a reference to the AttVal whitelist, which is a hash that maps attribute value class names from the Attrib whitelist to coderefs to subs to validate (and optionally transform) a particular attribute value.

The filter calls the attribute value validation subs with the following parameters:

filter

A reference to the filter object.

tagname

The lowercase name of the tag in which the attribute appears.

attrname

The name of the attribute.

attrval

The attribute value found in the input document, in canonical form (see "CANONICAL FORM").

The validation sub can return undef to indicate that the attribute should be removed from the tag, or it can return the new value for the attribute, in canonical form.

init_style_whitelist ()

Returns a reference to the Style whitelist, which determines which CSS style directives are permitted in style tag attributes. The keys are value names such as color and background-color, and the values are class names to be used as keys into the AttVal whitelist.

init_deinter_whitelist

Returns a reference to the DeInter whitelist, which determines which inline tags the filter should attempt to automatically de-interleave if they are encountered interleaved. For example, the filter will transform:

<b>hello <i>world</b> !</i>

Into:

<b>hello <i>world</i></b><i> !</i>

because both b and i appear as keys in the DeInter whitelist.

CHARACTER DATA PROCESSING

These methods transform attribute values and non-tag text from the input document into canonical form (see "CANONICAL FORM"), and transform text in canonical form into a suitable form for the output document.

text_to_canonical_form ( TEXT )

This method is used to reduce non-tag text from the input document to canonical form before passing it to the filter_text() method.

The default implementation unescapes all entities that map to US-ASCII characters other than ampersand, and replaces any ampersands that don't form part of valid entities with &amp;.

quoted_to_canonical_form ( VALUE )

This method is used to reduce attribute values quoted with doublequotes or singlequotes to canonical form before passing it to the handler subs in the AttVal whitelist.

The default behavior is the same as that of text_to_canonical_form().

unquoted_to_canonical_form ( VALUE )

This method is used to reduce attribute values without quotes to canonical form before passing it to the handler subs in the AttVal whitelist.

The default implementation simply replaces all ampersands with &amp;, since that corresponds with the way most browsers treat entities in unquoted values.

canonical_form_to_attval ( ATTVAL )

This method is used to convert the text in canonical form returned by the AttVal handler subs to a form suitable for inclusion in doublequotes in the output tag.

The default implementation runs anything that doesn't look like a valid entity through the escape_html_metachars() method.

canonical_form_to_text ( TEXT )

This method is used to convert the text in canonical form returned by the filter_text() method to a form suitable for inclusion in the output document.

The default implementation runs anything that doesn't look like a valid entity through the escape_html_metachars() method.

validate_href_attribute ( TEXT )

If the AllowHref filter configuration option is set, then this method is used to validate href type attribute values. TEXT is the attribute value in canonical form. Returns a possibly modified attribute value (in canonical form) or undef to reject the attribute.

The default implementation allows only absolute http and https URLs, permits port numbers and query strings, and imposes reasonable length limits.

validate_src_attribute ( TEXT )

If the AllowSrc filter configuration option is set, then this method is used to validate src type attribute values. TEXT is the attribute value in canonical form. Returns a possibly modified attribute value (in canonical form) or undef to reject the attribute.

The default implementation behaves as validate_href_attribute().

OTHER METHODS TO OVERRIDE

As well as the output, reject, init and cdata methods listed above, it might make sense for subclasses to override the following methods:

filter_text ( TEXT )

This method will be invoked to filter blocks of non-tag text in the input document. Both input and output are in canonical form, see "CANONICAL FORM".

The default implementation does no filtering.

escape_html_metachars ( TEXT )

This method is used to escape all HTML metacharacters in TEXT. The return value must be a copy of TEXT with metacharacters escaped.

The default implementation escapes a minimal set of metacharacters for security against XSS vulnerabilities. The set of characters to escape is a compromise between the need for security and the need to ensure that the filter will work for documents in as many different character sets as possible.

Subclasses which make strong assumptions about the document character set will be able to escape much more aggressively.

strip_nonprintable ( TEXT )

Returns a copy of TEXT with runs of nonprintable characters replaced with spaces or some other harmless string. Avoids replacing anything with the empty string, as that can lead to other security issues.

The default implementation strips out only NULL characters, in order to avoid scrambling text for as many different character sets as possible.

Subclasses which make some sort of assumption about the character set in use will be able to have a much wider definition of a nonprintable character, and hence a more secure strip_nonprintable() implementation.

ATTRIBUTE VALUE HANDLER SUBS

References to the following subs appear in the AttVal whitelist returned by the init_attval_whitelist() method.

_hss_attval_style( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value hander for the style attribute.

_hss_attval_size ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for attributes who's values are some sort of size or length.

_hss_attval_number ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for attributes who's values are a simple integer.

_hss_attval_color ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for color attributes.

_hss_attval_text ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for text attributes.

_hss_attval_word ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for attributes who's values must consist of a single short word, with minus characters permitted.

_hss_attval_wordlist ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for attributes who's values must consist of one or more words, separated by spaces and/or commas.

_hss_attval_wordlistq ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for attributes who's values must consist of one or more words, separated by commas, with optional doublequotes around words and spaces allowed within the doublequotes.

_hss_attval_href ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for href type attributes. If the AllowHref configuration option is set, uses the validate_href_attribute() method to check the attribute value.

_hss_attval_src ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for src type attributes. If the AllowSrc configuration option is set, uses the validate_src_attribute() method to check the attribute value.

_hss_attval_stylesrc ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for src type style pseudo attributes.

_hss_attval_novalue ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )

Attribute value handler for attributes that have no value or a value that is ignored. Just returns the attribute name as the value.

CANONICAL FORM

Many of the methods described above deal with text from the input document, encoded in what I call canonical form, defined as follows:

All characters other than ampersands represent themselves. Literal ampersands are encoded as &amp;. Non US-ASCII characters may appear as literals in whatever character set is in use, or they may appear as named or numeric HTML entities such as &aelig;, &#31337; and &#xFF;. Unknown named entities such as &foo; may appear.

The idea is to be able to be able to reduce input text to a minimal form, without making too many assumptions about the character set in use.

PRIVATE METHODS

The following methods are internal to this class, and should not be invoked from elsewhere. Subclasses should not use or override these methods.

_hss_decode_numeric ( NUMERIC )

Returns the string that should replace the numeric entity NUMERIC in the text_to_canonical_form() method.

_hss_tag_is_banned ( TAGNAME )

Returns true if the lower case tag name TAGNAME is on the list of harmless tags that the filter is configured to block, false otherwise.

_hss_get_to_valid_context ( TAG )

Tries to get the filter to a context in which the tag TAG is allowed, by introducing extra end tags or start tags if necessary. TAG can be either the lower case name of a tag or the string 'CDATA'.

Returns 1 if an allowed context is reached, or 0 if there's no reasonable way to get to an allowed context and the tag should just be rejected.

_hss_close_innermost_tag ()

Closes the innermost open tag.

_hss_context ()

Returns the current named context of the filter.

_hss_valid_in_context ( TAG, CONTEXT )

Returns true if the lowercase tag name TAG is valid in context CONTEXT, false otherwise.

_hss_valid_in_current_context ( TAG )

Returns true if the lowercase tag name TAG is valid in the filter's current context, false otherwise.

BUGS

  • This module does a lot of work to ensure that tags are correctly nested and are not left open, causing unnecessary overhead for applications where that doesn't matter.

    Such applications may benefit from using the more lightweight HTML::Scrubber::StripScripts module instead.

SEE ALSO

HTML::Parser, HTML::StripScripts::Parser, HTML::StripScripts::Regex

AUTHOR

Nick Cleaton <nick@cleaton.net>

COPYRIGHT

Copyright (C) 2003 Nick Cleaton. All Rights Reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1223:

=cut found outside a pod block. Skipping to next block.