# Callback for custom handling HTML tag attributes
sub DefangAttribsCallback {
my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $HtmlR) = @_;
# Change all 'border' attribute values to zero.
$$AttrValR = '0' if $lcAttrKey eq 'border';
# Defang all 'src' attributes
return DEFANG_ALWAYS if $lcAttrKey eq 'src';
return DEFANG_NONE;
}
# Callback for all content between tags (except <style>, <script>, etc)
sub DefangContentCallback {
my ($Self, $Defang, $ContentR) = @_;
$$ContentR =~ s/remove this content//;
}
=head1 DESCRIPTION
This module accepts an input HTML and/or CSS string and removes any executable code including scripting, embedded objects, applets, etc., and neutralises any XSS attacks. A whitelist based approach is used which means only HTML known to be safe is allowed through.
HTML::Defang uses a custom html tag parser. The parser has been designed and tested to work with nasty real world html and to try and emulate as close as possible what browsers actually do with strange looking constructs. The test suite has been built based on examples from a range of sources such as http://ha.ckers.org/xss.html and http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as possible XSS attack scenarios have been dealt with.
HTML::Defang can make callbacks to client code when it encounters the following:
=over 4
=item *
When a specified tag is parsed
=item *
When a specified attribute is parsed
=item *
When a URL is parsed as part of an HTML attribute, or CSS property value.
=item *
When style data is parsed, as part of an HTML style attribute, or as part of an HTML <style> tag.
=back
The callbacks include details about the current tag/attribute that is being parsed, and also gives a scalar reference to the input HTML. Querying pos() on the input HTML should indicate where the module is with parsing. This gives the client code flexibility in working with HTML::Defang.
HTML::Defang can defang whole tags, any attribute in a tag, any URL that appear as an attribute or style property, or any CSS declaration in a declaration block in a style rule. This helps to precisely block the most specific unwanted elements in the contents(for example, block just an offending attribute instead of the whole tag), while retaining any safe HTML/CSS.
# "style" => qr/expression|eval|script:|mocha:|\&{|\@import|(?<!background-)position:|background-image/i, # XXX there are probably a million more ways to cause trouble with css!
"style"=> qr/^.*$/s,
#kc In addition to this, we could strip all 'javascript:|expression|' etc. from all attributes(in attribute_cleanup())
"stylesheet"=> [ qr/expression|eval|script:|mocha:|\&\{|\@import/i ], # stylesheets are forbidden if Embedded => 1. css positioning can be allowed in an iframe.
# NB see also `process_stylesheet' below
"style-type"=> [ qr/script|mocha/i ],
"size"=> qr/^[\+\-]?[\d.]+(px|%)?$/i,
"target"=> qr/^[A-Za-z0-9_][A-Za-z0-9_.-]*$/,
"base-href"=> qr/^https?:\/\/[\w.\/]+$/,
"anything"=> qr/^.*$/, #[ 0, 0 ],
"meta:content"=> [ qr//],
);
my%CommonAttributes=
(
# Core attributes
"class"=> "class",
"id"=> "alnum",
"name"=> "alnum",
"style"=> "style",
"accesskey"=> "alnum",
"tabindex"=> "integer",
"title"=> "anything",
# Language attributes
"dir"=> "dir",
"lang"=> "alnum",
"language"=> "language",
"longdesc"=> "anything",
# Height, width, alignment, etc.
"align"=> "align",
"bgcolor"=> "color",
"bottommargin"=> "size",
"clear"=> "align",
"color"=> "color",
"height"=> "size",
"leftmargin"=> "size",
"marginheight"=> "size",
"marginwidth"=> "size",
"nowrap"=> "anything",
"rightmargin"=> "size",
"scroll"=> "boolean",
"scrolling"=> "boolean",
"topmargin"=> "size",
"type"=> "mime-type",
"valign"=> "align",
"width"=> "size",
"/"=> "empty",
);
my%ListAttributes=
(
"compact"=> "anything",
"start"=> "integer",
"type"=> "list-type",
);
my%TableAttributes=
(
"axis"=> "alnum",
"background"=> "src",
"border"=> "number",
"bordercolor"=> "color",
"bordercolordark"=> "color",
"bordercolorlight"=> "color",
"padding"=> "integer",
"spacing"=> "integer",
"cellpadding"=> "integer",
"cellspacing"=> "integer",
"cols"=> "anything",
"colspan"=> "integer",
"char"=> "alnum",
"charoff"=> "integer",
"datapagesize"=> "integer",
"frame"=> "frame",
"frameborder"=> "boolean",
"framespacing"=> "integer",
"headers"=> "anything",
"rows"=> "anything",
"rowspan"=> "size",
"rules"=> "rules",
"scope"=> "scope",
"span"=> "integer",
"summary"=> "anything"
);
my%UrlRules= (
"src"=> 1,
"href"=> 1,
"base-href"=> 1,
# cite => 1,
# action => 1,
);
my%Tags= (
script=> \&defang_script_tag,
style=> \&defang_style_tag,
"html"=> 100,
#
# Safe elements commonly found in the <head> block follow.
#
"head"=> 2,
"base"=>
{
"href"=> "base-href",
"target"=> "target",
},
# TODO: Deal with link below later
#"link" => \$r_link,
# {
# "rel" => "rel",
# "rev" => "rel",
# "src" => "src",
# "href" => "src", # Might be auto-loaded by the browser!!
# "charset" => "charset",
# "media" => "media",
# "target" => "target",
# "type" => "mime-type",
# },
"meta"=>
{
"_score"=> 2,
"content"=> "meta:content",
"http-equiv"=> "meta:name",
"name"=> "meta:name",
"charset"=> "charset",
},
"title"=> 2,
#
# Safe elements commonly found in the <body> block follow.
#
"body"=>
{
"_score"=> 2,
"link"=> "color",
"alink"=> "color",
"vlink"=> "color",
"background"=> "src",
"nowrap"=> "boolean",
"text"=> "color",
"vlink"=> "color",
},
"a"=>
{
"charset"=> "charset",
"coords"=> "coords",
"href"=> "href",
"shape"=> "shape",
"target"=> "target",
"type"=> "mime-type",
"eudora"=> "eudora",
"notrack"=> "anything",
},
"address"=> 1,
"area"=>
{
"alt"=> "anything",
"coords"=> "coords",
"href"=> "href",
"nohref"=> "anything",
"shape"=> "shape",
"target"=> "target",
},
"article"=> 1,
"applet"=> 0,
"basefont"=>
{
"face"=> "font-face",
"family"=> "font-face",
"back"=> "color",
"size"=> "number",
"ptsize"=> "number",
},
"bdo"=> 1,
"bgsound"=>
{
"balance"=> "integer",
"delay"=> "integer",
"loop"=> "alnum",
"src"=> "src",
"volume"=> "integer",
},
"blockquote"=>
{
"cite"=> "href",
"type"=> "mime-type",
},
"br"=> 1,
"button"=> # FORM
{
"type"=> "input-type",
"disabled"=> "anything",
"value"=> "anything",
"tabindex"=> "number",
},
"caption"=> 1,
"center"=> 1,
"col"=> \%TableAttributes,
"colgroup"=> \%TableAttributes,
"comment"=> 1,
"dd"=> 1,
"del"=>
{
"cite"=> "href",
"datetime"=> "datetime",
},
"dir"=> \%ListAttributes,
"div"=> 1,
"dl"=> \%ListAttributes,
"dt"=> 1,
"embed"=> 0,
"fieldset"=> 1, # FORM
"font"=>
{
"face"=> "font-face",
"family"=> "font-face",
"back"=> "color",
"size"=> "number",
"ptsize"=> "number",
},
"footer"=> 1,
"form"=> # FORM
{
"method"=> "form-method",
"action"=> "href",
"enctype"=> "form-enctype",
"accept"=> "anything",
"accept-charset"=> "anything",
},
"header"=> 1,
"hr"=>
{
"size"=> "number",
"noshade"=> "anything",
},
"h1"=> 1,
"h2"=> 1,
"h3"=> 1,
"h4"=> 1,
"h5"=> 1,
"h6"=> 1,
"iframe"=> 0,
"ilayer"=> 0,
"img"=>
{
"alt"=> "anything",
"border"=> "size",
"dynsrc"=> "src",
"hspace"=> "size",
"ismap"=> "anything",
"loop"=> "alnum",
"lowsrc"=> "src",
"nosend"=> "alnum",
"src"=> "src",
"start"=> "alnum",
"usemap"=> "usemap-href",
"vspace"=> "size",
},
"inlineinput"=> 0,
"input"=> # FORM
{
"type"=> "input-type",
"disabled"=> "anything",
"value"=> "anything",
"maxlength"=> "input-size",
"size"=> "input-size",
"readonly"=> "anything",
"tabindex"=> "number",
"checked"=> "anything",
"accept"=> "anything",
# for type "image":
"alt"=> "anything",
"border"=> "size",
"dynsrc"=> "src",
"hspace"=> "size",
"ismap"=> "anything",
"loop"=> "alnum",
"lowsrc"=> "src",
"nosend"=> "alnum",
"src"=> "src",
"start"=> "alnum",
"usemap"=> "usemap-href",
"vspace"=> "size",
},
"ins"=>
{
"cite"=> "href",
"datetime"=> "datetime",
},
"isindex"=> 0,
"keygen"=> 0,
"label"=> # FORM
{
"for"=> "alnum",
},
"layer"=> 0,
"legend"=> 1, # FORM
"li"=> {
"value"=> "integer",
},
"listing"=> 0,
"map"=> 1,
"marquee"=> 0,
"menu"=> \%ListAttributes,
"multicol"=> 0,
"nextid"=> 0,
"nobr"=> 0,
"noembed"=> 1,
"nolayer"=> 1,
# Pretend our defang result is going into a non-scripting environment,
# even though javascript is likely enabled, so just defang all noscript tags
"noscript"=> 0,
"noembed"=> 1,
"object"=> 0,
"ol"=> \%ListAttributes,
"optgroup"=> # FORM
{
"disabled"=> "anything",
"label"=> "anything",
},
"option"=> # FORM
{
"disabled"=> "anything",
"label"=> "anything",
"selected"=> "anything",
"value"=> "anything",
},
"o:p"=> 1,
"p"=> 1,
"param"=> 0,
"plaintext"=> 0,
"pre"=> 1,
"rt"=> 0,
"ruby"=> 0,
"section"=> 1,
"select"=> # FORM
{
"disabled"=> "anything",
"multiple"=> "anything",
"size"=> "input-size",
"tabindex"=> "number",
},
"spacer"=> 0,
"span"=> 1,
"spell"=> 0,
"sound"=>
{
"delay"=> "number",
"loop"=> "integer",
"src"=> "src",
},
"table"=> \%TableAttributes,
"tbody"=> \%TableAttributes,
"textarea"=> # FORM
{
"cols"=> "input-size",
"rows"=> "input-size",
"disabled"=> "anything",
"readonly"=> "anything",
"tabindex"=> "number",
"wrap"=> "anything",
},
"td"=> \%TableAttributes,
"tfoot"=> \%TableAttributes,
"th"=> \%TableAttributes,
"thead"=> \%TableAttributes,
"tr"=> \%TableAttributes,
"ul"=> \%ListAttributes,
"wbr"=> 1,
"xml"=> 0,
"xmp"=> 0,
"x-html"=> 0,
"x-tab"=> 1,
"x-sigsep"=> 1,
# Character formatting
"abbr"=> 1,
"acronym"=> 1,
"big"=> 1,
"blink"=> 0,
"b"=> 1,
"cite"=> 1,
"code"=> 1,
"dfn"=> 1,
"em"=> 1,
"i"=> 1,
"kbd"=> 1,
"q"=> 1,
"s"=> 1,
"samp"=> 1,
"small"=> 1,
"strike"=> 1,
"strong"=> 1,
"sub"=> 1,
"sup"=> 1,
"tt"=> 1,
"u"=> 1,
"var"=> 1,
#
# Safe elements commonly found in the <frameset> block follow.
my%BlockTags= map{ $_=> 1 } qw(h1 h2 h3 h4 h5 h6 p div pre plaintext address blockquote center form table tbody thead tfoot tr td th caption colgroup col dl ul ol li fieldset);
my%InlineTags= map{ $_=> 1 } qw(span abbr acronym q sub sup cite code em kbd samp strong var dfn strike b i u s tt small big nobr a font);
my%NestInlineTags= map{ $_=> 1 } qw(span abbr acronym q sub sup cite code em kbd samp strong var dfn strike b i u s tt small big nobr);
# Default list of mismatched tags to track
my%MismatchedTags= (%BlockTags, %InlineTags);
=head1 CONSTRUCTOR
=over 4
=cut
=item I<HTML::Defang-E<gt>new(%Options)>
Constructs a new HTML::Defang object. The following options are supported:
=over 4
=item B<Options>
=over 4
=item B<tags_to_callback>
Array reference of tags for which a call back should be made. If a tag in this array is parsed, the subroutine tags_callback() is invoked.
=item B<attribs_to_callback>
Array reference of tag attributes for which a call back should be made. If an attribute in this array is parsed, the subroutine attribs_callback() is invoked.
=item B<tags_callback>
Subroutine reference to be invoked when a tag listed in @$tags_to_callback is parsed.
=item B<attribs_callback>
Subroutine reference to be invoked when an attribute listed in @$attribs_to_callback is parsed.
=item B<url_callback>
Subroutine reference to be invoked when a URL is detected in an HTML tag attribute or a CSS property.
=item B<css_callback>
Subroutine reference to be invoked when CSS data is found either as the contents of a 'style' attribute in an HTML tag, or as the contents of a <style> HTML tag.
=item B<content_callback>
Subroutine reference to be invoked when standard content between HTML tags in found.
=item B<fix_mismatched_tags>
This property, if set, fixes mismatched tags in the HTML input. By default, tags present in the default %mismatched_tags_to_fix hash are fixed. This set of tags can be overridden by passing in an array reference $mismatched_tags_to_fix to the constructor. Any opened tags in the set are automatically closed if no corresponding closing tag is found. If an unbalanced closing tag is found, that is commented out.
=item B<mismatched_tags_to_fix>
Array reference of tags for which the code would check for matching opening and closing tags. See the property $fix_mismatched_tags.
=item B<context>
You can pass an arbitrary scalar as a 'context' value that's then passed as the first parameter to all callback functions. Most commonly this is something like '$Self'
=item B<allow_double_defang>
If this is true, then tag names and attribute names which already begin
with the defang string ("defang_" by default) will have an additional
copy of the defang string prepended if they are flagged to be defanged
by the return value of a callback, or if the tag or attribute name
is unknown.
The default is to assume that tag names and attribute names beginning
with the defang string are already made safe, and need no further
modification, even if they are flagged to be defanged by the
return value of a callback. Any tag or attribute modifications made
directly by a callback are still performed.
=item B<delete_defang_content>
Normally defanged tags are turned into comments and prefixed by defang_,
and defanged styles are surrounded by /* ... */. If this is set to
A number of the callbacks share the same parameters. These common parameters are documented here. Certain variables may have specific meanings in certain callbacks, so be sure to check the documentation for that method first before referring this section.
=over 4
=item I<$context>
You can pass an arbitrary scalar as a 'context' value that's then passed as the first parameter to all callback functions. Most commonly this is something like '$Self'
=item I<$Defang>
Current HTML::Defang instance
=item I<$OpenAngle>
Opening angle(<) sign of the current tag.
=item I<$lcTag>
Lower case version of the HTML tag that is currently being parsed.
=item I<$IsEndTag>
Has the value '/' if the current tag is a closing tag.
=item I<$AttributeHash>
A reference to a hash containing the attributes of the current tag and
their values. Each value is a scalar reference to the value, rather
than just a scalar value. You can add attributes (remember to make it a
If $Defang->{tags_callback} exists, and HTML::Defang has parsed a tag preset in $Defang->{tags_to_callback}, the above callback is made to the client code. The return value of this method determines whether the tag is defanged or not. More details below.
=over 4
=item B<Return values>
=over 4
=item DEFANG_NONE
The current tag will not be defanged.
=item DEFANG_ALWAYS
The current tag will be defanged.
=item DEFANG_DEFAULT
The current tag will be processed normally by HTML:Defang as if there was no callback method specified.
If $Defang->{attribs_callback} exists, and HTML::Defang has parsed an attribute present in $Defang->{attribs_to_callback}, the above callback is made to the client code. The return value of this method determines whether the attribute is defanged or not. More details below.
=over 4
=item B<Method parameters>
=over 4
=item I<$lcAttrKey>
Lower case version of the HTML attribute that is currently being parsed.
=item I<$AttrVal>
Reference to the HTML attribute value that is currently being parsed.
See $AttributeHash for details of decoding.
=back
=item B<Return values>
=over 4
=item DEFANG_NONE
The current attribute will not be defanged.
=item DEFANG_ALWAYS
The current attribute will be defanged.
=item DEFANG_DEFAULT
The current attribute will be processed normally by HTML:Defang as if there was no callback method specified.
If $Defang->{url_callback} exists, and HTML::Defang has parsed a URL, the above callback is made to the client code. The return value of this method determines whether the attribute containing the URL is defanged or not. URL callbacks can be made from <style> tags as well style attributes, in which case the particular style declaration will be commented out. More details below.
=over 4
=item B<Method parameters>
=over 4
=item I<$lcAttrKey>
Lower case version of the HTML attribute that is currently being parsed. However if this callback is made as a result of parsing a URL in a style attribute, $lcAttrKey will be set to the string I<style>, or will be set to I<undef> if this callback is made as a result of parsing a URL inside a style tag.
=item I<$AttrVal>
Reference to the URL value that is currently being parsed.
=item I<$AttributeHash>
A reference to a hash containing the attributes of the current tag and their values. Each value is a scalar reference to the value,
rather than just a scalar value. You can add attributes (remember to make it a scalar ref, eg $AttributeHash{"newattr"} = \"newval"), delete attributes, or modify attribute values in this hash, and any changes you make will be incorporated into the output HTML stream. Will be set to I<undef> if the callback is made due to URL in a <style> tag or attribute.
=back
=item B<Return values>
=over 4
=item DEFANG_NONE
The current URL will not be defanged.
=item DEFANG_ALWAYS
The current URL will be defanged.
=item DEFANG_DEFAULT
The current URL will be processed normally by HTML:Defang as if there was no callback method specified.
If $Defang->{css_callback} exists, and HTML::Defang has parsed a <style> tag or style attribtue, the above callback is made to the client code. The return value of this method determines whether a particular declaration in the style rules is defanged or not. More details below.
=over 4
=item B<Method parameters>
=over 4
=item I<$Selectors>
Reference to an array containing the selectors in a style tag or attribute.
=item I<$SelectorRules>
Reference to an array containing the style declaration blocks of all selectors in a style tag or attribute. Consider the below CSS:
a { b:c; d:e}
j { k:l; m:n}
The declaration blocks will get parsed into the following data structure:
[
[
[ "b", "c", DEFANG_DEFAULT ],
[ "d", "e", DEFANG_DEFAULT ]
],
[
[ "k", "l", DEFANG_DEFAULT ],
[ "m", "n", DEFANG_DEFAULT ]
]
]
So, generally each property:value pair in a declaration is parsed into an array of the form
["property", "value", X]
where X can be DEFANG_NONE, DEFANG_ALWAYS or DEFANG_DEFAULT, and DEFANG_DEFAULT the default value. A client can manipulate this value to instruct HTML::Defang to defang this property:value pair.
DEFANG_NONE - Do not defang
DEFANG_ALWAYS - Defang the style:property value
DEFANG_DEFAULT - Process this as if there is no callback specified
=item I<$IsAttr>
True if the currently processed item is a style attribute. False if the currently processed item is a style tag.
=back
=back
=back
=cut
=head1 METHODS
=over 4
=item B<PUBLIC METHODS>
=over 4
=cut
=item I<defang($InputHtml, \%Opts)>
Cleans up $InputHtml of any executable code including scripting, embedded objects, applets, etc., and defang any XSS attacks.
=over 4
=item B<Method parameters>
=over 4
=item I<$InputHtml>
The input HTML string that needs to be sanitized.
=back
=back
Returns the cleaned HTML. If fix_mismatched_tags is set, any tags that appear in @$mismatched_tags_to_fix that are unbalanced are automatically commented or closed.
Appends $String to the output after the current parsed tag ends. Can be used by client code in callback methods to add HTML text to the processed output. If the HTML text needs to be defanged, client code can safely call HTML::Defang->defang() recursively from within the callback.
=over 4
=item B<Method parameters>
=over 4
=item I<$String>
The string that is added after the current parsed tag ends.
This method is invoked when a <script> tag is parsed. Defangs the <script> opening tag, and any closing tag. Any scripting content is also commented out, so browsers don't display them.
Returns 1 to indicate that the <script> tag must be defanged.
=over 4
=item B<Method parameters>
=over 4
=item I<$OutR>
A reference to the processed output HTML before the tag that is currently being parsed.
=item I<$HtmlR>
A scalar reference to the input HTML.
=item I<$TagOps>
Indicates what operation should be done on a tag. Can be undefined, integer or code reference. Undefined indicates an unknown tag to HTML::Defang, 1 indicates a known safe tag, 0 indicates a known unsafe tag, and a code reference indicates a subroutine that should be called to parse the current tag. For example, <style> and <script> tags are parsed by dedicated subroutines.
=item I<$OpenAngle>
Opening angle(<) sign of the current tag.
=item I<$IsEndTag>
Has the value '/' if the current tag is a closing tag.
=item I<$Tag>
The HTML tag that is currently being parsed.
=item I<$TagTrail>
Any space after the tag, but before attributes.
=item I<$Attributes>
A reference to an array of the attributes and their values, including any surrouding spaces. Each element of the array is added by 'push' calls like below.
Kurian Jose Aerthail E<lt>cpan@kurianja.fastmail.fmE<gt>. Thanks to Rob Mueller E<lt>cpan@robm.fastmail.fmE<gt> for initial code, guidance and support and bug fixes.
=cut
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2003-2013 by FastMail Pty Ltd
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.
=cut
1;
Keyboard Shortcuts
Global
s
Focus search bar
?
Bring up this help dialog
GitHub
gp
Go to pull requests
gi
go to github issues (only if github is preferred repository)