NAME

HTML::Valid::Tagset - data tables useful in parsing HTML

SYNOPSIS

use HTML::Valid::Tagset ':all';
for my $tag (qw/canvas a li moonshines/) {
    if ($isHTML5{$tag}) {
        print "<$tag> is ok\n";
    }
    else {
        print "<$tag> is not HTML5\n";
    }
}

produces output

<canvas> is ok
<a> is ok
<li> is ok
<moonshines> is not HTML5

(This example is included as tagset-synopsis.pl in the distribution.)

VERSION

This documents HTML::Valid::Tagset version 0.09 corresponding to git commit 0eff27f5da639787969d7ac7787f18df00b6753e released on Wed Jun 29 08:45:38 2022 +0900.

This Perl module is built on top of the "HTML Tidy" library version 5.8.0.

DESCRIPTION

This module contains several data tables useful in various kinds of HTML parsing operations.

All tag names used are lowercase.

This module and HTML::Tagset

This is a drop-in replacement for HTML::Tagset. However, HTML::Valid::Tagset is mostly not based on HTML::Tagset. It uses the tables of HTML elements from a C program called "HTML Tidy" (this is not the Perl module HTML::Tidy).

As far as possible, this module tries to be compatible with HTML::Tagset. Incompatibilities with HTML::Tagset are discussed in "Issues with HTML::Tagset".

Validation

If you need to validate tags, you should use, for example, "%isHTML5" for HTML 5 tags, or "%isKnown" if you want to check whether a tag is a known one.

Terminology

In the following documentation, a "hashset" is a hash being used as a set. The actual values associated with the keys are not significant.

VARIABLES

None of these variables are exported by default. See "EXPORTS". The compatibility with HTML::Tagset is listed. In all cases, the compatibility with HTML::Tagset refers to HTML::Tagset version 3.20.

@allTags

This contains all the HTML tags that this module knows of as an array sorted in alphabetical order. It is exactly the same thing as the keys of "%isKnown".

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%canTighten

This is copied from HTML::Tagset.

%emptyElement

This hashset has as values the tag names of elements that cannot have content. For example, "base", "br", or "hr".

use HTML::Valid::Tagset '%emptyElement';
for my $tag (qw/hr dl br snakeeyes/) {
    if ($emptyElement{$tag}) {
        print "<$tag> is empty.\n";
    }
    else {
        print "<$tag> is not empty.\n";
    }
}

outputs

<hr> is empty.
<dl> is not empty.
<br> is empty.
<snakeeyes> is not empty.

This is compatible with HTML::Tagset.

%isBlock

This hashset contains all block elements.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isBodyElement

This hashset contains all elements that are to be found only in/under the "body" element of an HTML document.

This is compatible with the undocumented %HTML::Tagset::isBodyElement in HTML::Tagset and the documentation for %HTML::Tagset::isBodyMarkup. See also "Issues with HTML::Tagset". %isBodyMarkup is not implemented in HTML::Tagset, so it's not provided for compatibility here.

%isCDATA_Parent

This hashset includes all elements whose content is CDATA.

This is copied from HTML::Tagset.

%isFormElement

This hashset contains all elements that are to be found only in/under a "form" element.

This is compatible with HTML::Tagset.

%isHeadElement

This hashset contains elements that can be present in the 'head' section of an HTML document.

This is compatible with the contents of %HTML::Tagset::isHeadElement, but not its documentation. See also "Issues with HTML::Tagset".

%isHeadOrBodyElement

This hashset includes all elements that can fall either in the head or in the body.

This is compatible with HTML::Tagset.

%isHTML2

This hashset is true for elements which are part of the "HTML 2.0" standard.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isHTML3

This hashset is true for elements which are part of the "HTML 3.2" standard.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isHTML4

This hashset is true for elements which are part of the "HTML 4.01" standard.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isHTML5

use utf8;
use FindBin '$Bin';
use HTML::Valid::Tagset '%isHTML5';
if ($isHTML5{canvas}) {
    print "<canvas> is OK.\n"; 
}
if ($isHTML5{a}) {
    print "<a> is OK.\n";
}
if ($isHTML5{plaintext}) {
    print "OH NO!"; 
}
else {
    print "<plaintext> went out with scrambled eggs.\n";
}

outputs

<canvas> is OK.
<a> is OK.
<plaintext> went out with scrambled eggs.

This is true for elements which are valid HTML tags in "HTML5". It is not true for obsolete elements like the <plaintext> tag (see "%isObsolete"), or proprietary elements such as the <blink> tag which have never been part of any HTML standard (see "%isProprietary"). Further, some elements neither marked as obsolete nor proprietary are also not present in HTML5. For example the <isindex> tag is not present in HTML5.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isKnown

This hashset lists all known HTML elements. See also "@allTags".

This is compatible with HTML::Tagset.

%isList

This hashset contains all elements that can contain "li" elements.

This is copied from HTML::Tagset.

%isInline

This hashset contains all inline elements. It is identical to %isPhraseMarkup.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isObsolete

$isObsolete{canvas};
# Undefined
$isObsolete{plaintext};
# True

This is true for HTML elements which were once part of HTML standards, like plaintext, but have now been declared obsolete. Note that %isObsolete is not true for elements like the <blink> tag which were never part of any HTML standard. See "%isProprietary" for these tags.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isPhraseMarkup

This hashset contains all inline elements. It is identical to %isInline.

This is compatible with HTML::Tagset.

%isProprietary

This is true for elements which are not part of any HTML standard, but were added by computer companies.

use utf8;
use FindBin '$Bin';
use HTML::Valid::Tagset '%isProprietary';
my @tags = qw/a blink plaintext marquee/;
for my $tag (@tags) {
    if ($isProprietary{$tag}) {
        print "<$tag> is proprietary.\n";
    }
    else {
        print "<$tag> is not a proprietary tag.\n";
    }
}

outputs

<a> is not a proprietary tag.
<blink> is proprietary.
<plaintext> is not a proprietary tag.
<marquee> is proprietary.

This is only in HTML::Valid::Tagset, not in HTML::Tagset.

%isTableElement

This hashset contains all elements that are to be found only in/under a "table" element.

This is compatible with HTML::Tagset.

%optionalEndTag

Elements in this hashset are not empty (see "%emptyElement"), but their end-tags are generally, "safely", omissible.

use HTML::Valid::Tagset qw/%optionalEndTag %emptyElement/;
for my $tag (qw/li p a br/) {
    if ($optionalEndTag{$tag}) {
        print "OK to omit </$tag>.\n";
    }
    elsif ($emptyElement{$tag}) {
        print "<$tag> does not ever take '</$tag>'\n";
    }
    else {
        print "Cannot omit </$tag> after <$tag>.\n";
    }
}

outputs

OK to omit </li>.
OK to omit </p>.
Cannot omit </a> after <a>.
<br> does not ever take '</br>'

This is compatible with HTML::Tagset.

FUNCTIONS

all_attributes

my $attr = all_attributes ();

This returns an array reference containing all known attributes. The attributes are not sorted.

attributes

my $attr = attributes ('a');

This returns an array reference containing all valid attributes for the specified tag (as decided by the WWW Consortium). The attributes are not sorted. By default this returns the valid tags for HTML 5.

It is also possible to choose a value for standard which specifies which standard one wants:

my $attr = attributes ('a', standard => 'html5');

Possible values for standard are

html5

This returns valid attributes for "HTML5".

This is the default

html4

This returns valid attributes for "HTML 4.01".

html3

This returns valid attributes for "HTML 3.2".

html2

This returns valid attributes for "HTML 2.0".

tag_attr_ok

my $ok = tag_attr_ok ('a', 'onmouseover');
# $ok = 1
my $ok = tag_attr_ok ('table', 'cellspacing');
# $ok = undef, because "cellspacing" is not a valid attribute for
# table in HTML 5.

This returns a true value if the attribute is allowed for the specified tag. The default version is HTML 5. Another version of HTML can be specified using the parameter standard:

my $ok = tag_attr_ok ('html', 'onload', standard => 'html2');

The possible versions are as in "attributes".

attr_type

my $type = attr_type ('onmouseover');
# $type = 'script'

This returns a text string containing likely type information for the attribute. This content is extracted from the internals of "HTML Tidy", and it may or may not be correct. This interface is experimental, and likely to change.

COMPATIBILITY-ONLY VARIABLES

These variables are present in this module for compatibility with existing programs which use HTML::Tagset. However, they are fundamentally flawed and should not be used for new projects.

%is_Possible_Strict_P_Content

In HTML::Valid::Tagset, this is identical to "%isInline".

This is a mistake in HTML::Tagset which is preserved in name only for backwards compatibility. See also "Issues with HTML::Tagset".

@p_closure_barriers

In HTML::Valid::Tagset, this resolves to an empty list.

This is a mistake in HTML::Tagset which is preserved in name only for backwards compatibility. See also "Issues with HTML::Tagset".

UNIMPLEMENTED

The following parts of HTML::Tagset are not implemented in version 0.09 of HTML::Valid::Tagset.

%boolean_attr

This is not implemented in HTML::Valid::Tagset.

%linkElements

This is not implemented in HTML::Valid::Tagset.

SEE ALSO

HTML Tidy

This is a program and a library in C for improving HTML. It was originally written by Dave Raggett of the W3 Consortium. HTML::Valid is based on this project.

Please note that this is not the Perl module HTML::Tidy by Andy Lester, although that module is also based on the above library.

CPAN modules

HTML::Tagset, HTML::Element, HTML::TreeBuilder, HTML::LinkExtor

HTML standards

This section gives links to the HTML standards which HTML::Valid supports.

HTML 2.0

HTML 2.0 was described in RFC ("Request For Comments") 1866, a standard of the Internet Engineering Task Force. See http://www.ietf.org/rfc/rfc1866.txt.

HTML 3.2

This was described in the HTML 3.2 Reference Specification. See http://www.w3.org/TR/REC-html32.

HTML 4.01

This was described in the HTML 4.01 Specification. See http://www.w3.org/TR/html401/.

HTML5

Dive into HTML5

http://diveintohtml5.info/.

This isn't a standards document, but "Dive into HTML 5" may be good background reading before trying to read the standards documents.

HTML: The Living Standard

This is at https://developers.whatwg.org/. It says

    This specification is intended for authors of documents and scripts that use the features defined in this specification.

HTML5 - A vocabulary and associated APIs for HTML and XHTML

This is at http://www.w3.org/TR/html5/. It's the W3 consortium's version of the WHATWG documents.

EXPORTS

The hashes and arrays are exported on demand. Everything can be exported with :all:

export HTML::Valid::Tagset ':all';

BUGS

Issues with HTML::Tagset

There are several problems with HTML::Tagset version 3.20 which mean that it's difficult to be fully compatible with it.

@p_closure_barriers should be an empty set

There is a long-winded argument in the documentation of HTML::Tagset, which has been there since version 3.01, released on Aug 21 2000, about why it's possible for a p element to contain another p element. However, the specification for HTML4.01, which HTML::Tagset seems to be based on, from 1999, states

    The P element represents a paragraph. It cannot contain block-level elements (including P itself).

Thus, it is simply not possible for any block element to legally be part of a paragraph, and the mechanism that HTML::Tagset suggests for how a paragraph element can contain a table which can contain a paragraph element, like this:

<p>
<table>

is not and was not legal HTML, since <table> itself is a block level element, and the HTML rule is that in the above case, if a new block level element is seen, a </p> is inserted automatically, so it always becomes

<p>
</p>
<table>

anyway. See "%isBlock" for testing for whether an element is a block level element.

So in this module, "@p_closure_barriers" is an empty set.

%is_Possible_Strict_P_Content doesn't really make sense

The comments for HTML::Tagset version 3.20 read

# I've no idea why there's these latter exceptions.
# I'm just following the HTML4.01 DTD.

and following this it lists the form tag in this hash. However, the form tag is a block level element, so the purpose of this hash seems to be misguided. Since, as noted above, a p tag can contain any inline element, in this module, for compatibility, "%is_Possible_Strict_P_Content" is just the same thing as "%isInline".

%isBodyMarkup doesn't exist

The documented %isBodyMarkup doesn't exist, in its place is %isBodyElement.

This is reported as https://rt.cpan.org/Public/Bug/Display.html?id=109024.

The documentation of %isHeadElement is misleading

The documentation of %isHeadElement claims

    This hashset contains all elements that elements that should be present only in the 'head' element of an HTML document.

However, in fact it actually contains elements that can be present either only in the head, like <title>, or both in the head and the body, like <script>. In this module, "%isHeadElement" copies the contents of HTML::Tagset rather than its documentation.

The issue in HTML::Tagset is reported as https://rt.cpan.org/Ticket/Display.html?id=109044.

Some elements of %isHeadElement are not head elements

This is reported as https://rt.cpan.org/Public/Bug/Display.html?id=109018.

COPYRIGHT & LICENSE

Portions of this module are taken from HTML::Tagset, which bears the following copyright notice.

Copyright 1995-2000 Gisle Aas.

Copyright 2000-2005 Sean M. Burke.

Copyright 2005-2008 Andy Lester.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

However, the bulk of HTML::Valid::Tagset is not a fork of HTML::Tagset, it is based on "HTML Tidy".

HTML::Valid is based on HTML Tidy, which is under the following copyright:

HTML Tidy

HTML parser and pretty printer

Copyright (c) 1998-2016 World Wide Web Consortium (Massachusetts Institute of Technology, European Research Consortium for Informatics and Mathematics, Keio University). All Rights Reserved.

Additional contributions (c) 2001-2016 University of Toronto, Terry Teague, @geoffmcl, HTACG, and others.

Contributing Author(s):

Dave Raggett <dsr@w3.org>

The contributing author(s) would like to thank all those who helped with testing, bug fixes and suggestions for improvements. This wouldn't have been possible without your help.

This software and documentation is provided "as is," and the copyright holders and contributing author(s) make no representations or warranties, express or implied, including but not limited to, warranties of merchantability or fitness for any particular purpose or that the use of the software or documentation will not infringe any third party patents, copyrights, trademarks or other rights.

The copyright holders and contributing author(s) will not be held liable for any direct, indirect, special or consequential damages arising out of any use of the software or documentation, even if advised of the possibility of such damage.

Permission is hereby granted to use, copy, modify, and distribute this source code, or portions hereof, documentation and executables, for any purpose, without fee, subject to the following restrictions:

1. The origin of this source code must not be misrepresented. 2. Altered versions must be plainly marked as such and must not be misrepresented as being the original source. 3. This Copyright notice may not be removed or altered from any source or altered source distribution.

The copyright holders and contributing author(s) specifically permit, without fee, and encourage the use of this source code as a component for supporting the Hypertext Markup Language in commercial products. If you use this source code in a product, acknowledgement is not required but would be appreciated.

The Perl parts of this distribution are copyright (C) 2015-2021 Ben Bullock and may be used under either the above licence terms, or the usual Perl conditions, either the GNU General Public Licence or the Perl Artistic Licence.