NAME

HTML::EntityReference - A minimal, abstract, and reusable list of HTML entities

VERSION

Version 0.011

SYNOPSIS

This is a listing of HTML character entities. It is intended to be the last time such a list is compiled into a module, being meant to be exposed and usable in any situation. I found several modules that dealt with Entities, but did not do what I needed, or were for internal use.

The essential characteristic of this data is that "entities exist".

The entity is nothing more than a name for a Unicode character. Everything else having to do with it is attached to the character, and should be something I can find in the Unicode database and related Unicode Perl stuff. The most fundamental thing is a map of names to code point numbers. I mean the number itself (an integer), not some string representation of the number in hex or decimal or decorated with some other escape system. From the code point value, it is a single step to get the actual character, or the formatted numeric entity, or whatever.

You can use the supplied hash directly. Or, this module provides some simple functions that abstract the way the data is actually stored and return the common cases.

The function calls also provide for an easy way to check multiple tables in one go. So non-standard entities recognised by some browsers or historically are documented here also.

use HTML::EntityReference;
my $codepoint= HTML::EntityReference::ordinal('ldquo');  # the integer 8220
say "Character is known formally as ", charnames::viacode($codepoint), '.';
my $char= HTML::EntityReference::character('amp');  # the string '&'

# can look up the other way too
my $entity= HTML::EntityReference::from_ordinal(0x2026);
say "You can use &$entity; on a web page."  # "hellip"

# use non-standard definitions
$codepoint= HTML::EntityReference::ordinal($whatsit, ':all');

Data Tables

%W3C_Entities

The package variable %W3C_Entities contains the standard HTML entities as keys, and the code point (integer) as the value. The source also contains comments copied from http://www.w3.org/TR/html4/sgml/entities.html.

%HTML5_draft

The package variable %HTML5_draft contains the entities defined as part of the HTML5 standard, a work in progress. These are taken from http://dev.w3.org/html5/spec/named-character-references.html#named-character-references. This is loaded on demand, since there are over two thousand of them. So if you want to use this hash directly, be sure to call one of the functions specifying 'HTML5_draft' first.

Unlike the existing standard HTML Entity chart, this chart contains some entries that expand to more than one code point. They can be combining characters, variation selectors, and in a couple cases really are two separate characters.

other charts

Others will be added.

custom charts

You can pass your own chart data to the various functions, to be used instead of or in addtion to the built-in charts. Do this by passing a reference to the hash as an element in the include or exclude list.

In addition to adding your own custom entities, you can also duplicate existing entities in order to override what gets generated (e.g. precomposed vs decomposed form), or provide priority in inverse lookups.

(This might work in this version but has not been tested yet)

Functions

The function calls also provide for an easy way to check multiple tables in one go. They also abstract the way data is actually stored, and provide handling of simple cases, and take care of busy details that you might not have thought of like multi-valued entities.

(parameters)

In general, the functions take the thing to be converted as the first parameter, and can take one or two additonal optional arguments. Only the format function doesn't follow this pattern exactly, taking another parameter first.

The second parameter specifies the chart or charts to use. This is commonly referred to as the include parameter. That's because the 3rd works the same way but specifies things to exclude.

The include parameter may be a string or an array reference. The string is the name of a chart or the name of a bundle. The chart names available are "HTML4" and "HTML5_draft". The only bundle name available is ":all". Others will be added in later versions. If no parameter is given at all, it is the same as using "HTML4".

If you have more to say than just one string, you can use an array reference instead. Each element of the array can be a string as explained above. An item can also be a hash reference, which is a custom chart.

If more than one item is given as the include parameter, they are checked in order until something is found or the list exhausted.

The exclude parameter is not implemented yet.

ordinal

Calling $n=HTML::EntityReference::ordinal($entity); is simply the same as looking it up in the data hash: $n=$HTML::EntityReference::W3C_Entities{$entity};. It will return the code point if the $entity is listed, or undef otherwise.

The return value is normally a number, the integer value of the code point that the entity refers to. In the case of multi-valued entities, the return value is an array reference.

character

This is the same as calling the built-in chr on the result of ordinal, except that if the named entity was not listed it returns undef. It also takes care of entities that expand into multiple code points. For multi-valued entities, it simply produces a string with more than one character in it.

hex

This is the same as calling the sprintf("%04x", $ord); on the result of ordinal, except that if the named entity was not listed it returns undef. Note that this returns the 4 hex digits only, without any decorations or prefix. You can incorporate this into a hex notation or hex entity notation, as desired. However, that might be awkward for multi-value returns, so this function doesn't handle those. See the format function instead.

format

This takes a format string as a first argument. After that are the usual entity, include, and exclude parameters. The format string is used with sprintf. For example, format ('&#x%X;', 'NotHumpDownHump', 'HTML5_draft') will produce "≎ ̸" in scalar context.

For multi-value entities, it will format each code point. In scalar context, they are returned as one string with separating spaces. In list context, returns a list of formatted numbers.

valid

This returns a truth value indicating whether the specified entity name is listed.

from_... Inverse Functions

Since Perl doesn't provide for overloading in the C++ sense, we need to clearly distinguish whether you are passing in a code point integer, or the character itself, or whatever other forms might be available. So the inverse functions match the names of the primary functions with the additon of from_ in front.

The inverse lookup table is not created until it is needed, the first time this function is called. The inverse table is stored inside the main table, under a key whose name begins with a ";" character. Because entities are normally parsed out as terminating with a semicolon, you won't have an entity with a semicolon within the name! So names beginning with a semicolon are used for "internal use" and if you access the charts directly (or use your custom charts), ignore these.

from_ordinal

If the argument contains more than one code point, it will try to match a multi-valued entity exactly. It will not take prefixes, change normalizations, or anything like that. You can pass an integer or an array ref containing integers to this function.

If multiple entities are defined that map to the same code point(s), it will simply return one of them essentially at random. There is no way to know which one is "better" for your purpose. However, it does check the tables in the order specified by the second argument, so you can put a custom table first that includes the answers you specifically want.

from_character

This is the inverse of character. It will return undef if no entity matches the argument. See notes on from_ordinal.

AUTHOR

John M. Dlugosz, <dlugosz AT cpan DOT com>

BUGS

Please report any bugs or feature requests to bug-html-entityreference at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-EntityReference. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc HTML::EntityReference

You can also look for information at:

ACKNOWLEDGEMENTS

Thanks to Zsbán Ambrus for suggesting the handling of multiple charts. That pretty much made the module what it became.

Thanks to those on PerlMonks who chatted with me regarding the specifications and ideas.

LICENSE AND COPYRIGHT

Copyright 2011 John M. Dlugosz.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 627:

Non-ASCII character seen before =encoding in 'Zsbán'. Assuming UTF-8