NAME
HTML::Encoding - Determine the encoding of (X)HTML documents
SYNOPSIS
use HTML::Encoding;
# ...
my $encoding = get_encoding
headers => $r->headers,
string => $r->content,
check_bom => 1,
check_xmldecl => 0,
check_meta => 1
DESCRIPTION
This module can be used to determine the encoding of HTML and XHTML files. It reports explicitly given encoding informations, i.e.
the HTTP Content-Type headers charset parameter
the XML declaration
the byte order mark (BOM)
the meta element with http-equiv set to Content-Type
- get_encoding( %options )
-
This function takes a hash as argument that stores all configuration options. The following are available:
- string
-
A string containing the (X)HTML document. The function assumes that all possibly applied Content-(Transfer-)Encodings are removed.
- headers
-
An HTTP::Headers or Mail::Header object to extract the Content-Type header. Please note that LWP::UserAgent stores header values from meta elements by default in the response header. To turn this of call the $ua->parse_head() method with a false value. get_encoding() always uses only the first given Content-Type: header; this should be the one given in the original HTTP header in most cases.
- check_xmldecl
-
Checks the document for an XML declaration. If one is found, it tries to extract the value of the
encoding
pseudo-attribute. Please note that the XML declaration must not be preceded by any character. The default is no. - check_bom
-
Checks the document for a byte order mark (BOM). The default is yes; it's always yes if check_xmldecl is set to a true value.
- check_meta
-
Checks the document for a meta element like
<meta http-equiv='Content-Type' content='text/html;charset=iso-8859-1'>
using HTML::HeadParser (or does nothing if it fails to load that module). The default is yes.
In list context it returns a list of hash refernces. Each hash references consists of two key/value pairs, e.g.
[ { source => 4, encoding => 'utf-8' }, { source => 1, encoding => 'utf-8' } ]
The source value is mapped to one of the constants FROM_META, FROM_BOM, FROM_XMLDECL and FROM_HEADER. You can import these constants solely into your namespace or using the
:constants
symbol, e.g.use HTML::Encoding ':constants';
In scalar context it returns the value of the encoding key from the first entry in the list. The list is sorted according to the origin of the encoding information, see the list at the beginning of this document.
If no explicit encoding information is found, it returns undef. It's up to you to implement defaulting behaivour if this is applicable.
BUGS
The module does not recode the content before passing it to
HTML::HeadParser
(that only supports US-ASCII compatible encodings).
WARNING
This module is currently at alpha stage, please note that the interface may change in subsequent versions.
SEE ALSO
http://www.w3.org/TR/REC-xml-20001006.htm#sec-guessing
http://www.w3.org/TR/1999/REC-html401-19991224/charset.html#h-5.2
http://www.ietf.org/rfc/rfc2854.txt
http://www.ietf.org/rfc/rfc2616.txt
RFC 2045 - RFC 2049
COPYRIGHT
Copyright (c) 2001 Björn Höhrmann
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
AUTHOR
Björn Höhrmann <bjoern@hoehrmann.de>