=head1 NAME
perlunifaq - Perl Unicode FAQ
=head1 Q and A
This is a list of questions and answers about Unicode in Perl, intended to be
read
after
L<perlunitut>.
=head2 perlunitut isn't really a Unicode tutorial, is it?
No, and this isn't really a Unicode FAQ.
Perl
has
an abstracted interface
for
all supported character encodings, so this
is actually a generic C<Encode> tutorial and C<Encode> FAQ. But many people
think that Unicode is special and magical, and I didn't want to disappoint
them, so I decided to call the document a Unicode tutorial.
=head2 What character encodings does Perl support?
To find out which character encodings your Perl supports, run:
perl -MEncode -le
"print for Encode->encodings(':all')"
=head2 Which version of perl should I
use
?
Well,
if
you can, upgrade to the most recent, but certainly C<5.8.1> or newer.
The tutorial and FAQ assume the latest release.
You should also check your modules, and upgrade them
if
necessary. For example,
HTML::Entities requires version >= 1.32 to function correctly, even though the
changelog is silent about this.
=head2 What about binary data, like images?
Well, apart from a bare C<
binmode
$fh
>, you shouldn't treat them specially.
(The
binmode
is needed because otherwise Perl may convert line endings on Win32
systems.)
Be careful, though, to never combine text strings
with
binary strings. If you
need text in a binary stream, encode your text strings first using the
appropriate encoding, then
join
them
with
binary strings. See also: "What
if
I
don't encode?".
=head2 When should I decode or encode?
Whenever you're communicating text
with
anything that is external to your perl
process, like a database, a text file, a
socket
, or another program. Even
if
the thing you're communicating
with
is also written in Perl.
=head2 What
if
I don't decode?
Whenever your encoded, binary string is used together
with
a text string, Perl
will assume that your binary string was encoded
with
ISO-8859-1, also known as
latin-1. If it wasn't latin-1, then your data is unpleasantly converted. For
example,
if
it was UTF-8, the individual bytes of multibyte characters are seen
as separate characters, and then again converted to UTF-8. Such double encoding
can be compared to double HTML encoding (C<
&
;gt;>), or double URI encoding
(C<%253E>).
This silent implicit decoding is known as
"upgrading"
. That may sound
positive, but it's best to avoid it.
=head2 What
if
I don't encode?
It depends on what you output and how you output it.
=head3 Output via a filehandle
=over
=item * If the string's characters are all code point 255 or lower, Perl
outputs bytes that match those code points. This is what happens
with
encoded
strings. It can also, though, happen
with
unencoded strings that happen to be
all code point 255 or lower.
=item * Otherwise, Perl outputs the string encoded as UTF-8. This only happens
with
strings you neglected to encode. Since that should not happen, Perl also
throws a
"wide character"
warning in this case.
=back
=head3 Other output mechanisms (e.g., C<
exec
>, C<
chdir
>, ..)
Your text string will be sent using the bytes in Perl's internal
format
.
Because the internal
format
is often UTF-8, these bugs are hard to spot,
because UTF-8 is usually the encoding you wanted! But don
't be lazy, and don'
t
use
the fact that Perl's internal
format
is UTF-8 to your advantage. Encode
explicitly to avoid weird bugs, and to show to maintenance programmers that you
thought this through.
=head2 Is there a way to automatically decode or encode?
If all data that comes from a certain handle is encoded in exactly the same
way, you can
tell
the PerlIO
system
to automatically decode everything,
with
the C<encoding> layer. If you
do
this, you can't accidentally forget to decode
or encode anymore, on things that
use
the layered handle.
You can provide this layer
when
C<
open
>ing the file:
open
my
$fh
,
'>:encoding(UTF-8)'
,
$filename
;
open
my
$fh
,
'<:encoding(UTF-8)'
,
$filename
;
Or
if
you already have an
open
filehandle:
binmode
$fh
,
':encoding(UTF-8)'
;
Some database drivers
for
DBI can also automatically encode and decode, but
that is sometimes limited to the UTF-8 encoding.
=head2 What
if
I don't know which encoding was used?
Do whatever you can to find out, and
if
you have to: guess. (Don't forget to
document your guess
with
a comment.)
You could
open
the document in a web browser, and change the character set or
character encoding
until
you can visually confirm that all characters look the
way they should.
There is
no
way to reliably detect the encoding automatically, so
if
people
keep sending you data without charset indication, you may have to educate them.
=head2 Can I
use
Unicode in
my
Perl sources?
Yes, you can! If your sources are UTF-8 encoded, you can indicate that
with
the
This doesn't
do
anything to your input, or to your output. It only influences
the way your sources are
read
. You can
use
Unicode in string literals, in
identifiers (but they still have to be
"word characters"
according to C<\w>),
and even in custom delimiters.
=head2 Data::Dumper doesn't restore the UTF8 flag; is it broken?
No, Data::Dumper's Unicode abilities are as they should be. There have been
some complaints that it should restore the UTF8 flag
when
the data is
read
again
with
C<
eval
>. However, you should really not look at the flag, and
nothing indicates that Data::Dumper should break this rule.
Here's what happens:
when
Perl reads in a string literal, it sticks to 8 bit
encoding as long as it can. (But perhaps originally it was internally encoded
as UTF-8,
when
you dumped it.) When it
has
to give that up because other
characters are added to the text string, it silently upgrades the string to
UTF-8.
If you properly encode your strings
for
output, none of this is of your
concern, and you can just C<
eval
> dumped data as always.
=head2 Why
do
regex character classes sometimes match only in the ASCII range?
Starting in Perl 5.14 (and partially in Perl 5.12), just put a
C<
use
feature
'unicode_strings'
> near the beginning of your program.
Within its lexical scope you shouldn't have this problem. It also is
automatically enabled under C<
use
feature
':5.12'
> or C<
use
v5.12> or
using C<-E> on the command line
for
Perl 5.12 or higher.
The rationale
for
requiring this is to not break older programs that
rely on the way things worked
before
Unicode came along. Those older
programs knew only about the ASCII character set, and so may not work
properly
for
additional characters. When a string is encoded in UTF-8,
Perl assumes that the program is prepared to deal
with
Unicode, but
when
the string isn't, Perl assumes that only ASCII
is wanted, and so those characters that are not ASCII
characters aren't recognized as to what they would be in Unicode.
C<
use
feature
'unicode_strings'
> tells Perl to treat all characters as
Unicode, whether the string is encoded in UTF-8 or not, thus avoiding
the problem.
However, on earlier Perls, or
if
you pass strings to subroutines outside
the feature's scope, you can force Unicode rules by changing the
encoding to UTF-8 by doing C<utf8::upgrade(
$string
)>. This can be used
safely on any string, as it checks and does not change strings that have
already been upgraded.
For a more detailed discussion, see L<Unicode::Semantics> on CPAN.
=head2 Why
do
some characters not uppercase or lowercase correctly?
See the answer to the previous question.
=head2 How can I determine
if
a string is a text string or a binary string?
You can
't. Some use the UTF8 flag for this, but that'
s misuse, and makes well
behaved modules like Data::Dumper look bad. The flag is useless
for
this
purpose, because it's off
when
an 8 bit encoding (by
default
ISO-8859-1) is
used to store the string.
This is something you, the programmer,
has
to keep track of; sorry. You could
consider adopting a kind of
"Hungarian notation"
to help
with
this.
=head2 How
do
I convert from encoding FOO to encoding BAR?
By first converting the FOO-encoded byte string to a text string, and then the
text string to a BAR-encoded byte string:
my
$text_string
= decode(
'FOO'
,
$foo_string
);
my
$bar_string
= encode(
'BAR'
,
$text_string
);
or by skipping the text string part, and going directly from one binary
encoding to the other:
from_to(
$string
,
'FOO'
,
'BAR'
);
or by letting automatic decoding and encoding
do
all the work:
open
my
$foofh
,
'<:encoding(FOO)'
,
'example.foo.txt'
;
open
my
$barfh
,
'>:encoding(BAR)'
,
'example.bar.txt'
;
print
{
$barfh
}
$_
while
<
$foofh
>;
=head2 What are C<decode_utf8> and C<encode_utf8>?
These are alternate syntaxes
for
C<decode(
'utf8'
, ...)> and C<encode(
'utf8'
,
...)>. Do not
use
these functions
for
data exchange. Instead
use
C<decode(
'UTF-8'
, ...)> and C<encode(
'UTF-8'
, ...)>; see
L</What's the difference between UTF-8 and utf8?> below.
=head2 What is a
"wide character"
?
This is a term used
for
characters occupying more than one byte.
The Perl warning
"Wide character in ..."
is caused by such a character.
With
no
specified encoding layer, Perl tries to
fit things into a single byte. When it can't, it
emits this warning (
if
warnings are enabled), and uses UTF-8 encoded data
instead.
To avoid this warning and to avoid having different output encodings in a single
stream, always specify an encoding explicitly,
for
example
with
a PerlIO layer:
binmode
STDOUT,
":encoding(UTF-8)"
;
=head1 INTERNALS
=head2 What is
"the UTF8 flag"
?
Please,
unless
you
're hacking the internals, or debugging weirdness, don'
t
think about the UTF8 flag at all. That means that you very probably shouldn't
use
C<is_utf8>, C<_utf8_on> or C<_utf8_off> at all.
The UTF8 flag, also called SvUTF8, is an internal flag that indicates that the
current internal representation is UTF-8. Without the flag, it is assumed to be
ISO-8859-1. Perl converts between these automatically. (Actually Perl usually
assumes the representation is ASCII; see L</Why
do
regex character classes
sometimes match only in the ASCII range?> above.)
One of Perl
's internal formats happens to be UTF-8. Unfortunately, Perl can'
t
keep a secret, so everyone knows about this. That is the source of much
confusion. It's better to pretend that the internal
format
is some unknown
encoding, and that you always have to encode and decode explicitly.
=head2 What about the C<
use
bytes> pragma?
Don't
use
it. It makes
no
sense to deal
with
bytes in a text string, and it
makes
no
sense to deal
with
characters in a byte string. Do the proper
conversions (by decoding/encoding), and things will work out well: you get
character counts
for
decoded data, and byte counts
for
encoded data.
C<
use
bytes> is usually a failed attempt to
do
something useful. Just forget
about it.
=head2 What about the C<
use
encoding> pragma?
Don
't use it. Unfortunately, it assumes that the programmer'
s environment and
that of the user will
use
the same encoding. It will
use
the same encoding
for
the source code and
for
STDIN and STDOUT. When a program is copied to another
machine, the source code does not change, but the STDIO environment might.
If you need non-ASCII characters in your source code, make it a UTF-8 encoded
If you need to set the encoding
for
STDIN, STDOUT, and STDERR,
for
example
based on the user's locale, C<
use
open
>.
=head2 What is the difference between C<:encoding> and C<:utf8>?
Because UTF-8 is one of Perl's internal formats, you can often just skip the
encoding or decoding step, and manipulate the UTF8 flag directly.
Instead of C<:encoding(UTF-8)>, you can simply
use
C<:utf8>, which skips the
encoding step
if
the data was already represented as UTF8 internally. This is
widely accepted as good behavior
when
you're writing, but it can be dangerous
when
reading, because it causes internal inconsistency
when
you have invalid
byte sequences. Using C<:utf8>
for
input can sometimes result in security
breaches, so please
use
C<:encoding(UTF-8)> instead.
Instead of C<decode> and C<encode>, you could
use
C<_utf8_on> and C<_utf8_off>,
but this is considered bad style. Especially C<_utf8_on> can be dangerous,
for
the same reason that C<:utf8> can.
There are some shortcuts
for
oneliners;
see L<-C in perlrun|perlrun/-C [numberE<sol>list]>.
=head2 What's the difference between C<UTF-8> and C<utf8>?
C<UTF-8> is the official standard. C<utf8> is Perl's way of being liberal in
what it accepts. If you have to communicate
with
things that aren't so liberal,
you may want to consider using C<UTF-8>. If you have to communicate
with
things
that are too liberal, you may have to
use
C<utf8>. The full explanation is in
L<Encode/
"UTF-8 vs. utf8 vs. UTF8"
>.
C<UTF-8> is internally known as C<utf-8-strict>. The tutorial uses UTF-8
consistently, even where utf8 is actually used internally, because the
distinction can be hard to make, and is mostly irrelevant.
For example, utf8 can be used
for
code points that don't exist in Unicode, like
9999999, but
if
you encode that to UTF-8, you get a substitution character (by
default
; see L<Encode/
"Handling Malformed Data"
>
for
more ways of dealing
with
this.)
Okay,
if
you insist: the
"internal format"
is utf8, not UTF-8. (When it's not
some other encoding.)
=head2 I lost track; what encoding is the internal
format
really?
It
's good that you lost track, because you shouldn'
t depend on the internal
format
being any specific encoding. But since you asked: by
default
, the
internal
format
is either ISO-8859-1 (latin-1), or utf8, depending on the
history of the string. On EBCDIC platforms, this may be different even.
Perl knows how it stored the string internally, and will
use
that knowledge
when
you C<encode>. In other words: don't
try
to find out what the internal
encoding
for
a certain string is, but instead just encode it into the encoding
that you want.
=head1 AUTHOR
Juerd Waalboer <
=head1 SEE ALSO
L<perlunicode>, L<perluniintro>, L<Encode>