=encoding utf8
=head1 NAME
perlunicook - cookbookish examples of handling Unicode in Perl
=head1 DESCRIPTION
This manpage contains short recipes demonstrating how to handle common Unicode
operations in Perl, plus one complete program at the end. Any undeclared
variables in individual recipes are assumed to have a previous appropriate
value in them.
=head1 EXAMPLES
=head2 ℞ 0: Standard preamble
Unless otherwise notes, all examples below
require
this standard preamble
to work correctly,
with
the C<
use
open
qw(:std :encoding(UTF-8)
);
This I<does> make even Unix programmers C<
binmode
> your binary streams,
or
open
them
with
C<:raw>, but that's the only way to get at them
portably anyway.
B<WARNING>: C<
use
autodie> (pre 2.26) and C<
use
open
>
do
not get along
with
each
other.
=head2 ℞ 1: Generic Unicode-savvy filter
Always decompose on the way in, then recompose on the way out.
while
(<>) {
$_
= NFD(
$_
);
...
}
continue
{
print
NFC(
$_
);
}
=head2 ℞ 2: Fine-tuning Unicode warnings
As of v5.14, Perl distinguishes three subclasses of UTF‑8 warnings.
no
warnings
"nonchar"
;
no
warnings
"surrogate"
;
no
warnings
"non_unicode"
;
=head2 ℞ 3: Declare source in utf8
for
identifiers and literals
Without the all-critical C<
use
utf8> declaration, putting UTF‑8 in your
literals and identifiers won’t work right. If you used the standard
preamble just
given
above, this already happened. If you did, you can
do
things like this:
my
$measure
=
"Ångström"
;
my
@μsoft =
qw( cp852 cp1251 cp1252 )
;
my
@ὑπέρμεγας =
qw( ὑπέρ μεγας )
;
my
@鯉 =
qw( koi8-f koi8-u koi8-r )
;
my
$motto
=
"👪 💗 🐪"
;
If you forget C<
use
utf8>, high bytes will be misunderstood as
separate characters, and nothing will work right.
=head2 ℞ 4: Characters and their numbers
The C<
ord
> and C<
chr
> functions work transparently on all codepoints,
not just on ASCII alone — nor in fact, not even just on Unicode alone.
ord
(
"A"
)
chr
(65)
ord
(
"Σ"
)
chr
(0x3A3)
ord
(
"𝑛"
)
chr
(0x1D45B)
ord
(
"\x{20_0000}"
)
chr
(0x20_0000)
=head2 ℞ 5: Unicode literals by character number
In an interpolated literal, whether a double-quoted string or a
regex, you may specify a character by its number using the
C<\x{I<HHHHHH>}> escape.
String:
"\x{3a3}"
Regex: /\x{3a3}/
String:
"\x{1d45b}"
Regex: /\x{1d45b}/
/[\x{1D434}-\x{1D467}]/
=head2 ℞ 6: Get character name by number
my
$name
= charnames::viacode(0x03A3);
=head2 ℞ 7: Get character number by name
my
$number
= charnames::vianame(
"GREEK CAPITAL LETTER SIGMA"
);
=head2 ℞ 8: Unicode named characters
Use the C<< \N{I<charname>} >> notation to get the character
by that name
for
use
in interpolated literals (double-quoted
strings and regexes). In v5.16, there is an implicit
But prior to v5.16, you must be explicit about which set of charnames you
want. The C<:full> names are the official Unicode character name, alias, or
sequence, which all share a namespace.
"\N{MATHEMATICAL ITALIC SMALL N}"
"\N{GREEK CAPITAL LETTER SIGMA}"
Anything
else
is a Perl-specific convenience abbreviation. Specify one or
more scripts by names
if
you want short names that are script-specific.
"\N{Greek:Sigma}"
"\N{ae}"
"\N{epsilon}"
The v5.16 release also supports a C<:loose>
import
for
loose matching of
character names, which works just like loose matching of property names:
that is, it disregards case, whitespace, and underscores:
"\N{euro sign}"
Starting in v5.32, you can also
use
qr/\p{name=euro sign}/
to get official Unicode named characters in regular expressions. Loose
matching is always done
for
these.
=head2 ℞ 9: Unicode named sequences
These look just like character names but
return
multiple codepoints.
Notice the C<
%vx
> vector-
print
functionality in C<
printf
>.
my
$seq
=
"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"
;
printf
"U+%v04X\n"
,
$seq
;
U+0100.0300
=head2 ℞ 10: Custom named characters
Use C<:alias> to give your own lexically scoped nicknames to existing
characters, or even to give unnamed private-
use
characters useful names.
ecute
=>
"LATIN SMALL LETTER E WITH ACUTE"
,
"APPLE LOGO"
=> 0xF8FF,
};
"\N{ecute}"
"\N{APPLE LOGO}"
=head2 ℞ 11: Names of CJK codepoints
Sinograms like “東京” come back
with
character names of
C<CJK UNIFIED IDEOGRAPH-6771> and C<CJK UNIFIED IDEOGRAPH-4EAC>,
because their “names” vary. The CPAN C<Unicode::Unihan> module
has
a large database
for
decoding these (and a whole lot more), provided you
know how to understand its output.
my
$str
=
"東京"
;
my
$unhan
= Unicode::Unihan->new;
for
my
$lang
(
qw(Mandarin Cantonese Korean JapaneseOn JapaneseKun)
) {
printf
"CJK $str in %-12s is "
,
$lang
;
say
$unhan
->
$lang
(
$str
);
}
prints:
CJK 東京 in Mandarin is DONG1JING1
CJK 東京 in Cantonese is dung1ging1
CJK 東京 in Korean is TONGKYENG
CJK 東京 in JapaneseOn is TOUKYOU KEI KIN
CJK 東京 in JapaneseKun is HIGASHI AZUMAMIYAKO
If you have a specific romanization scheme in mind,
my
$k2r
= Lingua::JA::Romanize::Japanese->new;
my
$str
=
"東京"
;
say
"Japanese for $str is "
,
$k2r
->chars(
$str
);
prints
Japanese
for
東京 is toukyou
=head2 ℞ 12: Explicit encode/decode
On rare occasion, such as a database
read
, you may be
given
encoded text you need to decode.
my
$chars
= decode(
"shiftjis"
,
$bytes
, 1);
my
$bytes
= encode(
"MIME-Header-ISO_2022_JP"
,
$chars
, 1);
For streams all in the same encoding, don't
use
encode/decode; instead
set the file encoding
when
you
open
the file or immediately
after
with
C<
binmode
> as described later below.
=head2 ℞ 13: Decode program arguments as utf8
$ perl -CA ...
or
$ export PERL_UNICODE=A
or
@ARGV
=
map
{ decode(
'UTF-8'
,
$_
, 1) }
@ARGV
;
=head2 ℞ 14: Decode program arguments as locale encoding
@ARGV
=
map
{ decode(
locale
=>
$_
, 1) }
@ARGV
;
=head2 ℞ 15: Declare STD{IN,OUT,ERR} to be utf8
Use a command-line option, an environment variable, or
else
call C<
binmode
> explicitly:
$ perl -CS ...
or
$ export PERL_UNICODE=S
or
use
open
qw(:std :encoding(UTF-8)
);
or
binmode
(STDIN,
":encoding(UTF-8)"
);
binmode
(STDOUT,
":utf8"
);
binmode
(STDERR,
":utf8"
);
=head2 ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding
binmode
STDIN,
":encoding(console_in)"
if
-t STDIN;
binmode
STDOUT,
":encoding(console_out)"
if
-t STDOUT;
binmode
STDERR,
":encoding(console_out)"
if
-t STDERR;
=head2 ℞ 17: Make file I/O
default
to utf8
Files opened without an encoding argument will be in UTF-8:
$ perl -CD ...
or
$ export PERL_UNICODE=D
or
use
open
qw(:encoding(UTF-8)
);
=head2 ℞ 18: Make all I/O and args
default
to utf8
$ perl -CSDA ...
or
$ export PERL_UNICODE=SDA
or
use
open
qw(:std :encoding(UTF-8)
);
@ARGV
=
map
{ decode(
'UTF-8'
,
$_
, 1) }
@ARGV
;
=head2 ℞ 19: Open file
with
specific encoding
Specify stream encoding. This is the normal way
to deal
with
encoded text, not by calling low-level
functions.
open
(
my
$in_file
,
"< :encoding(UTF-16)"
,
"wintext"
);
OR
open
(
my
$in_file
,
"<"
,
"wintext"
);
binmode
(
$in_file
,
":encoding(UTF-16)"
);
THEN
my
$line
= <
$in_file
>;
open
(
$out_file
,
"> :encoding(cp1252)"
,
"wintext"
);
OR
open
(
my
$out_file
,
">"
,
"wintext"
);
binmode
(
$out_file
,
":encoding(cp1252)"
);
THEN
print
$out_file
"some text\n"
;
More layers than just the encoding can be specified here. For example,
the incantation C<
":raw :encoding(UTF-16LE) :crlf"
> includes implicit
CRLF handling.
=head2 ℞ 20: Unicode casing
Unicode casing is very different from ASCII casing.
uc
(
"henry ⅷ"
)
uc
(
"tschüß"
)
"tschüß"
=~ /TSCHÜSS/i
"Σίσυφος"
=~ /ΣΊΣΥΦΟΣ/i
=head2 ℞ 21: Unicode case-insensitive comparisons
Also available in the CPAN L<Unicode::CaseFold> module,
the new C<fc> “foldcase” function from v5.16 grants
access to the same Unicode casefolding as the C</i>
pattern modifier
has
always used:
my
@sorted
=
sort
{ fc(
$a
) cmp fc(
$b
) }
@list
;
fc(
"tschüß"
) eq fc(
"TSCHÜSS"
)
fc(
"Σίσυφος"
) eq fc(
"ΣΊΣΥΦΟΣ"
)
=head2 ℞ 22: Match Unicode linebreak sequence in regex
A Unicode linebreak matches the two-character CRLF
grapheme or any of seven vertical whitespace characters.
Good
for
dealing
with
textfiles coming from different
operating systems.
\R
s/\R/\n/g;
=head2 ℞ 23: Get character category
Find the general category of a numeric codepoint.
my
$cat
= charinfo(0x3A3)->{category};
=head2 ℞ 24: Disabling Unicode-awareness in builtin charclasses
Disable C<\w>, C<\b>, C<\s>, C<\d>, and the POSIX
classes from working correctly on Unicode either in this
scope, or in just one regex.
my
(
$num
) =
$str
=~ /(\d+)/a;
Or
use
specific un-Unicode properties, like C<\p{ahex}>
and C<\p{POSIX_Digit>}. Properties still work normally
no
matter what charset modifiers (C</d /u /l /a /aa>)
should be effect.
=head2 ℞ 25: Match Unicode properties in regex
with
\p, \P
These all match a single codepoint
with
the
given
property. Use C<\P> in place of C<\p> to match
one codepoint lacking that property.
\pL, \pN, \pS, \pP, \pM, \pZ, \pC
\p{Sk}, \p{Ps}, \p{Lt}
\p{alpha}, \p{upper}, \p{lower}
\p{Latin}, \p{Greek}
\p{script_extensions=Latin}, \p{scx=Greek}
\p{East_Asian_Width=Wide}, \p{EA=W}
\p{Line_Break=Hyphen}, \p{LB=HY}
\p{Numeric_Value=4}, \p{NV=4}
=head2 ℞ 26: Custom character properties
Define at compile-
time
your own custom character
properties
for
use
in regexes.
sub
In_Tengwar {
"E000\tE07F\n"
}
if
(/\p{In_Tengwar}/) { ... }
sub
Is_GraecoRoman_Title {<<
'END_OF_SET'
}
+utf8::IsLatin
+utf8::IsGreek
&utf8::IsTitle
END_OF_SET
if
(/\p{Is_GraecoRoman_Title}/ { ... }
=head2 ℞ 27: Unicode normalization
Typically render into NFD on input and NFC on output. Using NFKC or NFKD
functions improves recall on searches, assuming you've already done to the
same text to be searched. Note that this is about much more than just pre-
combined compatibility glyphs; it also reorders marks according to their
canonical combining classes and weeds out singletons.
my
$nfd
= NFD(
$orig
);
my
$nfc
= NFC(
$orig
);
my
$nfkd
= NFKD(
$orig
);
my
$nfkc
= NFKC(
$orig
);
=head2 ℞ 28: Convert non-ASCII Unicode numerics
Unless you’ve used C</a> or C</aa>, C<\d> matches more than
ASCII digits only, but Perl’s implicit string-to-number
conversion does not current recognize these. Here’s how to
convert such strings manually.
my
$str
=
"got Ⅻ and ४५६७ and ⅞ and here"
;
my
@nums
= ();
while
(
$str
=~ /(\d+|\N)/g) {
push
@nums
, num($1);
}
say
"@nums"
;
my
$nv
= num(
"\N{RUMI DIGIT ONE}\N{RUMI DIGIT TWO}"
);
=head2 ℞ 29: Match Unicode grapheme cluster in regex
Programmer-visible “characters” are codepoints matched by C</./s>,
but user-visible “characters” are graphemes matched by C</\X/>.
my
$nfd
= NFD(
$orig
);
$nfd
=~ / (?=[aeiou]) \X /xi
=head2 ℞ 30: Extract by grapheme instead of by codepoint (regex)
my
(
$first_five
) =
$str
=~ /^ ( \X{5} ) /x;
=head2 ℞ 31: Extract by grapheme instead of by codepoint (
substr
)
my
$gcs
= Unicode::GCString->new(
$str
);
my
$first_five
=
$gcs
->
substr
(0, 5);
=head2 ℞ 32: Reverse string by grapheme
Reversing by codepoint messes up diacritics, mistakenly converting
C<crème brûlée> into C<éel̂urb em̀erc> instead of into C<eélûrb emèrc>;
so
reverse
by grapheme instead. Both these approaches work
right
no
matter what normalization the string is in:
$str
=
join
(
""
,
reverse
$str
=~ /\X/g);
$str
=
reverse
Unicode::GCString->new(
$str
);
=head2 ℞ 33: String
length
in graphemes
The string C<brûlée>
has
six graphemes but up to eight codepoints.
This counts by grapheme, not by codepoint:
my
$str
=
"brûlée"
;
my
$count
= 0;
while
(
$str
=~ /\X/g) {
$count
++ }
my
$gcs
= Unicode::GCString->new(
$str
);
my
$count
=
$gcs
->
length
;
=head2 ℞ 34: Unicode column-width
for
printing
Perl’s C<
printf
>, C<
sprintf
>, and C<
format
> think all
codepoints take up 1
print
column, but many take 0 or 2.
Here to show that normalization makes
no
difference,
we
print
out both forms:
my
@words
=
qw/crème brûlée/
;
@words
=
map
{ NFC(
$_
), NFD(
$_
) }
@words
;
for
my
$str
(
@words
) {
my
$gcs
= Unicode::GCString->new(
$str
);
my
$cols
=
$gcs
->columns;
my
$pad
=
" "
x (10 -
$cols
);
say
str,
$pad
,
" |"
;
}
generates this to show that it pads correctly
no
matter
the normalization:
crème |
crème |
brûlée |
brûlée |
=head2 ℞ 35: Unicode collation
Text sorted by numeric codepoint follows
no
reasonable alphabetic order;
use
the UCA
for
sorting text.
my
$col
= Unicode::Collate->new();
my
@list
=
$col
->
sort
(
@old_list
);
See the I<ucsort> program from the L<Unicode::Tussle> CPAN module
for
a convenient command-line interface to this module.
=head2 ℞ 36: Case- I<and> accent-insensitive Unicode
sort
Specify a collation strength of level 1 to ignore case and
diacritics, only looking at the basic character.
my
$col
= Unicode::Collate->new(
level
=> 1);
my
@list
=
$col
->
sort
(
@old_list
);
=head2 ℞ 37: Unicode locale collation
Some locales have special sorting rules.
my
$col
= Unicode::Collate::Locale->new(
locale
=>
"de__phonebook"
);
my
@list
=
$col
->
sort
(
@old_list
);
The I<ucsort> program mentioned above accepts a C<--locale> parameter.
=head2 ℞ 38: Making C<cmp> work on text instead of codepoints
Instead of this:
@srecs
=
sort
{
$b
->{AGE} <=>
$a
->{AGE}
||
$a
->{NAME} cmp
$b
->{NAME}
}
@recs
;
Use this:
my
$coll
= Unicode::Collate->new();
for
my
$rec
(
@recs
) {
$rec
->{NAME_key} =
$coll
->getSortKey(
$rec
->{NAME} );
}
@srecs
=
sort
{
$b
->{AGE} <=>
$a
->{AGE}
||
$a
->{NAME_key} cmp
$b
->{NAME_key}
}
@recs
;
=head2 ℞ 39: Case- I<and> accent-insensitive comparisons
Use a collator object to compare Unicode text by character
instead of by codepoint.
my
$es
= Unicode::Collate->new(
level
=> 1,
normalization
=>
undef
);
$es
->eq(
"García"
,
"GARCIA"
);
$es
->eq(
"Márquez"
,
"MARQUEZ"
);
=head2 ℞ 40: Case- I<and> accent-insensitive locale comparisons
Same, but in a specific locale.
my
$de
= Unicode::Collate::Locale->new(
locale
=>
"de__phonebook"
,
);
$de
->eq(
"tschüß"
,
"TSCHUESS"
);
=head2 ℞ 41: Unicode linebreaking
Break up text into lines according to Unicode rules.
my
$para
=
"This is a super\N{HYPHEN}long string. "
x 20;
my
$fmt
= Unicode::LineBreak->new;
print
$fmt
->break(
$para
),
"\n"
;
=head2 ℞ 42: Unicode text in DBM hashes, the tedious way
Using a regular Perl string as a key or value
for
a DBM
hash will trigger a wide character exception
if
any codepoints
won’t fit into a byte. Here’s how to manually manage the translation:
tie
%dbhash
,
"DB_File"
,
"pathname"
;
my
$enc_key
= encode(
"UTF-8"
,
$uni_key
, 1);
my
$enc_value
= encode(
"UTF-8"
,
$uni_value
, 1);
$dbhash
{
$enc_key
} =
$enc_value
;
my
$enc_key
= encode(
"UTF-8"
,
$uni_key
, 1);
my
$enc_value
=
$dbhash
{
$enc_key
};
my
$uni_value
= decode(
"UTF-8"
,
$enc_value
, 1);
=head2 ℞ 43: Unicode text in DBM hashes, the easy way
Here’s how to implicitly manage the translation; all encoding
and decoding is done automatically, just as
with
streams that
have a particular encoding attached to them:
my
$dbobj
=
tie
%dbhash
,
"DB_File"
,
"pathname"
;
$dbobj
->Filter_Value(
"utf8"
);
$dbhash
{
$uni_key
} =
$uni_value
;
my
$uni_value
=
$dbhash
{
$uni_key
};
=head2 ℞ 44: PROGRAM: Demo of Unicode collation and printing
Here’s a full program showing how to make
use
of locale-sensitive
sorting, Unicode casing, and managing
print
widths
when
some of the
characters take up zero or two columns, not just one column
each
time
.
When run, the following program produces this nicely aligned output:
Crème Brûlée....... €2.00
Éclair............. €1.60
Fideuà............. €4.20
Hamburger.......... €6.00
Jamón Serrano...... €4.45
Linguiça........... €7.00
Pâté............... €4.15
Pears.............. €2.00
Pêches............. €2.25
Smørbrød........... €5.75
Spätzle............ €5.50
Xoriço............. €3.00
Γύρος.............. €6.50
막걸리............. €4.00
おもち............. €2.65
お好み焼き......... €8.00
シュークリーム..... €1.85
寿司............... €9.99
包子............... €7.50
Here's that program.
use
open
qw(:std :encoding(UTF-8)
);
my
%price
= (
"γύρος"
=> 6.50,
"pears"
=> 2.00,
"linguiça"
=> 7.00,
"xoriço"
=> 3.00,
"hamburger"
=> 6.00,
"éclair"
=> 1.60,
"smørbrød"
=> 5.75,
"spätzle"
=> 5.50,
"包子"
=> 7.50,
"jamón serrano"
=> 4.45,
"pêches"
=> 2.25,
"シュークリーム"
=> 1.85,
"막걸리"
=> 4.00,
"寿司"
=> 9.99,
"おもち"
=> 2.65,
"crème brûlée"
=> 2.00,
"fideuà"
=> 4.20,
"pâté"
=> 4.15,
"お好み焼き"
=> 8.00,
);
my
$width
= 5 + max
map
{ colwidth(
$_
) }
keys
%price
;
my
$coll
= Unicode::Collate::Locale->new(
locale
=>
"ja"
);
for
my
$item
(
$coll
->
sort
(
keys
%price
)) {
print
pad(entitle(
$item
),
$width
,
"."
);
printf
" €%.2f\n"
,
$price
{
$item
};
}
sub
pad (
$str
,
$width
,
$padchar
) {
return
$str
. (
$padchar
x (
$width
- colwidth(
$str
)));
}
sub
colwidth (
$str
) {
return
Unicode::GCString->new(
$str
)->columns;
}
sub
entitle (
$str
) {
$str
=~ s{ (?=\pL)(\S) (\S*) }
{
ucfirst
($1) .
lc
($2) }xge;
return
$str
;
}
=head1 SEE ALSO
See these manpages, some of which are CPAN modules:
L<perlunicode>, L<perluniprops>,
L<perlre>, L<perlrecharclass>,
L<perluniintro>, L<perlunitut>, L<perlunifaq>,
L<PerlIO>, L<DB_File>, L<DBM_Filter>, L<DBM_Filter::utf8>,
L<Encode>, L<Encode::Locale>,
L<Unicode::UCD>,
L<Unicode::Normalize>,
L<Unicode::GCString>, L<Unicode::LineBreak>,
L<Unicode::Collate>, L<Unicode::Collate::Locale>,
L<Unicode::Unihan>,
L<Unicode::CaseFold>,
L<Unicode::Tussle>,
L<Lingua::JA::Romanize::Japanese>,
L<Lingua::ZH::Romanize::Pinyin>,
L<Lingua::KO::Romanize::Hangul>.
The L<Unicode::Tussle> CPAN module includes many programs
to help
with
working
with
Unicode, including
these programs to fully or partly replace standard utilities:
I<tcgrep> instead of I<egrep>,
I<uniquote> instead of I<cat -v> or I<hexdump>,
I<uniwc> instead of I<wc>,
I<unilook> instead of I<look>,
I<unifmt> instead of I<fmt>,
and
I<ucsort> instead of I<
sort
>.
For exploring Unicode character names and character properties,
see its I<uniprops>, I<unichars>, and I<uninames> programs.
It also supplies these programs, all of which are general filters that
do
Unicode-y things:
I<unititle> and I<unicaps>;
I<uniwide> and I<uninarrow>;
I<unisupers> and I<unisubs>;
I<nfd>, I<nfc>, I<nfkd>, and I<nfkc>;
and I<
uc
>, I<
lc
>, and I<tc>.
Finally, see the published Unicode Standard (page numbers are from version
6.0.0), including these specific annexes and technical reports:
=over
=item §3.13 Default Case Algorithms, page 113;
§4.2 Case, pages 120–122;
Case Mappings, page 166–172, especially Caseless Matching starting on page 170.
=item UAX
=item UTS
=item UAX
=item UTS
=item UAX
=item UAX
=item UAX
=back
=head1 AUTHOR
Tom Christiansen E<lt>tchrist
@perl
.comE<gt> wrote this,
with
occasional
kibbitzing from Larry Wall and Jeffrey Friedl in the background.
=head1 COPYRIGHT AND LICENCE
Copyright © 2012 Tom Christiansen.
This program is free software; you may redistribute it and/or modify it
under the same terms as Perl itself.
Most of these examples taken from the current edition of the “Camel Book”;
that is, from the 4ᵗʰ Edition of I<Programming Perl>, Copyright © 2012 Tom
Christiansen <et al.>, 2012-02-13 by O’Reilly Media. The code itself is
freely redistributable, and you are encouraged to transplant, fold,
spindle, and mutilate any of the examples in this manpage however you please
for
inclusion into your own programs without any encumbrance whatsoever.
Acknowledgement via code comment is polite but not required.
=head1 REVISION HISTORY
v1.0.0 – first public release, 2012-02-27