NAME
unichars - list characters for one or more properties
SYNOPSIS
unichars [options] criterion ...
Each criterion is either a square-bracketed character class, a regex starting with a backslash, or an arbitrary Perl expression. See the EXAMPLES section below.
OPTIONS:
Selection Options:
--bmp include the Basic Multilingual Plane (plane 0) [DEFAULT]
--smp include the Supplementary Multilingual Plane (plane 1)
--astral -a include planes above the BMP (planes 1-15)
--unnamed -u include various unnamed characters (see DESCRIPTION)
--locale -l specify the locale used
for
UCA functions
Display Options:
--category -c include the general category (GC=)
--script -s include the script name (SC=)
--block -b include the block name (BLK=)
--bidi -B include the bidi class (BC=)
--combining -C include the canonical combining class (CCC=)
--numeric -n include the numeric value (NV=)
--casefold -f include the casefold status
--decimal -d include the decimal representation of the code point
Miscellaneous Options:
--version -v
version information and
exit
--help -h this message
--man -m full manpage
--debug -d show debugging of criteria and examined code point span
Special Functions:
$_
is the current code point
ord
is the current code point's ordinal
NAME is charname::viacode(
ord
)
NUM is Unicode::UCD::num(
ord
), not code point number
CF is casefold->{status}
NFD, NFC, NFKD, NFKC, FCD, FCC (normalization)
UCA, UCA1, UCA2, UCA3, UCA4 (binary
sort
keys
)
Singleton, Exclusion, NonStDecomp, Comp_Ex
checkNFD, checkNFC, checkNFKD, checkNFKC, checkFCD, checkFCC
NFD_NO, NFC_NO, NFC_MAYBE, NFKD_NO, NFKC_NO, NFKC_MAYBE
DESCRIPTION
The unichars program reports which characters match all selection criteria anded together.
A criterion beginning with a square bracket or a backslash is assumed to be a regular expression. Anything else is a Perl expression such as you might pass to the Perl grep
function. The $_
variable is set to each successive Unicode character, and if all criteria match, that character is displayed.
The numeric code point is therefore accessible as ord
.
The special token NAME
is set to the full name of the current code point. Also, the tokens NFD
, NFKD
, NFC
, and NFKC
are set to the corresponding normalization form.
By default only plane 0, the Basic Multilingual Plane, is examined. For plane 1, the Supplementary Multilingual Plane, use --smp. To examine either, specify both --bmp and --smp options, or -bs. To include all valid code points, use the -a or --astral option.
Unless the --unnamed option is given, characters with any of the properties Unassigned, PrivateUse, Han, or InHangulSyllables will be excluded.
EXAMPLES
Could all non-ASCII digits:
$ unichars -a
'\d'
'\P{ASCII}'
| wc -l
401
Find all line terminators:
$ unichars
'\R'
-- 10 0000A LINE FEED (LF)
-- 11 0000B LINE TABULATION
-- 12 0000C FORM FEED (FF)
-- 13 0000D CARRIAGE RETURN (CR)
-- 133 00085 NEXT LINE (NEL)
-- 8232 02028 LINE SEPARATOR
-- 8233 02029 PARAGRAPH SEPARATOR
Find what is not \s
but is [\h\v]
:
$ unichars
'\S'
'[\h\v]'
-- 11 0000B LINE TABULATION
Count how many code points in the Basic Multilingual Plane are not marks but are diacritics:
$ unichars
'\PM'
'\p{Diacritic}'
| wc -l
209
Count how many code points in the Basic Multilingual Plane are marks but are not diacritics:
$ unichars
'\pM'
'\P{Diacritic}'
| wc -l
750
Find all code points that are Letters, are in the Greek script, have differing canonical and compatibility decompositions, and whose name contains "SYMBOL":
$ unichars -a
'\pL'
'\p{Greek}'
'NFD ne NFKD'
'NAME =~ /SYMBOL/'
ϐ 976 003D0 GREEK BETA SYMBOL
ϑ 977 003D1 GREEK THETA SYMBOL
ϒ 978 003D2 GREEK UPSILON WITH HOOK SYMBOL
ϓ 979 003D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
ϔ 980 003D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
ϕ 981 003D5 GREEK PHI SYMBOL
ϖ 982 003D6 GREEK PI SYMBOL
ϰ 1008 003F0 GREEK KAPPA SYMBOL
ϱ 1009 003F1 GREEK RHO SYMBOL
ϲ 1010 003F2 GREEK LUNATE SIGMA SYMBOL
ϴ 1012 003F4 GREEK CAPITAL THETA SYMBOL
ϵ 1013 003F5 GREEK LUNATE EPSILON SYMBOL
Ϲ 1017 003F9 GREEK CAPITAL LUNATE SIGMA SYMBOL
Find all numeric nondigits in the Latin script (within the BMP):
$ unichars
'\pN'
'\D'
'\p{Latin}'
Ⅰ 8544 02160 ROMAN NUMERAL ONE
Ⅱ 8545 02161 ROMAN NUMERAL TWO
Ⅲ 8546 02162 ROMAN NUMERAL THREE
Ⅳ 8547 02163 ROMAN NUMERAL FOUR
Ⅴ 8548 02164 ROMAN NUMERAL FIVE
Ⅵ 8549 02165 ROMAN NUMERAL SIX
Ⅶ 8550 02166 ROMAN NUMERAL SEVEN
Ⅷ 8551 02167 ROMAN NUMERAL EIGHT
(etc)
Find the first three alphanumunderish code points with no assigned name:
$ unichars -au
'\w'
'!length NAME'
| head -3
㐀 13312 003400 <unnamed codepoint>
㐁 13313 003401 <unnamed codepoint>
㐂 13314 003402 <unnamed codepoint>
Count the combining characters in the Suuplemental Multilingual Plane:
$ unichars -s
'\pM'
| wc -l
61
ENVIRONMENT
If your environment smells like it's in a Unicode encoding, program arguments will be in UTF-8.
BUGS
The --man option does not correctly process the page for UTF-8, because it does not pass the necessary --utf8 option to pod2man.
SEE ALSO
uniprops, uninames, perluniprops, perlunicode, perlrecharclass, perlre
AUTHOR
Tom Christiansen <tchrist@perl.com>
COPYRIGHT AND LICENCE
Copyright 2010 Tom Christiansen.
This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.