NAME
unichars - list characters for one or more properties
SYNOPSIS
unichars [options] criterion ...
Each criterion is either a square-bracketed character class, a regex starting with a backslash, or an arbitrary Perl expression. See the EXAMPLES section below.
OPTIONS:
Selection Options:
--bmp include the Basic Multilingual Plane (plane 0) [DEFAULT]
--smp include the Supplementary Multilingual Plane (plane 1)
--astral -a include planes above the BMP (planes 1-15)
--unnamed -u include various unnamed characters (see DESCRIPTION)
--locale -l specify the locale used for UCA functions
Display Options:
--category -c include the general category (GC=)
--script -s include the script name (SC=)
--block -b include the block name (BLK=)
--bidi -B include the bidi class (BC=)
--combining -C include the canonical combining class (CCC=)
--numeric -n include the numeric value (NV=)
--casefold -f include the casefold status
--decimal -d include the decimal representation of the code point
Miscellaneous Options:
--version -v print version information and exit
--help -h this message
--man -m full manpage
--debug -d show debugging of criteria and examined code point span
Special Functions:
$_ is the current code point
ord is the current code point's ordinal
NAME is charname::viacode(ord)
NUM is Unicode::UCD::num(ord), not code point number
CF is casefold->{status}
NFD, NFC, NFKD, NFKC, FCD, FCC (normalization)
UCA, UCA1, UCA2, UCA3, UCA4 (binary sort keys)
Singleton, Exclusion, NonStDecomp, Comp_Ex
checkNFD, checkNFC, checkNFKD, checkNFKC, checkFCD, checkFCC
NFD_NO, NFC_NO, NFC_MAYBE, NFKD_NO, NFKC_NO, NFKC_MAYBE
DESCRIPTION
The unichars program reports which characters match all selection criteria anded together.
A criterion beginning with a square bracket or a backslash is assumed to be a regular expression. Anything else is a Perl expression such as you might pass to the Perl grep
function. The $_
variable is set to each successive Unicode character, and if all criteria match, that character is displayed.
The numeric code point is therefore accessible as ord
.
The special token NAME
is set to the full name of the current code point. Also, the tokens NFD
, NFKD
, NFC
, and NFKC
are set to the corresponding normalization form.
By default only plane 0, the Basic Multilingual Plane, is examined. For plane 1, the Supplementary Multilingual Plane, use --smp. To examine either, specify both --bmp and --smp options, or -bs. To include all valid code points, use the -a or --astral option.
Unless the --unnamed option is given, characters with any of the properties Unassigned, PrivateUse, Han, or InHangulSyllables will be excluded.
EXAMPLES
Could all non-ASCII digits:
$ unichars -a '\d' '\P{ASCII}' | wc -l
401
Find all line terminators:
$ unichars '\R'
-- 10 0000A LINE FEED (LF)
-- 11 0000B LINE TABULATION
-- 12 0000C FORM FEED (FF)
-- 13 0000D CARRIAGE RETURN (CR)
-- 133 00085 NEXT LINE (NEL)
-- 8232 02028 LINE SEPARATOR
-- 8233 02029 PARAGRAPH SEPARATOR
Find what is not \s
but is [\h\v]
:
$ unichars '\S' '[\h\v]'
-- 11 0000B LINE TABULATION
Count how many code points in the Basic Multilingual Plane are not marks but are diacritics:
$ unichars '\PM' '\p{Diacritic}' | wc -l
209
Count how many code points in the Basic Multilingual Plane are marks but are not diacritics:
$ unichars '\pM' '\P{Diacritic}' | wc -l
750
Find all code points that are Letters, are in the Greek script, have differing canonical and compatibility decompositions, and whose name contains "SYMBOL":
$ unichars -a '\pL' '\p{Greek}' 'NFD ne NFKD' 'NAME =~ /SYMBOL/'
ϐ 976 003D0 GREEK BETA SYMBOL
ϑ 977 003D1 GREEK THETA SYMBOL
ϒ 978 003D2 GREEK UPSILON WITH HOOK SYMBOL
ϓ 979 003D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
ϔ 980 003D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
ϕ 981 003D5 GREEK PHI SYMBOL
ϖ 982 003D6 GREEK PI SYMBOL
ϰ 1008 003F0 GREEK KAPPA SYMBOL
ϱ 1009 003F1 GREEK RHO SYMBOL
ϲ 1010 003F2 GREEK LUNATE SIGMA SYMBOL
ϴ 1012 003F4 GREEK CAPITAL THETA SYMBOL
ϵ 1013 003F5 GREEK LUNATE EPSILON SYMBOL
Ϲ 1017 003F9 GREEK CAPITAL LUNATE SIGMA SYMBOL
Find all numeric nondigits in the Latin script (within the BMP):
$ unichars '\pN' '\D' '\p{Latin}'
Ⅰ 8544 02160 ROMAN NUMERAL ONE
Ⅱ 8545 02161 ROMAN NUMERAL TWO
Ⅲ 8546 02162 ROMAN NUMERAL THREE
Ⅳ 8547 02163 ROMAN NUMERAL FOUR
Ⅴ 8548 02164 ROMAN NUMERAL FIVE
Ⅵ 8549 02165 ROMAN NUMERAL SIX
Ⅶ 8550 02166 ROMAN NUMERAL SEVEN
Ⅷ 8551 02167 ROMAN NUMERAL EIGHT
(etc)
Find the first three alphanumunderish code points with no assigned name:
$ unichars -au '\w' '!length NAME' | head -3
㐀 13312 003400 <unnamed codepoint>
㐁 13313 003401 <unnamed codepoint>
㐂 13314 003402 <unnamed codepoint>
Count the combining characters in the Suuplemental Multilingual Plane:
$ unichars -s '\pM' | wc -l
61
ENVIRONMENT
If your environment smells like it's in a Unicode encoding, program arguments will be in UTF-8.
BUGS
The --man option does not correctly process the page for UTF-8, because it does not pass the necessary --utf8 option to pod2man.
SEE ALSO
uniprops, uninames, perluniprops, perlunicode, perlrecharclass, perlre
AUTHOR
Tom Christiansen <tchrist@perl.com>
COPYRIGHT AND LICENCE
Copyright 2010 Tom Christiansen.
This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.