NAME
uniprops - list unicode properties for one or more characters
SYNOPSIS
uniprops [options] character | U+codepoint | "name" ...
Options:
--version print version information
--help this message
--man full manpage
--unicode list simple Unicode properties (DEFAULT)
--general include even the long form of general properties
--perl list lowercase Perl short-cuts, plus \R (DEFAULT)
--negated list uppercase Perl short-cuts
--all list all Unicode categories, not just one-parters
--list list all known Unicode properties, then exit
--reorder sort Unicode property lists shortest first
--single output each property one per line
--verbose wrap Unicode properties in \p{xxx}
--width N set column width
--debug noisy internal processing
options may be bundled if used in the short form; e.g., -va
DESCRIPTION
Each argument to uniprops specifies a character in one of three forms:
a one-character literal, such as "#" or "A".
a code point number in hex, (optionally) prefixed by "0x" or "U+", or "\x" or "\u", with the backslash prefixes admitting but not requiring enclosing curly braces. Examples: "0x23", "U+394", "\x{0394}", "0394".
a case-sensitive character name, such as "COMMA" or "GREEK CAPITAL LETTER DELTA". Names may be specified by their full names or their short names per the charnames pragma, or they may be Latin or Greek (in that order). See the EXAMPLES.
The uniprops program reports the properties that apply to a given character for use in regular expressions. By default, the Perl character class short-cuts and the one-part Unicode properties are listed, which are mostly those from the general category.
The --all option adds all the two-part Unicode properties from the non-general categories.
Long, two-part forms of general category properties are not listed unless the --general option is given.
The --negated option adds the Perl shortcuts that are in capitals. The --verbose option encloses Unicode properties with \p{PROPNAME}
.
To simply list out all available Unicode properties, use the --list option, which then exits without processing further arguments.
Lines will be wrapped before the edge of your screen. You can override the window width with the --width NN option. To get only one property per line without any indentation, use the --single or -1 option.
Unicode properties are by default listed in the same order in which they occur in perluniprops(), but the --reorder option will sort them smallest to largest.
Unicode properties designated as deprecated, obsolete, or discouraged, or which begin with an underscore, are ignored.
It takes quite some time to load up and test all the Unicode properties, so if you just need confirmation of a character, just ask for Perl properties, not Unicode ones, and it will run at least six times faster.
EXAMPLES
Count known Unicode properties:
$ uniprops -l | wc -l
2478
List all known Unicode properties, sorted by length:
$ uniprops -lr
List all known Unicode properties, sorted by name:
$ uniprops -l | sort -df | more
List Greek-related Unicode properties:
$ uniprops -l | grep Greek | sort -dfu
Blk=Greek
Block:Ancient_Greek_Musical_Notation
Block:Ancient_Greek_Numbers
Block:Greek
Block=Greek_And_Coptic
Block:Greek_Extended
Greek
Greek_And_Coptic
InAncientGreekMusicalNotation
InAncientGreekNumbers
InGreek
InGreekExtended
Is_Greek
Script=Greek
List just Perl properties for three named characters:
$ uniprops -p delta greek:delta Greek:Delta
U+1E9F ‹ẟ› \N{ LATIN SMALL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+03B4 ‹δ› \N{ GREEK SMALL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+0394 ‹Δ› \N{ GREEK CAPITAL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
List just Perl properties negations for four named characters:
$ uniprops -p Thorn pi hebrew:alef cyrillic:be
U+00DE ‹Þ› \N{ LATIN CAPITAL LETTER THORN }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
U+03C0 ‹π› \N{ GREEK SMALL LETTER PI }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+05D0 ‹א› \N{ HEBREW LETTER ALEF }:
\w \pL \p{L_} \p{Lo}
U+0431 ‹б› \N{ CYRILLIC SMALL LETTER BE }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
List Perl and Unicode properties for three different literal characters:
$ uniprops \# ç π
U+0023 ‹#› \N{ NUMBER SIGN }:
\pP \p{Po}
All Any ASCII Assigned Common Zyyy Po P Gr_Base
Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct
Print Punctuation
U+00E7 ‹ç› \N{ LATIN SMALL LETTER C WITH CEDILLA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC
ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower
Lowercase Print Word XID_Continue XIDC XID_Start XIDS
U+03C0 ‹π› \N{ GREEK SMALL LETTER PI }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek
InGreek Cased Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter
Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
Just list Perl shortcuts, including negated ones, for a named character:
$ uniprops -pn LF
U+000A ‹U+000A› \N{ LINE FEED (LF) }:
\s \v \R \pC \p{Cc}
\W \D \H
For the Greek final sigma character, list Unicode properties that are either one-parters or else two-part general categories
$ uniprops -ug "greek:final sigma"
U+03C2 ‹ς› \N{ GREEK SMALL LETTER FINAL SIGMA }:
All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek
Cased Cased_Letter LC Changes_When_Casefolded CWCF
Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter Lower
Lowercase Print Word XID_Continue XIDC XID_Start XIDS
General_Category=Cased_Letter General_Category:Cased_Letter Gc=LC
General_Category:L General_Category=Letter General_Category:LC
General_Category:Letter Gc=L General_Category:Ll
General_Category=Lowercase_Letter
General_Category:Lowercase_Letter Gc=Ll
List just Unicode properties for a code point, given in hex:
$ uniprops -u 0xDF
U+00DF ‹ß› \N{ LATIN SMALL LETTER SHARP S }:
All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
Cased_Letter LC Changes_When_Casefolded CWCF
Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded
CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased
CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue
IDC ID_Start IDS Letter L_ Latin Latn Lowercase_Letter
Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
List Perl and Unicode properties for a named character, verbosely:
$ uniprops -v "ALEF SYMBOL"
U+2135 ‹ℵ› \N{ ALEF SYMBOL }:
\w \pL \p{L_} \p{Lo}
\p{All} \p{Any} \p{Alnum} \p{Alpha} \p{Alphabetic} \p{Assigned}
\p{InLetterlikeSymbols} \p{Changes_When_NFKC_Casefolded}
\p{CWKCF} \p{Common} \p{Zyyy} \p{L} \p{Lo} \p{Gr_Base}
\p{Grapheme_Base} \p{Graph} \p{GrBase} \p{ID_Continue} \p{IDC}
\p{ID_Start} \p{IDS} \p{Letter} \p{L_} \p{Other_Letter}
\p{Math} \p{Print} \p{Word} \p{XID_Continue} \p{XIDC}
\p{XID_Start} \p{XIDS}
List Unicode properties in all categories except for two-part general categories:
$ uniprops -au INFINITY
U+221E ‹∞› \N{ INFINITY }:
All Any Assigned InMathematicalOperators Common Zyyy Sm S
Gr_Base Grapheme_Base Graph GrBase Math Math_Symbol
Pat_Syn Pattern_Syntax PatSyn Print Symbol
Age:1.1 Bidi_Class:ON Bidi_Class=Other_Neutral
Bidi_Class:Other_Neutral Bc=ON Block:Mathematical_Operators
Canonical_Combining_Class:0
Canonical_Combining_Class=Not_Reordered
Canonical_Combining_Class:Not_Reordered Ccc=NR
Canonical_Combining_Class:NR Script=Common
Decomposition_Type:None Dt=None East_Asian_Width:A
East_Asian_Width=Ambiguous East_Asian_Width:Ambiguous Ea=A
Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX
Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
Hangul_Syllable_Type=Not_Applicable
Hangul_Syllable_Type:Not_Applicable Hst=NA
Joining_Group:No_Joining_Group Jg=NoJoiningGroup
Joining_Type:Non_Joining Jt=U Joining_Type:U
Joining_Type=Non_Joining Line_Break:AI Line_Break=Ambiguous
Line_Break:Ambiguous Lb=AI Numeric_Type:None Nt=None
Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0
In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0
In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:Other SB=XX
Sentence_Break:XX Sentence_Break=Other Word_Break:Other WB=XX
Word_Break:XX Word_Break=Other
For the HYPHEN character, verbosely list all Unicode properties including the two-part general categories, one per line, and sort them:
$ uniprops -1vgau HYPHEN | sort
List Perl and Unicode properties for code point U+2212, reordered by length and with width set to 50:
$ uniprops -r -w 50 U+2212
U+2212 ‹−› \N{ MINUS SIGN }:
\pS \p{Sm}
S Sm All Any Dash Math Zyyy Graph Print
Common GrBase PatSyn Symbol Gr_Base Pat_Syn
Assigned Math_Symbol Grapheme_Base
Pattern_Syntax InMathematicalOperators
Ask for a (currently) unassigned code point:
$ uniprops 1F12F
U+1F12F ‹U+1F12F› \N{ U+1F12F }:
\pC \p{Cn}
All Any InEnclosedAlphanumericSupplement C Other Cn
Unassigned Zzzz Unknown
ERRORS
It is an error to ask for properties of code points representing a UTF-16 surrogate.
Characters not legal for interchange are flagged as errors.
ENVIRONMENT
If your environment smells like it's in a Unicode encoding, program arguments and output will be in UTF-8. This allows you to enter a single, literal UTF-8 character as a program argument.
The PAGER environment variable is used for the --list option.
FILES
The pod source for the perluniprops(1) manpage is parsed to determine Unicode properties. This is expected to be found in the Config module's $installprivlib/pods directory.
PROGRAMS
The stty(1) program is called on Unix systems to determine the window size.
If the standard output is to a tty when the --list option is requested, the user's pager is used, defaulting to more(1).
BUGS
The --man option does not correctly process the page for UTF-8; pod2text(1) works fine, though.
SEE ALSO
unichars, uninames, perluniprops, perlunicode, perlrecharclass, perlre
AUTHOR
Tom Christiansen <tchrist@perl.com>
COPYRIGHT AND LICENCE
Copyright 2011 Tom Christiansen.
This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.