NAME

uniprops - list unicode properties for one or more characters

SYNOPSIS

uniprops [options] character | U+codepoint | "name" ...

Options:

   --version   print version information 
   --help      this message
   --man       full manpage

   --unicode   list simple Unicode properties (DEFAULT)
   --general   include even the long form of general properties

   --perl      list lowercase Perl short-cuts, plus \R (DEFAULT)
   --negated   list uppercase Perl short-cuts

   --all       list all Unicode categories, not just one-parters
   --list      list all known Unicode properties, then exit

   --reorder   sort Unicode property lists shortest first
   --single    output each property one per line

   --verbose   wrap Unicode properties in \p{xxx}
   --width N   set column width

   --debug     noisy internal processing

 options may be bundled if used in the short form; e.g., -va

DESCRIPTION

Each argument to uniprops specifies a character in one of three forms:

  1. a one-character literal, such as "#" or "A".

  2. a code point number in hex, (optionally) prefixed by "0x" or "U+", or "\x" or "\u", with the backslash prefixes admitting but not requiring enclosing curly braces. Examples: "0x23", "U+394", "\x{0394}", "0394".

  3. a case-sensitive character name, such as "COMMA" or "GREEK CAPITAL LETTER DELTA". Names may be specified by their full names or their short names per the charnames pragma, or they may be Latin or Greek (in that order). See the EXAMPLES.

The uniprops program reports the properties that apply to a given character for use in regular expressions. By default, the Perl character class short-cuts and the one-part Unicode properties are listed, which are mostly those from the general category.

The --all option adds all the two-part Unicode properties from the non-general categories.

Long, two-part forms of general category properties are not listed unless the --general option is given.

The --negated option adds the Perl shortcuts that are in capitals. The --verbose option encloses Unicode properties with \p{PROPNAME}.

To simply list out all available Unicode properties, use the --list option, which then exits without processing further arguments.

Lines will be wrapped before the edge of your screen. You can override the window width with the --width NN option. To get only one property per line without any indentation, use the --single or -1 option.

Unicode properties are by default listed in the same order in which they occur in perluniprops(), but the --reorder option will sort them smallest to largest.

Unicode properties designated as deprecated, obsolete, or discouraged, or which begin with an underscore, are ignored.

It takes quite some time to load up and test all the Unicode properties, so if you just need confirmation of a character, just ask for Perl properties, not Unicode ones, and it will run at least six times faster.

EXAMPLES

Count known Unicode properties:

$ uniprops -l | wc -l
2478

List all known Unicode properties, sorted by length:

$ uniprops -lr

List all known Unicode properties, sorted by name:

$ uniprops -l | sort -df | more

List Greek-related Unicode properties:

$ uniprops -l | grep Greek | sort -dfu
Blk=Greek
Block:Ancient_Greek_Musical_Notation
Block:Ancient_Greek_Numbers
Block:Greek
Block=Greek_And_Coptic
Block:Greek_Extended
Greek
Greek_And_Coptic
InAncientGreekMusicalNotation
InAncientGreekNumbers
InGreek
InGreekExtended
Is_Greek
Script=Greek

List just Perl properties for three named characters:

$ uniprops -p delta greek:delta Greek:Delta
U+1E9F ‹ẟ› \N{ LATIN SMALL LETTER DELTA }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+03B4 ‹δ› \N{ GREEK SMALL LETTER DELTA }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+0394 ‹Δ› \N{ GREEK CAPITAL LETTER DELTA }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}

List just Perl properties negations for four named characters:

$ uniprops -p Thorn pi hebrew:alef cyrillic:be
U+00DE ‹Þ› \N{ LATIN CAPITAL LETTER THORN }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
U+03C0 ‹π› \N{ GREEK SMALL LETTER PI }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+05D0 ‹א› \N{ HEBREW LETTER ALEF }:
    \w \pL \p{L_} \p{Lo}
U+0431 ‹б› \N{ CYRILLIC SMALL LETTER BE }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}

List Perl and Unicode properties for three different literal characters:

$ uniprops \# ç π
U+0023 ‹#› \N{ NUMBER SIGN }:
    \pP \p{Po}
    All Any ASCII Assigned Common Zyyy Po P Gr_Base
       Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
       Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct
       Print Punctuation
U+00E7 ‹ç› \N{ LATIN SMALL LETTER C WITH CEDILLA }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
       Cased_Letter LC Changes_When_Casemapped CWCM
       Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
       L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC
       ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower
       Lowercase Print Word XID_Continue XIDC XID_Start XIDS
U+03C0 ‹π› \N{ GREEK SMALL LETTER PI }:
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek
       InGreek Cased Cased_Letter LC Changes_When_Casemapped CWCM
       Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
       L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
       ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter
       Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS

Just list Perl shortcuts, including negated ones, for a named character:

$ uniprops -pn LF
U+000A ‹U+000A› \N{ LINE FEED (LF) }:
    \s \v \R \pC \p{Cc}
    \W \D \H

For the Greek final sigma character, list Unicode properties that are either one-parters or else two-part general categories

$ uniprops -ug "greek:final sigma"
U+03C2 ‹ς› \N{ GREEK SMALL LETTER FINAL SIGMA }:
    All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek
       Cased Cased_Letter LC Changes_When_Casefolded CWCF
       Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF
       Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
       Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
       ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter Lower
       Lowercase Print Word XID_Continue XIDC XID_Start XIDS
    General_Category=Cased_Letter General_Category:Cased_Letter Gc=LC
       General_Category:L General_Category=Letter General_Category:LC
       General_Category:Letter Gc=L General_Category:Ll
       General_Category=Lowercase_Letter
       General_Category:Lowercase_Letter Gc=Ll

List just Unicode properties for a code point, given in hex:

$ uniprops -u 0xDF
U+00DF ‹ß› \N{ LATIN SMALL LETTER SHARP S }:
    All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
       Cased_Letter LC Changes_When_Casefolded CWCF
       Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded
       CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased
       CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue
       IDC ID_Start IDS Letter L_ Latin Latn Lowercase_Letter
       Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS

List Perl and Unicode properties for a named character, verbosely:

$ uniprops -v "ALEF SYMBOL"
U+2135 ‹ℵ› \N{ ALEF SYMBOL }:
    \w \pL \p{L_} \p{Lo}
    \p{All} \p{Any} \p{Alnum} \p{Alpha} \p{Alphabetic} \p{Assigned}
       \p{InLetterlikeSymbols} \p{Changes_When_NFKC_Casefolded}
       \p{CWKCF} \p{Common} \p{Zyyy} \p{L} \p{Lo} \p{Gr_Base}
       \p{Grapheme_Base} \p{Graph} \p{GrBase} \p{ID_Continue} \p{IDC}
       \p{ID_Start} \p{IDS} \p{Letter} \p{L_} \p{Other_Letter}
       \p{Math} \p{Print} \p{Word} \p{XID_Continue} \p{XIDC}
       \p{XID_Start} \p{XIDS}

List Unicode properties in all categories except for two-part general categories:

$ uniprops -au INFINITY
U+221E ‹∞› \N{ INFINITY }:
    All Any Assigned InMathematicalOperators Common Zyyy Sm S
       Gr_Base Grapheme_Base Graph GrBase Math Math_Symbol
       Pat_Syn Pattern_Syntax PatSyn Print Symbol
    Age:1.1 Bidi_Class:ON Bidi_Class=Other_Neutral
       Bidi_Class:Other_Neutral Bc=ON Block:Mathematical_Operators
       Canonical_Combining_Class:0
       Canonical_Combining_Class=Not_Reordered
       Canonical_Combining_Class:Not_Reordered Ccc=NR
       Canonical_Combining_Class:NR Script=Common
       Decomposition_Type:None Dt=None East_Asian_Width:A
       East_Asian_Width=Ambiguous East_Asian_Width:Ambiguous Ea=A
       Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX
       Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
       Hangul_Syllable_Type=Not_Applicable
       Hangul_Syllable_Type:Not_Applicable Hst=NA
       Joining_Group:No_Joining_Group Jg=NoJoiningGroup
       Joining_Type:Non_Joining Jt=U Joining_Type:U
       Joining_Type=Non_Joining Line_Break:AI Line_Break=Ambiguous
       Line_Break:Ambiguous Lb=AI Numeric_Type:None Nt=None
       Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
       Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0
       In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
       Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0
       In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
       Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:Other SB=XX
       Sentence_Break:XX Sentence_Break=Other Word_Break:Other WB=XX
       Word_Break:XX Word_Break=Other

For the HYPHEN character, verbosely list all Unicode properties including the two-part general categories, one per line, and sort them:

$ uniprops -1vgau HYPHEN | sort

List Perl and Unicode properties for code point U+2212, reordered by length and with width set to 50:

$ uniprops -r -w 50 U+2212
U+2212 ‹−› \N{ MINUS SIGN }:
    \pS \p{Sm}
    S Sm All Any Dash Math Zyyy Graph Print
       Common GrBase PatSyn Symbol Gr_Base Pat_Syn
       Assigned Math_Symbol Grapheme_Base
       Pattern_Syntax InMathematicalOperators

Ask for a (currently) unassigned code point:

$ uniprops 1F12F
U+1F12F ‹U+1F12F› \N{ U+1F12F }:
    \pC \p{Cn}
    All Any InEnclosedAlphanumericSupplement C Other Cn
        Unassigned Zzzz Unknown

ERRORS

It is an error to ask for properties of code points representing a UTF-16 surrogate.

Characters not legal for interchange are flagged as errors.

ENVIRONMENT

If your environment smells like it's in a Unicode encoding, program arguments and output will be in UTF-8. This allows you to enter a single, literal UTF-8 character as a program argument.

The PAGER environment variable is used for the --list option.

FILES

The pod source for the perluniprops(1) manpage is parsed to determine Unicode properties. This is expected to be found in the Config module's $installprivlib/pods directory.

PROGRAMS

The stty(1) program is called on Unix systems to determine the window size.

If the standard output is to a tty when the --list option is requested, the user's pager is used, defaulting to more(1).

BUGS

The --man option does not correctly process the page for UTF-8; pod2text(1) works fine, though.

SEE ALSO

unichars, uninames, perluniprops, perlunicode, perlrecharclass, perlre

AUTHOR

Tom Christiansen <tchrist@perl.com>

COPYRIGHT AND LICENCE

Copyright 2011 Tom Christiansen.

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.