Legal \\p{}
and \\P{}
constructs that match no characters
Unicode has some property-value pairs that currently don't match anything. This happens generally either because they are obsolete, or they exist for symmetry with other forms, but no language has yet been encoded that uses them. In this version of Unicode, the following match zero code points:
$zero_matches
END }
# Generate list of properties that we don't accept, grouped by the reasons
# why. This is so only put out the 'why' once, and then list all the
# properties that have that reason under it.
my %why_list; # The keys are the reasons; the values are lists of
# properties that have the key as their reason
# For each property, add it to the list that are suppressed for its reason
# The sort will cause the alphabetically first properties to be added to
# each list first, so each list will be sorted.
foreach my $property (sort keys %why_suppressed) {
next unless $why_suppressed{$property};
push @{$why_list{$why_suppressed{$property}}}, $property;
}
# For each reason (sorted by the first property that has that reason)...
my @bad_re_properties;
foreach my $why (sort { $why_list{$a}->[0] cmp $why_list{$b}->[0] }
keys %why_list)
{
# Add to the output, all the properties that have that reason.
my $has_item = 0; # Flag if actually output anything.
foreach my $name (@{$why_list{$why}}) {
# Split compound names into $property and $table components
my $property = $name;
my $table;
if ($property =~ / (.*) = (.*) /x) {
$property = $1;
$table = $2;
}
# This release of Unicode may not have a property that is
# suppressed, so don't reference a non-existent one.
$property = property_ref($property);
next if ! defined $property;
# And since this list is only for match tables, don't list the
# ones that don't have match tables.
next if ! $property->to_create_match_tables;
# Find any abbreviation, and turn it into a compound name if this
# is a property=value pair.
my $short_name = $property->name;
$short_name .= '=' . $property->table($table)->name if $table;
# Start with an empty line.
push @bad_re_properties, "\n\n" unless $has_item;
# And add the property as an item for the reason.
push @bad_re_properties, "\n=item I<$name> ($short_name)\n";
$has_item = 1;
}
# And add the reason under the list of properties, if such a list
# actually got generated. Note that the header got added
# unconditionally before. But pod ignores extra blank lines, so no
# harm.
push @bad_re_properties, "\n$why\n" if $has_item;
} # End of looping through each reason.
if (! @bad_re_properties) {
push @bad_re_properties,
"*** This installation accepts ALL non-Unihan properties ***";
}
else {
# Add =over only if non-empty to avoid an empty =over/=back section,
# which is considered bad form.
unshift @bad_re_properties, "\n=over 4\n";
push @bad_re_properties, "\n=back\n";
}
# Similarly, generate a list of files that we don't use, grouped by the
# reasons why (Don't output if the reason is empty). First, create a hash
# whose keys are the reasons, and whose values are anonymous arrays of all
# the files that share that reason.
my %grouped_by_reason;
foreach my $file (keys %skipped_files) {
next unless $skipped_files{$file};
push @{$grouped_by_reason{$skipped_files{$file}}}, $file;
}
# Then, sort each group.
foreach my $group (keys %grouped_by_reason) {
@{$grouped_by_reason{$group}} = sort { lc $a cmp lc $b }
@{$grouped_by_reason{$group}} ;
}
# Finally, create the output text. For each reason (sorted by the
# alphabetically first file that has that reason)...
my @unused_files;
foreach my $reason (sort { lc $grouped_by_reason{$a}->[0]
cmp lc $grouped_by_reason{$b}->[0]
}
keys %grouped_by_reason)
{
# Add all the files that have that reason to the output. Start
# with an empty line.
push @unused_files, "\n\n";
push @unused_files, map { "\n=item F<$_> \n" }
@{$grouped_by_reason{$reason}};
# And add the reason under the list of files
push @unused_files, "\n$reason\n";
}
# Similarly, create the output text for the UCD section of the pod
my @ucd_pod;
foreach my $key (keys %ucd_pod) {
next unless $ucd_pod{$key}->{'output_this'};
push @ucd_pod, format_pod_line($indent_info_column,
$ucd_pod{$key}->{'name'},
$ucd_pod{$key}->{'info'},
$ucd_pod{$key}->{'status'},
);
}
# Sort alphabetically, and fold for output
@ucd_pod = sort { lc substr($a, 2) cmp lc substr($b, 2) } @ucd_pod;
my $ucd_pod = simple_fold(\@ucd_pod,
' ',
$indent_info_column,
$automatic_pod_indent);
$ucd_pod = format_pod_line($indent_info_column, 'NAME', ' INFO')
. "\n"
. $ucd_pod;
my $space_hex = sprintf("%02x", ord " ");
local $" = "";
# Everything is ready to assemble.
my @OUT = << "END";
=begin comment
$HEADER
To change this file, edit $0 instead.
NAME
$pod_file - Index of Unicode Version $unicode_version character properties in Perl
DESCRIPTION
This document provides information about the portion of the Unicode database that deals with character properties, that is the portion that is defined on single code points. ("Other information in the Unicode data base" below briefly mentions other data that Unicode provides.)
Perl can provide access to all non-provisional Unicode character properties, though not all are enabled by default. The omitted ones are the Unihan properties and certain deprecated or Unicode-internal properties. (An installation may choose to recompile Perl's tables to change this. See "Unicode character properties that are NOT accepted by Perl".)
For most purposes, access to Unicode properties from the Perl core is through regular expression matches, as described in the next section. For some special purposes, and to access the properties that are not suitable for regular expression matching, all the Unicode character properties that Perl handles are accessible via the standard Unicode::UCD module, as described in the section "Properties accessible through Unicode::UCD".
Perl also provides some additional extensions and short-cut synonyms for Unicode properties.
This document merely lists all available properties and does not attempt to explain what each property really means. There is a brief description of each Perl extension; see "Other Properties" in perlunicode for more information on these. There is some detail about Blocks, Scripts, General_Category, and Bidi_Class in perlunicode, but to find out about the intricacies of the official Unicode properties, refer to the Unicode standard. A good starting place is $unicode_reference_url.
Note that you can define your own properties; see "User-Defined Character Properties" in perlunicode.
Properties accessible through \\p{}
and \\P{}
The Perl regular expression \\p{}
and \\P{}
constructs give access to most of the Unicode character properties. The table below shows all these constructs, both single and compound forms.
Compound forms consist of two components, separated by an equals sign or a colon. The first component is the property name, and the second component is the particular value of the property to match against, for example, \\p{Script_Extensions: Greek}
and \\p{Script_Extensions=Greek}
both mean to match characters whose Script_Extensions property value is Greek. (Script_Extensions
is an improved version of the Script
property.)
Single forms, like \\p{Greek}
, are mostly Perl-defined shortcuts for their equivalent compound forms. The table shows these equivalences. (In our example, \\p{Greek}
is a just a shortcut for \\p{Script_Extensions=Greek}
). There are also a few Perl-defined single forms that are not shortcuts for a compound form. One such is \\p{Word}
. These are also listed in the table.
In parsing these constructs, Perl always ignores Upper/lower case differences everywhere within the {braces}. Thus \\p{Greek}
means the same thing as \\p{greek}
. But note that changing the case of the "p"
or "P"
before the left brace completely changes the meaning of the construct, from "match" (for \\p{}
) to "doesn't match" (for \\P{}
). Casing in this document is for improved legibility.
Also, white space, hyphens, and underscores are normally ignored everywhere between the {braces}, and hence can be freely added or removed even if the /x
modifier hasn't been specified on the regular expression. But in the table below $a_bold_stricter at the beginning of an entry means that tighter (stricter) rules are used for that entry:
- Single form (
\\p{name}
) tighter rules: -
White space, hyphens, and underscores ARE significant except for:
white space adjacent to a non-word character
underscores separating digits in numbers
That means, for example, that you can freely add or remove white space adjacent to (but within) the braces without affecting the meaning.
- Compound form (
\\p{name=value}
or\\p{name:value}
) tighter rules: -
The tighter rules given above for the single form apply to everything to the right of the colon or equals; the looser rules still apply to everything to the left.
That means, for example, that you can freely add or remove white space adjacent to (but within) the braces and the colon or equal sign.
Some properties are considered obsolete by Unicode, but still available. There are several varieties of obsolescence:
- Stabilized
-
A property may be stabilized. Such a determination does not indicate that the property should or should not be used; instead it is a declaration that the property will not be maintained nor extended for newly encoded characters. Such properties are marked with $a_bold_stabilized in the table.
- Deprecated
-
A property may be deprecated, perhaps because its original intent has been replaced by another property, or because its specification was somehow defective. This means that its use is strongly discouraged, so much so that a warning will be issued if used, unless the regular expression is in the scope of a
no warnings 'deprecated'
statement. $A_bold_deprecated flags each such entry in the table, and the entry there for the longest, most descriptive version of the property will give the reason it is deprecated, and perhaps advice. Perl may issue such a warning, even for properties that aren't officially deprecated by Unicode, when there used to be characters or code points that were matched by them, but no longer. This is to warn you that your program may not work like it did on earlier Unicode releases.A deprecated property may be made unavailable in a future Perl version, so it is best to move away from them.
A deprecated property may also be stabilized, but this fact is not shown.
- Obsolete
-
Properties marked with $a_bold_obsolete in the table are considered (plain) obsolete. Generally this designation is given to properties that Unicode once used for internal purposes (but not any longer).
- Discouraged
-
This is not actually a Unicode-specified obsolescence, but applies to certain Perl extensions that are present for backwards compatibility, but are discouraged from being used. These are not obsolete, but their meanings are not stable. Future Unicode versions could force any of these extensions to be removed without warning, replaced by another property with the same name that means something different. $A_bold_discouraged flags each such entry in the table. Use the equivalent shown instead.
@block_warning
The table below has two columns. The left column contains the \\p{}
constructs to look up, possibly preceded by the flags mentioned above; and the right column contains information about them, like a description, or synonyms. The table shows both the single and compound forms for each property that has them. If the left column is a short name for a property, the right column will give its longer, more descriptive name; and if the left column is the longest name, the right column will show any equivalent shortest name, in both single and compound forms if applicable.
If braces are not needed to specify a property (e.g., \\pL
), the left column contains both forms, with and without braces.
The right column will also caution you if a property means something different than what might normally be expected.
All single forms are Perl extensions; a few compound forms are as well, and are noted as such.
Numbers in (parentheses) indicate the total number of Unicode code points matched by the property. For the entries that give the longest, most descriptive version of the property, the count is followed by a list of some of the code points matched by it. The list includes all the matched characters in the 0-255 range, enclosed in the familiar [brackets] the same as a regular expression bracketed character class. Following that, the next few higher matching ranges are also given. To avoid visual ambiguity, the SPACE character is represented as \\x$space_hex
.
For emphasis, those properties that match no code points at all are listed as well in a separate section following the table.
Most properties match the same code points regardless of whether "/i"
case-insensitive matching is specified or not. But a few properties are affected. These are shown with the notation (/i= other_property)
in the second column. Under case-insensitive matching they match the same code pode points as the property other_property.
There is no description given for most non-Perl defined properties (See $unicode_reference_url for that).
For compactness, '*' is used as a wildcard instead of showing all possible combinations. For example, entries like:
\\p{Gc: *} \\p{General_Category: *}
mean that 'Gc' is a synonym for 'General_Category', and anything that is valid for the latter is also valid for the former. Similarly,
\\p{Is_*} \\p{*}
means that if and only if, for example, \\p{Foo}
exists, then \\p{Is_Foo}
and \\p{IsFoo}
are also valid and all mean the same thing. And similarly, \\p{Foo=Bar}
means the same as \\p{Is_Foo=Bar}
and \\p{IsFoo=Bar}
. "*" here is restricted to something not beginning with an underscore.
Also, in binary properties, 'Yes', 'T', and 'True' are all synonyms for 'Y'. And 'No', 'F', and 'False' are all synonyms for 'N'. The table shows 'Y*' and 'N*' to indicate this, and doesn't have separate entries for the other possibilities. Note that not all properties which have values 'Yes' and 'No' are binary, and they have all their values spelled out without using this wild card, and a NOT
clause in their description that highlights their not being binary. These also require the compound form to match them, whereas true binary properties have both single and compound forms available.
Note that all non-essential underscores are removed in the display of the short names below.
Legend summary:
- * is a wild-card
- (\\d+) in the info column gives the number of Unicode code points matched by this property.
- $DEPRECATED means this is deprecated.
- $OBSOLETE means this is obsolete.
- $STABILIZED means this is stabilized.
- $STRICTER means tighter (stricter) name matching applies.
- $DISCOURAGED means use of this form is discouraged, and may not be stable.
$formatted_properties
$zero_matches
Properties accessible through Unicode::UCD
The value of any Unicode (not including Perl extensions) character property mentioned above for any single code point is available through "charprop()" in Unicode::UCD. "charprops_all()" in Unicode::UCD returns the values of all the Unicode properties for a given code point.
Besides these, all the Unicode character properties mentioned above (except for those marked as for internal use by Perl) are also accessible by "prop_invlist()" in Unicode::UCD.
Due to their nature, not all Unicode character properties are suitable for regular expression matches, nor prop_invlist()
. The remaining non-provisional, non-internal ones are accessible via "prop_invmap()" in Unicode::UCD (except for those that this Perl installation hasn't included; see below for which those are).
For compatibility with other parts of Perl, all the single forms given in the table in the section above are recognized. BUT, there are some ambiguities between some Perl extensions and the Unicode properties, all of which are silently resolved in favor of the official Unicode property. To avoid surprises, you should only use prop_invmap()
for forms listed in the table below, which omits the non-recommended ones. The affected forms are the Perl single form equivalents of Unicode properties, such as \\p{sc}
being a single-form equivalent of \\p{gc=sc}
, which is treated by prop_invmap()
as the Script
property, whose short name is sc
. The table indicates the current ambiguities in the INFO column, beginning with the word "NOT"
.
The standard Unicode properties listed below are documented in $unicode_reference_url; Perl_Decimal_Digit is documented in "prop_invmap()" in Unicode::UCD. The other Perl extensions are in "Other Properties" in perlunicode;
The first column in the table is a name for the property; the second column is an alternative name, if any, plus possibly some annotations. The alternative name is the property's full name, unless that would simply repeat the first column, in which case the second column indicates the property's short name (if different). The annotations are given only in the entry for the full name. The annotations for binary properties include a list of the first few ranges that the property matches. To avoid any ambiguity, the SPACE character is represented as \\x$space_hex
.
If a property is obsolete, etc, the entry will be flagged with the same characters used in the table in the section above, like $DEPRECATED or $STABILIZED.
$ucd_pod
Properties accessible through other means
Certain properties are accessible also via core function calls. These are:
Lowercase_Mapping lc() and lcfirst()
Titlecase_Mapping ucfirst()
Uppercase_Mapping uc()
Also, Case_Folding is accessible through the /i
modifier in regular expressions, the \\F
transliteration escape, and the fc
operator.
Besides being able to say \\p{Name=...}
, the Name and Name_Aliases properties are accessible through the \\N{}
interpolation in double-quoted strings and regular expressions; and functions charnames::viacode()
, charnames::vianame()
, and charnames::string_vianame()
(which require a use charnames ();
to be specified.
Finally, most properties related to decomposition are accessible via Unicode::Normalize.
Unicode character properties that are NOT accepted by Perl
Perl will generate an error for a few character properties in Unicode when used in a regular expression. The non-Unihan ones are listed below, with the reasons they are not accepted, perhaps with work-arounds. The short names for the properties are listed enclosed in (parentheses). As described after the list, an installation can change the defaults and choose to accept any of these. The list is machine generated based on the choices made for the installation that generated this document.
@bad_re_properties
An installation can choose to allow any of these to be matched by downloading the Unicode database from http://www.unicode.org/Public/ to \$Config{privlib}
/unicore/ in the Perl source tree, changing the controlling lists contained in the program \$Config{privlib}
/unicore/mktables and then re-compiling and installing. (\%Config
is available from the Config module).
Also, perl can be recompiled to operate on an earlier version of the Unicode standard. Further information is at \$Config{privlib}
/unicore/README.perl.
Other information in the Unicode data base
The Unicode data base is delivered in two different formats. The XML version is valid for more modern Unicode releases. The other version is a collection of files. The two are intended to give equivalent information. Perl uses the older form; this allows you to recompile Perl to use early Unicode releases.
The only non-character property that Perl currently supports is Named Sequences, in which a sequence of code points is given a name and generally treated as a single entity. (Perl supports these via the \\N{...}
double-quotish construct, "charnames::string_vianame(name)" in charnames, and "namedseq()" in Unicode::UCD.
Below is a list of the files in the Unicode data base that Perl doesn't currently use, along with very brief descriptions of their purposes. Some of the names of the files have been shortened from those that Unicode uses, in order to allow them to be distinguishable from similarly named files on file systems for which only the first 8 characters of a name are significant.
@unused_files
SEE ALSO
END
# And write it. The 0 means no utf8.
main::write([ $pod_directory, "$pod_file.pod" ], 0, \@OUT);
return;
}
sub make_Name_pm () { # Create and write Name.pm, which contains subroutines and data to use in # conjunction with Name.pl
# Maybe there's nothing to do.
return unless $has_hangul_syllables || @code_points_ending_in_code_point;
my @name = <<END;
$HEADER
$INTERNAL_ONLY_HEADER
NAME -- Internal generated file for use by charnames
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 16838:
=end comment without matching =begin. (Stack: [empty])