NAME
Muldis::D::Ext::Text - Muldis D extension for character string data types and operators
VERSION
This document is Muldis::D::Ext::Text version 0.55.0.
PREFACE
This document is part of the Muldis D language specification, whose root document is Muldis::D; you should read that root document before you read this one, which provides subservient details.
DESCRIPTION
Muldis D has a mandatory core set of system-defined (eternally available) entities, which is referred to as the Muldis D core or the core; they are the minimal entities that all Muldis D implementations need to provide; they are mutually self-describing and are used to bootstrap the language; any entities outside the core, called Muldis D extensions, are non-mandatory and are defined in terms of the core or each other, but the reverse isn't true.
This current Text
document describes the system-defined Muldis D Text Extension, which consists of character string data types and operators, essentially all the generic ones that a typical programming language should have, but for the bare minimum needed for bootstrapping Muldis D, which are defined in the language core instead.
This current document does not describe the polymorphic operators that all types, or some types including core types, have defined over them; said operators are defined once for all types in Muldis::D::Core.
This documentation is pending.
Maybe TODO: Add proper subtypes of Text specific to those values in each Unicode Normal Form; such would be the output of a folded_to_UFC etc function; or maybe not as then maybe we'd want to add ASCII etc subtypes too, so all said, too much complexity for too little benefit.
SYSTEM-DEFINED TEXT-CONCERNING FUNCTIONS
These functions implement commonly used character string operations.
function sys.std.Text.catenation result Text params { topic(array_of.Text) }
-
This function results in the catenation of the N element values of its argument; it is a reduction operator that recursively takes each consecutive pair of input values and catenates (which is associative) them together until just one is left, which is the result. If
topic
has zero values, thencatenate
results in the empty string value, which is the identity value for catenate. function sys.std.Text.repeat result Text params { topic(Text), count(NNInt) }
-
This function results in the catenation of
count
instances oftopic
. function sys.std.Text.length_in_codepoints result NNInt params { topic(Text) }
-
This function results in the length of its argument in codepoints, or in other words, in the actual length of the argument since Muldis D explicitly works natively at the codepoint abstraction level.
function sys.std.Text.length_in_graphemes result NNInt params { topic(Text) }
-
This function results in the length of its argument in language-independent graphemes.
function sys.std.Text.is_substr result Bool params { look_in(Text), look_for(Text), fixed_start(Bool)?, fixed_end(Bool)? }
-
This function results in
Bool:true
iff itslook_for
argument is a substring of itslook_in
argument as per the optionalfixed_start
andfixed_end
constraints, andBool:false
otherwise. Iffixed_start
orfixed_end
areBool:true
, thenlook_for
must occur right at the start or end, respectively, oflook_in
in order forcontains
to result inBool:true
; if either flag isBool:false
, its additional constraint doesn't apply. Each of thefixed_(start|end)
parameters is optional and defaults toBool:false
if no explicit argument is given to it. Note thatis_substr
will handle the common special cases of SQL's "LIKE" operator for patterns like ['foo', '%foo', 'foo%', '%foo%'], but see also theis_match_using_like
function which provides the full generality of SQL's "LIKE", such as 'foo%bar%baz'. function sys.std.Text.is_not_substr result Bool params { look_in(Text), look_for(Text), fixed_start(Bool)?, fixed_end(Bool)? }
-
This function is exactly the same as
sys.std.Text.is_substr
except that it results in the opposite boolean value when given the same arguments.
FUNCTIONS FOR TEXT NORMALIZATION
These functions implement commonly used text normalization operations which are relatively simple or whose details are fully specified by the Unicode standard; examples are folding letters to lower or upper case, removing combining characters like accent marks and other diacritics from base letters, or removing or normalizing whitespace, or that convert text from a larger to a smaller character repertoire such as to ASCII. By contrast, operations such as stemming or removing common words or expanding abbreviations are not done by these functions and are best implemented by a third party language extension or library. You can use these functions as a basis for making comparison or ranking or collation operators that ignore some distinctions between values such as their case or accents, such as to do case-insensitive or accent-insensitive or whitespace-insensitive matching or indexing or sorting; the actual system-defined matching operators are still sensitive to case et al, but you can pretend they're not by having them work with the results of these normalization functions rather than on the inputs to these functions. This is useful when you want to emulate the semantics of insensitive though possibly preserving systems over Muldis D.
function sys.std.Text.folded_to_NF(C|D) result Text { topic(Text) }
-
This function results in the normalization of its argument into Unicode Normal Form C|D. TODO: Generalize this to handle the other normal forms, such as with an extra enum argument, or add extra functions. Also to do eventually, and definitely with the extra argument version, add normalization specific to locales, such as to handle language-specific graphemes right.
function sys.std.Text.case_folded_to_upper result Text { topic(Text) }
-
This function results in the normalization of its argument where any letters considered to be (small) lowercase are folded to (capital) uppercase.
function sys.std.Text.case_folded_to_lower result Text { topic(Text) }
-
This function results in the normalization of its argument where any letters considered to be (capital) uppercase are folded to (small) lowercase.
function sys.std.Text.accents_stripped result Text { topic(Text) }
-
This function results in the normalization of its argument where any accent marks or diacritics are removed from letters, leaving just the primary letters.
function sys.std.Text.ASCII result Text { topic(Text), mark(Text)? }
-
This function results in the normalization of its
topic
argument where any characters not in the 7-bit ASCII repertoire are stripped out, where each non-ASCII character is replaced with the common ASCII character string specified by itsmark
argument; ifmark
is the empty string, then the non-ASCII characters are simply stripped. This function is quite simple and does not do a smart replace with sequences of similar looking ASCII characters. Themark
parameter is optional and defaults to the empty string if no explicit argument is given to it. function sys.std.Text.whitespace_trimmed result Text { topic(Text) }
-
This function results in the normalization of its argument where any leading or trailing whitespace characters are trimmed.
FUNCTIONS FOR PATTERN MATCHING AND TRANSLITERATION
These functions implement commonly used operations for matching text against a pattern or performing substitutions of characters for others; included are both the functionality of SQL's simple "LIKE" pattern matching operator but also support for Perl 5's regular expressions and Perl 6's rules. All of these functions are case-sensitive et al as per is_identical
unless explicitly given flags to do otherwise, where applicable; or just use them to search results of normalization functions if you need to. Note that Perl 5.10+ is also an inspiration such that its regular expression feature is algorithm-agnositic and can both be plugined with new algorithms or have multiple system-defined ones. Note that a lot of this section is still TODO, with several useful functions missing, or more complicated parts like the Perl pattern matching may be separated off into their own language extensions later.
function sys.std.Text.is_match_using_like result Bool params { look_in(Text), look_for(Text), escape(Text) }
-
This function results in
Bool:true
iff itslook_in
argument is matched by the pattern given in itslook_for
argument, andBool:false
otherwise. This function implements the full generalization of SQL's simple "LIKE" pattern matching operator. Any characters inlook_for
are matched literally except for the 2 wildcard characters_
(match any single character) and%
(match any string of 0..N characters); the preceeding assumes that theescape
argument is the empty string. Ifescape
is a character, then that character is also special and its lone occurrence inlook_for
will no longer match itself as per the 2 wildcard characters; rather it will be used inlook_for
to indicate when the pattern wishes to match a literal_
or%
or the escape character itself literally. For example, if\
is used as the escape character, then you use\_
,\%
,\\
to match the literal wildcard characters or itself, respectively. function sys.std.Text.is_not_match_using_like result Bool params { look_in(Text), look_for(Text), escape(Text) }
-
This function is exactly the same as
sys.std.Text.is_match_using_like
except that it results in the opposite boolean value when given the same arguments; it implements SQL's "NOT LIKE".
SEE ALSO
Go to Muldis::D for the majority of distribution-internal references, and Muldis::D::SeeAlso for the majority of distribution-external references.
AUTHOR
Darren Duncan (perl@DarrenDuncan.net
)
LICENSE AND COPYRIGHT
This file is part of the formal specification of the Muldis D language.
Muldis D is Copyright © 2002-2008, Darren Duncan.
See the LICENSE AND COPYRIGHT of Muldis::D for details.
TRADEMARK POLICY
The TRADEMARK POLICY in Muldis::D applies to this file too.
ACKNOWLEDGEMENTS
The ACKNOWLEDGEMENTS in Muldis::D apply to this file too.