=pod

=encoding utf8

=head1 NAME

Muldis::D::Ext::Text -
Muldis D extension for character string data types and operators

=head1 VERSION

This document is Muldis::D::Ext::Text version 0.26.0.

=head1 PREFACE

This document is part of the Muldis D language specification, whose root
document is L<Muldis::D>; you should read that root document
before you read this one, which provides subservient details.

=head1 DESCRIPTION

Muldis D has a mandatory core set of system-defined (eternally available)
entities, which is referred to as the I<Muldis D core> or the I<core>; they
are the minimal entities that all Muldis D implementations need to provide;
they are mutually self-describing and are used to bootstrap the language;
any entities outside the core, called I<Muldis D extensions>, are
non-mandatory and are defined in terms of the core or each other, but the
reverse isn't true.

This current C<Text> document describes the system-defined I<Muldis D Text
Extension>, which consists of character string data types and operators,
essentially all the generic ones that a typical programming language should
have, but for the bare minimum needed for bootstrapping Muldis D, which are
defined in the language core instead.

This current document does not describe the polymorphic operators that all
types, or some types including core types, have defined over them; said
operators are defined once for all types in L<Muldis::D::Core>.

I<This documentation is pending.>

=head1 SYSTEM-DEFINED TEXT-CONCERNING FUNCTIONS

These functions implement commonly used character string operations.

=over

=item C<function sys.std.Text.catenation result Text params {
topic(array_of.Text) }>

This function results in the catenation of the N element values of its
argument; it is a reduction operator that recursively takes each
consecutive pair of input values and catenates (which is associative) them
together until just one is left, which is the result.  If C<topic> has zero
values, then C<catenate> results in the empty string value, which is the
identity value for catenate.

=item C<function sys.std.Text.repeat result Text params { topic(Text),
count(UInt) }>

This function results in the catenation of C<count> instances of C<topic>.

=item C<function sys.std.Text.length_in_graphemes result UInt params {
topic(Text) }>

This function results in the length of its argument in graphemes.  We are
assuming here for simplicity that at the grapheme level of abstraction, the
particular Unicode Normal Form that might be in use behind the scenes has
no effect on the result.  Muldis D explicitly never works at the Unicode
codepoint or encoding byte level abstrations of text.

=item C<function sys.std.Text.is_substr result Bool params { look_in(Text),
look_for(Text), fixed_start(Bool)?, fixed_end(Bool)? }>

This function results in C<Bool:true> iff its C<look_for> argument is a
substring of its C<look_in> argument as per the optional C<fixed_start> and
C<fixed_end> constraints, and C<Bool:false> otherwise.  If C<fixed_start>
or C<fixed_end> are C<Bool:true>, then C<look_for> must occur right at the
start or end, respectively, of C<look_in> in order for C<contains> to
result in C<Bool:true>; if either flag is C<Bool:false>, its additional
constraint doesn't apply.  Each of the C<fixed_(start|end)> parameters is
optional and defaults to C<Bool:false> if no explicit argument is given to
it.  Note that C<is_substr> will handle the common special cases of SQL's
"LIKE" operator for patterns like ['foo', '%foo', 'foo%', '%foo%'], but see
also the C<is_match_using_like> function which provides the full generality
of SQL's "LIKE", such as 'foo%bar%baz'.

=item C<function sys.std.Text.is_not_substr result Bool params {
look_in(Text), look_for(Text), fixed_start(Bool)?, fixed_end(Bool)? }>

This function is exactly the same as C<sys.std.Text.is_substr> except that
it results in the opposite boolean value when given the same arguments.

=back

=head1 FUNCTIONS FOR TEXT NORMALIZATION

These functions implement commonly used text normalization operations which
are relatively simple or whose details are fully specified by the Unicode
standard; examples are folding letters to lower or upper case, removing
combining characters like accent marks and other diacritics from base
letters, or removing or normalizing whitespace, or that convert text from a
larger to a smaller character repertoire such as to ASCII.  By contrast,
operations such as stemming or removing common words or expanding
abbreviations are not done by these functions and are best implemented by a
third party language extension or library.  You can use these functions as
a basis for making comparison or ranking or collation operators that ignore
some distinctions between values such as their case or accents, such as to
do case-insensitive or accent-insensitive or whitespace-insensitive
matching or indexing or sorting; the actual system-defined matching
operators are still sensitive to case et al, but you can pretend they're
not by having them work with the results of these normalization functions
rather than on the inputs to these functions.  This is useful when you want
to emulate the semantics of insensitive though possibly preserving systems
over Muldis D.

=over

=item C<function sys.std.Text.case_folded_to_upper result Text {
topic(Text) }>

This function results in the normalization of its argument where any
letters considered to be (small) lowercase are folded to (capital)
uppercase.

=item C<function sys.std.Text.case_folded_to_lower result Text {
topic(Text) }>

This function results in the normalization of its argument where any
letters considered to be (capital) uppercase are folded to (small)
lowercase.

=item C<function sys.std.Text.accents_stripped result Text { topic(Text) }>

This function results in the normalization of its argument where any accent
marks or diacritics are removed from letters, leaving just the primary
letters.

=item C<function sys.std.Text.ASCII result Text { topic(Text), mark(Text)?
}>

This function results in the normalization of its C<topic> argument where
any characters not in the 7-bit ASCII repertoire are stripped out, where
each non-ASCII character is replaced with the common ASCII character string
specified by its C<mark> argument; if C<mark> is the empty string, then the
non-ASCII characters are simply stripped.  This function is quite simple
and does not do a smart replace with sequences of similar looking ASCII
characters.  The C<mark> parameter is optional and defaults to the empty
string if no explicit argument is given to it.

=item C<function sys.std.Text.whitespace_trimmed result Text { topic(Text)
}>

This function results in the normalization of its argument where any
leading or trailing whitespace characters are trimmed.

=back

=head1 FUNCTIONS FOR PATTERN MATCHING AND TRANSLITERATION

These functions implement commonly used operations for matching text
against a pattern or performing substitutions of characters for others;
included are both the functionality of SQL's simple "LIKE" pattern matching
operator but also support for Perl 5's regular expressions and Perl 6's
rules.  All of these functions are case-sensitive et al as per
C<is_identical> unless explicitly given flags to do otherwise, where
applicable; or just use them to search results of normalization functions
if you need to.  Note that Perl 5.10+ is also an inspiration such that its
regular expression feature is algorithm-agnositic and can both be plugined
with new algorithms or have multiple system-defined ones.  I<Note that a
lot of this section is still TODO, with several useful functions missing,
or more complicated parts like the Perl pattern matching may be separated
off into their own language extensions later.>

=over

=item C<function sys.std.Text.is_match_using_like result Bool params {
look_in(Text), look_for(Text), escape(Text) }>

This function results in C<Bool:true> iff its C<look_in> argument is
matched by the pattern given in its C<look_for> argument, and C<Bool:false>
otherwise.  This function implements the full generalization of SQL's
simple "LIKE" pattern matching operator.  Any characters in C<look_for> are
matched literally except for the 2 wildcard characters C<_> (match any
single character) and C<%> (match any string of 0..N characters); the
preceeding assumes that the C<escape> argument is the empty string.  If
C<escape> is a character, then that character is also special and its lone
occurrence in C<look_for> will no longer match itself as per the 2 wildcard
characters; rather it will be used in C<look_for> to indicate when the
pattern wishes to match a literal C<_> or C<%> or the escape character
itself literally.  For example, if C<\> is used as the escape character,
then you use C<\_>, C<\%>, C<\\> to match the literal wildcard characters
or itself, respectively.

=item C<function sys.std.Text.is_not_match_using_like result Bool params {
look_in(Text), look_for(Text), escape(Text) }>

This function is exactly the same as C<sys.std.Text.is_match_using_like>
except that it results in the opposite boolean value when given the same
arguments; it implements SQL's "NOT LIKE".

=back

=head1 SEE ALSO

Go to L<Muldis::D> for the majority of distribution-internal
references, and L<Muldis::D::SeeAlso> for the majority of
distribution-external references.

=head1 AUTHOR

Darren Duncan (C<perl@DarrenDuncan.net>)

=head1 LICENSE AND COPYRIGHT

This file is part of the formal specification of the Muldis D language.

Muldis D is Copyright © 2002-2008, Darren Duncan.

See the LICENSE AND COPYRIGHT of L<Muldis::D> for details.

=head1 TRADEMARK POLICY

The TRADEMARK POLICY in L<Muldis::D> applies to this file too.

=head1 ACKNOWLEDGEMENTS

The ACKNOWLEDGEMENTS in L<Muldis::D> apply to this file too.

=cut