THIS FUNCTION SHOULD BE USED IN ONLY VERY SPECIALIZED CIRCUMSTANCES. Instead, Almost all code should use "uv_to_utf8" in perlapi or "uv_to_utf8_flags" in perlapi.

This function is like them, but the input is a strict Unicode (as opposed to native) code point. Only in very rare circumstances should code not be using the native code point.

For details, see the description for "uv_to_utf8_flags" in perlapi.

These functions are identical. THEY SHOULD BE USED IN ONLY VERY SPECIALIZED CIRCUMSTANCES.

Most code should use "uv_to_utf8_flags"() rather than call this directly.

This function is for code that wants any warning and/or error messages to be returned to the caller rather than be displayed. Any message that would have been displayed if all lexical warnings are enabled will instead be returned.

It is just like "uvchr_to_utf8_flags" but it takes an extra parameter placed after all the others, msgs. If this parameter is 0, this function behaves identically to "uvchr_to_utf8_flags". Otherwise, msgs should be a pointer to an HV * variable, in which this function creates a new HV to contain any appropriate message. The hash has three key-value pairs, as follows:

text

The text of the message as a SVpv.

warn_categories

The warning category (or categories) packed into a SVuv.

flag_bit

A single flag bit associated with this message, in a SVuv. The bit corresponds to some bit in the *errors return value. The possibilities are:

UNICODE_GOT_SURROGATE
UNICODE_GOT_NONCHAR
UNICODE_GOT_SUPER
UNICODE_GOT_PERL_EXTENDED

It's important to note that specifying this parameter as non-null will cause any warning this function would otherwise generate to be suppressed, and instead be placed in *msgs. The caller can check the lexical warnings state (or not) when choosing what to do with the returned message.

Only a single message is returned; if a code point requires Perl extended UTF-8 to represent, it is also above-Unicode. If either the UNICODE_WARN_PERL_EXTENDED or UNICODE_DISALLOW_PERL_EXTENDED flags are set, the return is controlled by them; if neither is set, the return is controlled by the UNICODE_WARN_SUPER and UNICODE_DISALLOW_SUPER flags.

The caller, of course, is responsible for freeing any returned HV.

These each add the UTF-8 representation of the native code point uv to the end of the string d; d should have at least UVCHR_SKIP(uv)+1 (up to UTF8_MAXBYTES+1) free bytes available. The return value is the pointer to the byte after the end of the new character. In other words,

d = uv_to_utf8(d, uv);

This is the Unicode-aware way of saying

*(d++) = uv;

(uvchr_to_utf8 is a synonym for uv_to_utf8.)

uv_to_utf8_flags is used to make some classes of code points problematic in some way. uv_to_utf8 is effectively the same as calling uv_to_utf8_flags with flags set to 0, meaning no class of code point is considered problematic. That means any input code point from 0..IV_MAX is considered to be fine. IV_MAX is typically 0x7FFF_FFFF in a 32-bit word.

(uvchr_to_utf8_flags is a synonym for uv_to_utf8_flags).

A code point can be problematic in one of two ways. Its use could just raise a warning, and/or it could be forbidden with the function failing, and returning NULL.

The potential classes of problematic code points and the flags that make them so are:

If uv is a Unicode surrogate code point and UNICODE_WARN_SURROGATE is set, the function will raise a warning, provided UTF8 warnings are enabled. If instead UNICODE_DISALLOW_SURROGATE is set, the function will fail and return NULL. If both flags are set, the function will both warn and return NULL.

Similarly, the UNICODE_WARN_NONCHAR and UNICODE_DISALLOW_NONCHAR flags affect how the function handles a Unicode non-character.

And likewise, the UNICODE_WARN_SUPER and UNICODE_DISALLOW_SUPER flags affect the handling of code points that are above the Unicode maximum of 0x10FFFF. Languages other than Perl may not be able to accept files that contain these.

The flag UNICODE_WARN_ILLEGAL_INTERCHANGE selects all three of the above WARN flags; and UNICODE_DISALLOW_ILLEGAL_INTERCHANGE selects all three DISALLOW flags. UNICODE_DISALLOW_ILLEGAL_INTERCHANGE restricts the allowed inputs to the strict UTF-8 traditionally defined by Unicode. Similarly, UNICODE_WARN_ILLEGAL_C9_INTERCHANGE and UNICODE_DISALLOW_ILLEGAL_C9_INTERCHANGE are shortcuts to select the above-Unicode and surrogate flags, but not the non-character ones, as defined in Unicode Corrigendum #9. See "Noncharacter code points" in perlunicode.

Extremely high code points were never specified in any standard, and require an extension to UTF-8 to express, which Perl does. It is likely that programs written in something other than Perl would not be able to read files that contain these; nor would Perl understand files written by something that uses a different extension. For these reasons, there is a separate set of flags that can warn and/or disallow these extremely high code points, even if other above-Unicode ones are accepted. They are the UNICODE_WARN_PERL_EXTENDED and UNICODE_DISALLOW_PERL_EXTENDED flags. For more information see "UTF8_GOT_PERL_EXTENDED". Of course UNICODE_DISALLOW_SUPER will treat all above-Unicode code points, including these, as malformations. (Note that the Unicode standard considers anything above 0x10FFFF to be illegal, but there are standards predating it that allow up to 0x7FFF_FFFF (2**31 -1))

A somewhat misleadingly named synonym for UNICODE_WARN_PERL_EXTENDED is retained for backward compatibility: UNICODE_WARN_ABOVE_31_BIT. Similarly, UNICODE_DISALLOW_ABOVE_31_BIT is usable instead of the more accurately named UNICODE_DISALLOW_PERL_EXTENDED. The names are misleading because on EBCDIC platforms,these flags can apply to code points that actually do fit in 31 bits. The new names accurately describe the situation in all cases.

These functions each translate from UTF-8 to UTF-32 (or UTF-64 on 64 bit platforms). In other words, to a code point ordinal value. (On EBCDIC platforms, the initial encoding is UTF-EBCDIC, and the output is a native code point).

For example, the string "A" would be converted to the number 65 on an ASCII platform, and to 193 on an EBCDIC one. Converting the string "ABC" would yield the same results, as the functions stop after the first character converted. Converting the string "\N{LATIN CAPITAL LETTER A WITH MACRON} plus anything more in the string" would yield the number 0x100 on both types of platforms, since the first character is U+0100.

The functions whose names contain to_uvchr are older than the functions whose names don't have chr in them. The API in the older functions is harder to use correctly, and so they are kept only for backwards compatibility, and may eventually become deprecated. If you are writing a module and use Devel::PPPort, your code can use the new functions back to at least Perl v5.7.1.

All the functions accept, without complaint, well-formed UTF-8 for any non-problematic Unicode code point 0 .. 0x10FFFF. There are two types of Unicode problematic code points: surrogate characters and non-character code points. (See perlunicode.) Some of the functions reject one or both of these. Private use characters and those code points yet to be assigned to a particular character are never considered problematic. Additionally, most of the functions accept non-Unicode code points, those starting at 0x110000.

There are two sets of these functions:

utf8_to_uv forms

Almost all code should use only utf8_to_uv, extended_utf8_to_uv, strict_utf8_to_uv, c9strict_utf8_to_uv, or utf8_to_uv_or_die. The other functions are either the problematic old form, or are for specialized uses.

utf8_to_uv_or_die has a simpler interface than the other four, for use when any errors encountered should be fatal. It throws an exception with any errors found, otherwise it returns the code point the input sequence represents.

The other four functions each return true if the sequence of bytes starting at s form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a code point; or false otherwise. They take an extra parameter, the address of an IV, &cp. *cp will be set to the native code point value the sequence represents, and *advance will be set to its length, in bytes.

If the functions returns false, *cp is set to the Unicode REPLACEMENT CHARACTER, and *advance to the next position along s, where the next possible UTF-8 character could begin. Failing to use this position as the next starting point during parsing of strings has led to successful attacks by crafted inputs.

The functions only examine as many bytes along s as are needed to form a complete UTF-8 representation of a single code point; they never examine the byte at e, or beyond. They return false (or die in the case of utf8_to_uv_or_die) if the code point requires more than e - s bytes to represent.

The functions differ only in what flavor of UTF-8 they accept. All reject syntactically invalid UTF-8.

  • strict_utf8_to_uv

    additionally rejects any UTF-8 that translates into a code point that isn't specified by Unicode to be freely exchangeable, namely the surrogate characters and non-character code points (besides non-Unicode code points, any above 0x10FFFF). It does not raise a warning when rejecting these.

  • c9strict_utf8_to_uv

    instead uses the exchangeable definition given by Unicode's Corregendum #9, which accepts non-character code points while still rejecting surrogates. It does not raise a warning when rejecting these.

  • utf8_to_uv

  • utf8_to_uv_or die

    accept all syntactically valid UTF-8, as extended by Perl to allow 64-bit code points to be encoded.

    extended_utf8_to_uv is merely a synonym for utf8_to_uv. Use this form to draw attention to the fact that it accepts any code point. But since Perl programs traditionally do this by default, plain utf8_to_uv is the form most often used.

Whenever syntactically invalid input is rejected, an explanatory warning message is raised, unless utf8 warnings (or the appropriate subcategory) are turned off. A given input sequence may contain multiple malformations, giving rise to multiple warnings, as the functions attempt to find and report on all malformations in a sequence. All the possible malformations are listed in "utf8_to_uv_msgs", with some examples of multiple ones for the same sequence. You can use that function or "utf8_to_uv_flags" to exert more control over the input that is considered acceptable, and the warnings that are raised.

Often, s is an arbitrarily long string containing the UTF-8 representations of many code points in a row, and these functions are called in the course of parsing s to find all those code points.

If your code doesn't know how to deal with illegal input, as would be typical of a low level routine, the loop could look like:

while (s < e) {
    Size_t advance;
    UV cp;
    (void) utf8_to_uv(s, e, &cp, &advance);
    <handle 'cp'>
    s += advance;
}

A REPLACEMENT CHARACTER will be inserted everywhere that malformed input occurs. Obviously, we aren't expecting such outcomes, but your code will be protected from attacks and many harmful effects that could otherwise occur.

If the situation is such that it would be a bug for the input to be invalid, a somewhat simpler loop suffices:

while (s < e) {
    Size_t advance;
    UV cp = utf8_to_uv_or_die(s, e, &advance);
    <handle 'cp'>
    s += advance;
}

This will throw an exception on invalid input, so your code doesn't have to concern itself with that possibility.

If you do have a plan for handling malformed input, you could instead write:

while (s < e) {
    Size_t advance;
    UV cp;

    if (UNLIKELY(! utf8_to_uv(s, e, &cp, &advance)) {
        <bail out or convert to handleable>
    }

    <handle 'cp'>

    s += advance;
}

You may pass NULL to these functions instead of a pointer to your advance variable. But the only legitimate case to do this is if you are only examining the first character in s, and have no plans to ever look further. You could also advance by using UTF8SKIP, but this gives the correct result if and only if the input is well-formed; and this practice has led to successful attacks against such code; and it is extra work always, as the functions have already done the equivalent work and return the correct value in advance, regardless of whether the input is well-formed or not.

Except with utf8_to_uv_or_die, you must always pass a non-NULL pointer into which to store the (first) code point s represents. If you don't care about this value, you should be using one of the "isUTF8_CHAR" functions instead.

utf8_to_uvchr forms

These are the old form equivalents of utf8_to_uv (and its synonym, extended_utf8_to_uv). They are utf8_to_uvchr and utf8_to_uvchr_buf. There is no old form equivalent of either strict_utf8_to_uv nor c9strict_utf8_to_uv.

utf8_to_uvchr is DEPRECATED. Do NOT use it; it is a security hole ready to bring destruction onto you and yours.

utf8_to_uvchr_buf is discouraged and may eventually become deprecated. It checks if the sequence of bytes starting at s form a complete, legal UTF-8 (or UTF-EBCDIC) sequence for a code point. If so, it returns the code point value the sequence represents, and *retlen will be set to its length, in bytes. Thus, the next possible character in s begins at s + *retlen.

The function only examines as many bytes along s as are needed to form a complete UTF-8 representation of a single code point, but it never examines the byte at e, or beyond.

If the sequence examined starting at s is not legal Perl extended UTF-8, the translation fails, and the resultant behavior unfortunately depends on if the warnings category "utf8" is enabled or not.

If 'utf8' warnings are disabled

The Unicode REPLACEMENT CHARACTER is silently returned, and *retlen is set (if retlen isn't NULL) so that (s + *retlen) is the next possible position in s that could begin a non-malformed character.

But note that it is ambiguous whether a REPLACEMENT CHARACTER was actually in the input, or if this function synthetically generated one. In the unlikely event that you care, you'd have to examine the input to disambiguate.

If 'utf8' warnings are enabled

A warning will be displayed, and 0 is returned and *retlen is set (if retlen isn't NULL) to -1.

But note that 0 may also be returned if *s is a legal NUL character. This means that you have to disambiguate a 0 return. You can do this by checking that the first byte of s is indeed a NUL; or by making sure to always pass a non-NULL retlen pointer, and by examining it.

Also note that should you wish to proceed with parsing s, you have no easy way of knowing where to start looking in it for the next possible character. It is important to look in the right place to prevent attacks on your code. It would be better to have instead called an equivalent function that provides this information; any of the utf8_to_uv series, or "utf8n_to_uvchr".

Because of these quirks, utf8_to_uvchr_buf is very difficult to use correctly and handle all cases. Generally, you need to bail out at the first failure it finds.

The deprecated utf8_to_uvchr behaves the same way as utf8_to_uvchr_buf for well-formed input, and for the malformations it is capable of finding, but doesn't find all of them, and it can read beyond the end of the input buffer, which is why it is deprecated.

The utf8_to_uv() family of functions is preferred because they make it easier to write code safe from attacks. You should be converting to them; this will result in simpler, more robust code.

These functions are extensions of "utf8_to_uv", where you need more control over what UTF-8 sequences are acceptable. These functions are unlikely to be needed except for specialized purposes.

utf8n_to_uvchr is more like an extension of utf8_to_uvchr_buf, but with fewer quirks, and a different method of specifying the bytes in s it is allowed to examine. It has a curlen parameter instead of an e parameter, so the furthest byte in s it can look at is s + curlen - 1. Its return value is, like utf8_to_uvchr_buf, ambiguous with respect to the NUL and REPLACEMENT characters, but the value of *retlen can be relied on (except with the UTF8_CHECK_ONLY flag described below) to know where the next possible character along s starts, removing that quirk. Hence, you always should use *retlen to determine where the next character in s starts.

These functions have an additional parameter, flags, besides the ones in utf8_to_uv and utf8_to_uvchr_buf, which can be used to broaden or restrict what is acceptable UTF-8. flags has the same meaning and behavior in both functions. When flags is 0, these functions accept any syntactically valid Perl-extended-UTF-8 sequence that doesn't overflow the platform's word size.

There are flags that apply to accepting particular sequences, and flags that apply to raising warnings about encountering sequences. Each type is independent of the other. You can reject and not warn; warn and still accept; or both reject and warn. Rejecting means that the sequence gets translated into the Unicode REPLACEMENT CHARACTER instead of what it was meant to represent.

Unless otherwise stated below, warnings are subject to the utf8 warnings category being on.

UTF8_CHECK_ONLY

This suppresses any warnings. And it changes what is stored into *retlen with the uvchr family of functions (for the worse). It is not likely to be of use to you. You can use UTF8_ALLOW_ANY (described below) to also turn off warnings, and that flag doesn't adversely affect *retlen.

This flag is ignored if UTF8_DIE_IF_MALFORMED is also set.

UTF8_FORCE_WARN_IF_MALFORMED

Normally, no warnings are generated if warnings are turned off lexically or globally, regardless of any flags to the contrary. But this flag effectively turns on warnings temporarily for the duration of this function's execution.

Do not use it lightly.

This flag is ignored if UTF8_CHECK_ONLY is also set.

UTF8_DISALLOW_SURROGATE
UTF8_WARN_SURROGATE

These reject and/or warn about UTF-8 sequences that represent surrogate characters. The warning categories utf8 and non_unicode control if warnings are actually raised.

UTF8_DISALLOW_NONCHAR
UTF8_WARN_NONCHAR

These reject and/or warn about UTF-8 sequences that represent non-character code points. The warning categories utf8 and nonchar control if warnings are actually raised.

UTF8_DISALLOW_SUPER
UTF8_WARN_SUPER

These reject and/or warn about UTF-8 sequences that represent code points above 0x10FFFF. The warning categories utf8 and non_unicode control if warnings are actually raised.

UTF8_DISALLOW_ILLEGAL_INTERCHANGE
UTF8_WARN_ILLEGAL_INTERCHANGE

These are the same as having selected all three of the corresponding SURROGATE, NONCHAR and SUPER flags listed above.

All such code points are not considered to be safely freely exchangeable between processes.

UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE
UTF8_WARN_ILLEGAL_C9_INTERCHANGE

These are the same as having selected both the corresponding SURROGATE and SUPER flags listed above.

Unicode issued Unicode Corrigendum #9 to allow non-character code points to be exchanged by processes aware of the possibility. (They are still discouraged, however.) For more discussion see "Noncharacter code points" in perlunicode.

UTF8_DISALLOW_PERL_EXTENDED
UTF8_WARN_PERL_EXTENDED

These reject and/or warn on encountering sequences that require Perl's extension to UTF-8 to represent them. These are all for code points above 0x10FFFF, so these sequences are a subset of the ones controlled by SUPER or either of the illegal interchange sets of flags. The warning categories utf8, non_unicode, and portable control if warnings are actually raised.

Perl predates Unicode, and earlier standards allowed for code points up through 0x7FFF_FFFF (2**31 - 1). Perl, of course, would like you to be able to represent in UTF-8 any code point available on the platform. To do so, some extension must be used to express them. Perl uses a natural extension to UTF-8 to represent the ones up to 2**36-1, and invented a further extension to represent even higher ones, so that any code point that fits in a 64-bit word can be represented. We lump both of these extensions together and refer to them as Perl extended UTF-8. There exist other extensions that people have invented, incompatible with Perl's.

On EBCDIC platforms starting in Perl v5.24, the Perl extension for representing extremely high code points kicks in at 0x3FFF_FFFF (2**30 -1), which is lower than on ASCII. Prior to that, code points 2**31 and higher were simply unrepresentable, and a different, incompatible method was used to represent code points between 2**30 and 2**31 - 1.

It is likely that programs written in something other than Perl would not be able to read files that contain these; nor would Perl understand files written by something that uses a different extension. Hence, you can specify that above-Unicode code points are generally accepted and/or warned about, but still exclude the ones that require this extension to represent.

UTF8_ALLOW_ANY and kin

Other flags can be passed to allow, in a limited way, syntactic malformations and/or overflowing the number of bits available in a UV on the platform. The functions will not treat the relevant malformations as errors, hence will not raise any warnings for them. utf8_to_uv_msgs will return true.

However, all such malformations translate to the REPLACEMENT CHARACTER, regardless of any of the flags.

The only such flag that you would ever have any reason to use is UTF8_ALLOW_ANY which applies to any of the syntactic malformations and overflow, except for empty input. The other flags are analogous to ones in the _GOT_ bits list in "utf8_to_uv_msgs".

UTF8_DIE_IF_MALFORMED

If the function would otherwise return false, it instead croaks. The UTF8_FORCE_WARN_IF_MALFORMED flag is effectively turned on so that the cause of the croak is displayed.

These functions are extensions of "utf8_to_uv_flags" and "utf8n_to_uvchr". They are used for the highly specialized purpose of when the caller needs to know the exact malformations that were encountered and/or the diagnostics that would be raised.

They each take one or two extra parameters, pointers to where to store this information. The functions with _msgs in their names return both types, so take two extra parameters; those with _error return just the malformations, so take just one extra parameter. When the extra parameters are both 0, the functions behave identically to the function they extend.

When the errors parameter is not NULL, it should be the address of a U32 variable, into which the functions store a bitmap, described just below, with a bit set for each malformation the function found; 0 if none. The ALLOW-type flags are ignored when determining the content of this variable. That is, even if you "allow" a particular malformation, if it is encountered, the corresponding bit will be set to notify you that one was encountered. However, the bits for conditions that are accepted by default aren't set unless the flags passed to the function indicate that they should be rejected or warned about when encountering them. These are explicitly noted in the list below along with the controlling flags.

The bits returned in errors and their meanings are:

UTF8_GOT_CONTINUATION

The input sequence was malformed in that the first byte was a UTF-8 continuation byte.

UTF8_GOT_EMPTY

The input parameters indicated the length of s is 0. Technically, this a coding error, not a malformation; you should check before calling these functions if there is actually anything to convert. But perl needs to be able to recover from bad input, and this is how it does it.

UTF8_GOT_LONG

The input sequence was malformed in that there is some other sequence that evaluates to the same code point, but that sequence is shorter than this one.

Until Unicode 3.1, it was legal for programs to accept this malformation, but it was discovered that this created security issues.

UTF8_GOT_NONCHAR

The code point represented by the input UTF-8 sequence is for a Unicode non-character code point. This bit is set only if the input flags parameter contains either the UTF8_DISALLOW_NONCHAR or the UTF8_WARN_NONCHAR flags.

UTF8_GOT_NON_CONTINUATION

The input sequence was malformed in that a non-continuation-type byte was found in a position where only a continuation-type one should be. See also "UTF8_GOT_SHORT".

UTF8_GOT_OVERFLOW

The input sequence was malformed in that it is for a code point that is not representable in the number of bits available in an IV on the current platform.

UTF8_GOT_PERL_EXTENDED

The input sequence is not standard UTF-8, but a Perl extension. This bit is set only if the input flags parameter contains either the UTF8_DISALLOW_PERL_EXTENDED or the UTF8_WARN_PERL_EXTENDED flags.

UTF8_GOT_SHORT

The input sequence was malformed in that curlen is smaller than required for a complete sequence. In other words, the input is for a partial character sequence.

UTF8_GOT_SHORT and UTF8_GOT_NON_CONTINUATION both indicate a too short sequence. The difference is that UTF8_GOT_NON_CONTINUATION indicates always that there is an error, while UTF8_GOT_SHORT means that an incomplete sequence was looked at. If no other flags are present, it means that the sequence was valid as far as it went. Depending on the application, this could mean one of three things:

  • The e or curlen parameters passed in were too small, and the function was prevented from examining all the necessary bytes.

  • The buffer being looked at is based on reading data, and the data received so far stopped in the middle of a character, so that the next read will read the remainder of this character. (It is up to the caller to deal with the split bytes somehow.)

  • This is a real error, and the partial sequence is all we're going to get.

UTF8_GOT_SUPER

The input sequence was malformed in that it is for a non-Unicode code point; that is, one above the legal Unicode maximum. This bit is set only if the input flags parameter contains either the UTF8_DISALLOW_SUPER or the UTF8_WARN_SUPER flags.

UTF8_GOT_SURROGATE

The input sequence was malformed in that it is for a Unicode UTF-16 surrogate code point. This bit is set only if the input flags parameter contains either the UTF8_DISALLOW_SURROGATE or the UTF8_WARN_SURROGATE flags.

Note that more than one bit may have been set by these functions. This is because it is possible for multiple malformations to be present in the same sequence. An example would be an overlong sequence evaluating to a surrogate when surrogates are forbidden. Another example is overflow; standard UTF-8 never overflows, so something that does must have been expressed using Perl's extended UTF-8. It also is above all legal Unicode code points. So there will be a bit set for up to all three of these things. 1) Overflow always; 2) perl-extended if the calling flags indicate those should be rejected or warned about; and 3) above-Unicode, provided the calling flags indicate those should be rejected or warned about.

If you don't care about the system's messages text nor warning categories, you can customize error handling by calling one of the _error functions, using either of the flags UTF8_ALLOW_ANY or UTF8_CHECK_ONLY to suppress any warnings, and then examine the *errors return. If you don't use those flags, warnings will be raised as usual.

But if you do care, instead use one of the functions with _msgs in their names. These allow you to completely customize error handling by suppressing any warnings that would otherwise be raised; instead returning all relevant information in a structure specified by an extra parameter, msgs, a pointer to a variable which has been declared to be an AV*, and into which the function creates a new AV to store information, described below, about all the malformations that were encountered.

When this parameter is non-NULL, the UTF8_DIE_IF_MALFORMED and UTF8_FORCE_WARN_IF_MALFORMED flags are asserted against in DEBUGGING builds, and are ignored in non-DEBUGGING ones. The UTF8_CHECK_ONLY flag is always ignored.

What is considered a malformation is affected by flags, the same as described in "utf8_to_uv_flags". No array element is generated for malformations that are "allowed" by the input flags, in contrast to the bitmap returned in a non-NULL *errors.

Each element of the msgs AV array is an anonymous hash with the following three key-value pairs:

text

A SVpv containing the text of the message about the problematic input. This text is identical to any warning that otherwise would have been raised if the appropriate warning categories were enabled.

warn_categories

This is 0 if the flags parameter to the function would ordinarily not have caused the message to be output as a warning; otherwise it is the warning category (or categories) that would have been used to generate a warning for text, packed into a SVuv. For example, if flags contains UTF8_DISALLOW_SURROGATE, but not UTF8_WARN_SURROGATE, this would be 0 if the input was a surrogate.

flag

A SVuv containing a single flag bit associated with this message. The bit corresponds to some bit in the *errors return value, such as UTF8_GOT_LONG.

The array is sorted so that element [0] contains the first message that would have otherwise been raised; [1], the second; and so on.

You thus can completely override the normal error handling; you can check the lexical warnings state (or not) when choosing what to do with the returned messages.

The caller, of course, is responsible for freeing any returned AV.

Returns the number of characters in the sequence of UTF-8-encoded bytes starting at s and ending at the byte just before e. If <s> and <e> point to the same place, it returns 0 with no warning raised.

If e < s or if the scan would end up past e, it raises a UTF8 warning and returns the number of valid characters.

Compares the sequence of characters (stored as octets) in b, blen with the sequence of characters (stored as UTF-8) in u, ulen. Returns 0 if they are equal, -1 or -2 if the first string is less than the second string, +1 or +2 if the first string is greater than the second string.

-1 or +1 is returned if the shorter string was identical to the start of the longer string. -2 or +2 is returned if there was a difference between characters within the strings.

These each convert a string encoded as UTF-8 into the equivalent native byte representation, if possible. The first three forms are preferred; their API is more convenient to use, and each return true if the result is in bytes; false if the conversion failed.

  • utf8_to_bytes_overwrite

  • utf8_to_bytes_new_pv

  • utf8_to_bytes_temp_pv

    These differ primarily in the form of the returned string and the allowed constness of the input string. In each, if the input string was already in native bytes or was not convertible, the input isn't changed.

    In each of these three functions, the input s_ptr is a pointer to the string to be converted and *lenp is its length (so that the first byte will be at *sptr[0]).

    utf8_to_bytes_overwrite overwrites the input string with the bytes conversion. Hence, the input string should not be const. (Converting the multi-byte UTF-8 encoding to single bytes never expands the result, so overwriting is always feasible.)

    Both utf8_to_bytes_new_pv and utf8_to_bytes_temp_pv allocate new memory to hold the converted string, never changing the input. Hence the input string may be const. They differ in that utf8_to_bytes_temp_pv arranges for the new memory to automatically be freed. With utf8_to_bytes_new_pv, the caller is responsible for freeing the memory. As explained below, not all successful calls result in new memory being allocated. Hence this function also returns to the caller (via an extra parameter, *free_me) a pointer to any new memory, or NULL if none was allocated.

    The functions return false when the input is not well-formed UTF-8 or contains at least one UTF-8 sequence that represents a code point that can't be expressed as a byte. The contents of *s_ptr and *lenp are not changed. utf8_to_bytes_new_pv sets *free_me to NULL.

    They all return true when either:

    The input turned out to already be in bytes form

    The contents of *s_ptr and *lenp are not changed. utf8_to_bytes_new_pv sets *free_me to NULL.

    The input was successfully converted
    For utf8_to_bytes_overwrite

    The input string *s_ptr was overwritten with the native bytes, including a NUL terminator. *lenp has been updated with the new length.

    For utf8_to_bytes_new_pv and utf8_to_bytes_temp_pv

    The input string was not changed. Instead, new memory has been allocated containing the translation of the input into native bytes, with a NUL terminator byte. *s_ptr now points to that new memory, and *lenp contains its length.

    For utf8_to_bytes_temp_pv, the new memory has been arranged to be automatically freed, via a call to "SAVEFREEPV".

    For utf8_to_bytes_new_pv, *free_me has been set to *s_ptr, and it is the caller's responsibility to free the new memory when done using it. The following paradigm is convenient to use for this:

    void * free_me;
    if (utf8_to_bytes_new_pv(&s, &len, &free_me) {
       ...
    }
    else {
       ...
    }
    
    ...
    
    Safefree(free_me);

    free_me can be used as a boolean (non-NULL meaning true) to indicate that the input was indeed changed if you need to revisit that later in the code. Your design is likely flawed if you find yourself using free_me for any other purpose.

    Note that in all cases, *s_ptr and *lenp will have correct and consistent values, updated as was necessary.

    Also note that upon successful conversion, the number of variants in the string can be computed by having saved the value of *lenp before the call, and subtracting the after-call value of *lenp from it. This is also true for the other two functions described below.

  • utf8_to_bytes

    Plain utf8_to_bytes (which has never lost its experimental status) also converts a UTF-8 encoded string to bytes, but there are more glitches that the caller has to be prepared to handle.

    The input string is passed with one less indirection level, s.

    If the conversion was a noop

    The contents of s and *lenp are not changed, and the function returns s.

    If the conversion was successful

    The contents of s were changed, and *lenp updated to be the correct length. The function returns s (unchanged).

    If the conversion failed

    The contents of s were not changed.

    The function returns NULL and sets *lenp to -1, cast to STRLEN. This means that you will have to use a temporary containing the string length to pass to the function if you will need the value afterwards.

  • bytes_from_utf8

    bytes_from_utf8 also converts a potentially UTF-8 encoded string s to bytes. It preserves s, allocating new memory for the converted string.

    In contrast to the other functions, the input string to this one need not be UTF-8. If not, the caller has set *is_utf8p to be false, and the function does nothing, returning the original s.

    Also do nothing if there are code points in the string not expressible in native byte encoding, returning the original s.

    Otherwise, *is_utf8p is set to 0, and the return value is a pointer to a newly created string containing the native byte equivalent of s, and whose length is returned in *lenp, updated. The new string is NUL-terminated. The caller is responsible for arranging for the memory used by this string to get freed.

    The major problem with this function is that memory is allocated and filled even when the input string was already in bytes form.

New code should use the first three functions listed above.

These each convert a string s of length *lenp bytes from the native encoding into UTF-8 (UTF-EBCDIC on EBCDIC platforms), returning a pointer to the UTF-8 string, and setting *lenp to its length in bytes.

bytes_to_utf8 always allocates new memory for the result, making sure it is NUL-terminated.

bytes_to_utf8_free_me simply returns a pointer to the input string if the string's UTF-8 representation is the same as its native representation. Otherwise, it behaves like bytes_to_utf8, returning a pointer to new memory containing the conversion of the input. In other words, it returns the input string if converting the string would be a no-op. Note that when no new string is allocated, the function can't add a NUL to the original string if one wasn't already there.

In both cases, the caller is responsible for arranging for any new memory to get freed.

bytes_to_utf8_temp_pv simply returns a pointer to the input string if the string's UTF-8 representation is the same as its native representation, thus behaving like bytes_to_utf8_free_me in this situation. Otherwise, it behaves like bytes_to_utf8, returning a pointer to new memory containing the conversion of the input. The difference is that it also arranges for the new memory to automatically be freed by calling "SAVEFREEPV" on it.

bytes_to_utf8_free_me takes an extra parameter, free_me to communicate. to the caller that memory was allocated or not. If that parameter is NULL, bytes_to_utf8_free_me acts identically to bytes_to_utf8, always allocating new memory.

But when it is a non-NULL pointer, bytes_to_utf8_free_me stores into it either NULL if no memory was allocated; or a pointer to that new memory. This allows the following convenient paradigm:

void * free_me;
U8 converted = bytes_to_utf8_free_me(string, &len, &free_me);

...

Safefree(free_me);

You don't have to know if memory was allocated or not. Just call Safefree unconditionally. free_me will contain a suitable value to pass to Safefree for it to do the right thing, regardless. Your design is likely flawed if you find yourself using free_me for anything other than passing to Safefree.

Upon return, the number of variants in the string can be computed by having saved the value of *lenp before the call, and subtracting the after-call value of *lenp from it.

If you want to convert to UTF-8 from encodings other than the native (Latin1 or EBCDIC), see "sv_recode_to_utf8"().

Build to the scalar dsv a displayable version of the UTF-8 encoded string spv, length len, the displayable version being at most pvlim bytes long (if longer, the rest is truncated and "..." will be appended).

The flags argument can have any combination of these flag bits

UNI_DISPLAY_ISPRINT

to display isPRINT()able characters as themselves

UNI_DISPLAY_BACKSLASH

to display the \\[nrfta\\] as the backslashed versions (like "\n")

(UNI_DISPLAY_BACKSLASH is preferred over UNI_DISPLAY_ISPRINT for "\\").

UNI_DISPLAY_BACKSPACE

to display \b for a backspace, but only when UNI_DISPLAY_BACKSLASH also is set.

UNI_DISPLAY_REGEX

This a shorthand for UNI_DISPLAY_ISPRINT along with UNI_DISPLAY_BACKSLASH.

UNI_DISPLAY_QQ

This a shorthand for all three UNI_DISPLAY_ISPRINT, UNI_DISPLAY_BACKSLASH, and UNI_DISPLAY_BACKSLASH.

The pointer to the PV of the dsv is returned.

See also "sv_uni_display".

Build to the scalar dsv a displayable version of the scalar sv, the displayable version being at most pvlim bytes long (if longer, the rest is truncated and "..." will be appended).

The flags argument is as in "pv_uni_display"().

The pointer to the PV of the dsv is returned.

Returns true if the leading portions of the strings s1 and s2 (either or both of which may be in UTF-8) are the same case-insensitively; false otherwise. How far into the strings to compare is determined by other input parameters.

If u1 is true, the string s1 is assumed to be in UTF-8-encoded Unicode; otherwise it is assumed to be in native 8-bit encoding. Correspondingly for u2 with respect to s2.

If the byte length l1 is non-zero, it says how far into s1 to check for fold equality. In other words, s1+l1 will be used as a goal to reach. The scan will not be considered to be a match unless the goal is reached, and scanning won't continue past that goal. Correspondingly for l2 with respect to s2.

If pe1 is non-NULL and the pointer it points to is not NULL, that pointer is considered an end pointer to the position 1 byte past the maximum point in s1 beyond which scanning will not continue under any circumstances. (This routine assumes that UTF-8 encoded input strings are not malformed; malformed input can cause it to read past pe1). This means that if both l1 and pe1 are specified, and pe1 is less than s1+l1, the match will never be successful because it can never get as far as its goal (and in fact is asserted against). Correspondingly for pe2 with respect to s2.

At least one of s1 and s2 must have a goal (at least one of l1 and l2 must be non-zero), and if both do, both have to be reached for a successful match. Also, if the fold of a character is multiple characters, all of them must be matched (see tr21 reference below for 'folding').

Upon a successful match, if pe1 is non-NULL, it will be set to point to the beginning of the next character of s1 beyond what was matched. Correspondingly for pe2 and s2.

For case-insensitiveness, the "casefolding" of Unicode is used instead of upper/lowercasing both the characters, see https://www.unicode.org/reports/tr21/ (Case Mappings).