TITLE
Parrot Strings
The Parrot String API
This document describes how Parrot abstracts the programmer's interface to string types. All strings used in the Parrot core should use the Parrot STRING
structure; Parrot programmers should not deal with char *
or other string-like types outside of this abstraction without very good reason.
Interface functions on STRING
s
In fact, programmers should hardly ever even access members of the STRING
structure directly. The reason for this is that the interpretation of the data inside the structure will be a function of the data's encoding. The idea is that Parrot's strings are encoding-aware so your functions don't need to be; if you break the abstraction, you suddenly have to start worrying about what the data actually means.
String Constructors
The most basic way of creating a string is through the function string_make
:
STRING* string_make(struct Parrot_Interp *, const void *buffer, INTVAL buflen, INTVAL encoding, INTVAL flags, INTVAL type)
In here you pass a pointer to a buffer of a given encoding, and the number of bytes in that buffer to examine, the encoding, (see below for the enum
which defines the different encodings) and the initial values of the flags
and type
field. These should usually be zero. In return, you'll get a brand new Parrot string. This string will have its own private copy of the buffer, so you don't need to keep it.
Hint: Nothing stops you doing
string_make(interpreter, NULL, 0, ...
If you already have a string, you can make a copy of it by calling
STRING* string_copy(struct Parrot_Interp *, STRING* s)
This is itself implemented in terms of string_make
.
String Manipulation Functions
Unless otherwise stated, all lengths, offsets, and so on, are given in characters; you are not allowed to care about the byte representation of a string, so it doesn't make sense to give the values in bytes.
To find out the length of a string, use
INTVAL string_length(const STRING *s)
You may explicitly use s->strlen
for this since it is such a useful operation.
To concatenate two strings - that is, to add the contents of string b
to the end of string a
, use:
STRING* string_concat(struct Parrot_Interp *, STRING* a, STRING *b, INTVAL flag)
a
is updated, and is also returned as a convenience. If the flag is set to a non-zero value, then b
will be transcoded to a
's encoding before concatenation if the strings are of different encodings. You almost certainly don't want to stick, say, a UTF-32 string on the end of a Big-5 string.
To repeat a string, (ie, turn 'xyz' into 'xyzxyzxyz') use:
STRING* string_repeat(struct Parrot_Interp *, const STRING* s, UINTVAL n, STRING** d)
Which will repeat string s n times and store the result into d, which it also returns. If *d or **d is NULL, a new string will be allocated to hold the result. s is not modified by this operation. If d is not of the same type as s, it will be upgraded appropiately.
Chopping n
characters off the end of a string is achieved with the unlikely-sounding
STRING* string_chopn(STRING* s, INTVAL n)
To retrieve a substring of the string, call
STRING* string_substr(struct Parrot_Interp *, STRING* src, INTVAL offset, INTVAL length, STRING** dest)
The result will be placed in dest
. (Passing in dest
avoids allocating a new string at runtime. If *dest
is a null pointer, a new string structure is created with the same encoding as src
.)
To retrieve a single character of the string, call
INTVAL string_ord(const STRING* s, INTVAL n)
The result will be returned from the function. It checks for the existence of s
, and tests for n
being out of range. Currently it applies the method that perl uses on arrays to handle negative indices. That is to say, negative values count backwards from the end of the string. For example, index -1 is the last character in the string, -2 is the next-to-last, and so on.
If s
is null or s
is zero-length, it throws an exception. If n
is out of range, it also throws an exception.
To compare two strings, use:
INTVAL string_compare(struct Parrot_Interp *, STRING* s1, STRING* s2)
The value returned will be less than, equal to, or greater than zero depending on whether s1
is less than, equal to, or greater than s2
.
Strings whose encodings are not the same can be compared - in this case a UTF-32 copy will be made of each string and these copies will be compared.
To test a string for truth, use:
BOOLVAL string_bool(STRING* s);
A string is false if it
o is not yet allocated
o has zero length
o consists of one digit character whose numeric value (as
decided by its character type) is zero.
Otherwise the string will be true.
To format output into a string, use
STRING* string_nprintf(struct Parrot_Interp *, STRING* dest, INTVAL len, char* format, ...)
dest
may be a null pointer, in which case a new native string will be created. If len
is zero, the behaviour becomes more sprintf
ish than snprintf
-like.
Notes for Implementors
Termination
The character buffer pointed to by *bustart is not expected to be terminated by a nul byte and functions which provide the string api will not add one. Any functions which access the buffer directly and which require a terminating nul byte must place one there themselves and also be very careful about nul bytes within the used portion of the character buffer. In particular, if bufused == buflen more space must be allocated to hold a terminating byte.
Elements of the STRING
structure
Those implementing the STRING
API will obviously need to know about how the STRING
structure works. You can find the definition of this structure in string.h:
struct parrot_string_t {
void *bufstart;
UINTVAL buflen;
UINTVAL flags;
UINTVAL bufused;
void *strstart;
UINTVAL strlen;
const ENCODING *encoding;
const CHARTYPE *type;
INTVAL language;
};
Let's look at each element of this structure in turn.
bufstart
This pointer points to the buffer which holds the string, encoded in whatever is the string's specified encoding. Because of this, you should not make any assumptions about what's in the buffer, and hence you shouldn't try and access it directly.
buflen
This is used for memory allocation; it tells you the currently allocated size of the buffer in bytes.
flags
This is a general holding area for string flags. The exact flags required have not yet been determined.
bufused
bufused
on the other hand, contains the number of bytes out of the allocated buffer which are actually in use. This, together with buflen
, is used by the buffer growing algorithm to determine when and by how much to grow the allocation buffer.
strstart
This stores the actual start of the string. In the case of COW strings holding references to portions of a larger string, (for example, in regex match variables), this is a pointer into the start of the string.
strlen
This is the length of the string in characters, as you would expect to find from length $string
in Perl. Again, because string buffers may be in one of a number of encodings, this must be computed by the appropriate encoding function. string_compute_strlen(STRING)
updates this value, calling the compute_strlen
function in the STRING's vtable.
encoding
This is a vtable of functions; the vtable should normally be taken from the array Parrot_string_vtable
. Entries in this array specify the encoding of the string, from the following enum
:
enum {
enc_native,
enc_utf8,
enc_utf16,
enc_utf32,
enc_foreign,
enc_max
};
The "native" string type is whatever happens when you set LANG=C
in your shell; it's usually ISO-8859-1 in most English-speaking machines. A character equals a byte equals eight bits. No shifts, no wide characters, nothing.
UTF8, UTF16, and UTF32 are what they sound like. UTF16 and UTF32 should use the native endianness of the machine.
enc_foreign
is there to allow for expansion; foreign strings will call functions from a user-defined string vtable instead of the Perl built-in ones.
enc_max
isn't an encoding. These aren't the droids you're looking for. It's just there to help know how big to make arrays.
type
XXX I don't know what this is for.
language
This field is currently unused; however, it can be used to hold a pointer to the correct vtable for foreign strings.
String Vtable Functions
The "String Manipulation Functions" above are implemented in terms of string vtables to create encoding abstraction; here's an example of one:
STRING*
string_concat(struct Parrot_Interp *interpreter, STRING* a, STRING* b, INTVAL flags) {
return (ENC_VTABLE(a).concat)(a, b, flags);
}
ENC_VTABLE(a)
is shorthand for:
a->encoding
Vtables are taken from the Parrot_string_vtable
array, defined in string.c
. Each encoding has its own vtable; to call the concatenation function for a
, we look up its vtable and retrieve the concat
entry from that vtable. This produces a function pointer we can throw the arguments at.
To get the actual position in the array from the vtable, use the which
entry, which returns an INTVAL
index into Parrot_string_vtable
.
Most of the string vtable functions are self-explanatory as they are thin wrappers around the functions given above. Some of them, however, are for internal use only, to help implement other functions. You'll find them in the next section.
How to add new vtable functions
The first thing to note is that if what you're doing isn't remotely encoding-specific, you don't need to add a vtable function; you can just add a function in string.c (don't forget to add the function prototype to string.h) and you don't need any more of this section. However, most things that people do with strings depend on the encoding of the string data, so if you need to add anything slightly complex, read on.
Currently, the construction of the vtables is not automated; it's hoped that soon someone will automate this and fix this section. However, for the time being, this is what you need to do when you implement a new vtable function:
Check to see whether or not the function's type has a typedef in string.h: for instance, if you have a function that takes a string and an
INTVAL
and returns a string, usestring_iv_to_string_t
; otherwise, add your own type.Add the unqualified name of the function (
frobnicate
), together with your type, tostring_vtable
in string.h.Create a function
string_frobnicate
instring.c
which is a wrapper aroundfrobnicate
. This function must take aSTRING*
parameter, so that the encoding can be extracted and the relevant encoding vtable be found and despatched. It should look something like this:yadda string_frobnicate(STRING *s, ...) { return (ENC_VTABLE(s).frobnicate)(s, ...); }
Create functions
string_XXX_frobnicate
for all values ofXXX
in the encoding table; (or better still, get other people to write them for you)string_native_frobnicate
should go in strnative.c,string_utf8_frobnicate
should go in strutf8.c, and so on.Add
string_XXX_frobnicate
to the end of each vtable returned bystring_XXX_vtable
.
Non-user-visible String Manipulation Functions
If you've read this far, I hope you're a Parrot implementor. If you're not helping construct the Parrot core itself, you probably want to look away now.
The first two functions to note are
INTVAL string_compute_strlen(STRING* s)
and
INTVAL string_max_bytes(STRING *s, INTVAL iv)
The first updates the contents of s->strlen
by contemplating the buffer bufstart
and working out how many characters it contains. The second is given a number of characters which we assume are going to be added into the string at some point; it returns the maximum number of bytes that need to be allocated to admit that number of characters. For fixed-width encodings, this is trivial - the "native" encoding, for instance, encodes one byte per character, so string_native_max_bytes
simply returns the INTVAL
it is passed; string_utf8_max_bytes
, on the other hand, returns three times the value that it is passed because a UTF8 character may occupy up to three bytes.
To grow a string to a specified size, use
void string_grow(struct Parrot_Interp *, STRING *s, INTVAL newsize)
The size is given in characters; string_max_bytes
is called to turn this into a size in bytes, and then the buffer is grown to accomodate (at least) that many bytes.
Transcoding
The fact that Parrot strings are encoding-abstracted really has to bottom out at some point, and it's usually when two strings of different encodings interact. When we try to append one type of string to another, we have the option of turning the later string into a string that matches the first string's encoding. This process, translating a string from one encoding into another, is called "transcoding".
In Parrot, transcoding is implemented by the two-dimensional array
Parrot_transcode_table[enc_from][enc_to]
Each entry in this table is a function pointer which takes two parameters:
string_utf32_to_utf8(STRING* from, STRING* to)
(If to
is a null pointer, a new STRING*
will be allocated. As before, it's all about avoiding memory allocation at runtime.)
A null pointer in the table should signify that no transcoding is necessary; Parrot_transcode_table[x][x]
should always be NULL
.
Parrot_transcode_table[enc_native][enc_utf8]
isn't NULL
. Don't fall for that, because "native" doesn't necessarily mean ISO-8859-1.
Foreign Encodings
Fill this in later; if anyone wants to implement new encodings at this stage they must be mad.
Work In Progress
The transcoding section is out of sync with the code.
Should the following functions be mentioned? string_append
, string_from_c_string
, string_from_int
, string_from_num
, string_index
, string_replace
, string_set
, string_str_index
, string_to_cstring
, string_to_int
, string_to_num
, string_transcode
.
string_bool
is here said to return BOOLVAL
. But the code is returning INTVAL
(2002Dec). Which is the right thing?