Lexer interface

This is the lower layer of the Perl parser, managing characters and tokens.

Pointer to a structure encapsulating the state of the parsing operation currently in progress. The pointer can be locally changed to perform a nested parse without interfering with the state of an outer parse. Individual members of PL_parser have their own documentation.

Buffer scalar containing the chunk currently under consideration of the text currently being lexed. This is always a plain string scalar (for which SvPOK is true). It is not intended to be used as a scalar by normal scalar means; instead refer to the buffer directly by the pointer variables described below.

The lexer maintains various char* pointers to things in the PL_parser->linestr buffer. If PL_parser->linestr is ever reallocated, all of these pointers must be updated. Don't attempt to do this manually, but rather use "lex_grow_linestr" if you need to reallocate the buffer.

The content of the text chunk in the buffer is commonly exactly one complete line of input, up to and including a newline terminator, but there are situations where it is otherwise. The octets of the buffer may be intended to be interpreted as either UTF-8 or Latin-1. The function "lex_bufutf8" tells you which. Do not use the SvUTF8 flag on this scalar, which may disagree with it.

For direct examination of the buffer, the variable "PL_parser->bufend" points to the end of the buffer. The current lexing position is pointed to by "PL_parser->bufptr". Direct use of these pointers is usually preferable to examination of the scalar through normal scalar means.

Direct pointer to the end of the chunk of text currently being lexed, the end of the lexer buffer. This is equal to SvPVX(PL_parser->linestr) + SvCUR(PL_parser->linestr). A NUL character (zero octet) is always located at the end of the buffer, and does not count as part of the buffer's contents.

Points to the current position of lexing inside the lexer buffer. Characters around this point may be freely examined, within the range delimited by SvPVX("PL_parser->linestr") and "PL_parser->bufend". The octets of the buffer may be intended to be interpreted as either UTF-8 or Latin-1, as indicated by "lex_bufutf8".

Lexing code (whether in the Perl core or not) moves this pointer past the characters that it consumes. It is also expected to perform some bookkeeping whenever a newline character is consumed. This movement can be more conveniently performed by the function "lex_read_to", which handles newlines appropriately.

Interpretation of the buffer's octets can be abstracted out by using the slightly higher-level functions "lex_peek_unichar" and "lex_read_unichar".

Points to the start of the current line inside the lexer buffer. This is useful for indicating at which column an error occurred, and not much else. This must be updated by any lexing code that consumes a newline; the function "lex_read_to" handles this detail.

Indicates whether the octets in the lexer buffer ("PL_parser->linestr") should be interpreted as the UTF-8 encoding of Unicode characters. If not, they should be interpreted as Latin-1 characters. This is analogous to the SvUTF8 flag for scalars.

In UTF-8 mode, it is not guaranteed that the lexer buffer actually contains valid UTF-8. Lexing code must be robust in the face of invalid encoding.

The actual SvUTF8 flag of the "PL_parser->linestr" scalar is significant, but not the whole story regarding the input character encoding. Normally, when a file is being read, the scalar contains octets and its SvUTF8 flag is off, but the octets should be interpreted as UTF-8 if the use utf8 pragma is in effect. During a string eval, however, the scalar may have the SvUTF8 flag on, and in this case its octets should be interpreted as UTF-8 unless the use bytes pragma is in effect. This logic may change in the future; use this function instead of implementing the logic yourself.

Reallocates the lexer buffer ("PL_parser->linestr") to accommodate at least len octets (including terminating NUL). Returns a pointer to the reallocated buffer. This is necessary before making any direct modification of the buffer that would increase its length. "lex_stuff_pvn" provides a more convenient way to insert text into the buffer.

Do not use SvGROW or sv_grow directly on PL_parser->linestr; this function updates all of the lexer's variables that point directly into the buffer.

Insert characters into the lexer buffer ("PL_parser->linestr"), immediately after the current lexing point ("PL_parser->bufptr"), reallocating the buffer if necessary. This means that lexing code that runs later will see the characters as if they had appeared in the input. It is not recommended to do this as part of normal parsing, and most uses of this facility run the risk of the inserted characters being interpreted in an unintended manner.

The string to be inserted is represented by len octets starting at pv. These octets are interpreted as either UTF-8 or Latin-1, according to whether the LEX_STUFF_UTF8 flag is set in flags. The characters are recoded for the lexer buffer, according to how the buffer is currently being interpreted ("lex_bufutf8"). If a string to be interpreted is available as a Perl scalar, the "lex_stuff_sv" function is more convenient.

Insert characters into the lexer buffer ("PL_parser->linestr"), immediately after the current lexing point ("PL_parser->bufptr"), reallocating the buffer if necessary. This means that lexing code that runs later will see the characters as if they had appeared in the input. It is not recommended to do this as part of normal parsing, and most uses of this facility run the risk of the inserted characters being interpreted in an unintended manner.

The string to be inserted is the string value of sv. The characters are recoded for the lexer buffer, according to how the buffer is currently being interpreted ("lex_bufutf8"). If a string to be interpreted is not already a Perl scalar, the "lex_stuff_pvn" function avoids the need to construct a scalar.

Discards text about to be lexed, from "PL_parser->bufptr" up to ptr. Text following ptr will be moved, and the buffer shortened. This hides the discarded text from any lexing code that runs later, as if the text had never appeared.

This is not the normal way to consume lexed text. For that, use "lex_read_to".

Consume text in the lexer buffer, from "PL_parser->bufptr" up to ptr. This advances "PL_parser->bufptr" to match ptr, performing the correct bookkeeping whenever a newline character is passed. This is the normal way to consume lexed text.

Interpretation of the buffer's octets can be abstracted out by using the slightly higher-level functions "lex_peek_unichar" and "lex_read_unichar".

Discards the first part of the "PL_parser->linestr" buffer, up to ptr. The remaining content of the buffer will be moved, and all pointers into the buffer updated appropriately. ptr must not be later in the buffer than the position of "PL_parser->bufptr": it is not permitted to discard text that has yet to be lexed.

Normally it is not necessarily to do this directly, because it suffices to use the implicit discarding behaviour of "lex_next_chunk" and things based on it. However, if a token stretches across multiple lines, and the lexing code has kept multiple lines of text in the buffer fof that purpose, then after completion of the token it would be wise to explicitly discard the now-unneeded earlier lines, to avoid future multi-line tokens growing the buffer without bound.

Reads in the next chunk of text to be lexed, appending it to "PL_parser->linestr". This should be called when lexing code has looked to the end of the current chunk and wants to know more. It is usual, but not necessary, for lexing to have consumed the entirety of the current chunk at this time.

If "PL_parser->bufptr" is pointing to the very end of the current chunk (i.e., the current chunk has been entirely consumed), normally the current chunk will be discarded at the same time that the new chunk is read in. If flags includes LEX_KEEP_PREVIOUS, the current chunk will not be discarded. If the current chunk has not been entirely consumed, then it will not be discarded regardless of the flag.

Returns true if some new text was added to the buffer, or false if the buffer has reached the end of the input text.

Looks ahead one (Unicode) character in the text currently being lexed. Returns the codepoint (unsigned integer value) of the next character, or -1 if lexing has reached the end of the input text. To consume the peeked character, use "lex_read_unichar".

If the next character is in (or extends into) the next chunk of input text, the next chunk will be read in. Normally the current chunk will be discarded at the same time, but if flags includes LEX_KEEP_PREVIOUS then the current chunk will not be discarded.

If the input is being interpreted as UTF-8 and a UTF-8 encoding error is encountered, an exception is generated.

Reads the next (Unicode) character in the text currently being lexed. Returns the codepoint (unsigned integer value) of the character read, and moves "PL_parser->bufptr" past the character, or returns -1 if lexing has reached the end of the input text. To non-destructively examine the next character, use "lex_peek_unichar" instead.

If the next character is in (or extends into) the next chunk of input text, the next chunk will be read in. Normally the current chunk will be discarded at the same time, but if flags includes LEX_KEEP_PREVIOUS then the current chunk will not be discarded.

If the input is being interpreted as UTF-8 and a UTF-8 encoding error is encountered, an exception is generated.

Reads optional spaces, in Perl style, in the text currently being lexed. The spaces may include ordinary whitespace characters and Perl-style comments. #line directives are processed if encountered. "PL_parser->bufptr" is moved past the spaces, so that it points at a non-space character (or the end of the input text).

If spaces extend into the next chunk of input text, the next chunk will be read in. Normally the current chunk will be discarded at the same time, but if flags includes LEX_KEEP_PREVIOUS then the current chunk will not be discarded.