Parser Interfaces

Basic interface for a simplified Perl binding for SAX (Simple API for XML).

SAX parsers are reusable but not re-entrant: the application may reuse a parser object (possibly with a different input source) once the first parse has completed successfully, but it may not invoke the parse() methods recursively within a parse.

Parser objects contain the following options. A new or different handler option may provided in the middle of a parse, and the SAX parser must begin using the new handler immediately. The `Locale' option must not be changed in the middle of a parse. If an application does not provide a handler for a particular set of events, those events will be silently ignored unless otherwise stated. If an `EntityResolver' is not provided, the parser will resolve system identifiers and open connections to entities itself.

Handler          default handler to receive events
DocumentHandler  handler to receive document events
DTDHandler       handler to receive DTD events
ErrorHandler     handler to receive error events
EntityResolver   handler to resolve entities
Locale           locale to provide localisation for errors

If no handlers are provided then all events will be silently ignored, except for `fatal_error()' which will cause a `die()' to be called after calling `end_document()'.

All handler methods are called with a single hash argument containing the parameters for that method. `new()' methods can be called with a hash or a list of key-value pairs containing the parameters.

Parser

All SAX parsers must implement this basic interface: it allows applications to provide handlers for different types of events and to initiate a parse from a URI, a byte stream, or a character stream.

new

Creates a Parser that will be used to parse XML sources. Any parameters passed to `new()' will be used for subsequent parses.

parse

Parse an XML document.

The application can use this method to instruct the SAX parser to begin parsing an XML document from any valid input source (a character stream, a byte stream, or a URI).

Applications may not invoke this method while a parse is in progress (they should create a new Parser instead for each additional XML document). Once a parse is complete, an application may reuse the same Parser object, possibly with a different input source.

`parse()' returns the result of calling the handler method `end_document()'.

The hash passed to `parse()' must contain at least one parameter, `Source', that provides the input source for parsing. Additional parameters replace the existing options set in the parser object. `Source' may be a scalar containing XML text or a hash with the following keys:

PublicId

The public identifier for this input source.

The public identifier is always optional: if the application writer includes one, it will be provided as part of the location information.

SystemId

The system identifier for this input source.

The system identifier is optional if there is a byte stream, a character stream, or a string, but it is still useful to provide one, since the application can use it to resolve relative URIs and can include it in error messages and warnings (the parser will attempt to open a connection to the URI only if there is no byte stream or character stream specified).

If the application knows the character encoding of the object pointed to by the system identifier, it can provide the encoding using the `Encoding' parameter.

If the system ID is a URL, it must be fully resolved.

String

A scalar value containing XML text to be parsed.

The SAX parser will ignore this if there is also a byte or character stream, but it will use a string in preference to opening a URI connection.

ByteStream

The byte stream (file handle) for this input source.

The SAX parser will ignore this if there is also a character stream specified, but it will use a byte stream in preference to opening a URI connection itself or using `String'.

If the application knows the character encoding of the byte stream, it should set it with the `Encoding' parameter.

CharacterStream

FOR FUTURE USE ONLY -- Perl does not currently support any character streams, only use the `ByteStream', `SystemId', or `String' parameters.

The character stream (file handle) for this input source.

If there is a character stream specified, the SAX parser will ignore any byte stream and will not attempt to open a URI connection to the system identifier.

Encoding

The character encoding, if known.

The encoding must be a string acceptable for an XML encoding declaration (see section 4.3.3 of the XML 1.0 recommendation).

This parameter has no effect when the application provides a character stream.

Locator

Interface for associating a SAX event with a document location.

If a SAX parser provides location information to the SAX application, it does so by implementing this interface and then passing an instance to the application passing the `Locator' parameter to the document handler's `start_document()' method. The application can use the object to obtain the location of any other document handler event in the XML source document.

Note that the results returned by the object will be valid only during the scope of each document handler method: the application will receive unpredictable results if it attempts to use the locator at any other time.

SAX parsers are not required to supply a locator, but they are very strongly encouraged to do so.

location

Return the location information for the current event.

Returns a hash containing the following parameters:

ColumnNumber The column number, or undef if none is available.
LineNumber   The line number, or undef if none is available.
PublicId     A string containing the public identifier, or undef if
             none is available.
SystemId     A string containing the system identifier, or undef if
             none is available.

Handler Interfaces

SAX handler methods are grouped into four interfaces: the document handler for receiving normal document events, the DTD handler for receiving notation and unparsed entity events, the error handler for receiving errors and warnings, and the entity resolver for redirecting external system identifiers.

The application may choose to implement each interface in one package or in seperate packages, as long as the objects provided as parameters to the parser provide the matching interface.

Parsers may implement additional methods in each of these categories, refer to the parser documentation for further information.

All handlers are called with a single hash argument containing the parameters for that handler.

Application writers who do not want to implement the entire interface can leave those methods undefined. Events whose handler methods are undefined will be ignored unless otherwise stated.

DocumentHandler

This is the main interface that most SAX applications implement: if the application needs to be informed of basic parsing events, it implements this interface and provides an instance with the SAX parser using the `DocumentHandler' parameter. The parser uses the instance to report basic document-related events like the start and end of elements and character data.

The order of events in this interface is very important, and mirrors the order of information in the document itself. For example, all of an element's content (character data, processing instructions, and/or subelements) will appear, in order, between the startElement event and the corresponding endElement event.

The application can find the location of any event using the Locator interface supplied by the Parser through the `Locator' parameter to `start_document()'.

start_document

Receive notification of the beginning of a document.

The SAX parser will invoke this method only once, before any other methods in this interface or in DTDHandler.

`Locator' parameter, if provided, contains an object that can be queried for the current location within the document. Parsers are not required to provide a document locator.

Parameters:

Locator     An object that can return the location of any SAX document
            event.
end_document

Receive notification of the end of a document, no parameters are passed for the end of a document.

The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input.

The value returned by calling `end_document()' will be the value returned by `parse()'.

start_element

Receive notification of the beginning of an element.

The Parser will invoke this method at the beginning of every element in the XML document; there will be a corresponding `end_element()' event for every `start_element()' event (even when the element is empty). All of the element's content will be reported, in order, before the corresponding `end_element()' event.

If the element name has a namespace prefix, the prefix will still be attached. Note that the attribute list provided will contain only attributes with explicit values (specified or defaulted): #IMPLIED attributes will be omitted.

Parameters:

Name        The element type name.
Attributes  The attributes attached to the element, if any.
end_element

Receive notification of the end of an element.

The SAX parser will invoke this method at the end of every element in the XML document; there will be a corresponding `start_element()' event for every `end_element()' event (even when the element is empty).

If the element name has a namespace prefix, the prefix will still be attached to the name.

Parameters:

Name        The element type name.
characters

Receive notification of character data.

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.

Note that some parsers will report whitespace using the `ignorable_whitespace()' method rather than this one (validating parsers must do so).

Parameters:

Data        The characters from the XML document.
ignorable_whitespace

Receive notification of ignorable whitespace in element content.

Validating Parsers must use this method to report each chunk of ignorable whitespace (see the W3C XML 1.0 recommendation, section 2.10): non-validating parsers may also use this method if they are capable of parsing and using content models.

SAX parsers may return all contiguous whitespace in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity, so that the Locator provides useful information.

The application must not attempt to read from the array outside of the specified range.

Data        The characters from the XML document.
processing_instruction

Receive notification of a processing instruction.

The Parser will invoke this method once for each processing instruction found: note that processing instructions may occur before or after the main document element.

A SAX parser should never report an XML declaration (XML 1.0, section 2.8) or a text declaration (XML 1.0, section 4.3.1) using this method.

Parameters:

Target      The processing instruction target. 
Data        The processing instruction data, if any.

ErrorHandler

Basic interface for SAX error handlers.

If a SAX application needs to implement customized error handling, it must implement this interface and then provide an instance to the SAX parser using the parser's `ErrorHandler' parameter. The parser will then report all errors and warnings through this interface.

The parser shall use this interface instead of throwing an exception: it is up to the application whether to throw an exception for different types of errors and warnings. Note, however, that there is no requirement that the parser continue to provide useful information after a call to `fatal_error()' (in other words, a SAX driver class could catch an exception and report a fatalError).

All error handlers receive the following parameters. The `PublicId', `SystemId', `LineNumber', and `ColumnNumber' are provided only if the parser has that information available.

Messsage     The error or warning message, or undef to use the message
             from the `C<EvalError>' parameter
PublicId     The public identifer of the entity that generated the
             error or warning.
SystemId     The system identifer of the entity that generated the
             error or warning.
LineNumber   The line number of the end of the text that caused the
             error or warning.
ColumnNumber The column number of the end of the text that cause the
             error or warning.
EvalError    The error value returned from a lower level interface.

Application writers who do not want to implement the entire interface can leave those methods undefined. If not defined, calls to the `warning()' and `error()' handlers will be ignored and a processing will be terminated (going straight to `end_document()') after the call to `fatal_error()'.

warning

Receive notification of a warning.

SAX parsers will use this method to report conditions that are not errors or fatal errors as defined by the XML 1.0 recommendation. The default behaviour is to take no action.

The SAX parser must continue to provide normal parsing events after invoking this method: it should still be possible for the application to process the document through to the end.

error

Receive notification of a recoverable error.

This corresponds to the definition of "error" in section 1.2 of the W3C XML 1.0 Recommendation. For example, a validating parser would use this callback to report the violation of a validity constraint. The default behaviour is to take no action.

The SAX parser must continue to provide normal parsing events after invoking this method: it should still be possible for the application to process the document through to the end. If the application cannot do so, then the parser should report a fatal error even if the XML 1.0 recommendation does not require it to do so.

fatal_error

Receive notification of a non-recoverable error.

This corresponds to the definition of "fatal error" in section 1.2 of the W3C XML 1.0 Recommendation. For example, a parser would use this callback to report the violation of a well-formedness constraint.

The application must assume that the document is unusable after the parser has invoked this method, and should continue (if at all) only for the sake of collecting addition error messages: in fact, SAX parsers are free to stop reporting any other events once this method has been invoked.

DTDHandler

Receive notification of basic DTD-related events.

If a SAX application needs information about notations and unparsed entities, then the application implements this interface and provide an instance to the SAX parser using the parser's `DTDHandler' parameter. The parser uses the instance to report notation and unparsed entity declarations to the application.

The SAX parser may report these events in any order, regardless of the order in which the notations and unparsed entities were declared; however, all DTD events must be reported after the document handler's `start_document()' event, and before the first `start_element()' event.

It is up to the application to store the information for future use (perhaps in a hash table or object tree). If the application encounters attributes of type "NOTATION", "ENTITY", or "ENTITIES", it can use the information that it obtained through this interface to find the entity and/or notation corresponding with the attribute value.

Application writers who do not want to implement the entire interface can leave those methods undefined. Events whose handler methods are undefined will be ignored.

notation_decl

Receive notification of a notation declaration event.

It is up to the application to record the notation for later reference, if necessary.

If a system identifier is present, and it is a URL, the SAX parser must resolve it fully before passing it to the application.

Parameters: Name The notation name. PublicId The notation's public identifier, or undef if none was given. SystemId The notation's system identifier, or undef if none was given.

unparsed_entity_decl

Receive notification of an unparsed entity declaration event.

Note that the notation name corresponds to a notation reported by the `notation_decl()' event. It is up to the application to record the entity for later reference, if necessary.

If the system identifier is a URL, the parser must resolve it fully before passing it to the application.

Parameters:

Name        The unparsed entity's name.
PublicId    The entity's public identifier, or undef if none was given.
SystemId    The entity's system identifier (it must always have one).
NotationName The name of the associated notation.

EntityResolver

Basic interface for resolving entities.

If a SAX application needs to implement customized handling for external entities, it must implement this interface and provide an instance with the SAX parser using the parser's `EntityResolver' parameter.

The parser will then allow the application to intercept any external entities (including the external DTD subset and external parameter entities, if any) before including them.

Many SAX applications will not need to implement this interface, but it will be especially useful for applications that build XML documents from databases or other specialised input sources, or for applications that use URI types other than URLs.

[Demo method ommitted for now]

The application can also use this interface to redirect system identifiers to local URIs or to look up replacements in a catalog (possibly by using the public identifier).

resolve_entity

Allow the application to resolve external entities.

The Parser will call this method before opening any external entity except the top-level document entity (including the external DTD subset, external entities referenced within the DTD, and external entities referenced within the document element): the application may request that the parser resolve the entity itself, that it use an alternative URI, or that it use an entirely different input source.

Application writers can use this method to redirect external system identifiers to secure and/or local URIs, to look up public identifiers in a catalogue, or to read an entity from a database or other input source (including, for example, a dialog box).

If the system identifier is a URL, the SAX parser must resolve it fully before reporting it to the application.

Parameters:

PublicId    The public identifier of the external entity being
            referenced, or undef if none was supplied. 
SystemId    The system identifier of the external entity being
            referenced.

`resolve_entity()' returns undef to request that the parser open a regular URI connection to the system identifier or returns a hash containing the same parameters as the `Source' parameter to Parser's `parse()' method, summarized here:

PublicId    The public identifier of the external entity being
            referenced, or undef if none was supplied. 
SystemId    The system identifier of the external entity being
            referenced.
String      String containing XML text
ByteStream  An open file handle.
CharacterStream
            An open file handle.
Encoding    The character encoding, if known.

See Parser's `parse()' method for complete details on how these parameters interact.

Contributors

SAX <http://www.megginson.com/SAX/> was developed collaboratively by the members of the XML-DEV mailing list. Please see the ``SAX History and Contributors'' page for the people who did the real work behind SAX.

Thanks to the following people who contributed to Perl SAX.

Eduard (Enno) Derksen
Ken MacLeod
Eric Prud'hommeaux
Larry Wall