NAME
XML::Compile::Schema - Compile a schema into CODE
INHERITANCE
XML::Compile::Schema
is a XML::Compile
SYNOPSIS
# compile tree yourself
my $parser = XML::LibXML->new;
my $tree = $parser->parse...(...);
my $schema = XML::Compile::Schema->new($tree);
# get schema from string
my $schema = XML::Compile::Schema->new($xml_string);
# get schema from file
my $schema = XML::Compile::Schema->new($filename);
# adding schemas
$schema->addSchemas($tree);
$schema->importDefinitions('http://www.w3.org/2001/XMLSchema');
$schema->importDefinitions('2001-XMLSchema.xsd');
# create and use a reader
my $read = $schema->compile(READER => '{myns}mytype');
my $hash = $read->($xml);
# create and use a writer
my $doc = XML::LibXML::Document->new('1.0', 'UTF-8');
my $write = $schema->compile(WRITER => '{myns}mytype');
my $xml = $write->($doc, $hash);
# show result
print $xml->toString;
# to create the type nicely
use XML::Compile::Util qw/pack_type/;
my $type = pack_type 'myns', 'mytype';
print $type; # shows {myns}mytype
DESCRIPTION
This module collects knowledge about one or more schemas. The most important method provided is compile(), which can create XML file readers and writers based on the schema information and some selected element or attribute type.
Various implementations use the translator, and more can be added later:
$schema-
compile('READER'...)> translates XML to HASH-
The XML reader produces a HASH from a XML::LibXML::Node tree or an XML string. Those represent the input data. The values are checked. An error produced when a value or the data-structure is not according to the specs.
The CODE reference which is returned can be called with anything accepted by dataToXML().
example: create an XML reader
my $msgin = $rules->compile(READER => '{myns}mytype'); # or ... = $rules->compile(READER => pack_type('myns', 'mytype')); my $xml = $parser->parse("some-xml.xml"); my $hash = $msgin->($xml);
or
my $hash = $msgin->('some-xml.xml'); my $hash = $msgin->($xml_string); my $hash = $msgin->($xml_node);
$schema-
compile('WRITER', ...)> translates HASH to XML-
The writer produces schema compliant XML, based on a Perl HASH. To get the data encoding correctly, you are required to pass a document object in which the XML nodes may get a place later.
example: create an XML writer
my $doc = XML::LibXML::Document->new('1.0', 'UTF-8'); my $write = $schema->compile(WRITER => '{myns}mytype'); my $xml = $write->($doc, $hash); print $xml->toString;
alternative
my $write = $schema->compile(WRITER => 'myns#myid');
$schema-
template('XML', ...)> creates an XML example-
Based on the schema, this produces an XML message as example. Schemas are usually so complex that people loose overview. This example may put you back on track, and used as starting point for many creating the XML version of the message.
$schema-
template('PERL', ...)> creates an Perl example-
Based on the schema, this produces an Perl HASH structure (a bit like the output by Data::Dumper), which can be used as template for creating messages. The output contains documentation, and is usually much clearer than the schema itself.
Be warned that the schema is not validated; you can develop schemas which do work well with this module, but are not valid according to W3C. In many cases, however, the translater will refuse to accept mistakes: mainly because it cannot produce valid code.
METHODS
Constructors
XML::Compile::Schema->new(TOP, OPTIONS)
Collect schema information. Details about many name-spaces can be organized with only a single schema object (actually, the data is administered in an internal XML::Compile::Schema::NameSpaces object)
Option --Defined in --Default
hook undef
hooks []
schema_dirs XML::Compile undef
. hook => ARRAY-WITH-HOOKDATA | HOOK
See addHook(). Adds one HOOK (HASH).
. hooks => ARRAY-OF-HOOK
See addHooks().
. schema_dirs => DIRECTORY|ARRAY-OF-DIRECTORIES
Accessors
$obj->addHook(HOOKDATA|HOOK|undef)
HOOKDATA is a LIST of options as key-value pairs, HOOK is a HASH with the same data. undef
is ignored. See addHooks() and "Schema hooks" below.
$obj->addHooks(HOOK, [HOOK, ...])
Add multiple hooks at once. These must all be HASHes. See "Schema hooks" and addHook(). undef
values are ignored.
$obj->addSchemaDirs(DIRECTORIES)
XML::Compile::Schema->addSchemaDirs(DIRECTORIES)
$obj->addSchemas(XML, OPTIONS)
Collect all the schemas defined in the XML data. The XML parameter must be a XML::LibXML node, therefore it is adviced to use importDefinitions(), which has a much more flexible way to specify the data.
No OPTIONS are defined, on the moment.
$obj->findSchemaFile(FILENAME)
$obj->hooks
Returns the LIST of defined hooks (as HASHes).
$obj->importDefinitions(XMLDATA, OPTIONS)
Import (include) the schema information included in the XMLDATA. The XMLDATA must be acceptable for dataToXML(). The resulting node and the OPTIONS are passed to addSchemas().
$obj->knownNamespace(NAMESPACE|PAIRS)
XML::Compile::Schema->knownNamespace(NAMESPACE|PAIRS)
$obj->namespaces
Returns the XML::Compile::Schema::NameSpaces object which is used to collect schemas.
Read XML
$obj->dataToXML(NODE|REF-XML-STRING|XML-STRING|FILENAME|KNOWN)
Filters
$obj->walkTree(NODE, CODE)
Compilers
$obj->compile(('READER'|'WRITER'), TYPE, OPTIONS)
Translate the specified ELEMENT (found in one of the read schemas) into a CODE reference which is able to translate between XML-text and a HASH. When the TYPE is undef
, an empty LIST is returned.
The indicated TYPE is the starting-point for processing in the data-structure, a toplevel element or attribute name. The name must be specified in {url}name
format, there the url is the name-space. An alternative is the url#id
which refers to an element or type with the specific id
attribute value.
When a READER is created, a CODE reference is returned which needs to be called with XML, as accepted by XML::Compile::dataToXML(). Returned is a nested HASH structure which contains the data from contained in the XML. The transformation rules are explained below.
When a WRITER is created, a CODE reference is returned which needs to be called with an XML::LibXML::Document object and a HASH, and returns a XML::LibXML::Node.
Most options below are explained in more detailed in the manual-page XML::Compile::Schema::Translate, which implements the compilation.
Option --Default
anyAttribute undef
anyElement undef
attributes_qualified <undef>
check_occurs <false>
check_values <true>
elements_qualified <undef>
hook undef
hooks undef
ignore_facets <false>
include_namespaces <true>
namespace_reset <false>
output_namespaces {}
path <expanded name of type>
permit_href <false>
sloppy_integers <false>
use_default_prefix <false>
. anyAttribute => CODE
In general, anyAttribute
schema components cannot be handled automatically. If you need to create or process anyAttribute information, then read about wildcards in the DETAILS chapter of the manual-page for the specific back-end.
. anyElement => CODE
In general, any
schema components cannot be handled automatically. If you need to create or process any information, then read about wildcards in the DETAILS chapter of the manual-page for the specific back-end.
. attributes_qualified => BOOLEAN
When defined, this will overrule the attributeFormDefault
flags in all schemas. When not qualified, the xml will not produce nor process prefixes on attributes.
. check_occurs => BOOLEAN
Whether code will be produced to do bounds checking on elements and blocks which may appear more than once. When the schema says that maxOccurs is 1, then that element becomes optional. When the schema says that maxOccurs is larger than 1, then the output is still always an ARRAY, but now of unrestricted length.
. check_values => BOOLEAN
Whether code will be produce to check that the XML fields contain the expected data format.
Turning this off will improve the processing speed significantly, but is (of course) much less safe. Do not set it off when you expect data from external sources: validation is a crucial requirement for XML.
. elements_qualified => TOP
|ALL
|NONE
|BOOLEAN
When defined, this will overrule the elementFormDefault
flags in all schemas. When TOP
is specified, at least the top-element will be name-space qualified. When ALL
or a true value is given, then all elements will be used qualified. When NONE
or a false value is given, the XML will not produce or process prefixes on the elements.
The form
attributes will be respected, except on the top element when TOP
is specified. Use hooks when you need to fix name-space use in more subtile ways.
. hook => HOOK|ARRAY-OF-HOOKS
Define one or more processing hooks. See "Schema hooks" below. These hooks are only active for this compiled entity, where addHook() and addHooks() can be used to define hooks which are used for all results of compile(). The hooks specified with the hook
or hooks
option are run before the global definitions.
. hooks => HOOK|ARRAY-OF-HOOKS
Alternative for option hook
.
. ignore_facets => BOOLEAN
Facets influence the formatting and range of values. This does not come cheap, so can be turned off. It affects the restrictions set for a simpleType. The processing speed will improve, but validation is a crucial requirement for XML: please do not turn this off when the data comes from external sources.
. include_namespaces => BOOLEAN
Indicates whether the WRITER should include the prefix to namespace translation on the top-level element of the returned tree. If not, you may continue with the same name-space table to combine various XML components into one, and add the namespaces later.
. namespace_reset => BOOLEAN
Use the same prefixes in output_namespaces
as with some other compiled piece, but reset the counts to zero first.
. output_namespaces => HASH|ARRAY-of-PAIRS
Can be used to predefine an output namespace (when 'WRITER') for instance to reserve common abbreviations like soap
for external use. Each entry in the hash has as key the namespace uri. The value is a hash which contains uri
, prefix
, and used
fields. Pass a reference to a private hash to catch this index.
. path => STRING
Prepended to each error report, to indicate the location of the error in the XML-Scheme tree.
. permit_href => BOOLEAN
When parsing SOAP-RPC encoded messages, the elements may have a href
attribute, pointing to an object with id
. The READER will return the unparsed, unresolved node when the attribute is detected, and the SOAP-RPC decoder will have to discover and resolve it.
. sloppy_integers => BOOLEAN
The decimal
and integer
types must support at least 18 digits, which is larger than Perl's 32 bit internal integers. Therefore, the implementation will use Math::BigInt objects to handle them. However, often an simple int
type whould have sufficed, but the XML designer was lazy. A long is much faster to handle. Set this flag to use int
as fast (but inprecise) replacements.
Be aware that Math::BigInt
and Math::BigFloat
objects are nearly but not fully transparent mimicing the behavior of Perl's ints and floats. See their respective manual-pages. Especially when you wish for some performance, you should optimize access to these objects to avoid expensive copying which is exactly the spot where the difference are.
. use_default_prefix => BOOLEAN
When mixing qualified and unqualified namespaces, then the use of a default prefix can be quite confusing. Therefore, by default, all qualified elements will have an explicit prefix.
$obj->elements
List all elements, defined by all schemas sorted alphabetically.
$obj->template('XML'|'PERL', TYPE, OPTIONS)
WARNING: under development! The implementation is far from complete.
Schema's can be horribly complex and unreadible. Therefore, this template method can be called to create an example which demonstrates how data of the specified TYPE as XML or Perl is organized in practice.
Some OPTIONS are explained in XML::Compile::Schema::Translate. There are some extra OPTIONS defined for the final output process.
Option --Default
attributes_qualified <undef>
elements_qualified <undef>
include_namespaces <true>
indent " "
show ALL
. attributes_qualified => BOOLEAN
. elements_qualified => ALL
|TOP
|NONE
|BOOLEAN
. include_namespaces => BOOLEAN
. indent => STRING
The leading indentation string per nesting. Must start with at least one blank.
. show => STRING|'ALL'|'NONE'
A comma seperated list of tokens, which explain what kind of comments need to be included in the output. The available tokens are: struct
, type
, occur
, facets
. A value of ALL
will select all available comments. The NONE
or empty string will exclude all comments.
$obj->types
List all types, defined by all schemas sorted alphabetically.
DETAILS
Comparison
Addressing components
Normally, external users can only address elements within a schema, and types are hidden to be used by other schemas only. For this reason, it is permitted to create an element and a type with the same name.
The compiler requires a starting-point. This can either be an element name or an element's id. The format of the element name is {url}name
, for instance
{http://library}book
refers to the built-in int
data-type. You may also start with
http://www.w3.org/2001/XMLSchema#float
as long as this ID refers to an element.
Representing data-structures
The code will do its best to produce a correct translation. For instance, an accidental 1.9999
will be converted into 2
when the schema says that the field is an int
. It will also strip superfluous blanks when the data-type permits. Especially watch-out for the Integer
types, which produce Math::BigInt objects unless compile(sloppy_integers) is used.
Elements can be complex, and themselve contain elements which are complex. In the Perl representation of the data, this will be shown as nested hashes with the same structure as the XML.
You should not take tare of character encodings, whereas XML::LibXML is doing that for us: you shall not escape characters like "<" yourself.
The schemas define kinds of data types. There are various ways to define them (with restrictions and extensions), but for the resulting data structure is that knowledge not important.
- simpleType
-
A single value. A lot of single value data-types are built-in (see XML::Compile::Schema::BuiltInTypes).
Simple types may have range limiting restrictions (facets), which will be checked by default. Types may also have some white-space behavior, for instance blanks are stripped from integers: before, after, but also inside the number representing string.
Note that some of the reader hooks will alter the single value of these elements into a HASH like used for the complexType/simpleContent (next paragraph), to be able to return some extra collected information.
example: typical simpleType
In XML, it looks like this:
<test1>42</test1>
In the HASH structure, the data will be represented as
test1 => 42
With reader hook
after =
'XML_NODE'> hook applied, it will becometest1 => { _ => 42 , _XML_NODE => $obj }
- complexType/simpleContent
-
In this case, the single value container may have attributes. The number of attributes can be endless, and the value is only one. This value has no name, and therefore gets a predefined name
_
.example: typical simpleContent example
In XML, this looks like this:
<test2 question="everything">42</test2>
As a HASH, this looks like
test2 => { _ => 42 , question => 'everything' }
- complexType and complexType/complexContent
-
These containers not only have attributes, but also multiple values as content. The
complexContent
is used to create inheritance structures in the data-type definition. This does not affect the XML data package itself.example: typical complexType element
The XML could look like:
<test3 question="everything" by="mouse"> <answer>42</answer> <when>5 billion BC</when> </test3>
Represented as HASH, this looks like
test3 => { question => 'everything' , by => 'mouse' , answer => 42 , when => '5 billion BC' }
Processing
A second factor which determines the data-structure is the element occurrence. Usually, elements have to appear once and exactly once on a certain location in the XML data structure. This order is automatically produced by this module. But elements may appear multiple times.
- usual case
-
The default behavior for an element (in a sequence container) is to appear exactly once. When missing, this is an error.
- maxOccurs larger than 1
-
In this case, the element or particle block can appear multiple times. Multiple values are kept in an ARRAY within the HASH. Non-schema based XML modules do not return a single value as an ARRAY, which makes that code more complicated. But in our case, we know the expected amount beforehand.
When the maxOccurs larger than 1 is specified for an element, an ARRAY of those elements is produced. When it is specified for a block (sequence, choice, all, group), then an ARRAY of HASHes is returned. See the special section about the subject.
An error is produced when the number of elements found is less than
minOccurs
(defaults to 1) or more thanmaxOccurs
(defaults to 1), unless compile(check_occurs) isfalse
.example: elements with maxOccurs > 1
In the schema: <element name="a" type="int" maxOccurs="unbounded" /> <element name="b" type="int" />
In the XML message: <a>12</a><a>13</a><b>14</b>
In the Perl representation: a => [12, 13], b => 14
- value is
NIL
-
When an element is nillable, that is explicitly represented as a
NIL
constant string. - use="optional" or minOccurs="0"
-
The element may be skipped. When found it is a single value.
- use="forbidden"
-
When the element is found, an error is produced.
- default="value"
-
When the XML does not contain the element, the default value is used... but only if this element's container exists. This has no effect on the writer.
- fixed="value"
-
Produce an error when the value is not present or different (after the white-space rules where applied).
Repetative blocks
Particle blocks come in four shapes: sequence
, choice
, all
, and group
(an indirect block). In situations like this:
<element name="example">
<complexType>
<sequence>
<element name="a" type="int" />
<sequence>
<element name="b" type="int" />
</sequence>
<element name="c" type="int" />
</sequence>
</complexType>
</element>
(yes, schemas are verbose) the data structure is
<example> <a>1</a> <b>2</b> <c>3</c> </example>
the Perl representation is flattened, into
example => { a => 1, b => 2, c => 3 }
Ok, this is very simple. However, schemas can use repetition:
<element name="example">
<complexType>
<sequence>
<element name="a" type="int" />
<sequence minOccurs="0" maxOccurs="unbounded">
<element name="b" type="int" />
</sequence>
<element name="c" type="int" />
</sequence>
</complexType>
</element>
The XML message may be:
<example> <a>1</a> <b>2</b> <b>3</b> <b>4</b> <c>5</c> </example>
Now, the perl representation needs to produce an array of the data in the repeated block. This array needs to have a name, because more of these blocks may appear together in a construct. The name of the block is derived from the type of block and the name of the first element in the block, regardless whether that element is present in the data or not. See XML::Compile::Util::block_label(). So, about data is translated into (and vice versa)
example =>
{ a => 1
, seq_b => [ {b => 2}, {b => 3}, {b => 4}
, c => 5
}
example: always an array with maxOccurs > 1
Even when there is only one element found, it will be returned as ARRAY (of one element). Therefore, you can write
my $data = $reader->($xml);
foreach my $a ( @{$data->{a}} ) {...}
example: blocks with maxOccurs > 1
In the schema: <sequence maxOccurs="5"> <element name="a" type="int" /> <element name="b" type="int" /> </sequence>
In the XML message: <a>15</a><b>16</b><a>17</a><b>18</b>
In Perl representation: seq_a => [ {a => 15, b => 16}, {a => 17, b => 18} ]
List type
List simpleType objects are also represented as ARRAY, like elements with a minOccurs or maxOccurs unequal 1.
example: with a list of ints
<test5>3 8 12</test5>
as Perl structure:
test5 => [3, 8, 12]
substitutionGroup
A substitution group is kind-of choice between alternative (complex) types. However, in this case roles have reversed: instead a choice
which lists the alternatives, here the alternative elements register themselves as valid for an abstract (head) element. All alternatives should be extensions of the head element's type, but there is no way to check that.
example: substitutionGroup
<xs:element name="price" type="xs:int" abstract="true" />
<xs:element name="euro" type="xs:int" substitutionGroup="price" />
<xs:element name="dollar" type="xs:int" substitutionGroup="price" />
<xs:element name="product">
<xs:complexType>
<xs:element name="name" type="xs:string" />
<xs:element ref="price" />
</xs:complexType>
</xs:element>
Now, valid XML data is
<product>
<name>Ball</name>
<euro>12</euro>
</product>
and
<product>
<name>Ball</name>
<dollar>6</dollar>
</product>
The HASH repesentation is respectively
product => {name => 'Ball', euro => 12}
product => {name => 'Ball', dollar => 6}
Wildcards
The any
and anyAttribute
elements are referred to as wildcards
: they specify groups of elements and attributes which can be used, in stead of being explicit.
The author of this module advices against the use of wildcards in schemas, because the purpose of schemas is to be explicit about the structure of the message, and that basic idea is simply thrown away by these wildcards. Let people cleanly extend the schema with inheritance! If you use a standard schema which facilitates these wildcards, then please do not use them!
Because wildcards are not explicit about the types to expect, the XML::Compile
module can not prepare for them automatically. However, as user of the schema you probably know better about the possible contents of these fields. Therefore, you can translate that knowledge into code explicitly. Read about the processing of wildcards in the manual page for each of the back-ends, because it is different in each case.
Schema hooks
You can use hooks, for instance, to block processing parts of the message, to create work-arounds for schema bugs, or to extract more information during the process than done by default.
defining hooks
Multiple hooks can active during the compilation process of a type, when compile()
is called. During Schema translation, each of the hooks is checked for all types which are processed. When multiple hooks select the object to get a modified behavior, then all are evaluated in order of definition.
Defining a global hook (where HOOKDATA is the LIST of PAIRS with hook parameters, and HOOK a HASH with such HOOKDATA):
my $schema = XML::Compile::Schema->new
( ...
, hook => HOOK
, hooks => [ HOOK, HOOK ]
);
$schema->addHook(HOOKDATA | HOOK);
$schema->addHooks(HOOK, HOOK, ...);
my $wsdl = XML::Compile::WSDL->new(...);
$wsdl->schemas->addHook(HOOKDATA | HOOK);
local hooks are only used for one reader or writer. They are evaluated before the global hooks.
my $reader = $schema->compile(READER => $type
, hook => HOOK, hooks => [ HOOK, HOOK, ...]);
example: of HOOKs:
my $hook = { type => '{my_ns}my_type'
, before => sub { ... }
};
my $hook = { path => qr/\(volume\)/
, replace => 'SKIP'
};
# path contains "(volume)" or id is 'aap' or id is 'noot'
my $hook = { path => qr/\(volume\)/
, id => [ 'aap', 'noot' ]
, before => [ sub {...}, sub { ... } ]
, after => sub { ... }
};
general syntax
Each hook has two kinds of parameters: selectors and processors. Selectors define the schema component of which the processing is modified. When one of the selectors matches, the processing information for the hook is used. When no selector is specified, then the hook will be used on all elements.
Available selectors (see below for details on each of them):
As argument, you can specify one element as STRING, a regular expression to select multiple elements, or an ARRAY of STRINGs and REGEXes.
Next to where the hook is placed, we need to known what to do in the case: the hook contains processing information. When more than one hook matches, then all of these processors are called in order of hook definition. However, first the compile hooks are taken, and then the global hooks.
How the processing works exactly depends on the compiler back-end. There are major differences. Each of those manual-pages lists the specifics. The label tells us when the processing is initiated. Available labels are before
, replace
, and after
.
hooks on matching types
The type
selector specifies a complexType of simpleType by name. Best is to base the selection on the full name, like {ns}type
, which will avoid all kinds of name-space conflicts in the future. However, you may also specify only the type
(in any name-space). Any REGEX will be matched to the full type name. Be careful with the pattern archors.
example: use of the type selector
type => 'int'
type => '{http://www.w3.org/2000/10/XMLSchema}int'
type => qr/\}xml_/ # type start with xml_
type => [ qw/int float/ ];
use XML::Compile::Util qw/pack_type SCHEMA2000/;
type => pack_type(SCHEMA2000, 'int')
hooks on matching ids
Matching based on IDs can reach more schema elements: some types are anonymous but still have an ID. Best is to base selection on the full ID name, like ns#id
, to avoid all kinds of name-space conflicts in the future.
example: use of the ID selector
# default schema types have id's with same name
id => 'int'
id => 'http://www.w3.org/2001/XMLSchema#int'
id => qr/\#xml_/ # id which start with xml_
id => [ qw/int float/ ];
use XML::Compile::Util qw/pack_id SCHEMA2001/;
id => pack_id(SCHEMA2001, int)
hooks on matching paths
When you see error messages, you always see some representation of the path where the problem was discovered. You can use this path as selector, when you know what it is... BE WARNED, that the current structure of the path is not really consequent hence will be improved in one of the future releases, breaking backwards compatibility.
DIAGNOSTICS
Error: cannot find pre-installed name-space files
Use $ENV{SCHEMA_LOCATION}
or new(schema_dirs) to express location of installed name-space files, which came with the XML::Compile distribution package.
Error: don't known how to interpret XML data
SEE ALSO
This module is part of XML-Compile distribution version 0.61, built on November 27, 2007. Website: http://perl.overmeer.net/xml-compile/
LICENSE
Copyrights 2006-2007 by Mark Overmeer. For other contributors see ChangeLog.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://www.perl.com/perl/misc/Artistic.html