<html>
<head>
<title>XML::Comma Guide</title>
<style type="text/css">
body { background-color: white }
h1 { font-size: 150%; text-align: left }
h2 { font-size: 110% }
h3 { font-size: 100% }
p { font-size: 100% }
pre { background-color: #ddddff }
</style>
</head>
<body>
<!-- last synced with svn rev. 1540 -->
<h1>Introduction</h1>
<p> XML::Comma is an information management platform. Comma speeds the
development of content-heavy, networked applications, and was designed
to solve some of the problems that make managing extremely large web
sites so expensive, difficult and tedious. </p>
<p> Comma is written mostly in Perl, and its target demographic is the
Perl programmer who must build customized, complex systems that handle
very large amounts of dynamic content. Like most software that is
designed to be used by programmers to build other software, Comma is
several things at once: a code library, a design framework, a
development methodology and a runtime system all rolled into
one. However, Comma's central philosophy is "play well with others,"
and the system depends heavily on a number of tools -- the Apache web
server and its mod_perl extensions, the HTML::Mason web development
environment, relational databases, the underlying filesystem and OS
utilities -- to implement its functionality and to provide
programmers with a complete, flexible, scalable, and familiar
toolkit. </p>
<p> Comma shapes information into "documents," and -- as its (full)
name implies -- uses XML to structure those documents. XML, like Perl,
is a powerful and standard tool for organizing text. But XML, again
like Perl, doesn't do much of anything by itself. Comma defines a
number of discrete "processes" in the "life-cycle" of a document and
provides a framework that abstracts basic activities common to those
process. These frameworks include structuring and validation;
long-term storage; programmatic manipulation; and indexing for fast
sorting, categorization and retrieval. </p>
<p> This document describes comma 2 (1.90 and greater). <b>For
documentation of comma <1.90 , please view <a
href="http://xml-comma.org/guide-1.x-filter.html">this page</a>
instead</b>, or for more information about the different branches,
please see the list of <a href="comma-2.html">comma 2.0
enhancements</a>. </p>
<p> This document is available online: </p>
<ul>
<li> HTML: <a href="http://xml-comma.org/guide-filter.html">http://xml-comma.org/guide-filter.html</a> </li>
<!--
<li> PDF: <a href="http://xml-comma.org/guide.pdf">http://xml-comma.org/guide.pdf</a> </li>
-->
</ul>
</p>
<h1>Installation</h1>
<h2>Dependencies</h2>
<p> XML::Comma requires that Perl, a number of CPAN modules, and a
relational database be installed in order to function properly. The
Perl version must be 5.005 or greater. The basic required CPAN modules
(more may be used by additional parts of Comma) are
<b>Class::ClassDecorator</b>,
<b>Compress::Zlib</b>,
<b>Crypt::Blowfish</b>,
<b>Crypt::CBC</b>,
<!-- <b>Crypt::Twofish</b> no longer required-->
<b>DBI</b>,
<b>Digest::HMAC_MD5</b>,
<b>Inline</b>,
<b>Lingua::Stem</b>,
<b>Math::BaseCalc</b>,
<b>PAR</b>,
<b>Proc::Exists</b>,
<b>Storable</b>,
<b>String::CRC</b>
and the <b>DBD::</b> module that matches your database.
Currently, the database must be mySQL or postgreSQL, the former of which
has much more robust support in the comma framework. Support for SQLite
has been hacked on, but is not yet complete or stable and cannot be used
in production. Support for other databases can be made available
whenever someone asks for it. </p>
<p> Comma installs in the usual <code>perl Makefile.PL</code>,
<code>make</code>, <code>make test</code> and <code>make install</code>
fashion. If you don't have an existing comma install,
<code>Makefile.PL</code> will ask you a series of questions
about the coniguration of your SQL backend, how you want various
comma features to work, etc. and store that in
<code>Comma/Configuration.pm</code> (which can be examined and overriden
by hand if desired before the <code>make</code> step).</p>
<p> <code>Comma/Configuration.pm</code> contains a package declaration
and then a <code>__DATA__ </code> section divider. Everything after
the <code>__DATA__</code> line is configuration information, in the
form of a big list of eval'able key/value pairs. Each key specifies
the name of a configuration variable, and each value is accessible as
a top-level Comma method, for example: </p>
<pre>
# the top of my Configuration.pm looks like this:
package XML::Comma::Configuration;
use base 'XML::Comma::Pkg::ModuleConfiguration'; 1;
__DATA__
comma_root => '/usr/local/comma',
log_file => '/usr/local/comma/log.comma',
document_root => '/usr/local/comma/docs',
sys_directory => '/usr/local/comma/sys',
tmp_directory => '/tmp',
defs_directories =>
[
'/allafrica/comma/defs',
'/usr/local/comma/defs',
'/usr/local/comma/defs/macros',
'/usr/local/comma/defs/standard',
'/usr/local/comma/defs/test'
],
###
###
# so, on my system, this assigns '/usr/local/comma' to $str
my $str = XML::Comma->comma_root();
# and, similarly
my $first_defs_directory = XML::Comma->defs_directories()->[0];
</pre>
<p> The <code>Configuration.pm</code> file that comes with the
distribution fully specifies all of the possible configuration
variables, and includes reasonable defaults for all those for which
reasonable defaults are likely. Just think of the configuration block
as a big hash assignment -- so pretty much any Perl code is, at least
theoretically, allowed. </p>
<h2>Configuration Variables</h2>
<ul>
<li> <b>comma_root</b> -- The base for Comma's directory tree. Comma
stores information down various subdirectories under its root, all of
which are also independently specifiable. </li>
<li> <b>log_file</b> -- The file to which Comma will write its error
and warning log messages. This is usually a file under
<b>comma_root</b>. </li>
<li> <b>document_root</b> -- This is the default directory below which
Comma documents will be found. This can actually be overridden in each
store definition, but you will usually rely on the configuration
default to be the base directory for document storage. The permissions
on this directory must be set so that each user of the system has the
read or write access that they will need to retrieve or store
documents. </li>
<li> <b>sys_directory</b> -- This directory is available to the Comma
system for internal storage of long-term data. All of Comma's modules
that use Inline::C, for example, use this as a build directory. </li>
<li> <b>tmp_directory</b> -- This directory is available to the Comma
system, and to any Comma code, for temporary storage. Most folks use
<code>/tmp</code>. </li>
<li> <b>defs_directories</b> -- This is an anonymous array of
directories that contain document definitions and macros. Each
directory must be explicitly listed; Comma does not recurse into
subdirectories looking for definitions. The directories are searched
in the order that they occur in this list (which is relevant only if
you are worrying about naming collisions; definition loading happens
infrequently enough that speed is not a concern). </li>
<li> <b>defs_from_PARs</b> -- This is a boolean value ("1" or "0" is
strongly preferred), that specifies whether Defs may be loaded from
PAR files. Normally "1". </li>
<li> <b>defs_extension</b> -- The extension that Comma expects
definition files to have. The convention is <code>.def</code>, which
means that the <i>Foo</i> def will be found in a file called
<code>Foo.def</code>. </li>
<li> <b>macro_extension</b> -- Like <b>defs_extension</b>, only for
macros. Normally <code>.macro</code>. </li>
<li> <b>include_extension</b> -- Like <b>defs_extension</b>, only for
includes. The convention is <code>.include</code>. </li>
<li> <b>validate_new</b> -- A boolean value (0 or 1) that determines
whether we should call <b>validate()</b>
on a document when we create it with
<code>XML::Comma::Doc->new()</code> with either
a <i>file</i> or <i>block</i> argument. This can be overridden by
passing a
validate argument, e.g.
<code>XML::Comma::Doc->new(block => "...", validate => [0/1]).</code>
The default is to validate() these documents with versions of comma
greater than 1.20 (but note that your Configuration.pm will not be
overwritten if you are upgrading, preserving the old behavior).
<li> <b>parser</b> -- The parser module that Comma will use. The two
standard choices are <code>PurePerl</code> and
<code>SimpleC</code>. The PurePerl module is written entirely in Perl,
so should work on any system and without any installation
headaches. The SimpleC module is faster, uses Brian Ingerson's really
nifty Inline framework, and like all such things May Not Work For
You. See the notes below. </li>
<li> <b>hash_module</b> -- Comma generates checksums for
documents. These checksums are used internally by the system, and are
also available via the Doc <b>get_hash()</b> method. You can use any
module that adheres to the CPAN "Digest"
interface. <code>Digest::MD5</code> is a good choice. </li>
<li> <b>system_db</b> -- Comma needs to know which database you're
using, and how to connect to it. Normally, <b>system_db</b> is
specified through one layer of indirection, as a string pointing to
another configuration entry that holds a hashref. There are examples
in the distribution file for both MySQL and Postgres
connections. </li>
</ul>
<h2>Using the SimpleC Parser</h2>
<p> The SimpleC parser module requires that the Inline and Inline::C
modules be installed on your system. After editing
<code>Comma.pm</code> to specify SimpleC as the system parser, run
<code>make test</code> as root. The test scripts should attempt to
compile SimpleC and cache the results in Comma's tmp directory. If all
goes well, the compiled module will be available to all users of the
system. It must be admitted, however, that we have abused the Inline
mechanisms a bit to achieve the dynamic loading that Comma's config
methods require. If Inline::C passes all its tests, but SimpleC
doesn't work for you, don't hesitate to let us know. </p>
<h1>Documents and DocumentDefinitions</h1>
<p> An XML::Comma system stores pieces of information as
<i>Documents</i>. The structure and basic behaviors of the Documents
in each system are described by <i>DocumentDefinitions</i>. This
section introduces Documents and DocumentDefinitions. We will mostly
refer to Documents as <i>Docs</i> and DocumentDefinitions as
<i>Defs</i>; this saves typing and is consistent with the Perl
API. </p>
<h2> A Simple Doc and Def </h2>
<p> Here is a simple sample Doc, showing the beginnings of a structure
that could be used to keep track of information about a registered
user of a web site. We'll use this example as we go along, adding
features and providing example pieces of code. </p>
<pre>
<User>
<username>kwindla</username>
<email>kwindla@xymbollab.com</email>
<full_name>Kwindla Hultman Kramer</full_name>
</User>
</pre>
<p> That's pretty self-explanatory. The whole thing is XML, with a
very simple structure. Here is the corresponding Def: </p>
<pre>
<DocumentDefinition>
<name>User</name>
<element><name>username</name></element>
<element><name>email</name></element>
<element><name>full_name</name></element>
</DocumentDefinition>
</pre>
<p> Still pretty simple, so far. For your Comma installation to
recognize Docs of the <i>User</i> type, it suffices to put the above
Def in a file called <b>User.def</b> somewhere down the
<b>defs_directories</b> path. If you're following along at the
keyboard, you can do that, now, and you'll be able to try out the code
examples that follow. </p>
<h2> Basic Manipulation: new(), element(), set() and get() </h2>
<p> The most basic parts of the Comma API are the methods that
manipulate the elements of a Doc. Let's write a little Perl program to
make an "empty" User Doc, set its three elements, and then print the
result: </p>
<pre>
use XML::Comma;
my $doc = XML::Comma::Doc->new ( type=>'User' );
$doc->element('username')->set ( 'kwindla' );
$doc->element('email')->set ( 'kwindla@xymbollab.com' );
$doc->element('full_name')->set ( 'Kwindla Hultman Kramer' );
print $doc->to_string();
</pre>
<p> Running that program should print out something very similar to
the sample Doc, above. (The only difference should be that the three
elements are not indented. There's a way to do that, too, but we'll
cover the subtleties of <b>to_string()</b> later.) </p>
<p> What did we do, there? Well, let's take the program line by
line.</p>
<p> The first line tells Perl that we're going to be using the
XML::Comma framework. All of the Comma modules that we'll need -- such
as XML::Comma::Doc -- are pulled in by this statement. </p>
<p> The second line creates a new Doc object. The <b>Doc->new()</b>
method takes a parameterized argument <b>type</b>, specifying which
DocumentDefinition we want our Doc to adhere to. </p>
<p> The next three lines set the contents of the three elements in the
Doc. The three statements are completely independent; we could have
placed them in any order. We can break these lines up further, to
clarify what's going on. Here is the <i>username</i> line in two
separate statements: </p>
<pre>
my $username_element = $doc->element('username');
$username_element->set ( 'khkramer' );
</pre>
<p> First, the <b>element()</b> method selects for us the element that
we're interested in, taking a single argument -- the name of the
element, and returning a reference to an Element object. Then we call
that object's <b>set</b> method. <b>set()</b> takes a single argument,
too, a string which will become the content of the Element. </p>
<p> The final line of the little program prints out our Doc. The
<b>to_string()</b> method generates a string of XML text that
completely represents the contents of the Doc. </p>
<p> One more basic method call is worth mentioning here:
<b>get()</b>. As you might expect, <b>get()</b> is the opposite of
<b>set()</b>. It takes no arguments, and returns the contents of an
Element as a string: </p>
<pre>
my $username = $username_element->get();
</pre>
<h2> More Complex Structures: Nested Elements </h2>
<p> The Doc so far is very simple: it contains three elements, each of
which contain some string-ish content. But we can do better than that,
we can introduce elements that, themselves, contain other
elements. If we add an <i>address</i> element to the Doc, it might look like this: </p>
<pre>
<User>
<username>kwindla</username>
<email>kwindla@xymbollab.com</email>
<full_name>Kwindla Hultman Kramer</full_name>
<address>
<street1>922 M Street SE</street1>
<city>Washington</city>
<state>DC</state>
<zip>20003</zip>
</address>
</User>
</pre>
<p> Corresponding changes in the Def are necessary, of course: </p>
<pre>
<DocumentDefinition>
<name>User</name>
<element><name>username</name></element>
<element><name>email</name></element>
<element><name>full_name</name></element>
<nested_element>
<name>address</name>
<element><name>street1</name></element>
<element><name>street2</name></element>
<element><name>city</name></element>
<element><name>state</name></element>
<element><name>zip</name></element>
<element><name>country</name></element>
</nested_element>
</DocumentDefinition>
</pre>
<p> The new <i>address</i> element is declared as a
<b>nested_element</b>. This means that it will serve as a container
for other elements, and will not have content of its own. Comma
enforces this distinction between simple and nested elements -- an
element can have string content, or it can serve as a container for
other elements, but it cannot do both. </p>
<p> You might infer from the above that a nested element will not have
<b>set()</b> and <b>get()</b> methods, but rather, like a Doc, will
provide an <b>element()</b> method. If so, you infer correctly. To get
at the pieces of the address, we can simply "walk down the tree",
using the methods we already know about. </p>
<pre>
my $address = $doc->element('address');
my $formatted_address = $address->element('street1')->get() . "\n";
if ( $address->element('street2')->get() ) {
$formatted_address .= $address->element('street2')->get() . "\n";
}
$formatted_address .= $address->element('city') . ',' .
$address->element('state') . ' ' .
$address->element('zip');
</pre>
<p> In fact, a Doc is itself a nested element -- all of the methods
that are available for manipulating nested elements are available for
Docs, as well. When we talk in more detail about nested elements,
we'll often call the nested element a <i>container</i>, and the
elements that it contains <i>sub-elements</i>. Just keep in mind that
when we describe nested element operations it doesn't matter whether
the container is a Doc or a nested element. In a similar vein,
elements can be nested as deeply as you want, you just have to declare
the nesting in the Def. (And there's even a way to specify
arbritrarily deep recursive nesting, but that's best covered in
another section entirely.) </p>
<h2> Plural Elements </h2>
<p> What if we want to store more than one address. We might, like
Amazon, keep a number of shipping addresses on file for each user. To
do so, we add a line to the Def, declaring that the address element is
<i>plural</i>. </p>
<pre>
<DocumentDefinition>
<name>User</name>
<element><name>username</name></element>
<element><name>email</name></element>
<element><name>full_name</name></element>
<nested_element>
<name>address</name>
<element><name>street1</name></element>
<element><name>street2</name></element>
<element><name>city</name></element>
<element><name>state</name></element>
<element><name>zip</name></element>
<element><name>country</name></element>
</nested_element>
<plural>'address'</plural>
</DocumentDefinition>
</pre>
<p> Note the quotes around <code>address</code>, in the new line. The
contents of the <b>plural</b> specifier are evaluated as a Perl
expression when the Def is loaded into the system, and the return
value of that expression must be a list of elements that the system
will allow to be <i>plural</i>. </p>
<p> We gain a lot of flexibility here, by treating a piece of a Def as
a bit of Perl code. The price for this flexibility is a little bit of
added complexity: the contents of the <b>plural</b> tag must create a
valid Perl list. In this case, that means putting quotes around
bareword <code>address</code>. Many other parts of Comma use this same
strategy of embedding Perl code into DocumentDefinitions, and we'll
see much more sophisticated examples shortly. </p>
<p> The <b>element()</b> method continues to work as it always has. If
you re-run the earlier code fragments with the new Def in place, the
results will be exactly the same. But our understanding of what
<b>element()</b> is doing should change a tiny bit: the method doesn't
fetch the only matching element for us, it fetches the <i>first</i>
one. And, because elements don't exist in a Doc until we manipulate
them, <b>element()</b> must create a new element for us if need be. </p>
<p> For plural elements, we obviously need some more methods. We need
a way to fetch elements other than the first one, a way to add a new
element, and a way to delete elements that we don't need. </p>
<pre>
# add a new address
my $address2 = $doc->add_element ( 'address' );
$address2->element('street1')->set ( 'PO Box 0000' );
$address2->element('city')->set ( 'Anyplace' );
$address2->element('state')->set ( 'ZZ' );
# add another new address
my $address3 = $doc->add_element ( 'address' );
# change my mind, delete that element
$doc->delete_element ( $address3 );
# get a list of address elements
@addresses = $doc->elements ( 'address' );
</pre>
<p> The <b>add_element()</b> method takes a single argument, the name
of the element to add. It creates a new element of the requested kind,
appends that element to the container, and returns the newly-created
element. To ask a container to add an element that is not plural, if
there is already an element of that kind present, is an
error. Remember that <b>element()</b> auto-creates elements as
required, so it is never necessary to call <b>add_element()</b> for a
non-plural element. </p>
<p> The <b>delete_element()</b> method also takes a single argument,
but is a bit more complicated. It will accept an element name as
string argument, in which case it deletes the last element of that
kind. It will also accept an element object, in which case it will
delete that specific element. The method returns true if it deletes
anything, false if it does not. </p>
<p> The <b>elements()</b> method accepts a list of element names and
returns a list of the elements of those types, in the order that they
exist in the container. (In the above example, we only asked for
<i>address</i> elements, but we could have asked for <i>username</i>
and <i>address</i> elements, or <i>username</i> and <i>full_name</i>
and <i>address</i> elements...) </p>
<p> Actually, the return value of <b>elements()</b> is a little
trickier than the description above would suggest. In a list context,
the method returns an array. But in a scalar context, it returns a
reference to an array. This context-awareness makes it possible to
write code like: </p>
<pre>
# quick walk down the tree
my $last_street = $doc->elements('address')->[-1]->element('street')->get();
</pre>
<p> This is usually not a problem; most of the time, things just work
out as you would expect them to. If you assign the return value to an
array, you get an array. If you dereference with a subscript, you get
an element of the list. But there is one very important case that does
not work as you would expect. <b>You can not do the following!!!</b> </p>
<pre>
# WRONG way to do something if we've got address elements
if ( $doc->elements('address') ) {...}
</pre>
<p> <b>The above if statement will always be true</b>, because what
<code>if</code> sees is the reference. Instead, you must use
constructions like the following for conditional elements-ing:</p>
<pre>
# do something if we've got address elements
if ( @{$doc->elements('address')} ) {...}
</pre>
<h2>Methods</h2>
<p> An <b>element</b> holds a piece of information. A <b>method</b>
generates a piece of information each time it is called. A document
definition may supplement its elements, which hold static data, with
methods, which return dynamic data. </p>
<p> Suppose we want to provide a method that will display a user's
email address modified in such a way as to make things more difficult
for the address-collecting web crawlers often used to build spam
databases. Here is a method definition that will fetch the contents of
the email element, replace the at-sign and periods with text, and
return the result: </p>
<pre>
<method>
<name>email_anti_spammed</name>
<code>
<![CDATA[
sub {
my $self = shift();
my $email = $self->element('email')->get();
$email =~ s/\@/ (AT) /;
$email =~ s/\./ (DOT) /g;
return $email;
}
]]>
</code>
</method>
</pre>
<p> A method is expected to have <i>name</i> and <i>code</i>
elements. The <i>name</i> is the name by which the method will be
called. The <i>code</i> element should be text that, when eval'ed,
returns a reference to an anonymous subroutine. It is this subroutine
that will be called when the method is invoked. </p>
<p> Not too surprisingly, the <b>method()</b> routine calls a
method. The <i>email_anti_spammed</i> method could be used as follows:
</p>
<pre>
# set email
$doc->element('email')->set ( 'kwindla@allafrica.com' );
# get munged email: kwindla (AT) allafrica (DOT) com
my $munged = $doc->method('email_anti_spammed');
</pre>
<p> Methods are often most useful at the top level of a document; they
function both as bits of reusable code and as programmatic
short-cuts. But methods can be defined as "part" of any element -- not
just the top-level Doc. Here is a new definition for the
<i>address</i> element that includes a method to generate a formatted
block of text suitable for printing on an envelope (some of the code
inside this method will be familiar from an earlier example): </p>
<pre>
<nested_element>
<name>address</name>
<element><name>street1</name></element>
<element><name>street2</name></element>
<element><name>city</name></element>
<element><name>state</name></element>
<element><name>zip</name></element>
<element><name>country</name></element>
<method>
<name>formatted</name>
<code>
<!CDATA[
# returns a block-formatted address.
# takes one optional arg indicating whether the country field
# should be included: print_country => 1
sub {
my ( $self, %args ) = @_;
my $formatted_address = $self->element('street1')->get() . "\n";
if ( $self->element('street2')->get() ) {
$formatted_address .= $self->element('street2')->get() . "\n";
}
$formatted_address .= $self->element('city') . ',' .
$self->element('state') . ' ' .
$self->element('zip');
if ( $args{print_country} ) {
$formatted_address .= ' ' . $self->element('country');
}
return "$formatted_address\n";
}
]]>
</code>
</method>
</nested_element>
</pre>
<p> The <i>formatted</i> example demonstrates that methods may make
use of arguments. The first argument to <b>method()</b> is the name of
the method to be invoked; any arguments after that are passed to the
invokee. Here is example usage of the <i>formatted</i> method: </p>
<pre>
# use a hypothetical &envelope_print sub to generate text on a mailing evelope
envelope_print ( $doc->element('full_name')->get() . "\n" );
envelope_print ( $doc->element('address')->method('formatted', print_country=>1) );
</pre>
<h2>Do What I Mean: Shortcut Syntax</h2>
<p> The <b>element()</b> syntax is quite verbose. Comma provides a
more concise syntax that reduces the length and unwieldiness of common
method calls. This <i>shortcut</i> syntax has a "Do What I Mean"
design, which, of course, means that it sometimes doesn't do what you
meant. </p>
<p> Shortcuts work via Perl's method <code>AUTOLOAD</code>
framework. Any Doc or nested element automatically recognizes Perl
methods that have the same name as their defined methods and
sub-elements. Because our <code>User</code> Def defines
<i>username</i>, <i>email</i>, <i>full_name</i>, <i>address</i> and
<i>email_anti_spammed</i> elements and methods, all of the following
Perl method calls are allowed: </p>
<pre>
# top-level User 'shortcut' methods
$doc->username();
$doc->email();
$doc->full_name();
$doc->address();
$doc->email_anti_spammed();
</pre>
<p> What a shortcut call does depends on what the underlying object
referenced is. In the simplest, most useful, and most common case --
here represented by <i>username</i>, <i>email</i> and <i>full_name</i>
-- the shortcut fetches the content from the element with the same
name as the shortcut. </p>
<pre>
# get username with less typing
my $username = $doc->username();
# which is the same thing as:
my $username = $doc->element('username')->get();
</pre>
<p> If the shortcut is called with an argument, then a <b>set()</b> is
performed rather than a <b>get()</b>. </p>
<pre>
# set the username
$doc->username ( 'kwindla' );
</pre>
<p> In the case of a nested element such as <i>address</i>, on the
other hand, a <b>get()</b> would make no sense. In the case of a
singular, nested element, the shortcut call returns the element. In
the case of <i>address</i> element, which is both nested and plural,
the shortcut call returns a list or reference to a list of the
<i>address</i> elements. </p>
<pre>
# 'address' shortcut
my $first_address = $doc->address()->[0];
# which is the same thing as:
my $first_address = $doc->element('address')->[0];
</pre>
<p> For a Comma method, such as <i>email_anti_spammed</i>, the
shortcut calls the method. So <b>$doc->email_anti_spammed()</b>
becomes <b>$doc->method('email_anti_spammed')</b>. It is possible for
a method and an element to have the same name; in this case, the
shortcut calls the method rather than accessing the element. <b>Comma
methods <i>shadow</i> elements of the same name in the context of
shortcut calls.</b> </p>
<p> A table of shortcuts and their not-short equivalents is probably
the easiest way to describe all of the seven possible ways a shortcut
can be resolved. Here then, are the many faces of
<b>$x->foo ( [@args] )</b>. </p>
<p>
<table cellpadding="4" border="1">
<tr>
<td><code>$x->method('foo', @args)</code></td>
<td>If there is a method named <i>foo</i></td>
</tr>
<tr>
<td><code>$x->element('foo')</code></td>
<td>For singular, nested <i>foo</i></td>
</tr>
<tr>
<td><code>$x->elements('foo')</code></td>
<td>For plural, nested <i>foo</i></td>
</tr>
<tr>
<td><code>$x->element('foo')->get()</code></td>
<td>For singular, non-nested <i>foo</i> called with no arguments</td>
</tr>
<tr>
<td><code>$x->elements('foo')->set ( $args[0] ) </code></td>
<td>For singular, non-nested <i>foo</i> called with arguments</td>
</tr>
<tr>
<td><code>$x->elements_group_get('foo')</code></td>
<td>For plural, non-nested <i>foo</i> called with no arguments</td>
</tr>
<tr>
<td><code>$x->elements_group_add('foo', @args)</code></td>
<td>For plural, non-nested <i>foo</i> called with arguments</td>
</tr>
</table>
</p>
<p> We've used examples from the top level of the <code>User</code>
Doc, but short-cut methods are applicable to any nested element
context. (Indeed, shortcuts are most useful in terms of keystrokes
saved when used to shorten multi-level traversals.) Here is a line of
code to grab the zip-code of the first stored address in a
<code>User</code> Doc: </p>
<pre>
# a shortcut version of $doc->elements('address')->[0]->element('zip')->get()
$doc->address()->[0]->zip();
</pre>
<h2>Nested Element Helper Methods: elements_group_get() and Friends</h2>
<p> <i>Shortcuts</i> are one kind of convenience method; they're not
strictly necessary but do save typing and make code easier to
read. Another set of convenience methods are supported by nested
elements: the <i>group helpers</i>. These methods make it possible to
manipulate instances of a non-nested, plural element as a single
group. To demonstrate, we first need to add a simple, plural element
to our <code>User</code> Def. In an even more contrived attempt to
come up with an example than normal, let's allow a user to be known by
a number of <i>nicknames</i>. </p>
<pre>
<element><name>nickname</name></element>
<plural>'nickname'</plural>
</pre>
<p> A Doc that includes several nickname elements might look like this: </p>
<pre>
<User>
<username>kwindla</username>
<email>kwindla@xymbollab.com</email>
<full_name>Kwindla Hultman Kramer</full_name>
<nickname>Junior</nickname>
<nickname>khkramer</nickname>
<nickname>smooth_operator</nickname>
</User>
</pre>
<p> To add one or more new nickname elements to this Doc, we can use
one of the <i>group helper</i> methods:
<b>elements_group_add()</b>. The first argument to
<b>elements_group_add()</b> is the <i>name</i> of the element(s) we'll
be adding; the remaining arguments specify the <i>content</i> for each
new element. </p>
<pre>
# add two nicknames
$doc->elements_group_add ( 'nickname', 'Sneezy', 'Forgetful' );
# note: the above statement is equivalent to the following two lines of code:
$doc->add_element('nickname')->set ( 'Sneezy' );
$doc->add_element('nickname')->set ( 'Forgetful' );
</pre>
<p> The opposite function, deleting particular elements from a group,
is handled by the <b>elements_group_delete()</b> method. Again, the
first argument supplies a <i>name</i> and the remainder of the
arguments specify content strings. If the content of an element
matches one of the supplied strings, that element will be
deleted. (Any strings that are not matched will be ignored.) If
<b>elements_group_delete</b> is only given the first, <i>name</i>,
argument, then all elements in the group are deleted. This provides a
convenient idiom for clearing and re-setting an elements group.</p>
<pre>
# remove the nicknames we just added (wrong movie)
$doc->elements_group_delete ( 'nickname', 'Sneezy', 'Forgetful' );
# remove all the nicknames and replace them with a list of nicknames we
# get back from a couple of subroutine calls
my @new_nicknames = Television::Stooges::nicknames();
push @new_nicknames, Usenet::Rec::Humor::Stooges::FanFiction::nicknames();
$doc->elements_group_delete ( 'nickname' );
$doc->elements_group_add_uniq ( 'nickname', @new_nicknames );
</pre>
<p> To query a group for the presence of a particular piece of
content, use the <b>elements_group_lists()</b> method. This method
expects two arguments: <i>name</i> and <i>content</i>. </p>
<pre>
# check that we really removed the Snow White stuff
print "no more dwarves"
if ! $doc->elements_group_lists('nickname', 'Sneezy') and
! $doc->elements_group_lists('nickname', 'Forgetful');
</pre>
<p> To slurp the contents of the group's elements into a list, use
<b>elements_group_get()</b>. As is the case with most of the nested
element "plural" methods, <b>elements_group_get()</b> returns either
an array in a list context and an array reference in a scalar context. </p>
<pre>
# get all of the nicknames
my @nicknames = $doc->elements_group_get ( 'nicknames' );
# get the last nickname
my $last_nickname = $doc->elements_group_get('nicknames')->[-1];
</pre>
<p> Finally, <b>elements_group_add_uniq()</b> works like
<b>elements_group_add()</b> except that it ignores duplicates. If we
always use <b>elements_group_add_uniq()</b> to add to the nicknames
list we will never list a nickname twice. </p>
<pre>
# add a new nickname
$doc->elements_group_add_uniq ( 'Bashful' );
# add several more nicknames, skipping 'Bashful' because it's already present
$doc->elements_group_add_uniq ( 'Dopey', 'Bashful', 'Doc' );
</pre>
<h2>Whitespace: Ignored and Trimmed</h2>
<p> XML-based systems must define how they treat whitespace. HTML, for
example, treats all occurrences of whitespace as equivalent.
With the exception of content inside a <code>pre</code> tag, which is
preserved as formatted, there is no difference between a single space
and a boatload of carriage returns. (With the exception, of course, of
<code>pre</code> tags, which preserve whitespace exactly as supplied.)
</p>
<p> Comma treats whitespace surrounding its tags as non-meaningful,
stripping it all out. The following Docs are exactly the same: </p>
<pre>
<!-- Two equivalent Docs -->
<User>
<username> kwindla </username>
<full_name> Kwindla Hultman Kramer </full_name>
</User>
<User><username>kwindla</username><full_name>Kwindla Hultman Kramer</full_name></User>
</pre>
<p> Comma's stripping of tag-adjacent whitespace has a very important
corrolary: <b>whitespace is trimmed from the beginning and end of all
element content</b>. So the two <b>set()</b> statements below are
equivalent, and the string comparison will always be false: </p>
<pre>
# set the username
$doc->element('username')->set ( 'kwindla' );
# set the username to the same thing -- whitespace is "trimmed"
$doc->element('username')->set ( ' kwindla ' );
# because the whitespace is gone, this can *never* be true
my $matched = $doc->element('username')->get eq ' kwindla ';
</pre>
<p> Of course, the auto-trimming only applies to tags defined in Comma
document definitions. It is often convenient to embed XML-marked-up
text in a Comma element as "flat" content -- an element that stores an
HTML snippet, for example, will include XML tags that have no
"meaning" to Comma. Element content is always preserved verbatim
(after whitespace is trimmed from the very beginning and very end) by
the system; any XML-like strings inside element content are treated
exactly like all other text. </p>
<h2>XML Escape/Unescape</h2>
<p> Every Comma Doc is a syntactically-legal XML document. All tags
must be properly balanced and nested, and bare ampersands, left
brackets and right brackets must be properly escaped. Elements that
contain XML-like tags or markup characters as part of their content
will need to take special action to ensure that proper formatting,
escaping or CDATA wrapping happens. </p>
<p> Let's add a <i>bio</i> element to our <code>User</code> Def, and
discuss some of the issues involved in storing HTML as element
content. </p>
<pre>
<!-- new 'bio' element: holds a chunk of HTML text -->
<element><name>bio</name></element>
</pre>
<pre>
<User>
<username> kwindla </username>
<full_name> Kwindla Hultman Kramer </full_name>
</User>
<bio> Kwin is a programmer who likes <a href="http://use.perl.org">Perl</a>
and <a href="http://www.motorola.com/mcu">6812</a>
assembly language. </bio>
</pre>
<p> The above Doc is perfectly fine. Because the two <i>a</i> tags are
balanced, the parser has no problem reading in the Doc. After parsing
is finished the content of the <i>bio</i> element is treated just like
any other "flat" piece of content. </p>
<p> We will run into problems, however, if we're not extremely careful
about the HTML we try to store in the <i>bio</i> element. For example,
HTML includes a number of "empty" tags that are usually used in a
non-balanced fashion -- <i>img</i> and <i>br</i>, for example. Unless
we force the use of XHTML syntax, which mandates XML-compatible tag
usage, we'll need to either escape all mark-up characters or wrap
content in a CDATA section. </p>
<p> The utility methods <b>XML_basic_escape</b> and
<b>XML_basic_unescape</b> handle simple escaping and unescaping of
markup characters. </p>
<pre>
use Comma::Util qw ( XML_basic_escape XML_basic_unescape );
# escape a string
$escaped = XML_basic_escape ( '<img src="picture.png">' );
$unescaped = XML_basic_unescape ( $escaped );
</pre>
<p> The <b>set()</b> and <b>get()</b> methods provide a means to
escape and unescape strings during get and set operations. If
<b>set()</b> is called with additional arguments following the
<i>content</i> arg, they are interpreted as paremeters that effect how
the set is performed. The argument <b>escape=>1</b> forces the content
string to be escaped before other pieces of the set routine --
validation, etc. -- go to work. Similarly, calling <b>get()</b> with
the parameterized arg <b>unescape=>1</b> unescapes the content string
before it is returned. </p>
<pre>
# safe set()
$doc->element('bio')->set ( $html_stuff, escape=>1 );
# get() bio content in a string that we can incorporate directly into
# a web page
$doc->element('bio')->get ( unescape=>1 );
</pre>
<p> Our other option, as mentioned above, is to "wrap" the bio
element's content in an XML CDATA section. The CDATA envelope forces
an XML parser to treat the characters inside it as plain text. Comma
allows an element to be flagged as CDATA-fied, meaning that on output
the entire contents will be wrapped in a CDATA section. Comma treats
this CDATA facility as high-impact and coarse-grained. As a result the
declaration is a one-way street: once a CDATA element, always a CDATA
element. The <b>cdata_wrap()</b> method flips the switch, so to
speak. </p>
<pre>
# configure the bio element so that it always CDATA-wraps its content
$doc->element('bio')->cdata_wrap();
# now we can set() with impunity
$doc->set ( $messy_html );
</pre>
<p> The <b>to_string()</b> method on the CDATA-set element will
produce output that looks something like this: </p>
<pre>
<bio><![CDATA[Kwin is a programmer who likes <a href="http://use.perl.org">Perl</a>
and <a href="http://www.motorola.com/mcu">6812</a>
assembly language.]]></bio>
</pre>
<h2>Flexible and Automatic Escape/Unescape</h2>
<p> Escaping and unescaping element content is common enough to
warrant specific configurability for each Element in a Def of: </p>
<ol>
<li> The code that performs the <b>escape</b> operation</li>
<li> The code that performs the <b>unescape</b> operation</li>
<li> Whether to automatically escape element content on a <b>set()</b></li>
<li> Whether to automatically unescape element content on a <b>get()</b></li>
</ol>
<p> Here is a (silly) example of a custom escape/unescape pair as part
of an Element's definition: </p>
<pre>
<element>
<name>Xs_are_dangerous</name>
<escapes>
<escape_code>
sub { my $str=shift; $str =~ s:X:--x--:g; return $str; }
</escape_code>
<unescape_code>
sub { my $str=shift; $str =~ s:--x--:X:g; return $str; }
</unescape_code>
<auto>1</auto>
</escapes>
</element>
</pre>
<p>Within the <b>escapes</b> section, <b>escape_code</b> specifies
some code that performs the ecape, and <b>unescape_code</b> specifies
some code that performs the unescape. They default, respectively, to: </p>
<pre>
\&XML::Comma::Util::XML_basic_escape
\&XML::Comma::Util::XML_basic_unescape
</pre>
<p> The <b>auto</b> element controls behaviors 3 and 4, from the list
above. The content of <b>auto</b> is eval'ed at Def load time, and if
<b>auto</b> contains a scalar value, that value sets the default for
both escaping and unescaping. If <b>auto</b> contains a listref, the
first value in the list controls escaping, and the second
unescaping. <b>auto</b> defaults to "0". </p>
<p> In the example above, <b>auto</b> is "1", so content is silently
escaped by the element's <b>set()</b> method and silently unescaped by
its <b>get()</b> method. Of course, explicitly passing
<b>escape=>0</b> to <b>set()</b> or <b>unescape=>0</b> to <b>get()</b>
overrides this behavior: </p>
<pre>
# if $el is an Xs_are_dangerous element...
# set $el content to "TE--x-- ME--x--"
$el->set ( "TEX MEX" );
# get back our string TEX MEX
$str = $el->get();
# get back the literal "TE--x-- ME--x--" stored in $el
$str = $el->get ( unescape => 0 );
# set $el content to literal "TEX MEX" -- no escape
$el->set ( "TEX MEX", escape => 0 );
</pre>
<p>Three more element Def examples: </p>
<pre>
<element>
<name>all_basic_escaped</name>
<escapes><auto>1</auto></escapes>
</element>
<element>
<name>esc_basic_escaped</name>
<escapes><auto>[1,0]</auto></escapes>
</element>
<element>
<name>unesc_basic_escaped</name>
<escapes><auto>[0,1]</auto></escapes>
</element>
</pre>
<h2>Automatic Content: <default></h2>
<p> It is often useful to define <i>default</i> content for a class of
elements, content that <b>get()</b> will return for any instance of an
element that doesn't have content of its own. We can amend the
definition of the <i>bio</i> element (defined in the previous section)
to provide a standard "no information available" string if a
<code>User</code> Doc doesn't include a bio.</p>
<pre>
<element>
<name>bio</name>
<default>No bio information available.</default>
</element>
</pre>
<pre>
# set() bio information
$doc->element('bio')->set ( 'Kwindla is a programmer' );
# get() will return our new bio -- this prints out 'Kwindla is a programmer';
print $doc->element('bio')->get();
# "clear" bio content by passing set() an undef argument
$doc->element('bio')->set();
# now get() will return our default string -- 'No bio information available'
print $doc->element('bio')->get();
</pre>
<p> As the above code demonstrates, calling <b>set()</b> with an
undefined value as its content argument (which passing no arguments
does implicitly) "clears" the content of an element, and any
subsequent <b>get()</b> calls will again return the default
string. Note that only an <code>undef</code> argument will clear an
element's content; in particular, an empty string is perfectly valid
as content and a <b>get()</b> on an element with an empty string as
its content will happily return that empty string.</p>
<p> It is sometimes important to differentiate between an element that
doesn't have any content and an element that has the same content as
its Def's default string. The <b>get_without_default()</b> method
returns an element's content exactly as is, without falling back to
any default value that may be defined. Unlike <b>get()</b>, which
returns an empty string if there is neither element content nor Def
default, <b>get_without_default()</b> returns <code>undef</code> if an
element has no content at all. </p>
<h2>Storing Dynamic Information in Defs: pnotes</h2>
<p> Document definitions are static constructs. However it can be
useful to tie some dynamic bits of information -- status or state
flags, simple lookup tables and the like -- to a def. It can also be
useful to have simple access to a perl-level hash that can store
arbritrary references.</p>
<p> To enable a Def to "hold" some long-lived bits of dynamic
information, each def exposes a unique <b>pnotes</b> hash, available
to any piece of code in the system. (Comma borrowed the idea for, and
the name of, the <b>pnotes</b> hash from Apache.) </p>
<pre>
# a bit of pnotes manipulation
my $def = XML::Comma::Def->read ( name=>'some_docdef' );
$def->def_pnotes()->{'foo'} = 'bar';
# prints out 'Foo from def: bar'
print "Foo from def: " . $def->def_pnotes()->{'foo'} . "\n";
my $doc = XML::Comma::Doc->new ( type => 'some_docdef' );
# prints out 'Foo from doc: bar'
print "Foo from doc: " . $doc->def_pnotes()->{'foo'} . "\n";
# prints out 'Foo from pathname: bar'
print "Foo from pathname: " . XML::Comma->pnotes('some_docdef')->{'foo'} . "\n";
# and every element down a def's tree has its own pnotes, too
XML::Comma->pnotes('some_docdef:nested_element:another_element')->{'test'} = 'Ok';
print "Ok down longer pathname: " . XML::Comma->pnotes('some_docdef:nested_element:another_element')->{'test'} . "\n";
</pre>
<p> There are three new methods here. Each element exposes a
<b>def_pnotes()</b> method, which returns a reference to that
element's def's pnotes hash. Each def also exposes a
<b>def_pnotes()</b> method, which returns a reference to its own
pnotes hash. The two methods are "different but the same" -- for
convenience, you can call <b>def_pnotes()</b> on an element or on that
element's def and get back the same hash reference. </p>
<p> The third new method is the system call
<b>XML::Comma->def_pnotes()</b>, which takes a pathname and returns
that def path's pnotes hash. </p>
<h2> Not just for Defs: pnotes for Elements </h2>
<p> Sometimes you need to store bits of perl-level data that are
specific to a particular Doc, rather than to a Def. You could always
write a closure-ish method that creates persistant variables, but
Comma provides a simple, Element-bound pnotes hash as an
alternative. <p>
<p> Here's the workhorse method from the MailMessageReader
input/output filter, which uses Mark Overmeer's Mail::Message module
to parse an internet email message and create a Doc. The code sticks
the Mail::Message object into the doc's pnotes hash, for possible
later use. </p>
<pre>
sub input {
my $msg = Mail::Message->read ( $_[1] );
my $doc = XML::Comma::Doc->new ( type => $_[0]->{_doctype} );
$doc->message_id ( get_message_id($msg) );
$doc->subject ( $msg->get ('Subject') );
$doc->from ( $msg->get ('From') );
$doc->to ( $msg->get ('To') );
my $date = $msg->get( 'Date' );
if ( $date ) {
my $unix_time = Date::Parse::str2time ( $date );
$doc->date ( $date );
$doc->date_utime ( $unix_time );
}
foreach ( get_references($msg) ) {
$doc->add_element('reference')->set ( $_ );
}
foreach ( get_parts_content_types($msg) ) {
$doc->add_element('part_content_type')->set ( $_ );
}
my $plain_part = get_plain_part ( $msg );
my $body = autoformat $plain_part->decoded if $plain_part;
$doc->body ( $body ) if $body;
$doc->pnotes()->{mail_message_object} = $msg;
return $doc;
}
</pre>
<h1>Storage and Retrieval</h1>
<p> Manipulating Docs in memory is only a small part of the story. We
need a way to store Docs in permanent collections, a way to retrieve
these permanently stored Docs, and a way to manipulate the collections
themselves. </p>
<h2>The Store Definition</h2>
<p> Let's introduce a new section to the <i>User</i> Def:
<b>store</b>. </p>
<pre>
<DocumentDefinition>
<name>User</name>
<element><name>username</name></element>
<element><name>email</name></element>
<element><name>full_name</name></element>
<nested_element>
<name>address</name>
<element><name>street1</name></element>
<element><name>street2</name></element>
<element><name>city</name></element>
<element><name>state</name></element>
<element><name>zip</name></element>
</nested_element>
<plural>'address'</plural>
<store>
<name>main</name>
<base>comma_guide</base>
<location>Sequential_file</location>
</store>
</DocumentDefinition>
</pre>
<p> This is the simplest possible store specification: we supply a
<b>name</b>, a <b>base</b> directory and a <b>location</b>. </p>
<p> The <b>name</b> element specifies how we'll refer to this
particular store. As with elements, we can specify more than one
store, so we need names to differentiate one from 'nother. We've
called this particular store <i>main</i>.</p>
<p> The <b>base</b> element supplies a directory, underneath the
document root, where we're going to put the Docs that we're
storing. For this store, since the base is <code>comma_guide</code>,
all of the storage will take place in
<code><document_root>/comma_guide/</code>. </p>
<p> The <b>location</b> element specifies how Docs will be stored
within the <b>base</b> context. In this case we're storing Docs in a
series of sequentially-numbered files. </p>
<h2>Two Methods: store() and retrieve()</h2>
<p> With this definition of our <i>main</i> store in place, we're
ready to store and retrieve User documents. </p>
<pre>
# make a new Doc, so we have something to store.
my $doc = XML::Comma::Doc->new ( type=>'User' );
$doc->element('username')->set ( 'kwindla' );
$doc->element('email')->set ( 'kwindla@xymbollab.com' );
# write this Doc out to the "main" permanent store
my $key = $doc->store ( store => 'main' )->doc_key();
# now read the Doc back in, manipulate it, and store it back out to the same place
my $d2 = XML::Comma::Doc->retrieve ( $key );
$d2->element('full_name')->set ( 'Kwindla Hultman Kramer' );
$d2->store();
</pre>
<p> There are three new methods here -- <b>store()</b>,
<b>retrieve()</b>, and <b>doc_key</b>.
<p> The <b>store()</b> method writes a Doc out to permanent storage. A
<b>store => <name></b> argument must be supplied the first time
the method is called on a new Doc, to specify which of the stores in
the Def will be used. The <b>store()</b> method re-returns a reference
to the Doc, so that you can chain method calls together easily. The <b>doc_key</b> method returns a unique, long-term identifier for the stored Doc. </p>
<p> The <b>retrieve()</b> method fetches a Doc out of storage, and
expects to be supplied a document <b>key</b> as its argument. </p>
<h2>Where Are the Files?</h2>
<p> It's worth looking at the files that <b>store()</b> writes out. If
you ran the above bit of code, you should be able to look in your
document root and see a directory named <code>comma_guide</code>. In
that directory, there should be a file named <code>0001</code>. (And
if you ran the code multiple times, also <code>0002</code>
<code>0003</code>, etc.) The contents of these files should look
familiar: the text in them was produced by an internal call to
<b>to_string()</b>. We can compare the output from a
<b>to_string()</b> call with the contents of a store file, to confirm
this:</p>
<pre>
my $store = XML::Comma::Def->read(name=>'User')->get_store('main');
my $doc = XML::Comma::Doc->retrieve ( type => 'User',
store => 'main',
id => $store->first_id() );
# print out the doc with a to_string()
print "doc retrieved...\n"
print " key: " . $doc->doc_key() . "\n";
print " from to_string()...\n";
print "----\n";
print $doc->to_string();
print "----\n";
# cat the file that we got the doc from
print " from file: " . $doc->doc_location() . "\n";
open ( FILE, '<'.$doc->doc_location() );
my @lines = <FILE>;
close ( FILE );
print "----\n";
print @lines;
print "----\n";
</pre>
<p> We've snuck several things into the above example. </p>
<p> In the first line we <b>read()</b> the <code>User</code>
Def. This is the Def that we've been adding to as we go along in this
chapter, but here we're going to be querying it programmatically,
rather than editing it as a text file. <b>Def->read()</b> gives us
a reference to the Def object, upon which we immediately call
<b>get_store()</b> to get a reference to our <code>main</code>
store. We use that to get the <b>id</b> of the first document we
stored in <code>main</code>, whatever and whenever that was. A
document <b>id</b>, as you might guess, is one of the parts that makes
up a document key. (The other mandatory parts are a document type and
a store name.) As you can see, <b>retrieve()</b> is flexible: it
accepts a single argument and interprets that as a key (as in the
previous example); it is also happy to accept separate, parameterized
arguments supplying a type, store name and Doc id, which is what we've
done here. </p>
<p> Again, we see the <b>doc_key()</b> method, which returns this
Doc's key, and a new method, <b>doc_location()</b>, which returns the
underlying file that this Doc was fetched from. It is worth noting
that <b>doc_location()</b> is rarely used in the course of "normal"
Doc manipulation, because Comma handles all of the underlying
filesystem tasks that are part of ordinary storage, retrieval and the
like. </p>
<p> There are other "doc_foo()" methods, including <b>doc_store()</b>
which returns a reference to the store that was used to fetch or store
the Doc, and <b>doc_id()</b>, which returns the Doc's id. It is an
error to call any of the doc_foo() methods on a newly-created Doc that
has not yet been stored. </p>
<h2>Multiple Users and Processes: Permissions and Locking</h2>
<p> Access permissions are an important part of any multi-use
system. XML::Comma uses the underlying filesystem to provide basic
permissions facilities. The store definition may include a
<b>file_permissions</b> element, which sets the rwx permissions on any
stored files. Here is our <code>main</code> store with a new line that
makes these files world-readable but writable only by their owner:
</p>
<pre>
<store>
<name>main</name>
<base>comma_guide</base>
<location>Sequential_file</location>
<file_permissions>644</file_permissions>
</store>
</pre>
<p> The <code>644</code> specification is suitable for a system in
which all <code>User</code> editing is done by processes running as a
single user, but in which many users might need to run processes that
need read-access to <code>User</code> information. It is actually more
common for a group of users to need write access to a Doc collection;
for that reason the default value of the <b>file_permissions</b>
element -- the value that is used by the system if no specifier is
given -- is <b>664</b>. </p>
<p> Because Comma depends on the filesystem to manage permissions, you
will need to understand how the filesystem determines and applies
permissions information to/for individual files in order to set up
complicated scenarios. Remember that Comma code always runs as part of
some particular process, under the ownership of a specific user. </p>
<p> Permissions restrictions address issues of information ownership
and security. File permissions discriminate among multiple users of a
system. An even more fundamental set of problems is posed by the
multi-process nature of the systems on which Comma runs. We must be
able to <b>lock</b> Docs so that concurrent processes do not
simultaneously attempt to modify a file. </p>
<p> The <b>retrieve()</b> method automatically acquires a <b>lock</b>
on the requested Doc. As long as this lock is held, the Doc cannot be
retrieved again. The <b>store()</b> method automatically unlocks the
stored Doc. </p>
<p> Because of the automatic locking, <b>retrieve()</b> is a
relatively heavy-weight method. In addition, if <b>retrieve()</b>
cannot immediately acquire its lock, it waits -- re-trying
periodically -- until it finally can. The <b>retrieve()</b> method
should therefore be used carefully, with the time that a Doc is held
open kept as short as possible. (An optional argument to retrieve,
<b>timeout=><seconds></b> is also available. With a timeout
specified, <b>retrieve()</b> will throw an error if it is unable to
acquire its lock within the given number of seconds.) </p>
<p> The <b>read()</b> method is an alternative to <b>retrieve()</b>,
for situations in which a Doc will be read but not modified. In fact,
in most applications, <b>read()</b> is by far the most common access
method. Because <b>read()</b> does not need to acquire a lock, it is
somewhat faster than <b>retrieve()</b>. The two methods take the same
arguments. </p>
<p> There is one other method in the retrieve family:
<b>retrieve_no_wait()</b>. This method is exactly like
<b>retrieve()</b>, except that if it fails to immediately acquire a
lock it returns <code>undef</code>, rather than blocking. Programmers
with extensive experience designing multi-threaded/concurrent systems
will find uses for this method: other programmers will find abuses. In
general, if you can't describe in exact and minute detail why you are
using <b>retrieve_no_wait()</b>, you shouldn't be. </p>
<p> As the necessary complement to <b>retrieve()</b>, <b>store()</b>
must unlock objects as they are written out to permanent storage so
that other users of the system will be able to fetch them. After
storage, a Doc object becomes read-only, as if it had been opened with
<b>read()</b>. </p>
<p> It is possible to <b>store()</b> a Doc without unlocking it
(useful, for example, to write out intermediate changes as part of a
series of operations). The <b>keep_open=<true></b> argument
specifies that the lock be retained. (Conversely, a Doc that has been
opened read-only can be locked with the <b>get_lock()</b> or
<b>get_lock_no_wait()</b> methods.) </p>
<p> Finally, the methods <b>erase()</b>, <b>copy()</b> and
<b>move()</b> perform the operations that their names suggest: </p>
<pre>
# retrieve and then erase a Doc
my $doc = XML::Comma::Doc->retrieve ( $key_a );
$doc->erase();
# retrieve and the move a Doc
$doc = XML::Comma::Doc->retrieve ( $key_b );
$doc->move ( store=>'other_store' );
# read and copy a Doc (we're not modifying the original, so it's
# okay to read() instead of retrieve()
$doc = XML::Comma::Doc->read ( $key_c );
$doc->copy ( store=>'other_store' );
</pre>
<p> As a side note, <b>copy()</b> and <b>move()</b> accept the same
arguments as <b>store()</b>, including <b>keep_open=<true></b>,
and you should always supply a <b>store=><name></b> when copying
and moving -- the normal use of these methods is to transfer a Doc
from one store to another. (Confusingly, in this normal case,
<b>copy()</b> is really just a synonym for <b>store()</b>; calling
<b>store()</b> with a <i>new</i> <b>store=><name></b> specifier
effectively performs a copy. The only case in which the actual
<b>copy()</b> method is uniquely required is the copying of a Doc
<i>within</i> the same store.)</p>
<h2>Iterating Over Stored Docs</h2>
<p> It is often necessary to process some or all of the Docs in a
store. Methods exist to fetch the first and last ids in a store and,
given an id, to fetch the ids before and after it. In one of the
examples above we retrieved the first Doc in the <code>main</code>
store. We'll begin with that same code, and then go on to iterate
through all of the Docs in the store. </p>
<pre>
my $store = XML::Comma::Def->read(name=>'User')->get_store('main');
my $doc = XML::Comma::Doc->retrieve ( type => 'User',
store => 'main',
id => $store->first_id() );
print "first doc: " . $doc->doc_key() . "\n";
while ( my $id = $store->next_id($doc->doc_id()) ) {
$doc = XML::Comma::Doc->retrieve ( type => 'User',
store => 'main',
id => $id );
print "next doc: " . $doc->doc_key() . "\n";
}
</pre>
<p> This code uses the store's <b>first_id()</b> and <b>next_id()</b>
methods. To iterate in the other direction, we could substitute
<b>last_id()</b> and <b>previous_id()</b>.</p>
<p> The <b>prev_</b> and <b>next_</b> methods are fine for fetching a
few docs, but for sizable loops they are a little clumsy and a lot
slow. An <i>iterator</i> provides a means by which to apply repetitive
operations to a set of stored documents quickly and easily. </p>
<pre>
# basic iterator -- start from the end and work backwards
my $iterator = $store->iterator();
while ( my $doc = $iterator->prev_read() ) {
print "working on doc: " . $doc->doc_id() . "\n";
}
# with some additional parameters -- start from the beginning and
# limit the set to the first 500 docs
$iterator = $store->iterator ( size=>500, pos=>'-' );
while ( my $doc = $iterator->next_read() ) {
print "working on doc: " . $doc->doc_id() . "\n";
}
</pre>
<p> An iterator is obtained by calling the store's <b>iterator()</b>
method. By default, <b>iterator()</b> provides access to all of the
store's documents, starting with the last doc. (This is the default
because iterating backwards over recently-stored docs is a fairly
common thing to do.) Two arguments to <b>iterator()</b> modify this
default behavior: <b>store=></b> limits the size of the iterator's
result set, and <b>pos=></b> specifies whether the iterator is
initially set to point at the end or at the beginning of the set --
<b>'+'</b> specifies the end (and is equivalent to the default of not
specifying a pos), and <b>'-'</b> specifies the beginning. The
<b>size=></b> argument can only be used to pick out the first or last
<i>n</i> documents. There is no way to pull a subset of documents out
of the "middle" of a store. When used with <b>pos=>'-'</b>, the size
specifier will select documents from the beginning of the store, and
when a <b>pos=></b> argument is not given (or when <b>pos=>'+'</b> is
specified), the size specifier will select documents from the end of
the store. </p>
<p> The basic iterator methods are <b>next_id()</b>, <b>prev_id()</b>,
<b>next_read()</b>, <b>prev_read()</b>, <b>next_retrieve()</b>, and
<b>prev_retrieve()</b>. The names are pretty self-explanatory. Each of
these methods returns an id or doc, as the case may be, unless the
iterator has passed the beginning or end of its collection, in which
case the method returns <code>undef</code>. The six methods can be
called in any combination and in any order. (Criticism-inclined
readers may, at this point, be thinking that "iterator" is a poor name
for this class, given that it is possible to move across the set in
any order and backwards and forwards. Those readers are probably
correct.) </p>
<p> Four more methods are defined for advanced mucking around with an
iterator. These methods should be wielded with caution, as they are
not usually needed and they don't do any error or sanity checking. The
<b>length()</b> method returns the size of the iterator's document
set; the <b>index()</b> method gives the position of the current
pointer into that document set; the <b>inc()</b> method moves the
pointer a relative amount -- with no argument <b>inc()</b> adds one to
the pointer, given an argument it adds that value to the pointer
(<code>-1</code> is a common argument); and the <b>set()</b> method
sets the pointer to an absolute index value -- so
<code>$iterator->set($iterator->length())</code> would reset an
iterator such that the next call to <b>prev_id()</b> will fetch the
last id in the set. </p>
<h2>Location Chains</h2>
<p> So far, our storage definition for <code>main</code> has used only
a single <b>location</b> element. We saw above that specifying
<code>Sequential_file</code> governed the "file" portion of the
storage location. To understand how to create more complex storage
patterns, we need to understand how multiple location specifiers can
be "chained" together. </p>
<p> A filesystem is a heirarchical store: directories contain files
and directories, which contain more files and directories, which
contain more files and directories, ad infinitum. Each time a Doc is
stored, Comma uses the <b>document_root</b>, the <b>base</b> specifier
and the <b>location</b> elements in a storage definition to build a
"location chain" that determines where in the filesystem to save the
written-out Doc. For our <code>main</code> store, the chain looks like
this: </p>
<p>
<table border="1" cellpadding="6">
<tr>
<td>document root</td> <td>base</td> <td>location</td>
</tr>
<tr>
<td><code>XML::Comma->document_root()</code></td>
<td><code>comma_config</code></td>
<td><code><location>Sequential_file</location></code></td>
</tr>
</table>
</p>
<p> There are other location specifiers besides
<code>Sequential_file</code>. Some of these are designed to be used in
pairs or groups, so that several location specifiers can be combined
as part of a chain. One of these "intermediate" specifiers is
<code>Sequential_dir</code>, which is similar to
<code>Sequential_file</code> except that it determines an intermediate
directory in the location chain rather than a final file. Here is our
store definition with a new addition: </p>
<pre>
<store>
<name>main</name>
<base>comma_guide</base>
<location>Sequential_dir</location>
<location>Sequential_file</location>
</store>
</pre>
<p> The first file stored by this store will be located at: </p>
<pre>
<document_root>/comma_guide/0001/0001
</pre>
<p> We've added a directory level to the chain; the first
<code>0001</code> comes from the <code>Sequential_dir</code>, the
second from the <code>Sequential_file</code>. One effect of this
addition is to increase the capacity of the store. We're limited to
9999 files per directory, so before we could store a maximum of 9999
Docs and now we can store up to 9999 * 9999, or 99,980,001. And we can
add as many <code>Sequential_dir</code>s to the chain as we like,
increasing the number of directories in the resulting storage
locations. </p>
<p> Location specifiers often accept arguments that further determine
how they behave in the chain. <code>Sequential_file</code> recognized
two arguments, and <code>Sequential_dir</code> recognizes one. Here is
another modified version of our storage definition: </p>
<pre>
<store>
<name>main</name>
<base>comma_guide</base>
<location>Sequential_dir:max,10</location>
<location>Sequential_file:max,99,extension,'.xml'</location>
</store>
</pre>
<p> Now each of the location specifiers has an arguments list
attached. A colon separates the specifier name from the arguments, and
the arguments themselves take the form of a Perl list, which will be
turned into a hash of key/value pairs when the definition is
loaded. </p>
<p> The first argument is common to both declarations: <b>max</b>
specifies the maximum number of files that will be allowed in this
part of the chain. (When we stated above that we were limited to 9999
files, we were referring to the default value of the <b>max</b>
argument. If we had wanted to square the capacity of the storage
without adding an intermediate directory, we could have simply
specified <b>max,99_980_001</b> as an argument to
<code>Sequential_file</code>. Doing so has a serious drawback,
however; finding, creating and deleting files gets progressively
slower as the number of files in a directory climbs.) </p>
<p> <code>Sequential_file</code>s second argument, <b>extension</b>,
provides an extension to be tacked onto the end of every Doc's storage
file. This can be useful if, for example, other tools for managing or
manipulating files will co-exist with XML::Comma in a given
application. With our most recent definition, the first and last files
in the a store would have the following locations: </p>
<pre>
<document_root>/comma_guide/01/01.xml
<document_root>/comma_guide/10/99.xml
</pre>
<p> The Storage in More Detail section provides additional information
about storage definitions, including documentation for all of the
standard location modules. </p>
<h1>Validation, Macros and Hooks</h1>
<p> Document Definitions describe and constrain the basic structure of
the documents that we can produce. For example, an attempt to make use
of an element that isn't specified in a document's Def generates an
error. This section describes Comma's mechanisms for "validating" the
structure of documents and the content of elements. </p>
<h2>Document Structure: Required Elements and validate()</h2>
<p> Section Three introduced the <b>plural</b> tag. This tag
determines which elements may exist multiple times in the given
container. Another container-level tag is <b>required</b>, which
specifies that a container must include at least one of each of the
specified elements. Here is our <code>User</code> Def with a new
validity constraint: </p>
<pre>
<DocumentDefinition>
<name>User</name>
<element><name>username</name></element>
<element><name>email</name></element>
<element><name>full_name</name></element>
<nested_element>
<name>address</name>
<element><name>street1</name></element>
<element><name>street2</name></element>
<element><name>city</name></element>
<element><name>state</name></element>
<element><name>zip</name></element>
</nested_element>
<plural>'address'</plural>
<required>qw( username email full_name )</required>
<store>
<name>main</name>
<base>comma_guide</base>
<location>Sequential_file</location>
</store>
</DocumentDefinition>
</pre>
<p> To be "valid," a <code>User</code> Doc must now have content in
its <i>username</i>, <i>email</i> and <i>full_name</i> elements. A
document that is not valid cannot be stored -- the storage routines
all call the method <b>validate()</b>, which throws an error
if all required elements are not present. The
<b>validate()</b> method can also be called directly. It
takes no arguments and returns the emtpy string; it's only function is
to throw an error if the Doc doesn't pass all validity tests. Here are
two simple code snippets, for more information see the section on
errors and error handling:</p>
<pre>
# check whether a Doc passes validity tests
eval {
$doc->validate();
}; if ( $@ ) {
print "doc didn't validate: $@\n";
}
# the same idea, but during a store()
my $key
eval {
$key = $doc->store( store=>'main' );
}; if ( $@ ) {
print "doc couldn't be stored: $@\n";
} else {
print "doc was stored successfully: $key\n";
}
</pre>
<p> Our example use of <b>required</b> is not very complicated. As
with all things to do with nested elements, <b>required</b> and
<b>validate()</b> are just as applicable deep inside a nested
structure as at the very top level. Any nested element can specify a
<b>required</b> list, and can be checked with a call to
<b>validate()</b>. More interestingly, calls to
<b>validate()</b> automatically check the validity of all
elements underneath the caller, so a Doc-level validity check
walks the entire document tree. This is convenient and it makes good
theoretical sense: no element can be valid that itself contains
invalid parts.</p>
<h2>Element Content: Macros and validate_content()</h2>
<p> A container's validity is a function of the sub-elements that it
contains. A simple element's validity is a function of its contents. A
<i>macro</i> defines and limits the type of content that an element
may have. Here is our <code>User</code> Def with macros added to its
<i>username</i> and <i>email</i> definitions.</p>
<pre>
<DocumentDefinition>
<name>User</name>
<element>
<name>username</name>
<macro>length:min,4,max,20</macro>
</element>
<element>
<name>email</name>
<macro>email</macro>
</element>
<element><name>full_name</name></element>
<nested_element>
<name>address</name>
<element><name>street1</name></element>
<element><name>street2</name></element>
<element><name>city</name></element>
<element><name>state</name></element>
<element><name>zip</name></element>
</nested_element>
<plural>'address'</plural>
<required>qw( username email full_name )</required>
<store>
<name>main</name>
<base>comma_guide</base>
<location>Sequential_file</location>
</store>
</DocumentDefinition>
</pre>
<p> We can use the <b>validate_content()</b> method to check whether
a string can be accepted as an element's content. The method takes a
single argument -- the prospective content -- and throws an error if
the content fails to pass the validity checks. It is not usually
necessary to call <b>validate_content()</b> directly, because
<b>set()</b> calls the method at the very beginning of its operation,
before doing anything else. Here is a typical bit of error-checked
<b>set()</b> code: </p>
<pre>
# modifying a User doc
eval {
$doc->username ( $username );
$doc->email ( $email );
$doc->full_name ( $full_name );
}; if ( $@ ) {
handle_content_error ( $@ );
}
</pre>
<p> The <b>validate()</b> method is also defined for non-nested
elements. It is possible to use the unsafe <b>append()</b> method to
construct invalid element content (and also possible to read invalid
Docs out of storage). The <b>validate()</b> method checks an element's
existing content for validity. Just as with nested elements, this
method is called by all of the storage methods, so that an invalid
document will not be written out to permanent storage. </p>
<p> As with storage location specifiers, the <b>macro</b> tag should
contain a name followed by an optional argument list (with a colon in
between). Different macros expect different numbers of arguments and
different argument formats. The <i>enum</i> macro, for example, takes
a list of strings that will be the only acceptable contents for the
element being defined. Let's add a new <i>subscription</i> element to
the <code>User</code> Def, indicating what level of service a user has
paid for. (This time, we won't re-produce the whole Def, just the new
element.) </p>
<pre>
<element>
<name>subscription</name>
<macro>enum: 'basic', 'premium', 'lifetime'</macro>
</element>
</pre>
<p> There are only four possible values for the content of the
<i>subscription</i> element: <code>undef</code>, <i>basic</i>,
<i>premium</i>, and <i>lifetime</i>. "Hmm, <code>undef</code>" you
say, "I don't see <code>undef</code> in that list? Well, <i>enum</i>
always includes <code>undef</code> as an implicit member of the
possible-contents list. The reason for this will be clear after a
little reflection: because Comma treats a content-less element as
indistinguishable from an element that is not there at all,
<code>undef</code> must be legal content for all elements. To make an
empty element illegal is actually the same operation as to make it
required. If we want every <code>User</code> Doc to include
subscription information, we can define the <i>subscription</i>
element to be <b>required</b>: </p>
<pre>
<element>
<name>subscription</name>
<macro>enum: 'basic', 'premium', 'lifetime'</macro>
</element>
<required>'subscription'</required>
</pre>
<h2>More Flexibility: Perl Hooks</h2>
<p> The <b>required</b> and <b>macro</b> facilities that we've just
seen are actually implemented using a finer-grained, more flexible
tool: the <i>hook</i>. A hook is a piece of Perl code that will, under
specific circumstance, be automatically called by the Comma
system. Declaring an element as required actually installs a
<b>validate_hook</b> -- the <b>required</b> tag is just a short-cut,
provided because the facility is so important and so commonly
used. The following two pieces of a hypothetical Def are exactly
equivalent: </p>
<pre>
<!-- 1) a required tag specifying two element names-->
<required>'foo', 'bar'</required>
<!-- 2) the two validate_hooks that are actually installed
when the Def is parsed, one for 'foo' and one for 'bar' -->
<validate_hook>
<![CDATA[
sub {
my $self = shift();
my $req_el = \$self->elements('foo')->[0];
die "required element 'foo' not found in " . $self->tag_up_path() . "\n" if
(! $req_el) or
((! $req_el->def()->is_nested()) and ($req_el->get() eq ''));
}
]]>
</validate_hook>
<validate_hook>
<![CDATA[
sub {
my $self = shift();
my $req_el = \$self->elements('bar')->[0];
die "required element 'bar' not found in " . $self->tag_up_path() . "\n" if
(! $req_el) or
((! $req_el->def()->is_nested()) and (! $req_el->get()));
}
]]>
</validate_hook>
</pre>
<p> This example demonstrates the common convention for writing hooks:
most hooks are subroutines that are compiled into code references when
the Def is loaded by the system; they can expect to be passed certain
arguments when they are invoked; they should make use of the Comma API
to do whatever work they need to do; and they should return
appropriate values or throw errors, depending on what is expected of
them. </p>
<p> We can go over the first of these hooks line by line. (The second
hook is exactly the same, except that <code>'bar'</code> is
substituted for <code>'foo'</code> in two places.) The first line is
an opening CDATA tag. Perl snippets usually include characters that
are illegal in XML -- the arrow operator is particularly common in
this kind of code -- so wrapping the content in a CDATA section is a
near necessity. The second line begins an anonymous subroutine
declaration. The third line establishes a named variable,
<code>$self</code>, which comes from the first argument to the
sub. The next line fetches the first 'foo' element, if any, into
<code>$req_el</code> -- the <code>$req_el</code> variable now holds
either an element object or an undefined value. The final statement
throws an error if either <code>$req_el</code> is not defined, or
<code>$req_el</code> is a non-nested, empty element. (NOTE: FIXING the
obvious bug here, real real soon.) </p>
<p> The required example demonstrates the use of a
<b>validate_hook</b> in the "structural" context -- checking the
sub-elements of a nested element. We can use the same approach to
validate the contents of a non-nested element, but in this case we
must expect two arguments: the element and the proposed new
content. Imagine, if you will, an element,
<i><delicate_sensibilities></i>, which must contain text that
will not shock or offend children, great aunts and members of the
clergy. Imagine, also, a hypothetical CPAN module Lingua::FCC_Check,
which can check for words that are proscribed by the Federal
Communications Commission from over-the-air broadcast in the United
States. Here, then, is a definition for the
<i><delicate_sensibilities></i> element: </p>
<pre>
<element>
<name>delicate_sensibilities</name>
<validate_hook>
<![CDATA[
use Lingua::FCC_Check;
sub {
die "unacceptable language detected for " . $_[0]->tag_up_path() . "\n" if
Lingua::FCC_Check::check ( $_[1] );
}
]]>
</validate_hook>
</element>
</pre>
<p> The only new thing in this example is the <code>use</code>
statement that precedes the subroutine definition. We need to pull in
the Lingua::FCC_Check module, so we do that just as we would in a
stand-alone program. </p>
<p> To summarize, <b>validate_hook</b>s may be defined for both simple
and nested elements and should take the form of anonymous
subroutines. In the case of a nested element, the hook expects the
element itself to be the sole argument. In the case of a non-nested
element, the hook expects to be passed the element and a string
containing the content to be checked. The <b>validate()</b>
method calls any hooks that have been defined for an element; as does
<b>validate_content()</b>. More hooks (called as part of storage,
indexing, etc.) will be introduced as we go along, and documentation
for all available hooks can be found in the hooks reference
section. </p>
<h2>Writing New Macros</h2>
<p> Macros were designed as a way to extend the syntax of document
definitions without modifying any of the Comma system code. When an
element definition is loaded, any macros that it contains are given a
chance to execute. Writing and installing macros is relatively
easy. In general, macros work by installing hooks, so you've already
seen most of what you need to know to create a new macro.</p>
<p> For a macro to be available to the system, the definition loader
must be able to find it. The loader will look in the same places that
it looks for defs (the list of directories in the
<b>defs_directory</b> Comma variable), and it will look in files named
<i>macro.extension</i>, where <i>macro</i> is the name of the macro
and <i>extension</i> is the string defined by the
<b>macro_extension</b> variable. </p>
<p> To turn the "FCC_Check" example from the previous section into a
macro, we need to save a few lines of perl code in a file that meets
the above criteria (on my system, I'm using
/comma/defs/macros/fcc_approved.macro). Here is the contents of the
file: </p>
<pre>
# fcc_check: a macro to check element content for blue language
use Lingua::FCC_Check;
$self->add_hook ( 'validate_hook',
sub {
die "unacceptable language detected for " . $_[0]->tag_up_path() . "\n" if
Lingua::FCC_Check::check ( $_[1] );
}
);
</pre>
<p> The first line is just a comment that helps us remember what this
code snippet does, if we run across it in an unexpected place. The
<code>use</code> statement and the subroutine definition are familiar
from the validation hook version of this code. The only new thing here
is the <b>add_hook()</b> method. The syntax is a little hard to see,
but <b>add_hook()</b> is quite simple: it expects a hook name as its
first argument, and a subroutine reference or string (which will be
eval'ed and must become a subroutine reference) as its second
argument. The subroutine is installed as a hook of the requested
type. </p>
<p> Turning the FCC check into a macro simplifies the definition of
the <i>delicate_sensibilities</i> element considerably. Even more
imporantly, we can re-use this macro in any other Def on this system,
and changes -- bug fixes, new additions to the FCC list -- will only
need to be made to the macro, not to each element that defines the
hook. </p>
<pre>
# the new, improved delicate_sensibilities def
<element>
<name>delicate_sensibilities</name>
<macro>fcc_approved</macro>
</element>
</pre>
<p> The <i>range</i> macro (part of the standard Comma installation)
provides a slightly more complex example. This macro is used to limit
content to a range of numbers, for example between one and ten. As
such, <i>range</i> requires two arguments; the first specifies the
low end of the range and second the high end. </p>
<pre>
# range macro: takes two arguments, low-end and high-end
my $low = $macro_args[0];
my $high = $macro_args[1];
$self->add_hook ( 'validate_hook',
sub {
my ( $doc, $content ) = @_;
if ( $content < $low or
$content > $high ) {
die "'$content' out of range ($low:$high)\n";
};
}
);
</pre>
<p> The only thing here that we haven't seen before is the pre-defined
variable <b>@macro_args</b>. Like <b>$self</b>, the <b>@macro_args</b>
array is filled with the appropriate values by the macro
loader. Macros can do whatever they want with the arguments that are
supplied them. This macro simply makes use of the first two elements
in the list as part of the hook subroutine. (It should actually
probably do a little bit of pro-active error checking.) Here is how we
might use the <i>range</i> macro. </p>
<pre>
<element>
<name>one_to_ten</name>
<macro>range:1,10</macro>
</element>
</pre>
<h2>#include: Defs From Components</h2>
<p> Defs that are part of a single system or application usually share
common element definitions, hooks, and methods. These common
components can be abstracted out, placed into separate files, and
<b>#include</b>'ed into as many defs as necessary. </p>
<pre>
<!-- file 'simple_el.include' -- anywhere in the defs_directories -->
<element><name>included_el</name></element>
<!-- and a 'simple_def.def' that uses the above include -->
<DocumentDefinition>
<name>simple_def</name>
<element><name>el_one</name></element>
<element><name>el_two</name></element>
<? #include simple_el ?>
</DocumentDefinition>
</pre>
<p> The <b>#include</b> syntax is quite different from most everything
else that Comma defines for defs. XML afficionados will recognize it
as a <code>preprocessor declaration</code>, a special part of an XML
document that is intended for the consumption of a particular parser
or tool-chain and should be ignored by everyone else. Using the
preprocessor declaration syntax makes it possible to exempt
<b>#include</b> directives from the normal rules governing what can go
where in a def. </p>
<p> When a Comma parser encounters an <b>#include</b> statement, it
looks at the word immediately following <code>#include</code> and
tries to find an <code>.include</code> file of that name somewhere in
the system's def directories. If it succeeds, the parser continues
reading in the def from that file. When the parser reaches the end of
the <code>.include</code> file, it returns to the original file and
continues on. Except for adjusting the filename and line numbers that
are reported if the parser encounters an error, this switch between
files is completely transparent -- using an <b>#include</b> is the
same as cutting and pasting the content of the <b>#include</b> into
the def. </p>
<p> Of course, this wouldn't be Comma if you couldn't gussy up your
<b>#include</b>s with perl code. For many purposes, the simple include
behavior described above is perfectly sufficient. But sometimes the
content of the <b>#include</b> needs to be customized for the def at
hand. Here is an example, an include that takes two arguments and
generates a customized method: </p>
<pre>
<!-- file 'first_word_method.include' -->
sub {
my ( $method_name, $el_name ) = @_;
return <<END;
<method>
<name>$method_name</name>
<code><![CDATA[
sub {
my \$self = shift;
my \$content = \$self->get($el_name);
\$content =~ m|^(\w+)|;
return \$1 || '';
} ]]></code>
</method>
END
}
<!-- and a def that uses 'first_word_method.include' -->
<DocumentDefinition>
<name>another_include_example</name>
<element><name>paragraph</name></element>
<? #include {first_word_method} 'fw_paragraph', 'paragraph' ?>
</DocumentDefinition>
</pre>
<p> Wrapping the name of the include in curly brackets indicates that
this is a dynamic, rather than a static include. Comma expects a
dynamic <code>.include</code> file to return a code reference that,
when executed, will return the content to be folded into the def. Any
text that follows the curly-bracketed include name will be treated as
a list to be eval'ed, then passed to the code reference as its
arguments. </p>
<h1>Indexing</h1>
<p> XML::Comma implements <i>storage</i> and <i>indexing</i>
separately. </p>
<p> Comma <i>storage</i> generally involves writing complete documents
out to disk. Each stored document is retrievable by a unique key, and
collections of stored documents can be iterated across in key
order. Most of the time, stored documents are saved as normal,
XML-formatted text files. Modern filesystems are fast, robust and well
understood. Relying on the standard filesystem functionality enables a
systems administrator to use normal tools for backup, maintenance and
monitoring, and allows programmers to use standard utilities for quick
or simple manipulations. (It is very convenient, for example, to be
able to do a quick <code>grep</code> on a directory full of Docs.)
</p>
<p> Comma <i>indexing</i> involves saving pieces of documents in a
relational database so that complex search, sort and retrieval
operations can be performed flexibly and efficiently. These tasks are
"above and beyond" what a filesystem is capable of, so Comma builds
its indexing functionality as a relational database framework. The
system can be configured to use any RDBMS; Comma provides a standard
interface that sits atop the sophisticated storage and query
capabilities of platforms such as MySQL, Postgres or Oracle. </p>
<h2>A <code>User</code> Index Definition</h2>
<p> An index allows a collection of Docs to be searched and sorted
according to their elements' contents. We'll build an index for
our <code>User</code> Docs to demonstrate the basic features of
the indexing framework. </p>
<p> An index defines one or more <b>fields</b>, with each field
normally corresponding to an element or method in the document
definition. A simple index might only contain a single <b>field</b>:
</p>
<pre>
<DocumentDefinition>
<name>User</name>
<element><name>username</name></element>
<element><name>email</name></element>
<element><name>full_name</name></element>
<nested_element>
<name>address</name>
<element><name>street1</name></element>
<element><name>street2</name></element>
<element><name>city</name></element>
<element><name>state</name></element>
<element><name>zip</name></element>
</nested_element>
<plural>'address'</plural>
<store>
<name>main</name>
<base>comma_guide</base>
<location>Sequential_file</location>
</store>
<index>
<name>main</name>
<field><name>email</name></field>
</index>
</DocumentDefinition>
</pre>
<p> With the <i>main</i> index part of our document definition, we can
use the <b>index_update()</b> method to add documents to it. Calling
<b>index_update()</b> -- which takes as its <b>index=></b> argument
the name of the index to update -- adds a document to an index or, if
the document is already present, updates the index to reflect any
changes. </p>
<pre>
# add/update this Doc's record in the 'main' index
$doc->index_update ( index => 'main' );
</pre>
<p> On the other hand, <b>index_remove()</b> deletes a document from
an index. Like <b>index_update</b> it expects the name of an index as
an <b>index=></b> argument. </p>
<pre>
# delete this Doc's record in the 'main' index
$doc->index_remove ( index=>'main' );
</pre>
<h2>Indexing from different stores and defs</h2>
<p> Sometimes it can be useful to index different document types
or documents from multiple stores in the same document type into
a single index. Comma allows one to do this by introducing the
<b><store></b> and <b><doctype></b> directives within
an index. </p>
<h2>Querying the Index: Iterators</h2>
<p> We need a way to get at the <code>User</code> Docs that are in our
index. First we need a handle to the index itself. Then we can ask the
index for an <b>iterator</b> that will step through all of the Docs:
</p>
<pre>
# get 'main' index
my $index = XML::Comma::Def->read(name=>'User')->get_index ('main');
# get iterator
my $i = $index->iterator();
# iterate, printing out "$key: $email"
while ( $i ) {
print $i->doc_key() . ': ' . $i->email() . "\n";
$i++;
}
</pre>
<p> There are several new methods here. The <b>get_index()</b> method
operates like <b>get_storage()</b>, taking a single argument and
returning the index of that name. The index's <b>iterator()</b> method
returns an iterator object, which provides a means to step through the
documents in the index. An iterator can only deal with documents one
at a time, and can only advance in one direction through its
sequence. Here, we use the <b>++</b> operator to advance the
iterator. </p>
<p> Every iterator exposes its fields as methods, so we call the
<b>email()</b> method to get the value of this record's <i>email</i>
field -- a value which came originally from the <i>email</i> element
of the Doc that this record represents. </p>
<p>Every indexing iterator implicitly includes a <b>doc_key</b> method
and <b>doc_id</b> field, so <b>doc_key()</b> and <b>doc_id()</b>
are always available as methods. Another special method,
<b>record_last_modified()</b> is also available. As its name
suggests, <b>record_last_modified()</b> returns a timestamp (unix
system time) indicating when the index record was last changed. </p>
<p> The <b>++</b> operator is actually a short cut for a named method,
<b>iterator_next()</b>. And if that isn't bad enough, there's an
implicit check of the <b>iterator_has_stuff()</b> method triggered by
the boolean context of the <code>while</code> statement. The
implicit-nesses are an example of operator overloading, which is
exmplained in detail in the Camel book. For our purposes, suffice it
to say that 1) it is easy and correct to write an iterator loop as
above, and 2) you've just seen the only two overloaded operators that
the <code>Iterator</code> class defines -- <b>++</b> =>
<b>iterator_next()</b> and <b>boolean-ization</b> =>
<b>iterator_has_stuff()</b>. </p>
<p> In general, I prefer compact idioms, and the simple <b>++</b> loop
is both compact and (to me, anyway) highly readable. However, in the
spirit of over-explanation, here are several exactly-equivalent
versions of the lowly iterator loop. (And a note to careful observers:
it doesn't matter that some of these loops "increment" the iterator on
loop entry and some don't -- an iterator contrives to point to its
first record in a "lazy" fashion so that programmers don't ever have
to worry about whether an iterator is newly-created or not.) </p>
<pre>
my $i = $index->iterator();
while ( $i++ ) {
$i->foo();
}
my $i = $index->iterator();
while ( $i->iterator_has_stuff() ) {
$i->foo();
$i->iterator_next();
}
my $i = $index->iterator();
while ( $i ) {
$i->foo();
$i->iterator_next();
}
my $i = $index->iterator();
while ( $i->iterator_next() ) {
$i->foo();
}
my $i = $index->iterator();
while ( $i ) {
$i->foo();
$i++;
}
</pre>
<p> <b>NOTE</b> -- there is a bug in perl 5.6.x and 5.8.0 that makes
the first (and most compact) idiom above leak memory. The iterator
object won't get properly garbage collected, when the
<code>while()</code> look is written like that. This can be a big
problem in long-running contexts (such as inside a web server. For any
long-running code, use one of the more verbose forms. </p>
<p> Of course, an iterator that steps through all of the records in an
index is not usually what you want. The <b>iterator()</b> method
accepts arguments that specify matching and sorting criteria, making
it possible to construct iterators that return a subset of an index in
a specified order. </p>
<p> The <b>where_clause => <sql-where></b> argument matches a
conditional phrase against the index's fields to narrow down the
records that are returned. The <b>order_by => <sql-order-by></b>
argument controls the ordering of the records. </p>
<p> The <b>iterator()</b> method constructs a complex SQL statement
that, when executed by the database, selects the records that the
iterator will include. If a <b>where_clause</b> or <b>order_by</b>
argument is supplied when an iterator is constructed, that piece of
SQL logic is integrated into the iterator's complete SQL
statement. Given a knowledge of generic SQL, it is easy to write
<b>where_clause</b> and <b>order_by</b> arguments -- simply treat each
field as you would a column in the database and the iterator parser
will do the rest. An iterator with botha <b>where_clause</b> and an <b>order_by</b> might look like this:</p>
<pre>
# find all users with hotmail addresses and sort alphabetically
my $i = $index->iterator ( where_clause => 'email LIKE "%.hotmail.com"',
order_by => 'email' );
</pre>
<p> Let's add another couple of fields to this index, so we can build
some more interesting iterators. The <i>username</i> element is easy
to add; it's just another field. If we also want to add the <i>zip</i>
of the first <i>address</i>, that's a little harder. The index fields
that we've seen so far map to top-level pieces of a Doc. One way to
get at the <i>zip</i> information we need is to add a method to the
<code>User</code> Def that fetches the <i>zip</i> of the first
<i>address</i>. Here is the new method, along with the expanded index:
</p>
<pre>
<!-- a method to return the zip code of the first address -->
<method>
<name>first_zip</name>
<code>
<![CDATA[ sub { return $_[0]->element('address')->element('zip') } ]]>
</code>
</method>
<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field><name>first_zip</name></field>
</index>
</pre>
<p> Another way, if this method seems not likely to be used except to
build the index table, is to add a <b>code</b> specifier to the field: </p>
<pre>
<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field>
<name>first_zip</name>
<code>
<![CDATA[ sub { return $_[0]->element('address')->element('zip') } ]]>
</code>
</field>
</index>
</pre>
<p> As you can see, this is almost like adding another method to the
Def -- in fact, we didn't change the embedded perl at all. The main
difference is that we're not "cluttering up" the top level of our Def
with a method that will only be used as part of the indexing
operations. The <b>code</b> block is passed two arguments, the
<b>doc</b> being indexed and the <b>index</b> element. It turns out
that you almost always use the doc, and almost never use the
index. </p>
<p> Adding a <b>code</b> block to a <b>field</b> disassociates the
name of the field from the data that it stores. This is, obviously,
useful. It can also be confusing. The default, non-code behavior is
worth sticking to whenever possible, to keep defs and programs as
maintainable as possible. (A <b>code</b> block can also be part of
<b>collection</b> and <b>sort</b> elements, which are described
below.) </p>
<p> And here are a few possible iterators: </p>
<pre>
# find all the users with .edu addresses -- in any order
my $i = $index->iterator ( where_clause => 'email LIKE "%.edu"' );
# find all the users with .edu addresses in Bevery Hills
my $i = $index->iterator ( where_clause => 'email LIKE "%.edu" AND zip = "90210'" );
# sort the .edu email addresses by string length in descending order, then
# alphabetically by username (uses mysql's LENGTH function)
my $i = $index->iterator ( where_clause => 'email LIKE "%.edu"',
order_by => 'DESC LENGTH(email), username' );
</pre>
<h2>Plural Items in Indexes</h2>
<p> The <b>field</b> elements of an index hold values derived from a
Doc's elements and methods. As we've seen, fields can be used to
select sets of records from an index, and to control the order in
which those results are returned. One limitation of using fields in
this way, however, is that each field can only hold a single value per
record. Looked at another way, fields do an excellent job standing in
for singular elements, but are not at all suited to dealing with
plural elements. A field that corresponds to a plural element will
always contain only the value of the first of those elements. </p>
<p> Another type of index element, the <b>collection</b>, is designed
to accomodate plural values, and to allow the kinds of "sorting"
operations that are common to many kinds of documents. Unlike a field,
a collection cannot be used in a where clause; collections are a
special-purpose tool. Let's add a collection to our <code>User</code>
index that will allow us to select all of our records that include an
address with a given zip-code -- any address this time, not just the
first one. </p>
<pre>
<!-- a method to return the zip codes of each address, as an array -->
<method>
<name>zips</name>
<code>
<![CDATA[ sub {
my @addresses;
foreach my $addr $_[0]->elements('address') {
push @addresses, $addr->element('zip')->get();
}
return @addresses;
]]>
</code>
</method>
<!-- the expanded index definition, now including the zips info
<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field><name>first_zip</name></field>
<collection><name>zips</name></collection>
</index>
</pre>
<p> We tied the zips collection to a method, but we could just as
easily have tied it to a plural, non-nested element. The collection
isn't particular, it just wants to be handed an array when the index
is updated. </p>
<p> To select all of the users with an address in a given zip code, we
request an iterator qualified by a <b>collection_spec</b>. A
<b>collection_spec</b> is a string of the form
<b><collection_name>:<value></b>. The
<b>collection_spec</b> argument can be combined with the
<b>where_clause</b> and <b>order_by</b> specifiers that we've already
seen: </p>
<pre>
# select all the users with an address in the 20003 zip code -- in any order
my $i = $index->iterator ( collection_spec => 'zips:20003' );
# select the users as above, and order them alphabetically by username
my $i = $index->iterator ( collection_spec => 'zips:20003',
order_by => 'username' );
# select all of the users with an address in 20003 who also have a .edu
# email address, and order them alphabetically by username
my $i = $index->iterator ( collection_spec => 'zips:20003',
where_clause => 'email LIKE "%.edu"',
order_by => 'username' );
</pre>
<h2>Complex Collection Selectors</h2>
<p> It is possible to specify complex collection_spec arguments, when
creating an Iterator.
<ul><li> The "pairs" in a spec can be strung together with
<code>AND</code> and <code>OR</code>. </li>
<li> AND'ed and OR'ed phrases can be parenthesized. </li>
<li> Any pair can be prefixed with a <code>NOT</code> to ask for Docs
that do not match the pair. </li>
<li> A pair that includes spaces in its value part can be specified by
surrounding the pair with single quotes. </li>
<li> A pair that includes single quotes in its value part can be
specified by surrounding the pair with single quotes and escaping the
internal single quotes with a single backslash. </li></ul>
For example: </p>
<pre>
# select all of the users with an address in any 2000x OR 0213x zip code
# who also have a .edu email address, and order them alphabetically by
# username
my $i = $index->iterator ( collection_spec => 'zips:2000% AND zips:0213%',
where_clause => 'email LIKE "%.edu"',
order_by => 'username' );
# select users with an address NOT in the 20003 zip code.
my $i = $index->iterator ( collection_spec => 'NOT zips:20003' );
# a hypothetical collection pair with spaces and single quotes. We use
two backslashes here because the double quotes that surround the whole
string treat a single backslash as part of an escape character!
my $i = $index->iterator ( collection_spec => "'test:it\\'s easy' OR
'test:it\\'s hard'" );
</pre>
<p> It is fairly easy to create complex specs that slow down database
queries quite a lot. In particular, OR'ing together selections on
large collections is very slow. </p>
<h2>Full Text Search</h2>
<p> A special content-holder is available that enables full-text
search on an index component. We could make all of a
<code>User</code>'s address information searchable by defining a
method to generate a chunk of "address text", then defining an index
<b>textsearch</b> container: </p>
<pre>
<!-- addresses_text: a method to glob all of a User's addresses together into
a single string -->
<method>
<name>addresses_text</name>
<code>
<![CDATA[
sub {
my $self = shift();
my $addr_text = $self->full_name() . "\n";
foreach my $a ( $self->elements('address') );
$addr_text .= $a->street1() . "\n".
$addr_text .= $a->street2() . "\n" .
$addr_text .= $a->city() . ' ' . $a->state() . ' ' . $a->zip() . "\n";
}
return $addr_text;
}
]]>
</code>
</method>
<!-- the 'main' index, redefined to add full-text search on the addresses -->
<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field><name>first_zip</name></field>
<collection><name>zips</name></collection>
<textsearch><name>addresses_text</name></textsearch>
</index>
</pre>
<p>With the new 'addresses_text' <b>textsearch</b> in place, we can
use the full-text search feature in constructing iterators. A
<b>textsearch_spec</b> argument specifies keywords that must appear in
a record for it to be returned as part of an iterator's result
set: </p>
<pre>
# look up all users who have the word "elm" (or "elms", "elmy", "elmed",
# etc.) in any of their addresses
my $i = $index->iterator ( textsearch_spec=>'addresses_text:elm' );
# look up all users with an 'elm' and a 'springfield' in any
# of their addresses
my $i = $index->iterator ( textsearch_spec=>'addresses_text:elm springfield' );
</pre>
<p> As the comments in the above example imply, the textsearch
subsystem includes a "preprocessor" interface that allows words to be
stemmed and pruned before indexing. The preprocessor defaults to
<b>XML::Comma::Pkg::Textsearch::Preprocessor_En</b>, which handles
English text. It includes a stop list of roughly 500 words, and relies
on the CPAN module <b>Lingua::Stem</b> to do its stemming. </p>
<p> There are currently two other preprocessors in the standard
distribution, <b>Preprocessor_Fr</b> for French and
<b>Preprocessor_Sp</b> for Spanish. If you are only handling
English-language content, you can skip the next three code examples,
which detail how to specify Preprocessors other than
<b>Preprocessor_En</b>. </p>
<p> A textsearch's <b>which_preprocessor</b> element controls which
preprocessor will be used: <b>which_preprocessor</b> should define a
sub that will be passed some combination of four arguments -- an
active $doc; $index and $textsearch objects; and a search "attribute."
The sub must return the name of the Preprocessor package that should
be used. Here is a typical example: </p>
<pre>
<textsearch>
<name>body_text</name>
<which_preprocessor>
sub { return 'XML::Comma::Pkg::Textsearch::Preprocessor_Fr'; }
</which_preprocessor>
</textsearch>
</pre>
<p> Not much to it. </p>
<p> Things get somewhat more complex when we have to choose between
multiple pre-processors on the fly. A Preprocessor is used in two
different contexts: 1) when a doc is indexed, and 2) when a search is
performed. The routine below uses the $doc argument to determine what
Preprocessor to use in the former case, and the $attribute argument in
the latter. </p>
<pre>
<textsearch>
<name>paragraph</name>
<which_preprocessor>
<![CDATA[
use XML::Comma::Pkg::Textsearch::Preprocessor_En;
use XML::Comma::Pkg::Textsearch::Preprocessor_Sp;
sub {
my ( $doc, $index, $ts, $attribute ) = @_;
if ( $doc->lang_code() eq 'sp' or $attribute eq 'sp' ) {
return 'XML::Comma::Pkg::Textsearch::Preprocessor_Sp';
} else {
return 'XML::Comma::Pkg::Textsearch::Preprocessor_En';
}
}
]]>
</which_preprocessor>
</textsearch>
</pre>
<p> The <b>$attribute</b> argument's value comes from the
<b>textsearch_spec</b>, which has a special form for just this
purpose: </p>
<pre>
# look up a word in the index, stemmed by the Spanish pre_processor
my $i = $index->iterator ( textsearch_spec=>'body_text{sp}:lobos' );
</pre>
<p> The extra bit of text after the textsearch name, enclosed in curly
brackets, is stripped off and passed as the <b>$attribute</b> to the
<b>which_preprocessor</b> sub. </p>
<p> The back end of the <b>textsearch</b> facility is currently
implemented on top of, and as part of, the Comma database-specific
modules. It's efficiency is only mediocre, and performing the
indexing operation on each document write is somewhat
resource-intensive. Because of this, a <b>textsearch</b> can specify
that it's operations should be deferred -- performed as a batch rather
than on each and every update of the index. A cron job or application
hook can be written to call an index's
<b>sync_deferred_textsearches()</b> method at some convenient time (or
at some regular interval). </p>
<pre>
<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field><name>first_zip</name></field>
<collection><name>zips</name></collection>
<textsearch>
<name>addresses_text</name>
<defer_on_update>1</defer_on_update>
</textsearch>
</index>
</pre>
<p> A number of important features are missing from the current
implementation of full-text search: support for more languages, the
ability to search for phrases within text, boolean OR'ing, etc. The
strengths of the current implementation are that the full-text search
is fully integrated with the rest of the database system, so complex
iterators that include several different kinds of qualifiers can
easily be constructed; and that the storage overhead is relatively
small (only the inverted index is stored in the database, and that in
a compressed form). </p>
<p> Work to improve the <b>textsearch</b> framework is certainly an
area of interest for the Comma developers. It is likely that
integration with database-provided full-text search capabilities is
the best long-term option for fast, robust operation. Oracle certainly
provides such capabilities. For the moment, the open source databases
lag behind in this area. </p>
<h2>Using an Iterator Over and Over: iterator_refresh()</h2>
<p> It is often convenient to "reuse" an iterator. The
<b>iterator_refresh()</b> method re-fills and resets the iterator. In
its no-argument form <b>iterator_refresh()</b> is equivalent to asking
the index for a new iterator with exactly the same
specifications. However, the method also takes two optional arguments
to limit the total number and the starting position of the results
that are returned. Here are some examples: </p>
<pre>
# usage: $iterator->iterator_refresh ( [ limit_number [, limit_offset ]] );
## simple refresh of a once-used iterator
#
my $i = $index->iterator ( collection_spec => 'zips:20003' );
while ( $i++ ) {
# ... do some stuff
}
$i->iterator_refresh();
# now we can loop through again
while ( $i++ ) {
# .. do some other stuff
}
## using iterator_refresh() to process only the first 10 results of a set
#
my $i = $index->iterator()->iterator_refresh ( 10 );
while ( $i++ ) {
# ... do something with the first 10 (or fewer, if there weren't
# even that many)
}
## using iterator_refresh() to process the eleventh through fifteenth
## results (noting that the second argument, the offset, is zero-indexed)
#
my $i = $index->iterator()->iterator_refresh ( 5, 10 );
while ( $i++ ) {
# ... do something with these five results (again, assuming that
# there are that many)
}
</pre>
<h2>Fetching the Record's Doc: read_doc() and retrieve_doc()</h2>
<p> Generally, an index should be designed so that its fields hold the
most commonly-used pieces of information in a Doc. Of course, any
criterion that will be used to select from an index must be available
as a field or collection. Additionally, any part of a Doc that is
regularly used during an iteration should also be defined as a
field. </p>
<p> But sometimes you actually need to get the Doc itself from an
iterator -- perhaps to do some complex read operation, or to check the
content of an element that is so infrequently used that it makes
little sense to include it as a field, or even to change the Doc and
re-store it. Two iterator methods make this possible:
<b>read_doc()</b> and <b>retrieve_doc()</b>. </p>
<p> As the name suggests, <b>read_doc()</b> is analagous to
<b>Doc->read()</b> and fetches a read-only copy of the document, while
<b>retrieve_doc()</b> is like <b>Doc->retrieve()</b>, returning a
fully modifiable Doc. </p>
<pre>
# print a simple list indicating how many addresses each User has defined
my $i = $index->iterator();
while ( $i++ ) {
print $i->doc_key() . ': ' . scalar @{$i->read_doc()->elements('address')} . "\n";
}
# permanently delete (from the store that this index is tied to)
# all documents with a .edu email address
my $i = $index->iterator ( where_clause => 'email LIKE
"%.edu"' ); while ( $i++ ) {
print "deleting " $i->doc_key();
$i->retrieve_doc()->erase();
}
</pre>
<h2>Fetching One Record: single() and Company</h2>
<p> In iterator retrieves a set of records, in a particular
order. Sometimes you only want one record from an index. The
<b>single()</b> method accepts the same arguments as
<b>iterator()</b>, but it never returns more than one record, and if
no records satisfy its specification it returns
<code>undef</code>. </p>
<p> Usually, <b>single()</b> is used when you know there will only be
one record in the index that matches your selection criteria. For
example, we could write a <b>pre_store_hook</b> to make sure that no
document is ever stored that has the same email address as a document
that is already present. (See the section on advanced store techniques
for more information about store hooks.)</p>
<pre>
<pre_store_hook>
<![CDATA[
sub {
my ( $self, $store ) = @_;
my $email = $self->element('email')->get();
my $index = $self->def()->get_index('main');
if ( $index->single(where_clause => "email = '$email'") ) {
die "the email address '$email' is already in use\n";
}
}
]]>
</pre_store_hook>
</pre>
<p> The <b>single()</b> method isn't strictly necessary (you can
always substitute some equivalent, if longer and more involved,
iterator creation and refresh statement), but it does save some typing
and makes code more readable. In the same spirit, two more methods
exist that provide additional short-cuttage: <b>single_read()</b> and
<b>single_retrieve()</b>. </p>
<p> As their names suggest, <b>single_read()</b> is a <b>single()</b>
call plus (if possible) a <b>read_doc()</b>, and
<b>single_retrieve</b> is the same except with
<b>retrieve_doc()</b>. Both methods return <code>undef</code> if there
is no record in the index that matches the iterator criteria. We can
(somewhat frivolously) modify our <b>pre_store_hook</b> to demonstrate
the use of <b>single_read()</b>: </p>
<pre>
<pre_store_hook>
<![CDATA[
sub {
my ( $self, $store ) = @_;
my $email = $self->element('email')->get();
my $index = $self->def()->get_index('main');
if ( my $other_user = $index->single_read(where_clause => "email = '$email'") ) {
my $other_users_location = $other_user->element('address')->[0]->element('zip')->get();
die "the email address '$email' is already in use by someone in zip code $other_users_location\n";
}
}
]]>
</pre_store_hook>
</pre>
<h2>Getting a Total Rather Than an Iterator: count()</h2>
<p> One more index method exists that, like <b>single()</b>,
<b>single_read()</b> and <b>single_retrieve()</b>, accepts the same
arguments as <b>iterator()</b>: <b>count()</b>. The <b>count()</b>
method returns the total number of records that match the supplied
criteria. </p>
<pre>
# how many Users have a .edu email address?
my $total = $index->count ( where_clause => 'email LIKE "%.edu"' );
</pre>
<h2>Using SQL Aggregates: aggregate ( function => ... )</h2>
<p> It is possible to get information from an Index using the
"aggregate" functions provided by SQL. What functions are available
(and how they work) are highly database-dependent. But, in general, it
is possible to ask a Comma Index to create an <b>aggregate()</b>
object in much the same way as an <b>iterator()</b> is created. </p>
<pre>
# ask for the sum of all the 'mailings_sent' columns in a
# hypothetical index
my $sum = $index->aggregate ( function => "SUM(mailings_sent)" );
# ask for the average age of all users in the 20003 zip code
my $avg_age = $index->aggregate ( function => "AVG(age)",
collection_spec->'zips:20003' );
</pre>
<h2>Actions at Index Time: index_hook</h2>
<p> An <b>index_hook</b> is a hook that is run each time a record is
added or updated, just before any changes are made to the
database. Any number of these hooks can be defined for an index; they
will be run in the order in which they appear in the Def. Two
arguments are passed to the <b>index_hook</b> when it is invoked: the
Doc being indexed and the index object. The return value of the hook
is not used, but if the hook <code>dies</code> then the indexing
operation is silently aborted. </p>
<p> Here is our index definition with a hook that prevents any ".edu"
users from appearing in an index: </p>
<pre>
<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field><name>first_zip</name></field>
<collection><name>zips</name></collection>
<index_hook>
<![CDATA[ sub {
my ( $doc, $index ) = @_;
die if $doc->element('email')->get() =~ /\.edu$/;
} ]]>
</index_hook>
</index>
</pre>
<h2>Defining Methods for an Index</h2>
<p> You can add some flexibility to an index by defining <b>index
methods</b>, pieces of code that can be called from an Iterator in the
same way that element methods can be called from a Doc or
element. Index methods are written for much the same reasons as
element methods: to "standardize" a commonly-used piece of
functionality, or to process information in a way that depends on some
dynamic input (such as the time of day). </p>
<p> Here is our index definition with a fairly trivial index method
added that simply checks whether a record's email address is in a
hard-coded list: </p>
<pre>
<index>
<name>main</name>
<field><name>email</name></field>
<field><name>username</name></field>
<field><name>first_zip</name></field>
<collection><name>zips</name></collection>
<method>
<name>in_denied_list</name>
<code>
<![CDATA[ sub {
my @denied_list = ( 'spammer@spamming-org.org',
'impolite_person@unresponsize_isp.com' );
my ( $iterator, @rest_of_args ) = @_;
foreach my $denied ( @denied_list ) {
return 1 if $iterator->email() eq $denied;
}
return 0;
} ]]>
</code>
</method>
</index>
</pre>
<p> And here is a simple scan through the entire index, checking if
each record is in the denied list or not: </p>
<pre>
my $i = $index->iterator();
while ( $i++ ) {
if ( $i->in_denied_list() ) {
print $i->username() . " is in the denied-list\n";
} else {
print $i->username() . " is clean as a whistle\n";
}
}
</pre>
<h2>More Configuration, More SQL</h2>
<p> An index is stored, behind the scenes, as a set of tables in a
relational database. The Comma indexing framework provides a layer of
abstraction on top of the database, and most of the time a programmer
doesn't need to worry about the underlying implementation. But there
are a few limitations that one should be aware of, and a few
configuration directives that can make indexes more efficient and
usable. </p>
<h3>Configuring Fields</h3>
<p> Each piece of information stored in a database must have a
<i>type</i>. Every field in a Comma index defaults to the
<b>VARCHAR(255)</b> type. It is necessary to specify a different type
if a field needs to
<ul>
<li> accomodate values longer than 255 characters, or </li>
<li> use a more space-efficient representation, or </li>
<li> support comparisons or operations (in where clauses, for example)
other than those that VARCHAR provides. </li>
</ul>
A <b>sql_type</b> element can be used to specify the SQL type for a
field. Here is an index definition with every field fully specified as
to type: </p>
<pre>
<index>
<name>main</name>
<field>
<name>email</name>
<sql_type>VARCHAR(150)</sql_type>
</field>
<field>
<name>username</name>
<sql_type>VARCHAR(40)</sql_type>
</field>
<field>
<name>first_zip</name>
<sql_type>CHAR(5)</sql_type>
</field>
<collection><name>zips</name></collection>
</index>
</pre>
<p> The <b>doc_id_sql_type</b> element can be used to specify the SQL
type for the columns that store the <b>doc_id</b> parts of
records. This also defaults to <b>VARCHAR(255)</b>, and that's usually
fine, but for particular efficiency it too can be changed. <b>NOTE:
</b> unlike most other index elements, the <b>doc_id_sql_type</b> for
an index should not be altered after an index is created the first
time. If you need to change the <b>doc_id_sql_type</b>, you'll need to
drop and rebuild the index. Collections can also specify a
<b>sql_type</b>, which will be used only by 'binary table' type
collections. Like the <b>doc_id_sql_type</b>, this specifier should
not be changed for an existing index. </p>
<p> Another rarely-used specifier is the
<b>store</b> element. By convention, an index shares its name with the
store to which it is bound (remember, an index can only hold
information about documents from a single store). But the convention
is sometimes too restrictive -- for example, two indexes using the
same store obviously can't have the same name. The <b>store</b>
element allows an index to explicitly state with which store it is
associated. Here is an index definition that uses both the
<b>store</b> and <b>doc_id_sql_type</b> specifiers: </p>
<pre>
<index>
<name>just_emails</name>
<store>main</store>
<doc_id_sql_type>CHAR(12)</doc_id_sql_type>
<field>
<name>email</name>
<sql_type>VARCHAR(150)</sql_type>
</field>
</index>
</pre>
<p> The doc_id specifiers are passed directly to the relational
database. Databases differ in how they define (and name) even
commonly-used types, so which types are available from Comma will
obviously depend on which database is being used. The Comma database
adapters handle typing issues for the columns that are used internally
as part of the abstraction layer, but when you start specifying SQL
types for document ids and index fields, you're on your own. </p>
<h3>Configuring Collections</h3>
<p> Collection information can be stored in one of three ways: </p>
<ul>
<li>in a <b>stringified</b> form in the main Index table (the default)</li>
<li>in a <b>binary table</b> containing a list of doc_id/value pairs</li>
<li>in <b>many tables</b>, each containing a list of doc_ids </li>
</ul>
<p> Each of these has advantages and disadvantages. </p>
<p> The <b>stringified</b> type is the simplest. No extra database
tables are created, and the values of the collection are available
directly from the Iterator. Partial matches (using the SQL wildcard
character <b>%</b>) that are anchored at either the front or back of a
string are allowed. </p>
<p> The <b>binary table</b> type requires one additional table to be
maintained in the database. This type of collection has the potential
to provide the best mixture of flexibility and speed, although the
current implementation is less advanced than it should be. A number of
optimizations that would improve the speed of this type of storage are
yet to be performed, and the MySQL database doesn't support some types
of queries that would be particularly useful in this regard. Partial
matches are allowed, but NOT pairs are not. And any given
binary-table-typed collection may be used only once
collection_spec. </p>
<p> The <b>many tables</b> type requires a table to maintained in the
database for each unique value that occurs in the
collection. Collections of this type are usually tied to elements that
are defined by an <b>enum</b> (or are similarly restricted in their
allowed values). </p>
<p> This type is not very flexible, but can provide the quickest
select times for many applications. Because <b>many tables</b> data is
spread across several database tables, it is often possible for the
system to take shortcuts that are not possible for <b>stringified</b>
or <b>binary table</b> collection. And each of the tables in a
collection of this type can be <b>clean()</b>ed separately (see below
for an explanation of <code><clean></code> clauses), further
reducing the amount of data that must be sifted through to create each
Iterator. Partial matches are not allowed for many-tables-typed
collections, but NOT pairs are. </p>
<p> Here are three collections that will hold the same data but store
that data differently behind the scenes (with <clean> behavior
specified for the <b>many tables</b> version, for good measure: </p>
<pre>
<collection>
<name>zips_str</name>
<code><![CDATA[ sub { $_[0]->zips() } ]]></code>
</collection>
<collection>
<name>zips_bt</name>
<type>binary table</type>
<code><![CDATA[ sub { $_[0]->zips() } ]]></code>
</collection>
<collection>
<name>zips_mt</name>
<type>many tables</type>
<code><![CDATA[ sub { $_[0]->zips() } ]]></code>
<clean>
<to_size>10_000</to_size>
<size_trigger>10_500</size_trigger>
<order_by>doc_id DESC</order_by>
</clean>
</collection>
</pre>
<h3>The fields=> Argument</h3>
<p> One more way to increase the speed of Iterator creation is to ask for only a subset of the fields defined in the Index: </p>
<pre>
# we only need email addresses
my $iterator = $index->iterator ( fields => [ 'email' ] );
</pre>
<p> This reduces the amount of data that the database must shuffle
around when joining tables together. It generally only helps when used
in conjunction with many-tables- and binary-table-type collections,
and on Indexes that define many fields. But in some cases the speed
increases can be dramatic: 20% or more. </p>
<p> The <code>fields=></code> argument should take the form of a
reference to an array of field names. The resulting Iterator will only
have data for the <code>doc_id</code> and
<code>record_last_modified</code> pseudo-fields, and for the fields
named in the argument array ref. </p>
<h2>Changes: Automatic Updating of Database Structure</h2>
<p> Comma does its best to adjust the table structures of the database
to match any changes that are made to an index definition. Mostly,
this is possible. You can always add or remove a collection or
field. You can usually change the SQL type of a field -- only the
willingness of the database to convert already-stored information from
the old type to the new limits this. </p>
<p> You <b>can not</b> change the <b>doc_id_sql_type</b> without
dropping and rebuilding the database. </p>
<h2>Clean and Rebuild</h2>
<p> Left to their own devices, indexes will grow as new records are
added. But larger indexes can be slower to query, or take up too much
disk space. And some indexes are designed specifically to contain only
a subset of documents: the <i>n</i> most recently-updated, for
example, or all of the documents that haven't yet been tagged
"archive." </p>
<p> You have already seen how to write an <b>index_hook</b> that
prevents documents from being added to the index. But to dynamically
scan an index and remove records requires a different approach. </p>
<p> The <b>clean()</b> method triggers the "cleaning" of an
index. During a clean, records are deleted according to the criteria
listed in the index definition's <b>clean</b> section. Here is a
simple example index definition that uses a clean section to delete
records that haven't been "used" in 30 days: </p>
<pre>
<index>
<name>main</name>
<field>username</field>
<field>email</field>
<field>last_used</field>
<clean>
<erase_where_clause>last_used &gt; 60*60*24*30</erase_where_clause>
</clean>
</index>
</pre>
<p> The <b>erase_where_clause</b> contains a bit of SQL that will be
passed to the database. Any doc record that satisfies the
<b>erase_where_clause</b> will be removed from the index. Sometimes
it's useful to construct the <b>erase_where_clause</b> using a bit of
perl. If the first character is a <b>{</b>, the content of the
<b>erase_where_clause</b> specifier is eval'ed before being passed to
the database: </p>
<pre>
<clean>
<erase_where_clause><![CDATA[
{ 'record_last_modified < ' . (time() - 80*80*24) }
]]></erase_where_clause>
</clean>
</pre>
<p> The <b>clean</b> method can, of course, be called just like any
other, but it is common to use a cron job to clean a database each
night, or each week. </p>
<pre>
# simply call clean() to clean an index
$index->clean()
# perhaps a command-line version from within a cron (or similar) job
perl -MXML::Comma -e 'XML::Comma::Def->read(name=>"User")->get_index("main")->clean();';
</pre>
<p> The clean operation is as careful as possible about running in the
background: records added while a clean is in progress are ignored by
the clean, and only one clean runs at a time (if <b>clean()</b> is
called while another clean is already in progress, it simply returns). </p>
<p> The <b>erase_where_clause</b> is one style of clean, but there is
another. A clean can define a <b>to_size</b> element to trim an index
down to a certain number of records, and specify an <b>order_by</b>
clause to make sure that the list of records is arranged in the
correct order before being trimmed. (If no <b>order_by</b> is given,
the index's <b>default_order_by</b> is used, which itself defaults to
<b>doc_id</b>.) Here is a clean section that keeps only the 1000
most-recently-used records: </p>
<pre>
<clean>
<to_size>1000</to_size>
<order_by>last_used DESC</order_by>
</clean>
</pre>
<p> An alternative to calling <b>clean()</b> manually (or from a cron
job) is to configure cleaning to take place automatically when an
index contains a certain number of records. If a <b>size_trigger</b>
element is present, Comma will check the size of the index after each
<b>insert()</b> and if that size equals or exceeds the number given in
the <b>size_trigger</b> specifier then a clean will be triggered. </p>
<pre>
<clean>
<to_size>1000</to_size>
<order_by>last_used DESC</order_by>
<size_trigger>1200</size_trigger>
</clean>
</pre>
<p> A <b>clean</b> section can also apply to an individual collection,
rather than to the entire index. In this case, the clean effects only
a table of pointers holding collection information, not the main
records themselves. Configuring an index so that its collection tables
stay reasonably small can dramatically improve performance. </p>
<p> Sometimes, even a clean isn't clean enough. If several fields
are added to an existing index, or the criteria for indexing changes
radically, or a problem with an index is identified, it can be
necessary to <b>rebuild()</b> the index. </p>
<p> A rebuild can be done starting from scratch (after dropping any
existing index tables), or on a fully-populated index. In either case,
the rebuild will add and update information in the database while it
runs, then clean the index after it is finished -- so an index can be
used while a rebuild is in progress. Care should be taken, however,
not to stop a rebuild operation before it has completed. </p>
<p> As might be expected, the <b>rebuild()</b> method is used to
trigger a rebuild operation. During a rebuild, Comma iterates
backwards through all of the documents in a store, calling
<b>index_update()</b> on each Doc. All of the normal rules apply, so
Docs that would not be added to an index by an explicit call to
<b>index_update()</b> are not added by a <b>rebuild()</b>, and all
hooks are run as normal for each Doc. </p>
<p> It can take a long time to handle all of the documents in a large
store, and often an index will only want to treat a subset of
documents. The <b>stop_rebuild_hook</b> allows some "stop-now-if"
logic to be inserted into the rebuild cycle. This hook was designed to
gracefully handle the common case of an index that does not include
documents that are "older" than a certain cut-off age. (Which is also,
of course, why the rebuild cycle iterates <i>backwards</i> through a
store. To be of much use, a <b>stop_rebuild_hook</b> must be used in
conjunction with a store that sorts doc ids by criteria that are
roughly similar to the criteria that the <b>stop_rebuild_hook</b>
cares about. But you already knew that.) </p>
<p> The <b>stop_rebuild_hook</b> is passed two arguments, the doc in
question and the index object, and should return true if the rebuild
should stop cycling through the storage documents and move on to its
cleanup phase. </p>
<pre>
<index>
<name>new_users</name>
<store>sequential_user_id_store</store>
<field>username</field>
<field>email</field>
<field>created_timestamp</field>
<stop_rebuild_hook>
<![CDATA[ sub {
my ( $doc, $index ) = @_;
my $age = time() - $doc->created_timestamp();
return $age > (60*60*24*90);
} ]]>
</stop_rebuild_hook>
</index>
</pre>
<h1>Storage in More Detail: Hooks, Output Filters and Location Modules</h1>
<h2>Hooks: pre_store_hook, post_store_hook, erase_hook</h2>
<p>Three types of hooks are available to run during store
operations. Just before a document is stored, each defined
<b>pre_store_hook</b> runs. Three arguments are passed: the doc being
stored and (though these are rarely needed) the store definition and a
hashref containing the args that were passed to the original store()
call. (This access to the original %args allows hook code to take
actions based on programmer-defined args. You can pass anything you
want as part of the store args -- its completely open-ended. ) If any
of the <b>pre_store_hooks</b> <code>die</code> then the store
operation is aborted, the remaining hooks are ignored, and a
<code>STORE_ERROR</code> is thrown. </p>
<p> Just after a doc is stored, each defined <b>post_store_hook</b>
runs, again taking the doc in question, the store definition and the
args hashref as its arguments. Every <b>post_store_hook</b> runs and
any errors that are thrown are ignored. </p>
<p> As a side note: very occasionally you may want to modify the doc
itself inside a <b>post_store_hook</b>, and then to save those
changes. (You might do this if, for example, you need to have the
doc's id available before your code can run, which requires that the
code be installed as a post rather than pre store_hook.) A special
flag for the <b>store()</b> method -- <b>no_hooks=>1</b> -- is
available to allow a <b>store()</b> to be performed without its
attendant pre and post hooks. This is obviously a power that should
not be abused. </p>
<p> Before a doc is erased, each defined <b>erase_hook</b> runs. Three
arguments are passed to an <b>erase_hook</b>: the doc being erased,
the store definition, and the doc's doc_location. If any
<b>erase_hook</b> dies the erase operation is aborted (and no more
hooks are run). It's worth noting that an erase operation happens --
and the erase_hooks are run -- during a <b>move()</b> as well as
during a simple <b>erase()</b>. </p>
<h2>More on Location Chains</h2>
<p> A store definition's location chain controls how a doc is written
out to long-term storage. A location chain must generate both a
storage "location" and a document id. The current location modules all
use the filesytem, but -- in principle at least -- the Store interface
is abstract enough that modules could equally well use a database, a
tape drive, or a networked archive of some kind. </p>
<p> Location modules come in two kinds, which betray their
file-system-centric roots by being called <i>_dir</i> and
<i>_file</i>. The _dir modules specify "directory" locations; a dir
module cannot be used as the final link in a location chain. The _file
modules specify the "file" portion of a location; a _file module can
only be used as the final link in a chain. </p>
<h3>Standard _dir Modules</h3>
<p> <b>Sequential_dir</b> creates sequentially-numbered
directories. It takes two arguments, both of which are optional:
<b>max</b> specifies the highest legal number in the sequence, and
defaults to 9999; <b>digits</b> specifies how the number will be
formatted, and defaults to normal decimal notation. The <b>max</b>
number determines how many entries the <b>Sequential_dir</b> can
hold. Once the <b>max</b> number has been reached, all subsequent
store attempts throw an error. The directory name will be created by
formatting the current sequence number according to the <b>digits</b>
specifier, and padding the formatted string with leading zeros, so
that alpha-numeric sorting is possible. (Note that if the <b>max</b>
or <b>digits</b> specifier is changed, the width or format of
subsequently-created directory names could change, possibly ruining
the sort.) Behind the scenes, <b>Sequential_dir</b> uses the
<b>Math::BaseCalc</b> module to do the formatting, and the
<b>digits</b> argument accepts any digit set that
<b>Math::BaseCalc</b> understands. For example: </p>
<pre>
<store>
<name>hex_directory_plus_abc_filenames</name>
<base>strange/storage/base</base>
<location>Sequential_dir: hex, 'max', 999</location>
<location>Sequential_file: 'digits', ['a','b', 'c'], 'max', 4</location>
</store>
</pre>
<p> Note that some (one is tempted to qualify them as perverse)
digit-set choices will render alpha-numeric sorting useless for
reconstructing the order in which the directories were created. The id
fragment generated by the <b>Sequential_dir</b> module is simply the
directory name.</p>
<p> <b>GMT_3layer_dir</b> creates a directory structure derived from
the current date. As the name implies, three directory layers are
created, in the pattern <code>YYYY/MM/DD</code>. The id fragment that
this module generates is the directory structure without internal
separators: <code>YYYMMDD</code>. </p>
<p> <b>Derived_dir</b> creates a directory from the contents of a doc
element or the return value of a method. The <b>derive_from</b>
argument is required, and specifies the element or method that will be
called to generate the filename -- this works like the shortcut
syntax: <code>$doc-><derive_from>()</code>. The <b>width</b>
argument is also required, and specifies the number of characters that
will be in the directory name. The <b>derive_from</b>'ed value will be
left-padded with zeros if it is shorter than <b>width</b>, or
truncated if it is longer. The id fragment generated by
<b>Derived_dir</b> is the same as the directory name. </p>
<h3>Standard _file Modules</h3>
<p> <b>Sequential_file</b> creates unique, sequentially-numbered
files. It takes three arguments, all of which are both optional. The
<b>max</b> and <b>digits</b> arguments operate as described above for
<b>Sequential_dir</b>. An additional argument, <b>extension</b>,
specifies what extension should be added to the filename. The
extension should include the separator character (a period is the most
common separator), and defaults to <code>.comma</code>. The id
fragment that <b>Sequential_file</b> generates is the filename,
stripped of its extension. </p>
<p> <b>Derived_file</b> creates a filename from the contents of a doc
element or the return value of a method. The <b>derive_from</b>
argument is required, and specifies the element or method that will be
called to generate the filename -- this works like the shortcut
syntax: <code>$doc-><derive_from>()</code>. The optional
<b>extension</b> argument defaults to <code>.comma</code>. Here is an
example: </p>
<pre>
<store>
<name>main</name>
<location>Derived_file: 'derive_from','foo', 'extension', '.xml'</location>
</store>
</pre>
<p> <b>Read_only_file</b> is used for collections of documents that
will be by some non-Comma tool. Parts of Comma use
<b>Read_only_file</b> to read config files, which are always edited by
hand. Trying to write to a location chain that uses
<b>Read_only_file</b> will result in an error. The optional
<b>extension</b> argument specifies the extention that is present on
the files, and defaults to <code>.comma</code>. Here is the definition
used by <i>HTTP_Upload_Config</i>, part of the <i>HTTP_Upload</i>
package: </p>
<pre>
<store>
<name>main</name>
<base>config</base>
<location>Read_only_file:'extension','.config'</location>
</store>
</pre>
<h2>Index_Only Storage</h2>
<p> A special location module exists for documents that should only be
indexed, not stored. </p>
<p> It can be convenient to use the Comma API to manipulate data that
doesn't need to have the longevity or additional features that come
with file-system storage. The development of this functionality was
motivated by a desire for what might be called "sortable short-term
log files." For example, say you're tracking web page accesses by
registered users, and the questions your code needs to ask mostly look
something like "how many pages has this user visited in the last 10
minutes?" Updating a traditional, full-fledged Doc on each page view
is pretty heavy-weight. It would be better to have a way to perform a
simple mapping between Doc objects and database rows. Here's a simple
example Def: </p>
<pre>
<DocumentDefinition>
<name>test_index_only</name>
<element><name>time</name></element>
<element><name>string</name></element>
<store>
<name>main</name>
<location>Index_Only:'index_name','main'</location>
</store>
<index>
<name>main</name>
<field><name>time</name></field>
<field><name>string</name></field>
</index>
</DocumentDefinition>
</pre>
<p> Mostly, this works as one would expect. Normal Comma method calls
are used to create Doc objects and <b>store()</b> them. You can
<b>read()</b> an individual Doc, or use index <b>iterator()</b> and
<b>count()</b> methods to get at collections and totals. Doc
<b>read()</b> does come with a few caveats. Only have access to simple
index fields -- no collections or code-constructed fields -- is
available. And it's not possible to store Blobs or to use storage
iterators. </p>
<p> The <b>doc_id</b>s that are generated are sequential integers, and
are not 0-padded. </p>
<h2>Output Filters</h2>
<p> Like location chains, output chains influence how a document is
written to long-term storage. Whereas a location chain determines
"where" a document is stored, an output chain determines the "format"
of the stored doc. </p>
<p> There is an implicit output format present for every single
document: plain text. Document storage always starts with the
generation of a plain-text, xml-marked-up representation of the doc
(the same thing that is produced by a call to <b>to_string()</b>), and
document retrieval always ends with the parsing of that plain text
representation. But various output filters can be added to the
storage/retrieval process, with each one influencing what bytes
actually get written out to disk. </p>
<p> The <b>Gzip</b> output filter compresses the doc using the gzip
algorithm. Large documents can be compressed quite
effectively. <b>Gzip</b> takes no arguments and operates as simply as
possible. Other tools that understand gzip compression/decompression
-- <code>zcat</code>, for example -- can be used to examine or process
the stored docs independent of Comma. </p>
<pre>
<store>
<name>gzipped</name>
<location>Sequential_file</location>
<output>Gzip</output>
</store>
</pre>
<p> The <b>Twofish</b> filter uses the Twofish symmetric encryption
algorithm (implemented by Abhijit Menon-Sen's Crypt::Twofish module)
to encrypt the storage output. <b>Twofish</b> needs an argument,
<b>key</b>, specifying the encryption/decryption key to be used. An
additional argument, <b>key_hash</b>, is also required. The
<b>key_hash</b> is used to verify that the key supplied is
correct. This is important because encrypting data with a mis-typed
(or otherwise wrongly-supplied) key can cause much heartache and
difficulty. (And keys can -- and likely should -- be dynamically
supplied, so there is often opportunity to mis-type a key.)
<p> The <b>key_hash</b> is produced by generating an md5
digest of the key, and should be supplied as a hex string. The
following one-liner spits out the <b>key_hash</b> for the <b>key</b>
'foo':</p>
<pre>
perl -MDigest::MD5 -e 'print Digest::MD5::md5_hex("foo") . "\n"'
</pre>
<p> And here is an example of a store that first gzips, then Twofish
encrypts, its docs: </p>
<pre>
<store>
<name>gzipped_and_encrypted</name>
<location>Sequential_file</location>
<output>Gzip</output>
<output>Twofish: 'key', 'an encryption p',
'key_hash', '67d8db90c079ae74967a1b750b87525f'</output>
</store>
</pre>
<p> The <b>HMAC_MD5</b> output filter uses Gisle Aas's
Digest::HMAC_MD5 module to fingerprint each of its stored docs. Like
the <b>Twofish</b> filter, <b>HMAC_MD5</b> requires <b>key</b> and
<b>key_hash</b> arguments. (The <b>key_hash</b> is generated in the
same way as for <b>Twofish</b>.) Here is a store that gzips, Twofish
encrypts and HMACs its docs: </p>
<pre>
<store>
<name>five</name>
<base>gz_hmac_twofish</base>
<location>Sequential_file:'max',10,'extension','.gz_hmac_encrypt'</location>
<output>Gzip</output>
<output>HMAC_MD5: 'key', 'an-hmac-sillykey',
'key_hash', '7c116a20dcc378de2afb4cc9955a2187'</output>
<output>Twofish: 'key', 'another-sillykey',
'key_hash', '6ae8eaeaa226a03a46d79a359ab00db0'</output>
</store>
</pre>
<p> As mentioned above, keys will often need to be supplied when Comma
applications are loaded, rather than hard-coded into defs. (Leaving
the key in the def in plain text is a potential security problem.)
Here is a toy storage definition that prompts the user to enter the
key on the command line when the def is loaded: </p>
<pre>
<store>
<name>four</name>
<base>test/four</base>
<location>Sequential_file:'max',10,'extension','.encrypt'</location>
<output>
<![CDATA[
Twofish:
'key' => do { print "key ('1234'): "; my $key=<>; chop $key; $key },
'key_hash' => '81dc9bdb52d04dc20036dbd8313ed055'
]]>
</output>
</store>
</pre>
<h2>Writing New Output and Location Modules</h2>
<p> Hmmm. See the text file: Storage/Location/location_modules.doc for
basic location module theory and practice. Output filters are much
simpler -- use the source, Luke. </p>
<h1>Blob Elements</h1>
<p> A <i>blob</i> is an element that is stored separately from the
rest of a Doc's content. Comma borrows the term (and to some extent
the concept) of blobs from the database world, where <i>BLOB</i>
originated as an acronym for "Binary Large Object." A blob element is
conceptually part of a Doc, but -- as a matter of implementation -- is
stored so that it does not need to be parsed when a doc is read and
does not need to have its content escaped/unescaped on <b>set()</b>
and <b>get()</b>. In addition, blobs are often stored transparently in
individual files, so that their content can be directly accessed by
non-comma, filesystem-centric tools. </p>
<p> The API for manipulating blob elements is the same as that for
manipulating normal elements, with a few additions. Here is a simple
document with a regular and a blob element defined: </p>
<pre>
<DocumentDefinition>
<name>two_easy_elements</name>
<element><name>regular_el</name></element>
<blob_element><name>blob_el</name></blob_element>
<store>
<base>te</base>
<location>Sequential_file</location>
</store>
</DocumentDefinition>
</pre>
<p> Our <code>blob_el</code> can be treated just like our
<code>regular_el</code>: <b>set()</b> and <b>get()</b> work exactly as
one would expect. When a <code>two_easy_elements</code> doc is stored,
It will look something like this: </p>
<pre>
<two_easy_elements>
<regular_el>whatever</regular_el>
<blob_el><_comma_blob>/usr/local/comma/docs/te/0001-cuPhU10z</_comma_blob></blob_el>
</two_easy_elements>
</pre>
<p> Where a <b>to_string()</b>'ed normal element has content, a blob
element has a pointer. What the pointer means (hence, how and where
the content is actually stored) is dependent on the particular
location module that did the storing. In most cases blob content is
stored in a file in the same directory as the doc, with a filename
consisting of the doc's id as a leading string, then a dash, then a
randomly generated alphanumeric string, then an optional
extension. </p>
<p> In addition to being <b>set()</b>, a blob element can be handed
content using <b>set_from_file()</b>. This makes using a blob to store
information that is already available on disk quite simple: </p>
<pre>
my $filename = '~/pictures/snowy_day.jpg';
$blob_el->set_from_file ( $filename );
</pre>
<p> The <b>get_location()</b> method returns the pointer information
for the blob. Note that <b>get_location</b> will return an empty
string if the blob is unset, and that because comma jumps through some
hoops to store blob content in temporary files between a <b>set()</b>
and a <b>store()</b>, the pointer returned by <b>get_location()</b>
may not be what you expect in all situations, and shouldn't be treated
as a persistent piece of data. </p>
<pre>
# display our jpg to the screen
my $filename = $blob_el->get_location() ||
die "no pretty picture available";
`display $filename`;
</pre>
<p> Blob elements can be <b>validate()</b>'ed, and have hooks of type
<b>set_hook</b>, <b>read_hook</b> and <b>validate_hook</b> attached to
them. They also accept a unique <b>set_from_file</b> hook, which is
like a <b>set_hook</b> but is triggered by the <b>set_from_file()</b>
method. </p>
<p> The extension that is appended to a blob element's filename during
storage is specified with an <b>extension</b> def element: </p>
<pre>
<DocumentDefinition>
<name>two_easy_elements</name>
<element><name>regular_el</name></element>
<blob_element>
<name>blob_el</name>
<extension>'.jpg'</extension>
</blob_element>
<store>
<base>te</base>
<location>Sequential_file</location>
</store>
</DocumentDefinition>
</pre>
<p> The extension itself is produced by eval'ing the <b>extension</b>
specifier each time the element is stored. This makes it possible to
have the extension vary depending on the content of the blob. The
easiest way to do this, of course, is to put a method in your extension,
like so:
<pre>
<extension><![CDATA[ $self->image_extension() ]]></extension>
</pre>
For a more complex example of this -- as well as for a generally
useful element that can be defname'ed into any context -- see the file
t/defs/standard/Comma_Standard_Image.def in the comma distribution. </p>
<h1>Grouping and Sorting Elements</h1>
<p> Comma provides a pair of methods that rearrange the order of
elements in a doc or nested element. </p>
<h2>Prettifying: group_elements()</h2>
<p> Calling the <b>group_elements()</b> method of a doc or nested
element pulls together all sub-elements of the same type, and arranges
these groupings of elements according to the order in which they are
listed in the doc or nested element's def. The order of the elements
in each group relative to one another remains unchanged. Often this
method is used to "prettify" a doc so that it will be more easily
readable or hand-editable. The <b>group_elements()</b> method returns
the object on which it was called.</p>
<pre>
<!-- a simple def -->
<DocumentDefinition>
<name>group_test</name>
<element><name>a</name></element>
<element><name>b</name></element>
<plural>'a','b'</plural>
</DocumentDefinition>
<!-- and a doc -->
<group_test>
<a>0</a>
<b>1</b>
<a>2</a>
<b>3</b>
</group_test>
# assume the above is '$doc' -- call group_elements and to_string() from some
# code...
print $doc->group_elements()->to_string();
# will print out:
<group_test>
<a>0</a>
<a>2</a>
<b>1</b>
<b>3</b>
</group_test>
</pre>
<h2>Sorting: sort_elements()</h2>
<p> The <b>sort_elements()</b> method is much more flexible and
powerful. When called with no arguments, it rearranges all of the
elements in a container according to a defined <b>sort_sub</b>. Like
<b>elements()</b>, it accepts an optional list of tags as arguments;
if called with arguments it rearranges only the specified types of
elements. </p>
<p> The <b>sort_sub</b> is specified in one of two places:
<b>sort_elements()</b> first looks in the definition of the first
callee for a <b>sort_sub</b>, then in the definition of the caller. If
it doesn't find a <b>sort_sub</b> in either def, the method throws an
error. The <b>sort_elements()</b> method (again, for cognitive
compatibility with <b>elements()</b> returns a sorted array -- or
reference -- of elements).</p>
<p> Here is a document definition showing four slightly different uses
of a <b>sort_sub</b>. The <i>simple</i> element's sort would be
controlled by the document's <b>sort_sub</b> (which appears near the
end of the definition), as it doesn't have one of its own. The other
three elements all define <b>sort_sub</b> elements that (presumably)
are tailored for the types of data they will contain and the ways in
which they will be used. </p>
<pre>
<DocumentDefinition>
<name>sort_test</name>
<element><name>simple</name></element>
<element>
<name>self_sorting_alpha</name>
<sort_sub><![CDATA[ sub ($$) { $_[1]->get() cmp $_[0]->get(); } ]]>
</element>
<element>
<name>self_sorting_numeric</name>
<sort_sub><![CDATA[ sub ($$) { $_[1]->get() <=> $_[0]->get(); } ]]>
</element>
<nested_element>
<name>self_sorting_nested</name>
<element><name>rank</name></element>
<sort_sub><![CDATA[ sub ($$) { $_[0]->rank() <=> $_[1]->rank(); } ]]>
</nested_element>
<plural>'simple','self_sorting_alpha','self_sorting_numeric','self_sorting_nested'</plural>
<sort_sub><![CDATA[ sub ($$) { $_[0]->get() <=> $_[1]->get(); } ]]>
</DocumentDefinition>
</pre>
<p> Each <b>sort_sub</b> string is turned into a code reference (by an
eval statement), then passed to a <code>sort</code> statement whenever
<b>sort_elements()</b> is called. Note that you must use the
<i>prototyped</i> form of subroutine definition for the sort statement
to work properly. You may be more used to using the special variables
<code>$a</code> and <code>$b</code> -- the <code>($$)</code> prototype
simply tells <code>sort</code> to use normal subroutine parameters
instead: <code>$_[0]</code> and <code>$_[1]</code>. </p>
<pre>
# the above def in use
my $doc = XML::Comma::Doc->new ( type=>'sort_test' );
$doc->add_element('simple')->(1);
$doc->add_element('simple')->(23);
$doc->add_element('simple')->(11);
$doc->add_element('self_sorting_alpha')->('ccc');
$doc->add_element('self_sorting_alpha')->('aaa');
$doc->add_element('self_sorting_alpha')->('bbb');
$doc->add_element('self_sorting_numeric')->(10);
$doc->add_element('self_sorting_numeric')->(11);
$doc->add_element('self_sorting_numeric')->(3);
$doc->add_element('self_sorting_nested')->ranked(7);
$doc->add_element('self_sorting_nested')->ranked(5);
$doc->add_element('self_sorting_nested')->ranked(10);
# do some sorts
$doc->sort_elements ( 'simple' );
$doc->sort_elements ( 'self_sorting_alpha' );
$doc->sort_elements ( 'self_sorting_numeric' );
# do a sort and actually use the return value
print join "\n", map { "rank of " . $_ . ": " . $_->rank() }
$doc->sort_elements ( 'self_sorting_nested' );
</pre>
<h1>Error Handling and Logging</h1>
<p> Comma includes error propogation facilities and some basic logging
functionality. The <b>log_file</b> configuration variable specifies a
file for Comma to log to. </p>
<p> Each line in the log file is made up of three fields, separated by
spaces: 1) the unix time, 2) the pid, and 3) the error string. Most
errors that are thrown as part of Comma's internal workings will have
an error string made up a standard error name, then two dashes, then
more information about the error, then the file and line number of the
caller. For example: </p>
<pre>
1000838058 1584 STORE_ERROR -- no store given to first-time Doc->store() at t/storage.t line 167
</pre>
<p> Two public methods enable writing to the log
file. <b>XML::Comma::Log->err()</b> takes an error name and a
description string as arguments, writes a line to the log, then exits
with a <code>die</code>. The error name should be a short, arbritrary
string which contains no spaces and identifies the error. All errors
thrown by internal comma code use all-caps error names, which makes
the log files easy to read, but that's only a convention. The
description string can be as long as desired, and should contain more
specific information about the error (newline characters will be
replaced with spaces when the string is written to the log file.) A
file and line-number from which the <b>err()</b> method was called
will be included in the output. </p>
<p> <b>XML::Comma::Log->warn()</b> takes a string, appends the text
<code>WARNING -- </code> to it, and writes a line to the log. It
doesn't <code>die</code> or record the file and line-number of the
caller. </p>
<pre>
# non-fatal, informational logging
XML::Comma::Log->warn ( "whoa, we might have problems" );
# throw a fatal error
XML::Comma::Log->err ( "FAKE_ERROR", "we have broken down" );
</pre>
<h1>Network Transfer</h1>
<p> Comma ships with a client/server library for transferring Docs
across the network. The tranfer operations are built on top of the
HTTP protocol, and the server side of the library is designed to run
as a mod_perl handler inside Apache. The client side can be used
programmatically, just like any other part of the Comma system. </p>
<p> Here is a bit of example code: </p>
<pre>
my $t = XML::Comma::Pkg::Transfer::HTTP_Transfer->new
( target => 'https://remote.server.com/util/transfer' );
if ( ! $t->ping() ) {
die "couldn't contact the remote server";
}
my $key = 'article|main|0012';
my $article = XML::Comma::Doc->read ( $key );
if ( $article->comma_hash ne $t->get_hash($key) ) {
print "transferring: $key ... ";
$t->put ( $article );
print "ok\n";
} else {
print "hash matched for $key";
}
</pre>
<h2> Client Methods </h2>
<p> The client is an instance of the
<b>XML::Comma::Pkg::Transfer::HTTP_Transfer</b> class. The constructor
for this class takes one optional argument, <code>target</code>, which
specifies the URL to which the client will connect. The available
client methods are: <b>ping()</b>, <b>put()</b>, <b>put_push()</b>,
<b>get_hash()</b> and <b>get_and_store()</b>. </p>
<p> The <b>ping()</b> tests the connection to the server. It returns
<code>1</code> if the client is able to exchange data with the server
and returns <code>undef</code> otherwise. </p>
<p> The <b>put()</b> and <b>put_push()</b> methods transfer docs from
the client to the server. Both methods take a doc object as their
argument. A doc that is <b>put()</b> across the network is stored on
the server using the same <b>store</b> and <b>id</b> as the local
doc. (This implies that a newly-created and not-yet-stored doc may not
be <b>put()</b>ted.) A doc that is <b>put_push()</b>ed across the
network will be stored on the server in a new location -- as if it
were newly-created. The <b>put_push()</b> method takes an optional
second argument, a <b>store_name</b> indicating what store to use on
the server. If that argument is omitted, the doc's store is used. Both
methods return the doc id under which the doc was stored (though this
is presumably already known in the <b>put()</b> case). Both methods
throw an error if they encounter unrecoverable network difficulties or
problems on the remote server. </p>
<p> The <b>get_hash()</b> method gets a document's <b>comma_hash</b>
from the remote server. The method takes the three "retrieval"
arguments -- either paramaterized as a <b>type=></b>,
<b>store=></b> and <b>id=></b> or stringified as a <b>key</b>. A
call to <b>get_hash()</b> returns the key on success, returns
<code>undef</code> if the requested doc is not found on the remote
server, and throws an error if it encounters severe difficulties
across the network. </p>
<p> The <b>get_and_store()</b> method takes the same "retrieval"
arguments as <b>get()</b>, but pulls the remote doc across and stores
it on the local server. It returns a read-only copy of the doc on
success, and throws an error on failure. </p>
<p> <b>The <i>HTTP::Transfer</i> library is not designed to support
transferring documents bi-directionally within a single store. All
kinds of potential problems <i>will</i> arise if one attempts to do
that.</b> </p>
<h2> Server <Location /util/transfer> Configuration </h2>
<p> To bring up an <i>HTTP_Transfer</i> server, an Apache handler must
be configured for a particular <b><Location></b>. The following
code shold be all that is required for a basic installation. </p>
<pre>
<Location /util/transfer>
SetHandler perl-script
PerlHandler XML::Comma::Pkg::Transfer::HTTP_Transfer
</Location>
</pre>
<h2> Access Control and SSL Encryption </h2>
<p> Apache's extensive access control facilities can be used to
control usage of an <i>HTTP_Transfer</i> server. The Apache
documentation describes how to limit access by connection-oriented
criteria such as IP address.</p>
<p> The <i>HTTP_Transfer</i> library is also SSL-aware, allowing data
to be sent across the network in an encrypted form. Apache must be
configured with SSL support for this option to be available. On the
client side, all that is required is to specify a target url that
begins with <i>https</i>. </p>
<p> Client SSL certificates can be used to further limit access to the
server. Apache must be configured so that it has access to a
"Certificate Authority" public certificate and with the following two
SSL options: </p>
<pre>
SSLVerifyClient require
SSLVerifyDepth 1
</pre>
<p> This will limit access to clients that hold certificates signed by
the "Certificate Authority's" private key. The following two
constructor arguments can be used to make a certificate/key pair
available to the <i>HTTP_Transfer</i> client. </p>
<pre>
# use client certificate:
https_cert_file => '/tmp/cert.pem'
https_key_file => '/tmp/key.pem'
# example
my $t = XML::Comma::Pkg::Transfer::HTTP_Transfer->new
( target => 'https://remote.server.com/util/transfer',
https_cert_file => '/path/cert.pem',
https_key_file => '/path/key.pem' );
</pre>
<p> The key file should not be passphrase encrypted. The Crypt::SSLeay
library that <i>HTTP_Transfer</i> relies on does not cache the key, so
the passphrase must be re-typed on each connection if an encrypted key
is used. </p>
<h1>Reference: Defs</h1>
<p> Document definitions, which are Comma documents like any other,
are themselves constrained by a definition. This "bootstrap"
definition is often useful as a reference. </p>
<pre>
<DocumentDefinition>
<name>DocumentDefinition</name>
<element><name>name</name></element>
<element><name>ignore_for_hash</name></element>
<element><name>include_for_hash</name></element>
<element><name>plural</name></element>
<element><name>required</name></element>
<element><name>validate_hook</name></element>
<element><name>document_write_hook</name></element>
<element><name>def_hook</name></element>
<element><name>sort_sub</name></element>
<nested_element>
<name>method</name>
<element><name>name</name></element>
<element><name>code</name></element>
<required>'name','code'</required>
</nested_element>
<nested_element>
<name>element</name>
<element><name>name</name></element>
<element><name>validate_hook</name></element>
<element><name>set_hook</name></element>
<element><name>default</name></element>
<element><name>macro</name></element>
<element><name>defname</name></element>
<element><name>sort_sub</name></element>
<plural>'validate_hook','set_hook','macro'</plural>
<required>'name'</required>
</nested_element>
<nested_element>
<name>blob_element</name>
<element><name>name</name></element>
<element><name>extension</name></element>
</nested_element>
<nested_element>
<name>nested_element</name>
<element><name>name</name></element>
<element><name>defname</name></element>
<element><name>macro</name></element>
<element><name>plural</name></element>
<element><name>required</name></element>
<element><name>ignore_for_hash</name></element>
<element><name>include_for_hash</name></element>
<element><name>validate_hook</name></element>
<element><name>sort_sub</name></element>
<nested_element>
<name>element</name>
<defname>DocumentDefinition:element</defname>
</nested_element>
<nested_element>
<name>blob_element</name>
<defname>DocumentDefinition:blob_element</defname>
</nested_element>
<nested_element>
<name>nested_element</name>
<defname>DocumentDefinition:nested_element</defname>
</nested_element>
<nested_element>
<name>method</name>
<defname>DocumentDefinition:method</defname>
</nested_element>
<plural>
'macro',
'plural',
'required',
'ignore_for_hash',
'include_for_hash',
'validate_hook',
'element',
'blob_element',
'nested_element',
'method',
</plural>
<required>'name'</required>
</nested_element>
<nested_element>
<name>store</name>
<element><name>name</name></element>
<element><name>location</name></element>
<element><name>output</name></element>
<element><name>root</name></element>
<element><name>base</name></element>
<element>
<name>file_permissions</name>
<default>664</default>
</element>
<element><name>pre_store_hook</name></element>
<element><name>post_store_hook</name></element>
<element><name>erase_hook</name></element>
<element><name>index_on_store</name></element>
<plural>qw( location output
pre_store_hook post_store_hook
erase_hook
index_on_store )</plural>
<required>'name','base','location'</required>
</nested_element>
<nested_element>
<name>index</name>
<element><name>name</name></element>
<!-- 'store' will default to self->element('name')->get() -
(note, this is handled by the Store->store() method in
the internals ) -->
<element><name>store</name></element>
<!-- doc_id_sql_type SHOULD NOT BE CHANGED without completely
dropping and recreating a given index's database (or otherwise
altering the database structure outside of Comma). ** there is
no automatic change of this to match a def ** -->
<element>
<name>doc_id_sql_type</name>
<default>VARCHAR(255)</default>
</element>
<nested_element>
<name>field</name>
<element><name>name</name></element>
<element><name>code</name></element>
<element>
<name>sql_type</name>
<default>VARCHAR(255)</default>
</element>
<required>'name'</required>
</nested_element>
<nested_element>
<name>collection</name>
<element><name>name</name></element>
<element><name>code</name></element>
</nested_element>
<nested_element>
<name>sort</name>
<element><name>name</name></element>
<element><name>code</name></element>
<nested_element>
<name>clean</name>
<defname>DocumentDefinition:index:clean</defname>
</nested_element>
<required>'name'</required>
</nested_element>
<nested_element>
<name>textsearch</name>
<element><name>name</name></element>
<element><name>defer_on_update</name></element>
<required>'name'</required>
</nested_element>
<nested_element>
<name>sql_index</name>
<element><name>name</name></element>
<element><name>unique</name></element>
<element><name>fields</name></element>
<required>'name','fields'</required>
</nested_element>
<element>
<name>default_order_by</name>
<default>doc_id</default>
</element>
<nested_element>
<name>order_by_expression</name>
<element><name>name</name></element>
<element><name>expression</name></element>
<required>'name','expression'</required>
</nested_element>
<nested_element>
<name>clean</name>
<element><name>to_size</name></element>
<element><name>order_by</name></element>
<element><name>size_trigger</name></element>
<element><name>erase_where_clause</name></element>
</nested_element>
<element><name>index_hook</name></element>
<element><name>stop_rebuild_hook</name></element>
<nested_element>
<name>method</name>
<defname>DocumentDefinition:method</defname>
</nested_element>
<plural>qw( field
collection
sort
textsearch
sql_index
order_by_expression
index_hook
stop_rebuild_hook
method )</plural>
<required>'name'</required>
</nested_element>
<plural>
'element',
'nested_element',
'blob_element',
'method',
'store',
'index',
'document_write_hook',
'plural',
'required',
'ignore_for_hash',
'include_for_hash',
'validate_hook',
</plural>
</DocumentDefinition>
</pre>
<h1>Reference: Hooks</h1>
<p> <b>def_hook</b> -- any def. The <b>def_hook</b> is unlike the
other hooks in its operation. Rather than being defined as part of a
document definition and run during later operations, the
<b>def_hook</b> is run as part of the loading of the def itself. This
hook is designed to allow a def to create dynamic structures that will
be available to its various methods, hooks and elements. The contents
of the def_hook element should be a block of code suitable for
<code>eval</code>ing. Any error thrown during the <code>eval</code>
will cause a <code>DEF_HOOK_ERR</code> to be propogated. </p>
<p> <b>document_write_hook ( $doc )</b> -- any doc. The
<b>to_string()</b> method triggers the execution of any defined hooks
of this type, and any error thrown by one of these hooks will abort
the to_string operation. This hook is passed the doc that it is
attached to as its only argument. </p>
<p> <b>erase_hook ( $doc, $store, $location )</b> -- any store. This
hook is run at the beginning of an erase operation (triggered by doc
<b>erase()</b> and <b>move()</b> methods). Any thrown errors abort the
erase operation and cause a <code>STORE_ERROR</code> to be
propogated. This hook is passed the doc being erased, the store
object, and the doc location. </p>
<p> <b>index_hook ( $doc, $index )</b> -- any index. This hook is run
at the beginning of an <b>index_update()</b>. If an <b>index_hook</b>
throws an error, the update halts and the call to
<b>index_update()</b> returns <code>undef</code> (no error is
propogated, however). This hook is passed the doc that is the source
of the update and the index object as its two arguments. </p>
<p> <b>post_store_hook ( $doc, $store, \%store_arguments )</b> -- any
store. This hook is run at the very end of a store operation
(triggered by doc <b>store()</b>, <b>move()</b>, and <b>copy()</b>
methods). All hooks are run, then the doc is unlocked, then -- if
necessary -- the first of any errors that might have been encountered
as the hooks were run is thrown. This hook is passed the doc being
stored and the store object as its arguments. </p>
<p> <b>pre_store_hook ( $doc, $store, \%store_arguments )</b> -- any
store. This type of hook is run at the very beginning of a store
operation (triggered by doc <b>store()</b>, <b>move()</b>, and
<b>copy()</b> methods). If any <b>pre_store_hook</b> throws an error,
the store operation is aborted and a <code>STORE_ERROR</code> is
propogated. This hook is passed the doc being stored and the store
object as its arguments. </p>
<p> <b>stop_rebuild_hook ( $doc, $index )</b> -- any index. This hook
is run during an index <b>rebuild()</b>, after each doc is
processed. The hook should return a true value to indicate that the
rebuild operation has completed (and, conversely, a false value to
indicate that the rebuild should continue). This hook is passed the
doc that -- if the process continues -- will next be added to the
index and the index object itself as its two arguments. </p>
<p> <b>validate_hook ( $element, [ $content ] )</b> -- any nested element,
non-nested element, or doc. The element <b>validate()</b>,
<b>validate_content()</b>,
<b>set()</b>, and <b>append()</b> methods trigger the execution of
these hooks, as does the nested element <b>validate()</b>
method. If any <b>validate_hook()</b> throws an error, the validate
operation is halted and a <code>BAD_CONTENT</code> or
<code>VALIDATE_ERROR</code> is propogated. The hook is passed one or
two arguments: the element to which the hook is attached and (for
non-nested elements) the content to be validated. </p>
<p> <b>set_hook ( $element, $content_reference, \%set_arguments )</b>
-- non-nested elements and blob elements. This hook is triggered by a
call to an element's <b>set()</b> method. Any <b>set_hook</b>s defined
for an element is run after any <b>validate_hook</b>s and before the
actual set operation takes place. A <b>set_hook</b> receives, as its
second argument, a reference to the content that was passed to the
<b>set()</b> method, enabling the hook to modify the content, if
necessary. If any <b>set_hook</b> dies, an error is thrown and the set
is aborted. </p>
<p> <b>set_from_file_hook ( $element, $filename, \%set_arguments )</b>
-- blob elements. This hook is exactly analagous to the
<b>set_hook</b> described above, except that it is unique to blob
elements, is triggered by the blob element <b>set_from_file()</b>
method, and is passed the filename that <b>set_from_file()</b>
receives instead of a content reference. If you want to intercept all
"set" operations on a blob element, you must define both of these
hooks. </p>
<p> <b>read_hook()</b> -- any doc or element. This hook is called when
a doc or element is "read in" from storage, during a <b>read()</b> or
<b>retrieve()</b>. After the system creates and initializes the
element (including any sub-elements or content that are read in), any
defined <b>read_hook</b>s are called. No arguments are passed. This
hook is sometimes needed as a complement to a <b>set_hook</b>, which
is only called by a <b>set()</b> method invocation and not upon
reading an element in from storage. Note that a <b>read_hook</b> can
only be called on an element that "exists" in the stored doc: empty
elements aren't stored, and so they can't be read in and hook'ed. </p>
<h1>Reference: Perl API (Methods and Objects)</h1>
<p>
<ul>
<li>XML::Comma<ul>
<li><all configuration variables readable via methods></li>
<li>$lock_interface = XML::Comma->lock_singlet()</li>
<ul>
<li>undef = $lock_interface->wait_for_hold ( $string );</li>
<li>undef = $lock_interface->release_hold ( $string );</li>
</ul>
<li>$hashref = XML::Comma->def_pnotes($defpath);</li>
</ul></li>
<li>XML::Comma::Doc<ul>
<li>$doc = XML::Comma::Doc->new ( type => )</li>
<li>$doc = XML::Comma::Doc->new ( block => )</li>
<li>$doc = XML::Comma::Doc->new ( file => )</li>
<li>$doc = XML::Comma::Doc->retrieve ( key, [timeout=><seconds>] )</li>
<li>$doc = XML::Comma::Doc->retrieve ( store =>, type =>, id =>, [timeout=><seconds>] )</li>
<li>$doc || undef = XML::Comma::Doc->retrieve_no_wait ( key )</li>
<li>$doc || undef = XML::Comma::Doc->retrieve_no_wait ( store =>, type =>, id => )</li>
<li>$doc = XML::Comma::Doc->read ( key )</li>
<li>$doc = XML::Comma::Doc->read ( <retrieve arguments> )</li>
<li>$doc = $doc->get_lock ( [timeout=><seconds>] );</li>
<li>$doc || undef = $doc->get_lock_no_wait();</li>
<li>$string = $doc->to_string()</li>
<li>$string = $doc->comma_hash()</li>
<li>@elements = $doc->get_leaf_nodes( [ include => [ path_1 ... path_n ] ], [ exclude => [ path_1 ... path_n ] ])</li>
<li>$string = $doc->full_field_texts( [ same args as get_leaf_nodes ] );</li>
<li>$self = $doc->store ( store=>, [keep_open=>], [no_hooks=>], [args...] )</li>
<li>$self = $doc->erase()</li>
<li>$self = $doc->copy()</li>
<li>$self = $doc->copy() ( <store arguments> )</li>
<li>$self = $doc->move()</li>
<li>$self = $doc->move() ( <store arguments> )</li>
<li>$store = $doc->doc_store()</li>
<li>$string = $doc->doc_location()</li>
<li>$string = $doc->doc_id()</li>
<li>$string = $doc->doc_key()</li>
<li>$string = $doc->doc_source_file()</li>
<li>$bool = $doc->doc_is_locked()</li>
<li>$bool = $doc->doc_is_new()</li>
<li>$int = $doc->doc_last_modified()</li>
<li>$doc = $doc->index_update ( index=>$index )</li>
<li>$doc = $doc->index_remove ( index=>$index )</li>
</ul></li>
<li>all elements<ul>
<li>$string = $el->tag()</li>
<li>$string = $el->tag_up_path()</li>
<li>$def = $el->def()</li>
<li>$return_val = $el->method ( $name, [ @args...] )</li>
<li>null = $el->set_attr ( $name => $value, [ $name => $value ... ] );</li>
<li>$string = $el->get_attr ( $name );</li>
<li>$hash_ref = $def->def_pnotes();</li>
<li>@names = $def->applied_macros();</li>
<li>1/undef = $def->applied_macros ( @names );</li>
<li>$hashref = $el->pnotes();</li>
</ul></li>
<li>blob elements<ul>
<li>$string = $el->set( $string )</li>
<li>$string = $el->get()</li>
<li>'' = $el->set_from_file ( $filename )</li>
<li>'' = $el->validate()</li>
<li>$string = $el->append ( $more_string )</li>
<li>$string = $el->get_location()</li>
</ul></li>
<li>simple elements<ul>
<li>$string = $el->get( [unescape=>], [%args] )</li>
<li>$string = $el->get_without_default()</li>
<li>$string = $el->set ( $string, [escape=>], [%args] )</li>
<li>$string = $el->append ( $more_string )</li>
<li>$string = $el->validate()</li>
<li>$string = $el->validate_content ( $string )</li>
<li>1 = $el->cdata_wrap();</li>
</ul></li>
<li>nested elements<ul>
<li>@els/[] = $el->elements ( [@tags] )</li>
<li>$el = $el->element ( $tag )</li>
<li>$el = $el->add_element ( $tag )</li>
<li>$el = $el->delete_element ( $tag )</li>
<li>@strings/[] = $el->elements_group_get ( $tag )</li>
<li>@strings/[] = $el->elements_group_add ( $tag, @strings )</li>
<li>@els/[] = $el->elements_group_delete ( $tag, @strings ) </li>
<li>$bool = $el->elements_group_lists ( $tag, $string )</li>
<li>$bool = $el->element_is_plural ( $tag )</li>
<li>$bool = $el->element_is_defined ( $tag )</li>
<li>$bool = $el->element_is_nested ( $tag )</li>
<li>$bool = $el->element_is_blob ( $tag )</li>
<li>$bool = $el->element_is_required ( $tag )</li>
<li>'' = $el->validate()</li>
<li>[DEPRECATED] '' = $el->validate_structure()</li>
<li>@els = $el->get_all_blobs()</li>
<li>$el = $el->group_elements();</li>
<li>@els/[] = $el->sort_elements ( [@tags] )</li>
</ul></li>
<li>XML::Comma::Def<ul>
<li>$def = XML::Comma::Def->read ( name => )</li>
<li>@names = $def->store_names();</li>
<li>$store = $def->get_store ( $name );</li>
<li>@names = $def->index_names();</li>
<li>@names = $def->method_names();</li>
<li>$store = $def->get_index ( $name );</li>
<li>$hash_ref = $def->def_pnotes();</li>
<li>$code_ref = $def->add_hook ( $hook_type, $string || $code_ref );</li>
<li>$code_ref = $def->add_method ( $name, $string || $code_ref );</li>
<li>$code_ref || undef = $def->method_code ( $name );</li>
<li>@return/[] = $def->method ( $name, @args );</li>
<li>@names = $def->applied_macros();</li>
<li>1/undef = $def->applied_macros ( @names );</li>
<li>@els/[] = $def->def_sub_elements();</li>
<li>$el = $def->def_by_name ( $element_name );</li>
<li>1/undef = $def->is_required();</li>
<li>1/undef = $def->is_plural();</li>
<li>1/undef = $def->is_nested();</li>
<li>1/undef = $def->is_blob();</li>
<li>1/undef = $def->is_ignore_for_hash();</li>
<li>1/undef = $def->has_property( [ ignore_for_hash |
include_for_hash | plural | required | nested | blob | enum | boolean |
range | timestamp | timestamp_created | timestamp_last_modified |
doc_key | single_line ] );</li>
</ul></li>
<li>XML::Comma::Indexing::Index<ul>
<li>@names = $index->field_names();</li>
<li>@names = $index->sort_names(); [ DEPRECATED ]</li>
<li>@names = $index->collection_names();</li>
<li>@names = $index->textsearch_names();</li>
<li>@names = $index->method_names();</li>
<li>$type_name = $index->collection_type ( $collection_name );</li>
<li>$iterator = $index->iterator ( [%args] );</li>
<li>$iterator/undef = $index->single ( [%args] );</li>
<li>$doc/undef = $index->single_read ( [%args] );</li>
<li>$doc/undef = $index->single_retrieve ( [%args] );</li>
<li>$int = $index->count ( [%args] );</li>
<li>$int = $index->last_modified_time ( [$sort_name, $sort_string] );</li>
<li>$val = $index->aggregate ( function=> [%args] );</li>
<li>'' = $index->rebuild ( [verbose=>'1'|'0',workers=>$processes_num,size=>$size_num] );</li>
<li>'' = $index->clean();</li>
<li>'' = $index->get_dbh();</li>
<li>$def_name = $index->def_name();</li>
<li>$idx_name = $index->name();</li>
</ul></li>
<li>XML::Comma::Indexing::Iterator<ul>
<li>$iterator = $iterator_refresh() ( [$limit_number], [$limit_offset] );</li>
<li>$iterator/false = $iterator->iterator_next();</li>
<li>$bool = $iterator->iterator_has_stuff();</li>
<li>$string = $iterator->iterator_select_returnval();</li>
<li>$doc = $iterator->retrieve_doc();</li>
<li>$doc = $iterator->read_doc();</li>
<li>$string = $iterator->doc_key();</li>
<li>$string = $iterator->doc_id();</li>
<li>$string = $iterator->record_last_modified();</li>
<li>$num = $iterator->select_count();</li>
<li>$return_value = $iterator->$field/$method ( [@args] );</li>
<!-- <li>@docs = $iterator->to_array();</li> -->
</ul></li>
<li>XML::Comma::Log<ul>
<li><thrown error/die> = XML::Comma::Log->err ( $error_string, $info_string );</li>
<li>'' = XML::Comma::Log->warn ( $string );</li>
<li>'' = XML::Comma::Log->log ( $string/$error );</li>
</ul></li>
<li>XML::Comma::Storage::Store<ul>
<li>$id_string = $store->first_id();</li>
<li>$id_string = $store->last_id();</li>
<li>$id_string = $store->next_id ( $id_string );</li>
<li>$id_string = $store->prev_id ( $id_string );</li>
<li>$directory = $store->base_directory();</li>
<li>$store_name = $store->name();</li>
<li>$def_name = $store->def_name();</li>
<li>@index_names = $store->associated_indices();</li>
</ul></li>
<li>XML::Comma::Storage::Iterator<ul>
<li>$iterator = XML::Comma::Storage::Iterator->new ( store=>$store, size=>$num, pos=><'+' | '-'>;</li>
<li>$num = $iterator->length();</li>
<li>$num = $iterator->index();</li>
<li>$num = $iterator->set( $num );</li>
<li>$num = $iterator->next_id();</li>
<li>$num = $iterator->prev_id();</li>
<li>$num = $iterator->next_retrieve();</li>
<li>$num = $iterator->prev_retrieve();</li>
<li>$num = $iterator->next_read();</li>
<li>$num = $iterator->prev_read();</li>
<li>$num = $iterator->doc_id();</li>
<li>@docs = $iterator->read_doc();</li>
<!-- <li>@docs = $iterator->to_array();</li> -->
</ul></li>
<li>XML::Comma::Util<ul>
<li>$first_element = trim ( @strings_to_trim );</li>
<li>@trimmed_strings = trim ( @strings_to_trim );</li>
<li>$bool = array_includes ( @array, $string );</li>
<li>@array/[] = arrayref_remove_dups ( $array_ref );</li>
<li>@array/[] = arrayref_remove ( $array_ref, @els/[] );</li>
<li>@array = flatten_arrayrefs ( @arrays/[] [...] );</li>
<li>$escaped_string = XML_basic_escape ( $string );</li>
<li>$unescaped_string = XML_basic_unescape ( $string );</li>
<li>$escaped_string = XML_smart_escape ( $string ); # ignores entities</li>
<li>$escaped_string = XML_bare_amp_unescape ( $string ); # ditto</li>
<li>'' = dbg ( @arrays/[] [...] );</li>
<li>($name, @args) = name_and_args_eval ( $string );</li>
<li>$string = random_an_string ( $length );</li>
</ul></li>
<li>XML::Comma::Storage::Util<ul>
<li>$doc_key = XML::Comma::Storage::Util::concat_key ( type=>$type, store=>$store_name, id=>$id );</li>
<li>( $type, $store_name, $id ) = XML::Comma::Storage::Util->split_key ( $key );</li>
</ul></li>
</ul>
</p>
<h1>Appendix: Table Structure of Index Databases</h1>
<p> Each Comma <b>index</b> creates (and uses) at least one table in
the SQL database. All of these tables are kept track of by an
<i>index_tables</i> table. It is usually possible for a programmer or
system administrator to remain blissfully ignorant of the information
presented here. This section is written for the curious and the
unlucky. </p>
<h2>The index_tables Table</h2>
<p> The <i>index_tables</i> table contains a record for each database
table that has been created as part of an index's backing store. The fields are as follows: </p>
<ul>
<li>_comma_flag</li>
<li>_sq</li>
<li>doctype</li>
<li>index_name</li>
<li>table_name</li>
<li>table_type</li>
<li>last_modified</li>
<li>sort_spec</li>
<li>textsearch</li>
<li>collection</li>
<li>index_def</li>
</ul>
<p> The <i>_comma_flag</i> field is used by parts of the system that
need to mark a table as in use. The <b>rebuild()</b> method, for
example, tags any tables that it is working on, and will refuse to
begin work if any tables for an index appear to be so tagged. </p>
<p> The <i>_sq</i> field is a unique, ascending integer sequence, and
is used to generate unique names for all the tables that Comma
creates. </p>
<p> The <i>doctype</i> and <i>index_name</i> fields together identify
which Comma index a table "belongs" to. </p>
<p> The <i>table_name</i> field gives the name of the table. When a
new table is created, a name is generated by appending an underscore
and the next valid <i>_sq</i> integer to the first few letters of the
doctype. </p>
<p> The <i>table_type</i> field is an integer indicating what kind of
table this is (see below). </p>
<p> The <i>last_modified</i> field indicates when a table was last
changed, but is currently not much used. </p>
<p> Only one of the <i>sort_spec</i>, <i>textsearch</i>,
<i>collection</i> and <i>index_def</i> fields is used by any single
record: these fields hold extra information relevant to the various
kinds of tables. </p>
<h2> Table Type 1: The Data Table </h2>
<p> Every index has a Data table. Each row in the data table
represents a single record in the index. The table has three standard
columns: <code>_comma_flag</code> holds a status value,
<code>_sq</code> holds a short, unique identifier that can be used to
refer back to this record, and <code>doc_id</code> the doc_id of the
document this record is drawn from. The rest of the columns in the
table are created from the fields and collections defined by the
index; each is named the same as the field, typed according to the
field's <b>sql_type</b>, and holds the contents of that field or
collection for the record in question. Field columns simply hold the
scalar value returned by the field's element or method
call. Collection columns hold a string consisting of all of the values
returned by the element or method call, each blocked between two
'pipe' characters, concatenated all together. </p>
<h2> Table Type 2: The 'many tables' Table </h2>
<p> An index may have as many 'many tables' collection tables as the
database permits. Each table is created on demand when an update
encounters a new value in the collection. </p>
<p> Each table contains only two columns, the familiar
<code>_comma_flag</code> field used for coordintation and locking by
various pieces of the indexing code, and a <code>doc_id</code>
column. The table simply keeps track of the documents that belong in
that sort; logical joins are used to select subsets of records that
match sort criteria. </p>
<h2> Table Types 3 and 4: Textsearch Index and Defers Tables </h2>
<p> Each textsearch defined by an index uses two tables. The main
table, called a textsearch_index_table by the system internals, stores
an inverted index, with each record in the table mapping a word to a
packed array containing data table _sq keys. The second table -- the
textsearch_defers table -- contains a list of actions that have been
performed on the index since the last
<b>sync_deferred_textsearches()</b> call. </p>
<h2> Table Type 5: The 'binary table' Table </h2>
<p> A 'binary table' collection table is created for each
binary-table-typed collection in each Index. Each of these tables
contains three columns, the familiarl <code>_comma_flag</code> field,
a <code>doc_id</code> field and a <code>value</code> field. The table
maps doc_id/value pairs together, so that a where'd selection can be
followed by a logical join to determine subsets of records that match
sort criteria. </p>
</body>
</html>