NAME
Data::Domain - Data description and validation
SYNOPSIS
use Data::Domain qw/:all/;
my $domain = Struct(
anInt => Int(-min => 3, -max => 18),
aNum => Num(-min => 3.33, -max => 18.5),
aDate => Date(-max => 'today'),
aLaterDate => sub {my $context = shift;
Date(-min => $context->{flat}{aDate})},
aString => String(-min_length => 2, -optional => 1),
anEnum => Enum(qw/foo bar buz/),
anIntList => List(-min_size => 1, -all => Int),
aMixedList => List(Integer, String, Int(-min => 0), Date),
aStruct => Struct(foo => String, bar => Int(-optional => 1))
);
my $messages = $domain->inspect($some_data);
my_display_error($messages) if $messages;
DESCRIPTION
A data domain is a description of a set of values, either scalar or structured (arrays or hashes). The description can include many constraints, like minimal or maximal values, regular expressions, required fields, forbidden fields, and also contextual dependencies. From that description, one can then invoke the domain's inspect
method to check if a given value belongs to it or not. In case of mismatch, a structured set of error messages is returned.
The motivation for writing this package was to be able to express in a compact way some possibly complex constraints about structured data. Typically the data is a Perl tree (nested hashrefs or arrayrefs) that may come from XML, JSON, from a database through DBIx::DataModel, or from postprocessing an HTML form through CGI::Expand. Data::Domain
is a kind of tree parser on that structure, with some facilities for dealing with dependencies within the structure through lazy evaluation of domains.
There are several other packages in CPAN doing data validation; these are briefly listed in the "SEE ALSO" section.
DISCLAIMER : this code is still in design exploration phase; some parts of the API may change in future versions.
GLOBAL API
Shortcut functions for domain constructors
Internally, domains are represented as Perl objects; however, it would be tedious to write
my $domain = Data::Domain::Struct->new(
anInt => Data::Domain::Int->new(-min => 3, -max => 18),
aDate => Data::Domain::Date->new(-max => 'today'),
...
);
so for each of its builtin domain constructors, Data::Domain
exports a plain function that just calls new
on the appropriate subclass. If you import those functions (use Data::Domain qw/:all/
, or use Data::Domain qw/Struct Int Date .../
), then you can write more conveniently :
my $domain = Struct(
anInt => Int(-min => 3, -max => 18),
aDate => Date(-max => 'today'),
...
);
Short function names like Int
or String
are convenient, but may cause name clashes with other modules. If conflicts happen, don't import the function names, and explicitly call the new
method on domain constructors -- or write your own wrappers around them.
Methods
new
Creates a new domain object, from one of the domain constructors listed below (Num
, Int
, Date
, etc.). The Data::Domain
class itself has no new
method, because it is an abstract class.
Arguments to the new
method specify various constraints for the domain (minimal/maximal values, regular expressions, etc.); most often they are specific to a given domain constructor, so see the details below. However, there are also some generic options :
-optional
-
if true, an <undef> value will be accepted, without generating an error message
-name
-
defines a name for the domain, that will be printed in error messages instead of the subclass name.
-messages
-
defines ad hoc messages for that domain, instead of the builtin messages. The argument can be either a string or a hashref, as explained in the "ERROR MESSAGES" section.
Option names always start with a dash. If no option name is given, parameters to the new
method are passed to the default option, which differs according to the constructor subclass. For example the default option in List
is -items
, so
my $domain = List(Int, String, Int);
is equivalent to
my $domain = List(-items => [Int, String, Int]);
inspect
my $messages = $domain->inspect($some_data);
Inspects the supplied data, and returns an error message (or a structured collection of messages) if anything is wrong. If the data successfully passed all domain tests, then nothing is returned.
For scalar domains (Num
, String
, etc.), the error message is just a string. For structured domains (List
, Struct
), the return value is a corresponding arrayref or hashref, like for example
{anInt => "smaller than mimimum 3",
aDate => "not a valid date",
aList => ["message for item 0", undef, undef, "message for item 3"]}
The client code can then exploit this structure to dispatch error messages to appropriate locations (typically these will be the form fields that gathered the data).
BUILTIN DOMAIN CONSTRUCTORS
Whatever
my $domain = Struct(
just_anything => Whatever,
is_defined => Whatever(-defined => 1),
is_undef => Whatever(-defined => 0),
is_true => Whatever(-true => 1),
is_false => Whatever(-true => 0),
is_object => Whatever(-isa => 'My::Funny::Object'),
has_methods => Whatever(-can => [qw/jump swim dance sing/]),
);
Encapsulates just any kind of Perl value (including undef
). Options are :
- -defined
-
If true, the data must be defined. If false, the data must be undef.
- -true
-
If true, the data must be true. If false, the data must be false.
- -isa
-
The data must be an object of the specified class.
- -can
-
The data must implement the listed methods, supplied either as an arrayref (several methods) or as a scalar (just one method).
Num
my $domain = Num(-range =>[-3.33, 999], -not_in => [2, 3, 5, 7, 11]);
Domain for numbers (including floats). Options are :
- -min
-
The data must be greater or equal to the supplied value.
- -max
-
The data must be smaller or equal to the supplied value.
- -range
-
-range => [$min, $max]
is equivalent to-min => $min, -max => $max
. - -not_in
-
The data must be different from all values in the exclusion set, supplied as an arrayref.
Int
my $domain = Int(-min => 0, -max => 999, -not_in => [2, 3, 5, 7, 11]);
Domain for integers. Accepts the same options as Num
and returns the same error messages.
Date
Data::Domain::Date->parser('EU'); # default
my $domain = Date(-min => '01.01.2001',
-max => 'today',
-not_in => ['02.02.2002', '03.03.2003', 'yesterday']);
Domain for dates, implemented via the Date::Calc module. By default, dates are parsed according to the european format, i.e. through the Decode_Date_EU method; this can be changed by setting
Data::Domain::Date->parser('US'); # will use Decode_Date_US
or
Data::Domain::Date->parser(\&your_own_date_parsing_function);
# that func. should return an array ($year, $month, $day)
When outputting error messages, dates will be printed according to Date::Calc's current language (english by default); see that module's documentation for changing the language.
In the options below, the special keywords today
, yesterday
or tomorrow
may be used instead of a date constant, and will be replaced by the appropriate date when performing comparisons.
- -min
-
The data must be greater or equal to the supplied value.
- -max
-
The data must be smaller or equal to the supplied value.
- -range
-
-range => [$min, $max]
is equivalent to-min => $min, -max => $max
. - -not_in
-
The data must be different from all values in the exclusion set, supplied as an arrayref.
Time
my $domain = Time(-min => '08:00', -max => 'now');
Domain for times in format hh:mm:ss
(minutes and seconds are optional).
In the options below, the special keyword now
may be used instead of a time, and will be replaced by the current local time when performing comparisons.
- -min
-
The data must be greater or equal to the supplied value.
- -max
-
The data must be smaller or equal to the supplied value.
- -range
-
-range => [$min, $max]
is equivalent to-min => $min, -max => $max
.
String
my $domain = String(qr/^[A-Za-z0-9_\s]+$/);
my $domain = String(-regex => qr/^[A-Za-z0-9_\s]+$/,
-antiregex => qr/$RE{profanity}/, # see Regexp::Common
-range => ['AA', 'zz'],
-length => [1, 20],
-not_in => [qw/foo bar/]);
Domain for strings. Options are:
- -regex
-
The data must match the supplied compiled regular expression. Don't forget to put
^
and$
anchors if you want your regex to check the whole string.-regex
is the default option, so you may just pass the regex as a single unnamed argument toString()
. - -antiregex
-
The data must not match the supplied regex.
- -min
-
The data must be greater or equal to the supplied value.
- -max
-
The data must be smaller or equal to the supplied value.
- -range
-
-range => [$min, $max]
is equivalent to-min => $min, -max => $max
. - -min_length
-
The string length must be greater or equal to the supplied value.
- -max_length
-
The string length must be smaller or equal to the supplied value.
- -length
-
-length => [$min, $max]
is equivalent to-min_length => $min, -max_length => $max
. - -not_in
-
The data must be different from all values in the exclusion set, supplied as an arrayref.
Enum
my $domain = Enum(qw/foo bar buz/);
Domain for a finite set of scalar values. Options are:
- -values
-
Ref to an array of values admitted in the domain. This would be called as
Enum(-values => [qw/foo bar buz/])
, but since this it is the default option, it can be simply written asEnum(qw/foo bar buz/)
.
List
my $domain = List(String, Int, String, Num);
my $domain = List(-items => [String, Int, String, Num]); # same as above
my $domain = List(-all => String(qr/^[A-Z]+$/),
-any => String(-min_length => 3),
-size => [3, 10]);
Domain for lists of values (stored as Perl arrayrefs). Options are:
- -items
-
Ref to an array of domains; then the first n items in the data must match those domains, in the same order.
This is the default option, so item domains may be passed directly to the
new
method, without the-items
keyword. - -min_size
-
The data must be a ref to an array with at least that number of entries.
- -max_size
-
The data must be a ref to an array with at most that number of entries.
- -size
-
-size => [$min, $max]
is equivalent to-min_size => $min, -max_size => $max
. - -all
-
All remaining entries in the array, after the first <n> entries as specified by the
-items
option (if any), must satisfy that domain specification. - -any
-
At least one remaining entry in the array, after the first <n> entries as specified by the
-items
option (if any), must satisfy that domain specification. A list domain can have both an-all
and and-any
constraint.The argument to
-any
can also be an arrayref of domains, as inList(-any => [String(qr/^foo/), Num(-range => [1, 10]) ])
This means that one member of the list must be a string starting with
foo
, and one member of the list (in this case, necessarily another one) must be a number between 1 and 10. Note that this is different fromList(-any => One_of(String(qr/^foo/), Num(-range => [1, 10]))
which says that one member of the list must be either a string starting with
foo
or a number between 1 and 10.
Struct
my $domain = Struct(foo => Int, bar => String);
my $domain = Struct(-fields => [foo => Int, bar => String],
-exclude => '*');
Domain for associative structures (stored as Perl hashrefs). Options are:
- -fields
-
Supplies a list of keys with their associated domains. The list might be given either as a hashref or as an arrayref (in which case the the order of individual field checks will follow the order in the array). The ordering may make a difference in case of context dependencies (see "LAZY CONSTRUCTORS" below ).
- -exclude
-
Specifies which keys are not allowed in the structure. The exclusion may be specified as an arrayref of key names, as a compiled regular expression, or as the string constant '
*
' or 'all
' (meaning that no key will be allowed except those explicitly listed in the-fields
option.
One_of
my $domain = One_of($domain1, $domain2, ...);
Union of domains : successively checks the member domains, until one of them succeeds. Options are:
LAZY CONSTRUCTORS (CONTEXT DEPENDENCIES)
Principle
If an element of a structured domain (List
or Struct
depends on another element), then we need to lazily construct the domain. Consider for example a struct in which the value of field date_end
must be greater than date_begin
: the subdomain for date_end
can only be constructed when the argument to <-min> is known, namely when the domain inspects an actual data structure.
Lazy domain construction is achieved by supplying a function reference instead of a domain object. That function will be called with some context information, and should return the domain object. So our example becomes :
my $domain = Struct(
date_begin => Date,
date_end => sub {my $context = shift;
Date(-min => $context->{flat}{date_begin})}
);
Structure of context
The supplied context is a hashref containing the following information:
- root
-
the overall root of the inspected data
- path
-
the sequence of keys or array indices that led to the current data node. With that information, the subdomain is able to jump to other ancestor or sibling data node within the tree, with help of the node_from_path function.
- flat
-
a flat hash containing an entry for any hash key met so far while traversing the tree. In case of name clashes, most recent keys (down in the tree) override previous keys.
- list
-
a reference to the last list (arrayref) encountered while traversing the tree.
Here is an example :
my $data = {foo => [undef, 99, {bar => "hello, world"}]};
my $domain = Struct(
foo => List(Whatever,
Whatever,
Struct(bar => sub {my $context = shift;
print Dumper($context);
String;})
)
);
$domain->inspect($data);
This code will print something like
$VAR1 = {
'root' => {'foo' => [undef, 99, {'bar' => 'hello, world'}]},
'path' => ['foo', 2, 'bar'],
'list' => $VAR1->{'root'}{'foo'},
'flat' => {
'bar' => 'hello, world',
'foo' => $VAR1->{'root'}{'foo'}
}
};
Usage examples
Contextual sets
my $some_cities = {
Switzerland => [qw/Genève Lausanne Bern Zurich Bellinzona/],
France => [qw/Paris Lyon Marseille Lille Strasbourg/],
Italy => [qw/Milano Genova Livorno Roma Venezia/],
};
my $domain = Struct(
country => Enum(keys %$some_cities),
city => sub {
my $context = shift;
Enum(-values => $some_cities->{$context->{flat}{country}});
});
Ordered lists
Here is an example of a domain for ordered lists of integers:
my $domain = List(-all => sub {
my $context = shift;
my $index = $context->{path}[-1];
return Int if $index == 0; # first item has no constraint
return Int(-min => $context->{list}[$index-1] + 1);
});
Recursive domains
A domain for expression trees, where leaves are numbers, and intermediate nodes are binary operators on subtrees
my $expr_domain = One_of(Num, Struct(operator => String(qr(^[-+*/]$)),
left => sub {$expr_domain},
right => sub {$expr_domain}));
WRITING NEW DOMAIN CONSTRUCTORS
Implementing new domain constructors is fairly simple : create a subclass of Data::Domain
and implement a new
method and an _inspect
method. See the source code of Data::Domain::Num
or Data::Domain::String
for short examples.
However, before writing such a class, consider whether the existing mechanisms are not enough for your needs. For example, many domains could be expressed as a String
with a regular expression; therefore it is just a matter of writing a wrapper that supplies that regular expression, and passes other arguments (like -optional
) to the String
constructor :
sub Phone { String(-regex => qr/^\+?[0-9() ]+$/,
-messages => "Invalid phone number", @_) }
sub Email { String(-regex => qr/^[-.\w]+\@[\w.]+$/,
-messages => "Invalid email", @_) }
sub Contact { Struct(-fields => [name => String,
phone => Phone,
mobile => Phone(-optional => 1),
emails => List(-all => Email) ], @_) }
ERROR MESSAGES
Messages returned by validation rules have default values, but can be customized in several ways.
Each error message has an internal string identifier, like TOO_SHORT
, NOT_A_HASH
, etc. The documentation for each builtin domain tells which message identifiers may be generated in that domain. Message identifiers are then associated with user-friendly strings, either within the domain itself, or via a global table. Such strings are actually sprintf format strings, with placeholders for printing some specific details about the validation rule : for example the String
domain defines default messages such as
TOO_SHORT => "less than %d characters",
SHOULD_MATCH => "should match %s",
The -messages
option to domain constructors
Any domain constructor may receive a -messages
option to locally override the messages for that domain. The argument may be
a plain string : that string will be returned for any kind of validation error within the domain
a hashref : keys of the hash should be message identifiers, and values should be the associated error strings.
a coderef : the referenced function is called, and the return value becomes the error string. The called function receives the message identifier as argument.
Here is an example :
sub Phone {
String(-regex => qr/^\+?[0-9() ]+$/,
-min_length => 7,
-messages => {
TOO_SHORT => "phone number should have at least %d digits",
SHOULD_MATCH => "invalid chars in phone number"
}, @_)
}
The messages
class method
Default strings associated with message identifiers are stored in a global table. The distribution contains builtin tables for english (the default) and for french : these can be chosen through the messages
class method :
Data::Domain->messages('english'); # the default
Data::Domain->messages('français');
The same method can also receive a custom table.
my $custom_table = {...}; # see
Data::Domain->messages($custom_table);
This should be a two-level hashref : first-level entries in the hash correspond to Data::Domain
subclasses (i.e Num => {...}
, String => {...}
), or to the constant Generic
; for each of those, the second-level entries should correspond to message identifiers as specified in the doc for each subclass (for example TOO_SHORT
, NOT_A_HASH
, etc.). Values should be strings suitable to be fed to sprintf. Look at $builtin_msgs
in the source code to see an example.
Finally, it is also possible to write your own message generation handler :
Data::Domain->messages(sub {my ($msg_id, @args) = @_;
return "you just got it wrong ($msg_id)"});
What is received in @args
depends on which validation rule is involved; it can be for example the minimal or maximal bounds, or the regular expression being checked.
The -name
option to domain constructors
The name of the domain is prepended in front of error messages. The default name is the subclass of Data::Domain
, so a typical error message for a string would be
String: less than 7 characters
However, if a -name
is supplied to the domain constructor, that name will be printed instead;
my $dom = String(-min_length => 7, -name => 'Phone');
# now error would be: "Phone: less than 7 characters"
Message identifiers
This section lists all possible message identifiers generated by the builtin constructors.
Whatever
-
MATCH_DEFINED
,MATCH_TRUE
,MATCH_ISA
,MATCH_CAN
. Num
-
INVALID
,TOO_SMALL
,TOO_BIG
,EXCLUSION_SET
. Date
-
INVALID
,TOO_SMALL
,TOO_BIG
,EXCLUSION_SET
. Time
-
INVALID
,TOO_SMALL
,TOO_BIG
. String
-
TOO_SHORT
,TOO_LONG
,TOO_SMALL
,TOO_BIG
,EXCLUSION_SET
,SHOULD_MATCH
,SHOULD_NOT_MATCH
. Enum
-
NOT_IN_LIST
. List
-
The domain will first check if the supplied array is of appropriate shape; in case of of failure, it will return of the following scalar messages :
NOT_A_LIST
,TOO_SHORT
,TOO_LONG
.Then it will check all items in the supplied array according to the
-items
and-all
specifications; in case of failure, an arrayref of messages is returned, where message positions correspond to the positions of offending data items.Finally, the domain will check the
-any
constraint; in case of failure, it returns anANY
scalar message. Since that message contains the name of the missing domain, it is a good idea to use the-name
option so that the message is easily comprehensible, as for example inList(-any => String(-name => "uppercase word", -regex => qr/^[A-Z]$/))
Here the error message would be : should have at least one uppercase word.
Struct
-
The domain will first check if the supplied hash is of appropriate shape; in case of of failure, it will return of the following scalar messages :
NOT_A_HASH
,FORBIDDEN_FIELD
.Then it will check all entries in the supplied hash according to the
-fields
specification, and return a hashref of messages, where keys correspond to the keys of offending data items. One_of
-
If all member domains failed to accept the data, an arrayref or error messages is returned, where the order of messages corresponds to the order of the checked domains.
INTERNALS
node_from_path
my $node = node_from_path($root, @path);
Convenience function to find a given node in a data tree, starting from the root and following a path (a sequence of hash keys or array indices). Returns undef
if no such path exists in the tree. Mainly useful for contextual constraints in lazy constructors (see below).
msg
Internal utility method for generating an error message.
subclass
Method that returns the short name of the subclass of Data::Domain
(i.e. returns 'Int' for Data::Domain::Int
).
SEE ALSO
Doc and tutorials on complex Perl data structures: perlref, perldsc, perllol.
Other CPAN modules doing data validation : Data::FormValidator, CGI::FormBuilder, HTML::Widget::Constraint, Jifty::DBI, Data::Constraint, Declare::Constraints::Simple. Among those, Declare::Constraints::Simple
is the closest to Data::Domain
, because it is also designed to deal with substructures; yet it has a different approach to combinations of constraints and scope dependencies.
Some inspiration for Data::Domain
came from the wonderful Parse::RecDescent module, especially the idea of passing a context where individual rules can grab information about neighbour nodes.
TODO
- generate javascript validation code
- generate XML schema
- normalization / conversions (-filter option)
- msg callbacks (-filter_msg option)
- default values within domains ? (good idea ?)
AUTHOR
Laurent Dami, <laurent.d...@etat.geneve.ch>
COPYRIGHT AND LICENSE
Copyright 2006, 2007 by Laurent Dami.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 1524:
Non-ASCII character seen before =encoding in '[qw/Genève'. Assuming CP1252