NAME

Data::Sah::Manual::Schema - Description of Sah schema language

VERSION

version 0.02

OVERVIEW AND TERMINOLOGY

Although it can contain extra stuffs, a schema is essentially a type definition, stating a set of valid values for data. Sah defines several basic types like "bool", "int", "str", "array", "hash", and a few others. Please see Data::Sah::Type::* for the complete list.

A type can have clauses, which mostly declare constraints. Each clause has a clause name and a hash of clause attributes, probably the most commonly used attribute is the value attribute. So common that, when specifying a clause in a schema, only its value attribute is specified unless when using the normalized form.

min => 10

# is actually equivalent to:
'min.*' => {value=>10}

Another example:

between => [1, 10]

# equivalent to the normalized form:
'between.*' => {value=>[1, 10]}

Aside from the value attribute, there are others, serving various functions. See "Clause attributes" for more information.

When validating a data each clause will be tested and must succeed for the whole validation to succeed (there are exceptions, but it is not important right now). Aside from declaring constraints, clauses can also declare other stuff like default value (the default clause), store metadata (the summary, description, tags clauses), etc. Please see type's documentation for the list of known clauses of each type.

Clause set is just a term for a hash of clauses. A Sah schema is essentially comprised of type name and a clause set.

Base schema. You can define a schema, declare it as a new type, and then write subsequent schemas against that type, along with additional clauses. This is very much like subtyping. See "BASE SCHEMA" for more information.

GENERAL STRUCTURE

Sah schema is an array:

[TYPE, CLAUSE_SET, EXTRA_STUFFS]

CLAUSE_SET and EXTRA_STUFFS are optional.

Examples:

['int']
['int', {min=>1, max=>10}]

If you don't have EXTRA_STUFFS, you can also write it like this (saves a couple of characters):

["int", min=>1, max=>10]

If you don't have any clauses, you can use the scalar/string form:

"int"

EXTRA_STUFFS is a hashref. Currently the only known key is def, for defining base schemas locally (see "BASE SCHEMA" for more information). The other keys are reserved for future use.

Text/strings should be in Unicode (UTF-8).

String form shortcuts

For convenience, the string form not only can be used to specify just type, but also some common clauses via shortcut syntax:

  • The * suffix (req=>1)

    "X*"

    is equivalent to:

    [X => {req=>1}]

BASE SCHEMA

As mentioned before, you can define a schema as a type and then write other schemas against that type. For example:

# defined as pos_int type
[int => {min=>0}]

and later:

# a positive integer, divisible by 5
[pos_int => {div_by=>5}]

During data validation, base schemas will be replaced by its original definition, and all the clause sets will be evaluated. Illustrated by the plus sign:

[int => {min=>0} + {div_by=>5}]

You can also declare base schemas/types locally using the def key in EXTRA_STUFFS (the third element of the array schema), for example:

[throws => {},
 {
     def => {
         single_dice_throw  => [int => {in => [1,2,3,4,5,6]}],
         sdt                => "single_dice_throw", # short notation
         dice_pair_throw    => [array => {len=>2, elems=>["sdt", "sdt"]}],
         dpt                => "dice_pair_throw",   # short notation
         throw              => [any => {of => ["sdt", "dpt"]}],
         throws             => 'throw[]',
     },
 }
]

The above schema describes a list of dice throws ("throws"). Each throw can be a single dice throw ("sdt") which is a number between 1 and 6, OR a throw of two dices ("dpt") which is a 2-element array (where each element is a number between 1 and 6).

Examples of valid data for this schema:

[1, [1,3], 6, 4, 2, [3,5]]

Examples of invalid data:

1                  # not an array
[1, [2, 3], 0]     # the third throw is invalid
[1, [2, 0, 4], 4]  # the second throw is invalid

All the base schemas names "throw", "throws", "sdt", etc is only declared locally and unknown outside the schema. You can even nest this.

Optional/conditional definition

If you put a ? suffix after the definition name then it means that the definition is optional and can be skipped if the type is already defined, e.g.:

def         => {
    "email?"   => [str => {req=>1, match=>".+\@.+"}],
    "username" => [str => {req=>1, match=>'^[a-z0-9_]+$'}],
},

In the above example, if there is already an "email" type defined at that time, the definition will be skipped instead of a "cannot redefine type" error being generated.

Optional definition is useful if you want to provide some defaults (e.g. a rudimentary validation for email) but don't mind if the validator already has something probably better (a stricter or more precise definition of email).

CLAUSE

A clause set is a mapping of clause names and attributes:

{
    'CLAUSENAME1.*' => {value=>VAL, ATTR1=>..., ATTR2=>, ...},
    'CLAUSENAME2.*' => {...},
}

For convenience, there are also some shortcuts:

CLAUSENAME => VAL

is a shortcut for:

'CLAUSENAME.*' => {value=>VAL}

(This is Huffman encoding principle at work, since the value attribute is the most common, it has the shortest syntax.)

Attribute can also be specified separately (they will be combined during normalization).

"CLAUSENAME.ATTRNAME1" => ATTRVAL1,
"CLAUSENAME.ATTRNAME2" => ATTRVAL2,
...

Another shortcut is:

"CLAUSENAME+" => [VAL, ...]
"CLAUSENAME&" => [VAL, ...]

for:

"CLAUSENAME.values" => [VAL, ...]

and:

"CLAUSENAME|" => [VAL, ...]

for:

"CLAUSENAME.values" => [VAL, ...],
"CLAUSENAME.min_ok" => 1,

Also this:

"!CLAUSENAME" => VAL

is a shortcut for this:

CLAUSENAME => VAL,
"CLAUSENAME.revert" => 1,

An attribute can contain literal value or an expression. To specify an expression, add = suffix:

"CLAUSENAME=" => EXPR
"CLAUSENAME.ATTRNAME1=" => EXPR

When doing validation, all clauses will be evaluated and must succeed if the validation is to succeed. The order of evaluation usually does not matter, but some clauses are early (like "default" and "prefilters") and some are late (like "postfilters").

Clause name

Clause names must begin with letter/underscore and contain letters/numbers/underscores only. All clauses which begin with an "_" (underscore) will be ignored by Sah (you can use this to embed extra data for other purposes).

Clause attribute

Attribute name must also only contain letters/numbers/underscores, but it can be a dotted-separated series of parts, e.g. alt.lang.id_ID. As with clauses, clause attributes which begin with "_" (underscore) will be ignored by Sah.

Currently known attributes:

  • value

    This is the most commonly used attribute.

  • values

    This attribute can be used to store more than one values to a clause. The values will be evaluated. Example:

    ['int*' => {"div_by.values"=>[2, 3, 5]}]

    The above schema requires an integer which is divisible by 2, 3, and 5.

  • err_level

    Valid value: error, warn. Default if not specified is error. Normally, when clause checking fails, an error is generated and the fails to validate the whole schema. If err_level is set to warn, however, this only generates a warning and does not cause the validation to fail.

    [str=>{min_len=>4},
          {min_len=>8, 'min_len:err_level' => 'warn'},]

    Example:

    # password
    [str =>
      {req                 => 1,
       min_len             => 4},
      {min_len             => 8,
       "min_len.err_level" => "warn",
       "min_len.err_msg"   => "Although a password less than 4 letters are ".
                              "valid it's highly recommended that a password is ".
                              "at least 8 letters long, for security reasons"}],

    In the above example, the "err_level" and "err_msg" are attributes for the "min_len" clause. The second clause set basically adds an optional restriction for the password: when the "min_len" clause is not satisfied, instead of making the data fails the validation, only a warning is issued.

  • err_msg[.LANGCODE]

    This tells the compiler that instead of the default error message from the type handler, a custom error message is supplied. You can add translations by adding more attributes. For example:

    [str=>{match                  => qr/[^A-Za-z0-9_-]/,
           'match.err_msg'        => 'Must not contain naughty characters',
           'match.err_msg.id_ID'  => 'Tidak boleh mengandung karakter aneh-aneh',
    }]
  • comment

    This is ignored during validation.

  • human[.LANGCODE]

    This is also ignored when validating data, but will be used by the Data::Sah::Compiler::human compiler to supply description. You can add translations by adding more attributes.

    [str=>{match               => qr/[^A-Za-z0-9_-]/,
           'match.human'       => 'Must not contain naughty characters',
           'match.human.id_ID' => 'Must not contain naughty characters',
    }]
  • alt

    This attribute is used to store alternative clause value. Examples are:

    alt.lang.<LANGCODE>

    to store alternative value for different language. This applies to clauses which contain a translatable text, like name, summary, description.

A clause attribute can store a literal value, or an expression. To store an expression, append the = character after the attribute name. See "EXPRESSION" for more details on expression.

Example:

default => "blah"
"default=" => 'int(10*rand())+1'

Special-purpose clauses

Normally clauses serve as a type constraint (e.g. for type string, the "min_len" and "max_len" clauses restrict how short/long the string can be). However there are also some clauses that are special.

The '' (empty clause)

This can be used to store attributes that can be used generally by other clauses ("general attributes"). For example:

[str => {not_match => /(password|abcd)$/,
         min_len => 4,
         ":err_msg" => "Password not good enough!"}]

When validation fails for one or more of the clauses, the custom error message will be used instead.

Clause set merging

Given several clause sets in the schema like:

[TYPE, CS1 + CS2 + CS3]

all CS1, CS2, and CS3 will be evaluated in that order:

eval(CS1)
eval(CS2)
eval(CS3)

However, if a clause set hash contains one or more keys with merge prefix (explained later), the clause set will first be merged with the previous clause set(s) prior to evaluation. For example, if CS2 keys contain merge prefixes (notation: "*" indicates the presence of merge prefix):

[TYPE, CS1 + *CS2 + CS3]

then CS1 will be merged with CS2 first before evaluated (notation: "~" signifies merging).

eval(CS1 ~ *CS2)
eval(CS3)

If CS3 instead contains merge prefixes:

[TYPE, CS1 + CS2 + *CS3]

then CS1 will be evaluated, and then CS2 is merged first with CS3:

eval(CS1)
eval(CS2 ~ *CS3)

If CS2 as well as CS3 contains merge prefixes:

[TYPE, CS1 + *CS2 + *CS3]

then the three will be merged first before evaluating:

eval(CS1 ~ *CS2 ~ *CS3)

So in short, unless the right hand side is devoid of merge prefixes, merging will be done first from left to right.

The first clause set should not contain any merge prefixes.

Sah uses Data::ModeMerge to do the merging, with merge prefixes changed to '[merge:+]', '[merge:!]' and so on. In merging, Data::ModeMerge allows keys on the right side hash not only to replace but also add, subtract, remove keys from the left side. This is powerful because it allows schema definition to not only add clauses (restrict types even more), but also replace clauses (change type restriction) as well as delete clauses (relax type restriction). For more information, refer to the Data::ModeMerge documentation.

Examples:

[int => {div_by=>2} + {  div_by =>3}] # must be divisible by 2 & 3

[int => {div_by=>2} + {'[merge]div_by'=>3}] # will be merged and become:
[int => {div_by=>3}                       ] # must be divisible by 3 ONLY

[int => {div_by=>2} + {'[merge:!]div_by'=>0}] # will be merged and become:
[int => {}                                  ] # need not be divisible by any

[int => {in=>[1,2,3,4,5]} + {  in =>[6]}] # impossible to satisfy

[int => {in=>[1,2,3,4,5]} + {'[merge:+]in'=>[6]}] # will be merged and become:
[int => {in=>[1,2,3,4,5,6]}            ]

[int => {in=>[1,2,3,4,5]}, {'-in'=>[4]}] # will be merged and become:
[int => {in=>[1,2,3,  5]}              ]

Merging and hash keys

XXX: Does merging of clause sets need to be done recursively?

Due to recursive merging of clause sets by Data::ModeMerge, please be reminded if your clause set keys contain a merge prefix, then you need to check all hashes (not just the top-level clause set hash, but also deep down in nested hashes) for possible accidental merge prefixes:

[merge]
[merge:*]
[merge:+]
[merge:-]
[merge:.]
[merge:!]
[merge:^]

because they will also be merged and removed after merging.

Example:

[hash => {       "keys_regex" => { foo=>[int=>{max=>7}],
                                   "[merge:^]bar"=>"str" }}
      +  {"[merge]keys_regex" => { foo=>[int=>{max=>3}]
                                                         }}]

will be merged to:

[hash => {       "keys_regex" => { foo => [int=>{max=>7}],
                                   bar => "str"          }}]

that is, foo will also gets merged. Sometimes this might be what you want, but sometimes it might not be. If the later is the case, you can turn off prefix parsing:

[hash => {       "keys_regex" => { foo => [int=>{max=>7}],
                                  "[merge:^]bar" => "str",
                ""=>{parse_prefix=>0} }}
       + {"[merge]keys_regex" => { "[merge:-]foo" => [int=>{max=>4}]}}]

it will become:

[hash => { "keys_regex" => { foo => [int=>{max=>7}],
                             "[merge:-]foo" => [int=>{max=>4}],
                             "[merge:^]bar" => "str" } }]

Please also note that the empty string key ("") is also regarded as special by Data::ModeMerge, it is called the options key which regulate how merging should be done. Be careful not to use an empty string as your key either.

EXPRESSION

XXX: Syntax of variables not yet fixed.

Sah supports expressions, using Language::Expr minilanguage. See Language::Expr::Manual::Syntax for details on the syntax. You can specify expression in the check clause, e.g.:

[int => {check => '$_ >= 4'}]

Alternatively, expression can also be specified in any clause's attribute:

[int => {'min='       => '2+2'}]
[int => {'min.value=' => 'floor(4.9)'}]

The above three schemas are equivalent to:

[int => {min => 4}]

Expression can refer to elements of data and (normalized) schema, and can call functions, enabling more complex schema to be defined, for example:

['array*' => {len=>2, elems => [
  ['str*', {match => '^\w+$'}],
  ['str*', {'match.expr' => '${../../0/clause_sets/0/match}',
            'min_len.expr' => '2*length(${data:../0})'}]
]}]

The above schema requires data to be a two-element array containing strings, where the length of the second string has to be at least twice the length of the first. Both strings have to comply to the same regex, qr/^\w+$/ (which is declared on the first string's clause and referred to in the second string's clause).

FUNCTION

Functions can be used in expressions. The syntax of calling function is:

func()
func(ARG, ...)

For the list of functions supported, see Data::Sah::FuncSet::Core. To add a new Sah function set from Perl, see Data::Sah::Manual::NewFuncSet.

Functions in Sah can sometimes accept several types of arguments, e.g. length(ARRAY) will return the number of elements in the ARRAY, while length(STR) will return the number of characters in the string. However, when an inappropriate argument is given, a Perl exception will be thrown.

SEE ALSO

Data::Sah::Manual::NewType for guide on creating a new type.

Data::Sah::Manual::NewClause for guide on extending an existing type by adding a new clause.

Data::Sah::Manual::NewFuncSet for guide on creating a new function set.

Want more schema examples? Try Data::Sah::Manual::Cookbook.

AUTHOR

Steven Haryanto <stevenharyanto@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Steven Haryanto.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.