NAME

Elastic::Manual::Attributes - Fine-tuning how your attributes are indexed

VERSION

version 0.03

SYNOPSIS

package MyApp::User;

use Elastic::Doc;

has 'email' => (
    isa       => 'Str',
    is        => 'rw',
    analyzer  => 'english',
    multi => {
        untouched => { index    => 'not_analyzed' },
        ngrams    => { analyzer => 'edge_ngrams' }
    }
);

no Elastic::Doc;

INTRODUCTION

Elastic::Model uses Moose's type constraints to figure out how each of the attributes in your classes should be indexed. This is a good start, but it won't be long before you want to fine tune the indexing process.

For instance, if you have attribute title which isa => 'Str', then Elastic::Model::TypeMap::Moose will map the attribute as a string, whose value will be analyzed by the standard analyzer.

But perhaps you want the words in your title to be stemmed (eg "fox" will match "fox", "foxes" or "foxy"), or perhaps your title is in Spanish and you want Spanish stemming rules to apply. Or perhaps you want to use as-you-type auto-completion on the title attribute.

This fine-tuning is easy to do, using the attribute keywords that are added to your attributes when you include the line use Elastic::Doc; in your class. These keywords come from Elastic::Model::Trait::Field.

TYPES

Each attribute/field must have one of the following types:

type

The type keyword allows you to override the default type that is generated by the typemap. For instance:

has 'epoch_milliseconds' => (
    isa     => 'Int',
    type    => 'date'
);

Although the type constraint Int would normally generate type long, we have, overridden the type to be indexed as a date field.

Arrays

A notable omission from the above list is array. That's because a field of any type can store and index multiple values. For example:

has 'author'  => ( isa => 'Str'           );
has 'authors' => ( isa => 'ArrayRef[Str]' );

Both of these fields would be mapped as { type => 'string' }.

Note: the values contained in an array must be of a single type in order to be indexable. You can't mix types in an array.

Also see "NESTED FIELDS" for a more detailed explanation, specifically about how to deal with arrays of objects.

Unknown types

If the typemap does not know how to deal with the type constraint on an attribute, and you haven't specified a "type", then the field is mapped as:

{
    type    => 'object',
    enabled => 0
}

This means that the value in the field will be stored in ElasticSearch, but will not be queryable. Also, no attempt is made to inflate or deflate the value - it is just passed through unchanged. If it is a value that JSON::XS cannot handle natively, (eg a blessed ref), then Elastic::Model will be unable to deal with it automatically.

See "CUSTOM MAPPING, INFLATION AND DEFLATION" for details of how to handle this situation.

GENERAL KEYWORDS

There are a number of keywords that apply to (almost) all of the field types listed below:

exclude

has 'cache_key' => (
    is      => 'ro',
    exclude => 1,
    builder => '_generate_cache_key',
    lazy    => 1,
);

If exclude is true then the attribute will not be stored in ElasticSearch. This is only useful for generated or temporary attributes. If you want to store the attribute but make it not searchable, then you should use the "index" keyword instead.

index

has 'tag' => (
    is      => 'rw',
    isa     => 'Str',
    index   => 'not_analyzed'
);

The index keyword controls how ElasticSearch will index your attribute. It accepts 3 values:

  • no: This attribute will not be indexed, and will thus not be searchable.

  • not_analyzed: This attribute will be indexed using exactly the value that you pass in, eg FoO will be stored (and searchable) as FoO.

  • analyzed: This attribute will be analyzed. In other words, the text value will be passed through the specified (or default) "analyzer" before being indexed. The analyzer will tokenize and pre-process the text to produce terms. For example FoO BAR would (depending on the analyzer) be stored as the terms foo and bar. This is the default for string fields (except for enums).

include_in_all

has 'secret' => (
    is              => 'ro',
    isa             => 'Str',
    include_in_all  => 0
);

By default, all attributes (except those with index => 'no') are also indexed in the special _all field. This is intended to make it easy to search for documents that contain a value in any field. If you would like to exclude a particular attribute from the _all field, then specify { include_in_all => 0 }.

Note:

  • The _all field has its own "analyzer" - so the tokens that are stored in the _all field may be different from the tokens stored in the attribute itself.

  • When include_in_all is set on a field of type object, its value will propogate down to all attributes within the object.

  • You can disable the _all field completely using "has_mapping" in Elastic::Doc:

    package MyApp::User;
    
    use Elastic::Doc;
    
    has_mapping {
        _all    => { enabled => 0 }
    };

boost

has 'title' => (
    is      => 'rw',
    isa     => 'Str',
    boost   => 2
);

A boost makes a value "more relevant". For instance, the words in the title field of a blog post are probably a better indicator of the topic than the words in the content field. You can boost a field at search time and at index time. The benefit of boosting at search time, is that your boost is not fixed. The benefit of boosting at index time is that the boost value is carried over to the _all field.

Also see "omit_norms".

multi

has 'name' => (
    is      => 'ro',
    isa     => 'Str',
    multi   => {
        sorting     => { index    => 'not_analyzed' },
        partial     => { analyzer => 'ngrams'       }
    }
);

It is a common requirement to be able to use a single field in different ways. For instance, with the name field example above, we may want to:

  • Do a full text search for Joe Bloggs

  • Do a partial match on all names beginning with Blo

  • Sort results alphabetically by name.

A single field definition is insufficient in this case: The standard analyzer won't allow partial matching, and because it generates multiple terms/tokens, it can't be used for sorting. (You can only sort on a single value).

This is where multi_fields are useful. The same value can be indexed and queried in multiple ways. When you specify a multi mapping, each "sub-field" inherits the type of the main field, but you can override this. The "sub-fields" can be referred to as eg name.partial or name.sorting and the "main field" name can also be referred to as name.name.

Another benefit of multi-fields is that they can be added without reindexing all of your data.

index_name

has 'foo' => (
    is          => 'rw',
    isa         => 'Str',
    index_name  => 'bar'
);

ElasticSearch uses dot-notation to refer to nested hashes. For instance, with this data structure:

{
    foo => {
        bar => {
            baz => 'xxx'
        }
    }
}

... you could refer to the baz value as baz or as foo.bar.baz.

Sometimes, you may want to specify a different name for a field. For instance:

{
    street => {
        name    => 'Oxford Street',
        number  => 1
    },
    town => {
        name    => 'London'
    }
}

You can use the index_name to distinguish town_name from street_name.

store

has 'big_field' => (
    is          => 'ro',
    isa         => 'Str',
    store       => 'yes'
);

Individual fields can be stored (ie have their original value stored on disk). This is not the same as whether the value is indexed or not (see "index"). It just means that this individual value can be retrieved separately from the others. stored defaults to 'no' but can be set to 'yes'.

You almost never need this. The _source field (which is stored by default) contains the hashref representing your whole object, and is returned by default when you get or search for a document. This means a single disk seek to load the_source field, rather than a disk seek (think 5ms) for every stored field! It is much more efficient.

null_value

has 'foo' => (
    is          => 'rw',
    isa         => 'Str',
    null_value  => 'none'
);

If the attribute's value is undef then the null_value will be indexed instead. This option is included for completeness, but isn't very useful. Rather just leave the value as undef and use the exists and missing filters when you need to consider undef values.

STRING FIELDS

String fields fall into two broad categories:

  • Text, like this paragraph, which needs to be searchable via flexible full text queries

  • Terms, for instance tags, postcodes, enums, which should be stored and queried exactly (eg Perl is different from PERL)

The simplest way to distinguish between text and terms is to set the "index" keyword to analyzed (the default) for text or to not_analyzed for terms. See Elastic::Manual::Analysis for a more detailed discussion.

The keywords below apply only to fields of "type" string. You can also use "index", "include_in_all", "boost", "multi", "index_name", "store" and "null_value".

analyzer

has 'email' => (
    is          => 'ro',
    isa         => 'Str',
    analyzer    => 'my_email_analyzer'
);

Specify which analyzer (built-in or custom) to use at index time and at search time. This is the equivalent of setting "index_analyzer" and "search_analyzer" to the same value.

Also see Elastic::Manual::Analysis for an explanation.

index_analyzer

has 'email' => (
    is              => 'ro',
    isa             => 'Str',
    index_analyzer  => 'my_email_analyzer'
);

Sets the "analyzer" to use at index time only.

search_analyzer

has 'email' => (
    is              => 'ro',
    isa             => 'Str',
    search_analyzer => 'my_email_analyzer'
);

Sets the "analyzer" to use at search time only.

search_quote_analyzer

has 'email' => (
    is                    => 'ro',
    isa                   => 'Str',
    search_analyzer       => 'my_email_analyzer',
    search_quote_analyzer => 'my_quoted_email_analyzer'
);

Sets the "analyzer" to use in a Query-String query or Field query when the search phrase includes quotes (""). If not set, then it falls back to the "search_analyzer" or the "analyzer".

omit_norms

has 'status' => (
    is          => 'ro',
    isa         => 'Str',
    analyzer    => 'keyword',
    omit_norms  => 1
);

Norms allow for index time "boost" and for field length normalization (shorter fields score higher). This may not always be what you want. For instance, a status field may contain a single value that is never used for relevance scoring, just for filtering (eg all docs where status is "active"). Or, if the values in a field are short (eg name, email) then the field length normalization may skew the results incorrectly.

You can turn off norms with { omit_norms => 0 }.

See http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e71 for more about omit_norms.

omit_term_freq_and_positions

has 'status' => (
    is                              => 'ro',
    isa                             => 'Str',
    analyzer                        => 'keyword',
    omit_term_freq_and_positions    => 1
)

ElasticSearch normally stores the frequency and position of each term in analyzed text. If you don't need this information, then you can turn it off, and save space.

See http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e63 for more about omit_term_freq_and_positions.

term_vector

has 'text' => (
    is          => 'ro',
    isa         => 'Str',
    store       => 'yes',
    term_vector => 'with_positions_offsets',
);

The full functionality of term vectors is not exposed via ElasticSearch, so the only real value for now is for highlighting snippets. Allowed values are: no (the default), yes, with_offsets, with_positions and with_positions_offsets.

Fast snippet highlighting

There are two highlighters available for highlighting matching snippets in text fields: the highlighter, which can be used on any analyzed text field without any preparation, and the fast-vector-highlighter which is faster (better for large text fields which require frequent highlighting), but uses more disk space, and the field needs to be setup correctly before use:

has 'big_field' => (
    is          => 'ro',
    isa         => 'Str',
    term_vector => 'with_positions_offsets'
);

NUMERIC FIELDS

The following keyword applies only to fields of "type" integer, long, float, double, short or byte.

You can also use "index", "include_in_all", "boost", "multi", "index_name", "store" and "null_value".

precision_step (numeric)

has 'count' => (
    isa             => 'Int',
    precision_step  => 2,
);

The precision_step determines the number of terms generated for each value (defaults to 4). The more terms, the faster the lookup, but the more memory used.

DATE FIELDS

Dates in ElasticSearch are stored internally as long values containing milliseconds since the epoch.

The following keywords apply only to fields of "type" date. You can also use "index", "include_in_all", "boost", "multi", "index_name", "store" and "null_value".

format

has 'year_week' => (
    isa     => 'Str',
    type    => 'date',
    format  => 'basic_week_date'
);

Date fields by default can parse (1) milliseconds since epoch (2) yyyy/MM/dd HH:mm:ss Z or (3) yyyy/MM/dd Z.

If you would like to specify a different format, you can use one of the built-in formats or a custom format.

precision_step (date)

has 'count' => (
    isa             => 'Int',
    precision_step  => 2,
);

The precision_step determines the number of terms generated for each value (defaults to 4). The more terms, the faster the lookup, but the more memory used.

BOOLEAN FIELDS

Boolean fields accept undef, 0, "" and "false" as false values, and any other value as true.

There are no boolean-specific keywords, but the general keywords apply: "index", "include_in_all", "boost", "multi", "index_name", "store" and "null_value".

Note: By default, ElasticSearch treats undef as NEITHER true NOR false, but as null (ie missing). To work around this, we automatically set "null_value" to 0 to make boolean fields more Perlish. If you would like to revert to the default behaviour, set "null_value" to undef.

BINARY FIELDS

Binary data can be stored in ElasticSearch in Base64 encoding. The easiest way to do this is to use "Binary" in Elastic::Model::Types, which will handle the Base64 conversion for you:

use Elastic::Model::Types qw(Binary);

has 'binary_field' => (
    is      => 'ro',
    isa     => Binary
);

The field is always not indexed.

There are no binary-specific keywords, but you can use: "store" (defaults to yes)and "index_name".

IP FIELDS

Fields of type ip can be used to index IPv4 addresses in numeric form.

The following keyword applies only to fields of "type" ip. You can also use "index", "include_in_all", "boost", "multi", "index_name", "store" and "null_value".

precision_step (ip)

has 'address' => (
    isa             => 'Str',
    type            => 'ip'
    precision_step  => 2,
);

The precision_step determines the number of terms generated for each value (defaults to 4). The more terms, the faster the lookup, but the more memory used.

GEO_POINT FIELDS

geo_point fields are used to index latitude/longitude points. The easiest way to use them is to use "GeoPoint" in Elastic::Model::Types:

use Elastic::Model::Types qw(GeoPoint);

has 'point' => (
    is      => 'ro',
    isa     => GeoPoint,
    coerce  => 1
);

See http://www.elasticsearch.org/guide/reference/mapping/geo-point-type.html for more information about geo_point fields.

The following keywords apply only to fields of "type" geo_point. You can also use "multi", "index_name" and "store".

lat_lon

has 'point' => (
    is      => 'ro',
    isa     => GeoPoint,
    lat_lon => 1
);

By default, a geo-point is indexed as a lat-lon combination. To index the lat and lon fields as numeric fields as well, which is considered good practice, as both the geo distance and bounding box filters can either be executed using in memory checks, or using the indexed lat lon values. Note: indexed lat lon only makes sense when there is a single geo point value for the field, and not multiple values.

geohash

has 'point' => (
    is      => 'ro',
    isa     => GeoPoint,
    geohash => 1,
);

Set geohash to true to index the geohash value as well.

This value is queryable via (eg) the point.geohash field.

geohash_precision

has 'point' => (
    is                  => 'ro',
    isa                 => GeoPoint,
    geohash             => 1,
    geohash_precision   => 8,
);

The geohash_precision determines how accurate the geohash will be - defaults to 12.

OBJECT FIELDS

Hashrefs (and objects which have been serialised to hashrefs) are considered to be "objects", as in JSON objects. Your doc class is serialized to a JSON object/hash, which is known as the root_object. The mapping for the root object can be configured with "has_mapping" in Elastic::Doc.

Your doc class may have attributes which are hash-refs, or objects, which may themselves contain hash-refs or objects. Multi-level data structures are allowed.

The following keywords apply only to fields of "type" object and nested. You can also use "include_in_all", "include_attrs", "exclude_attrs".

The mapping for these structures should be automatically generated, but these keywords give you some extra control:

enabled

has 'foo' => (
    is      => 'ro',
    isa     => 'HashRef',
    enabled => 0
);

Setting enabled to false disables the indexing of any value in the object. Defaults to false.

dynamic

has 'foo' => (
    is      => 'ro',
    isa     => 'HashRef',
    dynamic => 1
);

ElasticSearch defaults to trying to detect field types dynamically, but this can lead to mistakes, eg is "123" a string or a long? Elastic::Model turns off this dynamic detection, and instead uses Moose's type constraints to determine what type each field should have.

If you know what you're doing, you can set dynamic to 1 (auto-detect new field types), 0 (ignore new fields) or 'strict' (throw an error if an unknown field is included).

path

package MyApp::Types;

use MooseX::Types -declare 'FullName';

use MooseX::Types::Moose qw(Str);
use MooseX::Types::Structured qw(Dict);

subtype Fullname,
as Dict[
    first   => Str,
    last    => Str
];


package MyApp::Couple;

use Moose;
use MyApp::Types qw(FullName);

has 'husband' => (
    is      => 'ro',
    isa     => FullName,
    path    => 'just_name'
);

has 'wife' => (
    is      => 'ro',
    isa     => FullName,
    path    => 'full'
);

The path keyword accepts the values full and just_name. By default, nested attributes can be referenced by just their name, or by their path, using dot-notation, eg wife.first, or couple.wife.first.

The path setting, which defaults to full (eg wife.first) can be set to just_name, in which case eg the name husband.first won't be defined.

The path keyword can also be combined with the "index_name" keyword.

NESTED FIELDS

nested fields are a sub-class of object fields, that are useful when your attribute can contain multiple values.

First an explanation. Consider this data structure:

{
    person  => [
        { first => 'John', last => 'Smith' },
        { first => 'Mary', last => 'Smith' },
        { first => 'Mary', last => 'Jones' }
    ]
}

If the person field is of type object, then the above data structure is flattened into something more like this:

{
    'person.first' => ['John','Mary','Mary'],
    'person.last'  => ['Smith','Smith','Jones']
}

With this structure it is impossible to run queries that depend on matching on attributes of a SINGLE person object. For instance, a query asking for docs that have a person who has (first == John and last == Jones) would incorrectly match this document.

Nested objects are the solution to this problem. When an attribute is marked as "type" nested, then ElasticSearch creates each object as a separate-but-related hidden document. These nested objects can be queried with the nested query and the nested filter.

The following keywords apply only to fields of "type" nested. You can also use "include_in_all", "path" "dynamic", "include_attrs" and "exclude_attrs".

include_in_parent

has 'person' => (
    is                  => 'ro',
    isa                 => ArrayRef[Person],
    type                => 'nested',
    include_in_parent   => 1,
);

If you would also like the data from the nested objects to be indexed in their containing object (as in the first data structure above), then set include_in_parent to true.

include_in_root

has 'person' => (
    is                  => 'ro',
    isa                 => ArrayRef[Person],
    type                => 'nested',
    include_in_root     => 1,
);

Objects can be nested inside objects which are nested inside objects etc. The include_in_root keyword does the same as the "include_in_parent" keyword, but refers to the top-most document, rather than the direct parent.

ELASTIC::DOC FIELDS

You can have attributes in one class that refer to another Elastic::Doc class. For instance, a MyApp::Post object could have, as an attribute, the MyApp::User object to whome the post belongs.

You may want to store just the Elastic::Model::UID of the user object, or you may want to include the user's name and email address, so that you can search for posts by a user named "Joe Bloggs". You can't do joins in a NoSQL database, so you need to denormalize your data.

By default, all attributes in an object are included. You can change the list with "include_attrs" and "exclude_attrs". You can use these with Moose classes too, but any attributes that are excluded, won't be stored and it won't be possible to retrieve them.

include_attrs

has 'user' => (
    is            => 'ro',
    isa           => 'MyApp::User',
    include_attrs => ['name','email']
);

The above declaration will index the user object's UID, plus the name and email attributes. If include_attrs is not specified, then all the attributes from the user object will be indexed. If include_attrs is set to an empty array ref [] then no attributes other than the UID will be indexed.

exclude_attrs

has 'user' => (
    is            => 'ro',
    isa           => 'MyApp::User',
    exclude_attrs => ['secret']
);

The above declaration will index all the user attributes, except for the attribute secret.

ATTACHMENT KEYWORDS

TODO

CUSTOM MAPPING, INFLATION AND DEFLATION

The preferred way to specify the mapping and how to deflate and inflate an attribute is by specifying an isa type constraint and adding a typemap entry.

However, you can provide custom values with the following:

mapping

has 'foo' => (
    is      => 'ro',
    type    => 'string',
    mapping => { index => 'no' },
);

You can specify a custom mapping directly in the attribute, which will be used instead of the typemap entry that would be generated. Any other keywords that you specify (eg "type") will be added to your mapping.

deflator

has 'foo' => (
    is       => 'ro',
    type     => 'string',
    deflator => sub { my $val = shift; return my_deflator($val) },
);

You can specify a custom deflator directly in the attribute. It should return undef, a string, or an unblessed data structure that can be converted to JSON.

inflator

has 'foo' => (
    is       => 'ro',
    type     => 'string',
    inflator => sub { my $val = shift; return my_inflator($val) },
);

You can specify a custom inflator directly in the attribute. It should be able to reinflate the original value from the plain data structure that is stored in ElasticSearch.

AUTHOR

Clinton Gormley <drtech@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Clinton Gormley.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.