NAME
Elastic::Manual::Attributes - Fine-tuning how your attributes are indexed
VERSION
version 0.03
SYNOPSIS
package MyApp::User;
use Elastic::Doc;
has 'email' => (
isa => 'Str',
is => 'rw',
analyzer => 'english',
multi => {
untouched => { index => 'not_analyzed' },
ngrams => { analyzer => 'edge_ngrams' }
}
);
no Elastic::Doc;
INTRODUCTION
Elastic::Model uses Moose's type constraints to figure out how each of the attributes in your classes should be indexed. This is a good start, but it won't be long before you want to fine tune the indexing process.
For instance, if you have attribute title
which isa => 'Str'
, then Elastic::Model::TypeMap::Moose will map the attribute as a string
, whose value will be analyzed by the standard analyzer.
But perhaps you want the words in your title
to be stemmed (eg "fox" will match "fox", "foxes" or "foxy"), or perhaps your title is in Spanish and you want Spanish stemming rules to apply. Or perhaps you want to use as-you-type auto-completion on the title
attribute.
This fine-tuning is easy to do, using the attribute keywords that are added to your attributes when you include the line use Elastic::Doc;
in your class. These keywords come from Elastic::Model::Trait::Field.
TYPES
Each attribute/field must have one of the following types:
ip (IPv4 addresses)
nested (a specialized form of
object
)multi_field (index the same field in different ways)
type
The type
keyword allows you to override the default type
that is generated by the typemap. For instance:
has 'epoch_milliseconds' => (
isa => 'Int',
type => 'date'
);
Although the type constraint Int
would normally generate type long
, we have, overridden the type
to be indexed as a date
field.
Arrays
A notable omission from the above list is array
. That's because a field of any type can store and index multiple values. For example:
has 'author' => ( isa => 'Str' );
has 'authors' => ( isa => 'ArrayRef[Str]' );
Both of these fields would be mapped as { type => 'string' }
.
Note: the values contained in an array must be of a single type in order to be indexable. You can't mix types in an array.
Also see "NESTED FIELDS" for a more detailed explanation, specifically about how to deal with arrays of objects.
Unknown types
If the typemap does not know how to deal with the type constraint on an attribute, and you haven't specified a "type", then the field is mapped as:
{
type => 'object',
enabled => 0
}
This means that the value in the field will be stored in ElasticSearch, but will not be queryable. Also, no attempt is made to inflate or deflate the value - it is just passed through unchanged. If it is a value that JSON::XS cannot handle natively, (eg a blessed ref), then Elastic::Model will be unable to deal with it automatically.
See "CUSTOM MAPPING, INFLATION AND DEFLATION" for details of how to handle this situation.
GENERAL KEYWORDS
There are a number of keywords that apply to (almost) all of the field types listed below:
exclude
has 'cache_key' => (
is => 'ro',
exclude => 1,
builder => '_generate_cache_key',
lazy => 1,
);
If exclude
is true then the attribute will not be stored in ElasticSearch. This is only useful for generated or temporary attributes. If you want to store the attribute but make it not searchable, then you should use the "index" keyword instead.
index
has 'tag' => (
is => 'rw',
isa => 'Str',
index => 'not_analyzed'
);
The index
keyword controls how ElasticSearch will index your attribute. It accepts 3 values:
no
: This attribute will not be indexed, and will thus not be searchable.not_analyzed
: This attribute will be indexed using exactly the value that you pass in, egFoO
will be stored (and searchable) asFoO
.analyzed
: This attribute will be analyzed. In other words, the text value will be passed through the specified (or default) "analyzer" before being indexed. The analyzer will tokenize and pre-process the text to produce terms. For exampleFoO BAR
would (depending on the analyzer) be stored as the termsfoo
andbar
. This is the default for string fields (except for enums).
include_in_all
has 'secret' => (
is => 'ro',
isa => 'Str',
include_in_all => 0
);
By default, all attributes (except those with index => 'no'
) are also indexed in the special _all
field. This is intended to make it easy to search for documents that contain a value in any field. If you would like to exclude a particular attribute from the _all
field, then specify { include_in_all => 0 }
.
Note:
The
_all
field has its own "analyzer" - so the tokens that are stored in the_all
field may be different from the tokens stored in the attribute itself.When
include_in_all
is set on a field of typeobject
, its value will propogate down to all attributes within the object.You can disable the
_all
field completely using "has_mapping" in Elastic::Doc:package MyApp::User; use Elastic::Doc; has_mapping { _all => { enabled => 0 } };
boost
has 'title' => (
is => 'rw',
isa => 'Str',
boost => 2
);
A boost
makes a value "more relevant". For instance, the words in the title
field of a blog post are probably a better indicator of the topic than the words in the content
field. You can boost a field at search time and at index time. The benefit of boosting at search time, is that your boost
is not fixed. The benefit of boosting at index
time is that the boost
value is carried over to the _all
field.
Also see "omit_norms".
multi
has 'name' => (
is => 'ro',
isa => 'Str',
multi => {
sorting => { index => 'not_analyzed' },
partial => { analyzer => 'ngrams' }
}
);
It is a common requirement to be able to use a single field in different ways. For instance, with the name
field example above, we may want to:
Do a full text search for
Joe Bloggs
Do a partial match on all names beginning with
Blo
Sort results alphabetically by name.
A single field definition is insufficient in this case: The standard analyzer won't allow partial matching, and because it generates multiple terms/tokens, it can't be used for sorting. (You can only sort on a single value).
This is where multi_fields are useful. The same value can be indexed and queried in multiple ways. When you specify a multi
mapping, each "sub-field" inherits the type
of the main field, but you can override this. The "sub-fields" can be referred to as eg name.partial
or name.sorting
and the "main field" name
can also be referred to as name.name
.
Another benefit of multi-fields is that they can be added without reindexing all of your data.
index_name
has 'foo' => (
is => 'rw',
isa => 'Str',
index_name => 'bar'
);
ElasticSearch uses dot-notation to refer to nested hashes. For instance, with this data structure:
{
foo => {
bar => {
baz => 'xxx'
}
}
}
... you could refer to the baz
value as baz
or as foo.bar.baz
.
Sometimes, you may want to specify a different name for a field. For instance:
{
street => {
name => 'Oxford Street',
number => 1
},
town => {
name => 'London'
}
}
You can use the index_name
to distinguish town_name
from street_name
.
store
has 'big_field' => (
is => 'ro',
isa => 'Str',
store => 'yes'
);
Individual fields can be stored (ie have their original value stored on disk). This is not the same as whether the value is indexed or not (see "index"). It just means that this individual value can be retrieved separately from the others. stored
defaults to 'no'
but can be set to 'yes'
.
You almost never need this. The _source
field (which is stored by default) contains the hashref representing your whole object, and is returned by default when you get or search for a document. This means a single disk seek to load the_source
field, rather than a disk seek (think 5ms) for every stored field! It is much more efficient.
null_value
has 'foo' => (
is => 'rw',
isa => 'Str',
null_value => 'none'
);
If the attribute's value is undef
then the null_value
will be indexed instead. This option is included for completeness, but isn't very useful. Rather just leave the value as undef
and use the exists and missing filters when you need to consider undef
values.
STRING FIELDS
String fields fall into two broad categories:
Text, like this paragraph, which needs to be searchable via flexible full text queries
Terms, for instance tags, postcodes, enums, which should be stored and queried exactly (eg
Perl
is different fromPERL
)
The simplest way to distinguish between text and terms is to set the "index" keyword to analyzed
(the default) for text or to not_analyzed
for terms. See Elastic::Manual::Analysis for a more detailed discussion.
The keywords below apply only to fields of "type" string
. You can also use "index", "include_in_all", "boost", "multi", "index_name", "store" and "null_value".
analyzer
has 'email' => (
is => 'ro',
isa => 'Str',
analyzer => 'my_email_analyzer'
);
Specify which analyzer (built-in or custom) to use at index time and at search time. This is the equivalent of setting "index_analyzer" and "search_analyzer" to the same value.
Also see Elastic::Manual::Analysis for an explanation.
index_analyzer
has 'email' => (
is => 'ro',
isa => 'Str',
index_analyzer => 'my_email_analyzer'
);
Sets the "analyzer" to use at index time only.
search_analyzer
has 'email' => (
is => 'ro',
isa => 'Str',
search_analyzer => 'my_email_analyzer'
);
Sets the "analyzer" to use at search time only.
search_quote_analyzer
has 'email' => (
is => 'ro',
isa => 'Str',
search_analyzer => 'my_email_analyzer',
search_quote_analyzer => 'my_quoted_email_analyzer'
);
Sets the "analyzer" to use in a Query-String query or Field query when the search phrase includes quotes (""
). If not set, then it falls back to the "search_analyzer" or the "analyzer".
omit_norms
has 'status' => (
is => 'ro',
isa => 'Str',
analyzer => 'keyword',
omit_norms => 1
);
Norms allow for index time "boost" and for field length normalization (shorter fields score higher). This may not always be what you want. For instance, a status
field may contain a single value that is never used for relevance scoring, just for filtering (eg all docs where status
is "active"
). Or, if the values in a field are short (eg name, email) then the field length normalization may skew the results incorrectly.
You can turn off norms with { omit_norms => 0 }
.
See http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e71 for more about omit_norms
.
omit_term_freq_and_positions
has 'status' => (
is => 'ro',
isa => 'Str',
analyzer => 'keyword',
omit_term_freq_and_positions => 1
)
ElasticSearch normally stores the frequency and position of each term in analyzed text. If you don't need this information, then you can turn it off, and save space.
See http://www.lucidimagination.com/content/scaling-lucene-and-solr#d0e63 for more about omit_term_freq_and_positions
.
term_vector
has 'text' => (
is => 'ro',
isa => 'Str',
store => 'yes',
term_vector => 'with_positions_offsets',
);
The full functionality of term vectors is not exposed via ElasticSearch, so the only real value for now is for highlighting snippets. Allowed values are: no
(the default), yes
, with_offsets
, with_positions
and with_positions_offsets
.
Fast snippet highlighting
There are two highlighters available for highlighting matching snippets in text fields: the highlighter
, which can be used on any analyzed text field without any preparation, and the fast-vector-highlighter
which is faster (better for large text fields which require frequent highlighting), but uses more disk space, and the field needs to be setup correctly before use:
has 'big_field' => (
is => 'ro',
isa => 'Str',
term_vector => 'with_positions_offsets'
);
NUMERIC FIELDS
The following keyword applies only to fields of "type" integer
, long
, float
, double
, short
or byte
.
You can also use "index", "include_in_all", "boost", "multi", "index_name", "store" and "null_value".
precision_step (numeric)
has 'count' => (
isa => 'Int',
precision_step => 2,
);
The precision_step
determines the number of terms generated for each value (defaults to 4). The more terms, the faster the lookup, but the more memory used.
DATE FIELDS
Dates in ElasticSearch are stored internally as long
values containing milliseconds since the epoch.
The following keywords apply only to fields of "type" date
. You can also use "index", "include_in_all", "boost", "multi", "index_name", "store" and "null_value".
format
has 'year_week' => (
isa => 'Str',
type => 'date',
format => 'basic_week_date'
);
Date fields by default can parse (1) milliseconds since epoch (2) yyyy/MM/dd HH:mm:ss Z or (3) yyyy/MM/dd Z.
If you would like to specify a different format, you can use one of the built-in formats or a custom format.
precision_step (date)
has 'count' => (
isa => 'Int',
precision_step => 2,
);
The precision_step
determines the number of terms generated for each value (defaults to 4). The more terms, the faster the lookup, but the more memory used.
BOOLEAN FIELDS
Boolean fields accept undef
, 0
, ""
and "false"
as false
values, and any other value as true
.
There are no boolean
-specific keywords, but the general keywords apply: "index", "include_in_all", "boost", "multi", "index_name", "store" and "null_value".
Note: By default, ElasticSearch treats undef
as NEITHER true
NOR false
, but as null
(ie missing). To work around this, we automatically set "null_value" to 0
to make boolean
fields more Perlish. If you would like to revert to the default behaviour, set "null_value" to undef
.
BINARY FIELDS
Binary data can be stored in ElasticSearch in Base64 encoding. The easiest way to do this is to use "Binary" in Elastic::Model::Types, which will handle the Base64 conversion for you:
use Elastic::Model::Types qw(Binary);
has 'binary_field' => (
is => 'ro',
isa => Binary
);
The field is always not indexed.
There are no binary
-specific keywords, but you can use: "store" (defaults to yes
)and "index_name".
IP FIELDS
Fields of type ip
can be used to index IPv4 addresses in numeric form.
The following keyword applies only to fields of "type" ip
. You can also use "index", "include_in_all", "boost", "multi", "index_name", "store" and "null_value".
precision_step (ip)
has 'address' => (
isa => 'Str',
type => 'ip'
precision_step => 2,
);
The precision_step
determines the number of terms generated for each value (defaults to 4). The more terms, the faster the lookup, but the more memory used.
GEO_POINT FIELDS
geo_point
fields are used to index latitude/longitude points. The easiest way to use them is to use "GeoPoint" in Elastic::Model::Types:
use Elastic::Model::Types qw(GeoPoint);
has 'point' => (
is => 'ro',
isa => GeoPoint,
coerce => 1
);
See http://www.elasticsearch.org/guide/reference/mapping/geo-point-type.html for more information about geo_point fields.
The following keywords apply only to fields of "type" geo_point
. You can also use "multi", "index_name" and "store".
lat_lon
has 'point' => (
is => 'ro',
isa => GeoPoint,
lat_lon => 1
);
By default, a geo-point is indexed as a lat-lon combination. To index the lat
and lon
fields as numeric fields as well, which is considered good practice, as both the geo distance and bounding box filters can either be executed using in memory checks, or using the indexed lat lon values. Note: indexed lat lon only makes sense when there is a single geo point value for the field, and not multiple values.
geohash
has 'point' => (
is => 'ro',
isa => GeoPoint,
geohash => 1,
);
Set geohash
to true to index the geohash value as well.
This value is queryable via (eg) the point.geohash
field.
geohash_precision
has 'point' => (
is => 'ro',
isa => GeoPoint,
geohash => 1,
geohash_precision => 8,
);
The geohash_precision
determines how accurate the geohash will be - defaults to 12.
OBJECT FIELDS
Hashrefs (and objects which have been serialised to hashrefs) are considered to be "objects", as in JSON objects. Your doc class is serialized to a JSON object/hash, which is known as the root_object. The mapping for the root object can be configured with "has_mapping" in Elastic::Doc.
Your doc class may have attributes which are hash-refs, or objects, which may themselves contain hash-refs or objects. Multi-level data structures are allowed.
The following keywords apply only to fields of "type" object
and nested
. You can also use "include_in_all", "include_attrs", "exclude_attrs".
The mapping for these structures should be automatically generated, but these keywords give you some extra control:
enabled
has 'foo' => (
is => 'ro',
isa => 'HashRef',
enabled => 0
);
Setting enabled
to false disables the indexing of any value in the object. Defaults to false
.
dynamic
has 'foo' => (
is => 'ro',
isa => 'HashRef',
dynamic => 1
);
ElasticSearch defaults to trying to detect field types dynamically, but this can lead to mistakes, eg is "123"
a string
or a long
? Elastic::Model turns off this dynamic detection, and instead uses Moose's type constraints to determine what type each field should have.
If you know what you're doing, you can set dynamic
to 1
(auto-detect new field types), 0
(ignore new fields) or 'strict'
(throw an error if an unknown field is included).
path
package MyApp::Types;
use MooseX::Types -declare 'FullName';
use MooseX::Types::Moose qw(Str);
use MooseX::Types::Structured qw(Dict);
subtype Fullname,
as Dict[
first => Str,
last => Str
];
package MyApp::Couple;
use Moose;
use MyApp::Types qw(FullName);
has 'husband' => (
is => 'ro',
isa => FullName,
path => 'just_name'
);
has 'wife' => (
is => 'ro',
isa => FullName,
path => 'full'
);
The path
keyword accepts the values full
and just_name
. By default, nested attributes can be referenced by just their name, or by their path, using dot-notation, eg wife.first
, or couple.wife.first
.
The path
setting, which defaults to full
(eg wife.first
) can be set to just_name
, in which case eg the name husband.first
won't be defined.
The path
keyword can also be combined with the "index_name" keyword.
NESTED FIELDS
nested
fields are a sub-class of object
fields, that are useful when your attribute can contain multiple values.
First an explanation. Consider this data structure:
{
person => [
{ first => 'John', last => 'Smith' },
{ first => 'Mary', last => 'Smith' },
{ first => 'Mary', last => 'Jones' }
]
}
If the person
field is of type object
, then the above data structure is flattened into something more like this:
{
'person.first' => ['John','Mary','Mary'],
'person.last' => ['Smith','Smith','Jones']
}
With this structure it is impossible to run queries that depend on matching on attributes of a SINGLE person
object. For instance, a query asking for docs that have a person
who has (first == John and last == Jones)
would incorrectly match this document.
Nested objects are the solution to this problem. When an attribute is marked as "type" nested
, then ElasticSearch creates each object as a separate-but-related hidden document. These nested objects can be queried with the nested query and the nested filter.
The following keywords apply only to fields of "type" nested
. You can also use "include_in_all", "path" "dynamic", "include_attrs" and "exclude_attrs".
include_in_parent
has 'person' => (
is => 'ro',
isa => ArrayRef[Person],
type => 'nested',
include_in_parent => 1,
);
If you would also like the data from the nested objects to be indexed in their containing object (as in the first data structure above), then set include_in_parent
to true.
include_in_root
has 'person' => (
is => 'ro',
isa => ArrayRef[Person],
type => 'nested',
include_in_root => 1,
);
Objects can be nested inside objects which are nested inside objects etc. The include_in_root
keyword does the same as the "include_in_parent" keyword, but refers to the top-most document, rather than the direct parent.
ELASTIC::DOC FIELDS
You can have attributes in one class that refer to another Elastic::Doc class. For instance, a MyApp::Post
object could have, as an attribute, the MyApp::User
object to whome the post belongs.
You may want to store just the Elastic::Model::UID of the user
object, or you may want to include the user's name and email address, so that you can search for posts by a user named "Joe Bloggs". You can't do joins in a NoSQL database, so you need to denormalize your data.
By default, all attributes in an object are included. You can change the list with "include_attrs" and "exclude_attrs". You can use these with Moose classes too, but any attributes that are excluded, won't be stored and it won't be possible to retrieve them.
include_attrs
has 'user' => (
is => 'ro',
isa => 'MyApp::User',
include_attrs => ['name','email']
);
The above declaration will index the user
object's UID, plus the name
and email
attributes. If include_attrs
is not specified, then all the attributes from the user
object will be indexed. If include_attrs
is set to an empty array ref []
then no attributes other than the UID will be indexed.
exclude_attrs
has 'user' => (
is => 'ro',
isa => 'MyApp::User',
exclude_attrs => ['secret']
);
The above declaration will index all the user
attributes, except for the attribute secret
.
ATTACHMENT KEYWORDS
TODO
CUSTOM MAPPING, INFLATION AND DEFLATION
The preferred way to specify the mapping and how to deflate and inflate an attribute is by specifying an isa
type constraint and adding a typemap entry.
However, you can provide custom values with the following:
mapping
has 'foo' => (
is => 'ro',
type => 'string',
mapping => { index => 'no' },
);
You can specify a custom mapping
directly in the attribute, which will be used instead of the typemap entry that would be generated. Any other keywords that you specify (eg "type") will be added to your mapping
.
deflator
has 'foo' => (
is => 'ro',
type => 'string',
deflator => sub { my $val = shift; return my_deflator($val) },
);
You can specify a custom deflator
directly in the attribute. It should return undef
, a string, or an unblessed data structure that can be converted to JSON.
inflator
has 'foo' => (
is => 'ro',
type => 'string',
inflator => sub { my $val = shift; return my_inflator($val) },
);
You can specify a custom inflator
directly in the attribute. It should be able to reinflate the original value from the plain data structure that is stored in ElasticSearch.
AUTHOR
Clinton Gormley <drtech@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2012 by Clinton Gormley.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.