NAME
ElasticSearch - An API for communicating with ElasticSearch
VERSION
Version 0.25, tested against ElasticSearch server version 0.12.0.
NOTE: This version has been completely refactored, to provide multiple Transport backends, and some methods have moved to subclasses.
DESCRIPTION
ElasticSearch is an Open Source (Apache 2 license), distributed, RESTful Search Engine based on Lucene, and built for the cloud, with a JSON API.
Check out its features: http://www.elasticsearch.com/products/elasticsearch/
This module is a thin API which makes it easy to communicate with an ElasticSearch cluster.
It maintains a list of all servers/nodes in the ElasticSearch cluster, and spreads the load randomly across these nodes. If the current active node disappears, then it attempts to connect to another node in the list.
Forking a process triggers a server list refresh, and a new connection to a randomly chosen node in the list.
SYNOPSIS
use ElasticSearch;
my $e = ElasticSearch->new(
servers => 'search.foo.com:9200',
transport => 'http' | 'httplite' | 'thrift', # default 'http'
trace_calls => 'log_file',
);
$e->index(
index => 'twitter',
type => 'tweet',
id => 1,
data => {
user => 'kimchy',
post_date => '2009-11-15T14:12:12',
message => 'trying out Elastic Search'
}
);
$data = $e->get(
index => 'twitter',
type => 'tweet',
id => 1
);
$results = $e->search(
index => 'twitter',
type => 'tweet',
query => {
term => { user => 'kimchy' },
}
);
$results = $e->search(
index => 'twitter',
type => 'tweet',
query => {
query_string => { query => 'kimchy' },
}
);
See the examples/
directory for a simple working example.
GETTING ElasticSearch
You can download the latest released version of ElasticSearch from http://github.com/elasticsearch/elasticsearch/downloads.
See here for setup instructions: http://github.com/elasticsearch/elasticsearch/wiki/Setting-up-ElasticSearch
CALLING CONVENTIONS
I've tried to follow the same terminology as used in the ElasticSearch docs when naming methods, so it should be easy to tie the two together.
Some methods require a specific index
and a specific type
, while others allow a list of indices or types, or allow you to specify all indices or types. I distinguish between them as follows:
$e->method( index => multi, type => single, ...)
multi
values can be:
index => 'twitter' # specific index
index => ['twitter','user'] # list of indices
index => undef # (or not specified) = all indices
single
values must be a scalar, and are required parameters
type => 'tweet'
RETURN VALUES AND EXCEPTIONS
Methods that query the ElasticSearch cluster return the raw data structure that the cluster returns. This may change in the future, but as these data structures are still in flux, I thought it safer not to try to interpret.
Anything that is known to be an error throws an exception, eg trying to delete a non-existent index.
METHODS
Creating a new ElasticSearch instance
new()
$e = ElasticSearch->new(
transport => 'http|httplite|thrift', # default 'http'
servers => '127.0.0.1:9200' # single server
| ['es1.foo.com:9200',
'es2.foo.com:9200'], # multiple servers
trace_calls => 1 | '/path/to/log/file',
timeout => 30,
);
servers
is a required parameter and can be either a single server or an ARRAY ref with a list of servers. These servers are used to retrieve a list of all servers in the cluster, after which one is chosen at random to be the "current_server()".
There are various transport
backends that ElasticSearch can use: http
(the default, based on LWP), httplite
(based on HTTP::Lite) or thrift
(which uses the Thrift protocol).
Although the thrift
interface has the right buzzwords (binary, compact, sockets), the generated Perl code is very slow. Until that is improved, I recommend one of the http
backends instead.
The httplite
backend is about 30% faster than the default http
backend, and will probably become the default after more testing in production.
See also: ElasticSearch::Transport, "timeout()", "trace_calls()", http://www.elasticsearch.com/docs/elasticsearch/modules/http and http://www.elasticsearch.com/docs/elasticsearch/modules/thrift
Document-indexing methods
index()
$result = $e->index(
index => single,
type => single,
id => $document_id, # optional, otherwise auto-generated
data => {
key => value,
...
},
timeout => eg '1m' or '10s' # optional
create => 1 | 0 # optional
refresh => 1 | 0 # optional
);
eg:
$result = $e->index(
index => 'twitter',
type => 'tweet',
id => 1,
data => {
user => 'kimchy',
post_date => '2009-11-15T14:12:12',
message => 'trying out Elastic Search'
},
);
Used to add a document to a specific index
as a specific type
with a specific id
. If the index/type/id
combination already exists, then that document is updated, otherwise it is created.
Note:
If the
id
is not specified, then ElasticSearch autogenerates a unique ID and a new document is always created.If
create
istrue
, then a new document is created, even if the sameindex/type/id
combination already exists!create
can be used to slightly increase performance when creating documents that are known not to exists in the index.
See also: http://www.elasticsearch.com/docs/elasticsearch/rest_api/index, "bulk()" and "put_mapping()"
set()
set()
is a synonym for "index()"
create()
create
is a synonym for "index()" but creates instead of first checking whether the doc already exists. This speeds up the indexing process.
get()
$result = $e->get(
index => single,
type => single,
id => single,
);
Returns the document stored at index/type/id
or throws an exception if the document doesn't exist.
Example:
$e->get( index => 'twitter', type => 'tweet', id => 1)
Returns:
{
_id => 1,
_index => "twitter",
_source => {
message => "trying out Elastic Search",
post_date=> "2009-11-15T14:12:12",
user => "kimchy",
},
_type => "tweet",
}
See also: "bulk()", "KNOWN ISSUES", http://www.elasticsearch.com/docs/elasticsearch/rest_api/get
delete()
$result = $e->delete(
index => single,
type => single,
id => single,
refresh => 1 | 0 # optional
);
Deletes the document stored at index/type/id
or throws an exception if the document doesn't exist.
Example:
$e->delete( index => 'twitter', type => 'tweet', id => 1);
See also: "bulk()", http://www.elasticsearch.com/docs/elasticsearch/rest_api/delete
bulk()
$result = $e->bulk([
{ create => { index => 'foo', type => 'bar', id => 123,
data => { text => 'foo bar'} }},
{ index => { index => 'foo', type => 'bar', id => 123,
data => { text => 'foo bar'} }},
{ delete => { index => 'foo', type => 'bar', id => 123 }},
]);
Perform multiple index
,create
or delete
operations in a single request. In my benchmarks, this is 10 times faster than serial operations.
For the above example, the $result
will look like:
{
actions => [ the list of actions you passed in ],
results => [
{ create => { id => 123, index => "foo", type => "bar" } },
{ index => { id => 123, index => "foo", type => "bar" } },
{ delete => { id => 123, index => "foo", type => "bar" } },
]
}
where each row in results
corresponds to the same row in actions
. If there are any errors for individual rows, then the $result
will contain a key errors
which contains an array of each error and the associated action, eg:
$result = {
actions => [
## NOTE - num is numeric
{ index => { index => 'bar', type => 'bar', id => 123,
data => { num => 123 } } },
## NOTE - num is a string
{ index => { index => 'bar', type => 'bar', id => 123,
data => { num => 'foo bar' } } },
],
errors => [
{
action => {
index => { index => 'bar', type => 'bar', id => 123,
data => { num => 'text foo' } }
},
error => "MapperParsingException[Failed to parse [num]]; ...",
},
],
results => [
{ index => { id => 123, index => "bar", type => "bar" } },
{ index => {
error => "MapperParsingException[Failed to parse [num]];...",
id => 123, index => "bar", type => "bar",
},
},
],
};
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/bulk for more details.
Query commands
search()
$result = $e->search(
index => multi,
type => multi,
query => {query},
search_type => $search_type # optional
explain => 1 | 0 # optional
facets => { facets } # optional
fields => [$field_1,$field_n] # optional
from => $start_from # optional
script_fields => { script_fields } # optional
size => $no_of_results # optional
sort => ['_score',$field_1] # optional
scroll => '5m' | '30s' # optional
highlight => { highlight } # optional
indices_boost => { index_1 => 1.5,... } # optional
);
Searches for all documents matching the query. Documents can be matched against multiple indices and multiple types, eg:
$result = $e->search(
index => undef, # all
type => ['user','tweet'],
query => { term => {user => 'kimchy' }}
);
For all of the options that can be included in the query
parameter, see http://www.elasticsearch.com/docs/elasticsearch/rest_api/search and http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/
scroll()
$result = $e->scroll(scroll_id => $scroll_id );
If a search has been executed with a scroll
parameter, then the returned scroll_id
can be used like a cursor to scroll through the rest of the results.
Note - this doesn't seem to work correctly in version 0.12.0 of ElasticSearch.
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/#Scrolling
count()
$result = $e->count(
index => multi,
type => multi,
bool
| constant_score
| custom_score
| dis_max
| field
| filtered
| flt
| flt_field
| fuzzy
| match_all
| mlt
| mlt_field
| query_string
| prefix
| range
| span_term
| span_first
| span_near
| span_not
| span_or
| term
| wildcard
);
Counts the number of documents matching the query. Documents can be matched against multiple indices and multiple types, eg
$result = $e->count(
index => undef, # all
type => ['user','tweet'],
term => {user => 'kimchy' },
);
See also "search()", http://www.elasticsearch.com/docs/elasticsearch/rest_api/count and http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/
delete_by_query()
$result = $e->delete_by_query(
index => multi,
type => multi,
bool
| constant_score
| custom_score
| dis_max
| field
| filtered
| flt
| flt_field
| fuzzy
| match_all
| mlt
| mlt_field
| query_string
| prefix
| range
| span_term
| span_first
| span_near
| span_not
| span_or
| term
| wildcard
);
Deletes any documents matching the query. Documents can be matched against multiple indices and multiple types, eg
$result = $e->delete_by_query(
index => undef, # all
type => ['user','tweet'],
term => {user => 'kimchy' }
);
See also "search()", http://www.elasticsearch.com/docs/elasticsearch/rest_api/delete_by_query and http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/
mlt()
# mlt == more_like_this
$results = $e->mlt(
index => single, # required
type => single, # required
id => $id, # required
# optional more-like-this params
boost_terms => float
mlt_fields => 'scalar' or ['scalar_1', 'scalar_n']
max_doc_freq => integer
max_query_terms => integer
max_word_len => integer
min_doc_freq => integer
min_term_freq => integer
min_word_len => integer
pct_terms_to_match => float
stop_words => 'scalar' or ['scalar_1', 'scalar_n']
# optional search params
scroll => '5m' | '10s'
search_type => "predefined_value"
explain => {explain}
facets => {facets}
fields => {fields}
from => {from}
highlight => {highlight}
size => {size}
sort => {sort}
)
More-like-this (mlt) finds related/similar documents. It is possible to run a search query with a more_like_this
clause (where you pass in the text you're trying to match), or to use this method, which uses the text of the document referred to by index/type/id
.
This gets transformed into a search query, so all of the search parameters are also available.
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/more_like_this/ and http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/more_like_this_query/
Index Admin methods
index_status()
$result = $e->index_status(
index => multi,
);
Returns the status of $result = $e->index_status(); #all $result = $e->index_status( index => ['twitter','buzz'] ); $result = $e->index_status( index => 'twitter' );
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/status
create_index()
$result = $e->create_index(
index => single,
defn => {...} # optional
);
Creates a new index, optionally setting certain paramters, eg:
$result = $e->create_index(
index => 'twitter',
defn => {
number_of_shards => 3,
number_of_replicas => 2,
}
);
Throws an exception if the index already exists.
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/create_index
delete_index()
$result = $e->delete_index(
index => single
);
Deletes an existing index, or throws an exception if the index doesn't exist, eg:
$result = $e->delete_index( index => 'twitter' );
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/delete_index
update_index_settings()
$result = $e->update_index_settings(
index => multi,
settings => { ... settings ...}
);
Update the settings for all, one or many indices. Currently only the number_of_replicas
is exposed:
$result = $e->update_index_settings(
settings => { number_of_replicas => 1 }
);
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/update_settings/
aliases()
$result = $e->aliases( actions => [actions] | {actions} )
Adds or removes an alias for an index, eg:
$result = $e->aliases( actions => [
{ remove => { index => 'foo', alias => 'bar' }},
{ add => { index => 'foo', alias => 'baz' }}
]);
actions
can be a single HASH ref, or an ARRAY ref containing multiple HASH refs.
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/aliases/
get_aliases()
$result = $e->get_aliases( index => multi )
Returns a hashref listing all indices and their corresponding aliases, and all aliases and their corresponding indices, eg:
{
aliases => {
bar => ["foo"],
baz => ["foo"],
},
indices => { foo => ["baz", "bar"] },
}
If you pass in the optional index
argument, which can be an index name or an alias name, then it will only return the indices and aliases related to that argument.
flush_index()
$result = $e->flush_index(
index => multi,
full => 1 | 0, # optional
refresh => 1 | 0, # optional
);
Flushes one or more indices, which frees memory from the index by flushing data to the index storage and clearing the internal transaction log. By default, ElasticSearch uses memory heuristics in order to automatically trigger flush operations as required in order to clear memory.
Example:
$result = $e->flush_index( index => 'twitter' );
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/flush
refresh_index()
$result = $e->refresh_index(
index => multi
);
Explicitly refreshes one or more indices, making all operations performed since the last refresh available for search. The (near) real-time capabilities depends on the index engine used. For example, the robin one requires refresh to be called, but by default a refresh is scheduled periodically.
Example:
$result = $e->refresh_index( index => 'twitter' );
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/refresh
clear_cache()
$result = $e->clear_cache( index => multi );
Clears the caches for the specified indices (currently only the filter cache).
See http://github.com/elasticsearch/elasticsearch/issues/issue/101
gateway_snapshot()
$result = $e->gateway_snapshot(
index => multi
);
Explicitly performs a snapshot through the gateway of one or more indices (backs them up ). By default, each index gateway periodically snapshot changes, though it can be disabled and be controlled completely through this API.
Example:
$result = $e->gateway_snapshot( index => 'twitter' );
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/gateway_snapshot and http://www.elasticsearch.com/docs/elasticsearch/modules/gateway
snapshot_index()
snapshot_index()
is a synonym for "gateway_snapshot()"
optimize_index()
$result = $e->optimize_index(
index => multi,
only_deletes => 1 | 0, # only_expunge_deletes
flush => 1 | 0, # flush after optmization
refresh => 1 | 0, # refresh after optmization
)
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/optimize
put_mapping()
$result = $e->put_mapping(
index => multi,
type => single,
_all => { ... },
_source => { ... },
properties => { ... }, # required
timeout => '5m' | '10s', # optional
ignore_conflicts => 1 | 0, # optional
);
A mapping
is the data definition of a type
. If no mapping has been specified, then ElasticSearch tries to infer the types of each field in document, by looking at its contents, eg
'foo' => string
123 => integer
1.23 => float
However, these heuristics can be confused, so it safer (and much more powerful) to specify an official mapping
instead, eg:
$result = $e->put_mapping(
index => ['twitter','buzz'],
type => 'tweet',
_source => { compress => 1 },
properties => {
user => {type => "string", index => "not_analyzed"},
message => {type => "string", null_value => "na"},
post_date => {type => "date"},
priority => {type => "integer"},
rank => {type => "float"}
}
);
See also: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/put_mapping and http://www.elasticsearch.com/docs/elasticsearch/mapping
delete_mapping()
$result = $e->delete_mapping(
index => multi,
type => single,
);
Deletes a mapping/type in one or more indices. See also http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/delete_mapping
mapping()
$mapping = $e->mapping(
index => single,
type => multi
);
Returns the mappings for all types in an index, or the mapping for the specified type(s), eg:
$mapping = $e->mapping(
index => 'twitter',
type => 'tweet'
);
$mappings = $e->mapping(
index => 'twitter',
type => ['tweet','user']
);
# { twitter => { tweet => {mapping}, user => {mapping}} }
Note: the index name which as used in the results is the actual index name. If you pass an alias name as the index
name, then this key will be the index (or indices) that the alias points to.
River admin methods
create_river()
$result = $e->create_river(
river => $river_name, # required
type => $type, # required
$type => {...}, # depends on river type
index => {...}, # depends on river type
);
Creates a new river with name $name
, eg:
$result = $e->create_river(
river => 'my_twitter_river',
type => 'twitter',
twitter => {
user => 'user',
password => 'password',
},
index => {
index => 'my_twitter_index',
type => 'status',
bulk_size => 100
}
)
See http://www.elasticsearch.com/docs/elasticsearch/river/ and http://www.elasticsearch.com/docs/elasticsearch/river/twitter/.
get_river()
$result = $e->get_river( river => $river_name );
Returns the river details eg
$result = $e->get_river ( river => 'my_twitter_river' )
Throws an exception if the river doesn't exist.
See http://www.elasticsearch.com/docs/elasticsearch/river/.
delete_river()
$result = $e->delete_river( river => $river_name );
Deletes the corresponding river, eg:
$result = $e->delete_river ( river => 'my_twitter_river' )
Throws an exception if the river doesn't exist.
Cluster admin methods
cluster_state()
$result = $e->cluster_state(
filter_nodes => 1 | 0, # optional
filter_metadata => 1 | 0, # optional
filter_routing_table => 1 | 0, # optional
filter_indices => [ 'index_1', ... 'index_n' ], # optional
);
Returns cluster state information.
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/state/
cluster_health()
$result = $e->cluster_health(
index => multi,
level => 'cluster' | 'indices' | 'shards',
wait_for_status => 'red' | 'yellow' | 'green',
| wait_for_relocating_shards => $number_of_shards,
| wait_for_nodes => eg '>=2',
timeout => $seconds
);
Returns the status of the cluster, or index|indices or shards, where the returned status means:
It can block to wait for a particular status (or better), or can block to wait until the specified number of shards have been relocated (where 0 means all) or the specified number of nodes have been allocated.
If waiting, then a timeout can be specified.
For example:
$result = $e->cluster_health( wait_for_status => 'green', timeout => '10s')
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/health/
nodes()
$result = $e->nodes(
nodes => multi,
settings => 1 | 0 # optional
);
Returns information about one or more nodes or servers in the cluster. If settings
is true
, then it includes the node settings information.
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/nodes_info
nodes_stats()
$result = $e->nodes_stats(
nodes => multi,
);
Returns various statistics about one or more nodes in the cluster.
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/nodes_stats/
shutdown()
$result = $e->shutdown(
nodes => multi,
delay => '5s' | '10m' # optional
);
Shuts down one or more nodes (or the whole cluster if no nodes specified), optionally with a delay.
node
can also have the values _local
, _master
or _all
.
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/nodes_shutdown/
restart()
$result = $e->restart(
nodes => multi,
delay => '5s' | '10m' # optional
);
Restarts one or more nodes (or the whole cluster if no nodes specified), optionally with a delay.
node
can also have the values _local
, _master
or _all
.
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/nodes_restart
current_server_version()
$version = $e->current_server_version()
Returns a HASH containing the version number
string, the build date
and whether or not the current server is a snapshot_build
.
Other methods
trace_calls()
$es->trace_calls(1); # log to STDERR
$es->trace_calls($filename); # log to $filename.$PID
$es->trace_calls(0 | undef); # disable logging
trace_calls()
is used for debugging. All requests to the cluster are logged either to STDERR
or the specified filename, with the current $PID appended, in a form that can be rerun with curl.
The cluster response will also be logged, and commented out.
Example: $e->cluster_health
is logged as:
# [Tue Oct 19 15:32:31 2010] Protocol: http, Server: 127.0.0.1:9200
curl -XGET 'http://127.0.0.1:9200/_cluster/health'
# [Tue Oct 19 15:32:31 2010] Response:
# {
# "relocating_shards" : 0,
# "active_shards" : 0,
# "status" : "green",
# "cluster_name" : "elasticsearch",
# "active_primary_shards" : 0,
# "timed_out" : false,
# "initializing_shards" : 0,
# "number_of_nodes" : 1,
# "unassigned_shards" : 0
# }
transport()
$transport = $e->transport
Returns the Transport object, eg ElasticSearch::Transport::HTTP.
camel_case()
$bool = $e->camel_case($bool)
Gets/sets the camel_case flag. If true, then all JSON keys returned by ElasticSearch are in camelCase, instead of with_underscores. This flag does not apply to the source document being indexed or fetched.
Defaults to false.
error_trace()
$bool = $e->error_trace($bool)
If the ElasticSearch server is returning an error, setting error_trace
to true will return some internal information about where the error originates. Mostly useful for debugging.
GLOBAL VARIABLES
$Elasticsearch::DEBUG = 0 | 1;
If $Elasticsearch::DEBUG
is set to true, then ElasticSearch exceptions will include a stack trace.
AUTHOR
Clinton Gormley, <drtech at cpan.org>
KNOWN ISSUES
- "set()", "index()" and "create()"
-
If one of the fields that you are trying to index has the same name as the type, then you need change the format as follows:
Instead of:
$e->set(index=>'twitter', type=>'tweet', data=> { tweet => 'My tweet', date => '2010-01-01' } );
you should include the type name in the data:
$e->set(index=>'twitter', type=>'tweet', data=> { tweet=> { tweet => 'My tweet', date => '2010-01-01' }} );
- "get()"
-
The
_source
key that is returned from a "get()" contains the original JSON string that was used to index the document initially. ElasticSearch parses JSON more leniently than JSON::XS, so if invalid JSON is used to index the document (eg unquoted keys) then$e->get(....)
will fail with a JSON exception.Any documents indexed via this module will be not susceptible to this problem.
- "scroll()"
-
scroll()
is broken in version 0.12.0 and earlier versions of ElasticSearch.See http://github.com/elasticsearch/elasticsearch/issues/issue/136
BUGS
This is a beta module, so there will be bugs, and the API is likely to change in the future, as the API of ElasticSearch itself changes.
If you have any suggestions for improvements, or find any bugs, please report them to http://github.com/clintongormley/ElasticSearch.pm/issues. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
TODO
Hopefully I'll be adding an ElasticSearch::Abstract (similar to SQL::Abstract) which will make it easier to generate valid queries for ElasticSearch.
Also, a non-blocking AnyEvent module has been written, but needs integrating with the new ElasticSearch::Transport.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc ElasticSearch
You can also look for information at:
GitHub
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
Thanks to Shay Bannon, the ElasticSearch author, for producing an amazingly easy to use search engine.
LICENSE AND COPYRIGHT
Copyright 2010 Clinton Gormley.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.