NAME

Elastic::Manual::Searching - Which search method to use with a View, and how to use the results.

VERSION

version 0.29_2

DESCRIPTION

Once you have configured your view correctly, you need to call a method on it in order to produce search results.

The three main methods are search(), scroll(), and scan(). All three methods return an iterator, but each method has a different purpose. The correct method should be chosen to match the situation.

This document discusses how to use the returned iterator, and when to use which search method.

RESULTS ITERATOR

Iterator basics

All three search methods return an iterator based on Elastic::Model::Role::Results and Elastic::Model::Role::Iterator, which works pretty much like any iterator, eg:

$it->first;         # first element
$it->next;          # next element
$it->prev;          # previous element
$it->last;          # last element
$it->current;       # current element
$it->shift;         # return first element and remove it from $it

$it->all;           # all elements
$it->slice(0,10);   # elements 0..9

$it->peek_next;     # return next element but don't move the cursor
$it->peek_prev;     # return next element but don't move the cursor

$it->has_next;      # 1 / 0
$it->has_prev;      # 1 / 0
$it->is_first;      # 1 / 0
$it->is_last;       # 1 / 0
$it->even;          # 1 / 0
$it->odd;           # 1 / 0
$it->parity;        # even / odd

$it->size;          # number of elements in $it
$it->total;         # total number of matching docs
$it->facets;        # any facets that were requested

What elements can the iterator return?

What's different about these iterators is the elements that they return. There are three options:

  • Result: The raw result returned from Elasticsearch is wrapped in an Elastic::Model::Result object, which provides methods for accessing the object itself, and metadata like highlights, script_fields or the relevance score that the current doc has.

  • Object: The object itself (or a stub-object which can auto-inflate) is returned. The other search metadata is not accessible.

  • Element: The raw result returned by Elasticsearch.

Depending on what you are doing, you may want either one of these three. For instance:

  • If you are doing a full text query (eg a user does a keyword search), you will want to return Results.

  • If you're retrieving the 20 most recent blog posts, you want just the Objects.

  • If you're reindexing your data from one index to another, you want to avoid the inflation/deflation process and just use the raw data Element.

Choosing an element type

From any results iterator, you can return any of the three element types:

$it->next_result;       # Result object
$it->next_object;       # Object itself
$it->next_element;      # Raw data

But that is verbose. By default, first(), next() etc all return Result objects, but you can change that:

$it = $view->search;

$it->next;              # Result object

$it->as_objects;
$it->next;              # Object itself

$it->as_elements;
$it->next;              # Raw data

$it->as_results;
$it->next;              # Result object

So the typical usage if you want a list of objects back, would be:

my $results = $view->search->as_objects;

while ( my $object = $results->next ) {
    do_something_with($object)
}

WHICH SEARCH METHOD SHOULD I USE WHEN?

Overview of differences

In summary:

  • Use search() when you want to retrieve a finite list of results. For instance: "Give me the 10 best results matching this query".

  • Use scroll() when you want an unbound result set. For instance: "Give me all the blog posts by user 123".

  • Use scan() when you want to retrieve a large amount of data, eg when you want to reindex all of your data. Scanning is very efficient, but the results cannot be not sorted.

Why do I need to choose a method?

When you create an index in Elasticsearch, it is created (by default) with 5 primary shards. Each of your docs is stored in one of those shards. It is these primary shards that allow you to scale your index size. But with the flexible scaling comes complexity.

The query process

Let's consider what happens when you run a query like: "Give me the 10 most relevant docs that match "foo bar"".

  • Your query is sent to one of your Elasticsearch nodes.

  • That node forwards your query to all 5 shards in the index.

  • Each shard runs the query and finds the 10 most relevant docs, and returns them to the requesting node.

  • The requesting node sorts these 50 docs by relevance, discards the 40 least relevant, and returns the 10 most relevant.

So then, if you ask for page 10,000 (ie results 100,001 - 100,010), each shard has to return 100,010 docs, and the requesting node has to sort through and discard 500,040 of them!

That approach doesn't scale. More than likely the requesting node will just run out of memory and be killed. There is a good reason why search engines don't return more than 100 pages of results.

Why should I scroll? Why can't I just ask for page 2?

More than likely, your data is being updated constantly. In between your requests for page 1 and page 2, your data may have changed order, or a doc might have been added or deleted. So you could end up missing results or seeing duplicates.

Scrolling gives you consistent results. It is like paging, where it returns size docs on each request, but Elasticsearch keeps the original data around until your scroll times out.

Of course, this comes at a cost: extra disk space. That means that you shouldn't make your scroll timeouts longer than they need to be. The default is 1 minute, but you may be able to reduce that considerably depending on your use case.

Of course, sometimes consistency won't matter - it may be perfectly reasonable to show duplicates in keyword searches, but less reasonable to have duplicate or missing items in a list.

Why can't I just pull all the data in one request?

Nobody has more than 10,000 blog posts, so why not just request all the posts in a single search() and specify size => 10_000?

The answer is: memory usage.

Each node needs to return 10,000 docs. The node handling the request has to make space for 50,000 docs, then sort through them to find the top 10,000. That may be fine as a one-off request, but when you have thousands of those happening concurrently, you're going to run out of memory pretty quickly.

But I need to retrieve all 10 billion docs!

OK, now we're in a different league. You can retrieve all the docs in your index, as long as you don't need them to be sorted: use scan(). Scanning works as follows (we'll assume that size is 10, but in practice you can probably make it a lot bigger):

  • Your query is sent to one of your Elasticsearch nodes.

  • That node forwards your query to all 5 shards in the index.

  • Each shard runs the query, finds all matching docs, and returns the first 10 docs to the requesting node, IN ANY ORDER.

  • The requesting node RETURNS ALL 50 DOCS IN ANY ORDER.

  • It also returns a scroll_id which:

    1. keeps track of what results have already been returned and

    2. keeps a consistent view of the index state at the time of the intial query.

  • With this scroll_id, we can keep pulling another 50 docs (ie number_of_primary_shards * size) until we have exhausted all the docs.

But I really need sorting!

Do you? Do you really? Why? No user needs to page through all 5 million of your matching results. Google only returns 1,000 results, for good reason.

OK, OK, so there may be situations where need to retrieve large numbers of sorted results. The trick here is to break them up into chunks. For instance, you could request all docs created in October, then November etc. How you do it really depends on your requirements.

DIFFERENCES BETWEEN THE METHODS

search()

$results = $view->search;

search() retrieves the best matching results up to a maximum of size and returns them all in an Elastic::Model::Results object.

The "size" in Elastic::Model::Results attribute contains the number of results that are stored in the iterator. The "total" in Elastic::Model::Results attribute contains the total number of matching docs in Elasticsearch.

scroll()

$results = $view->scroll('1m');

scroll() takes a timeout parameter, which defaults to 1m (one minute). It retrieves size results and wraps them in an Elastic::Model::Results::Scrolled object.

As you iterate through the results, you will eventually request a next() doc which isn't available in the buffer. The iterator will request the next tranche of results from Elasticsearch. It is important to make sure that the timeout is longer than the time between requests, otherwise it will throw an error and you will need to start scrolling again.

The "size" in Elastic::Model::Results::Scrolled attribute contains the number of docs in Elasticsearch that match the query and are available to pull (ie initially, it is the same as the "total" in Elastic::Model::Results::Scrolled attribute).

scan()

$results = $view->scan('1m');

scan() is pretty similar to "scroll()". It takes a timeout parameter, which defaults to 1m (one minute). However, it retrieves a maximum of number_of_primary_shards * size results in a single request and wraps them in an Elastic::Model::Results::Scrolled object. So you may want to consider reducing the size parameter when scanning.

When scrolling, there is a good chance that you want to load all of the results into memory. However, when scanning through billions of docs, you don't want to do that. Instead of using next() you should use shift():

while ( my $result = $results->shift ) {
    do_something_with($result)
}

This means, obviously, that prev() won't work - there is no previous doc. You've thrown it away.

When using shift(), while the "size" in Elastic::Model::Results::Scrolled attribute starts out the same as the "total" in Elastic::Model::Results::Scrolled attribute, it will decrement by one for each shift() call.

SEE ALSO

AUTHOR

Clinton Gormley <drtech@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2014 by Clinton Gormley.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.