Sometimes you want to run complex real-time queries on hundreds of millions of documents, and you’ll quickly find out the limits of what your hardware can do.
Some complex queries may involve indexing fields in multiples ways, by applying different analyzers. For example, you may want to index a field applying stemming with one analyzer, using the shingle token filter with another and indexing with the standard token filter.
Luckily there is an easy way to do this in Elasticsearch using multi-type fields. Even though the real mapping type “multi_field” was removed in version 1.0, Elasticsearch allows every possible type to accept a “fields” parameter where multiple field mappings can be configured.
Using the previous example, suppose we have defined two analyzers in the index settings (as explained in the previous post named “custom_stemmer”, “custom_shingles”).
August 2018: Please note that this post was written for an older version of Elasticsearch. Changes in the code might be necessary to adapt it to the latest versions and best practices.
We are creating the mapping for documents with type “test” with one field “foo”. The default “type” of “foo” is “string” and it has two sub-fields which are accessible under the names “foo.stemmed” and “foo.shingles”.
Now you can create queries using these fields, for example:
While still maintaining the ability to use non-stemmed searches.
One interesting type of query you can create with multiple fields is to combine different queries with different boostings. Maybe you want to take into account both stemmed and shingled fields for the score. In that case you can build something like:
giving different boosts to each component of the query.
Queries of this type may start having performance issues when the number of fields used is too big and/or if the number of documents is too large. One workaround for this could be to just throw money at the problem and get more and better servers.
Another way is to optimize queries. One way you can do that is using a feature called rescoring. Rescoring allows you to apply a secondary sorting algorithm to the result set of the primary query. For example, you can use a simple (fast) query to return 500 results and then run another (slow) query on those 500 results to change the sorting.
Let’s assume the previous query is having performance issues and you want to change it in order to use rescoring.
And then we have the rescore definition.
The first thing we need is the
window_size, which is the number of number of docs that which will be examined on each shard.
Then we define the query and query settings. The first one you can notice is
This defines how the final score is calculated from the original query score and the rescoring query.
|total||Add the original score and the rescore query score. The default.|
|multiply||Multiply the original score by the rescore query score.|
|avg||Average the original score and the rescore query score.|
|max||Take the max of original score and the rescore query score.|
|min||Take the min of the original score and the rescore query score.|
The rescore query can be the same type as any normal query. And finally we can use the parameters
rescore_query_weight to assign weights to each query when the scores are combined using the score_mode. By default both have the value 1.
After trying out this approach in the real world you may see how in most cases it returns results quicker and is, performance-wise, a better solution. So, if you are aiming to optimize your queries, try this strategy and let us know how it goes!
Like what you read?
Subscribe to our newsletter and get updates on Deep Learning, NLP, Computer Vision & Python.