We at Tryolabs are big fans of Elasticsearch, so much we are even sponsoring the first ever Elasticon which is taking place in March in San Francisco.
We are diving a little deeper in more interesting features and this time we are going to talk about Analyzers and how to do cool things with them.
August 2018: Please note that this post was written for an older version of Elasticsearch. Changes in the code might be necessary to adapt it to the latest versions and best practices.
As you may know Elasticsearch provides the way to customize the way things are indexed with the Analyzers of the index analysis module. Analyzers are the way the Lucene process and indexes the data. Each one is composed of:
The Tokenizers are used to split a string into a stream of tokens. For example a basic Tokenizer will do the following:
The TokenFilters on the other hand accept a stream of tokens and can modify them, remove them or add new tokens. For example to name some possibilities a TokenFilter can apply stemming, remove stop words, add synonyms.
We are not focusing on CharFilters since they are used to pre process chars before sending them to the tokenizer.
In order to use different combinations of Tokenizers and TokenFilters you need to create an Analyzer in your index settings and then use it in your mapping.
For example, lets suppose we want an Analyzer to tokenize in a standard way, and apply lowercase filter and stemming.
This may seem like a load of gibberish but let me explain it step by step.
On the first level we have two keys "settings" and "mappings". "settings" is where all the index settings needs to go, and "mappings" is where you put all the mappings for the types of your index.
First let’s focus on "settings". There are a lot of possible index settings, replica settings, read settings, cache settings, and of course analysis settings which we are interested on.
In the analysis json we have both analyzers and filter defined. When using a custom analyzer, some of the filters available need to be defined because they have mandatory options. In our case the stemming filter need to have a language defined, that is why we first need to define our "custom_english_stemmer".
Now that the filter is defined we can define the analyzer which will use it.
We name the analyzer
custom_lowercase_stemmed but you can put any name you want. In this example we are using the "standard" tokenizer and we define the list of filters to use.
The order of the list is important since it will be the order the tokens are processed in the indexing pipeline.
Finally, we can use this newly created analyzer in the mapping.
This way we are telling Elasticsearch there is type called "test" that has a field called "text" that needs to be analyzed using the
There is a special endpoint /index/_analyze where you can see the stream of tokens after applying the analyzer.
|Original||Lowercase & stemming|
So you created the analyzer, used it in some mapping and indexed documents. To use it, you just need to create a match query. Match queries automatically apply the same analyzer before querying.
For example, if you index
Then you can create a query:
This would return the document you have just indexed that would normally wouldn’t be returned if it wasn’t for a custom analyzer.
Analyzers are a powerful and essential tool for relevance engineering. When starting with Elasticsearch you need to get acquainted with the different filters and tokenizers Elasticsearch provides so you can seize its full potential.
In future posts we are going to detail how you can use more powerful filters like n-grams, shingles and synonyms for more specific use cases.
At Tryolabs we're Elastic official partners. If you want to talk about Elasticsearch, ELK, applications and possible projects using these technologies, drop us a line to email@example.com (or fill out this form) and we will be glad to connect!