Read time: 5 minutes

Elasticsearch Analyzers (or How I Learned to Stop Worrying and Love Custom Filters)

We at Tryolabs are big fans of Elasticsearch, so much we are even sponsoring the first ever Elasticon which is taking place in March in San Francisco.

We are diving a little deeper in more interesting features and this time we are going to talk about Analyzers and how to do cool things with them.

August 2018: Please note that this post was written for an older version of Elasticsearch. Changes in the code might be necessary to adapt it to the latest versions and best practices.

Analyzers

As you may know Elasticsearch provides the way to customize the way things are indexed with the Analyzers of the index analysis module. Analyzers are the way the Lucene process and indexes the data. Each one is composed of:

  • 0 or more CharFilters
  • 1 Tokenizer
  • 0 or more TokenFilters

The Tokenizers are used to split a string into a stream of tokens. For example a basic Tokenizer will do the following:

"Our goal at Tryolabs is to help Startups"
-> Tokenizer ->
["Our", "goal", "at", Tryolabs", "is", "to", "help", "Startups"]

The TokenFilters on the other hand accept a stream of tokens and can modify them, remove them or add new tokens. For example to name some possibilities a TokenFilter can apply stemming, remove stop words, add synonyms.

We are not focusing on CharFilters since they are used to pre process chars before sending them to the tokenizer.

Elasticsearch provides a great deal of Tokenizers, and TokenFilters, and you can create custom ones and install them as a plugin (although you may need to dive deep into Elasticsearch’s code base).

How to use Analyzers

In order to use different combinations of Tokenizers and TokenFilters you need to create an Analyzer in your index settings and then use it in your mapping.

For example, lets suppose we want an Analyzer to tokenize in a standard way, and apply lowercase filter and stemming.

curl -X POST http://127.0.0.1:9200/tryoindex/ -d'
{
  "settings": {
    "analysis": {
      "filter": {
        "custom_english_stemmer": {
          "type": "stemmer",
          "name": "english"
        }
      },
      "analyzer": {
        "custom_lowercase_stemmed": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "custom_english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "custom_lowercase_stemmed"
        }
      }
    }
  }
}'

This may seem like a load of gibberish but let me explain it step by step.

On the first level we have two keys “settings” and “mappings”. “settings” is where all the index settings needs to go, and “mappings” is where you put all the mappings for the types of your index.

First let’s focus on “settings”. There are a lot of possible index settings, replica settings, read settings, cache settings, and of course  analysis settings which we are interested on.

In the analysis json we have both analyzers and filter defined. When using a custom analyzer, some of the filters available need to be defined because they have mandatory options. In our case the stemming filter need to have a language defined, that is why we first need to define our “custom_english_stemmer”.

{
  "custom_english_stemmer": {
    "type": "stemmer",
    "name": "english"
  }
}

Now that the filter is defined we can define the analyzer which will use it.

{
  "analyzer": {
    "custom_lowercase_stemmed": {
      "tokenizer": "standard",
      "filter": [
        "lowercase",
        "custom_english_stemmer"
      ]
    }
  }
}

We name the analyzer custom_lowercase_stemmed but you can put any name you want. In this example we are using the “standard” tokenizer and we define the list of filters to use.

  • lowercase, is the Elasticsearch provided filter that doesn’t need extra configuration (though you can provide a language parameter for some non-standard languages).
  • custom_english_stemmer, the one we defined before.

The order of the list is important since it will be the order the tokens are processed in the indexing pipeline.

Finally, we can use this newly created analyzer in the mapping.

{
  "mappings": {
    "test": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "custom_lowercase_stemmed"
        }
      }
    }
  }
}
This way we are telling Elasticsearch there is type called “test” that has a field called “text” that needs to be analyzed using the custom_lowercase_stemmed analyzer.

Testing the analyzer

There is a special endpoint /index/_analyze where you can see the stream of tokens after applying the analyzer.

curl http://192.168.59.103:9200/tryoindex/_analyze?analyzer=custom_lowercase_stemmed \
-d 'Tryolabs running monkeys KANGAROOS and jumping elephants'
{
  "tokens": [
    {
      "token": "tryolab",
      "start_offset": 0,
      "end_offset": 9,
      "type": "",
      "position": 1
    },
    {
      "token": "run",
      "start_offset": 10,
      "end_offset": 17,
      "type": "",
      "position": 2
    },
    {
      "token": "monkei",
      "start_offset": 18,
      "end_offset": 25,
      "type": "",
      "position": 3
    },
    {
      "token": "kangaroo",
      "start_offset": 26,
      "end_offset": 35,
      "type": "",
      "position": 4
    },
    {
      "token": "and",
      "start_offset": 36,
      "end_offset": 39,
      "type": "",
      "position": 5
    },
    {
      "token": "jump",
      "start_offset": 40,
      "end_offset": 47,
      "type": "",
      "position": 6
    },
    {
      "token": "eleph",
      "start_offset": 48,
      "end_offset": 57,
      "type": "",
      "position": 7
    }
  ]
}
Original Lowercase & stemming
Tryolabs tryolab
running run
monkeys monkei
KANGAROOS kangaroo
and and
jumping jump
elephants eleph

Querying using the analyzer

So you created the analyzer, used it in some mapping and indexed documents. To use it, you just need to create a match query. Match queries automatically apply the same analyzer before querying.

For example, if you index

{
  "text": "JOHN LIKES RUNNING IN THE RAIN"
}
Then you can create a query:

curl -XGET ‘http://127.0.0.1:9200/tryoindex/test/_search’ -d '
{
  "query": {
    "match": {
      "text": "run"
    }
  }
}'
This would return the document you have just indexed that would normally wouldn’t be returned if it wasn’t for a custom analyzer.

{
  "took": 25,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5,
    "hits": [
      {
        "_index": "tryoindex",
        "_type": "test",
        "_id": "AUtgVQpqJh3uf--po6Ij",
        "_score": 0.5,
        "_source": {
          "text": "john likes running"
        }
      }
    ]
  }
}

Conclusion

Analyzers are a powerful and essential tool for relevance engineering. When starting with Elasticsearch you need to get acquainted with the different filters and tokenizers Elasticsearch provides so you can seize its full potential.

In future posts we are going to detail how you can use more powerful filters like n-grams, shingles and synonyms for more specific use cases.

At Tryolabs we’re Elastic official partners. If you want to talk about Elasticsearch, ELK, applications and possible projects using these technologies, drop us a line to hello@tryolabs.com (or fill out this form) and we will be glad to connect!


Like what you read?

Subscribe to our newsletter and get updates on Deep Learning, NLP, Computer Vision & Python.

No spam, ever. We'll never share your email address and you can opt out at any time.
Comments powered by Disqus

Get in touch

Do you have a project in mind?
We'd love to e-meet you!

Thanks for reaching out!

We'll reply as soon as possible.

And in the meantime?
Check out our blog to see what we're currently working on.