Tue, Feb 17, 2015
Lately, here at Tryolabs, we started gaining interest in big data and search related platforms which are giving us excellent resources to create our complex web applications. One of them is Elasticsearch.
Elastic{ON}15, the first ES conference is coming, and since nowadays we see a lot of interest in this technology, we are taking the opportunity to give an introduction and a simple example for Python developers out there that want to begin using it or give it a try.
Elasticsearch is a distributed, real-time, search and analytics platform.
Good question! In the previous definition you can see all these hype-sounding tech terms (distributed, real-time, analytics), so let's try to explain.
ES is distributed, it organizes information in clusters of nodes, so it will run in multiple servers if we intend it to.
ES is real-time, since data is indexed, we get responses to our queries super fast!
And last but not least, it does searches and analytics. The main problem we are solving with this tool is exploring our data!
A platform like ES is the foundation for any respectable search engine.
Using a restful API, Elasticsearch saves data and indexes it automatically. It assigns types to fields and that way a search can be done smartly and quickly using filters and different queries.
It's uses JVM in order to be as fast as possible. It distributes indexes in "shards" of data. It replicates shards in different nodes, so it's distributed and clusters can function even if not all nodes are operational. Adding nodes is super easy and that's what makes it so scalable.
ES uses Lucene to solve searches. This is quite an advantage with comparing with, for example, Django query strings. A restful API call allows us to perform searches using json objects as parameters, making it much more flexible and giving each search parameter within the object a different weight, importance and or priority.
The final result ranks objects that comply with the search query requirements. You could even use synonyms, autocompletes, spell suggestions and correct typos. While the usual query strings provides results that follow certain logic rules, ES queries give you a ranked list of results that may fall in different criteria and its order depend on how they comply with a certain rule or filter.
ES can also provide answers for data analysis, like averages, how many unique terms and or statistics. This could be done using aggregations. To dig a little deeper in this feature check the documentation here.
The main point is scalability and getting results and insights very fast. In most cases using Lucene could be enough to have all you need.
It seems sometimes that these tools are designed for projects with tons of data and are distributed in order to handle tons of users. Startups dream of growing to that scenario, but may start thinking small first to build a prototype and then when the data is there, start thinking about scaling problems.
Does it make sense and pays off to be prepared to grow A LOT? Why not? Elasticsearch has no drawback and is easy to use, so it's just a decision of using it to be prepared for the future.
I'm going to give you a quick example of a dead simple project using Elasticsearch to quickly and beautifully search for some example data. It will be quick to do, Python powered and ready to scale in case we need it to, so, best of both worlds.
For the following part it would be nice to be familiarized with concepts like Cluster, Node, Document, Index. Take a look at the official guide if you have doubts.
August 2018: Please note that this post was written for an older version of Elasticsearch. Changes in the code might be necessary to adapt it to the latest versions and best practices.
First things first, get ES from here.
I followed this video tutorial to get things started in just a minute. I recommend all you to check it out later.
Once you downloaded ES, it's as simple as running bin/elasticsearch and you will have your ES cluster with one node running! You can interact with it at http://localhost:9200/
If you hit it you will get something like this:
Creating another node is as simple as:
It automatically detects the old node as its master and joins our cluster. By default we will be able to communicate with this new node using the 9201 port http://localhost:9201. Now we can talk with each node and receive the same data, they are supposed to be identical.
To use ES with our all time favorite language; Python, it gets easier if we install elasticsearch-py package.
Now we will be able to use this package to index and search data using Python.
So, I wanted to make this project a "real world example", I really did, but after I found out there is a star wars API (https://swapi.dev/), I couldn't resist it and ended up being a fictional - "galaxy far far away" example. The API is dead simple to use, so we will get some data from there.
I'm using an IPython Notebook to do this test, I started with the sample request to make sure we can hit the ES server.
Then we connect to our ES server using Python and the elasticsearch-py library:
I added some data to test, and then deleted it. I'm skipping that part for this guide, but you can check it out in the notebook.
Now, using The Force, we connect to the Star Wars API and index some fictional people.
Please, notice that we automatically created an index "sw" and a "doc_type" with de indexing command. We get 17 responses from swapi and index them with ES. I'm sure there are much more "people" in the swapi DB, but it seems we are getting a 404 with https://swapi.dev/api/people/17. Bug report here! :-)
Anyway, to see if all worked with this few results, we try to get the document with id=5.
We will get Princess Leia:
Now, let’s add more data, this time using node 2! And let’s start at the 18th person, where we stopped.
We got the rest of the characters just fine.
Where is Darth Vader? Here is our search query:
This will give us both Darth Vader AND Darth Maul. Id 4 and id 44 (notice that they are in the same index, even if we use different node client call the index command). Both results have a score, although Darth Vader is much higher than Darth Maul (2.77 vs 0.60) since Vader is a exact match. Take that Darth Maul!
So, this query will give us results if the word is contained exactly in our indexed data. What if we want to build some kind of autocomplete input where we get the names that contain the characters we are typing?
There are many ways to do that and another great number of queries. Take a look here to learn more. I picked this one to get all documents with prefix "lu" in their name field:
We will get Luke Skywalker and Luminara Unduli, both with the same 1.0 score, since they match with the same 2 initial characters.
There are many other interesting queries we can do. If, for example, we want to get all elements similar in some way, for a related or correction search we can use something like this:
And we got Jabba although we had a typo in our search query. That is powerful!
This was just a simple overview on how to set up your Elasticsearch server and start working with some data using Python. The code used here is publicly available in this IPython notebook.
We encourage you to learn more about ES and specially take a look at the Elastic stack where you will be able to see beautiful analytics and insights with Kibana and go through logs using Logstash.
In following posts we will talk about more advanced ES features and we will try to extend this simple test and use it to show a more interesting Django app powered by this data and by ES.
Hope this post was useful for developers trying to enter the ES world.
At Tryolabs we're Elastic official partners. If you want to talk about Elasticsearch, ELK, applications and possible projects using these technologies, drop us a line to hello@tryolabs.com (or fill out this form) and we will be glad to connect!
© 2024. All rights reserved.