Significant terms aggregation is one of our favorite features in Elasticsearch. Initially included in the 1.1.0 release (March 2014), it is still an experimental feature.
In contrast with other types of Elasticsearch aggregations, which generally do simple counts, sums, and other simple math operations, significant terms tries to find statistical anomalies in your data, or as they call it, finding the uncommonly common.
There are plenty of excellent resources on what significant term aggregations does, and some practical examples on how to use it. But there isn’t any interactive demo in which you can see first hand some of the possibilities of this type of aggregation.
We decided to build an interactive demo using geolocated information of SFPD incidents provided by DataSF, which provides around 1.7 million incidents (since 2003), from DUIs to Kidnapping. Each incident also provides type of resolution, date and time, day of the week, address, and the coordinates.
The idea of our demo is that, seeing a map of SF, you can add heatmaps layers of any incident category/day/resolution, using simple aggregation (counting) or using significant terms aggregation.
Note: The visualization is limited to a year, because of the huge number of incidents. You can change the year and all the layers will refresh accordingly.
Initially the demo shows a simple basic heatmap layer displaying all the incidents without filtering.
As you can see, most incidents occur in the downtown area of SF.
Let’s add a new layer in order the see the Drug/Narcotic incidents aggregation without using significant terms.
As you can see, most Drug/Narcotic incidents are centered around downtown as well. There is also a big green spot around the Bart Station in 16th St and Mission. So, according to SFPD API this area is a peak for criminal activity.
We can see some other spots that are highlighted more that in the "basic layer" but it’s not easy to see where Drug/Narcotic incidents are statistical anomalies.
Let’s see the result if we delete the layer and create a new one ticking the checkbox in order to use significant terms aggregation. Add the layer and you'll see something similar to this:
Now we are talking!
First thing we notice is a big green spot around Twin Peaks. Besides downtown, there is also lots of activity around the Golden Gate Park. That also corresponds with what I could quickly find on Yelp reviews mentioning drugs in Hippie Hill. So, the data checks out.
In conclusion, significant terms aggregation is a powerful tool that allows us to easily notice the uncommon. There are plenty of uses for this besides geo analysis. Some of them are mentioned in the original Elastic blog post which you should check out if you haven't.