Update (12/20/2016): here you can find the 2016 edition of this post.
As the new year approaches, we often sit back and think about what we have accomplished in 2015. Many of our projects would not have been as successful if it were not for the great work done by the open source community, providing some solid, bullet-proof libraries.
Everyone and their grandma seems to be writing top 10 lists, so we couldn’t be less and compiled our own. Here is a list of the best 10 Python libraries you should know about that we have used in 2015, in no particular order. We try to avoid most established choices such as Django, Flask, Django Rest Framework, etc. and go for libraries that might not be as well known. Fasten your seatbelt, here we go!
How hard would be for a painter to paint without seeing immediately the results of what he is doing? Jupyter Notebooks makes it easy to interact with code, plots and results, and is becoming one of the preferred tools for data scientists. These Notebooks are documents which combine live code and documentation. For this reason, it is our go to tool for creating fast prototypes or tutorials.
Although we use Jupyter for writing Python code only, nowadays it has added support for other programming languages such as Julia or Haskell.
The retrying library helps you to avoid reinventing the wheel: it implements a retrying behavior for you. It provides a generic decorator which makes giving retrying abilities to any method effortless, as also has a bunch of properties you can set in order to have the desired retrying behavior such as maximum number of attempts, delay, backoff sleeping, error conditions, etc. Small and simple.
As of 2015, the most important libraries have all been ported to Python 3, so we started embracing it. We really liked asyncio for writing concurrent code using coroutines, so we had the need for an HTTP client (such as requests) and server using the same concurrency paradigm. The aiohttp library is such, providing a clean and easy to use HTTP client/server for asyncio.
We have tried several solutions for subprocess wrappers in order to call other scripts or executables from Python programs, but the model of plumbum blows them all away. With an easy to use syntax you can execute local or remote commands, get the output or error codes in a cross-platform way, and if that were not enough, you get composability (a la shell pipes) and an interface for building command line applications. Give it a try!
Working with and validating phone numbers can be a real pain, as there are international prefixes and area codes to take into account, and possibly other things depending on the country. The phonenumbers Python library is a port of Google’s libphonenumbers which thankfully simplifies this. It that can be used to parse, format and validate phone numbers with very little code involved. Most importantly, phonenumbers can tell whether a phone number is unique or not (following the E.164 format). It also works on both, Python 2 and Python 3.
We have used this library extensively in many projects, mostly through its adaptation django-phonenumber-field, as a way to solve this tedious problem that pretty much always pops up.
Graphs and networks are tools often used for many different tasks, such as organizing data or showing it’s flow or representing relations between entities. NetworkX allows the creation and manipulation of graphs and networks. The algorithms used in NetworkX make it highly scalable, allowing it to be ideal when working with large graphs is required. Moreover, there are tons of options for rendering graphs making it an awesome visualization tool too.
If you are thinking about storing loads of data in a time-series basis, then you have to consider using InfluxDB. InfluxDB is a time-series database we have been using to store measurements over time. Through a RESTFul API, it’s super easy to use and very efficient, which is a must when talking about lot of data. Additionally, retrieving and grouping data is painless due its built-in clustering functionalities. This official client abstracts away most of the work with invoking the API, although we would really like to see it improved by implementing a Pythonic way to create queries instead of writing the raw JSONs.
If you have ever used Elasticsearch you surely have suffered going over those long queries in JSON format, wasting time trying to find out where the parsing error is. The Elasticsearch DSL client is built upon the official Elasticsearch client and frees you from having to worry about JSONs again: you simply write everything using Python defined classes or queryset-like expressions. It also provides wrappers for working with documents as Python objects, mappings, etc.
Deep learning is the new trend, and here is where keras shines. It can run on top of Theano and allows fast experimentation with a variety of Neural Networks architectures. Highly modular and minimalistic, it can run seamlessly on CPU and GPU. Having something like keras was key for some of the R&D projects we tackled in 2015.
If you are into NLP (Natural Language Processing) and haven’t heard about Gensim, you are living under a rock. It provides fast and scalable (memory independent) implementations of some of the most used algorithms such as tf-idf, word2vec, doc2vec, LSA, etc, as well as an easy to use and well documented interface.
Bonus: MonkeyLearn Python
We couldn’t leave out MonkeyLearn. A product of Tryolabs which branched off in its own company, it offers text mining on the cloud via an easy to use RESTFul API. You can get insights about text such as sentiment, most important keywords, perform topic detection, as well as any other task you can perform with a custom text classifier. MonkeyLearn Python is the official Python client for the API, supporting both Python 2 and 3.
At Tryolabs we are great at developing heavy Python backends with Machine Learning components. If you need some help in these kind of projects, drop us a line to firstname.lastname@example.org (or fill out this form) and we’ll happily connect.