Wed, Feb 12, 2014
A couple of months ago I faced a problem, tried many things and went deep into Django's code. This week I faced the same problem and realized I hadn't documented the solution in the past, so I had to spend some time to re study the problem. So this time I am writing this in the blog.
This is the scenario, we are using Django 1.6, Postgres and Scrapy. If you are not familiar with Scrapy, don't worry, you can think a spider as a "long running process". In this project we have a view that schedules a Scrapy spider to run and scrape sites following the Scrapy guidelines inside a Celery task.
August 2018: Please note that this post was written for an older version of Celery. Changes in the code might be necessary to adapt it to the latest versions and best practices.
The scraping process takes around 10 hours and after finishing the scraping process we want to flag the search (a Django model) as finished. To give some context, this is the piece of the code used:
Everything looks fine here and this will work fine on a scenario where the spider take a couple of minutes to run, but when the spider takes 10 hours this changes. In this case, if you try to update a model after the proccess finishes, you will get the following error:
from the psycopg2 backend.
InterfaceError is a quite broad exception, but given the symptoms and the solution this is what I think is happening:
The database engine closed the connection due to a timeout and when you try to access raises this exception, on the other hand, Django doesn't recover from this because this exception is not caught. I guess Django assumes the connection is never closed by the database engine because the ORM is usually used in a Django context inside a view or something similar that doesn't take so long.
So, to solve this problem we have to catch that exception and then manually close the connection. The code with the solution is the following:
With this solution if the process takes a short time to finish you don't close the database connection (which is an expensive thing to do) and in case it fails you discard the old connection that is already dead and create new one. Hope this helped you and hope this saves me time in the future :)