Long running process and Django ORM
A couple of months ago I faced a problem, tried many things and went deep into Django’s code. This week I faced the same problem and realized I hadn’t documented the solution in the past, so I had to spend some time to re study the problem. So this time I am writing this in the blog.
This is the scenario, we are using Django 1.6, Postgres and Scrapy. If you are not familiar with Scrapy, don’t worry, you can think a spider as a ”long running process”. In this project we have a view that schedules a Scrapy spider to run and scrape sites following the Scrapy guidelines inside a Celery task.
August 2018: Please note that this post was written for an older version of Celery. Changes in the code might be necessary to adapt it to the latest versions and best practices.
The scraping process takes around 10 hours and after finishing the scraping process we want to flag the search (a Django model) as finished. To give some context, this is the piece of the code used:
def crawl(spider, loglevel, search_id):
def _crawl(crawler, spider, loglevel):
crawler.crawl(spider)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.start()
scrapy_log.start(loglevel=loglevel)
reactor.run()
crawler = Crawler(CrawlerSettings(scrapy_settings))
crawler.configure()
p = Process(target=_crawl, args=[crawler, spider, loglevel])
p.start()
p.join()
search = Search.objects.get(id=search_id)
search.finished = True
search.save()
InterfaceError: connection already closed
from the psycopg2 backend.
InterfaceError
is a quite broad exception, but given the symptoms and the solution this is what I think is happening:
The database engine closed the connection due to a timeout and when you try to access raises this exception, on the other hand, Django doesn’t recover from this because this exception is not caught. I guess Django assumes the connection is never closed by the database engine because the ORM is usually used in a Django context inside a view or something similar that doesn’t take so long.
So, to solve this problem we have to catch that exception and then manually close the connection. The code with the solution is the following:
from django.db.utils import InterfaceError
from django import db
def crawl(spider, loglevel, search_id):
def _crawl(crawler, spider, loglevel):
crawler.crawl(spider)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.start()
scrapy_log.start(loglevel=loglevel)
reactor.run()
crawler = Crawler(CrawlerSettings(scrapy_settings))
crawler.configure()
p = Process(target=_crawl, args=[crawler, spider, loglevel])
p.start()
p.join()
search = Search.objects.get(id=search_id)
search.finished = True
try:
search.save()
except InterfaceError:
db.connection.close()
search.save()
Like what you read?
Subscribe to our newsletter and get updates on Deep Learning, NLP, Computer Vision & Python.
No spam, ever. We'll never share your email address and you can opt out at any time.