When you need to do some web scraping job in Python, an excellent choice is the Scrapy framework. Not only it takes care of most of the networking (HTTP, SSL, proxies, etc) but it also facilitates the process of extracting data from the web by providing things such as nifty xpath selectors.
Scrapy is built upon the Twisted networking engine. A limitation of its core component, the reactor, is that it cannot be restarted. This might cause us some troubles if we are trying to devise a mechanism to run Scrapy spiders independently from a Python script (and not from Scrapy shell). Say for example we want to implement a Python function that receives some parameters, performs a search/web scraping in some sites and returns a list of scrapped items. A naive solution such as this will not work, since in each of the function calls we need to have the Twisted reactor restarted, and this is unfortunately not possible.
A workaround for this is to run Scrapy on its own process. After doing a search, I could get no solution to work on latest Scrapy. However one of those used Multiprocessing and it came pretty close! Here is an updated version for Scrapy 0.13:
from scrapy import project, signalsfrom scrapy.conf import settingsfrom scrapy.crawler import CrawlerProcessfrom scrapy.xlib.pydispatch import dispatcherfrom multiprocessing.queues import Queueimport multiprocessingclass CrawlerWorker(multiprocessing.Process):def __init__(self, spider, result_queue):multiprocessing.Process.__init__(self)self.result_queue = result_queueself.crawler = CrawlerProcess(settings)if not hasattr(project, 'crawler'):self.crawler.install()self.crawler.configure()self.items = self.spider = spiderdispatcher.connect(self._item_passed, signals.item_passed)def _item_passed(self, item):self.items.append(item)def run(self):self.crawler.crawl(self.spider)self.crawler.start()self.crawler.stop()self.result_queue.put(self.items)
One way to invoke this, say inside a function, would be:
result_queue = Queue()crawler = CrawlerWorker(MySpider(myArgs), result_queue)crawler.start()for item in result_queue.get():yield item
MySpider is of course the class of the Spider you want to run, and
myArgs are the arguments you wish to invoke the spider with.