32

I get twisted.internet.error.ReactorNotRestartable error when I execute following code:

from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher

result = None

def set_result(item):
    result = item

while True:
    process = CrawlerProcess(get_project_settings())
    dispatcher.connect(set_result, signals.item_scraped)

    process.crawl('my_spider')
    process.start()

    if result:
        break
    sleep(3)

For the first time it works, then I get error. I create process variable each time, so what's the problem?

k_wit
  • 421
  • 1
  • 4
  • 5

7 Answers7

18

By default, CrawlerProcess's .start() will stop the Twisted reactor it creates when all crawlers have finished.

You should call process.start(stop_after_crawl=False) if you create process in each iteration.

Another option is to handle the Twisted reactor yourself and use CrawlerRunner. The docs have an example on doing that.

paul trmbrth
  • 19,235
  • 3
  • 47
  • 62
  • 22
    `process.start(stop_after_crawl=False)` — will block the main process – Ilia w495 Nikitin Mar 19 '17 at 23:40
  • @Iliaw495Nikitin, CrawlerProcess.start() will run the reactor and give back control to the thread when the crawl is finished, correct. is that an issue here? The alternative [scrapy.crawler.CrawlerRunner's `.crawl()`](https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerRunner.crawl) _"Returns a deferred that is fired when the crawling is finished."_ – paul trmbrth Mar 20 '17 at 09:37
  • Blocking wouldn't be a good idea for AWS Lambda, would it? I have literally spent half a day just to figure out how to get this running on AWS Lambda, still nothing. – André Yuhai Sep 05 '20 at 12:02
  • I have no idea how AWS Lambda work. You may want to post a new question. – paul trmbrth Sep 07 '20 at 17:12
4

I was able to solve this problem like this. process.start() should be called only once.

from time import sleep
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher

result = None

def set_result(item):
    result = item

while True:
    process = CrawlerProcess(get_project_settings())
    dispatcher.connect(set_result, signals.item_scraped)

    process.crawl('my_spider')

process.start()
Sagun Shrestha
  • 1,110
  • 9
  • 17
4

Ref http://crawl.blog/scrapy-loop/

 import scrapy
 from scrapy.crawler import CrawlerProcess
 from scrapy.utils.project import get_project_settings     
 from twisted.internet import reactor
 from twisted.internet.task import deferLater

 def sleep(self, *args, seconds):
    """Non blocking sleep callback"""
    return deferLater(reactor, seconds, lambda: None)

 process = CrawlerProcess(get_project_settings())

 def _crawl(result, spider):
    deferred = process.crawl(spider)
    deferred.addCallback(lambda results: print('waiting 100 seconds before 
    restart...'))
    deferred.addCallback(sleep, seconds=100)
    deferred.addCallback(_crawl, spider)
    return deferred


_crawl(None, MySpider)
process.start()
1

I could advice you to run scrapers using subprocess module

from subprocess import Popen, PIPE

spider = Popen(["scrapy", "crawl", "spider_name", "-a", "argument=value"], stdout=PIPE)

spider.wait()
1

For a particular process once you call reactor.run() or process.start() you cannot rerun those commands. The reason is the reactor cannot be restarted. The reactor will stop execution once the script completes the execution.

So the best option is to use different subprocesses if you need to run the reactor multiple times.

you can add the content of while loop to a function(say execute_crawling). Then you can simply run this using different subprocesses. For this python Process module can be used. Code is given below.

from multiprocessing import Process
def execute_crawling():
    process = CrawlerProcess(get_project_settings())#same way can be done for Crawlrunner
    dispatcher.connect(set_result, signals.item_scraped)
    process.crawl('my_spider')
    process.start()

if __name__ == '__main__':
for k in range(Number_of_times_you_want):
    p = Process(target=execute_crawling)
    p.start()
    p.join() # this blocks until the process terminates
Gihan Gamage
  • 1,581
  • 11
  • 19
0

I was able to mitigate this problem using package crochet via this simple code based on Christian Aichinger's answer to the duplicate of this question Scrapy - Reactor not Restartable. The initialization of Spiders is done in the main thread whereas the particular crawling is done in different thread. I'm using Anaconda (Windows).

import time
import scrapy
from scrapy.crawler import CrawlerRunner
from crochet import setup

class MySpider(scrapy.Spider):
    name = "MySpider"
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/ip']

    def parse(self, response):
        print(response.text)
        for i in range(1,6):
            time.sleep(1)
            print("Spider "+str(self.name)+" waited "+str(i)+" seconds.")

def run_spider(number):
    crawler = CrawlerRunner()
    crawler.crawl(MySpider,name=str(number))

setup()
for i in range(1,6):
    time.sleep(1)
    print("Initialization of Spider #"+str(i))
    run_spider(i)
DovaX
  • 488
  • 5
  • 11
0

I had a similar issue using Spyder. Running the file from the command line instead fixed it for me.

Spyder seems to work the first time but after that it doesn't. Maybe the reactor stays open and doesn't close?

Daniel Wyatt
  • 364
  • 4
  • 16