Force my scrapy spider to stop crawling

Question

is there a chance to stop crawling when specific if condition is true (like scrap_item_id == predefine_value ). My problem is similar to Scrapy - how to identify already scraped urls but I want to 'force' my scrapy spider to stop crawling after discover the last scraped item.

Ok got it ... I'm pretty sure there is a better solution but from scrapy.project import crawler crawler.engine.close_spider(spider, 'closespider_blee') works — no1, Dec 16 '10 at 15:59
That solution seems fine. It's used in the scrapy source too (e.g. contrib/closespider.py) — Shane Evans, Aug 31 '11 at 01:21

Sjaak Trekhaak · Answer 1 · 2014-07-23T13:26:29.790

40

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider.

In the 0.14 release note doc is mentioned: "Added CloseSpider exception to manually close spiders (r2691)"

Example as per the docs:

def parse_page(self, response):
  if 'Bandwidth exceeded' in response.body:
    raise CloseSpider('bandwidth_exceeded')

See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider

edited Jul 23 '14 at 13:26

answered Nov 01 '11 at 16:03

Sjaak Trekhaak

4,596
27
36

3

It succeeds to force stop, but not fast enough. It still lets some Request running. I hope Scrapy will provide a better solution in the future. – Aminah Nuraini Dec 11 '15 at 18:27
1

From my observations, it finishes the requests which were already fired, no? – Rafael Almeida Aug 22 '17 at 15:12

score 11 · Answer 2 · edited Jun 15 '18 at 09:16

11

This question was asked 8 months ago but I was wondering the same thing and have found another (not great) solution. Hopefully this can help the future readers.

I'm connecting to a database in my Pipeline file, if the database connection is unsuccessful, I wanted the Spider to stop crawling (no point in collecting data if there's nowhere to send it). What I ended up doing was using:

from scrapy.project import crawler
crawler._signal_shutdown(9,0) #Run this if the cnxn fails.

This causes the Spider to do the following:

[scrapy] INFO: Received SIGKILL, shutting down gracefully. Send again to force unclean shutdown.

I just kind of pieced this together after reading your comment and looking through the "/usr/local/lib/python2.7/dist-packages/Scrapy-0.12.0.2543-py2.7.egg/scrapy/crawler.py" file. I'm not totally sure what it's doing, the first number delivered to the function is the signame (for example, using 3,0 instead of 9,0 returns error [scrapy] INFO: Received SIGKILL...

Seems to work well enough though. Happy scraping.

EDIT: I also suppose that you could just force your program to shut down with something like:

import sys
sys.exit("SHUT DOWN EVERYTHING!")

edited Jun 15 '18 at 09:16

Nicolò Gasparini

1,762
2
18
38

answered Aug 16 '11 at 03:23

alukach

3,780
3
34
33

2

Thanks for mentioning extension - right now it is really way to go. Here are docs: http://readthedocs.org/docs/scrapy/en/0.12/topics/extensions.html#module-scrapy.contrib.closespider – Victor Farazdagi Sep 13 '11 at 10:15
The thing that I dislike about the Close Spider extension is that it can only be initiated by four conditions (timeout, itempassed, pagecount, errorcount; as far as I know). What would be nice is if you could make your own conditions to close the spider, so that it's closed when there is a specific occurance (ex. a certain word is scraped). – alukach Sep 13 '11 at 19:32
3

the link to the extension is down – Dec 08 '15 at 10:37
The new link to the extension is: https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/closespider.py – matiskay Dec 14 '17 at 20:43

score 3 · Answer 3 · answered Nov 07 '19 at 23:34

3

From a pipeline, I prefer the following solution.

class MongoDBPipeline(object):

def process_item(self, item, spider):
    spider.crawler.engine.close_spider(self, reason='duplicate')

Source: Force spider to stop in scrapy

answered Nov 07 '19 at 23:34

Macbric

442
4
10

score 0 · Answer 4 · answered Aug 10 '20 at 18:37

0

Tried lots of options nothing works. This dirty hack do the trick for Linux:

os.kill(os.getpid(), signal.SIGINT)
os.kill(os.getpid(), signal.SIGINT)

This sends SIGINT signal two times to scrapy. Second signal forces shutdown

answered Aug 10 '20 at 18:37

Alex

869
6
17

I tried this on Linux, but I got: "NameError: name 'signal' is not defined" – Mantas Lukosevicius Nov 21 '20 at 16:35
add ```import signal``` – Alex Nov 21 '20 at 16:53

Force my scrapy spider to stop crawling

4 Answers4

Linked