2

I have 20 spiders in one project, each spider has different task and URL to crawl ( but data are similar and I'm using shared items.py and pipelines.py for all of them), by the way in my pipelines class I want if some conditions satisfied that specified spider stop crawl. I've testing

  raise DropItem("terminated by me")

and

 raise CloseSpider('terminate by me')

but both of them just stop the current running of spider and next_page url still crawling !!!

part of my pipelines.py

class MongoDBPipeline(object):

    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

    def process_item(self, item, spider):
        raise CloseSpider('terminateby')
        raise DropItem("terminateby")

        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
        if valid:
            self.collection.insert(dict(item))
            log.msg("Items added to MongoDB database!",
                    level=log.DEBUG, spider=spider)
        return item

and my spider

import scrapy
import json
from Maio.items import MaioItem



class ZhilevanSpider(scrapy.Spider):
    name = 'tehran'
    allowed_domains = []
    start_urls = ['https://search.Maio.io/json/']
    place_code = str(1);

    def start_requests(self):

        request_body = {
                "id": 2,
                "jsonrpc": "2.0",
                "method": "getlist",
                "params": [[["myitem", 0, [self.place_code]]], next_pdate]
        }
        # for body in request_body:
        #     request_body = body

        request_body = json.dumps(request_body)
        print(request_body)
        yield scrapy.Request(
            url='https://search.Maio.io/json/',
            method="POST",
            body=request_body,
            callback = self.parse,
            headers={'Content-type' : 'application/json;charset=UTF-8'}
        )

    def parse(self, response):

        print(response)
        # print(response.body.decode('utf-8'))
        input = (response.body.decode('utf-8'))
        result = json.loads(input)
        # for key,item in result["result"]:
        #     print(key)
        next_pdate = result["result"]["last_post_date"];
        print(result["result"]["last_post_date"])
        for item in result["result"]["post_list"]:
            print("title : {0}".format(item["title"]))
            ads = MaioItem()
            ads['title'] = item["title"]
            ads['desc'] = item["desc"]
        yield ads
        if(next_pdate):
            request_body = {
                "id": 2,
                "jsonrpc": "2.0",
                "method": "getlist",
                "params": [[["myitem", 0, [self.place_code]]], next_pdate]
            }

            request_body = json.dumps(request_body)
            yield scrapy.Request(
                url='https://search.Maio.io/json/',
                method="POST",
                body=request_body,
                callback=self.parse,
                headers={'Content-type': 'application/json; charset=UTF-8'}
            )

**update **

even I put sys.exit("SHUT DOWN EVERYTHING!") in the pipeline but next page still run .

I see the following log in every page running

sys.exit("SHUT DOWN EVERYTHING!")
SystemExit: SHUT DOWN EVERYTHING!
AvyWam
  • 774
  • 6
  • 20
Yuseferi
  • 5,610
  • 10
  • 54
  • 83
  • `CloseSpider()` will still process the requests that are queued, if you want to terminate immediately then implement something like killing the process ID of your scraper twice ... – Umair Ayub Oct 15 '17 at 08:34
  • @Umair thanks for your attention, but I don't want to kill others spiders which concurrently running, is it possible? provide your idea as an answer plz – Yuseferi Oct 15 '17 at 08:39
  • No matter you are using shared pipeline, each of your spider will have a different Process ID in your server (assuming you are on Linux) ... Send `KILL PROCESS_ID` once and your particular spider will finish after executing pending requests, send `KILL PROCESS_ID` twice and it will terminate immediately. I don't see any other way to terminate spider wihtout being finishing all pending requests. – Umair Ayub Oct 15 '17 at 08:42
  • @Umair I tired, but still no result – Yuseferi Oct 15 '17 at 18:32

3 Answers3

4

If you want to stop a spider from a pipeline, you can call the close_spider() function of the engine.

class MongoDBPipeline(object):

    def process_item(self, item, spider):
        spider.crawler.engine.close_spider(self, reason='finished')
Adrien Blanquer
  • 1,907
  • 1
  • 18
  • 29
3

OK then you can use CloseSpider exception.

from scrapy.exceptions import CloseSpider
# condition
raise CloseSpider("message")
  • look at the time of the question. :). I can't test it because I don't remember finally how I resolved that and I'm not sure the suggested solution is working. but I'm sure I tried such basic solutions. thank you for your attention dude – Yuseferi Sep 13 '20 at 15:22
  • this is working and recommended way to stop the spider, it may be too late to tell you but definitely not for the people who want a proper solution to this problem. – Waqar Ahmad Sep 15 '20 at 12:59
0

why not just use this

# with some condition
sys.exit("Closing the spider")
  • I mentioned it on the question ;) ```even I put sys.exit("SHUT DOWN EVERYTHING!") in the pipeline but next page still run . I see the following log in every page running sys.exit("SHUT DOWN EVERYTHING!") SystemExit: SHUT DOWN EVERYTHING!``` – Yuseferi Sep 13 '20 at 10:03