1

This question is very similar to Force my scrapy spider to stop crawling and some others asked several years ago. However, the suggested solutions there are either dated for Scrapy 1.1.1 or not precisely relevant. The task is to close the spider when it reaches a certain URL. You definitely need this when crawling a news website for your media project, for instance.

Among the settings CLOSESPIDER_TIMEOUT CLOSESPIDER_ITEMCOUNT CLOSESPIDER_PAGECOUNT CLOSESPIDER_ERRORCOUNT, item count and page count options are close but not enough since you never know the number of pages or items.

The raise CloseSpider(reason='some reason') exception seems to do the job but so far it does it in a bit weird way. I follow the “Learning Scrapy” textbook and the structure of my code looks like the one in the book.

In items.py I make a list of items:

class MyProjectItem(scrapy.Item):

    Headline = scrapy.Field()
    URL = scrapy.Field()
    PublishDate = scrapy.Field()
    Author = scrapy.Field()

    pass

In myspider.py I use the def start_requests() method where the spider takes the pages to process, parse each index page in def parse(), and specify the XPath for each item in def parse_item():

class MyProjectSpider(scrapy.Spider):
    name = 'spidername'
    allowed_domains = ['domain.name.com']


    def start_requests(self):

        for i in range(1,3000): 
            yield scrapy.Request('http://domain.name.com/news/index.page'+str(i)+'.html', self.parse)


    def parse(self, response):

        urls = response.xpath('XPath for the URLs on index page').extract()           
        for url in urls:
            # The urls are absolute in this case. There’s no need to use urllib.parse.urljoin()
            yield scrapy.Request(url, callback=self.parse_item)


    def parse_item(self, response):

        l = ItemLoader(item=MyProjectItem(), response=response)

        l.add_xpath('Headline', 'XPath for Headline')
        l.add_value('URL', response.url)
        l.add_xpath ('PublishDate', 'XPath for PublishDate')
        l.add_xpath('Author', 'XPath for Author')

        return l.load_item()

If raise CloseSpider(reason='some reason') exception is placed in def parse_item(), it still scrapes a number of items before it finally stops:

if l.get_output_value('URL') == 'http://domain.name.com/news/1234567.html':
    raise CloseSpider('No more news items.')

If it’s placed in def parse() method to stop when the specific URL is reached, it stops after grabbing only the first item from the index page which contains that specific URL:

def parse(self, response):       

    most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '
    urls = response.xpath('XPath for the URLs on index page').extract()

    if most_recent_url_in_db not in urls:
        for url in urls:
            yield scrapy.Request(url, callback=self.parse_item)
    else:
        for url in urls[:urls.index(most_recent_url_in_db)]:
            yield scrapy.Request(url, callback=self.parse_item)
        raise CloseSpider('No more news items.')

For example, if you have 5 index pages (each of them has 25 item URLs) and most_recent_url_in_db is on page 4, it means that you’ll have all items from pages 1-3 and only the first item from page 4. Then the spider stops. If most_recent_url_in_db is number 10 in the list, items 2-9 from index page 4 won’t appear in your database.

The “hacky” tricks with crawler.engine.close_spider() suggested in many cases or the ones shared in How do I stop all spiders and the engine immediately after a condition in a pipeline is met? don’t seem to work.

What should be the method to properly complete this task?

Community
  • 1
  • 1

2 Answers2

1

I'd recommend to change your approach. Scrapy crawls many requests concurrently without a linear order, that's why closing the spider when you find what you're looking for won't do, since a request after that could already be processed.

To tackle this you could make Scrapy crawl sequentially, meaning a request at a time in a fixed order. This can be achieved in different ways, here's an example about how I would go about it.

First of all, you should crawl a single page at a time. This could be done like this:

class MyProjectSpider(scrapy.Spider):

    pagination_url = 'http://domain.name.com/news/index.page{}.html'

    def start_requests(self):
        yield scrapy.Request(
            self.pagination_url.format(1),
            meta={'page_number': 1},
        )

    def parse(self, response):
        # code handling item links
        ...

        page_number = response.meta['page_number']
        next_page_number = page_number + 1

        if next_page_number <= 3000:
            yield scrapy.Request(
                self.pagination_url.format(next_page_number),
                meta={'page_number': next_page_number},
            )

Once that's implemented, you could do something similar with the links in each page. However, since you can filter them without downloading their content, you could do something like this:

class MyProjectSpider(scrapy.Spider):

    most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '

    def parse(self, response):
        url_found = False

        urls = response.xpath('XPath for the URLs on index page').extract()
        for url in urls:

            if url == self.most_recent_url_in_db:
                url_found = True
                break

            yield scrapy.Request(url, callback=self.parse_item)

        page_number = response.meta['page_number']
        next_page_number = page_number + 1

        if not url_found:
            yield scrapy.Request(
                self.pagination_url.format(next_page_number),
                meta={'page_number': next_page_number},
            )

Putting all together you'll have:

class MyProjectSpider(scrapy.Spider):
    name = 'spidername'
    allowed_domains = ['domain.name.com']

    pagination_url = 'http://domain.name.com/news/index.page{}.html'
    most_recent_url_in_db = 'http://domain.name.com/news/1234567.html '

    def start_requests(self):
        yield scrapy.Request(
            self.pagination_url.format(1),
            meta={'page_number': 1}
        )

    def parse(self, response):
        url_found = False

        urls = response.xpath('XPath for the URLs on index page').extract()
        for url in urls:

            if url == self.most_recent_url_in_db:
                url_found = True
                break

            yield scrapy.Request(url, callback=self.parse_item)

        page_number = response.meta['page_number']
        next_page_number = page_number + 1

        if next_page_number <= 3000 and not url_found:
            yield scrapy.Request(
                self.pagination_url.format(next_page_number),
                meta={'page_number': next_page_number},
            )

    def parse_item(self, response):

        l = ItemLoader(item=MyProjectItem(), response=response)

        l.add_xpath('Headline', 'XPath for Headline')
        l.add_value('URL', response.url)
        l.add_xpath ('PublishDate', 'XPath for PublishDate')
        l.add_xpath('Author', 'XPath for Author')

        return l.load_item()

Hope that gives you an idea on how to accomplish what you're looking for, good luck!

Julia Medina
  • 61
  • 1
  • 3
  • 1
    Julia, this solution is amazing! It is so elegant and very easy to understand, especially for beginners. I’m sure it will help a lot of people. Moreover, you wrote all the code. I don’t even need to add anything to it. Isn’t this just great?) Thank you ever so much!!! – Vladimir Sirovitskiy Sep 18 '16 at 11:15
0

When you raise close_spider() exception, ideal assumption is that scrapy should stop immediately, abandoning all other activity (any future page requests, any processing in pipeline ..etc)

but this is not the case, When you raise close_spider() exception, scrapy will try to close it's current operation's gracefully , meaning it will stop the current request but it will wait for any other request pending in any of the queues( there are multiple queues!)

(i.e. if you are not overriding default settings and have more than 16 start urls, scrapy make 16 requests at a time)

Now if you want to stop spider as soon as you Raise close_spider() exception, you will want to clear three Queues

-- At spider middleware level ---

  • spider.crawler.engine.slot.scheduler.mqs -> Memory Queue future request
  • spider.crawler.engine.slot.inprogress -> Any In-progress Request

-- download middleware level ---

  • spider.requests_queue -> pending Request in Request queue

flush all this queues by overriding proper middle-ware to prevent scrapy from visiting any further pages

MrPandav
  • 1,741
  • 1
  • 18
  • 21
  • Thanks for the information! I guess I understand the logic behind your explanation but at present I have no idea how exactly I should alter the default settings. I need some time to figure it out. – Vladimir Sirovitskiy Sep 11 '16 at 17:11
  • link to what you need to do http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#downloader-middleware and http://doc.scrapy.org/en/latest/topics/spider-middleware.html#spider-middleware – MrPandav Sep 13 '16 at 06:32