3

I've written a basic Scrapy spider to crawl a website which seems to run fine other than the fact it doesn't want to stop, i.e. it keeps revisiting the same urls and returning the same content - I always end up having to stop it. I suspect it's going over the same urls over and over again. Is there a rule that will stop this? Or is there something else I have to do? Maybe middleware?

The Spider is as below:

class LsbuSpider(CrawlSpider):
name = "lsbu6"
allowed_domains = ["lsbu.ac.uk"]
start_urls = [
    "http://www.lsbu.ac.uk"
]
rules = [
    Rule(SgmlLinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
]

def parse_item(self, response):
    join = Join()
    sel = Selector(response)
    bits = sel.xpath('//*')
    scraped_bits = []            
    for bit in bits:
        scraped_bit = LsbuItem()
        scraped_bit['title'] = scraped_bit.xpath('//title/text()').extract()
        scraped_bit['desc'] = join(bit.xpath('//*[@id="main_content_main_column"]//text()').extract()).strip()
        scraped_bits.append(scraped_bit)

    return scraped_bits

My settings.py file looks like this

BOT_NAME = 'lsbu6'
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = True
SPIDER_MODULES = ['lsbu.spiders']
NEWSPIDER_MODULE = 'lsbu.spiders'

Any help/ guidance/ instruction on stopping it running continuously would be greatly appreciated...

As I'm a newbie to this; any comments on tidying the code up would also be helpful (or links to good instruction).

Thanks...

prbens
  • 33
  • 1
  • 5

2 Answers2

4

The DupeFilter is enabled by default: http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class and it's based on the request url.

I tried a simplified version of your spider on a new vanilla scrapy project without any custom configuration. The dupefilter worked and the crawl stopped after a few requests. I'd say you have something wrong on your settings or on your scrapy version. I'd suggest you to upgrade to scrapy 1.0, just to be sure :)

$ pip install scrapy --pre

The simplified spider I tested:

from scrapy.spiders import CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy import Item, Field
from scrapy.spiders import Rule 

class LsbuItem(Item):
    title = Field()
    url = Field()

class LsbuSpider(CrawlSpider):
    name = "lsbu6"
    allowed_domains = ["lsbu.ac.uk"]

    start_urls = [
        "http://www.lsbu.ac.uk"
    ]    

    rules = [
        Rule(LinkExtractor(allow=['lsbu.ac.uk/business-and-partners/.+']), callback='parse_item', follow=True),
    ]    

    def parse_item(self, response):
        scraped_bit = LsbuItem()
        scraped_bit['url'] = response.url
        yield scraped_bit
José Ricardo
  • 1,300
  • 1
  • 11
  • 27
1

Your design makes the crawl go in circles. For examples, there is a page http://www.lsbu.ac.uk/business-and-partners/business, which when opened contains the link to "http://www.lsbu.ac.uk/business-and-partners/partners, and that one contains again the link to the first one. Thus, you go in circles indefinitely.

In order to overcome this, you need to create better rules, eliminating the circular references. And also, you have two identical rules defined, which is not needed. If you want the follow you can always put it on the same rule, you don't need a new rule.

bosnjak
  • 7,628
  • 2
  • 17
  • 44
  • You are right, it should filter the duplicates by default. Set `DUPEFILTER_DEBUG=True` in settings, to see what is happening with it. – bosnjak Apr 28 '15 at 11:11
  • I tried what you suggested but it didn't change anything. I did some further reading and web searching based on your information and added some code to my pipeline.py and settings.py. Now I'm getting an error, so will ask a new question around that as I can't work it out - followed the Scrapy documentation to the letter!!! – prbens Apr 30 '15 at 17:43
  • Which version of scrapy do you have? – bosnjak Apr 30 '15 at 20:33
  • Sorry for the slow reply - was away. I'm using Scrapy 0.24.5 – prbens May 05 '15 at 16:08
  • Interestingly when I change `return` to `yield` Scrapy tells me it's filtering out duplicate links, but the the only issue then is the fact that the `items` are not scraped! – prbens May 07 '15 at 18:05