1

I've created a python script using scrapy to scrape some information available in a certain webpage. The problem is the link I'm trying with gets redirected very often. However, when I try few times using requests, I get the desired content.

In case of scrapy, I'm unable to reuse the link because I found it redirecting no matter how many times I try. I can even catch the main url using response.meta.get("redirect_urls")[0] meant to be used resursively within parse method. However, it always gets redirected and as a result callback is not taking place.

This is my current attempt (the link used within the script is just a placeholder):

import scrapy
from scrapy.crawler import CrawlerProcess

class StackoverflowSpider(scrapy.Spider):

    handle_httpstatus_list = [301, 302]
    name = "stackoverflow"
    start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'

    def start_requests(self):
        yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)


    def parse(self,response):

        if response.meta.get("lead_link"):
            self.lead_link = response.meta.get("lead_link")
        elif response.meta.get("redirect_urls"):
            self.lead_link = response.meta.get("redirect_urls")[0]

        try:
            if response.status!=200 :raise
            if not response.css("[itemprop='text'] > h2"):raise
            answer_title = response.css("[itemprop='text'] > h2::text").get()
            print(answer_title)

        except Exception:
            print(self.lead_link)
            yield scrapy.Request(self.lead_link,meta={"lead_link":self.lead_link},dont_filter=True, callback=self.parse)


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
    })
    c.crawl(StackoverflowSpider)
    c.start()

Question: How can I force scrapy to make a callback using the url that got redirected?

MITHU
  • 253
  • 2
  • 8
  • 28

2 Answers2

1

As far as I understand, you want to scrape a link until it stops redirecting and you finally get http status 200

If yes, then you have to first remove handle_httpstatus_list = [301, 302] from your code Then create a CustomMiddleware in middlewares.py

class CustomMiddleware(object):

    def process_response(self, request, response, spider):

        if not response.css("[itemprop='text'] > h2"):
            logging.info('Desired text not found so re-scraping' % (request.url))
            req = request.copy()
            request.dont_filter = True

            return req
        if response.status in [301, 302]:
            original_url = request.meta.get('redirect_urls', [response.url])[0]
            logging.info('%s is redirecting to %s, so re-scraping it' % (request._url, request.url))
            request._url = original_url
            request.dont_filter = True

            return request

        return response

Then your spider should look like something this

class StackoverflowSpider(scrapy.Spider):

    name = "stackoverflow"
    start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'YOUR_PROJECT_NAME.middlewares.CustomMiddleware': 100,
        }
    }

    def start_requests(self):
        yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)

    def parse(self,response):

        answer_title = response.css("[itemprop='text'] > h2::text").get()
        print(answer_title)

If you tell me which site you are scraping then I can help you out, you can email me as well which is on my profile

Umair Ayub
  • 13,220
  • 12
  • 53
  • 124
  • Finally someone could understand what I'm trying to achieve. Thanks umair for your answer. I'll let you know if I find trouble implementing it. Thanks again. – MITHU Dec 17 '19 at 17:14
  • @MITHU sure, if you have any trouble do let me know – Umair Ayub Dec 18 '19 at 03:30
0

You may want to see this.
If you need to prevent redirecting it is possible by request meta:

request = scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
request.meta['dont_redirect'] = True
yield request

Due to documentation this is a way to stop redirecting.

Moein Kameli
  • 866
  • 1
  • 7
  • 16
  • I didn't mean to prevent redirecting. I knew that keyword already. If I went the way you suggested, the script would gracefully ignore that. What I'm trying is reuse that url to get a response. Btw, this is what their documentation says - `If Request.meta has dont_redirect key set to True, the request will be ignored by this middleware`. – MITHU Dec 15 '19 at 03:51