Unable to force scrapy to make a callback using the url that got redirected

Question

I've created a python script using scrapy to scrape some information available in a certain webpage. The problem is the link I'm trying with gets redirected very often. However, when I try few times using requests, I get the desired content.

In case of scrapy, I'm unable to reuse the link because I found it redirecting no matter how many times I try. I can even catch the main url using response.meta.get("redirect_urls")[0] meant to be used resursively within parse method. However, it always gets redirected and as a result callback is not taking place.

This is my current attempt (the link used within the script is just a placeholder):

import scrapy
from scrapy.crawler import CrawlerProcess

class StackoverflowSpider(scrapy.Spider):

    handle_httpstatus_list = [301, 302]
    name = "stackoverflow"
    start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'

    def start_requests(self):
        yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)


    def parse(self,response):

        if response.meta.get("lead_link"):
            self.lead_link = response.meta.get("lead_link")
        elif response.meta.get("redirect_urls"):
            self.lead_link = response.meta.get("redirect_urls")[0]

        try:
            if response.status!=200 :raise
            if not response.css("[itemprop='text'] > h2"):raise
            answer_title = response.css("[itemprop='text'] > h2::text").get()
            print(answer_title)

        except Exception:
            print(self.lead_link)
            yield scrapy.Request(self.lead_link,meta={"lead_link":self.lead_link},dont_filter=True, callback=self.parse)


if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
    })
    c.crawl(StackoverflowSpider)
    c.start()

Question: How can I force scrapy to make a callback using the url that got redirected?

You want response of original link and redirected link as well? or you want to get response of lead link after countable redirection? or what? I'm trying to figure out what exactly the purpose is. — Moein Kameli, Dec 17 '19 at 00:10
So why don't you disable redirection in parse callback ? do you need anything from redirect URL? — Moein Kameli, Dec 17 '19 at 09:55
Your code seems to already yield a new request for the original URL. I don’t understand what you are asking. — Gallaecio, Dec 17 '19 at 15:28
Yep, it yields a new request but end up redirecting @Gallaecio. Once again, the url used in the above script is just a placeholder. Thanks. — MITHU, Dec 17 '19 at 17:17

score 1 · Answer 1 · answered Dec 17 '19 at 13:29

As far as I understand, you want to scrape a link until it stops redirecting and you finally get http status 200

If yes, then you have to first remove handle_httpstatus_list = [301, 302] from your code Then create a CustomMiddleware in middlewares.py

class CustomMiddleware(object):

    def process_response(self, request, response, spider):

        if not response.css("[itemprop='text'] > h2"):
            logging.info('Desired text not found so re-scraping' % (request.url))
            req = request.copy()
            request.dont_filter = True

            return req
        if response.status in [301, 302]:
            original_url = request.meta.get('redirect_urls', [response.url])[0]
            logging.info('%s is redirecting to %s, so re-scraping it' % (request._url, request.url))
            request._url = original_url
            request.dont_filter = True

            return request

        return response

Then your spider should look like something this

class StackoverflowSpider(scrapy.Spider):

    name = "stackoverflow"
    start_url = 'https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean'

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'YOUR_PROJECT_NAME.middlewares.CustomMiddleware': 100,
        }
    }

    def start_requests(self):
        yield scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)

    def parse(self,response):

        answer_title = response.css("[itemprop='text'] > h2::text").get()
        print(answer_title)

If you tell me which site you are scraping then I can help you out, you can email me as well which is on my profile

Finally someone could understand what I'm trying to achieve. Thanks umair for your answer. I'll let you know if I find trouble implementing it. Thanks again. — MITHU, Dec 17 '19 at 17:14

Moein Kameli · Answer 2 · 2019-12-17T14:13:53.683

0

You may want to see this.
If you need to prevent redirecting it is possible by request meta:

request = scrapy.Request(self.start_url,meta={"lead_link":self.start_url},callback=self.parse)
request.meta['dont_redirect'] = True
yield request

Due to documentation this is a way to stop redirecting.

edited Dec 17 '19 at 14:13

answered Dec 14 '19 at 22:29

Moein Kameli

866
1
7
16

I didn't mean to prevent redirecting. I knew that keyword already. If I went the way you suggested, the script would gracefully ignore that. What I'm trying is reuse that url to get a response. Btw, this is what their documentation says - `If Request.meta has dont_redirect key set to True, the request will be ignored by this middleware`. – MITHU Dec 15 '19 at 03:51

Unable to force scrapy to make a callback using the url that got redirected

2 Answers2