scrapy shell response differe from scrapy crawl response

Question

I have recreated XHR request. When we enter the XHR request URL in browser window since it is a GET method if first hit I got partial JSON output. If we hit reload, next time it loads more data that seems weird. Can anyone help me in this. Thanks in advance

Another info I tried in Scrapy shell that too gives entire JSON response.

Code:

import scrapy
import datetime
import time
from scrapy.http.request import Request

class test (scrapy.Spider):
    name = "test"
    allowed_domains = "ar.trivago.com"

    def start_requests(self):
        yield scrapy.Request("http://ar.trivago.com/search/region?iPathId=38715&iGeoDistanceItem=47160&aDateRange%5Barr%5D=2015-11-13&aDateRange%5Bdep%5D=2015-11-14&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange%5Bfrom%5D=0&aPriceRange%5Bto%5D=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&&_=1446825699501",
                         callback=self.parse)

    def parse(self, response):
        print "RESPONSE::", response.body

Please help me to resolve this

score -1 · Answer 1 · edited May 23 '17 at 12:00

You are making a Request with an encoded url. Scrapy is encoding it again and it looks like the objective website doesn't support double-encoding.

Also, it is important to mention that some websites with API endpoints have a protection which consists in checking if you have already a session. This is clearly to avoid direct requests to their endpoints. So in this cases it is always recommended to make a first "fake" request (which will generate a session) before querying their API/endpoint.

An example of the above is this answer on SO:

https://stackoverflow.com/a/33542753/4120036

Just check how it first makes a request to LOGIN_PAGE:
s.get(LOGIN_URL)
And then it makes the login post request:
login_response = s.post(LOGIN_URL, data=payload, headers={'Referer':'http://infotrac.galegroup.com/default/palm83799?db=SP19', 'Content-Type':'application/x-www-form-urlencoded'})

I've decoded the website URL, added X-Requested-With and Referer headers and it now returns the same amount of data as from your browser:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http.request import Request

class test(scrapy.Spider):
    name = "test"
    allowed_domains = ["ar.trivago.com"]

    def start_requests(self):
        headers = {
                'Referer': "http://ar.trivago.com/?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2016-01-01&aDateRange[dep]=2016-01-02&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&",
                'X-Requested-With':'XMLHttpRequest'
            }
        fake_request = Request("http://ar.trivago.com/search/region?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2015-11-13&aDateRange[dep]=2015-11-14&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&&_=1446825699501", headers=headers)
        yield Request("http://ar.trivago.com/search/region?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2015-11-13&aDateRange[dep]=2015-11-14&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&&_=1446825699501",
                         callback=self.parse, headers=headers)

    def parse(self, response):
        print "RESPONSE:", response.body

do you noticed scrapy shell gives full response.and i tried sleep in between fake request and original one that gives success 2 times outof 7 times this is not accurate. — Sabeena, Nov 07 '15 at 09:10

score -1 · Accepted Answer · edited May 23 '17 at 12:24

Hi all i have found the solution based on Andres code

@Andrés Pérez-Albela H. i have modified the code.that will give me actual response from the site. because of concurrent request execution session not created properly so that the response is partial most of the time. Crawling with an authenticated session in Scrapy this post helped me to figureout. thanks @Acorn and @Andrés Pérez-Albela H.

# -*- coding: utf-8 -*-
import scrapy
import time
from scrapy.http.request import Request
headers = {
    'Referer': "http://ar.trivago.com/?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2016-01-01&aDateRange[dep]=2016-01-02&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&",
    'X-Requested-With':'XMLHttpRequest'
    }
class test(scrapy.Spider):
    name = "test"
    allowed_domains = ["ar.trivago.com"]
    def start_requests(self):
        yield Request("http://ar.trivago.com/search/region?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2015-11-13&aDateRange[dep]=2015-11-14&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&&_=1446825699501",
                      callback=self.parse, headers=headers)
    def parse(self, response):
        yield Request("http://ar.trivago.com/search/region?iPathId=38715&iGeoDistanceItem=47160&aDateRange[arr]=2015-11-13&aDateRange[dep]=2015-11-14&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange[from]=0&aPriceRange[to]=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&&_=1446825699501",
                         callback=self.parse_final, headers=headers, dont_filter = 'TRUE')
    def parse_final(self, response):
        print "RESPONSE:", response.body

it worked for me thanks everyone for the help.

scrapy shell response differe from scrapy crawl response

2 Answers2