0

Is it possible to achieve infinite read scrap using just get method. for example,

http://www.justdial.com/Ahmedabad/Bearing-Dealers/ct-302676

gives following link for each page when we scroll-down

http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Ahmedabad&search=Bearing+Dealers&where=&catid=302676&psearch=&prid=&page=2&SID=&mntypgrp=0&toknbkt=&bookDate=&jdsrc=

http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Ahmedabad&search=Bearing+Dealers&where=&catid=302676&psearch=&prid=&page=3&SID=&mntypgrp=0&toknbkt=&bookDate=&jdsrc=

so far my code looks like:

import requests

def readJustDial(c):
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
           'Accept-Encoding': 'none',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'}    
    for i in range(1,10):
        url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city='+str(c)+'&search=Bearing+Dealers&where=&catid=302676&psearch=&prid=&page='+str(i)+'&SID=&mntypgrp=0&toknbkt=&bookDate=&jdsrc='
        page = requests.get(url,hdr)

def main(): #this is main function of this program
    allCities=["Ahmedabad","Hyderabad","Bangalore","Kolkata","Chennai","Mumbai","Delhi-NCR","Pune"]
    for city in allCities:
        readJustDial(city)
        #print(city)

if __name__ == "__main__":
    main()    

also please suggest any changes I can make to my existing code. I am just learning python so any suggestions will be good.

Bhavesh Ghodasara
  • 1,510
  • 1
  • 10
  • 24
  • Looks fine to me. How is your code not meeting your needs? – Ouroborus Nov 27 '16 at 12:24
  • @Ouroborusaccess denied...also if you try to open that page (with page 2 or 3) you can't actually open it in your browser..it will show empty list..why? – Bhavesh Ghodasara Nov 27 '16 at 12:26
  • Looks like referer and cookies are required. – Ouroborus Nov 27 '16 at 12:34
  • @Ouroborussorry can you please give more idea about it? I have not worked much on referer or cookies before. I know it is possible using selenium, but it looks clumsy solution to me. – Bhavesh Ghodasara Nov 27 '16 at 12:37
  • 1
    The `requests` documentation has a section on [cookies](http://docs.python-requests.org/en/master/user/quickstart/#cookies). `Referer` can be treated like any other header. Google can help you work out the details. – Ouroborus Nov 27 '16 at 12:40

1 Answers1

1

Try to imitate the headers that come with a normal working browser xhr request. You can view those headers using a browser's developer tools (I use chrome's). When I look at the request, I see that it is sent with these headers:

Accept:application/json, text/javascript, */*; q=0.01
Accept-Encoding:gzip, deflate, sdch
Accept-Language:he-IL,he;q=0.8,en-US;q=0.6,en;q=0.4
Connection:keep-alive
Cookie:f5avrbbbbbbbbbbbbbbbb=BEKIPCFANCEHKADKNPJJJLHGCDKJOEEGKIIEPAAPHGEDJDNKFFBPCKEGMMIAECHOLECIMLJDAICKIFECEPMNKJNMKIDIMHPCOMHNNHMANENHHKEGMABPKFGKBAPGCHCJ; ppc=; PHPSESSID=bh34mlv2ba4gmgbntjtsjtt753; www=1712105664.20480.0000; _gat=1; scity=Ahmedabad; sarea=; dealBackCity=Ahmedabad; inweb_city=Ahmedabad; profbd=0; bdcheck=1; _ga=GA1.2.338746795.1480258713; tab=toprs; BDprofile=1; prevcatid=302676; view=lst_v; main_city=Ahmedabad
Host:www.justdial.com
Referer:http://www.justdial.com/Ahmedabad/Bearing-Dealers/ct-302676
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36
X-Requested-With:XMLHttpRequest

Try sending a request with these headers, except the cookies (usually they work only temporarily).

If that doesn't work either, you'll need the cookies. You can either use a browser (using selenium, for example), or do some reverse engineering of the webpage or the cookies and try to write a method for getting working cookies.

kmaork
  • 5,030
  • 2
  • 17
  • 37