0

I'm scraping data from a site. I'm from Russia and when I use my standard IP and go to the url, page is represented wrong, without data. But when I use Britain proxy it's OK.

That's why I have to use proxy while scraping but I get a strange problem. When I try to go to http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=1000 via browser it works(it contains data). But when I do it with script it's represented in other way.

For some reason my parser doesn't represent pages from http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=1000 as I can see them via browser.

For examle, html code of http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=950 where differences begin:

Via browser(as I need):

<div id="pagination">Page:<a class="instl confirm-nav previous" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=900">« Previous</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=850">18</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=900">19</a><span class="current_page">20</span><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=1000">21</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=1050">22</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=1100">23</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=1150">24</a><a class="instl confirm-nav next" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=1000">Next »</a></div><div id="footer" class=""><p id="footer_nav" class="footer_nav">

The same place with parser(wrong):

</div><div id="pagination">Page:<a class="instl confi
rm-nav previous" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=900" rel="nofollow">< Previous</a><a class="in
stl confirm-nav" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=850" rel="nofollow">18</a><a class="instl conf
irm-nav" href="?q=data+scientist&amp;l=london&amp;co=GB&amp;start=900" rel="nofollow">19</a><span class="current_page">2
0</span></div><div class="" id="footer"><p class="footer_nav" id="footer_nav">

I'm on Win7, use Python3 and BeautifulSoup.

Code:

from bs4 import BeautifulSoup
import requests

proxy = {"http": "http://134.213.145.228:8080"}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page_url = 'http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=950'
req = requests.get(page_url, proxies=proxy, headers=headers)
req.encoding = 'utf-8'
main = BeautifulSoup(req.text, 'html.parser')
profile_urls_tag = main.find_all('a', class_="app_link")

Edited1:

One intresting think I think that the problem in it. When I use the same proxy in Mozilla I can see only 20 pages but with Chrome - 40.

Edited2: The problem has been solved. It appears that I must register and log-in to see full information.

GiveItAwayNow
  • 327
  • 3
  • 13
  • page is rendered using javascript, use a headless browser to get javascript rendered pages. Or use the underlying ajax apis to get the json/xml responses – Alan Francis Jan 25 '16 at 06:04
  • Alan, unforunately I haven't understood you. Could you tell me how can I integrate it in my code? – GiveItAwayNow Jan 25 '16 at 06:14
  • read these: http://stackoverflow.com/questions/2148493/scrape-html-generated-by-javascript-with-python , http://stackoverflow.com/questions/33622435/is-it-possible-to-get-the-raw-text-of-a-webpage-post-js-in-python/33623257#33623257 – Alan Francis Jan 25 '16 at 06:35
  • Alan, it seems you didn't understand wha I asked. The problem isn't in js. – GiveItAwayNow Jan 25 '16 at 06:41

0 Answers0