I'm scraping data from a site. I'm from Russia and when I use my standard IP and go to the url, page is represented wrong, without data. But when I use Britain proxy it's OK.
That's why I have to use proxy while scraping but I get a strange problem. When I try to go to http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=1000 via browser it works(it contains data). But when I do it with script it's represented in other way.
For some reason my parser doesn't represent pages from http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=1000 as I can see them via browser.
For examle, html code of http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=950 where differences begin:
Via browser(as I need):
<div id="pagination">Page:<a class="instl confirm-nav previous" rel="nofollow" href="?q=data+scientist&l=london&co=GB&start=900">« Previous</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=GB&start=850">18</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=GB&start=900">19</a><span class="current_page">20</span><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=GB&start=1000">21</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=GB&start=1050">22</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=GB&start=1100">23</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=GB&start=1150">24</a><a class="instl confirm-nav next" rel="nofollow" href="?q=data+scientist&l=london&co=GB&start=1000">Next »</a></div><div id="footer" class=""><p id="footer_nav" class="footer_nav">
The same place with parser(wrong):
</div><div id="pagination">Page:<a class="instl confi
rm-nav previous" href="?q=data+scientist&l=london&co=GB&start=900" rel="nofollow">< Previous</a><a class="in
stl confirm-nav" href="?q=data+scientist&l=london&co=GB&start=850" rel="nofollow">18</a><a class="instl conf
irm-nav" href="?q=data+scientist&l=london&co=GB&start=900" rel="nofollow">19</a><span class="current_page">2
0</span></div><div class="" id="footer"><p class="footer_nav" id="footer_nav">
I'm on Win7, use Python3 and BeautifulSoup.
Code:
from bs4 import BeautifulSoup
import requests
proxy = {"http": "http://134.213.145.228:8080"}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page_url = 'http://www.indeed.com/resumes/data-scientist/in-london?co=GB&start=950'
req = requests.get(page_url, proxies=proxy, headers=headers)
req.encoding = 'utf-8'
main = BeautifulSoup(req.text, 'html.parser')
profile_urls_tag = main.find_all('a', class_="app_link")
Edited1:
One intresting think I think that the problem in it. When I use the same proxy in Mozilla I can see only 20 pages but with Chrome - 40.
Edited2: The problem has been solved. It appears that I must register and log-in to see full information.