0

I am writing a code where I am loading a lot of websites, and sometimes the link doesn't exists and instead it goes to another link (than the one I told it to).

So I want to be able to identify when the current site I am scrapping from isn't actually the address I told it to go to.

This is the code sample I am using. What should I add so I can find the name of the address it goes to?

req = Request(l, headers={'User-Agent': 'Mozilla/5.0'})
        html_page = urlopen(req).read()
        soup = BeautifulSoup(html_page, "lxml")
MathiasRa
  • 693
  • 6
  • 19

1 Answers1

2

There are two ways, either you set allow_redirects=False to prevent the request to be redirect to another page, or, you can check the canonical url:

from bs4 import BeautifulSoup
import requests
import urllib
l = 'http://en.wikipedia.org/wiki/Google_Inc_Class_A'
req = requests.get(l, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req._content, "lxml")
canonical = soup.find('link', {'rel': 'canonical'})
canonical['href']

You can see more here: When I use python requests to check a site, if the site redirects me to another page, will I know?

Jane
  • 91
  • 5