Find broken links with Selenium. HTTP 302, HTTP 404 expected

Question

I'm trying to find broken links in a page. I got the code to work but all pages return 302 code. At first I though it was ok, but then I manually found that one page returned 404 error. Then I started to read what 302 code is about. I think I kinda get it, but still, is there a way to get the code that the redirection returns? In case it helps, here's my code:

import requests as requests
from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, 
executable_path='C:\\Chromedriver\\chromedriver.exe')
driver.get('https://pageURL.com')
links = driver.find_elements_by_css_selector("a")
for link in links:
    if link.get_attribute('href') != None:
        if link.get_attribute('href')[:14] == 'https://URLstart':
            r = requests.head(link.get_attribute('href'))
            print(link.get_attribute('href'), r.status_code)

score 0 · Answer 1 · answered Sep 19 '18 at 02:50

When you use requests.head(), it doesn't follow redirects by default. For that, use allow_redirects=True. (The other HTTP methods follow redirects by default.)

The response status_code is always the latest/last one after redirects. If you do have redirects and want those intermediate statuses, use requests.history. Example:

>>> import requests
>>> r = requests.head('http://google.com')  # default behaviour for HEAD
>>> r.status_code
301
>>>
>>> r = requests.head('http://google.com', allow_redirects=True)
>>> r.status_code
200
>>> r.url
'http://www.google.com/'
>>> r.history
[<Response [301]>]
>>> r.history[0].status_code
301
>>> r.history[0].url
'http://google.com/'

See this answer for an example of how you could iterate over your history.

Find broken links with Selenium. HTTP 302, HTTP 404 expected

1 Answers1