-1

I have a .txt file that contains a list of URLs. The structure of the URLs varies - some may begin with https, some with http, others with just www and others with just the domain name (stackoverflow.com). So an example of the .txt file content is:-

www.google.com
microsoft.com
https://www.yahoo.com
http://www.bing.com

What I want to do is parse through the list and check if the URLs are live. In order to do that, the stucture of the URL must be correct otherwise the request will fail. Here's my code so far:-

import requests

with open('urls.txt', 'r') as f:
    urls = f.readlines()
    for url in urls:
        url = url.replace('\n', '')
        if not url.startswith('http'):  #This is to handle just domain names and those that begin with 'www'
            url = 'http://' + url
        if url.startswith('http:'):
            print("trying url {}".format(url))
            response = requests.get(url, timeout=10)
            status_code = response.status_code
            if status_code == 200:
                continue
            else:
                print("URL {} has a response code of {}".format(url,  status_code))
                print("encountered error. Now trying with https")
                url = url.replace('http://', 'https://')
                print("Now replacing http with https and trying again")
                response = requests.get(url, timeout=10)
                status_code = response.status_code
                print("URL {} has a response code of {}".format(url,  status_code))
        else:
            response = requests.get(url, timeout=10)
            status_code = response.status_code
            print("URL {} has a response code of {}".format(url,  status_code))

I feel like I've overcomplicated this somewhat and there must be an easier way of trying variants (ie. domain name, domain with 'www' at the beginning, with 'http' at the beginning and with 'https://' at the beginning, until a site is identified as being live or not (ie. all variables have been exhausted).

Any suggestions on my code or a better way to approach this? In essence, I want to handle the formatting of the URL to ensure that I then attempt to check the status of the URL.

Thanks in advance

thefragileomen
  • 1,439
  • 7
  • 22
  • 36
  • 2
    Why do you have this block: `if url.startswith('http:'):`? You're calling `response = requests.get(url, timeout=10)` in both the `if` and the `else`. Remove the second `if` block altogether – Tgsmith61591 Feb 04 '20 at 20:10
  • maybe just grab whats before .com until / or www and then normally format it with https://, check the response and then try http:// if that doesnt work – Ironkey Feb 04 '20 at 20:10
  • Don't use `readlines`; just iterate over `f` itself. Also, `url = url.rstrip('\n')` would be the idiomatic way to remove the trailing newline (though many use the simpler, `url = url.strip()` under the assumption that the newline is the *only* bit of leading or trailing whitespace). – chepner Feb 04 '20 at 20:12
  • What is the issue, exactly? If you're asking how to validate a url, there are already plenty of resources on the subject. See, amongst others: https://stackoverflow.com/q/7160737/11301900, https://stackoverflow.com/q/827557/11301900. – AMC Feb 04 '20 at 21:08
  • Does this answer your question? [How do you validate a URL with a regular expression in Python?](https://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python) – AMC Feb 04 '20 at 21:11

1 Answers1

0

This is a little too long for a comment, but, yes, it can be simplified, starting from, and replacing, the startswith part:

if not '//' in url:
      url = 'http://' + url
      response = requests.get(url, timeout=10)

etc.

Jack Fleeting
  • 16,520
  • 5
  • 16
  • 39