0

i get data from an api in django. The data comes from an order form from another website.

The data also includes an url, for example like example.com but i can't validate the input because i don't have access to the order form.

The url that i get can also have different kinds. More examples:

example.de
http://example.de
www.example.com
https://example.de
http://www.example.de
https://www.example.de

Now i would like to open the url to get the correct url. For example if i open example.com in my browser, i got the correct url http://example.com/ and that is what i wish for all urls.

How can i do that in python fast?

Basti G.
  • 319
  • 3
  • 16
  • "Now i would like to open the url to get the correct url". I'm not sure I understand what the issue is. If I understand correctly, you want to turn an incompletely url like "example.com" into a complete one like "https://www.example.com". Is this correct? – wotanii Jan 30 '20 at 13:24
  • Yes that is what i would like to do. See my comment under Joel answer. – Basti G. Jan 30 '20 at 13:27
  • 1
    there is no easy solution to this, because there is no clean way to do what you want. If your input data is dirty, there is no good way to deal with it, except for telling the user to fix their input. Your browser "solves" this by doing some guesswork, which is not a clean solution either. I recommend to check if the url is valid (with Joel's first answer) and then let the user handle any error that come up. Adding https:// is front of the url is as far as I would go when trying to correct the url automatically. – wotanii Jan 30 '20 at 13:46

1 Answers1

0

If you get status_code 200 you know that you have a valid address.

In regards to HTTPS://. You will get an SSL error if you don't Follow the answers in this guide. Once you have that in place, the program will find the correct URL for you.

import requests
import traceback

validProtocols = ["https://www.", "http://www.", "https://", "http://"]

def removeAnyProtocol(url):
    url = url.replace("www.","") # to remove any inputs containing just www since we aren't planning on using them regardless.
    for protocol in validProtocols:
        url = url.replace(protocol, "")
    return url

def validateUrl(url):
    for protocol in validProtocols:
        if(protocol not in url):
            pUrl = protocol + removeAnyProtocol(url)
            try:
                req = requests.head(pUrl, allow_redirects=True)
                if req.status_code == 200:
                    return pUrl
                else:
                    continue
            except Exception:
                print(traceback.format_exc())
                continue
        else:
            try:
                req = requests.head(url, allow_redirects=True)
                if req.status_code == 200:
                    return url
            except Exception:
                print(traceback.format_exc())
                continue

Usage:

correctUrl = validateUrl("google.com") # https://www.google.com
Joel
  • 3,791
  • 1
  • 25
  • 41
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/206943/discussion-on-answer-by-joel-python-get-url-from-request). – Samuel Liew Jan 30 '20 at 23:38