Can't get this 'url-label' regex to be accurate

Question

I was writing a Python regex to match urls. I have written a reasonably complex one that is working fairly well. I was trying to get the label part in the hostname to be more accurate based on what I read on Wikipedia.

Basically if you see my code snippet, I have a truth table to test the label part I have written. If you want to understand the truth table, its based on 'Restrictions on hostnames' in the hostname Wiki page.

I just can't get case 1(strings[0]) to work without breaking other cases and cannot figure out just why case 1 wouldn't work. I have used debugging tools but the failure string in question is too small for me to get any significant information.

Please help me out and give me a fix to the one I have written so that I know what I am missing.

If you are wondering why I am not using a third party library to match urls, I am learning regular expressions.

import re

F = False
T = True

results = [T,F,T,F,F,F,T,F,T,F,F,F,F,F]

strings = ['a','-','aa','--','-a','a-','aaa','aa-','a-a','a--','-aa','-a-','--a','---']

x = list(range(len(strings)))

regex_test = r'(?P<Label>(?P<Label_start>[a-zA-Z0-9])((?P<Label_mid>[a-zA-Z0-9-]{1,61})?)(?=(?P<Label_end>[a-zA-Z0-9](?=[./])))(?P=Label_end)?)\.'

if len(strings) == len(results):
    for n in x:
        if results[n] == bool(re.match(pattern = regex_test, string = strings[n] + '.')):
            print("Works.")
            if results[n] == True:
                print(re.match(pattern = regex_test, string = strings[n] + '.').groupdict())
        else:
            print("Bug for: " + strings[n])
            #print(str(re.match(pattern = regex_test, string = strings[n] + '.',flags=re.DEBUG)))

Possible duplicate of [Domain name validation with RegEx](https://stackoverflow.com/questions/10306690/domain-name-validation-with-regex) — Michael Molter, Jun 12 '17 at 15:42
My regex is nearly accurate. Please point out why the regex is failing to match a single character. That would help me to understand where I went wrong in my regex. Would highly appreciate. Also, I need help only with the label not the hostname. The '.' is being used as an end flag because this regex will be followed by the rest of the hostname regex and then the rest of the url regex. — Tushar Vazirani, Jun 12 '17 at 18:06
I would try playing around with your regex on something like regex101.com. Also try simplifying the syntax (eliminate the named groups) to try and spot your error. You can add them back in later. — Michael Molter, Jun 12 '17 at 18:13
I have tried to debug it online but since the failure case is a very small string, even tools like Debuggex couldn't give me enough information. — Tushar Vazirani, Jun 12 '17 at 18:38
This regex `^(([[:alnum:]]{1,2})|([[:alnum:]][[:alnum:]-]+a))$` also passes all of your test cases (try it out https://regex101.com/r/Gv1ib2/2). It uses alternation to catch the pesky 1-2 character cases. — Michael Molter, Jun 12 '17 at 19:02

score 0 · Answer 1 · answered Jun 12 '17 at 04:30

Sometimes it is easier (and more readable) to just use python! However, there is one portion of the domain name validation that benefits from re. See my attempt here, or my full snipett below.

# What are valid hostnames?
#    (1) Series of labels separated by periods.
#    (2) Each label can contain only letters, numbers and hyphens.
#    (3) Each label can not start or end with a hyphen.
#    (4) Each label can be 1-63 characters
#    (5) Entire domain cannot be longer than 253 characters.

import re

domains = ['en.wikipedia.org',
           'my.his-site.com',
           '-bad.site-.5345']

def is_valid_hostname(domain):
    # Utilize rule 1 to split into labels.
    labels = domain.split('.')

    for label in labels:
        # Check rules 2 and 3.
        if not re.search(r'^[A-z0-9][A-z0-9\-]*[A-z0-9]$', label):
           return False

        # Check rule 4.
        if not (0 < len(label) < 64):
            return False

    # Check rule 5.
    if len(domain) > 253:
        return False

    return True


for domain in domains:
    if is_valid_hostname(domain):
        print('{}:\tvalid'.format(domain))
    else:
        print('{}:\tinvalid'.format(domain))

This is going to be a part of a much bigger regex. I need this solely as a regular expression. Also, domain name length is not a consideration at the moment. But thanks for taking the timeout and understanding my problem. — Tushar Vazirani, Jun 12 '17 at 11:00
So you want a regex only solution? In that case, `^[0-9\p{L}][0-9\p{L}-\.]{1,61}[0-9\p{L}]\.[0-9\p{L}][\p{L}-]*[0-9\p{L}]+$` would do the job (see duplicate https://stackoverflow.com/a/38477788/1713185) — Michael Molter, Jun 12 '17 at 15:41

Can't get this 'url-label' regex to be accurate

1 Answers1