How can I convert the URL into NDN format?

Question

i want to convert the format of URL to NDN format like this example:

https://stackoverflow.com/questions/ask

/com/stackoverflow/questions/ask

Using c++ or python, how can I achieve this with regular expressions (regex)?

Emma · Accepted Answer · 2019-11-16T20:18:08.493

Method 1

Maybe,

(?i)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]{2,6}(?:\.[a-z0-9]{2,6})?)\/?(.*)$

and a replacement of,

/\2/\1/\3

might be OK to look into.

RegEx Demo

Test

import re

string = '''
https://stackoverflow.com/questions/ask
https://stackoverflow.co.uk/questions/ask
https://www.stackoverflow.com/questions/ask
http://www.stackoverflow.co.uk/questions/ask
http://www.stackoverflow.co.uk/
http://www.stackoverflow.co.uk
'''

expression = r'(?im)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]{2,6}(?:\.[a-z0-9]{2,6})?)\/?(.*)$'

print(re.sub(expression, r'/\2/\1/\3', string))

Output

/com/stackoverflow/questions/ask
/co.uk/stackoverflow/questions/ask
/com/stackoverflow/questions/ask
/co.uk/stackoverflow/questions/ask
/co.uk/stackoverflow/
/co.uk/stackoverflow/

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

RegEx Circuit

jex.im visualizes regular expressions:

Method 2

Here, we'd create an optional capturing group 2, for if there would .co.uk instances:

RegEx Demo 2

Test

import re

string = '''
https://stackoverflow.com/questions/ask
https://stackoverflow.co.uk/questions/ask
https://www.stackoverflow.com/questions/ask
http://www.stackoverflow.co.uk/questions/ask
http://www.stackoverflow.co.uk/
http://www.stackoverflow.co.uk
'''

expression = r'(?im)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]{2,6})\.?([a-z0-9]{2,6})?\/?(.*)$'

output = []
for match in re.findall(expression, string):
    if match[2] != '':
        NDN = '/' + match[2] + '/' + match[1] + '/' + match[0] + '/' + match[3]
    else:
        NDN = '/' + match[1] + '/' + match[0] + '/' + match[3]

    if NDN[-1] != '/':
        NDN = NDN + '/'

    output.append(NDN)

print(output)

Output

['/com/stackoverflow/questions/ask/', '/uk/co/stackoverflow/questions/ask/', '/com/stackoverflow/questions/ask/', '/uk/co/stackoverflow/questions/ask/', '/uk/co/stackoverflow/', '/uk/co/stackoverflow/']

Method 3

If there'd have been one or more subdomains, such as:

http://www.subdomain1.subdomain2.stackoverflow.co.uk
http://www.subdomain1.subdomain2.subdomain3.stackoverflow.co.uk

Then, we'd simply add a \.?([a-z0-9]+)? for each subdomain to our expression:

(?im)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]+)\.?([a-z0-9]+)?\.?([a-z0-9]+)?\.?([a-z0-9]+)?\.?([a-z0-9]+)?\/?(.*)$

RegEx Demo 3

Thank you very much for your suggestion it is very hopeful for me but the problem that your solution not cover large scale of web sites example in http://www.stackoverflow.co.uk/questions/ask must be uk/co/stackoverflow/questions/ask — barbosa, Nov 16 '19 at 19:49
what i want that the elements between / stackoverflow.co.uk/ will be like this uk/co/stackoverflow — barbosa, Nov 16 '19 at 19:56

How can I convert the URL into NDN format?

1 Answers1

Method 1

RegEx Demo

Test

Output

RegEx Circuit

Method 2

RegEx Demo 2

Test

Output

Method 3

RegEx Demo 3