i want to convert the format of URL to NDN format like this example:
https://stackoverflow.com/questions/ask
/com/stackoverflow/questions/ask
Using c++ or python, how can I achieve this with regular expressions (regex)?
i want to convert the format of URL to NDN format like this example:
https://stackoverflow.com/questions/ask
/com/stackoverflow/questions/ask
Using c++ or python, how can I achieve this with regular expressions (regex)?
Maybe,
(?i)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]{2,6}(?:\.[a-z0-9]{2,6})?)\/?(.*)$
and a replacement of,
/\2/\1/\3
might be OK to look into.
import re
string = '''
https://stackoverflow.com/questions/ask
https://stackoverflow.co.uk/questions/ask
https://www.stackoverflow.com/questions/ask
http://www.stackoverflow.co.uk/questions/ask
http://www.stackoverflow.co.uk/
http://www.stackoverflow.co.uk
'''
expression = r'(?im)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]{2,6}(?:\.[a-z0-9]{2,6})?)\/?(.*)$'
print(re.sub(expression, r'/\2/\1/\3', string))
/com/stackoverflow/questions/ask
/co.uk/stackoverflow/questions/ask
/com/stackoverflow/questions/ask
/co.uk/stackoverflow/questions/ask
/co.uk/stackoverflow/
/co.uk/stackoverflow/
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
jex.im visualizes regular expressions:
Here, we'd create an optional capturing group 2, for if there would .co.uk
instances:
import re
string = '''
https://stackoverflow.com/questions/ask
https://stackoverflow.co.uk/questions/ask
https://www.stackoverflow.com/questions/ask
http://www.stackoverflow.co.uk/questions/ask
http://www.stackoverflow.co.uk/
http://www.stackoverflow.co.uk
'''
expression = r'(?im)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]{2,6})\.?([a-z0-9]{2,6})?\/?(.*)$'
output = []
for match in re.findall(expression, string):
if match[2] != '':
NDN = '/' + match[2] + '/' + match[1] + '/' + match[0] + '/' + match[3]
else:
NDN = '/' + match[1] + '/' + match[0] + '/' + match[3]
if NDN[-1] != '/':
NDN = NDN + '/'
output.append(NDN)
print(output)
['/com/stackoverflow/questions/ask/', '/uk/co/stackoverflow/questions/ask/', '/com/stackoverflow/questions/ask/', '/uk/co/stackoverflow/questions/ask/', '/uk/co/stackoverflow/', '/uk/co/stackoverflow/']
If there'd have been one or more subdomains, such as:
http://www.subdomain1.subdomain2.stackoverflow.co.uk
http://www.subdomain1.subdomain2.subdomain3.stackoverflow.co.uk
Then, we'd simply add a \.?([a-z0-9]+)?
for each subdomain to our expression:
(?im)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]+)\.?([a-z0-9]+)?\.?([a-z0-9]+)?\.?([a-z0-9]+)?\.?([a-z0-9]+)?\/?(.*)$