-2

i want to convert the format of URL to NDN format like this example:

https://stackoverflow.com/questions/ask

/com/stackoverflow/questions/ask 

Using c++ or python, how can I achieve this with regular expressions (regex)?

Zohaib Amir
  • 3,108
  • 2
  • 9
  • 28
barbosa
  • 7
  • 3

1 Answers1

0

Method 1

Maybe,

(?i)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]{2,6}(?:\.[a-z0-9]{2,6})?)\/?(.*)$

and a replacement of,

/\2/\1/\3

might be OK to look into.

RegEx Demo

Test

import re

string = '''
https://stackoverflow.com/questions/ask
https://stackoverflow.co.uk/questions/ask
https://www.stackoverflow.com/questions/ask
http://www.stackoverflow.co.uk/questions/ask
http://www.stackoverflow.co.uk/
http://www.stackoverflow.co.uk
'''

expression = r'(?im)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]{2,6}(?:\.[a-z0-9]{2,6})?)\/?(.*)$'

print(re.sub(expression, r'/\2/\1/\3', string))

Output

/com/stackoverflow/questions/ask
/co.uk/stackoverflow/questions/ask
/com/stackoverflow/questions/ask
/co.uk/stackoverflow/questions/ask
/co.uk/stackoverflow/
/co.uk/stackoverflow/

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here


Method 2

Here, we'd create an optional capturing group 2, for if there would .co.uk instances:

RegEx Demo 2

Test

import re

string = '''
https://stackoverflow.com/questions/ask
https://stackoverflow.co.uk/questions/ask
https://www.stackoverflow.com/questions/ask
http://www.stackoverflow.co.uk/questions/ask
http://www.stackoverflow.co.uk/
http://www.stackoverflow.co.uk
'''

expression = r'(?im)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]{2,6})\.?([a-z0-9]{2,6})?\/?(.*)$'

output = []
for match in re.findall(expression, string):
    if match[2] != '':
        NDN = '/' + match[2] + '/' + match[1] + '/' + match[0] + '/' + match[3]
    else:
        NDN = '/' + match[1] + '/' + match[0] + '/' + match[3]

    if NDN[-1] != '/':
        NDN = NDN + '/'

    output.append(NDN)

print(output)

Output

['/com/stackoverflow/questions/ask/', '/uk/co/stackoverflow/questions/ask/', '/com/stackoverflow/questions/ask/', '/uk/co/stackoverflow/questions/ask/', '/uk/co/stackoverflow/', '/uk/co/stackoverflow/']

Method 3

If there'd have been one or more subdomains, such as:

http://www.subdomain1.subdomain2.stackoverflow.co.uk
http://www.subdomain1.subdomain2.subdomain3.stackoverflow.co.uk

Then, we'd simply add a \.?([a-z0-9]+)? for each subdomain to our expression:

(?im)^(?:https?:\/\/)(?:w{3}\.)?([^\r\n]+?)\.([a-z0-9]+)\.?([a-z0-9]+)?\.?([a-z0-9]+)?\.?([a-z0-9]+)?\.?([a-z0-9]+)?\/?(.*)$

RegEx Demo 3

Emma
  • 1
  • 9
  • 28
  • 53
  • 1
    Thank you very much for your suggestion it is very hopeful for me but the problem that your solution not cover large scale of web sites example in http://www.stackoverflow.co.uk/questions/ask must be uk/co/stackoverflow/questions/ask – barbosa Nov 16 '19 at 19:49
  • 1
    what i want that the elements between / stackoverflow.co.uk/ will be like this uk/co/stackoverflow – barbosa Nov 16 '19 at 19:56
  • 1
    Thank you very much that what i need exactly – barbosa Nov 16 '19 at 21:15