2

I have this string which is on one line:

https[:]//sometest[.]com,http[:]//differentt,est.net,https://lololo.com

Note that I purposely placed , into the second URL. I am trying to replace the , where the http(s) meets. So far I tried this:

pattern_src = r"http(.*)"
for i, line_src in enumerate(open("/Users/test/Documents/tools/dump/email.txt")):
    for match in re.finditer(pattern_src, line_src):
        mal_url = (match.group())
source_ = mal_url

string = source_
for ch in ["[" , "]"]:
    for c in [","]:
        string = string.replace(c,"\n")
        string = string.replace(ch,"")
        with open("/Users/test/Documents/tools/dump/urls.txt", 'w') as file:
                file.write(string)
print(string)

But you can clearly see it will replace all the , in the string. So my question is, how would I go around replacing just the , before the http and have every http URL on a new line?

uzdisral
  • 357
  • 1
  • 3
  • 14

1 Answers1

2
>>> s = 'https[:]//sometest[.]com,http[:]//differentt,est.net,https://lololo.com'
>>> print(re.sub(r',(?=http)', '\n', s))
https[:]//sometest[.]com
http[:]//differentt,est.net
https://lololo.com

,(?=http) will match , only if it is followed by http. Here (?=http) is a positive lookahead assertion, which allows to check for conditions without consuming those characters.

See Reference - What does this regex mean? for details on lookarounds or my book: https://learnbyexample.github.io/py_regular_expressions/lookarounds.html

Sundeep
  • 19,273
  • 2
  • 19
  • 42