I'm not comfortable with regex, so I need your help with this one, which seems tricky to me.
Let's say I've got the following string :
string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
What would be the regex to get title:hello
, title:world
, remove these strings from the original one and leave "title:quoted"
in it, because it's surrounded by double quotes ?
I've already seen this similar SO answer, and here is what I ended up with :
import re
string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
def replace(m):
if m.group(1) is None:
return m.group()
return m.group().replace(m.group(1), "")
regex = r'\"[^\"]title:[^\s]+\"|([^\"]*)'
cleaned_string = re.sub(regex, replace, string)
assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
Of course, it does not work, and I'm not surprised, because regex are esoteric to me.
Thank you for your help !
Final solution
Thanks to your answers, here is the final solution, working for my needs :
import re
matches = []
def replace(m):
matches.append(m.group())
return ""
string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = '(?<!")title:[^\s]+(?!")'
cleaned_string = re.sub(regex, replace, string)
# remove extra withespaces
cleaned_string = ' '.join(cleaned_string.split())
assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
assert matches[0] == "title:hello"
assert matches[1] == "title:world"