3

I'm not comfortable with regex, so I need your help with this one, which seems tricky to me.

Let's say I've got the following string :

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

What would be the regex to get title:hello, title:world, remove these strings from the original one and leave "title:quoted" in it, because it's surrounded by double quotes ?

I've already seen this similar SO answer, and here is what I ended up with :

import re

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

def replace(m):
    if m.group(1) is None:
        return m.group()

    return m.group().replace(m.group(1), "")

regex = r'\"[^\"]title:[^\s]+\"|([^\"]*)'
cleaned_string = re.sub(regex, replace, string)

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'

Of course, it does not work, and I'm not surprised, because regex are esoteric to me.

Thank you for your help !

Final solution

Thanks to your answers, here is the final solution, working for my needs :

import re
matches = []

def replace(m):
    matches.append(m.group())
    return ""

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = '(?<!")title:[^\s]+(?!")'
cleaned_string = re.sub(regex, replace, string)

# remove extra withespaces
cleaned_string = ' '.join(cleaned_string.split())

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
assert matches[0] == "title:hello"
assert matches[1] == "title:world"
Community
  • 1
  • 1
Agate
  • 2,572
  • 1
  • 15
  • 27
  • so you want to match `[keyword1, keywor2, title:hello, title:word, keyword3]` ? – PepperoniPizza Jun 08 '14 at 21:47
  • In fact, i want to match `[title:hello, title:world]`, and remove both of them from the string. – Agate Jun 08 '14 at 21:50
  • There is a very simple regex for this, it is similar to this question about [regex-matching a pattern except when...](http://stackoverflow.com/questions/23589174/match-or-replace-a-pattern-except-in-situations-s1-s2-s3-etc/23589204#23589204). Give me a second to write an answer. :) – zx81 Jun 08 '14 at 22:01
  • Okay, FYI added explanation to the regex and online demo. – zx81 Jun 08 '14 at 22:13

4 Answers4

6

You can check for word boundaries (\b):

>>> s = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
>>> re.sub(r'\btitle:\w+\b', '', s, re.I)
'keyword1 keyword2   "title:quoted" keyword3'

Or, alternatively, you can use negative look behind and ahead assertions to check for not having quotes around title:\w+:

>>> re.sub(r'(?<!")title:\w+(?!")', '', s)
'keyword1 keyword2   "title:quoted" keyword3'
alecxe
  • 414,977
  • 106
  • 935
  • 1,083
3

This situation sounds very similar to "regex-match a pattern unless..."

We can solve it with a beautifully-simple regex:

"[^"]*"|(\btitle:\S+)

The left side of the alternation | matches complete "double quoted strings" tags. We will ignore these matches. The right side matches and captures your title:hello strings to Group 1, and we know they are the right ones because they were not matched by the expression on the left.

This program shows how to use the regex (see the results at the bottom of the online demo):

import re
subject = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = re.compile(r'"[^"]*"|(\btitle:\S+)')
def myreplacement(m):
    if m.group(1):
        return ""
    else:
        return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97
  • Thank you for your explanations, it's very useful to me. I'm not marked your answer as accepted because I used @alecxe's one, but your approach seems working too. I've readed your linked answer, which is a very nice piece of work ! – Agate Jun 08 '14 at 22:32
1
 re.sub('[^"]title:\w+',"",string)
keyword1 keyword2 "title:quoted" keyword3

Replace any substring starting with title:followed by any letters -> w+

Padraic Cunningham
  • 160,756
  • 20
  • 201
  • 286
  • I'm sorry but this answer is not working, it outputs `'keyword1 keyword2 "" keyword3'` – Agate Jun 08 '14 at 22:09
  • it works on the example you have provided, what do you expect it to do? – Padraic Cunningham Jun 08 '14 at 22:10
  • It's not working on the example I gave. Expected output is `keyword1 keyword2 "title:quoted" keyword3` I get `'keyword1 keyword2 "" keyword3'` with your snippet. – Agate Jun 08 '14 at 22:21
0

A little violent but works in all situations and without catastrophic backtracking:

import re

string = r'''keyword1 keyword2 title:hello title:world "title:quoted"title:foo
       "abcd \" title:bar"title:foobar keyword3 keywordtitle:keyword
       "non balanced quote title:foobar'''

pattern = re.compile(
    r'''(?:
            (      # other content
                (?:(?=(
                    " (?:(?=([^\\"]+|\\.))\3)* (?:"|$) # quoted content
                  |
                    [^t"]+             # all that is not a "t" or a quote
                  |
                    \Bt                # "t" preceded by word characters
                  |
                    t (?!itle:[a-z]+)  # "t" not followed by "itle:" + letters 
                )  )\2)+
            )
          |     # OR
            (?<!") # not preceded by a double quote
        )
        (?:\btitle:[a-z]+)?''',
    re.VERBOSE)

print re.sub(pattern, r'\1', string)
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • Thank you for your answer, but it's a bit too long for my needs. Moreover, even if I used `title:something` in my question, it can be something else like `url:domain.com` or `content:text`. So matching the first letter of the world title is not suitable for me. – Agate Jun 08 '14 at 22:39
  • @EliotBerriot: I understand, however you can easily build the pattern with the word you want (it's not difficult to extract the first letter of a word, and to use placeholders). If the pattern is long, don't imagine that it is slow. – Casimir et Hippolyte Jun 08 '14 at 22:47
  • You're absolutely right but it's simply not as convenient as other answers – Agate Jun 08 '14 at 22:52