Python regex matching pattern not surrounded by double quotes

Question

I'm not comfortable with regex, so I need your help with this one, which seems tricky to me.

Let's say I've got the following string :

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

What would be the regex to get title:hello, title:world, remove these strings from the original one and leave "title:quoted" in it, because it's surrounded by double quotes ?

I've already seen this similar SO answer, and here is what I ended up with :

import re

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

def replace(m):
    if m.group(1) is None:
        return m.group()

    return m.group().replace(m.group(1), "")

regex = r'\"[^\"]title:[^\s]+\"|([^\"]*)'
cleaned_string = re.sub(regex, replace, string)

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'

Of course, it does not work, and I'm not surprised, because regex are esoteric to me.

Thank you for your help !

Final solution

Thanks to your answers, here is the final solution, working for my needs :

import re
matches = []

def replace(m):
    matches.append(m.group())
    return ""

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = '(?<!")title:[^\s]+(?!")'
cleaned_string = re.sub(regex, replace, string)

# remove extra withespaces
cleaned_string = ' '.join(cleaned_string.split())

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
assert matches[0] == "title:hello"
assert matches[1] == "title:world"

so you want to match `[keyword1, keywor2, title:hello, title:word, keyword3]` ? — PepperoniPizza, Jun 08 '14 at 21:47
In fact, i want to match `[title:hello, title:world]`, and remove both of them from the string. — Agate, Jun 08 '14 at 21:50
There is a very simple regex for this, it is similar to this question about [regex-matching a pattern except when...](http://stackoverflow.com/questions/23589174/match-or-replace-a-pattern-except-in-situations-s1-s2-s3-etc/23589204#23589204). Give me a second to write an answer. :) — zx81, Jun 08 '14 at 22:01

alecxe · Accepted Answer · 2014-06-08T22:08:45.317

6

You can check for word boundaries (\b):

>>> s = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
>>> re.sub(r'\btitle:\w+\b', '', s, re.I)
'keyword1 keyword2   "title:quoted" keyword3'

Or, alternatively, you can use negative look behind and ahead assertions to check for not having quotes around title:\w+:

>>> re.sub(r'(?<!")title:\w+(?!")', '', s)
'keyword1 keyword2   "title:quoted" keyword3'

edited Jun 08 '14 at 22:08

answered Jun 08 '14 at 21:47

alecxe

414,977
106
935
1,083

thank you it works. Is there any way to store the matched patterns `[title:hello, title:world]` in a variable ? – Agate Jun 08 '14 at 21:55
@EliotBerriot yup, I think you need [something like this](http://stackoverflow.com/a/9135166/771848). – alecxe Jun 08 '14 at 21:58
@EliotBerriot I've updated the answer with a simpler approach, please check. – alecxe Jun 08 '14 at 22:10
Thanks for your help :) – Agate Jun 08 '14 at 22:36
@alecxe efficient too, nice to see different approaches on the one page. :) +1 – zx81 Jun 09 '14 at 00:05

score 3 · Answer 2 · edited May 23 '17 at 10:34

This situation sounds very similar to "regex-match a pattern unless..."

We can solve it with a beautifully-simple regex:

"[^"]*"|(\btitle:\S+)

The left side of the alternation | matches complete "double quoted strings" tags. We will ignore these matches. The right side matches and captures your title:hello strings to Group 1, and we know they are the right ones because they were not matched by the expression on the left.

This program shows how to use the regex (see the results at the bottom of the online demo):

import re
subject = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = re.compile(r'"[^"]*"|(\btitle:\S+)')
def myreplacement(m):
    if m.group(1):
        return ""
    else:
        return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

Thank you for your explanations, it's very useful to me. I'm not marked your answer as accepted because I used @alecxe's one, but your approach seems working too. I've readed your linked answer, which is a very nice piece of work ! — Agate, Jun 08 '14 at 22:32

Padraic Cunningham · Answer 3 · 2014-06-08T22:34:51.270

1

 re.sub('[^"]title:\w+',"",string)
keyword1 keyword2 "title:quoted" keyword3

Replace any substring starting with title:followed by any letters -> w+

edited Jun 08 '14 at 22:34

answered Jun 08 '14 at 22:03

Padraic Cunningham

160,756
20
201
286

I'm sorry but this answer is not working, it outputs `'keyword1 keyword2 "" keyword3'` – Agate Jun 08 '14 at 22:09
it works on the example you have provided, what do you expect it to do? – Padraic Cunningham Jun 08 '14 at 22:10
It's not working on the example I gave. Expected output is `keyword1 keyword2 "title:quoted" keyword3` I get `'keyword1 keyword2 "" keyword3'` with your snippet. – Agate Jun 08 '14 at 22:21

Casimir et Hippolyte · Answer 4 · 2014-06-08T22:56:31.080

0

A little violent but works in all situations and without catastrophic backtracking:

import re

string = r'''keyword1 keyword2 title:hello title:world "title:quoted"title:foo
       "abcd \" title:bar"title:foobar keyword3 keywordtitle:keyword
       "non balanced quote title:foobar'''

pattern = re.compile(
    r'''(?:
            (      # other content
                (?:(?=(
                    " (?:(?=([^\\"]+|\\.))\3)* (?:"|$) # quoted content
                  |
                    [^t"]+             # all that is not a "t" or a quote
                  |
                    \Bt                # "t" preceded by word characters
                  |
                    t (?!itle:[a-z]+)  # "t" not followed by "itle:" + letters 
                )  )\2)+
            )
          |     # OR
            (?<!") # not preceded by a double quote
        )
        (?:\btitle:[a-z]+)?''',
    re.VERBOSE)

print re.sub(pattern, r'\1', string)

edited Jun 08 '14 at 22:56

answered Jun 08 '14 at 22:34

Casimir et Hippolyte

83,228
5
85
113

Thank you for your answer, but it's a bit too long for my needs. Moreover, even if I used `title:something` in my question, it can be something else like `url:domain.com` or `content:text`. So matching the first letter of the world title is not suitable for me. – Agate Jun 08 '14 at 22:39
@EliotBerriot: I understand, however you can easily build the pattern with the word you want (it's not difficult to extract the first letter of a word, and to use placeholders). If the pattern is long, don't imagine that it is slow. – Casimir et Hippolyte Jun 08 '14 at 22:47
You're absolutely right but it's simply not as convenient as other answers – Agate Jun 08 '14 at 22:52

Python regex matching pattern not surrounded by double quotes

Final solution

4 Answers4