50

if a string has this predicted format:

value = "hello and good morning"

Where the " (quotations) might also be ' (single quote), and the closing char (' or ") will be the same as the opening one. I want to match the string between the quotation marks.

\bvalue\s*=\s*(["'])([^\1]*)\1

(the two \s are to allow any spaces near the = sign)

The first "captured group" (inside the first pair of brackets) - should match the opening quotation which should be either ' or " then - I'm supposed to allow any number of characters that are not what was captured in the first group, and then I expect the character captured in the group (the enclosing quotation marks).

(the required string should be captured in the second capture-group).
This doesn't work though.

This does:

\bvalue\s*=\s*(['"])([^"']*)["']

but I want to make sure that both the opening and closing quotation mark (either double or single) are the same.


EDIT
The goal was basically to get the opening tag of an anchor that has a certain class-name included within its class attribute, and I wanted to cover the rare occasion of the class attribute including a (') or a (").

Following all of the advices here, I used the pattern:

<\s*\ba\b[^<>]+\bclass\s*=\s*("|'|\\"|\\')(?:(?!\1).)*\s*classname\s*(?:(?!\1).)*\1[^>]*>

Meaning:
Find a tag-open sign.
Allow any spaces.
Find the word a.
Allow any non-closing-tag.
Find "class (any spaces) = (any spaces)"
Get opening quotes, one of the following: (" or ' or \" or \').
From Alan Moore's answer: Allow any characters that are not the opening quotes.
find classname
Allow any characters that are not the opening quotes.
Find the closing quote which is the same as the opening.
Allow any unclosing-tag chars.
Find the closing tag char.

ekad
  • 13,718
  • 26
  • 42
  • 44
Yuval A.
  • 5,007
  • 9
  • 47
  • 59

5 Answers5

72

Instead of a negated character class, you have to use a negative lookahead:

\bvalue\s*=\s*(["'])(?:(?!\1).)*\1

(?:(?!\1).)* consumes one character at a time, after the lookahead has confirmed that the character is not whatever was matched by the capturing group, (["'']). A character class, negated or not, can only match one character at a time. As far as the regex engine knows, \1 could represent any number of characters, and there's no way to convince it that \1 will only contain " or ' in this case. So you have to go with the more general (and less readable) solution.

Alan Moore
  • 68,531
  • 11
  • 88
  • 149
  • 1
    so, this part: (?!\1) means: match whatever follows, but make sure it's not what in \1 ? It's what I needed, thanks. – Yuval A. Nov 09 '11 at 14:05
  • I see that it only works right when inside a non-capturing group, like you did. I only half-understand why it must be inside a non-capturing group... – Yuval A. Nov 09 '11 at 14:13
  • 1
    The negative lookahead, `(?!\1)`, doesn't actually match anything, it merely asserts that it's not *possible* to match `\1` at the current position. It's the `.` that actually matches (i.e., *consumes*) the next character. – Alan Moore Nov 09 '11 at 21:17
  • 3
    As for the non-capturing group, that was just policy; I used it because I didn't *have to* use a capturing group there. The regex I posted should work either way, though `((?!\1).)*` would be gratuitously inefficient. More importantly, groups are numbered according to their position in the regex, so using non-capturing groups whenever possible makes it a lot easier to keep track of the capturing-group numbers. – Alan Moore Nov 09 '11 at 21:21
  • You... *have to* use a negative lookahead (?) – Code Jockey Sep 15 '17 at 16:10
2

Answering this question How to use a numerical reference in neglected set?

here because it was marked as an exact duplicate of this one.

Can't really specify a capture group inside a class.
What can be done is to specify the character in a negative assertion, like this

(["'])((?:(?!\1)[\S\s])*)(\1)

Expanded

 ( ["'] )                      # (1)
 (                             # (2 start)
      (?:
           (?! \1 )
           [\S\s] 
      )*
 )                             # (2 end)
 ( \1 )                        # (3)

Notice that on the original post [^char] normally matches linebreaks
as well, but since this is JavaScript (the old JS) the dot cannot be used.
Use [\S\s] instead, which matches any character.

2

You can use:

\bvalue\s*=\s*(['"])(.*?)\1

See it

codaddict
  • 410,890
  • 80
  • 476
  • 515
  • @YuvalA: you are right. We cannot have backreference in the char class. – codaddict Nov 08 '11 at 19:14
  • I’d urge you to delete this answer—it’s one of those unfortunate cases where it looks like it should work and is the most straightforward solution, but doesn’t actually work in practice. – Trey May 03 '20 at 20:33
2

Without knowing what you need the information for (or indeed even what language or tool you are using this regex in), there are many paths I can suggest.

Using these strings:

value = "hello and good morning"
value = 'hola y buenos dias'
value = 'how can I say "goodbye" so soon?'
value = 'why didn\'t you say "hello" to me this morning?'
value = "Goodbye! Please don't forget to write!"
value = 'Goodbye! Please don\'t forget to write!'

this expression:

"((\\"|[^"])*)"|'((\\'|[^'])*)'

will match these strings:

"hello and good morning"
'hola y buenos dias'
'how can I say "goodbye" so soon?'
'why didn\'t you say "hello" to me this morning?'
"Goodbye! Please don't forget to write!"
'Goodbye! Please don\'t forget to write!'

It would allow either the "other" type of quote or the same type of quote, when escaped with a single preceding \. The contents of the quoted strings are either in group 1 or 3. You could figure out which type of quotes are used by getting the first (or last) character.

If you need some of these things to be in particular match groups, please give more specific examples (and include things that should not work, but look like they might be close)

Please ask if you would like to take this route and need a little more help

Code Jockey
  • 6,348
  • 6
  • 27
  • 43
  • I'm curious where the downvote came from - anyone wanna tell me what I'm missing? I'm cool with everyone preferring the snazzier, less-"regular" features of more modern flavors of regex over simple alternation, but... does my approach... not work... for some use case?? – Code Jockey Sep 15 '17 at 16:07
  • 1
    I can't speak for the downvoter, but: `'this shouldn\'t match` – Coleoid Feb 05 '19 at 19:03
0

Example of replacement:

"markdown *text*"

to:

"markdown <em>text</em>"

PHP Code #1 for characters "*" and "_" (greedy mode):

preg_replace('%'.'([*_])'.'(?<phrase>.+?)'.'\\1'.'%sS', '<em>$2<em>', $text);

PHP Code #2 for characters "*" and "_" (negation in the back-reference):

preg_replace('%'.'([*_])'.'(?<phrase>(?:(?!\\1).){1,})'.'\\1'.'%sS', '<em>$2<em>', $text);

PHP Code #3 for single character "*" (negation in character classes):

preg_replace('%'.'([*])'.'(?<phrase>[^*]{1,})'.'[*]'.'%sS', '<em>$2<em>', $text);

Case #1 ("greedy mode") faster than Case #2 ("negation in the back-reference").

Tested on 1000000 iterations:

  1. 0.0245740413665 sec.
  2. 3.3793921470642 sec.