5

I have some regex I run over an entire HTML page looking for strings and replacing them, however if the string is in single or double quotes I do not want it to match.

Current Regex: ([a-zA-Z_][a-zA-Z0-9_]*)

I would like to match steve,john,cathie and john likes to walk (x3) but not "steve", 'sophie' or "john"'likes'"cake"

I have tried (^")([a-zA-Z_][a-zA-Z0-9_]*)(^") but get no matches?

Test Cases:

(steve=="john") would return steve
("test"=="test") would not return anything
(boob==lol==cake) would return all three
Pez Cuckow
  • 13,014
  • 15
  • 75
  • 121

5 Answers5

4

Try this one:

(\b(?<!['"])[a-zA-Z_][a-zA-Z_0-9]*\b(?!['"]))

Against this string:

john "michael" michael 'michael elt0n_john 'elt0n_j0hn'
 1      2        3        4       5            6

It would match nr 1 john, nr 3 Michael and nr 5 elt0n_john

PatrikAkerstrand
  • 43,625
  • 11
  • 74
  • 92
2

You could try with:

preg_match_all('#(?<!["\']) \b \w+ \b (?!["\'])#x', $str, $matches);

The \w+ matches word characters, but would allow 0123sophie for example. The \b matches word boundaries and thus ensures that the anti-quote assertions do not terminate too early.

However, this regex will also fail to find words which have just a single quote "before or after' them.

mario
  • 138,064
  • 18
  • 223
  • 277
  • Tested and works great in regex coach...fail in PHP though, not sure why. http://codepad.org/LEitQoEJ Could just be codepad's issue. – Kevin Peno Mar 04 '11 at 18:13
  • just did the same as well. Using the test case I posted in codpad I tried in in the PHP CLI. It works great and returns an array containing `but`, `john`, `like`, `apples` – Kevin Peno Mar 04 '11 at 18:21
1

Pez, resurrecting this ancient question because the current answer is not quite correct (and I'm not sure any solution can be).

It will fail to match john when it is in incomplete quotes, for instance in "john, john", 'john and john' (situations that can happen with john's birthday etc. See this demo.

This alternate solution just skips any content in quotes:

(?:'[^'\n]*'|"[^"\n]*")(*SKIP)(*F)|\b[a-zA-Z_][a-zA-Z_0-9]*\b

See demo

Either way, with quotes, no solution is perfect because you always run the risk of having unbalanced quotes. In this case I have tried to mitigate the problem by assuming that if it's on another line, it's a different string.

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...
Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97
1

To do that you probably need some dark magic:

'~(?:"[^"\\\\]*+(?:\\\\.[^"\\\\]*+)*+"|\'[^\'\\\\]*+(?:\\\\.[^\'\\\\]*+)*+\')(*SKIP)(*F)|([a-zA-Z_][a-zA-Z0-9_]*)~'

The (?:"[^"\\\\]*+(?:\\\\.[^"\\\\]*+)*+"|\'[^\'\\\\]*+(?:\\\\.[^\'\\\\]*+)*+\') part matches a string in either single or double quotes and implements backslash-escaping. The (*SKIP)(*F) skips the quoted string and forces a fail. ([a-zA-Z_][a-zA-Z0-9_]*) is your regex.

PS: If you are using this on PHP scripts, you may want to use the Tokenizer instead. That way you could for example exclude keywords (like class or abstract, I don't know whether you need this) and you will have much better handling of edge cases (like HEREDOC).

NikiC
  • 95,987
  • 31
  • 182
  • 219
0

Ok I think I have it and it works for your test cases:

(?<!"|'|\w)(\w+)(?!"|'|\w)

Done with look-ahead/look-behind regex feature.

Matej Baćo
  • 1,272
  • 2
  • 10
  • 12