47

Is there an implementation of regular expressions in Python/PHP/JavaScript that supports variable-length lookbehind-assertion?

/(?<!foo.*)bar/

How can I write a regular expression that has the same meaning, but uses no lookbehind-assertion?

Is there a chance that this type of assertion will be implemented some day?

Things are much better that I thought.

Update:

(1) There are regular expressions implementation that support variable-length lookbehind-assertion already.

Python module regex (not standard re, but additional regex module) supports such assertions (and has many other cool features).

>>> import regex
>>> m = regex.search('(?<!foo.*)bar', 'f00bar')
>>> print m.group()
bar
>>> m = regex.search('(?<!foo.*)bar', 'foobar')
>>> print m
None

It was a really big surprise for me that there is something in regular expressions that Perl can't do and Python can. Probably, there is "enhanced regular expression" implementation for Perl also?

(Thanks and +1 to MRAB).

(2) There is a cool feature \K in modern regular expressions.

This symbols means that when you make a substitution (and from my point of view the most interesting use case of assertions is the substitution), all characters that were found before \K must not be changed.

s/unchanged-part\Kchanged-part/new-part/x

That is almost like a look-behind assertion, but not so flexible of course.

More about \K:

As far as I understand, you can't use \K twice in the same regular expression. And you can't say till which point you want to "kill" the characters that you've found. That is always till the beginning of the line.

(Thanks and +1 to ikegami).

My additional questions:

  • Is it possible to say what point must be the final point of \K effect?
  • What about enhanced regular expressions implementations for Perl/Ruby/JavaScript/PHP? Something like regex for Python.
Igor Chubin
  • 51,940
  • 8
  • 108
  • 128
  • To know how to properly write an alternative that doesn't use a lookbehind assertion, we're going to need a little more context. What is this *actually* for? – Ry- Jul 24 '12 at 22:58
  • 1
    @minitech: there is no additional context. This is a general question – Igor Chubin Jul 25 '12 at 06:48
  • No, it requires additional context. The best way to solve your problem currently is to use `indexOf` to find `'foo'` and then repeat to find all `'bar'` after it. – Ry- Jul 25 '12 at 13:59
  • 1
    @minitech: I can remove this simple example; I provided it just for illustration purposes; the question is: "how (generally) can I avoid look-behind-negative-assertions and what (generally) I could use instead?". Why don't you like the answer from ikegami? I think that the answer is almost perfect. I was not aware of this `\K` trick and I find it really killing – Igor Chubin Jul 25 '12 at 14:06
  • I do, but it's not the most efficient solution if your problem is actually so simple. To recognize the validity of variable-width lookbehinds, I really need an example that can't be done using simple string searching (which works in engines without `\K`, too). – Ry- Jul 25 '12 at 14:09
  • I doubt Python's version is bug free. To implement variable width look behind correctly, you basically need to have two identical copies of the regex engine, except that one works backwards. It's simply not worth the cost. – ikegami Jul 25 '12 at 17:27
  • @ikegami: you mean `(?:(?!foo).)*` would be more (or at least not less) effective? – Igor Chubin Jul 25 '12 at 17:29
  • Also, not everyone agree as to what the following should capture: `'foo bar baz moo' =~ /(?<=foo.*(ba.).*)moo/`. I'd say `bar`. Some might say `baz`. If you say `baz`, variable-width lookbehind becomes very inefficient and `(?{ ... })` won't work sensibly. – ikegami Jul 25 '12 at 17:29
  • What do you mean by "effective"? – ikegami Jul 25 '12 at 17:31
  • @ikegami: why `bar`? `.*` is greedy here, right? – Igor Chubin Jul 25 '12 at 17:33
  • It's not a question of greediness, it's a question of at which end do you start matching. If you start from the right, the rightmost `.*` is encountered first and gobbles up as much as it can, and the leftmost gobbles up what's left. If you start at the left, there's all sorts of problems, and it doesn't make as much sense conceptually, but that seems to be what some people (e.g. you) expect. – ikegami Jul 25 '12 at 17:34
  • How can you compare efficiency of code that works and code that doesn't work in the general case? But yeah, in this case, they'll surely be equally efficient because Python surely does exactly the same thing internally. – ikegami Jul 25 '12 at 17:34
  • @ikegami: what code doesn't work? In the `regex` module work both variants. – Igor Chubin Jul 25 '12 at 17:36
  • No, it cannot work both right and efficiently without having two implementations of the regex engine, and I bet it doesn't. – ikegami Jul 25 '12 at 17:37
  • 1
    @ikegami: "it's a question of at which end do you start matching", ok, I've understood. I think that is just a question of definition. – Igor Chubin Jul 25 '12 at 17:47
  • @Igor Chubin, yes, but there's no agreement as to a definition, and there are issues with both definitions. It would be irresponsible to implement variable-width lookbehind at this point. – ikegami Jul 25 '12 at 18:19

5 Answers5

46

Most of the time, you can avoid variable length lookbehinds by using \K.

s/(?<=foo.*)bar/moo/s;

would be

s/foo.*\Kbar/moo/s;

Anything up to the last \K encountered is not considered part of the match (e.g. for the purposes of replacement, $&, etc)

Negative lookbehinds are a little trickier.

s/(?<!foo.*)bar/moo/s;

would be

s/^(?:(?!foo).)*\Kbar/moo/s;

because (?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR.


If you're just matching, you might not even need the \K.

/foo.*bar/s

/^(?:(?!foo).)*bar/s
ikegami
  • 322,729
  • 15
  • 228
  • 466
  • This trick with `\K` is really cool, but is it possible to specify several `\K` in one regular expression? Probably, not – Igor Chubin Jul 25 '12 at 10:04
  • No (or not usefully), but you can use captures: `s/foo.*\Kbar/moo/s;` === `s/(foo.*)bar/${1}moo/s;`. – ikegami Jul 25 '12 at 16:57
  • captures are obvious but it is not interesting :) `\K` is much better :) – Igor Chubin Jul 25 '12 at 17:18
  • But you can only have one. I was pointing out what you could do if needed more than one as you asked. (Captures also work before 5.10 when `\K` was introduced.) – ikegami Jul 25 '12 at 17:23
  • ikegami, of course I'm aware of captures, but there are many situations where they can't help; although `\K` is also not a silver bullet, it is really a cool thing. And this trick `(?:(?!foo).)*` is brilliant also. – Igor Chubin Jul 25 '12 at 17:27
  • Wow. You made my day. I had no idea about `\K`. Thanks!! – Matt Apr 25 '13 at 17:20
  • This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Lookarounds". – aliteralmind Apr 10 '14 at 00:32
  • 8
    This is great, thanks a lot. But please add a note about what `\K` actually is. It's not exactly easy to Google. – tremby Apr 22 '15 at 01:54
  • Perl's regular expressions are documented in perlre – ikegami Apr 22 '15 at 02:38
  • Very nice! I went through about 5 "fixes" before I found this one, which worked! – backend_dev_123 Jul 18 '16 at 03:21
12

For Python there's a regex implementation which supports variable-length lookbehinds:

http://pypi.python.org/pypi/regex

It's designed to be backwards-compatible with the standard re module.

MRAB
  • 18,864
  • 5
  • 36
  • 31
  • 1
    Thank you! That really works and the module is generally very interesting. Thank you very much! +1 – Igor Chubin Jul 25 '12 at 07:07
  • 1
    This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Lookarounds". – aliteralmind Apr 10 '14 at 00:33
  • Works swimmingly on `Python 3.4.1`. It also seems to be a little faster than `re`. – Navin Feb 11 '16 at 04:47
5

You can reverse the string AND the pattern and use variable length lookahead

(rab(?!\w*oof)\w*)

matches in bold:

raboof rab7790oof raboo rabof rab rabo raboooof rabo

Original solution as far as I know by:

Jeff 'japhy' Pinyan

Benjamin Udink ten Cate
  • 12,052
  • 3
  • 43
  • 63
2

The regexp you show will find any instance of bar which is not preceded by foo.

A simple alternative would be to first match foo against the string, and find the index of the first occurrence. Then search for bar, and see if you can find an occurrence which comes before that index.

If you want to find instances of bar which are not directly preceded by foo, I could also provide a regexp for that (without using lookbehind), but it will be very ugly. Basically, invert the sense of /foo/ -- i.e. /[^f]oo|[^o]o|[^o]|$/.

Alex D
  • 28,136
  • 5
  • 72
  • 115
  • Alex, thank you for the answer, but in general all is not so simple as you write. I provided just a small example of a regular expression with an assertion. Of course the re could be much complexer, and the assertion could be deep inside of it. In this case you couldn't just simple check a string for some substring. – Igor Chubin Jul 25 '12 at 06:55
  • 1
    Alex, when you need "instances of `bar` which are not directly preceded by `foo`", you can just use normal lookbehind assertion `(? – Igor Chubin Jul 25 '12 at 12:38
2
foo.*|(bar)

If foo is in the string first, then the regex will match, but there will be no groups.

Otherwise, it will find bar and assign it to a group.

So you can use this regex and look for your results in the groups found:

>>> import re
>>> m = re.search('foo.*|(bar)', 'f00bar')
>>> if m: print(m.group(1))
bar
>>> m = re.search('foo.*|(bar)', 'foobar')
>>> if m: print(m.group(1))
None
>>> m = re.search('foo.*|(bar)', 'fobas')
>>> if m: print(m.group(1))
>>> 

Source.

Community
  • 1
  • 1
twasbrillig
  • 12,313
  • 7
  • 37
  • 61