0

I have this regex that uses forward and backward look-aheads:

import re
re.compile("<!inc\((?=.*?\)!>)|(?<=<!inc\(.*?)\)!>")

I'm trying to port it from C# to Python but keep getting the error

look-behind requires fixed-width pattern

Is it possible to rewrite this in Python without losing meaning?

The idea is for it to match something like

<!inc(C:\My Documents\file.jpg)!>

Update

I'm using the lookarounds to parse HTTP multipart text that I've modified

body = r"""------abc
Content-Disposition: form-data; name="upfile"; filename="file.txt"
Content-Type: text/plain

<!inc(C:\Temp\file.txt)!>
------abc
Content-Disposition: form-data; name="upfile2"; filename="pic.png"
Content-Type: image/png

<!inc(C:\Temp\pic.png)!>
------abc
Content-Disposition: form-data; name="note"

this is a note
------abc--
"""

multiparts = re.compile(...).split(body)

I want to just get the file path and other text when I do the split and not have to remove the opening and closing tags

Code brevity is important, but I'm open to changing the <!inc( format if it makes the regex doable.

pppery
  • 3,434
  • 13
  • 24
  • 37
Chad
  • 2,920
  • 4
  • 27
  • 39

3 Answers3

4

From the documentation:

(?<!...)

Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

(?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in abcdef, since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:

Emphasis mine. No, I don't imagine you can port it to Python in it's current form.

g.d.d.c
  • 41,737
  • 8
  • 91
  • 106
  • Yeah, I read the documentation and was hoping someone on SO is smart enough to help me rewrite this without the lookarounds since the documentation says they're not allowed. Thanks! – Chad Jun 25 '12 at 22:12
  • This answer has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Lookarounds". – aliteralmind Apr 10 '14 at 00:30
3

For paths + "everything" in the same array, just split on the opening and closing tag:

import re
p = re.compile(r'''<!inc\(|\)!>''')
awesome = p.split(body)

You say you're flexible on the closing tags, if )!> can occur elsewhere in the code, you may want to consider changing that closing tag to something like )!/inc> (or anything, as long as it's unique).

See it run.

ohaal
  • 5,050
  • 2
  • 31
  • 48
  • +1 :: Optionally replace `.*?` with `.+?` for non-blank inside match – Ωmega Jun 25 '12 at 21:46
  • @user1215106: That wouldn't match his already existing regex. Keep in mind this is a port from C# to Python. – ohaal Jun 25 '12 at 21:47
  • That's why I wrote **optionally** and explain what would change, Sir. – Ωmega Jun 25 '12 at 21:48
  • BTW :: For better performance, don't use `*?` or `+?` at all, if you don't have to... – Ωmega Jun 25 '12 at 21:49
  • Just google for that - for example: http://blog.stevenlevithan.com/archives/greedy-lazy-performance – Ωmega Jun 25 '12 at 21:56
  • Sorry - I should have explained better - see my updated question. Thanks! – Chad Jun 25 '12 at 22:06
  • Yes, I think I'll take this approach. It does lose some accuracy since it doesn't verify the matching end tags, but I'm not too worried about that being a problem. Thanks! – Chad Jun 26 '12 at 21:19
1
import re

pat = re.compile("\<\!inc\((.*?)\)\!\>")

f = pat.match(r"<!inc(C:\My Documents\file.jpg)!>").group(1)

results in f == 'C:\My Documents\file.jpg'

In response to Jon Clements:

print re.escape("<!inc(filename)!>")

results in

\<\!inc\(filename\)\!\>

Conclusion: re.escape seems to think they should be escaped.

Hugh Bothwell
  • 50,702
  • 6
  • 75
  • 95