Python: \number Backreference in re.sub

Question

I'm trying to use python's re.sub function to replace some text.

>>> import re
>>> text = "<hi type=\"italic\"> the></hi>"
>>> pat_error = re.compile(">(\s*\w*)*>")
>>> pat_error.search(text)
<_sre.SRE_Match object at 0xb7a3fea0>
>>> re.sub(pat_error, ">\1", text)
'<hi type="italic">\x01</hi>'

Afterwards the value of text should be

"<hi type="italic"> the</hi>"

This really isn't a question... – Mike Caron Dec 18 '09 at 19:58 — Mike Caron, Dec 18 '09 at 19:58

score 10 · Accepted Answer · answered Jul 30 '09 at 03:13

Two bugs in your code. First, you're not matching (and specifically, capturing) what you think you're matching and capturing -- insert after your call to .search:

>>> _.groups()
('',)

The unconstrained repetition of repetitions (star after a capturing group with nothing but stars) matches once too many -- with the empty string at the end of what you think you're matchin -- and that's what gets captured. Fix by changing at least one of the stars to a plus, e.g., by:

>>> pat_error = re.compile(r">(\s*\w+)*>")
>>> pat_error.search(text)
<_sre.SRE_Match object at 0x83ba0>
>>> _.groups()
(' the',)

Now THIS matches and captures sensibly. Second, youre not using raw string literal syntax where you should, so you don't have a backslash where you think you have one -- you have an escape sequence \1 which is the same as chr(1). Fix by using raw string literal syntax, i.e. after the above snippet

>>> pat_error.sub(r">\1", text)
'<hi type="italic"> the</hi>'

Alternatively you could double up all of your backslashes, to avoid them being taken as the start of escape sequences -- but, raw string literal syntax is much more readable.

score 0 · Answer 2 · answered Jul 30 '09 at 03:05

0

>>> text.replace("><", "<")
'<hi type="italic"> the</hi>'

answered Jul 30 '09 at 03:05

ghostdog74

286,686
52
238
332

This won't work because there are other instances where the value of text might be "stuffblah" – Daniel Jul 30 '09 at 03:11

Python: \number Backreference in re.sub

2 Answers2