0

I'm trying to use python's re.sub function to replace some text.

>>> import re
>>> text = "<hi type=\"italic\"> the></hi>"
>>> pat_error = re.compile(">(\s*\w*)*>")
>>> pat_error.search(text)
<_sre.SRE_Match object at 0xb7a3fea0>
>>> re.sub(pat_error, ">\1", text)
'<hi type="italic">\x01</hi>'

Afterwards the value of text should be

"<hi type="italic"> the</hi>"
Daniel
  • 25
  • 1
  • 7

2 Answers2

10

Two bugs in your code. First, you're not matching (and specifically, capturing) what you think you're matching and capturing -- insert after your call to .search:

>>> _.groups()
('',)

The unconstrained repetition of repetitions (star after a capturing group with nothing but stars) matches once too many -- with the empty string at the end of what you think you're matchin -- and that's what gets captured. Fix by changing at least one of the stars to a plus, e.g., by:

>>> pat_error = re.compile(r">(\s*\w+)*>")
>>> pat_error.search(text)
<_sre.SRE_Match object at 0x83ba0>
>>> _.groups()
(' the',)

Now THIS matches and captures sensibly. Second, youre not using raw string literal syntax where you should, so you don't have a backslash where you think you have one -- you have an escape sequence \1 which is the same as chr(1). Fix by using raw string literal syntax, i.e. after the above snippet

>>> pat_error.sub(r">\1", text)
'<hi type="italic"> the</hi>'

Alternatively you could double up all of your backslashes, to avoid them being taken as the start of escape sequences -- but, raw string literal syntax is much more readable.

Alex Martelli
  • 762,786
  • 156
  • 1,160
  • 1,345
0
>>> text.replace("><", "<")
'<hi type="italic"> the</hi>'
ghostdog74
  • 286,686
  • 52
  • 238
  • 332
  • This won't work because there are other instances where the value of text might be "stuffblah" – Daniel Jul 30 '09 at 03:11