I'm encountering confusing and seemingly contradictory rules regarding raw strings. Consider the following example:
>>> text = 'm\n' >>> match = re.search('m\n', text) >>> print match.group() m >>> print text m
This works, which is fine.
>>> text = 'm\n' >>> match = re.search(r'm\n', text) >>> print match.group() m >>> print text m
Again, this works. But shouldn't this throw an error, because the raw string contains the characters m\n
and the actual text contains a newline?
>>> text = r'm\n'
>>> match = re.search(r'm\n', text)
>>> print match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> print text
m\n
The above, surprisingly, throws an error, even though both are raw strings. This means both contain just the text m\n
with no newlines.
>>> text = r'm\n'
>>> match = re.search(r'm\\n', text)
>>> print text
m\n
>>> print match.group()
m\n
The above works, surprisingly. Why do I have to escape the backslash in the re.search, but not in the text itself?
Then there's backslash with normal characters that have no special behavior:
>>> text = 'm\&'
>>> match = re.search('m\&', text)
>>> print text
m\&
>>> print match.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
This doesn't match, even though both the pattern and the string lack special characters.
In this situation, no combination of raw strings works (text as a raw string, patterns as a raw string, both or none).
However, consider the last example. Escaping in the text variable, 'm\\&'
, doesn't work, but escaping in the pattern does. This parallels the behavior above--even stranger, I feel, considering that \&
is of no special meaning to either Python or re:
>>> text = 'm\&'
>>> match = re.search(r'm\\&', text)
>>> print text
m\&
>>> print match.group()
m\&
My understanding of raw strings is that they inhibit the behavior of the backslash in python. For regular expressions, this is important because it allows re.search to apply its own internal backslash behavior, and prevent conflicts with Python. However, in situations like the above, where backslash effectively means nothing, I'm not sure why it seems necessary. Worse yet, I don't understand why I need to backslash for the pattern, but not the text, and when I make both a raw string, it doesn't seem to work.
The docs don't provide much guidance in this regard. They focus on examples with obvious problems, such as '\section'
, where \s
is a meta-character. Looking for a complete answer to prevent unanticipated behavior such as this.