The Python manual states:
The special sequence \w for 8-bit (bytes) patterns matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].
Compare now:
re.search(r"([\w]+)", 'München').group(1)
with:
re.search(r"([a-zA-Z0-9_]+)", 'München').group(1)
The first statement outputs the whole city name München, the second only the first letter M
. The letter ü
is a single byte with code point 0xFC
= 252
(Latin-1).
My question is: assuming that the Python manual is correct, how can I reconcile the difference in output between [\w]+
and [a-zA-Z0-9_]+
with the statement in the Python-3 manual? I use IDLE v. 3.6.2.