0

The Python manual states:

The special sequence \w for 8-bit (bytes) patterns matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].

Compare now:

re.search(r"([\w]+)", 'München').group(1)

with:

re.search(r"([a-zA-Z0-9_]+)", 'München').group(1)  

The first statement outputs the whole city name München, the second only the first letter M. The letter ü is a single byte with code point 0xFC = 252 (Latin-1). My question is: assuming that the Python manual is correct, how can I reconcile the difference in output between [\w]+ and [a-zA-Z0-9_]+ with the statement in the Python-3 manual? I use IDLE v. 3.6.2.

Dmitry
  • 1,634
  • 1
  • 12
  • 19
P. Wormer
  • 187
  • 1
  • 8
  • 4
    `re.U` flag is enabled by default (=`\w` matches any Unicode letters and digits) in Python 3. Python 3 strings are Unicode strings, not byte strings, by default. – Wiktor Stribiżew Aug 16 '17 at 09:50
  • But I use Latin-1, not UTF-8. And should the manual not mention the re.U flag? – P. Wormer Aug 16 '17 at 09:51
  • 2
    What do you actually need? Make `\w` always match only `[A-Za-z0-9_]` in Python 3? Then pass `re.ASCII` flag. – Wiktor Stribiżew Aug 16 '17 at 09:54
  • 2
    @P.Wormer The manual _does_ mention that. You just didn't read the correct section. You aren't working with `bytes`, so why do you quote the `bytes` section? – Aran-Fey Aug 16 '17 at 10:00
  • I wrote a little Python program that counts words in a text that is in Latin-1. The text contains single byte characters between 128 and 255 (accented characters). To my surprise \w+ did exactly what I wanted (counted words with accented characters). Now I try to understand what is going on. – P. Wormer Aug 16 '17 at 10:00
  • Maybe the reading of the file did a conversion from Latin-1 to UTF-8? – P. Wormer Aug 16 '17 at 10:02
  • The manual doesn't mention re.U (the unicode flag) because Python 3 uses unicode strings by default. You have the re.ASCII flag to restrict patterns to ASCII type behaviour, which is also done if the pattern is a bytes object rather than str. You've read the `bytes` specific paragraph when your pattern is `str` (unicode). – Yann Vernier Aug 16 '17 at 10:04

2 Answers2

-1

You referenced wrong manual (manual for python 3.1).

The correct one is at https://docs.python.org/3/library/re.html

If you want \w work like [a-zA-Z0-9_], you should use the flag re.ASCII:

>>> re.search(r"([\w]+)", 'München').group(1)
'München'
>>> re.search(r"([\w]+)", 'München', flags=re.ASCII).group(1)
'M'
>>> re.search(r"([a-zA-Z0-9_]+)", 'München').group(1)
'M'
InQβ
  • 406
  • 3
  • 16
-2

I'm not sure what source you're quoting from, but your link says:

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].

I'm still primarily using Python 2, but one of the big changes in Python 3 is that all strings are Unicode by default. Python will convert text to Unicode upon reading it.

Cody Gray
  • 222,280
  • 47
  • 466
  • 543
Stael
  • 2,299
  • 9
  • 18
  • I'm sure that the text I'm reading is in Latin-1. The text is actually older than Unicode. Maybe Python converts it somewhere (upon reading maybe?). – P. Wormer Aug 16 '17 at 10:05
  • OK that is the answer: Inadvertently I worked with UTF-8 and should have realized that the re.U flag is on. Thank you all! – P. Wormer Aug 16 '17 at 10:11