python regex replace unicode

Question

In the first test string, I'm trying to replace the Unicode right arrows char in the middle of the text with a space, but it doesn't seem to be working.

In general, I'm trying to remove all single character or more unicode "non-words", but keeping words if they are a mixture of a-z0-9 and unicode or just \w

# -*- coding: utf-8 -*-
import re
str = 'hi… » Test'
str = 're of… » Pr'
str = 're of… » Pr | removepipeaswell'
print str
str = re.sub(r' [^a-z0-9]+ ', ' ', str , re.UNICODE|re.MULTILINE)
# str = re.sub(r' [^\p{Alpha}] ', ' ', str, re.UNICODE)
print str
're of… Pr removepipeaswell' #expected output

str_nbsp = 'afds » asf'

edit: added another test string, i dont want to remove the "of..." (unicode dots), i want to remove multiple unicode (non-word) chars only.

edit: using this works for the test case, (but not in the full html??? - it only appears to replace matches to the first half to the string, then ignores the rest.)

str = re.sub(r' [^a-z0-9]+ ', ' ', str , re.UNICODE|re.MULTILINE)

edit: fml, it had to be something stupid like not reading the argument list properly: http://bytes.com/topic/python/answers/689341-sub-does-not-replace-all-occurences

[whoever just deleted their response - thank you for your help.]

str = re.sub(r' [^a-z0-9]+ ', ' ', str)

The final test string "str_nbsp" did not match the regex above. One of the space characters is actually a non breaking space character. I used www.regexr.com and hovered over each character to figure this out.

Just letting you know about the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496). :) — aliteralmind, Apr 17 '14 at 02:16
Thanks. I am a regex pro in perl, but I'm newbie on python. Still getting used to the different syntax. — Dave, Apr 17 '14 at 07:41
Debuggex.com, if you don't know already, is an online tester with both Python and PCRE. — aliteralmind, Apr 17 '14 at 11:53
thanks, I also found [regexr.com](http://regexr.com) and [rubular.com](http://www.rubular.com) to be useful as well for checking regex. — Dave, Apr 17 '14 at 16:43

score 2 · Accepted Answer · edited Oct 16 '16 at 23:42

2

str = re.sub(r' [^a-z0-9]+ ', ' ', str)

edited Oct 16 '16 at 23:42

malana

4,440
2
23
39

answered Apr 17 '14 at 01:11

Dave

341
1
6
13

python regex replace unicode

1 Answers1