2

In the first test string, I'm trying to replace the Unicode right arrows char in the middle of the text with a space, but it doesn't seem to be working.

In general, I'm trying to remove all single character or more unicode "non-words", but keeping words if they are a mixture of a-z0-9 and unicode or just \w

# -*- coding: utf-8 -*-
import re
str = 'hi… » Test'
str = 're of… » Pr'
str = 're of… » Pr | removepipeaswell'
print str
str = re.sub(r' [^a-z0-9]+ ', ' ', str , re.UNICODE|re.MULTILINE)
# str = re.sub(r' [^\p{Alpha}] ', ' ', str, re.UNICODE)
print str
're of… Pr removepipeaswell' #expected output

str_nbsp = 'afds » asf'

edit: added another test string, i dont want to remove the "of..." (unicode dots), i want to remove multiple unicode (non-word) chars only.

edit: using this works for the test case, (but not in the full html??? - it only appears to replace matches to the first half to the string, then ignores the rest.)

str = re.sub(r' [^a-z0-9]+ ', ' ', str , re.UNICODE|re.MULTILINE)

edit: fml, it had to be something stupid like not reading the argument list properly: http://bytes.com/topic/python/answers/689341-sub-does-not-replace-all-occurences

[whoever just deleted their response - thank you for your help.]

str = re.sub(r' [^a-z0-9]+ ', ' ', str)

The final test string "str_nbsp" did not match the regex above. One of the space characters is actually a non breaking space character. I used www.regexr.com and hovered over each character to figure this out.

malana
  • 4,440
  • 2
  • 23
  • 39
Dave
  • 341
  • 1
  • 6
  • 13
  • Just letting you know about the [Stack Overflow Regular Expressions FAQ](http://stackoverflow.com/a/22944075/2736496). :) – aliteralmind Apr 17 '14 at 02:16
  • Thanks. I am a regex pro in perl, but I'm newbie on python. Still getting used to the different syntax. – Dave Apr 17 '14 at 07:41
  • 1
    Debuggex.com, if you don't know already, is an online tester with both Python and PCRE. – aliteralmind Apr 17 '14 at 11:53
  • thanks, I also found [regexr.com](http://regexr.com) and [rubular.com](http://www.rubular.com) to be useful as well for checking regex. – Dave Apr 17 '14 at 16:43

1 Answers1

2
str = re.sub(r' [^a-z0-9]+ ', ' ', str)
malana
  • 4,440
  • 2
  • 23
  • 39
Dave
  • 341
  • 1
  • 6
  • 13