0

I am trying to clean a string from all special characters and retain everything else, including punctuation marks.

mystring = "Q18. On a scale from 0 to 10 where 0 means ‘not at all interested' and 10 means ‘very interested', how interested are you in helping to address problems that affect poor people in poor countries?"

My effort so far:

newlabel = re.sub('[^A-Za-z0-9]+', ' ', newstring)

Output:

Q18 On a scale from 0 to 10 where 0 means not at all interested and 10 means very interested how interested are you in helping to address problems that affect poor people in poor countries 

How can I retain the punctuation marks in the regex I currently have or is there a better solution?

Cœur
  • 32,421
  • 21
  • 173
  • 232
Boosted_d16
  • 10,018
  • 29
  • 80
  • 133

3 Answers3

4

Solved,

print (newstring.decode('unicode_escape').encode('ascii','ignore'))

Output:

Q18. On a scale from 0 to 10 where 0 means not at all interested' and 10 means very interested', how interested are you in helping to address problems that affect poor people in poor countries?
Boosted_d16
  • 10,018
  • 29
  • 80
  • 133
1

If all you need to change is to retain the dot than adding it to the regex will solve that.

re.sub('[^A-Za-z0-9\.]+', ' ', mystring)
Károly Nagy
  • 1,645
  • 10
  • 13
0

Just add backslash before each punctuation mark in the regular expression .....

Ian Ellis
  • 379
  • 3
  • 8