-3

I want to remove each of the following special characters from my documents:

symbols = {`,~,!,@,#,$,%,^,&,*,(,),_,-,+,=,{,[,],},|,\,:,;,",<,,,>,.,?,/}

The reason why I am not simply doing something like this:

document = re.sub(r'([^\s\w]|_)+', '', document)

is that in this way I remove also many (accented/special) letters in the case of documents written in languages such as Polish etc.

How can I remove each of the special characters above in one expression?

Outcast
  • 4,160
  • 1
  • 26
  • 65
  • @DeveshKumarSingh, thank you for your answer but I do not know what this exactly means. Can you please give me an example or a complete answer to my question? – Outcast May 30 '19 at 10:34
  • @DeveshKumarSingh, it can be any text which has some of these special characters too (as many texts do). You can very easily create one by yourself. – Outcast May 30 '19 at 10:39
  • @DeveshKumarSingh no confusion will occur - let's not waste time - the others below have already given an answer. If you cannot come up with any text (how difficult is it?) then take this: `(Hello World)] *!` which should be `Hello World` – Outcast May 30 '19 at 10:44

5 Answers5

1

You can solve this without regular expressions by using str.replace():

symbols = {"`", "~", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "-", "+", "=", "{", "[", "]", "}", "|", "\\", ":", ";", "\"", "<", ",", ">", ".", "?", "/"}

for c in symbols:
    document = document.replace(c, "")
Olvin Roght
  • 5,080
  • 2
  • 11
  • 27
  • To be honest, I am finally thinking about using `.replace()` than regular expressions for the reasons mentioned here too: https://stackoverflow.com/questions/5668947/use-pythons-string-replace-vs-re-sub – Outcast May 30 '19 at 10:45
  • @PoeteMaudit, you should apply regex in situations when it's really needed. In this particular situation you there's another solutions. You can also check benchmarks in [this](https://stackoverflow.com/a/27086669/10824407) answer. – Olvin Roght May 30 '19 at 10:51
  • Of course but the problem is that I am not sure when regex are really needed. Any specific advices please? To be very honest with you I am not sure what is the point of regex while .replace() exists but on the other hand I see many people using the former. – Outcast May 30 '19 at 10:53
  • @PoeteMaudit, there's [some usefull tips](https://stackoverflow.com/a/5670379/10824407). – Olvin Roght May 30 '19 at 11:03
  • By the way, since you are open to answer these question, why not use str.translate() in this case? Again the same question with regexs, when to use str.translate() and when to use str.replace()? – Outcast May 30 '19 at 12:01
0

If you have a list of the symbols you want to remove, you can construct this simple regex:

rgx = '|'.join(map(re.escape, symbols))

Example:

# example symbols list
symbols = ['"', '<', '+', '*']

document = '<div prop="+*+">'

rgx = '|'.join(map(re.escape, symbols))

document = re.sub(rgx, '', document)

print(document)

Output:

div prop=>

The code '|'.join(map(re.escape, symbols)) will construct the following regex:

\"|\<|\+|\*

which means match any one of the symbols ", <, +, or *.

MrGeek
  • 19,520
  • 4
  • 24
  • 49
0
symbols = ['a', 'b', '|']
document = document.translate({ord(c):None for c in symbols})
Mezbaul Haque
  • 1,114
  • 6
  • 13
0

If you want to remove literally each character, you can use str.replace and string module:

a = '345l,we.gm34mf,]-='

for char in string.punctuation:
    a = a.replace(char, '')
a

'345lwegm34mf'

If you need more symbols to replace (string.punctuation equals to '!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~'), you can add them to ain` string.

vurmux
  • 8,002
  • 2
  • 18
  • 38
0

without re:

"".join(str(x) for x in document if x not in symbols)
Afik Friedberg
  • 312
  • 2
  • 8