1

My aim is to remove all punctuations from a string so that I can then get the frequency of each word in the string.

My string is:

WASHINGTON—Coming to the realization in front of millions of viewers during the broadcast of his show, a horrified Tucker Carlson stated, ‘I…I am the mainstream media’ Wednesday as he began spiraling live on air. “We’ve discovered evidence of rampant voter fraud, and the president has every right to call for an investigation even if the mainstream media thinks...” said Carlson, who trailed off, stared down at his shaking hands, and felt a sudden ringing in his ears as he looked back up and zeroed in on the production crew surrounding him. “The media says…wait. Those liars on TV will try to tell you…oh God. We’re the number-one program on cable news, aren’t we? Fox News…Fox ‘News.’ It’s the media. It’s me. This can’t be. No, no, no, no. Jesus Christ, I make $6 million a year. Get that camera off me!” At press time, Carlson had torn the microphone from his lapel and fled the set in panic.

source: https://www.theonion.com/i-i-am-the-mainstream-media-realizes-horrified-tuc-1845646901

I want to remove all punctuations from it. I do that like this -

s.translate(str.maketrans('', '', string.punctuation))

This is the output -

WASHINGTON—Coming to the realization in front of millions of viewers during the broadcast of his show a horrified Tucker Carlson stated ‘I…I am the mainstream media’ Wednesday as he began spiraling live on air “We’ve discovered evidence of rampant voter fraud and the president has every right to call for an investigation even if the mainstream media thinks” said Carlson who trailed off stared down at his shaking hands and felt a sudden ringing in his ears as he looked back up and zeroed in on the production crew surrounding him “The media says…wait Those liars on TV will try to tell you…oh God We’re the numberone program on cable news aren’t we Fox News…Fox ‘News’ It’s the media It’s me This can’t be No no no no Jesus Christ I make 6 million a year Get that camera off me” At press time Carlson had torn the microphone from his lapel and fled the set in panic

As you can see that characters/ string like ", and ... still exist. Am I incorrectly expecting them to be removed too? If the output is correct then how can I NOT differentiate between "`News`" and "News"?

krtkush
  • 1,158
  • 3
  • 17
  • 33
  • I'd say yes, you're incorrectly expecting non-ASCII to match ASCII. – superb rain Nov 12 '20 at 20:20
  • 1
    It doesn't work because `...` in your example is not three ASCII periods -- it's a unicode ellipse character, which is not present in `string.punctuation`. And the same with all the other characters that you thought were plain punctuation. – John Gordon Nov 12 '20 at 20:24

3 Answers3

2
>>> import string
>>> "“" in string.punctuation
False
>>> "—" in string.punctuation
False

Welcome to the wonderful world of Unicode where, among many other things, is not three concatenated full stop periods and :

>>> import unicodedata
>>> unicodedata.name('—')
'EM DASH'

is not a hyphen.

How you want to handle the full scope of what could be considered 'punctuation' across the Unicode table is probably out of scope for this question, but you could either come up with your own ad-hoc list or use a third-party library designed for that type of text manipulation. Here is one such approach:

Best way to strip punctuation from a string

Brad Solomon
  • 29,156
  • 20
  • 104
  • 175
2

I added the list of characters you can remove from string by using your implementation.

>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

You can check this implementation to remove all special characters and keep whitespaces

''.join(e for e in s if e.isalnum() or e == ' ')
Cenk Bircanoglu
  • 264
  • 1
  • 7
0

It looks like the and a couple of the other characters you are having trouble with are special Unicode characters. A workaround is to use string.isalpha(), which tells you whether the characters of a string are part of the alphabet or not.

result = ""
for x in string:
    if x.isalpha() or x == " ":
        result = result + x
deadpython
  • 21
  • 3