Remove punctuation from Unicode formatted strings

Question

I have a function that removes punctuation from a list of strings:

def strip_punctuation(input):
    x = 0
    for word in input:
        input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x])
        x += 1
    return input

I recently modified my script to use Unicode strings so I could handle other non-Western characters. This function breaks when it encounters these special characters and just returns empty Unicode strings. How can I reliably remove punctuation from Unicode formatted strings?

`strip_punctuation()` should accept strings instead of list of strings then if you need it you could `list_of_strings = map(strip_punctuation, list_of_strings)` — jfs, Jun 16 '12 at 20:43
That might be a better way actually. I like your and F.C.'s implementations using unicode categories. — acpigeon, Jun 16 '12 at 20:50

score 75 · Accepted Answer · edited Feb 07 '14 at 12:24

75

You could use unicode.translate() method:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

You could also use r'\p{P}' that is supported by regex module:

import regex as re

def remove_punctuation(text):
    return re.sub(ur"\p{P}+", "", text)

edited Feb 07 '14 at 12:24

Pablo

3,635
2
28
44

answered Jun 16 '12 at 20:11

jfs

346,887
152
868
1,518

8

+1 for suggesting regex - this is _the_ way to go here. It would be worth noting that it's non-standard (yet) and has to be installed separately. Also, in py2, you need the pattern to be unicode (`ur".."`) to toggle unicode matching mode. – georg Jun 16 '12 at 21:17
1

@thg435: I've added link to regex module and made the pattern unicode – jfs Jun 16 '12 at 21:21
@thg435 I agree regex is ideally the way to go. Unfortunately I need to keep my external modules to a minimum as I'm not the only user. I've gone with the former solution which is slow but it does work. Thanks everyone. – acpigeon Jun 17 '12 at 18:43
1

@acpigeon: I've moved `tbl` to the global scope to make it clear that it only needs to be generated once – jfs Jun 17 '12 at 21:47
1

@Pablo: why did you revert my Python 3 compatible edit? – metakermit Feb 07 '14 at 12:27
@kermit666: my guess: the question has no explicit [tag:python-3.x] tag therefore the code should run on Python 2. Your edit breaks the code on Python 2. – jfs Feb 07 '14 at 12:43
@kermit666 Look at the guidelines for reviewing Suggested Edits: http://meta.stackexchange.com/a/155539/178187. Specifically, the second point in "Common reasons to Reject". I reviewed your edit, but when I tried to reject it, other users had already approved it. That's why I reverted it. – Pablo Feb 07 '14 at 15:13
Fair enough. I added it as a new answer. – metakermit Feb 07 '14 at 19:16
2

The `re` module (not `regex`) doesn't seem to support `\p{P}`, does it? – ratsimihah Jul 17 '15 at 20:37
1

Hmm I didn't realize `regex` was a pypi module. Tanks! – ratsimihah Jul 20 '15 at 16:30
Thank you. I was looking for this answer when all other answers were for string. – seokhoonlee Mar 24 '16 at 05:57
On py3.5, I get a syntax error for the regex based solution, `temp = regex.sub(ur"\p{P}+", "", s)` the little arrow is at the double quote after the `+`. Something I am missing here? – posdef Sep 28 '16 at 15:55
3

@posdef it is Python 2 code (read the very first comment). Drop `u''` prefix before `r''` on Python 3 or use `u"\\p{P}+"` (you have to escape the backlash manually in this case). – jfs Sep 28 '16 at 16:11
Note that the `regex` solution does not strip out `|`. Can anybody point out how to add it to the regex? – Dennis Golomazov Nov 09 '16 at 19:29
1

@DennisGolomazov: it is correct. `|` (U+007C) is a [Math Symbol: `\p{Sm}`](https://codepoints.net/search?gc=Sm), it is not a [Unicode punctuation](https://codepoints.net/search?gc=P). Perhaps, you want `\p{posix_punct}` (`[[:punct:]]`). Depending on your specific case, it might be simpler to specify characters that you want to keep. It might be a good separate question if you have a specific list of requirements (what to keep, what to remove). – jfs Nov 09 '16 at 20:15
@Mithril perhaps [my last comment](http://stackoverflow.com/questions/11066400/remove-punctuation-from-unicode-formatted-strings/11066687#comment68272124_11066687) applies to your case too. – jfs Feb 23 '17 at 09:03
@J.F. Sebastian Does that means I have to `text.translate(tbl)` then `re.sub` ? Anyway to merge them? – Mithril Feb 23 '17 at 13:18
@Mithril no. These are different solutions. The point is: what Unicode standard thinks "what is punctuation" may differ from what POSIX thinks "what is punctuation". You can pick whatever definition you like or even construct your own. You can use either str.translate or the regex module (but there is no point in mixing them). – jfs Feb 23 '17 at 14:16

score 23 · Answer 2 · answered Feb 07 '14 at 19:14

If you want to use J.F. Sebastian's solution in Python 3:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

Daenyth · Answer 3 · 2012-06-16T20:59:23.510

9

You can iterate through the string using the unicodedata module's category function to determine if the character is punctuation.

For possible outputs of category, see unicode.org's doc on General Category Values

import unicodedata.category as cat
def strip_punctuation(word):
    return "".join(char for char in word if cat(char).startswith('P'))
filtered = [strip_punctuation(word) for word in input]

Additionally, make sure that you're handling encodings and types correctly. This presentation is a good place to start: http://bit.ly/unipain

edited Jun 16 '12 at 20:59

answered Jun 16 '12 at 19:34

Daenyth

31,276
11
75
115

+1 for unipain link. I'm trying to implement this, but I'm getting "IndexError: list assignment index out of range" on the result[i] line. I'll keep messing around. – acpigeon Jun 16 '12 at 20:47
1

@acpigeon: For some reason I was thinking you could assign to lists in a sparse way without pre-populating it. Edited with a better approach. – Daenyth Jun 16 '12 at 20:57
1

There's a small but important bug in this answer: strip_punctuation actually does the opposite of what you intend, and will return *only* the punctuation, because you forgot a `not` in your comprehension. I would edit the answer to fix it, except "edits must be at least 6 characters." – Edward Jan 15 '15 at 19:36

score 8 · Answer 4 · edited May 23 '17 at 10:31

8

A little shorter version based on Daenyth answer

import unicodedata

def strip_punctuation(text):
    """
    >>> strip_punctuation(u'something')
    u'something'

    >>> strip_punctuation(u'something.,:else really')
    u'somethingelse really'
    """
    punctutation_cats = set(['Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po'])
    return ''.join(x for x in text
                   if unicodedata.category(x) not in punctutation_cats)

input_data = [u'somehting', u'something, else', u'nothing.']
without_punctuation = map(strip_punctuation, input_data)

edited May 23 '17 at 10:31

Community

1
1

answered Jun 16 '12 at 19:55

Facundo Casco

8,729
5
40
61

OP said `input_data` is a list of strings, not just one string. (Of course, you can just map your version over it) – Daenyth Jun 16 '12 at 20:06

Remove punctuation from Unicode formatted strings

4 Answers4

Linked

Related