56

I have the following code

import nltk, os, json, csv, string, cPickle
from scipy.stats import scoreatpercentile

lmtzr = nltk.stem.wordnet.WordNetLemmatizer()

def sanitize(wordList): 
answer = [word.translate(None, string.punctuation) for word in wordList] 
answer = [lmtzr.lemmatize(word.lower()) for word in answer]
return answer

words = []
for filename in json_list:
    words.extend([sanitize(nltk.word_tokenize(' '.join([tweet['text'] 
                   for tweet in json.load(open(filename,READ))])))])

I've tested lines 2-4 in a separate testing.py file when I wrote

import nltk, os, json, csv, string, cPickle
from scipy.stats import scoreatpercentile

wordList= ['\'the', 'the', '"the']
print wordList
wordList2 = [word.translate(None, string.punctuation) for word in wordList]
print wordList2
answer = [lmtzr.lemmatize(word.lower()) for word in wordList2]
print answer

freq = nltk.FreqDist(wordList2)
print freq

and the command prompt returns ['the','the','the'], which is what I wanted (removing punctuation).

However, when I put the exact same code in a different file, python returns a TypeError stating that

File "foo.py", line 8, in <module>
  for tweet in json.load(open(filename, READ))])))])
File "foo.py", line 2, in sanitize
  answer = [word.translate(None, string.punctuation) for word in wordList]
TypeError: translate() takes exactly one argument (2 given)

json_list is a list of all the file paths (I printed and check that this list is valid). I'm confused on this TypeError because everything works perfectly fine when I'm just testing it in a different file.

Martijn Pieters
  • 889,049
  • 245
  • 3,507
  • 2,997
carebear
  • 691
  • 1
  • 7
  • 16
  • 1
    Maybe this happens, because another encoding (utf8 for instance) is used in this file, for which the translate function only gets one argument. I'm not sure, but is this possible? You can check this by printing type(wordList) for each case. – Thorben Apr 19 '14 at 22:03
  • Can you show your import statements? Maybe there is a translate function that you are unknowingly importing. Try "print translate" when you get the exception and see which module it comes from – Spaceghost Apr 20 '14 at 14:14
  • @Spaceghost, import statements are: `import nltk, os, json, csv, string, cPickle` `from scipy.stats import scoreatpercentile (2 separate lines)` – carebear Apr 20 '14 at 16:30
  • Your example code in the second file will not run even after adding imports because you have left out code like what creates lmtzr. – Spaceghost Apr 21 '14 at 17:15
  • @Spaceghost I have the proper statements for creating lmtzr. The second block of code works fine. just the translate method in the first block doesn't work. – carebear Apr 23 '14 at 23:16
  • 2
    Your code, as seen above, is incomplete. No-one else can take that and run it to see what it does. – Spaceghost Apr 24 '14 at 02:31
  • @Spaceghost I've edited the code; the second block should be able to run now; however you would need to install the nltk package for the lemmatizer to work. – carebear Apr 26 '14 at 03:24
  • Are you using the same Python version for all of your tests? In Python 3, `str.translate` doesn't allow the two argument form that was legal in Python 2. – Blckknght Apr 26 '14 at 04:00
  • @Blckknght I have both versions on my download folder but only Python 2.7 on my C drive..is there a way for me to check the Python version I'm using in the code? – carebear Apr 26 '14 at 04:07
  • @Carrie: You could try `import sys; print(sys.version)`. One other possibility that just occurred to me. The Python 3 behavior is actually a side effect of the change of `str` to be `unicode`. If your `word` values are unicode objects, you might have the same issue in Python 2 (and your simpler test code might work because you're using regular Python 2 `str` instances instead). – Blckknght Apr 26 '14 at 04:34
  • @Blckknght both files are running on python 2.7.3 – carebear Apr 27 '14 at 04:51
  • @Carrie: I've updated my answer to address that. I suspect the issue is `str` versus `unicode` instances. – Blckknght Apr 27 '14 at 18:21

4 Answers4

112

If all you are looking to accomplish is to do the same thing you were doing in Python 2 in Python 3, here is what I was doing in Python 2.0 to throw away punctuation and numbers:

text = text.translate(None, string.punctuation)
text = text.translate(None, '1234567890')

Here is my Python 3.0 equivalent:

text = text.translate(str.maketrans('','',string.punctuation))
text = text.translate(str.maketrans('','','1234567890'))

Basically it says 'translate nothing to nothing' (first two parameters) and translate any punctuation or numbers to None (i.e. remove them).

drchuck
  • 3,485
  • 3
  • 19
  • 25
  • 2
    You can combine these two maps, trivially, by using `text.translate(str.maketrans('', '', string.punctuation + '1234567890'))` or even better, with `text.translate(str.maketrans('', '', string.punctuation + string.digits))`. I'd store the translation map first in a separate constant and re-use it. – Martijn Pieters Jan 02 '18 at 09:43
  • 1
    Using this `text = text.translate(str.maketrans('','',string.punctuation))` worked for me – Mona Jalal Mar 08 '18 at 21:31
73

I suspect your issue has to do with the differences between str.translate and unicode.translate (these are also the differences between str.translate on Python 2 versus Python 3). I suspect your original code is being sent unicode instances while your test code is using regular 8-bit str instances.

I don't suggest converting Unicode strings back to regular str instances, since unicode is a much better type for handling text data (and it is the future!). Instead, you should just adapt to the new unicode.translate syntax. With regular str.translate (on Python 2), you can pass an optional deletechars argument and the characters in it would be removed from the string. For unicode.translate (and str.translate on Python 3), the extra argument is no longer allowed, but translation table entries with None as their value will be deleted from the output.

To solve the problem you'll need to create an appropriate translation table. A translation table is a dictionary mapping from Unicode ordinals (that is, ints) to ordinals, strings or None. A helper function for making them exists in Python 2 as string.maketrans (and Python 3 as a method of the str type), but the Python 2 version of it doesn't handle the case we care about (putting None values into the table). You can build an appropriate dictionary yourself with something like {ord(c): None for c in string.punctuation}.

Blckknght
  • 85,872
  • 10
  • 104
  • 150
6

Python 3.0:

text = text.translate(str.maketrans('','','1234567890'))

static str.maketrans(x[, y[, z]])

This static method returns a translation table usable for str.translate().

If there is only one argument, it must be a dictionary mapping Unicode ordinals (integers) or characters (strings of length 1) to Unicode ordinals, strings (of arbitrary lengths) or None. Character keys will then be converted to ordinals.

If there are two arguments, they must be strings of equal length, and in the resulting dictionary, each character in x will be mapped to the character at the same position in y. If there is a third argument, it must be a string, whose characters will be mapped to None in the result.

https://docs.python.org/3/library/stdtypes.html?highlight=maketrans#str.maketrans

furins
  • 4,684
  • 1
  • 37
  • 54
ChuQuan
  • 71
  • 1
  • 3
0

If you just want to implement something like this: "123hello.jpg".translate(None, 0123456789") then try this:

 "".join(c for c in "123hello.jpg" if c not in "0123456789")

Ouput: hello.jpg

princebillyGK
  • 987
  • 11
  • 13