I know there's been several very similar answers here on SO on this exact question, but none of them really answer my question.
I'm trying to remove a series of stop words and punctuation from a list of words to perform basic natural language processing.
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation
text = "Hello there. I am currently typing Python. "
custom_stopwords = set(stopwords.words('english')+list(punctuation))
# tokenizes the text into a sentence
sentences = sent_tokenize(text)
# tokenizes each sentence into a list of words
words = [word_tokenize(sentence) for sentence in sentences]
filtered_words = [word for word in words if word not in custom_stopwords]
print(filtered_words)
This throws a TypeError: unhashable type: 'list'
error on the filtered_words
line. Why is this error being thrown? I'm not providing a list
collection at all- I'm providing a set
?
Note: I've read the post on SO on this exact error, but still have the same question. The accepted answer provides this explanation:
Sets require their items to be hashable. Out of types predefined by Python only the immutable ones, such as strings, numbers, and tuples, are hashable. Mutable types, such as lists and dicts, are not hashable because a change of their contents would change the hash and break the lookup code.
I'm providing a set of strings here, so why is Python still complaining?
EDIT: after reading more into this SO post, which recommends using tuples
, I edited my collection object:
custom_stopwords = tuple(stopwords.words('english'))
I also realized I have to flatten my list, since word_tokenize(sentence)
will create a list of lists, and will not filter out punctuation correctly (since a list object will not be in custom_stopwords
, which is a list of strings.
However, this still begs the question- why are tuples considered hashable by Python but string sets not? And why does the TypeError
say list
?