2

I know there's been several very similar answers here on SO on this exact question, but none of them really answer my question.

I'm trying to remove a series of stop words and punctuation from a list of words to perform basic natural language processing.

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation


    text = "Hello there. I am currently typing Python. "
    custom_stopwords = set(stopwords.words('english')+list(punctuation))

    # tokenizes the text into a sentence
    sentences = sent_tokenize(text)

    # tokenizes each sentence into a list of words
    words = [word_tokenize(sentence) for sentence in sentences]
    filtered_words = [word for word in words if word not in custom_stopwords]
    print(filtered_words)

This throws a TypeError: unhashable type: 'list' error on the filtered_words line. Why is this error being thrown? I'm not providing a list collection at all- I'm providing a set?

Note: I've read the post on SO on this exact error, but still have the same question. The accepted answer provides this explanation:

Sets require their items to be hashable. Out of types predefined by Python only the immutable ones, such as strings, numbers, and tuples, are hashable. Mutable types, such as lists and dicts, are not hashable because a change of their contents would change the hash and break the lookup code.

I'm providing a set of strings here, so why is Python still complaining?

EDIT: after reading more into this SO post, which recommends using tuples, I edited my collection object:

custom_stopwords = tuple(stopwords.words('english'))

I also realized I have to flatten my list, since word_tokenize(sentence) will create a list of lists, and will not filter out punctuation correctly (since a list object will not be in custom_stopwords, which is a list of strings.

However, this still begs the question- why are tuples considered hashable by Python but string sets not? And why does the TypeError say list?

Yu Chen
  • 3,751
  • 4
  • 31
  • 63
  • try [this](https://stackoverflow.com/questions/42203673/in-python-why-is-a-tuple-hashable-but-not-a-list) post – Nullman Dec 26 '17 at 18:07

1 Answers1

4

words is a list of lists since word_tokenize() returns a list of words.

When you do [word for word in words if word not in custom_stopwords] each word is actually of a list type. When the word not in custom_stopwords "is in set" condition needs to be checked, word needs to be hashed which fails because lists are mutable containers and are not hashable in Python.

These posts might help to understand what is "hashable" and why mutable containers are not:

alecxe
  • 414,977
  • 106
  • 935
  • 1,083
  • Got it. It turns out that flattening the list: `words = [word_tokenize(sentence) for sentence in sentences] flattened_words = [item for sublist in words for item in sublist] filtered_words = [word for word in flattened_words if word not in custom_stopwords]` is needed, since that converts `words` to a list of strings, not a list of lists. Once I flattened the list into a list of strings, I could use whatever collection I wanted to do my filtering (set, tuple, etc.) – Yu Chen Dec 26 '17 at 18:12