Bit of a simple question really, but I can't seem to crack it. I have a string that is formatted in the following way:
["category1",("data","data","data")]
["category2", ("data","data","data")]
I call the different categories posts and I want to get the most frequent words from the data section. So I tried:
from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)
for cat, text2 in posts:
tokens = wordpunct_tokenize(text2)
for token in tokens:
if token in freq_dict:
freq_dict[token] += 1
else:
freq_dict[token] = 1
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top
However, this will give me the top words PER post in the string.
I need a general top words list.
However if I take print top out of the for loop, it only gives me the results of the last post.
Does anyone have an idea?