How do I find the most common words in multiple separate texts?

Question

Bit of a simple question really, but I can't seem to crack it. I have a string that is formatted in the following way:

["category1",("data","data","data")]
["category2", ("data","data","data")]

I call the different categories posts and I want to get the most frequent words from the data section. So I tried:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       if token in freq_dict:
           freq_dict[token] += 1
       else:
           freq_dict[token] = 1
   top = sorted(freq_dict, key=freq_dict.get, reverse=True)
   top = top[:50]
   print top

However, this will give me the top words PER post in the string.

I need a general top words list.
However if I take print top out of the for loop, it only gives me the results of the last post.
Does anyone have an idea?

Do you want to count the occurrence of every unique word in all the tuples combined? — Janus Troelsen, May 04 '13 at 14:35
What does wordpunct_tokenize do? It would be easier to help you if we could execute the code you posted. Does it always take a triple or would it work with any length? — Janus Troelsen, May 04 '13 at 14:36
wordpunct comes from the nltk package and tokenizes the string from nltk.tokenize import wordpunct_tokeniz, changed it in the question. And no, I just want the most frequent words from all the posts combined. — Shifu, May 04 '13 at 14:38
you probably want to take a look at [Counter](http://docs.python.org/2/library/collections.html#collections.Counter) — soulcheck, May 04 '13 at 14:40
Seems like a use case for `defaultdict`. As in [this answer](http://stackoverflow.com/questions/893417/item-frequency-count-in-python) Argh. @Nikolaas please don't neglect to give *all* the information so that we can advise you properly and not leave half-informed comments. — kojiro, May 04 '13 at 14:40
@Nikolaas: Please use a better headline next time. You question is not "for loop, pretty simple". You question is "how do I find the most common words in multiple seperate texts?" — Janus Troelsen, May 04 '13 at 15:04

score 3 · Answer 1 · edited May 04 '13 at 17:27

This is a scope problem. Also, you don't need to initialize the elements of defaultdict, so this simplifies your code:

Try it like this:

posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]]

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict
freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

This, as expected, outputs

['data1', 'data3', 'data5', 'data2']

as a result.

If you really have something like

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

as an input, you won't need wordpunct_tokenize() as the input data is already tokenized. Then, the following would work:

posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

from collections import defaultdict
freq_dict = defaultdict(int)

for cat, tokens in posts:
   for token in tokens:
      freq_dict[token] += 1

top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

and it also outputs the expected result:

['data1', 'data3', 'data5', 'data2']

actually I did already initiate the freq_dict, I just did not write it in my post, I will edit it now. — Shifu, May 04 '13 at 14:40
`freq_dict` is a default dict, and anyway `freq_dict` cannot be a list as `token` is not an integer but a `Token` object. So, it cannot be a list index! — pradyunsg, May 04 '13 at 14:44
@Nikolaas added more information about the different input formats that might apply. If you really have the input format stated in your question, have a look at the second listing because you won't need `wordpunct_tokenize()` at all. Have fun... — likeitlikeit, May 04 '13 at 16:35

score 3 · Answer 2 · answered May 04 '13 at 14:53

3

Why not just use Counter?

In [30]: from collections import Counter

In [31]: data=["category1",("data","data","data")]

In [32]: Counter(data[1])
Out[32]: Counter({'data': 3})

In [33]: Counter(data[1]).most_common()
Out[33]: [('data', 3)]

answered May 04 '13 at 14:53

Fredrik Pihl

41,002
6
73
121

Still doesn't show how to best chain the texts (**after tokenization**) and get a specific number of most common words. Check my answer. – Janus Troelsen May 04 '13 at 15:11
2

@Nikolaas: Of course, we already did that. But it is unnecessarily complex to write your own counter when you can just use the one in the standard library. [The best code is no code at all](http://www.codinghorror.com/blog/2007/05/the-best-code-is-no-code-at-all.html). – Janus Troelsen May 04 '13 at 15:32

Janus Troelsen · Answer 3 · 2013-05-04T17:52:28.203

2

from itertools import chain
from collections import Counter
from nltk.tokenize import wordpunct_tokenize
texts=["a quick brown car", "a fast yellow rose", "a quick night rider", "a yellow officer"]
print Counter(chain.from_iterable(wordpunct_tokenize(x) for x in texts)).most_common(3)

outputs:

[('a', 4), ('yellow', 2), ('quick', 2)]

As you can see in the documentation for Counter.most_common, the returned list is sorted.

To use with your code, you can do

texts = (x[1] for x in posts)

or you can do

... wordpunct_tokenize(x[1]) for x in texts ...

If your posts actually look like this:

posts=[("category1",["a quick brown car", "a fast yellow rose"]), ("category2",["a quick night rider", "a yellow officer"])]

You can get rid of the categories:

texts = list(chain.from_iterable(x[1] for x in posts))

(texts will be ['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer'])

You can then use that in the snippet of the top of this answer.

edited May 04 '13 at 17:52

answered May 04 '13 at 14:52

Janus Troelsen

17,537
13
121
177

instead of list comprehension it would be better to use a generator expression – soulcheck May 04 '13 at 15:02
@soulcheck: Why? All of it would be read anyway. I think you'll get better spatial locality like this, and better performance. – Janus Troelsen May 04 '13 at 15:05
i imagine posts can be quite large. it also doesn't make sense to create a collection only to iterate it once and throw it away. – soulcheck May 04 '13 at 15:07
Ahh, you meant for the posts, thought you meant for the tokenized data. – Janus Troelsen May 04 '13 at 15:08
That's a good approach, but why not [`chain.from_iterable(wordpunct_tokenize(x) for x in texts)`](http://docs.python.org/3.3/library/itertools.html#itertools.chain.from_iterable)? Soulcheck is right, you can get rid of the list comprehension. – Adam May 04 '13 at 15:16
@codesparkle: Mainly because it was longer, and I think "chain" is intuitive to understand, but chain.from_iterable is less so. Anyway, changed it. :) – Janus Troelsen May 04 '13 at 15:20
is there no way to do it my way? – Shifu May 04 '13 at 15:48
@Nikolaas: If there is no solution in the thread right now that does what you want, you haven't explained the problem very well. Here are some suggestions: If you have a problem [make a short independent test case](http://sscce.org/) ("all my posts are big" is not an excuse, cause you can just do a small post for the example). Also, try drawing on paper what it is that you want. Try to understand why the solutions here don't work for you, find out what it is that they do, you can just fix them. My answer is self-contained. What's the problem? The input is not like yours? Change your code then. – Janus Troelsen May 04 '13 at 17:28
@Nikolaas: Now I covered a variant which I think you might be using. Writing `("data", "data")` in your question was a bad idea. If it is actually a list of posts, it makes sense, so that's what I added here. Remember, tuples are not lists. – Janus Troelsen May 04 '13 at 17:53

pradyunsg · Answer 4 · 2013-05-04T15:47:52.640

1

Just change your code to allow for the posts to be processed and then get the top words:

from nltk.tokenize import wordpunct_tokenize
from collections import defaultdict

freq_dict = defaultdict(int)

for cat, text2 in posts:
   tokens = wordpunct_tokenize(text2)
   for token in tokens:
       freq_dict[token] += 1
# get top after all posts have been processed.
top = sorted(freq_dict, key=freq_dict.get, reverse=True)
top = top[:50]
print top

edited May 04 '13 at 15:47

answered May 04 '13 at 14:38

pradyunsg

13,385
10
36
80

still only get the results from the last post in the loop, so it does not work. – Shifu May 04 '13 at 14:46
1

@Nikolaas: Why do you think that? – Janus Troelsen May 04 '13 at 14:46
@Nikolaas Could you post the value of `posts`?? Because if that's the case it would be helpful for us to see the input, to verify the output... – pradyunsg May 04 '13 at 14:47
No need for `freq_dict[token] = 0` as it already is set to zero. – eandersson May 04 '13 at 14:52
@Janus I tried it with some fake data, it only gives me back the last line. Schoolboy, posts looks like I posted above ["category"["data","data",...] I'm working with actual blog posts so giving the actual string would be a few pages. – Shifu May 04 '13 at 14:53
Say Schoolboy, it seems I was far too quick to judge and didn't take over your code well, I thought the unindented freq_dict[token] += 1 was a mistake. It wasn't. It works like a charm now. My apologies and thank you! – Shifu May 04 '13 at 15:38
wait, it still does not work, it awards 1 for words, not multiple, creating a correct frequency count. – Shifu May 04 '13 at 15:44
@Nikolaas: How is it a problem that it creates a correct frequency count? Isn't that what you wanted? – Janus Troelsen May 04 '13 at 17:56

How do I find the most common words in multiple separate texts?

4 Answers4