1

I've got a list of words, about 273000 of them in the list Word_array There are about 17000 unique words, and they're stored in Word_arrayU

I want a count for each one

#make bag of worsds   
Word_arrayU = np.unique(Word_array)
wordBag = [['0','0'] for _ in range(len(Word_array))] #prealocate necessary space
i=0
while i< len(Word_arrayU): #for each unique word
    wordBag[i][0] = Word_arrayU[i]
    #I think this is the part that takes a long time.  summing up a list comprehension with a conditional.  Just seems sloppy
    wordBag[i][1]=sum([1 if x == Word_arrayU[i] else 0 for x in Word_array])
    i=i+1

summing up a list comprehension with a conditional. Just seems sloppy; is there a better way to do it?

CDspace
  • 2,551
  • 17
  • 31
  • 35
Mohammad Athar
  • 1,652
  • 1
  • 12
  • 27

6 Answers6

2
from collections import Counter
counter = Counter(Word_array)
the_count_of_some_word = counter["some_word"]

#printing the counts
for word, count in counter.items():
   print("{} appears {} times.".format(word, count)
Padraic Cunningham
  • 160,756
  • 20
  • 201
  • 286
bravosierra99
  • 1,221
  • 9
  • 21
1

Since you are already using numpy.unique, just set return_counts=True in the unique call:

import numpy as np

unique,  count = np.unique(Word_array, return_counts=True)

That will give you two arrays, the unique elements and their counts:

n [10]: arr = [1,3,2,11,3,4,5,2,3,4]

In [11]: unique,  count = np.unique(arr, return_counts=True)

In [12]: unique
Out[12]: array([ 1,  2,  3,  4,  5, 11])

In [13]: count
Out[13]: array([1, 2, 3, 2, 1, 1])
Padraic Cunningham
  • 160,756
  • 20
  • 201
  • 286
0

Building on the suggestion from @jonrsharpe...

from collections import Counter

words = Counter()

words['foo'] += 1
words['foo'] += 1
words['bar'] += 1

Output

Counter({'bar': 1, 'foo': 2})

It's really convenient because you don't have to initialize words.

You can also initialize directly from a list of words:

Counter(['foo', 'foo', 'bar'])

Output

Counter({'bar': 1, 'foo': 2})
Taylor Edmiston
  • 9,072
  • 4
  • 45
  • 65
0

I don't know about most 'Pythonic' but definitely the easiest way of doing this would be to use collections.Counter.

from collections import Counter

Word_array = ["word1", "word2", "word3", "word1", "word2", "word1"]

wordBag = Counter(Word_array).items()
work.bin
  • 1,048
  • 5
  • 25
  • Terrible suggestion - count twice? If you - for some obscure reason - want tuples, _Counter(Word\_array).items()_ will do the trick – volcano Oct 13 '16 at 20:32
  • 1
    If you call a function over t the same data twice - that is bad practice. Since OP is talking about list of 237,000 words - yep, it is terrible. – volcano Oct 13 '16 at 21:16
0

In python 3 there is a built-in list.count function. For example:

>>> h = ["a", "b", "a", "a", "c"]
>>> h.count("a")
3
>>> 

So, you could make it more efficient by doing something like:

Word_arrayU = np.unique(Word_array)
wordBag = []
for uniqueWord in Word_arrayU:
    wordBag.append([uniqueWord, Word_array.count(uniqueWord)])
rassar
  • 4,499
  • 3
  • 19
  • 36
-1

If you want a less efficient (than Counter), but more transparent solution, you can use collections.defaultdict

from collections import defaultdict
my_counter = defaultdict(int)
for word in word_array:
    my_counter[word] += 1
Patrick Haugh
  • 49,982
  • 11
  • 66
  • 73