0

I am using python / pandas.

I have a dataframe like this:

     date         id         my_column
0    31.07.20     128909     ['hey', 'hi']
1    31.07.20     128914     ['hi']
3    31.07.20     853124     ['hi', 'hello', 'hey']
4    30.07.20     123456     ['hey']
...

The dataframe over 1.000.000 rows long. I want the top 10 most common words in the my_column column.

Appreciate any help.

Ben W
  • 101
  • 6

1 Answers1

3

Use Series.explode with Series.value_counts, by default are values sorted, so for top10 need first 10 index values:

out = df['my_column'].explode().value_counts().index[:10].tolist()

Or you can use pure python solution for flatten and count top10:

from collections import Counter
from  itertools import chain

c = Counter(chain.from_iterable(df['my_column']))
out = [a for a, b in c.most_common(10)]
jezrael
  • 629,482
  • 62
  • 918
  • 895
  • Kind of works but the output doesn't give the single most common elements. It also gives several elements in combination. Kind of weird. Is it possible to get an output where only single elements are listed? – Ben W Oct 16 '20 at 11:02
  • @BenW - Are in column lists? What is `print (df['my_column'].head(3).tolist())` ? – jezrael Oct 16 '20 at 11:03
  • ["['hi', 'hey', 'hello', 'asd', 'fgh]", '[]', '[]'] – Ben W Oct 16 '20 at 11:06
  • 1
    @BenW - Is possible use `import ast` and `df['my_column'] = df['my_column'].apply(ast.literal_eval)` before my solution? – jezrael Oct 16 '20 at 11:07
  • 1
    Gives the same output. EDIT: Nevermind made a mistake. works now. – Ben W Oct 16 '20 at 11:10
  • @BenW - Same output if use `out = df['my_column'].apply(ast.literal_eval).explode().value_counts().index[:10].tolist()` vs `out = df['my_column'].explode().value_counts().index[:10].tolist()` ? – jezrael Oct 16 '20 at 11:10
  • 1
    Thank you very much. Amazing help! Made my day! – Ben W Oct 16 '20 at 11:10
  • Quick question: What does the .apply(ast.literal_eval) do? And another question: How do I make all the words lower case or alternatively make the code not distinguish between upper and lower case for the top list. – Ben W Oct 16 '20 at 11:19
  • 1
    @BenW - It convert lists of strings to list of lists, check [this](https://stackoverflow.com/questions/1894269/how-to-convert-string-representation-of-list-to-a-list), for lowercase `out = df['my_column'].apply(ast.literal_eval).explode().str.lower().value_counts().index[:10].tolist()`use – jezrael Oct 16 '20 at 11:21
  • Thanks. I get a "ValueError: malformed node or string". The words in my data actually contain a "#" before every word. I think that may cause the error. – Ben W Oct 16 '20 at 11:29
  • 1
    @BenW - hmm, it means already values are lists or some wrong values – jezrael Oct 16 '20 at 11:30
  • Ah its because i applied the ast.literal_eval twice. My bad. Everything working perfectly now. Can i donate you some money for your help? – Ben W Oct 16 '20 at 11:32
  • 1
    Feel free to send me your paypal or something in a message. Appreciate your help a ton! – Ben W Oct 16 '20 at 11:33