6

I have a word list like following.

wordlist = ['p1','p2','p3','p4','p5','p6','p7']

And the dataframe is like following.

df = pd.DataFrame({'id' : [1,2,3,4],
                'path'  : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3"]})

output:

    id path

    1 p1,p2,p3,p4
    2 p1,p2,p1
    3 p1,p5,p5,p7
    4 p1,p2,p3,p3

I want to count the path data to get following output. Is it possible to get this kind of transformation?

id p1 p2 p3 p4 p5 p6 p7
1  1  1  1  1  0  0  0
2  2  1  0  0  0  0  0
3  1  0  0  0  2  0  1
4  1  1  2  0  0  0  0
Nilani Algiriyage
  • 22,706
  • 29
  • 77
  • 114
  • Are the words in `wordlist` really as simple as that, or can they be substrings of each other? Even if they can be substrings, the answers below are 95% what you want. – U2EF1 Dec 04 '13 at 08:39

3 Answers3

5

You can use the vectorized string method str.count() (see docs and reference), and that for each element in wordlist feed that to a new dataframe:

In [4]: pd.DataFrame({name : df["path"].str.count(name) for name in wordlist})
Out[4]:
    p1  p2  p3  p4  p5  p6  p7
id
1    1   1   1   1   0   0   0
2    2   1   0   0   0   0   0
3    1   0   0   0   2   0   1
4    1   1   2   0   0   0   0

UPDATE: some answers to the comments. Indeed this will not work if the strings can be substrings of each other (but the OP should clarify it then). If that is the case, this would work (and is also faster):

splitted = df["path"].str.split(",")
pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})

And some tests to back up my claim of being faster :-)
Off course, I don't know what the realistic use case is, but I made the dataframe a bit larger (just repeated it 1000 times, the differences are bigger then):

In [37]: %%timeit
   ....: splitted = df["path"].str.split(",")
   ....: pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name i
n wordlist})
   ....:
100 loops, best of 3: 17.9 ms per loop

In [38]: %%timeit
   ....: pd.DataFrame({name:df["path"].str.count(name) for name in wordlist})
   ....:
10 loops, best of 3: 23.6 ms per loop

In [39]: %%timeit
   ....: c = df["path"].str.split(',').apply(Counter)
   ....: pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
   ....:
10 loops, best of 3: 42.3 ms per loop

In [40]: %%timeit
   ....: dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
   ....: pd.DataFrame(dfN, columns=wordlist).fillna(0)
   ....:
1 loops, best of 3: 715 ms per loop

I also did the test with more elements in wordlist, and conclusion is: if you have a larger dataframe with relative smaller number of elements in wordlist my approach is faster, if you have a large wordlist the approach with Counter from @RomanPekar can be faster (but only the last one).

joris
  • 106,362
  • 32
  • 216
  • 184
  • 1
    This works in the example but may be a terrible idea in practice, as certain words will be double-counted (e.g. "at" in the word "cat"). Hopefully a revision of the question can address this. – U2EF1 Dec 04 '13 at 08:38
  • @joris : This can be also applied, but for my real problem I used Roman Pekar's method. Thanks :) – Nilani Algiriyage Dec 04 '13 at 08:47
  • @U2EF1 yep, this approach looks unefficient for me anyway, because it runs over the strings multiple times. That's why I suggest to split string first. – Roman Pekar Dec 04 '13 at 08:48
  • @RomanPekar note that my solution is faster, although it may seem unefficient to you. And personally I think it is simpler. – joris Dec 04 '13 at 08:50
  • @joris good to know, may be add some tests? :) I think it could be less efficient if number of words is large. It's not that I don't like the solution, just don't like an idea of iterating over strings over and over. – Roman Pekar Dec 04 '13 at 08:53
  • @joris I too like your solution simplicity, just don't know how it would perform in real situation. – Roman Pekar Dec 04 '13 at 08:54
  • About double counting, may be change it to `df["path"].str.split(',').count(name)`? – Roman Pekar Dec 04 '13 at 09:01
  • @RomanPekar What do you mean with your last comment? On a series, `count` is a method to count the number non-NA values. – joris Dec 04 '13 at 09:12
  • @joris yeah, mean `apply` with `count` but you've already added that. – Roman Pekar Dec 04 '13 at 09:13
5

I think this would be efficient

# create Series with dictionaries
>>> from collections import Counter
>>> c = df["path"].str.split(',').apply(Counter)
>>> c
0    {u'p2': 1, u'p3': 1, u'p1': 1, u'p4': 1}
1                        {u'p2': 1, u'p1': 2}
2              {u'p1': 1, u'p7': 1, u'p5': 2}
3              {u'p2': 1, u'p3': 2, u'p1': 1}

# create DataFrame
>>> pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
   p1  p2  p3  p4  p5  p6  p7
0   1   1   1   1   0   0   0
1   2   1   0   0   0   0   0
2   1   0   0   0   2   0   1
3   1   1   2   0   0   0   0

update

Another way to do this:

>>> dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
>>> pd.DataFrame(dfN, columns=wordlist).fillna(0)
   p1  p2  p3  p4  p5  p6  p7
0   1   1   1   1   0   0   0
1   2   1   0   0   0   0   0
2   1   0   0   0   2   0   1
3   1   1   2   0   0   0   0

update 2

Some rough tests for performance:

>>> dfL = pd.concat([df]*100)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
0.7363274283027295

>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
0.5305424618886718

# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)

>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
1.765344003293876

>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
2.33328927599905

update 3

after reading this topic I've found that Counter is really slow. You can optimize it a bit by using defaultdict:

>>> def create_dict(x):
...     d = defaultdict(int)
...     for c in x:
...         d[c] += 1
...     return d
>>> c = df["path"].str.split(",").apply(create_dict)
>>> pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})
   p1  p2  p3  p4  p5  p6  p7
0   1   1   1   1   0   0   0
1   2   1   0   0   0   0   0
2   1   0   0   0   2   0   1
3   1   1   2   0   0   0   0

and some tests:

>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
0.45942801555111146

# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
1.5798653213942089
Community
  • 1
  • 1
Roman Pekar
  • 92,153
  • 25
  • 168
  • 181
  • My problem was not as easy as the question I asked. That's why took bit of time to accept the answer. Thanks :) – Nilani Algiriyage Dec 04 '13 at 08:46
  • Note that if you don't have a larger wordlist (or a relative large dataframe compared to the wordlist), even your last solution is slower than mine. All I want to say is, depends on the exact real use case :-) And I like the simplicity of mine. – joris Dec 04 '13 at 15:00
  • @joris agreed about simplicity, and readability counts too. If I have time I'll try to think about simple solution with dictionaries, I think there're should be one :) – Roman Pekar Dec 04 '13 at 15:05
0

something similar to this:

df1 = pd.DataFrame([[path.count(p) for p in wordlist] for path in df['path']],columns=['p1','p2','p3','p4','p5','p6','p7'])
VIKASH JAISWAL
  • 798
  • 1
  • 4
  • 11