I think this would be efficient
# create Series with dictionaries
>>> from collections import Counter
>>> c = df["path"].str.split(',').apply(Counter)
>>> c
0 {u'p2': 1, u'p3': 1, u'p1': 1, u'p4': 1}
1 {u'p2': 1, u'p1': 2}
2 {u'p1': 1, u'p7': 1, u'p5': 2}
3 {u'p2': 1, u'p3': 2, u'p1': 1}
# create DataFrame
>>> pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
update
Another way to do this:
>>> dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
>>> pd.DataFrame(dfN, columns=wordlist).fillna(0)
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
update 2
Some rough tests for performance:
>>> dfL = pd.concat([df]*100)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
0.7363274283027295
>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
0.5305424618886718
# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
1.765344003293876
>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
2.33328927599905
update 3
after reading this topic I've found that Counter
is really slow. You can optimize it a bit by using defaultdict
:
>>> def create_dict(x):
... d = defaultdict(int)
... for c in x:
... d[c] += 1
... return d
>>> c = df["path"].str.split(",").apply(create_dict)
>>> pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
and some tests:
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
0.45942801555111146
# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
1.5798653213942089