5

I need to initialize the cells in a column of a DataFrame to lists.

df['some_col'] = [[] for _ in no_of_rows]

I am wondering is there a better way to do that in terms of time efficiency?

Stefan
  • 35,233
  • 11
  • 66
  • 76
daiyue
  • 6,164
  • 16
  • 67
  • 117
  • You have accepted an answer that offers a solution 3x slower than your starting point. – Stefan May 24 '16 at 14:08
  • @Stefan it seems that you are correct, as `apply(list)` is indeed slightly slower than my old code. – daiyue May 24 '16 at 14:15
  • So as you can see below you can get a tiny bit faster using `itertools`, but I think you're actually quite good already because I don't see a faster way to add the column than the standard method, but perhaps someone comes up with some magic.. – Stefan May 24 '16 at 14:18

2 Answers2

5

Since you are looking for time efficiency, below some benchmarks. I think list comprehension is already quite fast to create the empty list of list objects, but you can squeeze out a marginal improvement using itertools.repeat. On the insert piece, apply is 3x slower because it loops:

import pandas as pd
from itertools import repeat
df = pd.DataFrame({"A":np.arange(100000)})

%timeit df['some_col'] = [[] for _ in range(len(df))]
100 loops, best of 3: 8.75 ms per loop

%timeit df['some_col'] = [[] for i in repeat(None, len(df))]
100 loops, best of 3: 8.02 ms per loop

%%timeit 
df['some_col'] = ''
df['some_col'] = df['some_col'].apply(list)
10 loops, best of 3: 25 ms per loop
Stefan
  • 35,233
  • 11
  • 66
  • 76
4

Try apply:

df1['some_col'] = ''
df1['some_col'] = df1['some_col'].apply(list)

Sample:

df1 = pd.DataFrame({'a': pd.Series([1,2])})
print (df1)
   a
0  1
1  2

df1['some_col'] = ''
df1['some_col'] = df1['some_col'].apply(list)
print (df1)
   a some_col
0  1       []
1  2       []
jezrael
  • 629,482
  • 62
  • 918
  • 895
  • How is this better in terms of time efficiency? – Stefan May 24 '16 at 14:11
  • Hmmm, I think it is not better in terms of time efficiency. But it is up to OP which answer sign as accepted. Maybe you prefer me, because I was first, maybe because like it. But maybe in few seconds change his opinion. I dont know. – jezrael May 24 '16 at 14:14
  • Also note that `lambda: []` will be faster than `list`. – hilberts_drinking_problem May 24 '16 at 14:17
  • Just asking because the question was about time efficiency, so it's a good thing if the answer tries to do so as well. – Stefan May 24 '16 at 14:17
  • @Stefan And now maybe your solution will be accepted. – jezrael May 24 '16 at 14:18
  • @Stefan what about `xrange` ? `%timeit df['some_col'] = [[] for _ in xrange(len(df))]` . Unfortunately I cannot test it, because it works in python 2. – jezrael May 24 '16 at 14:42
  • `xrange` has been renamed to `range` in python3 - http://stackoverflow.com/questions/15014310/why-is-there-no-xrange-function-in-python3 and GvR: https://docs.python.org/3/whatsnew/3.0.html#views-and-iterators-instead-of-lists – Stefan May 24 '16 at 19:52