1

I have a dataframe like this

dat = [['ID1', '[1, 0, 1, 0, 0]'], ['ID2', '[0, 0, 1, 0, 0]'], ['ID3', '[1, 0, 1, 1, 0]']]
df = pd.DataFrame(dat, columns = ['ID', 'Values'])
df

    ID    Values
0   ID1   [1, 0, 1, 0, 0]
1   ID2   [0, 0, 1, 0, 0]
2   ID3   [1, 0, 1, 1, 0]

I want to calculate cumulative sum of each list, and store in new column. After this I want to take dot product of two list

df['Values'] = df['Values'].apply(lambda x : ast.literal_eval(x)) # Convert string into list type
df['Cumsum_Values'] = df['Values'].apply(lambda x : np.cumsum(x)) # cumulative sum of each list
df['Dot_Values'] = df.apply(lambda x : np.dot(x.Values, x.Cumsum_Values)) # dot product of columns Values and Cumsum_Values
df

    ID  Values            Cumsum_Values     Dot_Values
0   ID1 [1, 0, 1, 0, 0]   [1, 1, 2, 2, 2]   3
1   ID2 [0, 0, 1, 0, 0]   [0, 0, 1, 1, 1]   1
2   ID3 [1, 0, 1, 1, 0]   [1, 1, 2, 3, 3]   6

By using pandas apply, I am able to do the task. However my records are in millions, and some of these operations are taking lot of time (>10mins).

Is there any alternative to make this processing fast?

Hardik Gupta
  • 4,214
  • 6
  • 27
  • 64
  • 1
    Maybe [this](https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06) can help. – John Doe Nov 14 '19 at 06:11
  • 2
    @JohnDoe I believe he knows vectorization and built in functionalities are faster than `apply`, but this is exactly what he is asking: Is it possible to do it here? – Aryerez Nov 14 '19 at 06:16

2 Answers2

2

First I think working with lists in pandas is not good idea.

One idea is working with numpy arrays if each list has same length, so instead dot for each row multiple with sum:

import json

arr = np.array([json.loads(x) for x in df['Values']])
out = np.sum(arr * np.cumsum(arr, axis=1), axis=1)
print (out)
[3 1 6]

df['Dot_Values'] = out
print (df)
    ID           Values  Dot_Values
0  ID1  [1, 0, 1, 0, 0]           3
1  ID2  [0, 0, 1, 0, 0]           1
2  ID3  [1, 0, 1, 1, 0]           6
jezrael
  • 629,482
  • 62
  • 918
  • 895
2

May be you can consider using a list comprehension rather than apply:

df['Cumsum_Values']=[np.cumsum(ast.literal_eval(i)) for i in df['Values']]
df['dot']=[np.dot(ast.literal_eval(a),b) for a,b in zip(df['Values'],df['Cumsum_Values'])]

    ID           Values    Cumsum_Values  dot
0  ID1  [1, 0, 1, 0, 0]  [1, 1, 2, 2, 2]    3
1  ID2  [0, 0, 1, 0, 0]  [0, 0, 1, 1, 1]    1
2  ID3  [1, 0, 1, 1, 0]  [1, 1, 2, 3, 3]    6

Comparison:

enter image description here

anky
  • 64,269
  • 7
  • 30
  • 56
  • Can you campare vectorized numpy solution? – jezrael Nov 14 '19 at 06:28
  • 2
    @jezrael I am off my computer now. You can add that in your sol :) though i think there is a column missing in the solution – anky Nov 14 '19 at 06:33
  • 1
    I think it would be a better idea to start the timing only *after* the creation of `df`, as it's not relevent to the comparison, and only add noise to it. – Aryerez Nov 14 '19 at 06:37
  • 1
    @Aryerez agreed:) i added it because of the ast literal eval would not work with timer loops if it didn't get fresh data. I would check what's the way of doing that when I am back. Can't test in mobile :) – anky Nov 14 '19 at 06:41
  • 2
    @anky_91 I got that error too. Weird. Anyway, in my computer, the asker method was **1.47ms**, yours was **1.09ms**, and jezrael **626us**. – Aryerez Nov 14 '19 at 06:50
  • conversion using `ast.literal_eval` is taking a lot of time in the first line – Hardik Gupta Nov 14 '19 at 07:07