Alternate to apply in pandas

Question

I have a dataframe like this

dat = [['ID1', '[1, 0, 1, 0, 0]'], ['ID2', '[0, 0, 1, 0, 0]'], ['ID3', '[1, 0, 1, 1, 0]']]
df = pd.DataFrame(dat, columns = ['ID', 'Values'])
df

    ID    Values
0   ID1   [1, 0, 1, 0, 0]
1   ID2   [0, 0, 1, 0, 0]
2   ID3   [1, 0, 1, 1, 0]

I want to calculate cumulative sum of each list, and store in new column. After this I want to take dot product of two list

df['Values'] = df['Values'].apply(lambda x : ast.literal_eval(x)) # Convert string into list type
df['Cumsum_Values'] = df['Values'].apply(lambda x : np.cumsum(x)) # cumulative sum of each list
df['Dot_Values'] = df.apply(lambda x : np.dot(x.Values, x.Cumsum_Values)) # dot product of columns Values and Cumsum_Values
df

    ID  Values            Cumsum_Values     Dot_Values
0   ID1 [1, 0, 1, 0, 0]   [1, 1, 2, 2, 2]   3
1   ID2 [0, 0, 1, 0, 0]   [0, 0, 1, 1, 1]   1
2   ID3 [1, 0, 1, 1, 0]   [1, 1, 2, 3, 3]   6

By using pandas apply, I am able to do the task. However my records are in millions, and some of these operations are taking lot of time (>10mins).

Is there any alternative to make this processing fast?

Maybe [this](https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06) can help. — John Doe, Nov 14 '19 at 06:11
@JohnDoe I believe he knows vectorization and built in functionalities are faster than `apply`, but this is exactly what he is asking: Is it possible to do it here? — Aryerez, Nov 14 '19 at 06:16

jezrael · Accepted Answer · 2019-11-14T07:18:56.267

2

First I think working with lists in pandas is not good idea.

One idea is working with numpy arrays if each list has same length, so instead dot for each row multiple with sum:

import json

arr = np.array([json.loads(x) for x in df['Values']])
out = np.sum(arr * np.cumsum(arr, axis=1), axis=1)
print (out)
[3 1 6]

df['Dot_Values'] = out
print (df)
    ID           Values  Dot_Values
0  ID1  [1, 0, 1, 0, 0]           3
1  ID2  [0, 0, 1, 0, 0]           1
2  ID3  [1, 0, 1, 1, 0]           6

edited Nov 14 '19 at 07:18

answered Nov 14 '19 at 06:19

jezrael

629,482
62
918
895

1

`ast.literal_eval` is taking lot of time when used with millions of records – Hardik Gupta Nov 14 '19 at 06:52
@HardikGupta - data are in file? – jezrael Nov 14 '19 at 06:53
@HardikGupta - I am thinking if some another way for convert file to array, but not easy if more columns. But I try some tests. – jezrael Nov 14 '19 at 07:01
1

I just have these two columns (ID, values) from the file, which is a csv – Hardik Gupta Nov 14 '19 at 07:03
1

the first line processing is taking >10mins. Rest everything runs very fast because it's in array – Hardik Gupta Nov 14 '19 at 07:04
@HardikGupta - Can you test `arr = np.array([json.loads(x) for x in df['Values']])` ? – jezrael Nov 14 '19 at 07:11
1

`json.loads(x)` improved performance - ~2mins. Thank you so much – Hardik Gupta Nov 14 '19 at 07:17
1

`json.loads()` is faster then, thanks i will keep in mind :) – anky Nov 14 '19 at 10:25

score 2 · Answer 2 · answered Nov 14 '19 at 06:25

2

May be you can consider using a list comprehension rather than apply:

df['Cumsum_Values']=[np.cumsum(ast.literal_eval(i)) for i in df['Values']]
df['dot']=[np.dot(ast.literal_eval(a),b) for a,b in zip(df['Values'],df['Cumsum_Values'])]

    ID           Values    Cumsum_Values  dot
0  ID1  [1, 0, 1, 0, 0]  [1, 1, 2, 2, 2]    3
1  ID2  [0, 0, 1, 0, 0]  [0, 0, 1, 1, 1]    1
2  ID3  [1, 0, 1, 1, 0]  [1, 1, 2, 3, 3]    6

Comparison:

answered Nov 14 '19 at 06:25

anky

64,269
7
30
56

Can you campare vectorized numpy solution? – jezrael Nov 14 '19 at 06:28
2

@jezrael I am off my computer now. You can add that in your sol :) though i think there is a column missing in the solution – anky Nov 14 '19 at 06:33
1

I think it would be a better idea to start the timing only *after* the creation of `df`, as it's not relevent to the comparison, and only add noise to it. – Aryerez Nov 14 '19 at 06:37
1

@Aryerez agreed:) i added it because of the ast literal eval would not work with timer loops if it didn't get fresh data. I would check what's the way of doing that when I am back. Can't test in mobile :) – anky Nov 14 '19 at 06:41
2

@anky_91 I got that error too. Weird. Anyway, in my computer, the asker method was **1.47ms**, yours was **1.09ms**, and jezrael **626us**. – Aryerez Nov 14 '19 at 06:50
conversion using `ast.literal_eval` is taking a lot of time in the first line – Hardik Gupta Nov 14 '19 at 07:07

Alternate to apply in pandas

2 Answers2