I have a dataframe like this
dat = [['ID1', '[1, 0, 1, 0, 0]'], ['ID2', '[0, 0, 1, 0, 0]'], ['ID3', '[1, 0, 1, 1, 0]']]
df = pd.DataFrame(dat, columns = ['ID', 'Values'])
df
ID Values
0 ID1 [1, 0, 1, 0, 0]
1 ID2 [0, 0, 1, 0, 0]
2 ID3 [1, 0, 1, 1, 0]
I want to calculate cumulative sum of each list, and store in new column. After this I want to take dot product of two list
df['Values'] = df['Values'].apply(lambda x : ast.literal_eval(x)) # Convert string into list type
df['Cumsum_Values'] = df['Values'].apply(lambda x : np.cumsum(x)) # cumulative sum of each list
df['Dot_Values'] = df.apply(lambda x : np.dot(x.Values, x.Cumsum_Values)) # dot product of columns Values and Cumsum_Values
df
ID Values Cumsum_Values Dot_Values
0 ID1 [1, 0, 1, 0, 0] [1, 1, 2, 2, 2] 3
1 ID2 [0, 0, 1, 0, 0] [0, 0, 1, 1, 1] 1
2 ID3 [1, 0, 1, 1, 0] [1, 1, 2, 3, 3] 6
By using pandas
apply
, I am able to do the task. However my records are in millions, and some of these operations are taking lot of time (>10mins).
Is there any alternative to make this processing fast?