Process data frame without a loop

Question

I have the following pandas DataFrame:

df = pd.DataFrame([[1,100, 200, float('nan')],
                   [1, 110, float('nan'), float('nan')],
                   [1, 115, float('nan'), float('nan')], 
                   [2, 50, 100, float('nan')], 
                   [2, 50, float('nan'), float('nan')], 
                   [3, 100, 100, float('nan')],
                   [3, 125, float('nan'), float('nan')]],
                   columns=['User', 'Amount', 'FAB', 'Result'])

╔══════╦════════╦══════╦════════╗
║ User ║ Amount ║ FAB  ║ Result ║
╠══════╬════════╬══════╬════════╣
║    1 ║    100 ║ 200  ║ NaN    ║
║    1 ║    110 ║ NaN  ║ NaN    ║
║    1 ║    115 ║ NaN  ║ NaN    ║
║    2 ║     50 ║ 100  ║ NaN    ║
║    2 ║     50 ║ NaN  ║ NaN    ║
║    3 ║    100 ║ 100  ║ NaN    ║
║    3 ║    125 ║ NaN  ║ NaN    ║
╚══════╩════════╩══════╩════════╝

I would need to apply some math operations to fill FAB and Result:

For example to fill the first row of Result's column I'd have to:

df['Result'].iloc[0] = df['FAB'].iloc[0]*1.025 + df['Amount'].iloc[0]

This is equals to 305, so we should paste that result into the first row of result and the second for FAB

╔══════╦════════╦══════╦════════╗
║ User ║ Amount ║ FAB  ║ Result ║
╠══════╬════════╬══════╬════════╣
║    1 ║    100 ║ 200  ║ 305    ║
║    1 ║    110 ║ 305  ║ NaN    ║
║    1 ║    115 ║ NaN  ║ NaN    ║
║    2 ║     50 ║ 100  ║ NaN    ║
║    2 ║     50 ║ NaN  ║ NaN    ║
║    3 ║    100 ║ 100  ║ NaN    ║
║    3 ║    125 ║ NaN  ║ NaN    ║
╚══════╩════════╩══════╩════════╝

The same logic for the next row:

df['Result'].iloc[1] = df['FAB'].iloc[1]*1.025 + df['Amount'].iloc[1]

Which is equals to 422.62

╔══════╦════════╦═════════╦═════════╗
║ User ║ Amount ║  FAB    ║ Result  ║
╠══════╬════════╬═════════╬═════════╣
║    1 ║    100 ║ 200     ║ 305     ║
║    1 ║    110 ║ 305     ║ 422.625 ║
║    1 ║    115 ║ 422.625 ║ NaN     ║
║    2 ║     50 ║ 100     ║ NaN     ║
║    2 ║     50 ║ NaN     ║ NaN     ║
║    3 ║    100 ║ 100     ║ NaN     ║
║    3 ║    125 ║ NaN     ║ NaN     ║
╚══════╩════════╩═════════╩═════════╝

So, after applying the same logic for all users

╔══════╦════════╦═════════╦════════════╗
║ User ║ Amount ║  FAB    ║   Result   ║
╠══════╬════════╬═════════╬════════════╣
║    1 ║    100 ║ 200     ║ 305        ║
║    1 ║    110 ║ 305     ║ 422.625    ║
║    1 ║    115 ║ 422.625 ║ 548.190625 ║
║    2 ║     50 ║ 100     ║ 152.5      ║
║    2 ║     50 ║ 152.5   ║ 206.3125   ║
║    3 ║    100 ║ 100     ║ 202.5      ║
║    3 ║    125 ║ 202.5   ║ 332.5625   ║
╚══════╩════════╩═════════╩════════════╝

Is there a way to acchieve this without a for loop?

I think the answer is no, this kind iterative calculation that involve a multiplication and an addition using the result of the previous iteration are not easily vectorized. But If you currently use iloc in the loop `for` the way you describe it in the question, I think the loop can be optimized :) — Ben.T, Jul 30 '20 at 00:05
In the general case `FAB(n)` depends on `Result(n-1)` which depends on `FAB(n-1)`. The `FAB(0)` → `FAB(1)` → `FAB(2)` → ... chain is iterative and in the worst case O(N). However: a) you can perform computation in parallel in chunks per user, e.g. `FAB: 200, nan, nan, 100, nan, ...` can be 2 chunks `200, nan, nan` and `100, nan`; and b) if the math was different, you could simplify, e.g. if `FAB(n) = C + FAB(n-1)/10` then the contribution of `FAB(n-6)` would be insignificant and could be ignored. — Dima Tisnek, Jul 30 '20 at 00:33

BENY · Accepted Answer · 2020-07-30T03:13:47.200

logic here list below , I think the for loop is the way for this , since the logic here is way too much...

C1 = B1*1.025 + A1
C2 = C1*1.025 + A2  = (B1*1.025 + A1)*1.025 + A2
C2 = (B1*1.025 + A1)*1.025*1.025 + A2*1.025 + A3

s=df.User.values
s=np.tril((s==s[:,None]).cumsum(axis=1))
idx=np.where(s==0)
s=(s.max(1)[:,None]-s)
s=1.025**s
s[idx]=0
g=df.groupby('User')
s=s@df.Amount+1.025**(g.cumcount().add(1))*df.FAB.ffill()
df['New'] = s
Out[173]: 
0    305.000000
1    422.625000
2    548.190625
3    152.500000
4    206.312500
5    202.500000
6    332.562500
dtype: float64

Explanation row by row

From C2 = (B1*1.025 + A1)*1.025*1.025 + A2*1.025 + A3 We know the value is equal to two geometric sequence added for each user.

s=df.User.values
s=np.tril((s==s[:,None]).cumsum(axis=1))
s
Out[183]: 
array([[1, 0, 0, 0, 0, 0, 0],# row one only include A1 
       [1, 2, 0, 0, 0, 0, 0],# row two include A1 and A2 with different times of 1.025
       [1, 2, 3, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 2, 0, 0],
       [0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 2]], dtype=int32)

From above we found the Cn = B11.025**n + A11.025**n-1 + A21.025n-2 + A3*n-3..

But what we get the for the sersie of n is 1.....n , so we need to reverse the 1 to n

s = (s.max(1)[:, None] - s)
s
Out[185]: 
array([[0, 1, 1, 1, 1, 1, 1],
       [1, 0, 2, 2, 2, 2, 2],
       [2, 1, 0, 3, 3, 3, 3],
       [1, 1, 1, 0, 1, 1, 1],
       [2, 2, 2, 1, 0, 2, 2],
       [1, 1, 1, 1, 1, 0, 1],
       [2, 2, 2, 2, 2, 1, 0]], dtype=int32)

After this we need also keep the 0 position still 0

s=1.025**s
s[idx]=0
s
Out[187]: 
array([[1.      , 0.      , 0.      , 0.      , 0.      , 0.      ,
        0.      ],
       [1.025   , 1.      , 0.      , 0.      , 0.      , 0.      ,
        0.      ],
       [1.050625, 1.025   , 1.      , 0.      , 0.      , 0.      ,
        0.      ],
       [0.      , 0.      , 0.      , 1.      , 0.      , 0.      ,
        0.      ],
       [0.      , 0.      , 0.      , 1.025   , 1.      , 0.      ,
        0.      ],
       [0.      , 0.      , 0.      , 0.      , 0.      , 1.      ,
        0.      ],
       [0.      , 0.      , 0.      , 0.      , 0.      , 1.025   ,
        1.      ]])

Then we have the parameter matrix for the Power， then we just need to apply above calculation

 Cn = B1*1.025**n + A1*1.025**n-1 + A2*1.025*n-2 + A3*n-3..

Interesting! How many consecutive `NaN`'s can it handle? Perhaps one step (over entire matrix) can be done over and over until non `NaN`'s remain? — Dima Tisnek, Jul 30 '20 at 02:50
@DimaTisnek It should handle all NaN until the matrix is exceed the memory. — BENY, Jul 30 '20 at 02:55
Oh my, looks like I don't understand the code after all... Would you be so kind as to comment what you are doing and why, @YOBEN_S ? — Dima Tisnek, Jul 30 '20 at 02:59

Dima Tisnek · Answer 2 · 2020-07-30T02:47:31.883

I believe this is covered in the docs Group by: split-apply-combine

https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

Assuming only the final Result per user is of interest, consider groupby and aggregate:

def last_result(data, **_): 
    data = data.reset_index(drop=True) 
    for i in range(len(data)): 
        if pd.isnull(data.loc[i, "FAB"]): 
            data.loc[i, "FAB"] = data.loc[i-1, "Result"] 
        if pd.isnull(data.loc[i, "Result"]): 
            data.loc[i, "Result"] = data.loc[i, "FAB"]*1.025 + data.loc[i, "Amount"] 
    return data.iloc[-1]
                                                                                                               
df.groupby("User").aggregate(last_result, engine="numba") 

      Amount      FAB      Result
User                             
1.0    115.0  422.625  548.190625
2.0     50.0  152.500  206.312500

Caveats:

Custom aggregate function receives slightly different arguments depending on engine=... setting
The group data is a slice from df and maintains original index values. I couldn't figure out how to get the group's selected index values, so I worked around it with reset_index and indexing [0..len(data))

In summary: processing of a single user is still iterative, but multiple users may be processed in parallel.

Process data frame without a loop

2 Answers2