pd.df find rows pairwise using groupby and change bogus values

Question

My pd.DataFrame looks like this example but has about 10mio rows, hence I am looking for an efficient solution.

import pandas as pd                                                                                                                   

df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'], 
                   'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],  
                   'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],  
                   'type': ['c', 'p', 'c', 'p', 'c', 'p'],  
                   'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],  
                   'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True) 
df.index = pd.to_datetime(df.index) 
df.opt_expiry = pd.to_datetime(df.opt_expiry) 

Out[2]: 
           opt_expiry  strike type  sigma  delta
timestamp                                       
2004-09-06 2005-12-16     2.0    c  0.250   0.70
2004-09-06 2005-12-16     2.0    p  0.250  -0.30
2004-09-06 2005-12-16     2.5    c  0.001   1.00
2004-09-06 2005-12-16     2.5    p  0.170  -0.25
2004-09-07 2005-06-17     1.5    c  0.195   0.60
2004-09-07 2005-06-17     1.5    p  0.190  -0.40

here is what I am looking to achieve:

1) find the pairs with identical timestamp, opt_expiry and strike:

groups = df.groupby(['timestamp','opt_expiry','strike'])

2) for each group check if the sum of the absolute delta equals 1. If true find the maximum of the two sigma values and assign that to both rows as the new, correct sigma. pseudo code:

for group in groups:
    # if sum of absolute deltas != 1
    if (abs(group.delta[0]) + abs(group.delta[1])) != 1:
        correct_sigma = group.sigma.max()
        group.sigma = correct_sigma

Expected output:

           opt_expiry  strike type  sigma  delta
timestamp                                       
2004-09-06 2005-12-16     2.0    c  0.250   0.70
2004-09-06 2005-12-16     2.0    p  0.250  -0.30
2004-09-06 2005-12-16     2.5    c  0.170   1.00
2004-09-06 2005-12-16     2.5    p  0.170  -0.25
2004-09-07 2005-06-17     1.5    c  0.195   0.60
2004-09-07 2005-06-17     1.5    p  0.190  -0.40

[JFYI](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). I think a small input(required columns + a few records) and expected `df` is better to understanding. — Danila Ganchar, Apr 12 '20 at 11:54
@steff I made an attempt. Not sure I got everything right. If I didnt let me know. Willing to recode — wwnde, Apr 12 '20 at 12:02
@DanilaGanchar done. edited the question with a re-producible df — steff, Apr 13 '20 at 00:31

wwnde · Answer 1 · 2020-04-13T03:36:51.620

1

Revised answer. I believe there could be a shorter answer out there. Maybe put it up as bounty Data

df = pd.DataFrame({'timestamp':['2004-09-06', '2004-09-06', '2004-09-06', '2004-09-06', '2004-09-07', '2004-09-07'], 
                   'opt_expiry': ['2005-12-16', '2005-12-16', '2005-12-16', '2005-12-16', '2005-06-17', '2005-06-17'],  
                   'strike': [2, 2, 2.5, 2.5, 1.5, 1.5],  
                   'type': ['c', 'p', 'c', 'p', 'c', 'p'],  
                   'sigma': [0.25, 0.25, 0.001, 0.17, 0.195, 0.19],  
                   'delta': [0.7, -0.3, 1, -0.25, 0.6, -0.4]}).set_index('timestamp', drop=True) 
df

Working

Absolute delta sum for each groupfor each row

df['absdelta']=df['delta'].abs()

Absolute delta sum and maximum sigma for each group in a new dataframe df2

df2=df.groupby(['timestamp','opt_expiry','strike']).agg({'absdelta':'sum','sigma':'max'})#.reset_index().drop(columns=['timestamp','opt_expiry'])
df2

Merge df2 with df

df3=df.merge(df2, left_on='strike', right_on='strike',
          suffixes=('', '_right'))
df3

mask groups with sum absolute delta not equal to 1

m=df3['absdelta_right']!=1
m

Using mask, apply maximum sigma to entities in groups masked above

df3.loc[m,'sigma']=df3.loc[m,'sigma_right']

Slice to return to original dataframe

df3.iloc[:,:-4]

Output

edited Apr 13 '20 at 03:36

answered Apr 12 '20 at 11:52

wwnde

14,189
2
8
21

Thanks. Not quite what I needed though. For every pair I need to replace the smaller sigma with the higher value. Your function assigns the highest sigma of the entire df. For example `df.iloc[-1].sigma` which is 0.152608. needs to be assigned to `df.iloc[-1].sigma` which is 0.001. sorry if I wasn't clear before... – steff Apr 12 '20 at 12:22
What do yo mean with the higher value? What/which is the higher value? We can find away round it – wwnde Apr 12 '20 at 12:26
every group we finds has type c and p. If the absolute delta doesn't add to 1 then one of the sigma values is wrong and needs to be replaced. The value that is wrong is typically the the lower one. Hence I want to assign the higher value for both. hope this clarifies. – steff Apr 12 '20 at 12:33
See my edits. I believe then that any that is ==1 maintains its sigma? – wwnde Apr 12 '20 at 12:48
thanks. will need to look at this tomorrow. will revert asap. – steff Apr 12 '20 at 13:08
edited question with reproducible df. sorry should have done that from the start – steff Apr 13 '20 at 00:02
Look at your rule again, it does not explain why third row sigma changes. If considered each group[0] going to group[1] have their absolute sum >=1. Would love to assist but need some clarity. In simple words, what determined why the third row changed? – wwnde Apr 13 '20 at 01:04
each resulting group has two rows. for each group I want to find the max value of sigma and then assign that to both rows within that group – steff Apr 13 '20 at 01:10
Looks to me that if the absolute value of delta is ==1,sigma of the value with absolute value delta tha is <=1 is taken. Is that correct? If not enlarge your sample dataset – wwnde Apr 13 '20 at 01:10
"each resulting group has two rows. for each group I want to find the max value of sigma and then assign that to both rows within that group" Why doesnt that happen to the last group in the last two rows? – wwnde Apr 13 '20 at 01:12
cos their absolute delta adds up to 1 – steff Apr 13 '20 at 01:13
Ok, see my latest attempt. Went the long way to ensure nothing was missed – wwnde Apr 13 '20 at 02:39
on the small sample df it works. on the big df it doesn't. – steff Apr 13 '20 at 03:57
Interesting, is the issue logic or code too slow?Happy to assist further. Where does it fail? – wwnde Apr 13 '20 at 04:01
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/211523/discussion-between-wwnde-and-steff). – wwnde Apr 13 '20 at 04:01

pd.df find rows pairwise using groupby and change bogus values

1 Answers1