1

I have a dataset that is something like this:

g_id    event   time_left  home away
1       "TIP"   00:12:00   8    6
1       "SHOT"  00:11:48   8    6
1       "MISS"  00:11:20   8    6
1       "TOV"   00:11:15   8    6
1       "SHOT"  00:10:40   8    6
2       "REB"   00:11:48   7    3
2       "FOUL"  00:11:35   7    3
2       "FT"    00:11:33   7    3
2       "FT"    00:11:31   7    3
3       "TIP"   00:12:00   5    1
3       "MISS"  00:11:43   5    1
3       "REB"   00:11:42   5    1
3       "SHOT"  00:11:27   5    1
3       "TOV"   00:11:04   5    1 
4       "SHOT"  00:11:39   9    4
4       "MISS"  00:11:17   9    4
4       "REB"   00:11:16   9    4
4       "SHOT"  00:10:58   9    4

I noticed that my problem is somewhat similar to this one in MySQL but I was wondering if this can be done in Pandas as well. As you may have noticed, the data is grouped by 'g_id' and some of the sequences start with 'TIP' and others don't. What I want to do is go by 'g_id' and if the 'g_id' doesn't start with event = 'TIP', insert a row that contains 'TIP' in that column, '00:12:00' in the 'time_left' column, and carry over the 'home' and 'away' columns that are in the first row. How can I do that? The real dataset has more columns, but I basically just need how to insert a new row where some column values are the same as the row they're going before and some are assigned new values.

2 Answers2

1

You can iterate over the groups and check if the first event is TIP , then with series.shift and pd.concat , you could add the first row and append the last row back:

l = [pd.concat((g.shift().fillna({'event':'"TIP"','time_left':'00:12:00'}).bfill(),
                                                            g.iloc[[-1]])) 
   if 'TIP' not in g['event'].iloc[0] else g for _,g in df.groupby('g_id')]

out = pd.concat(l,ignore_index=True)
print(out)

   g_id   event time_left home away
0     1   "TIP"  00:12:00    8    6
1     1  "SHOT"  00:11:48    8    6
2     1  "MISS"  00:11:20    8    6
3     1   "TOV"  00:11:15    8    6
4     1  "SHOT"  00:10:40    8    6
5     2   "TIP"  00:12:00    7    3
6     2   "REB"  00:11:48    7    3
7     2  "FOUL"  00:11:35    7    3
8     2    "FT"  00:11:33    7    3
9     2    "FT"  00:11:31    7    3
10    3   "TIP"  00:12:00    5    1
11    3  "MISS"  00:11:43    5    1
12    3   "REB"  00:11:42    5    1
13    3  "SHOT"  00:11:27    5    1
14    3   "TOV"  00:11:04    5    1
15    4   "TIP"  00:12:00    9    4
16    4  "SHOT"  00:11:39    9    4
17    4  "MISS"  00:11:17    9    4
18    4   "REB"  00:11:16    9    4
19    4  "SHOT"  00:10:58    9    4
anky
  • 64,269
  • 7
  • 30
  • 56
0

Slightly more drawn out solution. You can get an array of all your group id's by going

    g_ids = df['g_id'].unique()

This for your example will return an array [1,2,3,4]

    for g_id in g_ids:
        events = df[df['g_id'] == g_id]['event']
        if 'TIP' not in events:
            insert_index = len(df.index)
            copy_row_index = df.iloc[df['g_id'].ne(g_id).idxmax()]
            df.loc[insert_index] = df[df['g_id'] == g_id].iloc[0]
            df.loc[insert_index]['event'] == 'TIP'
    df.sort_values(by=['g_id'], inplace=True)
greenPlant
  • 452
  • 3
  • 12