Adding a new column to a multiIndex dataframe, using another dataframe with different column size

Question

I have the following multiIndex dataframe:

df= 
        id/uniqueID       var1    var2    var3   
        5171/0            10.0    2.8     0.0   
        5171/1            40.9    2.5     3.4   
        5171/2            60.7    3.1     5.2   
        ...
        5171/57           0.5     1.3     5.1   
        4567/0            1.5     2.0     1.0   
        4567/1            4.4     2.0     1.3   
        4567/2            6.3     3.0     1.5   
        ...
        4567/57           0.7     1.4     1.6   
       ... 
        9584/0            0.3     2.6     0.0   
        9584/1            0.5     1.2     8.3   
        9584/2            0.7     3.0     5.6   
        ...
        9584/57           0.7     1.3     0.1   

indexes_df= 
        id              labeled_idxs
        5171            [0,1,3,6,49,50]
        4567            [45,46,47,56,57]
        9584            [21]
        ...

I need to add a new binary column to df with 1 or True for the indexes labeled in the second dataframe, indexes_df, like this:

df= 
        id/uniqueID       var1    var2    var3    labels
        5171/0            10.0    2.8     0.0       1
        5171/1            40.9    2.5     3.4       1
        5171/2            60.7    3.1     5.2       0
        ...
        5171/57           0.5     1.3     5.1       0
        4567/0            1.5     2.0     1.0       0
        4567/1            4.4     2.0     1.3       0   
        4567/2            6.3     3.0     1.5       0   
        ...
        4567/56           0.4     0.4     1.3       1
        4567/57           0.7     1.4     1.6       1   
       ... 
        9584/0            0.3     2.6     0.0       0   
        9584/1            0.5     1.2     8.3       0   
        9584/2            0.7     3.0     5.6       0   
        ...
        9584/21           2.7     0.0     0.6       1
        ...
        9584/57           0.7     1.3     0.1       0

I tried to do it with the following code and similar approaches but they all failed with SyntaxError:

df['labes'] = indexes_df['labeled_idxs'].apply(lambda x: [i>0 ? 1 : 0 for i in x]))

How can I get the results I need?

It seems dupe is incorrect, my answer was edited (I miss first columns are indexes in both) — jezrael, Apr 12 '20 at 12:25
@jezrael ah ok! I was struggling to implement your solution! Let me try again! ;) — Birish, Apr 12 '20 at 12:30

score 1 · Answer 1 · answered Apr 12 '20 at 11:48

1

The ternary operator isn't available in Python. However you could use something like that:

df['labes'] = indexes_df['labeled_idxs'].apply(lambda x: [1 if i > 0 else 0 for i in x]))

answered Apr 12 '20 at 11:48

BeneSim

80
5

Thanks! But now it's assigning `Nan` to all values for the new column `labels`. I think it doesn't understand that it should update the values in the related indexes of `df` – Birish Apr 12 '20 at 11:56
Now it's only replacing the index values in `indexes_df ` with 1! For `id=5171` it is converting `[0,1,3,6,49,50]` to `[1,1,1,1,1,1]`. It doesn't update the corresponding indexes in `df['labes']` – Birish Apr 12 '20 at 12:05

score 1 · Answer 2 · answered Apr 12 '20 at 11:50

1

You're half right. The solution is to use lambdas in list comprehension, but you've got a bit wrong. Python doesn't use ? and :, so you have to do:

df['labes'] = indexes_df['labeled_idxs'].apply(lambda x: [(1 if i>0 else 0) for i in x]))

answered Apr 12 '20 at 11:50

Tom Robinson

182
10

score 1 · Accepted Answer · edited Apr 12 '20 at 14:34

Your solution is problematic, because also if correct apply part like another answers still is problem with:

df['labes'] = indexes_df['labeled_idxs']

because here for processing need labeled_idxs like another column in df DataFrame or index of indexes_df has to be same like df. If not, there are set values only for rows if index values are same in both `DataFrames.

Here is better use pure pandas solution - first DataFrame.explode column filled by lists to rows and convert to strings:

indexes_df = indexes_df.explode('labeled_idxs')
print (indexes_df)
     labeled_idxs
id               
4567           45
4567           46
4567           47
4567           56
4567           57
          ...
5171            3
5171            6
5171           49
5171           50
9584           21

[62 rows x 1 columns]

UPDATE: Since df is a multiIndex dataframe, the following should work

indexes_df = indexes_df.explode('labeled_idxs').astype(int)
indexes_df['labels'] = 1
indexes_df.set_index('labeled_idxs', append=True, inplace=True)
df['labels'] = 0  
df.loc[indexes_df.index, indexes_df.columns] = indexes_df

OLD ANSWER:

Then join index and column of DataFrame to Series like:

s = indexes_df.index.astype(str) + '/' + indexes_df['labeled_idxs'].astype(str)
print (s)
id
4567    4567/45
4567    4567/46
4567    4567/47
4567    4567/56
4567    4567/57

5171     5171/3
5171     5171/6
5171    5171/49
5171    5171/50
9584    9584/21
Length: 62, dtype: object

And last compare column id/uniqueID by Index.isin with cast boolean mask to integers:

df['labes'] = df.index.isin(s).astype(int)
print (df)
             var1  var2  var3  labes
id/uniqueID                         
5171/0       10.0   2.8   0.0      1
5171/1       40.9   2.5   3.4      1
5171/2       60.7   3.1   5.2      0
5171/57       0.5   1.3   5.1      0
4567/0        1.5   2.0   1.0      0
4567/1        4.4   2.0   1.3      0
4567/2        6.3   3.0   1.5      0
4567/57       0.7   1.4   1.6      1
9584/0        0.3   2.6   0.0      0
9584/1        0.5   1.2   8.3      0
9584/2        0.7   3.0   5.6      0
9584/57       0.7   1.3   0.1      0

Thanks for the answer, it fails at the join step: `s = indexes_df['id'] + '/' + indexes_df['labeled_idxs']` with this error: `Ufunc 'add' did not contain a loop with signature matching types (dtype('U21'))`. I tried to cast the values to string like: `s = str(indexes_df['id']) + '/' + str(indexes_df['labeled_idxs'])` but it creates `s` as a string with weird values... — Birish, Apr 12 '20 at 12:28
my `df` dataframe is a multiIndex dataframe with two levels. I should have mentioned it in the question to avoid confusion, sorry. Here is how I managed to get the desired results: ` indexes_df = indexes_df.explode('labeled_idxs').astype(int) ` `indexes_df['labels'] = 1 ` `indexes_df.set_index('labeled_idxs', append=True, inplace=True) ` `df['labels'] = 0 ` `df.loc[indexes_df.index, indexes_df.columns] = indexes_df` — Birish, Apr 12 '20 at 14:30

Adding a new column to a multiIndex dataframe, using another dataframe with different column size

3 Answers3