pandas: Merge two columns with different names?

Question

I am trying to concatenate two dataframes, above and below. Not concatenate side-by-side.

The dataframes contain the same data, however, in the first dataframe one column might have name "ObjectType" and in the second dataframe the column might have name "ObjectClass". When I do

df_total = pandas.concat ([df0, df1])

the df_total will have two column names, one with "ObjectType" and another with "ObjectClass". In each of these two columns, half of the values will be "NaN". So I have to manually merge these two columns into one which is a pain.

Can I somehow merge the two columns into one? I would like to have a function that does something like:

df_total = pandas.merge_many_columns(input=["ObjectType,"ObjectClass"], output=["MyObjectClasses"]

which merges the two columns and creates a new column. I have looked into melt() but it does not really do this?

(Maybe it would be nice if I could specify what will happen if there is a collision, say that two columns contain values, in that case I supply a lambda function that says "keep the largest value", "use an average", etc)

If you came here looking to learn how to actually merge two DataFrames with different column names, you can learn more [here](https://stackoverflow.com/questions/53645882/pandas-merging-101/53645883#53645883). — cs95, Dec 07 '18 at 10:53

jezrael · Accepted Answer · 2017-04-05T11:46:46.493

I think you can rename column first for align data in both DataFrames:

df0 = pd.DataFrame({'ObjectType':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df0)

df1 = pd.DataFrame({'ObjectClass':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df1)

inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"

#dict comprehension 
d = {x:output for x in inputs}
print (d)
{'ObjectType': 'MyObjectClasses', 'ObjectClass': 'MyObjectClasses'}

df0 = df0.rename(columns=d)
df1 = df1.rename(columns=d)
df_total = pd.concat([df0, df1], ignore_index=True)
print (df_total)
   B  C  MyObjectClasses
0  4  7                1
1  5  8                2
2  6  9                3
3  4  7                1
4  5  8                2
5  6  9                3

EDIT:

More simplier is update (working inplace):

df = pd.concat([df0, df1])
df['ObjectType'].update(df['ObjectClass'])
print (df)
   B  C  ObjectClass  ObjectType
0  4  7          NaN         1.0
1  5  8          NaN         2.0
2  6  9          NaN         3.0
0  4  7          1.0         1.0
1  5  8          2.0         2.0
2  6  9          3.0         3.0

Or fillna, but then need drop original columns columns:

df = pd.concat([df0, df1])
df["ObjectType"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop('ObjectClass', axis=1)
print (df)
   B  C  ObjectType
0  4  7         1.0
1  5  8         2.0
2  6  9         3.0
0  4  7         1.0
1  5  8         2.0
2  6  9         3.0

df = pd.concat([df0, df1])
df["MyObjectClasses"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop(['ObjectType','ObjectClass'], axis=1)
print (df)
   B  C  MyObjectClasses
0  4  7              1.0
1  5  8              2.0
2  6  9              3.0
0  4  7              1.0
1  5  8              2.0
2  6  9              3.0

EDIT1:

Timings:

df0 = pd.DataFrame({'ObjectType':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df0)

df1 = pd.DataFrame({'ObjectClass':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df1)
df0 = pd.concat([df0]*1000).reset_index(drop=True)
df1 = pd.concat([df1]*1000).reset_index(drop=True)

inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"

#dict comprehension 
d = {x:output for x in inputs}

In [241]: %timeit df_total = pd.concat([df0.rename(columns=d), df1.rename(columns=d)], ignore_index=True)
1000 loops, best of 3: 821 µs per loop

In [240]: %%timeit
     ...: df = pd.concat([df0, df1])
     ...: df['ObjectType'].update(df['ObjectClass'])
     ...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
     ...: 

100 loops, best of 3: 2.18 ms per loop

In [242]: %%timeit
     ...: df = pd.concat([df0, df1])
     ...: df['MyObjectClasses'] = df['ObjectType'].combine_first(df['ObjectClass'])
     ...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
     ...: 
100 loops, best of 3: 2.21 ms per loop

In [243]: %%timeit 
     ...: df = pd.concat([df0, df1])
     ...: df['MyObjectClasses'] = df['ObjectType'].fillna(df['ObjectClass'])
     ...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
     ...: 
100 loops, best of 3: 2.28 ms per loop

Yes this might work. But, I have many columns, and I only want to rename two columns. Does your solution only work when dataframe has two columns? — Orvar Korvar, Apr 04 '17 at 11:36
I think it is universal solution - simply need same columns names in both dataframes. — jezrael, Apr 04 '17 at 11:37
Thanx for your help, but I have chosen the answer with "combine_first" because it is simpler. :) — Orvar Korvar, Apr 05 '17 at 11:17
I add 2 solutions, but in my opinion `rename` is better, if working with more different columns and need mapping `input=["ObjectType,"ObjectClass"]` to `output=["MyObjectClasses"]` — jezrael, Apr 05 '17 at 11:22
Hmmm, mainly it depends what you need. `combine_first`, `fillna` and `update` are slowier also (see edited answer) and `int` is converted to `float`, so astype(int) last is necessary. ;) Nice day! — jezrael, Apr 05 '17 at 11:38
Ok, I think "update()" is easier than "combine_first()", so I change my mind and use this as an answer. Thanx for all your help, guys, you are the best! :) — Orvar Korvar, Apr 06 '17 at 10:33

greole · Answer 2 · 2017-04-06T05:22:07.267

1

You can merge two columns separated by Nan's into one using combine_first

>>> import numpy as np
>>> import pandas as pd
>>>
>>> df0 = pd.DataFrame({'ObjectType':[1,2,3],
                    'B':[4,5,6],
                    'C':[7,8,9]})

>>> df1 = pd.DataFrame({'ObjectClass':[1,2,3],
                    'B':[4,5,6],
                    'C':[7,8,9]})

>>> df = pd.concat([df0, df1])
>>> df['ObjectType'] = df['ObjectType'].combine_first(df['ObjectClass'])
>>> df['ObjectType']

0    1
1    2
2    3
0    1
1    2
3    3
Name: ObjectType, dtype: float64

edited Apr 06 '17 at 05:22

answered Apr 05 '17 at 06:53

greole

3,766
5
24
48

As i understood the op, he got the df full of Nan's after concatenating and is looking for a way to merge the two columns now. – greole Apr 05 '17 at 06:57
This exactly what I was after. Thanx! Maybe you could edit the last lined to this instead? It took me 15 minutes to figure this out, so I would appreciate if this could be added: df["ObjectType"] = df['ObjectType'].combine_first(df['ObjectClass']) – Orvar Korvar Apr 05 '17 at 11:16
Sorry, but I think "update()" is slightly more pythonic than "combine_first()". I would like to choose boths as a solution. – Orvar Korvar Apr 06 '17 at 10:32

pandas: Merge two columns with different names?

2 Answers2

Linked