3

I am trying to concatenate two dataframes, above and below. Not concatenate side-by-side.

The dataframes contain the same data, however, in the first dataframe one column might have name "ObjectType" and in the second dataframe the column might have name "ObjectClass". When I do

df_total = pandas.concat ([df0, df1])

the df_total will have two column names, one with "ObjectType" and another with "ObjectClass". In each of these two columns, half of the values will be "NaN". So I have to manually merge these two columns into one which is a pain.

Can I somehow merge the two columns into one? I would like to have a function that does something like:

df_total = pandas.merge_many_columns(input=["ObjectType,"ObjectClass"], output=["MyObjectClasses"]

which merges the two columns and creates a new column. I have looked into melt() but it does not really do this?

(Maybe it would be nice if I could specify what will happen if there is a collision, say that two columns contain values, in that case I supply a lambda function that says "keep the largest value", "use an average", etc)

Orvar Korvar
  • 741
  • 1
  • 9
  • 17
  • If you came here looking to learn how to actually merge two DataFrames with different column names, you can learn more [here](https://stackoverflow.com/questions/53645882/pandas-merging-101/53645883#53645883). – cs95 Dec 07 '18 at 10:53

2 Answers2

5

I think you can rename column first for align data in both DataFrames:

df0 = pd.DataFrame({'ObjectType':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df0)

df1 = pd.DataFrame({'ObjectClass':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df1)

inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"

#dict comprehension 
d = {x:output for x in inputs}
print (d)
{'ObjectType': 'MyObjectClasses', 'ObjectClass': 'MyObjectClasses'}

df0 = df0.rename(columns=d)
df1 = df1.rename(columns=d)
df_total = pd.concat([df0, df1], ignore_index=True)
print (df_total)
   B  C  MyObjectClasses
0  4  7                1
1  5  8                2
2  6  9                3
3  4  7                1
4  5  8                2
5  6  9                3

EDIT:

More simplier is update (working inplace):

df = pd.concat([df0, df1])
df['ObjectType'].update(df['ObjectClass'])
print (df)
   B  C  ObjectClass  ObjectType
0  4  7          NaN         1.0
1  5  8          NaN         2.0
2  6  9          NaN         3.0
0  4  7          1.0         1.0
1  5  8          2.0         2.0
2  6  9          3.0         3.0

Or fillna, but then need drop original columns columns:

df = pd.concat([df0, df1])
df["ObjectType"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop('ObjectClass', axis=1)
print (df)
   B  C  ObjectType
0  4  7         1.0
1  5  8         2.0
2  6  9         3.0
0  4  7         1.0
1  5  8         2.0
2  6  9         3.0

df = pd.concat([df0, df1])
df["MyObjectClasses"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop(['ObjectType','ObjectClass'], axis=1)
print (df)
   B  C  MyObjectClasses
0  4  7              1.0
1  5  8              2.0
2  6  9              3.0
0  4  7              1.0
1  5  8              2.0
2  6  9              3.0

EDIT1:

Timings:

df0 = pd.DataFrame({'ObjectType':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df0)

df1 = pd.DataFrame({'ObjectClass':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df1)
df0 = pd.concat([df0]*1000).reset_index(drop=True)
df1 = pd.concat([df1]*1000).reset_index(drop=True)

inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"

#dict comprehension 
d = {x:output for x in inputs}

In [241]: %timeit df_total = pd.concat([df0.rename(columns=d), df1.rename(columns=d)], ignore_index=True)
1000 loops, best of 3: 821 µs per loop

In [240]: %%timeit
     ...: df = pd.concat([df0, df1])
     ...: df['ObjectType'].update(df['ObjectClass'])
     ...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
     ...: 

100 loops, best of 3: 2.18 ms per loop

In [242]: %%timeit
     ...: df = pd.concat([df0, df1])
     ...: df['MyObjectClasses'] = df['ObjectType'].combine_first(df['ObjectClass'])
     ...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
     ...: 
100 loops, best of 3: 2.21 ms per loop

In [243]: %%timeit 
     ...: df = pd.concat([df0, df1])
     ...: df['MyObjectClasses'] = df['ObjectType'].fillna(df['ObjectClass'])
     ...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
     ...: 
100 loops, best of 3: 2.28 ms per loop
jezrael
  • 629,482
  • 62
  • 918
  • 895
  • Yes this might work. But, I have many columns, and I only want to rename two columns. Does your solution only work when dataframe has two columns? – Orvar Korvar Apr 04 '17 at 11:36
  • I think it is universal solution - simply need same columns names in both dataframes. – jezrael Apr 04 '17 at 11:37
  • Thanx for your help, but I have chosen the answer with "combine_first" because it is simpler. :) – Orvar Korvar Apr 05 '17 at 11:17
  • I add 2 solutions, but in my opinion `rename` is better, if working with more different columns and need mapping `input=["ObjectType,"ObjectClass"]` to `output=["MyObjectClasses"]` – jezrael Apr 05 '17 at 11:22
  • Hmmm, mainly it depends what you need. `combine_first`, `fillna` and `update` are slowier also (see edited answer) and `int` is converted to `float`, so astype(int) last is necessary. ;) Nice day! – jezrael Apr 05 '17 at 11:38
  • Ok, I think "update()" is easier than "combine_first()", so I change my mind and use this as an answer. Thanx for all your help, guys, you are the best! :) – Orvar Korvar Apr 06 '17 at 10:33
1

You can merge two columns separated by Nan's into one using combine_first

>>> import numpy as np
>>> import pandas as pd
>>>
>>> df0 = pd.DataFrame({'ObjectType':[1,2,3],
                    'B':[4,5,6],
                    'C':[7,8,9]})

>>> df1 = pd.DataFrame({'ObjectClass':[1,2,3],
                    'B':[4,5,6],
                    'C':[7,8,9]})

>>> df = pd.concat([df0, df1])
>>> df['ObjectType'] = df['ObjectType'].combine_first(df['ObjectClass'])
>>> df['ObjectType']

0    1
1    2
2    3
0    1
1    2
3    3
Name: ObjectType, dtype: float64
greole
  • 3,766
  • 5
  • 24
  • 48
  • As i understood the op, he got the df full of Nan's after concatenating and is looking for a way to merge the two columns now. – greole Apr 05 '17 at 06:57
  • This exactly what I was after. Thanx! Maybe you could edit the last lined to this instead? It took me 15 minutes to figure this out, so I would appreciate if this could be added: df["ObjectType"] = df['ObjectType'].combine_first(df['ObjectClass']) – Orvar Korvar Apr 05 '17 at 11:16
  • Sorry, but I think "update()" is slightly more pythonic than "combine_first()". I would like to choose boths as a solution. – Orvar Korvar Apr 06 '17 at 10:32