How to concatenate pandas dataframes with automatic keys?

Question

I have

df1 = pd.Dataframe(
    [
    {'a': 1},
    {'a': 2},
    {'a': 3},
    ]
)

df2 = pd.Dataframe(
    [
    {'a': 4},
    {'a': 5},
    ]
)

And I want

I accepted an answer too soon, that told me to do

pd.concat([df1, df2], keys=[1,2])

which gives the correct result, but [1,2] is hardcoded.

I also want this to be incremental, meaning given

df3

and

df4 = pd.Dataframe(
    [
    {'a': 6},
    {'a': 7},
    ]
)

I want the concatenation to give

Using the same function.

How can I achieve this correctly?

EDIT: A discount- I can manage with only the incrementing function. It doesn't have to work with the single level dfs, but it would be nice if it did.

@Ch3steR If you can come up with a simpler solution for only 2 dfs, then go for it. One has to be the 2-level-index df. — Gulzar, Oct 25 '20 at 18:18
@Gulzar then why not generate a list of dataframes that you need to concat and then use range(length(yourlist)) to spit out the keys? — BICube, Oct 25 '20 at 18:20
@BICube I must have understood the question differently. I don't know in advance how many dfs I am going to have. I will be getting them in a stream, one by one. — Gulzar, Oct 25 '20 at 18:22

ansev · Accepted Answer · 2020-10-27T09:29:41.297

1

IIUC,

def split_list_by_multitindex(l):

    l_multi, l_not_multi = [], []
    for df in l:
        if isinstance(df.index, pd.MultiIndex):
            l_multi.append(df)
        else:
            l_not_multi.append(df)
    
    return l_multi, l_not_multi

def get_start_key(df):
    return df.index.get_level_values(0)[-1]

def concat_starting_by_key(l, key):
    return pd.concat(l, keys=range(key, key+len(l))) \
        if len(l) > 1 else set_multiindex_in_df(l[0], key)

def set_multiindex_in_df(df, key):
    return df.set_axis(pd.MultiIndex.from_product(([key], df.index)))


def myconcat(l):
    l_multi, l_not_multi = split_list_by_multitindex(l)
    return pd.concat([*l_multi, 
                      concat_starting_by_key(l_not_multi, 
                                              get_start_key(l_multi[-1]) + 1)
                     ]) if l_multi else concat_starting_by_key(l_not_multi, 1)

Examples

l1 = [df1, df2]

print(myconcat(l1))

     a
1 0  1
  1  2
  2  3
2 0  4
  1  5

l2 = [myconcat(l1), df4]

print(myconcat(l2))

     a
1 0  1
  1  2
  2  3
2 0  4
  1  5
3 0  6
  1  7

myconcat([df4, myconcat([df1, df2]), df1, df2])

     a
1 0  1
  1  2
  2  3
2 0  4
  1  5
3 0  6
  1  7
4 0  1
  1  2
  2  3
5 0  4
  1  5

Note

This assumes that if we make a concatenation of the dataframes belonging to the l_multi list, the resulting dataframe would already be ordered

edited Oct 27 '20 at 09:29

answered Oct 25 '20 at 18:41

ansev

26,199
5
10
28

Flame wars aside, I chose @drec4s answer as accepted because it is shorter for the same result. I find it hard to understand why this solution is superior. Feel free to explain, if it really is, it will be the accepted one. About concatenating with an empty - it makes a lot of sense from a mathematical standpoint - There exists a zero element. – Gulzar Oct 27 '20 at 08:12
actually in my code a concatenaction with a unique dataframe is also performed only when l_not_mult has only size one, I will update the code for this case using `set_axis (pd.MultiIndex (....))`. It just doesn't make sense to concatenate a DataFrame with nothing, actually this is a shortcut to MultiIndex a single DataFrame, which is done with other methods. Although it is shorter it does not make sense, just open the documentation for pd.concat and pd.DataFrame to realize that doing this is inefficient – ansev Oct 27 '20 at 08:49
First of all, the use of the `max ()` function with the index that you always know is going to be 0, 1,2, 3 ... n is unnecessary. It is simply taking the last value. Second: her solution is exactly mine. The only difference is that I have written a function that supports one and two level Dataframes and does everything in one step. The first has written a code for the DataFrames that do not have MultiIndex and then a second code to add a new dataframe to a MultiIndex. I have taken the time to unify these two operations into a single function. – ansev Oct 27 '20 at 08:53
I have updated the solution removing all unnecessary concatenation with set_axis and MultiIndex, I have also added an additional example where the list contains the MultiIndex in second position – ansev Oct 27 '20 at 09:05
My function is bigger basically because it allows you to enter a list with the MultiIndex DataFrame in any position and all the dataframes you want to add, it also supports more than one MultiIndex as long as its level 0 indexes are already ordered, although it would be easy to modify it to that will be ordered automatically. In my opinion he has taken my idea and has tried to make it apparently simpler by eliminating the functions that are necessary for abstraction – ansev Oct 27 '20 at 09:14

afonso · Answer 2 · 2020-10-26T10:05:45.280

1

My approach was to nest two pd.concat functions, the second one to create a MultiIndex dataframe, from a single index.

import pandas as pd

df = pd.DataFrame(
    [
    {'a': 1},
    {'a': 2},
    {'a': 3},
    ]
)

df2 = pd.DataFrame(
    [
    {'a': 4},
    {'a': 5},
    ]
)

df = pd.concat([df, df2], keys=df.index.get_level_values(0))
In[2]: df
Out[2]:
     a
0 0  1
  1  2
  2  3
1 0  4
  1  5

And to merge a new dataframe:

df3 = pd.DataFrame(
    [
    {'a': 6},
    {'a': 7},
    ]
)

In[3]: pd.concat([df, pd.concat([df3,], keys=(max(df.index.get_level_values(0))+1,))])
Out[3]: 
     a
0 0  1
  1  2
  2  3
1 0  4
  1  5
2 0  6
  1  7

EDIT: Following the comment from ansev saying that this method was inefficent, ran some simple test. This is the output:

In[5]: %timeit pd.concat([df, pd.concat([df3,], keys=(max(df.index.get_level_values(0))+1,))])
Out[5]: 1.99 ms ± 98.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Comparing to his method:

In[6]: %timeit [myconcat(l1), df3]
Out[6]: 1.92 ms ± 96.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Oct 26 '20 at 10:05

answered Oct 26 '20 at 00:01

afonso

7,371
5
23
48

concatenating an empty dataframe is simply inefficient. Also due to the date of editions of each answer I think this answer is basically like mine with get_level_values (a copy but much worse) – ansev Oct 26 '20 at 08:48
Inefficient comparing to what? From what I can see their are pretty much the same in terms of performance, but your answer has 30 lines more to achieve the same output. – afonso Oct 26 '20 at 09:45
do you think concatenating with an empty data frame can be a good solution? do you think this code is coherent and easily understandable and revisable? Why use max () if you already know the order of the keys and you already know that it will be the last value of the previous data frame + 1? I think this is taking my idea and making it worse – ansev Oct 26 '20 at 09:50
Why do you think this is a competition? Why didn't you came up with a simpler solution then? Also, I didn't know that the `get_level_values` function was your idea, sorry. Here: [pandas.MultiIndex.get_level_values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.get_level_values.html) – afonso Oct 26 '20 at 09:56
I always look for how to eliminate an `if else` like anyone else. But when I realized that I should concatenate with a empty dataframe, I ruled out this possibility. I would never post a concatenation with an empty dataframe, but feel free. – ansev Oct 26 '20 at 10:00
It was my idea because my publication is prior to yours, that you include the reference to the documentation does not mean anything – ansev Oct 26 '20 at 10:02
Ahaha of course it's your idea. Also, I was wrong here, there is absolutely no need to use an empty DataFrame. Thanks for the heads up! – afonso Oct 26 '20 at 10:06
great! Now you propose to concatenate a dataframe with none more! – ansev Oct 26 '20 at 10:07
`pd.concat([df,])` , are you kidding me? do you think this would be valid somewhere? Aren't you looking for something like `df.set_axis(pd.MultiIndex.from_arrays( ... . .))`? sometimes trying very short code has worse consequences – ansev Oct 26 '20 at 10:08
about the max - i didn't like it either. Is there a more elegant way, using existing data? and not calculating something "strange"? – Gulzar Oct 27 '20 at 08:15
@Gulzar sorry, this argument has gone way further than I was expecting. Use whatever you feel like is the right solution for your problem. – afonso Oct 27 '20 at 09:34
@drec4s I chose the more general as the accepted, because I assume it would be better for future users. It is more readable, because it doesn't use the arbitrary max. Thank you for understanding – Gulzar Oct 27 '20 at 09:37

score 0 · Answer 3 · answered Oct 25 '20 at 18:39

0

This is how I solved it

import pandas as pd

df1 = pd.DataFrame(
    [
    {'a': 1},
    {'a': 2},
    {'a': 3},
    ]
)

df2 = pd.DataFrame(
    [
    {'a': 4},
    {'a': 5},
    ]
)

df = df1.append(df2)

df['from'] = df.index == 0
df['from'] = df['from'].cumsum()
df = df[['from', 'a']]

print(df)

answered Oct 25 '20 at 18:39

Prayson W. Daniel

9,790
2
35
43

1

didn't select this solution because it iterates all of the existing df for a single addition at the end. – Gulzar Oct 25 '20 at 19:26

How to concatenate pandas dataframes with automatic keys?

3 Answers3