Pandas: Inconsistent concatenation

Question

I have two pandas DataFrames with a date-time stamp as an index called "datetime" and some floats as a column called "metric1" or "metric2". When I try to concatenate them I get the following error:

ValueError: cannot reindex from a duplicate axis

After reading for many hours I cannot find a solution that fixes my problem (e.g. What does `ValueError: cannot reindex from a duplicate axis` mean? "ValueError: cannot reindex from a duplicate axis" ).

I then spent a long time trying to recreate the problem but couldn't whithout my specific data but my data is two big to post on here.

In the end it seemed like certain parts of the dataframes were incompatible but other parts were fine. After comparing many different slices of indexes I finally found incompatible sets. Can someone please help me understand why I can't concatenate these.

Basically I would like it so that all possible time-stamps are in the index and there is a column for "metric1" and "metric2". If there is no data for a column at a given timestamp then we just have NaN or something. This works fine normally with pd.concat but it is not working in this case. To recreate please use:

CSV Files:

test1.csv

timestamp,metric1
2018-03-21 15:46:36,3.5555559999999997
2018-03-21 15:47:36,5.345001
2018-03-21 15:48:36,5.719998

test2.csv

timestamp,metric2
2018-03-28 05:49:59,3.28
2018-03-28 05:50:59,3.45
2018-03-28 05:51:59,3.258332
2018-03-28 05:52:59,3.068333
2018-03-28 05:53:59,2.9733330000000002
2018-03-28 05:54:59,3.0650009999999996
2018-03-28 05:55:59,3.109999
2018-03-28 05:56:59,3.3683330000000002
2018-03-28 05:57:59,3.1516669999999998
2018-03-28 05:58:59,3.051666
2018-03-28 05:59:59,3.3083339999999994
2018-03-28 06:01:01,3.328333
2018-03-28 06:01:01,3.1
2018-03-28 06:02:00,3.305
2018-03-28 06:03:00,3.29
2018-03-28 06:04:00,3.2183330000000003
2018-03-28 06:05:00,3.176666
2018-03-28 06:06:00,3.353333
2018-03-28 06:07:00,3.3233330000000003
2018-03-28 06:08:00,3.393332
2018-03-28 06:09:00,3.053334
2018-03-28 06:10:00,3.268333
2018-03-28 06:11:00,3.239999
2018-03-28 06:12:00,3.223332
2018-03-28 06:13:00,3.119999

test4.csv

timestamp,metric2
2018-03-21 00:00:00,10.665
2018-03-21 00:01:00,10.285
2018-03-21 00:02:00,10.12834

Note: test2.csv and test4.csv come from EXACTLY the same data set.

Now let's load the CSV files:

tt1 = pd.read_csv('test1.csv', index_col=0)
tt1.index = pd.to_datetime(tt1.index)
tt2 = pd.read_csv('test2.csv', index_col=0)
tt2.index = pd.to_datetime(tt2.index)
tt4 = pd.read_csv('test4.csv', index_col=0)
tt4.index = pd.to_datetime(tt4.index)

Now let's test concatenating them:

Test with no error

tt3 = pd.concat([tt1, tt4], axis = 1)

Test with error

tt3 = pd.concat([tt1, tt2], axis = 1)
ValueError: cannot reindex from a duplicate axis

Concat is not the correct tool for this. For this problem you should use merge. — tobsecret, May 23 '18 at 17:57
@tobsecret thanks for the comment but can you expand a bit please? Why is merge appropriate now and are there any 'rules of thumb' to help decide this kind of thing? — ojunk, May 23 '18 at 18:37
The underlying issue is that `tt1` and `tt2` have indices with the same name. — BallpointBen, May 23 '18 at 19:03
@ojunk See https://stackoverflow.com/a/49564930/7480990 In a nutshell the problem with pd.concat for this case is that repeated columns will show up repeatedly in the DataFrame as well. Merge mitigates that because as Harv Ipan's answer shows, it also works on columns, allowing you to deal with duplicate indices by reseting the index for each dataframe. — tobsecret, May 23 '18 at 20:08

score 2 · Answer 1 · answered May 23 '18 at 17:59

2

You have duplicate index in tt2. That causes the error. Correct way to get the output:

tt1.reset_index().merge(tt2.reset_index(), how='outer')

answered May 23 '18 at 17:59

harvpan

7,966
2
16
33

Thanks @Harv Ipan what does this do with the duplicate index? – ojunk May 23 '18 at 18:41
@ojunk, Are you sure your dataframes are exactly the minimal example of what you are dealing with? If you do not `reset_index()`, you are ought to get this error. The code works for me. Also, see [this](https://github.com/pandas-dev/pandas/issues/6963) – harvpan May 23 '18 at 18:47

score 1 · Answer 2 · answered May 23 '18 at 19:04

1

Don't merge, join. join joins on the indices.

tt1.join(tt2, how='outer')

answered May 23 '18 at 19:04

BallpointBen

5,916
1
27
47

You need `how='outer'` – harvpan May 23 '18 at 19:06

score -1 · Answer 3 · answered May 23 '18 at 18:56

-1

I have solved your question.

Have a look this solution :)

import pandas as pd

tt1 = pd.read_csv('test1.csv', index_col=0)
tt1.index = pd.to_datetime(tt1.index)
tt2 = pd.read_csv('test2.csv', index_col=0)
tt2.index = pd.to_datetime(tt2.index)
tt4 = pd.read_csv('test4.csv', index_col=0)
tt4.index = pd.to_datetime(tt4.index)

tt3 = pd.concat([tt1, tt4], axis=1)
tt4 = tt3.reset_index().merge(tt2.reset_index(), how='outer')
tt4 = tt4.set_index('timestamp')
print(tt4)

Hope it will make sense

answered May 23 '18 at 18:56

ricko72

24
2

What's the difference? – harvpan May 23 '18 at 19:02
not any difference you are right in you answer my friend. I just complete the code and reindex timestamp column. Hope it will clear things – ricko72 May 23 '18 at 19:05