1

I am trying to concatenate multiple files in a directory. I am able to concatenate the files, however I am getting duplicated rows and columns. If a field doesn't exist in the input file, the output file should ideally show NaN or null.

Ideally the output would look as follows:

Merged.csv

   ItemID  Price  Discount ItemName          Desc  Barcode
0       1      4       2.0      Toy      For kids   667865
1       2     10       1.0     Game    For adults   998364
2       3      8       NaN      NaN  For everyone    43831

file1.csv

   ItemID  Price  Discount
0       1      4       2.0
1       2     10       1.0
2       3      8       NaN

file2.csv

   ItemID ItemName          Desc
0       1      Toy      For kids
1       2     Game    For adults
2       3      NaN  For everyone

file3.CSV

   ItemID  Barcode
0       1   667865
1       2   998364
2       3    43831

The following is the code I am working on:

import glob , os
import pandas as pd

path = '/location/test/'
all_files = glob.glob(os.path.join(path, "*.csv"))
pd.concat([pd.read_csv(f, sep=';').assign(file=os.path.basename(f)) for f in all_files]).to_csv('merged.csv', index=False)
David Erickson
  • 14,448
  • 1
  • 13
  • 30
  • Instead of concat use pd.merge and pass on = ItemID in arguments. – r_hudson Sep 28 '20 at 21:27
  • IIUC, you need a merge or join `try pd.merge(left,right,on='ItemID')` – Umar.H Sep 28 '20 at 21:27
  • 1
    Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – Umar.H Sep 28 '20 at 21:27
  • @r_hudson would that line of code be as follows ``` pd.merge(how = "right", on='ITEM_ID'[pd.read_csv(f, sep=';').assign(file=os.path.basename(f)) for f in all_files]).to_csv('merged.csv', index=False) ``` – confusedntired Sep 28 '20 at 21:50

1 Answers1

1
  1. You can pass axis=1 to pd.concat and use .set_index('ItemID') in order to concat on ItemID. You will just need to reset_index() at the end.
  2. You can us reduce from functools with merge:

I prefer the former as it's a little bit cleaner, but I'm including the reduce solution in case it is more performant, or if you want to use some of the additional features unique to .merge:

Setup:

import glob , os
import pandas as pd
from functools import reduce

path = '/location/test/'
all_files = glob.glob(os.path.join(path, "*.csv"))

Method 1* (please see a note on this method at the end of the answer):

(pd.concat([pd.read_csv(f, sep=';').set_index('ItemID') for f in all_files], axis=1)
 .reset_index().to_csv('merged.csv', index=False))

Method 2:

(reduce(lambda  left,right: pd.merge(left,right,on=['ItemID'], how='left'),
    [pd.read_csv(f, sep=';') for f in all_files]).to_csv('merged.csv', index=False))

The output in your .csv file will look like this:

Out[1]:

   ItemID  Price  Discount ItemName          Desc  Barcode
0       1      4       2.0      Toy      For kids   667865
1       2     10       1.0     Game    For adults   998364
2       3      8       NaN      NaN  For everyone    43831

For Method #1, this is not a universal solution, just because the same column that you are using to join the data exists in all dataframes. For example, if you have a base file with non-unique values, which you are trying to join other data to that have unique values, then the other files will join together, but the base file will not join to the others and you willl receive a ValueError. Below, first I read in the base file. Then I concat all other files except the base file. Then, I join the base file and other files together with join, but pd.concat, won't work if there are non-unique values in the key, so it's better to use join or merge:

import pandas as pd
import glob , os

path = 'Downloads\\'
base_file = os.path.join(path, "EVS1CPP.csv")
all_files = glob.glob(os.path.join(path, "*1CPP.csv"))
all_files.remove(base_file)
df_base = pd.read_csv(base_file, sep=';').set_index('ARTICLE_ID')
dfs = pd.concat([pd.read_csv(f, sep=';').set_index('ARTICLE_ID') for f in all_files], axis=1)
df = df_base.join(dfs, how='outer').rename_axis('ARTICLE_ID').reset_index()
df
David Erickson
  • 14,448
  • 1
  • 13
  • 30