0

I have 4 csv files. Each file has different fields, e.g. name, id_number, etc. Each file is talking about the same thing, for which there is a unique id that each file has. So, I would like to concatenate the fields of each of the 4 files into a single DataFrame. For instance, one file contains first_name, another file contains last_name, then I want to merge those two, so that I can have first and last name for each object.

Doing that is trivial, but I'd like to know the most efficient way, or if there is some built-in function that does it very efficiently.

The files look something like this:

file1:

id   name   age pets
b13  Marge  18  cat
y47  Dan    13  dog
h78  Mark   20  lizard

file2:

id   last_name   income  city
y47  Schmidt     1800    Dallas
b13  Olson       1670    Paris
h78  Diaz        2010    London

file 3 and 4 are like that with different fields. The ids are not necessarily ordered. The goal again, is to have one DataFrame looking like this:

id   name   age pets    last_name income city
b13  Marge  18  cat     Olson     1670   Paris
y47  Dan    13  dog     Schmidt   1800   Dallas
h78  Mark   20  lizard  Diaz      2010   London

What I've done is this:

file1 = pd.read_csv('file1.csv')
file2 = pd.read_csv('file2.csv')
file3 = pd.read_csv('file3.csv')
file4 = pd.read_csv('file4.csv')

f1_group = file1.groupby(['id'])
f2_group = file2.groupby(['id'])
f3_group = file3.groupby(['id'])
f4_group = file4.groupby(['id'])
data = []
for id1, group1 in f1_group:
    for id2, group2 in f2_group:
        for id3, group3 in f3_group:
            for id4, group4 in f4_group:
                if id1 == id2 == id3 == id4:
                   frames = [group1, group2, group3, group4]
                   con = pd.concat(frames, axis=1)
                   data.append(con)  

That works but is extremely inefficient. If I could eliminate the element that has been already considered from group1, group2, etc, that would help, but it would still be inefficient.

Thanks in advance.

Schach21
  • 305
  • 1
  • 11
  • IIUC, you can use concat/join if you set your indices as your unique IDs but its better if you provide some sample data as @TrentonMcKinney suggested. – Umar.H Jun 19 '20 at 22:45
  • Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – Umar.H Jun 19 '20 at 22:45
  • It kind of does, in that it provides a solution, but I am looking, in particular, for an efficient way to implement that. He mentioned on the post he is not concerned with performance-related issues or timing. But, yes, I guess that's probably the most efficient way to do it while keeping simple, easy-to-understand code. Thanks for the reference. – Schach21 Jun 20 '20 at 00:21

1 Answers1

0

Hi maybe you can try this :)

https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/

import os
import glob
import pandas as pd
#set working directory
os.chdir("/mydir")

#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)

#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
  • That solution merges the csv files, but it does not take into account their key. I didn't specify clearly enough my problem. I've edited my original question. Maybe it's clear now. – Schach21 Jun 20 '20 at 00:24