I have 4 csv files. Each file has different fields, e.g. name, id_number, etc. Each file is talking about the same thing, for which there is a unique id that each file has. So, I would like to concatenate the fields of each of the 4 files into a single DataFrame. For instance, one file contains first_name, another file contains last_name, then I want to merge those two, so that I can have first and last name for each object.
Doing that is trivial, but I'd like to know the most efficient way, or if there is some built-in function that does it very efficiently.
The files look something like this:
file1:
id name age pets
b13 Marge 18 cat
y47 Dan 13 dog
h78 Mark 20 lizard
file2:
id last_name income city
y47 Schmidt 1800 Dallas
b13 Olson 1670 Paris
h78 Diaz 2010 London
file 3 and 4 are like that with different fields. The ids are not necessarily ordered. The goal again, is to have one DataFrame looking like this:
id name age pets last_name income city
b13 Marge 18 cat Olson 1670 Paris
y47 Dan 13 dog Schmidt 1800 Dallas
h78 Mark 20 lizard Diaz 2010 London
What I've done is this:
file1 = pd.read_csv('file1.csv')
file2 = pd.read_csv('file2.csv')
file3 = pd.read_csv('file3.csv')
file4 = pd.read_csv('file4.csv')
f1_group = file1.groupby(['id'])
f2_group = file2.groupby(['id'])
f3_group = file3.groupby(['id'])
f4_group = file4.groupby(['id'])
data = []
for id1, group1 in f1_group:
for id2, group2 in f2_group:
for id3, group3 in f3_group:
for id4, group4 in f4_group:
if id1 == id2 == id3 == id4:
frames = [group1, group2, group3, group4]
con = pd.concat(frames, axis=1)
data.append(con)
That works but is extremely inefficient. If I could eliminate the element that has been already considered from group1, group2, etc, that would help, but it would still be inefficient.
Thanks in advance.