1

I hacked together some code that I thought would print all columns names that are common to all CSV files in a folder. I'm using an inner join, but it's acting like an outer join. There must be a quick fix for this, right.

import glob
import pandas as pd

files = glob.glob(r'C:\my_files\*.csv')

def get_merged(files, **kwargs):
    df = pd.read_csv(files[0], **kwargs)
    for f in files[1:]:
        df = df.merge(pd.read_csv(f, **kwargs), how='inner')
    return df

print(get_merged(files))

So, if I have 4 files with these columns:

 cola   colb    colc    cold    cole

And I have 1 file with these columns:

cola    colc    cole

I would like to see this:

cola    colc    cole
ASH
  • 15,523
  • 6
  • 50
  • 116
  • 1
    an inner `merge` will merge on a common-index basis, not a common-column basis IIRC. So it makes sense that you're getting a ton of columns. – Brian Joseph Sep 16 '19 at 20:24
  • You want `merge` not `join`. **`merge` works on column-names, whereas `join` works on indices** – smci Sep 16 '19 at 20:29
  • 1
    Pandas is kind of an overkill for that. Are you willing to accept a solution without pandas? – Paulo Scardine Sep 16 '19 at 20:31
  • I you just want the column names, then reading in all of that data is worthless. Grab the headers then `set.intersection`(or Index.intersection) is all you need. – ALollz Sep 16 '19 at 20:32
  • No pandas is not overkill. So just read in the column names and not the data with `pd.read_csv(..., nrows=1)`. Or if you want a native-Python solution read in with `csv` then use `set()` intersection on the column names, iteratively. – smci Sep 16 '19 at 20:33
  • Check your column names in all those files and make sure that the five column names you want actually are the same `string`. When you do an inner join, by default pandas will use the index for each dataframe you load, as column value to match (in other words the index then acts like an id-column in a database). – CypherX Sep 16 '19 at 20:38
  • Pandas may not be an overkill performance-wise if you are reading just the first row but it is still a huge dependency if all you want is to get the set of column names from a bunch of CSV files. But if your only tool is a hammer all problems look like nails, I guess. Either way this question seems like is a dupe as there are answers covering this both with and without pandas. – Paulo Scardine Sep 16 '19 at 20:42
  • @PauloScardine: I show both approaches: pandas and native Python. pandas can also be useful for handling delimiters/whitespace, quoting, encodings, multiline... [native `csv` module is notoriously brittle](https://stackoverflow.com/a/39197736/202229). My rule-of-thumb is if you ever find yourself writing code that duplicates pandas functionality. – smci Sep 16 '19 at 20:45
  • ...actually to read only the column-names, do `pd.read_csv(..., nrows=0)` then take `set(df.columns)` – smci Sep 16 '19 at 20:50
  • I was originally thinking intersect, like a SQL set operation. I looked at merge and it seems like that would work as well. I just want something that is easy to setup and maintain. It doesn't have to scan thousands of files, and it doesn't have to be lightning fast, although I think it will work pretty fast as it just has to fetch the first row from each file. I still don't have a working a solution. I tried the ideas mentioned above. One challenge is that I'm not working with just two files. I'm working with more than two files. – ASH Sep 16 '19 at 21:07
  • 1
    Apologies this is not a dupe (I thought the OP was insisting on pandas) and should not have been closed; also strangely it turns out that `pd.merge(..., how='inner')` still **takes the union (instead of intersection) of the columns; it only drops common rows not columns.** I revised my answer. – smci Sep 16 '19 at 23:30
  • I noticed some weirdness. I'm pretty sure you know this stuff better than me. Anyway, how can I achieve my results as I described above? None of the recommendations posted has actually worked for me. – ASH Sep 16 '19 at 23:35
  • ...and `pd.merge(left, right, on = left.columns, how='inner')` gives error if right dataframe doesn't have all those columns in common. So, `merge/join` are not the droid we're looking for. – smci Sep 16 '19 at 23:41
  • @asher: both my recommendations work: I coded the first one up and tested it and it works. Please edit your updated code into your question if you still have an issue. – smci Sep 17 '19 at 00:32
  • @PauloScardine: Can you please delete the disparaging remark *"But if your only tool is a hammer all problems look like nails, I guess."* If you ever have lots of large CSV files in Unicode encodings, with quoting, escaping, whitespace, various separators etc., then pandas is the only tool I know that will get the job done out-of-the-box without you having to write lots of unnecessary throwaway manual CSV-parsing code... which we already see too much of on SO on a daily basis. Whereas yeah if it was three small CSV files it would be overkill. Depends on the operational context. – smci Sep 28 '19 at 05:41
  • @smci apologize if it sounded rude or judgmental, English is not my first language. If pandas is an overkill for the task described by OP, lets cordially agree to disagree. – Paulo Scardine Sep 29 '19 at 03:53
  • Does this answer your question? [list of columns in common in two pandas dataframes](https://stackoverflow.com/questions/48539195/list-of-columns-in-common-in-two-pandas-dataframes) – AMC Feb 08 '20 at 01:23

2 Answers2

1

You can either use pandas or pure Python to compute the set-intersection of column-names.

1) pandas solution

def get_common_columns(files, **kwargs):
    """Get set intersection of column-names of specified CSV files"""
    common_columns = set(pd.read_csv(files[0], nrows=0, **kwargs).columns)
    for f in files[1:]:
        common_columns &= set(pd.read_csv(f, nrows=0, **kwargs).columns)
    return common_columns
  • I tested this and it works
  • pandas is not overkill: to only read in the column names and not the data, simply do pd.read_csv(..., nrows=0) then take set(df.columns)
  • Turns out we can't use merge/join. Even a merge(..., how='inner') takes the union not intersection of the columns. FYI merge works on column-names, whereas join works on indices. But they join row-wise

2) native Python solution with csv and set()

  • same idea, just in native Python
  • read in the column header with csv, then use set() intersection on the column names, iteratively: common_columns |= set(columns_from_current_csv)
smci
  • 26,085
  • 16
  • 96
  • 138
  • I see what's happening now. I may have asked my original question slightly wrong. I just updated my question slightly. I want to find the column names that appear in all files in the folder of CSV files. – ASH Sep 17 '19 at 01:42
  • What would be the native python solution? Can you show the details please? Let's say I have 4 files with these field names: colA colB colC colD colE. Also, I have 1 file with these names: colA colC colE colX. I would like the final result to show this: colA colC colE. So, colA colC colE are in all sets of column names. – ASH Sep 17 '19 at 15:01
  • Sorry for the confusion before smci. This works fine on different CSV files. There must have been some weird thing in the other files that I tested this on. Thanks so much!!! – ASH Sep 20 '19 at 13:53
0

Taken from another answer on Stack:

import csv
from glob import iglob

unique_headers = set()
for filename in iglob('*.csv'):
    with open(filename, 'rb') as fin:
        csvin = csv.reader(fin)
        unique_headers.update(next(csvin, []))

print(unique_headers)

How can I read only the header column of a CSV file using Python?