0

I have a large number (over 200,000) separate files, each of which contains a single row and many columns (sometimes upwards of several hundred columns). There is one column (id) in common across all of the files. Otherwise, the column names are semi-random, with incomplete overlap between dataframes. Currently, I am using %in% to determine which columns are in common and then merging using those columns. This works perfectly well, although I am confident (maybe hopeful is a better word) that it could be done faster.

For example:

dfone<-data.frame(id="12fgh",fred="1",wilma="2",barney="1")
dftwo<-data.frame(id="36fdl",fred="5",daphne="3")
common<-names(dfone)[names(dfone) %in% names(dftwo)]
merged<-merge(dfone,dftwo,by=common,all=TRUE)

So, since I'm reading a large number of files, here is what I'm doing right now:

fls<-list.files()
first<-fls[1]
merged<-read.csv(first)

for (fl in fls) {
dffl<-read.csv(fl)
common<-names(dffl)[names(dffl) %in% names(merged)]
merged<-merge(dffl,merged,by=common,all=TRUE)
# print(paste(nrow(merged)," rows in dataframe.",sep=""))
# flush.console()
# print(paste("Just did ",fl,".",sep=""))
# flush.console()
}

Obviously, the commented out section is just a way of keeping track of it as it's running. Which it is, although profoundly slowly, and it runs ever more slowly as it is assembling the data frame.

(1) I am confident that a loop isn't the right way to do this, but I can't figure out a way to vectorize this

(2) My hope is that there is some way to do the merge that I'm missing that doesn't involve my column name comparison kludge

(3) All of which is to say that this is running way too slowly to be viable

Any thought on how to optimize this mess? Thank you very much in advance.

seehuus
  • 101
  • 1
  • 5
  • `merge` defaults to `by = intersect(names(x), names(y))`, so you don't really need to calculate `common`. The right way to do this is probably [a list of data.frames](http://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames/24376207#24376207). – alistaire Jun 20 '16 at 18:59

1 Answers1

1

A much shorter and cleaner approach is to read them all into a list and then merge.

do.call(merge, lapply(list.files(), read.csv)) It will still be slow though. You could speed it up by replacing read.csv with something faster (e.g., data.table::fread) and possibly by replacing lapply with parallel::mclapply.

Ista
  • 9,106
  • 2
  • 32
  • 30
  • Thanks very much. It also greatly sped things up by reshaping the dataframes from wide to long and rbinding instead of merging - faster by an order of magnitude. And I'm sure would be even faster with rbindlist. – seehuus Jun 21 '16 at 21:19