0

I have over 1,000 .csv files in my directory. I need to select the first row (i.e. column names) of each of these files, and put them all into a new dataframe. Each .csv has a different number of columns.

Based on information from other stackoverflow questions, I've come up with this:

setwd("C:/Users/H300904/R project/files/CSV")

file_list <- list.files(getwd())
field_names <- data.frame(df)

for (file in file_list) {
   field_names <- read.csv(file="file", nrows=1, header=FALSE)
}

I need help referencing 'file' correctly and telling it to put the data on the next row in the new dataframe. Any tips?

Thank you for sharing your knowledge with me.

Alex
  • 3
  • 2
  • If you have different number of columns for each row I'd suggest using a list of lists to store the result. Is there a specific reason why you want to store the column names in a df? – dario Mar 07 '20 at 19:30
  • Hi Dario. A list of lists might be possible. The only reason I decided on a dataframe was because I need to do a word frequency analysis on this data afterwards, and the guide I found uses a dataframe as a starting point. – Alex Mar 07 '20 at 19:34
  • How about ```sapply(field_names, read.csv(file="file", nrows=1, header=FALSE))``` in place of your ```for``` loop? – Russ Thomas Mar 07 '20 at 19:35
  • It's just that, at least in my mind, data.frames are rectangular structures where values in the same column are related (e.g. by measuring the same variable) – dario Mar 07 '20 at 19:36

1 Answers1

1
file_list <- list.files(getwd())

field_names <- list()
for (i in seq_along(file_list) {
  field_names[i] <- read.csv(file=file_list[i], nrows=1, header=FALSE)
}

And then, if you really want to use a data.frame:

library(data.table)
rbindlist(field_names, fill=TRUE)

Edit:

As suggested by @r2evans, we probably would use lapply instead of the for loop (but they both do (almost) exactly the same).

field_names <- lapply(file_list, read.csv, nrows=1, header=FALSE)

Here, lapply iterates through the elements of file_list and passes them as the first argument to read.csv (together with the other arguments nrows=1, header=FALSE). Then lapply combines the results into a list.

dario
  • 4,863
  • 1
  • 9
  • 23
  • 1
    Is there a reason you include `getwd()`? The default is `list.files(".")` which is the current working directory. Perhaps more R-idiomatic (than the `for` loop) is `field_names – r2evans Mar 07 '20 at 20:36
  • 1
    @r2evans Thanks for the comment! I just tried to leave the original code alone if it was not totally wrong. Same for the `for` loop. For many (as it was for me), at the beginning of the R learning curve, `loops` are more easy to grasp... But yea, your suggestion is probably the most straight forward! – dario Mar 07 '20 at 20:42
  • @dario. Thanks for working with my original code. Yes, as a beginner, I can get my head around loops and gave it my best shot. So, thanks, you taught me something! – Alex Mar 07 '20 at 20:54
  • 1
    @r2evans. Thanks for your very elegant solution. It seems to have done the trick. Now I know what to do in the future :) – Alex Mar 07 '20 at 20:54
  • 1
    Alex, since you're listening :-), a few other thoughts about *reproducibility* and extensibility: (1) add a pattern to `list.files` so that you don't accidentally try to read in unrelated files, perhaps just `pattern="\\.csv$"`; (2) if you ever look anywhere other than in the current directory, include `full.names=TRUE`; (3) this applies to reading the full CSV files, too, c.f. https://stackoverflow.com/a/24376207/3358272; (4) I occasionally follow-up with `setNames(x, file_list)` or just replace `lapply(...)` with `sapply(..., simplify=FALSE)` to keep file names with each list object, useful. – r2evans Mar 07 '20 at 21:58
  • @dario, you say: "And then, if you really want to use a data.frame: `library(data.table) rbindlist(field_names, fill=TRUE)`. This code exports the file to .csv, but the formatting of the .csv isn't good. It splits the field names over two rows. Any ideas on how to export this to an excel-compatible doc? I have to send this list to a colleague who doesn't use R. – Alex Mar 08 '20 at 15:58
  • @Alex If the answer solved your question you could consider accepting it by clicking the check mark. Regarding your new question: Do you mean `rbindlist(field_names, fill=TRUE)` does the export directly?? Because we could adapt the column names before saving as csv... – dario Mar 08 '20 at 16:03