Formating a column in csvs while I import them

Question

I'm importing several .csvs that are all two columns wide (they're output from a program) - the first column is wavelength and the second is absorbance, but I'm naming it by the file name to be combined later like from this old stack overflow answer (Combining csv files in R to different columns). The incoming .csvs don't have headers, and I'm aware that the way I'm naming them crops the first data points. I would like for the first column to not have any decimals and standardize all of the numbers to four digits - the code I've added works on its own but not in this block - and I would prefer to do this formatting all in one go. I run into errors with $ not being the right operator, but when I use [] I get errors about that too. The column I need to do this to is the first and it's named 'Wavelength' - which also gives me errors either because wavelength doesn't exist or it's nonnumeric. Any ideas?

This is what my script currently looks like:

for (file in file_list) {
  f <- sub("(.*)\\.CSV", "\\1", file)
  assign(f, read.csv(file = file))
  assign(f, setNames(get(f), c(names(get(f))[0:0], "Wavelength")))
  assign(f, setNames(get(f), c(names(get(f))[1:1], file)))
  floor(f[Wavelength]) #the issues are here
  sprintf("%04d", f$Wavelength) #and here
}

The data looks like this in the csv before it gets processed:

1       401.7664      0.1379457
2       403.8058      0.1390427
3       405.8452      0.1421666
4       407.8847      0.1463629
5       409.9241      0.1477264

I would like the output to be:

     Wavelength  (file name)
1       0401      0.1379457
2       0403      0.1390427
3       0405      0.1421666
4       0407      0.1463629
5       0409      0.1477264

And here's the dput that r2evans asked for:

structure(list(X3.997270e.002 = c(401.7664, 403.8058, 405.8452, 
407.8847, 409.9241, 411.9635), X1.393858e.001 = c(0.1379457, 
0.1390427, 0.1421666, 0.1463629, 0.1477264, 0.1476971)), row.names = 
c(NA, 
6L), class = "data.frame")

Thanks in advance!

6/24 Update: When I assign the column name "Wavelength" it only gets added as a character, not as a real column name? When I dput/head the files once they go through (omitting the sprintf/floor functions) it only lists the file name (the second column). When I open the csvs in R studio the first column is properly labeled - and even further I'm able to combine all the csvs sorted by "Wavelength":

list_csvs <- mget(sub("(.*)\\.CSV", "\\1", file_list))
all_csvs <- Reduce(function(x, y) merge(x, y, all=T, 
      by=c("Wavelength")), list_csvs, accumulate=F)

Naturally I've thought about just formatting the column after this, but some of the decimals are off in the thousands place so I do need to format before I merge the csvs.

I've updated the code to use colnames outside of the read.csv:

for (file in file_list) {
  f <- sub("(.*)\\.CSV", "\\1", file)
  assign(f, read.csv(file = file, 
                     header = FALSE, 
                     row.names = NULL))
  colnames(f) <- c("Wavelength", file)
  print(summary(f))
  print(names(f))
  #floor("Wavelength")   #I'm omitting this to see the console errors
  #sprintf("%04.0f", f["Wavelength"])    #omitting this too
  }

but I get the following error:

attempt to set 'colnames' on an object with less than two dimensions

Without the naming bit and without the sprintf/floor I get this back from the summary and names prompt for each file:

 Length     Class      Mode 
    1 character character 
NULL

When I try to call out the first column by f[1], f[[1]], f[,1], or f[[,1]] I get error messages about 'incorrect number of dimensions'. I can clearly see in the R environment that each data frame has a length of 2. I also double checked with .row_names_info(f) that the first column isn't being read as row names. What am I doing wrong?

It would really help to see sample data and expected output, otherwise we're just shooting in the dark. — r2evans, Jun 19 '20 at 19:14
Sample data is often best in one of two forms: `dput(head(x))` or programmatically with `data.frame(...)`, your choice. The key (for me) is "unambiguous", since console output can sometimes mask data types and/or make it difficult to just copy/paste into our consoles. Something is better than nothing, copyable is better than not, unambiguous (dput/data.frame) is often best. Thanks! — r2evans, Jun 19 '20 at 20:36
@r2evans Any thoughts? I've spent a week on this problem but then again I'm relatively new to R — Mulligan, Jun 25 '20 at 15:36
I'm thinking now that the functions are only addressing the column name and not it's contents - how can I correct for this, if it's the issue? — Mulligan, Jun 25 '20 at 16:05

r2evans · Accepted Answer · 2020-06-25T17:06:10.247

I'm going to suggest a dplyr/tidyr pipe for this.

First, data-setup:

writeLines(
"401.7664,0.1379457
403.8058,0.1390427
405.8452,0.1421666
407.8847,0.1463629
409.9241,0.1477264", "sample1.csv")
file.copy("sample1.csv", "sample2.csv")
file_list <- normalizePath(list.files(pattern = ".*\\.csv$", full.names = TRUE), winslash = "/")
file_list
# [1] "C:/Users/r2/StackOverflow/13765634/sample1.csv"
# [2] "C:/Users/r2/StackOverflow/13765634/sample2.csv"

First, I'm going to suggest a slightly different format: not naming the column for the filename. I like this because I'm still going to preserve the filename with the data (as a category, so to speak), but it allows you to combine all of your data into one frame for more efficient processing:

library(dplyr)
library(purrr) # map*
library(tidyr) # pivot_wider
file_list %>%
  set_names(.) %>%
  # set_names(tools::file_path_sans_ext(basename(.))) %>%
  map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
          .id = "filename") %>%
  mutate(freq = sprintf("%04.0f", freq))
#                                          filename freq       val
# 1  C:/Users/r2/StackOverflow/13765634/sample1.csv 0402 0.1379457
# 2  C:/Users/r2/StackOverflow/13765634/sample1.csv 0404 0.1390427
# 3  C:/Users/r2/StackOverflow/13765634/sample1.csv 0406 0.1421666
# 4  C:/Users/r2/StackOverflow/13765634/sample1.csv 0408 0.1463629
# 5  C:/Users/r2/StackOverflow/13765634/sample1.csv 0410 0.1477264
# 6  C:/Users/r2/StackOverflow/13765634/sample2.csv 0402 0.1379457
# 7  C:/Users/r2/StackOverflow/13765634/sample2.csv 0404 0.1390427
# 8  C:/Users/r2/StackOverflow/13765634/sample2.csv 0406 0.1421666
# 9  C:/Users/r2/StackOverflow/13765634/sample2.csv 0408 0.1463629
# 10 C:/Users/r2/StackOverflow/13765634/sample2.csv 0410 0.1477264

Options: if you prefer just the filename (no path) and are certain that there is no filename collision, then use set_names(basename(.)) instead. (This step is really necessary when using the filename as a column name anyway.) I'll also remove the file extension, since they're likely all .csv or similar.

file_list %>%
  # set_names(.) %>%
  set_names(tools::file_path_sans_ext(basename(.))) %>%
  map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
          .id = "filename") %>%
  mutate(freq = sprintf("%04.0f", freq))
#    filename freq       val
# 1   sample1 0402 0.1379457
# 2   sample1 0404 0.1390427
# 3   sample1 0406 0.1421666
# 4   sample1 0408 0.1463629
# 5   sample1 0410 0.1477264
# 6   sample2 0402 0.1379457
# 7   sample2 0404 0.1390427
# 8   sample2 0406 0.1421666
# 9   sample2 0408 0.1463629
# 10  sample2 0410 0.1477264

(If you need to do something to each dataset at a time, then you should use %>% group_by(filename), not sure if that's relevant.)

If you really need the filename to be the column name of the value, then modify this slightly so that it preserves it as a list:

file_list %>%
  set_names(tools::file_path_sans_ext(basename(.))) %>%
  map(~ read.csv(.x, header = FALSE, col.names = c("freq", "val"))) %>%
  map2(., names(.), ~ transmute(.x, freq = sprintf("%04.0f", freq), !!.y := val))
# $sample1
#   freq   sample1
# 1 0402 0.1379457
# 2 0404 0.1390427
# 3 0406 0.1421666
# 4 0408 0.1463629
# 5 0410 0.1477264
# $sample2
#   freq   sample2
# 1 0402 0.1379457
# 2 0404 0.1390427
# 3 0406 0.1421666
# 4 0408 0.1463629
# 5 0410 0.1477264

But I'm going to infer that ultimately you want to combine these column-wise, assuming there will be alignment in the freq column. (I can't think of another reason why you'd want the column name to be the filename.)

For that, try this, reverting to the first use of map_dfr, introducing pivot_wider:

file_list %>%
  set_names(tools::file_path_sans_ext(basename(.))) %>%
  map_dfr(~ read.csv(.x, header = FALSE, col.names = c("freq", "val")),
          .id = "filename") %>%
  mutate(freq = sprintf("%04.0f", freq)) %>%
  pivot_wider(freq, names_from = filename, values_from = val)
# # A tibble: 5 x 3
#   freq  sample1 sample2
#   <chr>   <dbl>   <dbl>
# 1 0402    0.138   0.138
# 2 0404    0.139   0.139
# 3 0406    0.142   0.142
# 4 0408    0.146   0.146
# 5 0410    0.148   0.148

Notes (perhaps more of a soap-box):

Regarding your use of assign, I strongly discourage this behavior. Since the data is effectively all structured the same, I infer that you'll be doing the same thing to each of these files. In that case, it is much better to use one of the *apply functions on a list of data.frames. That is, instead of having to iterate over a list of variable names, get it, do something, then reassign it ... it is often much easier (to program, to read, to maintain) dats <- lapply(dats, some_function) or dats2 <- lapply(dats, function(x) { ...; x; }).
Regarding the use of filename-as-column-name. Some tools (e.g., ggplot2) really benefit from having "long" data (i.e., one or more category columns such as filename, and one column for each type of data ... type is relative to your understanding of the data). You might benefit from reframing your thinking on working with this data.

Thank you so much! It is such a relief to have this issue solved! I'll take your advice on filenames for columns and using assign - I've been trying sapply recently. — Mulligan, Jun 25 '20 at 17:05

Formating a column in csvs while I import them

1 Answers1