0

I have thousands of comma separated .txt files with two columns, where one column has "wavelength" for the column name and the same wavelength values ("x" values) for all files, and the other column has the file name as the column name and response values (various observed "y" values).

If I read in a single file by readr, the format appears like this:

# A tibble: 2,151 x 2
   Wavelength       a1lm_00000.asd.ref.sco.txt  ### [filename]
    <dbl>                  <dbl>
 1        350                 0.0542
 2        351                 0.0661
 3        352                 0.0686
 4        353                 0.0608
 5        354                 0.0545
 6        355                 0.0589
 7        356                 0.0644
 8        357                 0.0587
 9        358                 0.0556
10        359                 0.0519
  ...etc.

The end format I need is:

 Filename                 "350"        "351"       "352"     "353"     etc.

a1lm_00000.asd.ref.sco.txt    0.0542       0.0661      0.0686    0.0608    etc.
a1lm_00001.asd.ref.sco.txt    0.0567       0.0680      0.0704    0.0627    etc.   
  ...etc.

In other words, I need the first column as the file identifier, and each following column a spectral response with the associated spectral wavelength as the column name.

So, I need to read all these files in from a directory, and either:

a.) Create a third column that is the file name, make all second columns names something like "response", apply bind_rows to all files, then use "spread" in the tidyr package.

b.) Transpose each files as soon as it is read, in such a way that the first row becomes all column names, the second row column name is inserted into a first column for row identifiers by file name, and row bind these resulting rows.

Option b. seems preferable. Either options seems like I will need to use either lapply and possibly bind_rows or bind_cols. But I'm not sure how best to do so. There are a lot of data, and a few of the methods I've used have caused my machine to run out of memory, so the more memory-efficient I can make it the better.

1 Answers1

2

I recommend storing all data.frames in a list. Then it becomes a simple matter of merging data.frames, converting data from wide to long, and back to wide with a different key.

library(tidyverse)
reduce(lst, full_join) %>%
    gather(file, value, -Wavelength) %>%
    spread(Wavelength, value)
#                        file    350    351    352    353    354    355    356
#1 a1lm_00000.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608 0.0545 0.0589 0.0644
#2 a1lm_00001.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608 0.0545 0.0589 0.0644
#     357    358    359
#1 0.0587 0.0556 0.0519
#2 0.0587 0.0556 0.0519

Two more comments:

  1. To store data.frames in a list, I would do something along the lines of map(file_names, ~read_csv2(.x)) (or in in base R lapply(file_names, function(x) read.csv(x))). Adjust file_names and read_csv2/read.csv parameters as necessary.
  2. More generally, I would probably advise against such a format. It seems much easier to keep data in a list of long (and tidy) data.frames.

For completeness, the same can be achieved in base R using Reduce+merge to join data, and stack+reshape to convert from wide to long to wide.

df <- Reduce(merge, lst)
reshape(
    cbind(stack(df, select = -Wavelength), Wavelength = df$Wavelength),
    idvar = "ind", timevar = "Wavelength", direction = "wide")
#                          ind values.350 values.351 values.352 values.353
#1  a1lm_00000.asd.ref.sco.txt     0.0542     0.0661     0.0686     0.0608
#11 a1lm_00001.asd.ref.sco.txt     0.0542     0.0661     0.0686     0.0608
#   values.354 values.355 values.356 values.357 values.358 values.359
#1      0.0545     0.0589     0.0644     0.0587     0.0556     0.0519
#11     0.0545     0.0589     0.0644     0.0587     0.0556     0.0519

Sample data

df1 <- read.table(text =
    "Wavelength       a1lm_00000.asd.ref.sco.txt
1        350                 0.0542
2        351                 0.0661
3        352                 0.0686
4        353                 0.0608
5        354                 0.0545
6        355                 0.0589
7        356                 0.0644
8        357                 0.0587
9        358                 0.0556
10        359                 0.0519", header = T)

df2 <- read.table(text =
    "Wavelength       a1lm_00001.asd.ref.sco.txt
1        350                 0.0542
2        351                 0.0661
3        352                 0.0686
4        353                 0.0608
5        354                 0.0545
6        355                 0.0589
7        356                 0.0644
8        357                 0.0587
9        358                 0.0556
10        359                 0.0519", header = T)

lst <- list(df1, df2)
Maurits Evers
  • 42,255
  • 4
  • 27
  • 51
  • This worked. It is still slower than I would like. I am trying to emulate what the get_spectra function of the asdreader package does. I can't use get_spectra on my data because I had to change the format slightly for splice correction (by proprietary software), and get_spectra won't read the splice-correction ASCII file output, only original raw files. Somehow the get_spectra reads in the data much faster than your approach, but I'm not sure how to "look under the hood" in the package to figure out how the code works. – photosynthesis Nov 29 '18 at 21:55
  • 1
    @photosynthesis I've got no idea what `get_spectra` does. But there are many ways to optimise reading in and processing data from CSV files. In terms of reading, probably the quickest method is using `data.table`s `fread`, in particular when you're dealing with large CSV files (see e.g. [this post](https://stackoverflow.com/questions/51765374/read-csv-faster-than-data-tablefread)). I've also added a base R solution, which you might want to benchmark against the `tidyverse` method. – Maurits Evers Nov 29 '18 at 22:00