r double loop too slow with large data

Question

I need to read hundred of .bil files:(reproductive example)

d19810101 <- data.frame(ID=c(1:10),year=rep(1981,10),month=rep(1,10),day=rep(1,10),value=c(11:20))
d19810102 <- data.frame(ID=c(1:10),year=rep(1981,10),month=rep(1,10),day=rep(2,10),value=c(12:21))
d19820101 <- data.frame(ID=c(1:10),year=rep(1982,10),month=rep(1,10),day=rep(1,10),value=c(13:22))
d19820102 <- data.frame(ID=c(1:10),year=rep(1982,10),month=rep(1,10),day=rep(2,10),value=c(14:23))

The code I wrote for testing small amount files works ok but when I tried to run the entire files, it went super slow, please let me know if there is any way that I can improve. What I need to do is simply get the average of 33 years of daily data, here is the code for testing small amount of files:

years <- c(1981:1982)
days <- substr(as.numeric(format(seq(as.Date("1981/1/1"), as.Date("1981/1/2"), "day"), '%Y%m%d')),5,8) 
X_Y <- NULL

for (j in days) {
  for (i in years) {
    XYi <- read.table(paste(i,substr(j,1,2),substr(j,3,4),".csv",sep=''),header=T,sep=",",stringsAsFactors=F)
    X_Y <- rbind(X_Y, XYi)
    cat(paste("Data in ", i, j, " are processing now.", sep=""), "\n")
  }
  library(plyr)
  X_Y1 <- ddply(X_Y, .(ID, month, day), summarize, mean(value, na.rm=T))
  cat(paste("Data in ", i, j, " are processing now.", sep=""), "\n")
}

EDIT:

Thank you for all your help! I tried putting the files in a list to read, but since its .bil files which needs to get the raster characteristics, thus I got error, that's why I need to read them one by one, sorry for didn't make it clear earlier

Read.files <- function(file.names, sep=",") {
library(raster)          
ldply(file.names, function(fn) data.frame(Filename=fn, layer <- raster(fn, sep=",")))    
}

data1 <- Read.files(paste("filenames here",days,".bil",sep=''), sep=",")  
"Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class 'structure("RasterLayer", package = "raster")' into a data.frame.

EDIT 2:

The data structure of my data is actually same with the example data, only that my data is grid data and needs to be extracted(using raster function instead of read.csv), and then to be put into data frame, therefore I need to do the following steps:

for (i in days)
  {      
layer <- raster(paste("filename here",i,".bil",sep='')) 
projection <- projection(layer) 
cellsize <- res(layer)[1] 
...
s <- resample(layer,r, method='ngb') 
XY <- data.frame(rasterToPoints(s)) 
names(XY) <- c('Long','Lat','Data') 
 }

`read.table` will always be faster with `colClasses` specified. If there are the same number of columns and their classes are in the same order then that should be added. (Pretty sure this is a duplicate question.) What search strategy did you use before posting? — IRTFM, Sep 10 '14 at 22:46
[This](http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r/1820610#1820610) and [this](http://stackoverflow.com/questions/23271323/r-read-in-multiple-dat-files/23273316#23273316) may help speed up reading in the data. It is better to read in all the data in a list rather than growing the dataframe by iteratively rbinding. — user20650, Sep 10 '14 at 22:57
you can probably improve speed if you read all the individual files into elements of a list and then `do.call(rbind,dataList)` rather than growing the data frame a little bit at a time. — Ben Bolker, Sep 10 '14 at 22:58
does the timing of when your debugging outputs (from `cat()`) appear give you any hints about where the bottlenecks are? — Ben Bolker, Sep 10 '14 at 23:03
`read.csv` might even be faster since you've got `header=T`, and `sep=","`, both default args in `read.csv` — Rich Scriven, Sep 11 '14 at 17:05
the files I need to read are .bil files, .csv file/read.table function was just to give a big idea how the data looks like, sorry for the confusion — Rosa, Sep 11 '14 at 17:15
Can you provide a sample of the actual data structure? My solution addresses the example data, but I can't be sure what changes in your use case. — x4nd3r, Sep 12 '14 at 04:23
Please check EDIT 2 and let me know if u have more questions, thanks! — Rosa, Sep 15 '14 at 16:26

x4nd3r · Answer 1 · 2014-09-11T00:40:19.387

It's hard to tell exactly how your are managing file IO, but I think an easier way to achieve this would be to read the files in, put them into one data.frame (e.g. using rbind()), and then get the summary statistics you need via tapply():

data <- do.call(rbind, mget(ls(pattern = "d[0-9]*"))) # combine data
with(data, tapply(value, list(month, day), mean))     # get mean for each month and day combination

This assumes you have already read in all of the files, to objects named as in your example.

r double loop too slow with large data

1 Answers1