3

Hallo experts,

I am trying to read in a large file in consecutive blocks of 10000 lines. This is because the file is too large to read in at once. The "skip" field of read.csv comes in
handy to accomplish this task ( see below). However I noticed that the program starts slowing down towards the end of the file ( for large values of i). I suspect this is because each call to read.csv(file,skip=nskip,nrows=block) always starts reading the file from the beginning until the required starting line "skip" is reached. This becomes increasingly time-consuming as i increases. Question: Is there a way to continue reading a file starting from the last location that was reached in the previous block?

    numberOfBlocksInFile<-800
    block<-10000
for ( i in 1:(n-1))
{

            print(i)
    nskip<-i*block

    out<-read.csv(file,skip=nskip,nrows=block)
    colnames(out)<-names

            .....
            print("keep going")

    }

many thanks (:-
user3072048
  • 229
  • 1
  • 8
  • 2
    Have you seen [this post](http://stackoverflow.com/questions/9352887/strategies-for-reading-in-csv-files-in-pieces), or [this post](http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r)? – Blue Magister Dec 05 '13 at 21:23
  • 1
    Connections can behave that way but read.csv and even read.table and scan will close their connection when terminating. I found this comment in ?seek's help page interesting: "Use of seek on Windows is discouraged. We have found so many errors in the Windows implementation of file positioning that users are advised to use it only at their own risk, and asked not to waste the R developers' time with bug reports on Windows' deficiencies." – IRTFM Dec 05 '13 at 21:44
  • @DWin `read.csv` only closes connections it opens, so open it first as illustrated [here](http://stackoverflow.com/questions/18564689/rloops-to-process-large-datasetgbs-in-chunks/18566079). – Martin Morgan Dec 05 '13 at 21:55
  • Sorry. I couldn't get successive reads to work in hte manner you illustrated. After one read the connection was reported to be 'invalid'. – IRTFM Dec 05 '13 at 22:05

2 Answers2

1

One way is to use readLines with a file connection. For example, you could do something like this:

temp.fpath <- tempfile() # create a temp file for this demo
d <- data.frame(a=letters[1:10], b=1:10) # sample data, 10 rows. we'll read 5 at a time
write.csv(d, temp.fpath, row.names=FALSE) # write the sample data
f.cnxn <- file(temp.fpath, 'r') # open a new connection

fields <- readLines(f.cnxn, n=1) # read the header, which we'll reuse for each block
block.size <- 5

repeat { # keep reading and printing 5 row chunks until you reach the end of the cnxn.
    block.text <- readLines(f.cnxn, n=5) # read chunk
    if (length(block.text) == 0) # if there's nothing left, leave the loop
        break

    block <- read.csv(text=c(fields, block.text)) # process chunk with
    print(block)
}

close(f.cnxn)
file.remove(temp.fpath)
Matthew Plourde
  • 41,229
  • 5
  • 88
  • 109
0

Another option is to use fread from read.table package.

N <- 1e6   ##  1 second to read 1e6 rows/10cols
skip <- N
DT <- fread("test.csv",nrows=N)
repeat {
  if (nrow(DT) < N) break
  DT <- fread("test.csv",nrows=N,skip=skip)
  ## here use DT for your process
  skip <- skip + N
}
agstudy
  • 113,354
  • 16
  • 180
  • 244