0

I should read a big CSV file (with more than 30000 rows and 10000 columns). Somehow I should read data column by column. This is my code:

con<-file("D:\\Data.csv","r")
datalist<-list()
for(spalte in 5:5)
{
  for(i in 1:20000)
  {

    line<-readLines(con,n=1,warn=FALSE)
    m<-list(as.integer(unlist(strsplit(line,split=","))))
    datalist<-c(datalist,sapply(m,"[[",spalte))

  }

}

but this code needs 4 minutes just to read only one column (in this example 5th column). How should I do to make this job faster?

luiges90
  • 4,301
  • 2
  • 27
  • 42
Kaja
  • 2,591
  • 15
  • 46
  • 86

2 Answers2

2

Don't invent your own solution to well solved problems. If read.csv is giving you out of memory errors, then:

1) Make sure that you are using 64-bit R (no 4GB RAM limit).

2) Ignore some rows that you don't need to save space. The colbycol package is useful for this.

3) Read the file into a database, and import what you need from there. There are lots of solutions for this; start by reading answers to this SO question.

4) Buy more RAM, or run your analysis on a remote workstation with more RAM (maybe a cloud server) or use an out-of-memory package. See the Task View on High Performance Computing.

Community
  • 1
  • 1
Richie Cotton
  • 107,354
  • 40
  • 225
  • 343
  • thank you but the problem is that I need all of data in csv file – Kaja Feb 19 '14 at 15:12
  • There are lots of options; the best one depends on your setup. Are you using 64-bit R? How much RAM do you have on your machine? Can you get or buy more? Can you access a workstation with more RAM, or use a cloud compute service? Do you have access to any databases that you can use as a staging area? Are you really sure you need all 30k columns? – Richie Cotton Feb 19 '14 at 15:25
  • @Kaja Do you realise that this is approximately 2.2 Gb of data? It will not be easy, nor quick, to analyse this on a typical computer. – James Feb 19 '14 at 15:31
0

Try fread(filename). It's in data.table package. It is very fast to read large files.

system.time(fread('abc.csv'))
user  system elapsed 
0.41    0.00    0.40 

system.time(read.csv('abc.csv'))
user  system elapsed 
2.28    0.02    2.29 

If you are having memory issues then as Richie suggested use 64-bit and try to run on a server or you can even get Amazon EC2 machine with large RAM.

user1525721
  • 316
  • 4
  • 10