7

It seems most intuitive that .rdata files might be the fasted file format for R to load, but when scanning some of the stack posts it seems that more attention has been on enhancing load times for .csv or other formats. Is there a definitive answer?

Prradep
  • 4,719
  • 3
  • 34
  • 66
  • This is a very difficult question to answer correctly. You need to consider converting any given file-reading function into compiled & optimized c- or Fortran code, for example. In addition, since you generally don't have a choice of input format unless you generated the files **in R** in the first place, I'm not sure the answer really matters! – Carl Witthoft Jun 05 '15 at 14:38

1 Answers1

7

Not a definitive answer, but below are times it took to load the same dataframe read in as a .tab file with utils::read.delim(), readr::read_tsv(), data.table::fread() and as a binary .RData file timed using the system.time() function:

.tab with utils::read.delim

system.time(
  read.delim("file.tab")
)
#   user  system elapsed 
# 52.279   0.146  52.465

.tab with readr::read_tsv

system.time(
  read_tsv("file.tab")
)    
#   user  system elapsed 
# 23.417   0.839  24.275

.tab with data.table::fread

At @Roman 's request the same ~500MB file loaded in a blistering 3 seconds:

system.time(
  data.table::fread("file.tab")
)
# Read 49739 rows and 3005 (of 3005) columns from 0.400 GB file in 00:00:04
#    user  system elapsed 
#   3.078   0.092   3.172 

.RData binary file of the same dataframe

system.time(
  load("file.RData")
)
#    user  system elapsed 
#   2.181   0.028   2.210

Clearly not definitive (sample size = 1!) but in my case with a 500MB data frame:

  1. Binary .RData is quickest
  2. data.frame::fread() is a close second
  3. readr::read_tsv is an order of magnitude slower
  4. utils::read.x is slowest and only half as fast as readr
Phil
  • 3,936
  • 1
  • 20
  • 31