2

across the web I can read that I should use data.table and fread to load my data.

But when I run a benchmark, then I get the following results

Unit: milliseconds
expr       min        lq      mean    median        uq        max neval
test1  1.229782  1.280000  1.382249  1.366277  1.460483   1.580176    10
test3  1.294726  1.355139  1.765871  1.391576  1.542041   4.770357    10
test2 23.115503 23.345451 42.307979 25.492186 57.772522 125.941734    10

where the code can be seen below.

loadpath <- readRDS("paths.rds")

microbenchmark(
  test1 = read.csv(paste0(loadpath,"data.csv"),header=TRUE,sep=";", stringsAsFactors = FALSE,colClasses = "character"),
  test2 = data.table::fread(paste0(loadpath,"data.csv"), sep=";"),
  test3 = read.csv(paste0(loadpath,"data.csv")),
  times = 10
) %>%
  print(order = "min") 

I understand that fread() should be faster than read.csv() because it tries to first read rows into memory as character and then tries to convert them into integer and factor as data types. On the other hand, fread() simply reads everything as character.

If this is true, shouldn't test2 be faster than test3 ?

Can someone explain me, why I do not archieve a speed-up or atleast the same speed with test2 as test1 ? :)

camille
  • 13,812
  • 10
  • 29
  • 45
KaZyKa
  • 316
  • 1
  • 10

2 Answers2

12

data.table::freads significant performance advantage becomes clear if you consider larger files. Here is a fully reproducible example.

  1. Let's generate a CSV file consisting of 10^5 rows and 100 columns

    if (!file.exists("test.csv")) {
        set.seed(2017)
        df <- as.data.frame(matrix(runif(10^5 * 100), nrow = 10^5))
        write.csv(df, "test.csv", quote = F)
    }
    
  2. We run a microbenchmark analysis (note that this may take a couple of minutes depending on your hardware)

    library(microbenchmark)
    res <- microbenchmark(
        read.csv = read.csv("test.csv", header = TRUE, stringsAsFactors = FALSE, colClasses = "numeric"),
        fread = data.table::fread("test.csv", sep = ",", stringsAsFactors = FALSE, colClasses = "numeric"),
        times = 10)
    res
    #          Unit: milliseconds
    #     expr        min         lq       mean     median         uq        max
    # read.csv 17034.2886 17669.8653 19369.1286 18537.7057 20433.4933 23459.4308
    #    fread   287.1108   311.6304   432.8106   356.6992   460.6167   888.6531
    
    
    library(ggplot2)
    autoplot(res)
    

enter image description here

Maurits Evers
  • 42,255
  • 4
  • 27
  • 51
-5

If you take a look into the functions you can see that fread does more checks than read.csv. If the file you are reading is small i takes more time to do checking and preparations for reading than actually reading.

data.table is incredibly faster for big datasets.

karen
  • 767
  • 5
  • 20
  • 7
    `read.csv` calls `read.table` which is not less complex at a glance – Moody_Mudskipper Aug 09 '18 at 11:25
  • 1
    but the thing is that before calling anything fread checks some conditions which probably take more time than directly reading a small file, for example, identical is quite slow – karen Aug 09 '18 at 13:29
  • 5
    As I understand, your answer basically says that it's slower because it tests for things, but `read.table` tests things and you don't compare them. Then what bugs me the most is that it seems to be implied that `read.csv` is faster because it's 3 lines long, which doesn't make sense. – Moody_Mudskipper Aug 09 '18 at 13:56
  • well, the thing is that fread uses the function Creadfile and read.csv uses read.table, probably Creadfile is way faster than the second but it is more strict in terms of the input, which means that data.table requires longer preparation but if the data is long enough is totally worth it. – karen Aug 09 '18 at 19:19