7

I am using fread from data.table to load csv files. However my csv files uses dec="," as a decimal-separator (1.23 will be 1,23). Unlike in read.csv it seems that dec is not an allowed parameter.

R) args(fread)
function (input = "test.csv", sep = "auto", sep2 = "auto", nrows = -1,
    header = "auto", na.strings = "NA", stringsAsFactors = FALSE,
    verbose = FALSE, autostart = 30)

Do you see a work around (a R option to set may be) that will enable me to use fread (it is so much faster that it saves me a lot of time)?

PS: colClasses is not yet implemented so setAs cannot be used like in this post

Community
  • 1
  • 1
statquant
  • 12,389
  • 13
  • 75
  • 148
  • 1
    As a workaround you could replace with a fast text editor. – Roland Jan 21 '13 at 14:25
  • I would say "ask the package maintainer", especially since this function is in development: http://stackoverflow.com/questions/14124813/data-table-fread-function – Ben Bolker Jan 21 '13 at 14:26
  • Thank you Roland but I have many files plus some strings may hold `,` in the future... and I do not really want to alter them anyway. Thanks though for the suggestion – statquant Jan 21 '13 at 14:26
  • @Ben Bolker: thanks, I'll fill a request but only if I cannot find a workaround :) – statquant Jan 21 '13 at 14:27
  • 3
    PS from looking at the fread code (link in the comments to the other question, it seems to use `strtod` (reference here: http://www.cplusplus.com/reference/cstdlib/strtod/ , which means that implementing comma-separated decimals might be a little tricky (the decimal separator is hard-coded in `strtod`). Following up @Roland's comment, if you're on a system with `sed` (Linux, MacOS, or PC with Cygwin) you can use it to do this conversion on the fly: see http://stackoverflow.com/questions/3439001/how-to-change-the-decimal-separator-with-awk-sed – Ben Bolker Jan 21 '13 at 14:31
  • @Ben Bolker: Thanks I am on Windows but I am using gnuWin32 that implements `sed`, I'll try your suggestion and keep you posted! – statquant Jan 21 '13 at 14:36
  • I've also seen some hints that `strtod` is locale-specific, so it *might* (???) automatically handle comma-separated decimals if your locale is set appropriately? Would be worth an experiment or two with `Sys.setlocale`. – Ben Bolker Jan 21 '13 at 14:39
  • 1
    @BenBolker: I learned to be careful with locales: http://stat.ethz.ch/pipermail/r-devel/2012-August/064609.html – cbeleites unhappy with SX Jan 21 '13 at 14:51
  • @BenBolker is spot on. Currently `fread` uses `strtod`. Not sure about locale either. It can be done: ultimately (but not ideally) we could just fork `strtod` to allow the decimal separator to be switched. I'd have to remind myself what R itself does and if it exposes its method via R's C API. If that was efficient enough that would be better. – Matt Dowle Jan 21 '13 at 15:23
  • 1
    @Matthew: I will raise a request on R-forge, thanks. – statquant Jan 21 '13 at 15:28

1 Answers1

8

Update Oct 2014 : Now in v1.9.5

fread now accepts dec=',' (and other non-'.' decimal separators), #917. A new paragraph has been added to ?fread. If you are located in a country that uses dec=',' then it should just work. If not, you will need to read the paragraph for an extra step. In case it somehow breaks dec='.', this new feature can be turned off with options(datatable.fread.dec.experiment=FALSE).



Previous answer ...

Matt Dowle found a nice work-around with locales. First my sessionInfo

sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=C
...

Trying the following shows the culprit:

Sys.localeconv()["decimal_point"]
decimal_point 
          "." 

Trying to set the LC_NUMERIC worked on Ubuntu(Matthew) and WinXP(me)

Sys.setlocale("LC_NUMERIC", "French_France.1252")
[1] "French_France.1252"
Message d'avis :
In Sys.setlocale("LC_NUMERIC", "French_France.1252") :
  changer 'LC_NUMERIC' peut résulter en un fonctionnement étrange de R

The behaviour is fine and changes as:

DT = fread("A,B\n3,14;123\n4,22;456\n",sep=";")
str(DT)
Classes ‘data.table’ and 'data.frame':  2 obs. of  2 variables:
 $ V1: num  3.14 4.22
 $ V2: int  123 456

The "." decimal separators are now loaded as strings (as it should), it was the opposite previously.

DT = fread("A,B\n3.14;123\n4.22;456\n",sep=";")
str(DT)
Classes ‘data.table’ and 'data.frame':  2 obs. of  2 variables:
 $ V1: chr  "3.14" "4.22"
 $ V2: int  123 456
Matt Dowle
  • 56,107
  • 20
  • 160
  • 217
statquant
  • 12,389
  • 13
  • 75
  • 148