7

I realize that reading a .csv file removes the leading zeros, but for some of my files, it maintains the leading zeros without my having to explicitly set colClasses in read.csv. On the other hand, what's confusing me is in other cases, it DOES remove the leading zeros. So my question is: in which cases does read.csv remove the leading zeros?

user3755880
  • 335
  • 4
  • 13

2 Answers2

9

The read.csv, read.table, and related functions read everything in as character strings, then depending on arguments to the function (specifically colClasses, but also others) and options the function will then try to "simplify" the columns. If enough of the column looks numeric and you have not told the function otherwise, then it will convert it to a numeric column, this will drop any leading 0's (and trailing 0's after the decimal). If there is something in the column that does not look like a number then it will not convert to numeric and either keep it as character or convert to a factor, this keeps the leading 0's. The function does not always look at the entire column to make the decision, so what may be obvious to you as not being numeric may still be converted.

The safest approach (and quickest) is to specify colClasses so that R does not need to guess (and you do not need to guess what R is going to guess).

Greg Snow
  • 45,559
  • 4
  • 73
  • 98
3

Basically a supplement to @GregSnow's answer, from the manual.

All quotes from ?read.csv:

Unless colClasses is specified, all columns are read as character columns and then converted using type.convert to logical, integer, numeric, complex or (depending on as.is) factor as appropriate. Quotes are (by default) interpreted in all fields, so a column of values like "42" will result in an integer column.

Also:

The number of data columns is determined by looking at the first five lines of input...

Suggests read.csv looks at the first 5 lines and guesses whether the column is numeric/integer from there, otherwise keeps it as character (and thus keeps the leading 0).

If you're still curious of more details, I suggest you examine the code in edit(read.csv) and edit(read.table) which are long-ish but will spell out every step of what the function is doing.

Lastly, as an aside, it's generally good practice to specify colClasses:

Less memory will be used if colClasses is specified as one of the six atomic vector classes. This can be particularly so when reading a column that takes many distinct numeric values, as storing each distinct value as a character string can take up to 14 times as much memory as storing it as an integer.

Though if you're really concerned about memory usage/speed, you should really be using fread from data.table; even then, specifying colClasses engenders a speed-up.

MichaelChirico
  • 31,197
  • 13
  • 98
  • 169