How to detect the right encoding for read.csv?

Question

I have this file (http://b7hq6v.alterupload.com/en/) that I want to read in R with read.csv. But I am not able to detect the correct encoding. It seems to be a kind of UTF-8. I am using R 2.12.1 on an WindowsXP Machine. Any Help?

score 57 · Accepted Answer · edited May 23 '17 at 12:18

First of all based on more general question on StackOverflow it is not possible to detect encoding of file in 100% certainty.

I've struggle this many times and come to non-automatic solution:

Use iconvlist to get all possible encodings:

codepages <- setNames(iconvlist(), iconvlist())

Then read data using each of them

x <- lapply(codepages, function(enc) try(read.table("encoding.asc",
                   fileEncoding=enc,
                   nrows=3, header=TRUE, sep="\t"))) # you get lots of errors/warning here

Important here is to know structure of file (separator, headers). Set encoding using fileEncoding argument. Read only few rows.
Now you could lookup on results:

unique(do.call(rbind, sapply(x, dim)))
#        [,1] [,2]
# 437       14    2
# CP1200     3   29
# CP12000    0    1

Seems like correct one is that with 3 rows and 29 columns, so lets see them:

maybe_ok <- sapply(x, function(x) isTRUE(all.equal(dim(x), c(3,29))))
codepages[maybe_ok]
#    CP1200    UCS-2LE     UTF-16   UTF-16LE      UTF16    UTF16LE 
#  "CP1200"  "UCS-2LE"   "UTF-16" "UTF-16LE"    "UTF16"  "UTF16LE"

You could look on data too

x[maybe_ok]

For your file all this encodings returns identical data (partially because there is some redundancy as you see).

If you don't know specific of your file you need to use readLines with some changes in workflow (e.g. you can't use fileEncoding, must use length instead of dim, do more magic to find correct ones).

I did a quit similar thing for iconvlist(),but with a loop. The crucial thing was the usage of "fileEncoding". I wrongly used "encoding". Thanks for your help. — Alex, Jan 27 '11 at 14:27
I sketched out a similar approach at https://gist.github.com/837414 - I think it's more efficient to load the data once, and then try out different encodings using `iconv`. — hadley, Nov 22 '12 at 13:42
@Marek Nice trick. At least I know my `read.csv` issue doesn't have to do with `fileEncoding`. — Tunn, Nov 24 '16 at 01:35

score 42 · Answer 2 · answered Mar 07 '16 at 21:56

42

The package readr, https://cran.r-project.org/web/packages/readr/readr.pdf, includes a function called guess_encoding that calculates the probability of a file of being encoded in several encodings:

guess_encoding("your_file", n_max = 1000)

answered Mar 07 '16 at 21:56

Enrique Pérez Herrero

2,980
2
26
30

2

This option was very nice and easy to use. – MadmanLee Apr 09 '19 at 17:07
Sometimes `guess_encoding` fails to deliver definite results, though. Tried it on 11 csv files and 2 of them were equally split between several encodings. – Dutschke Aug 13 '20 at 15:48

daroczig · Answer 3 · 2011-01-27T07:51:18.617

First, you have to figure out what is the encoding of the file, what cannot be done in R (at least as I know). You can use external tools for it e.g. from Perl, python or eg. the file utility under Linux/UNIX.

As @ssmit suggested, you have an UTF-16LE (Unicode) encoding here, so load the file with this encoding and use readLines to see what you have in the first (e.g.) 10 lines:

> f <- file('encoding.asc', open="r", encoding="UTF-16LE")   # UTF-16LE, which is "called" Unicode in Windows
> readLines(f,10)
 [1] "\tFe 2\tZn\tO\tC\tSi\tMn\tP\tS\tAl\tN\tCr\tNi\tMo\tCu\tV\tNb 2\tTi\tB\tZr\tCa\tH\tCo\tMg\tPb 2\tW\tCl\tNa 3\tAr"                                                                                                                          
 [2] ""                                                                                                                                                                                                                                         
 [3] "0\t0,003128\t3,82E-05\t0,0004196\t0\t0,001869\t0,005836\t0,004463\t0,002861\t0,02148\t0\t0,004768\t0,0003052\t0\t0,0037\t0,0391\t0,06409\t0,1157\t0,004654\t0\t0\t0\t0,00824\t7,63E-05\t0,003891\t0,004501\t0\t0,001335\t0,01175"         
 [4] "0,0005\t0,003265\t3,05E-05\t0,0003662\t0\t0,001709\t0,005798\t0,004395\t0,002808\t0,02155\t0\t0,004578\t0,0002441\t0\t0,003601\t0,03897\t0,06406\t0,1158\t0,0047\t0\t0\t0\t0,008026\t6,10E-05\t0,003876\t0,004425\t0\t0,001343\t0,01157"  
 [5] "0,001\t0,003332\t2,54E-05\t0,0003052\t0\t0,001704\t0,005671\t0,0044\t0,002823\t0,02164\t0\t0,004603\t0,0003306\t0\t0,003611\t0,03886\t0,06406\t0,1159\t0,004705\t0\t0\t0\t0,008036\t5,09E-05\t0,003815\t0,004501\t0\t0,001246\t0,01155"   
 [6] "0,0015\t0,003313\t2,18E-05\t0,0002616\t0\t0,001678\t0,005689\t0,004447\t0,002921\t0,02171\t0\t0,004621\t0,0003488\t0\t0,003597\t0,03889\t0,06404\t0,1158\t0,004752\t0\t0\t0\t0,008022\t4,36E-05\t0,003815\t0,004578\t0\t0,001264\t0,01144"
 [7] "0,002\t0,003313\t2,18E-05\t0,0002834\t0\t0,001591\t0,005646\t0,00436\t0,003008\t0,0218\t0\t0,004643\t0,0003488\t0\t0,003619\t0,03895\t0,06383\t0,1159\t0,004752\t0\t0\t0\t0,008\t4,36E-05\t0,003771\t0,004643\t0\t0,001351\t0,01142"      
 [8] "0,0025\t0,003488\t2,18E-05\t0,000218\t0\t0,001657\t0,00558\t0,004338\t0,002986\t0,02175\t0\t0,004469\t0,0002616\t0\t0,00351\t0,03889\t0,06374\t0,1159\t0,004621\t0\t0\t0\t0,008131\t4,36E-05\t0,003771\t0,004708\t0\t0,001243\t0,01125"   
 [9] "0,003\t0,003619\t0\t0,0001526\t0\t0,001591\t0,005668\t0,004207\t0,00303\t0,02169\t0\t0,00449\t0,0002834\t0\t0,00351\t0,03874\t0,06383\t0,116\t0,004665\t0\t0\t0\t0,007956\t0\t0,003749\t0,004796\t0\t0,001286\t0,01125"                   
[10] "0,0035\t0,003422\t0\t4,36E-05\t0\t0,001482\t0,005711\t0,004185\t0,003292\t0,02156\t0\t0,004665\t0,0003488\t0\t0,003553\t0,03852\t0,06391\t0,1158\t0,004708\t0\t0\t0\t0,007717\t0\t0,003597\t0,004905\t0\t0,00133\t0,01136"

From this, it can be seen, that we have a header, and a blank line in the second row (which will be skipped by default using the read.table function), the separator is \t and the decimal character is ,.

> f <- file('encoding.asc', open="r", encoding="UTF-16LE")
> df <- read.table(f, sep='\t', dec=',', header=TRUE)

And see what we have:

> head(df)
       X     Fe.2       Zn         O C       Si       Mn        P        S
1 0.0000 0.003128 3.82e-05 0.0004196 0 0.001869 0.005836 0.004463 0.002861
2 0.0005 0.003265 3.05e-05 0.0003662 0 0.001709 0.005798 0.004395 0.002808
3 0.0010 0.003332 2.54e-05 0.0003052 0 0.001704 0.005671 0.004400 0.002823
4 0.0015 0.003313 2.18e-05 0.0002616 0 0.001678 0.005689 0.004447 0.002921
5 0.0020 0.003313 2.18e-05 0.0002834 0 0.001591 0.005646 0.004360 0.003008
6 0.0025 0.003488 2.18e-05 0.0002180 0 0.001657 0.005580 0.004338 0.002986
       Al N       Cr        Ni Mo       Cu       V    Nb.2     Ti        B Zr
1 0.02148 0 0.004768 0.0003052  0 0.003700 0.03910 0.06409 0.1157 0.004654  0
2 0.02155 0 0.004578 0.0002441  0 0.003601 0.03897 0.06406 0.1158 0.004700  0
3 0.02164 0 0.004603 0.0003306  0 0.003611 0.03886 0.06406 0.1159 0.004705  0
4 0.02171 0 0.004621 0.0003488  0 0.003597 0.03889 0.06404 0.1158 0.004752  0
5 0.02180 0 0.004643 0.0003488  0 0.003619 0.03895 0.06383 0.1159 0.004752  0
6 0.02175 0 0.004469 0.0002616  0 0.003510 0.03889 0.06374 0.1159 0.004621  0
  Ca H       Co       Mg     Pb.2        W Cl     Na.3      Ar
1  0 0 0.008240 7.63e-05 0.003891 0.004501  0 0.001335 0.01175
2  0 0 0.008026 6.10e-05 0.003876 0.004425  0 0.001343 0.01157
3  0 0 0.008036 5.09e-05 0.003815 0.004501  0 0.001246 0.01155
4  0 0 0.008022 4.36e-05 0.003815 0.004578  0 0.001264 0.01144
5  0 0 0.008000 4.36e-05 0.003771 0.004643  0 0.001351 0.01142
6  0 0 0.008131 4.36e-05 0.003771 0.004708  0 0.001243 0.01125

Thanks, it works. But why I have to skip the first 2 lines? And why doesn´t this wirk in read.csv directly? — Alex, Jan 27 '11 at 07:29
@user590885: you are right, `skip=2` can be omitted (I edited my answer based on that), the second, blank line will be skipped. You can also use the `read.csv` function to read this file (with the same paramateres given), but as your file is not delimited by commas, but tabulators instead, I do not think it would be pretty. Look for `?read.table` for details about the similarities of the functions (the differences can be found in the defaults). — daroczig, Jan 27 '11 at 07:46

score 2 · Answer 4 · answered Jun 01 '18 at 14:43

In addition to using the readr-package, you may also choose to use stringi::stri_enc_detect2. This function is particularly efficient if the locale is known and if you are dealing with some form of UTF or ASCII: "..it turns out that (empirically) stri_enc_detect2 works better than the ICU-based one [stringi::stri_enc_detect used by the guess_encoding] if UTF-* text is provided."

Details on stringi::stri_enc_detect.

Details on stringi::stri_enc_detect2.

Change-request for guess_encoding

ssmir · Answer 5 · 2011-01-26T16:26:33.000

1

This file has UTF-16LE encoding with BOM (byte order mark). You probably should use encoding = "UTF-16LE"

edited Jan 26 '11 at 16:26

answered Jan 26 '11 at 16:21

ssmir

1,482
8
10

4

For completeness of this answer: in `read.table` proper parameter is `fileEncoding`. – Marek Jan 27 '11 at 09:59

Jason Mercer · Answer 6 · 2020-01-31T18:14:04.897

My tidy update to @marek's solution, since I'm running into the same problem in 2020:

#Libraries
library(magrittr)
library(purrr)

#Make a vector of all the encodings supported by R
encodings <- set_names(iconvlist(), iconvlist())
#Make a simple reader function
reader <- function(encoding, file) {
  read.csv(file, fileEncoding = encoding, nrows = 3, header = TRUE)
}
#Create a "safe" version so we only get warnings, but errors don't stop it
# (May not always be necessary)
safe_reader <- safely(reader)

#Use the safe function with the encodings and the file being interrogated
map(encodings, safe_reader, `<TEST FILE LOCATION GOES HERE>`) %>%
  #Return just the results
  map("result") %>%
  #Keep only results that are dataframes
  keep(is.data.frame) %>%
  #Keep only results with more than one column
    #This predicate will need to change with the data
    #I knew this would work, because I could open in a text editor
  keep(~ ncol(.x) > 1) %>%
  #Return the names of the encodings
  names()

How to detect the right encoding for read.csv?

6 Answers6

Linked

Related