Reading Chinese Language (GB2312) Data

Question

I am trying to read a csv file with Chinese language text in it. The file should look like this:

userid,jobid,Title,companyid,industryids1
82497,1160,互联网产品经理,12
96429,658,企划经理（商业公司）,24
14471,95,产品运营经理,25,6
14471,1708,产品营销高级经理,727,2
14471,1558,产品总监,611,4
14471,1777,产品总监,743,1
14471,1697,产品经理,725,234
14471,1716,度假产品总监 ,730,234
14471,1717,产品经理,730,5

but when I read the data in using read.csv() it looks like this in the R console:

  userid jobid                Title companyid industryids1
1  82497  1160       »¥ÁªÍø²úÆ·¾Àí        12           NA
2  96429   658 Æó»®¾Àí£¨ÉÌÒµ¹«Ë¾£©        24           NA
3  14471    95         ²úÆ·ÔËÓª¾Àí        25            6
4  14471  1708     ²úÆ·ÓªÏú¸ß¼¶¾Àí       727            2
5  14471  1558             ²úÆ·×Ü¼à       611            4
6  14471  1777             ²úÆ·×Ü¼à       743            1
7  14471  1697             ²úÆ·¾Àí       725          234
8  14471  1716        ¶È¼Ù²úÆ·×Ü¼à        730          234
9  14471  1717             ²úÆ·¾Àí       730            5

How can I read this in properly?

Session info:

R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
loaded via a namespace (and not attached):
[1] tools_2.14.1

[What have you tried?](http://whathaveyoutried.com) I suspect that R has at least one Unicode library. — , Oct 26 '12 at 17:47
`read.csv` has an `encoding` argument that you can use if you know how the file is encoded. Otherwise, check out the answer here for a way to find which encoding to use: http://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv — Matthew Plourde, Oct 26 '12 at 18:19

score 1 · Answer 1 · answered Oct 26 '12 at 19:40

Are those characters even representable in in the Windows-1252 encoding? I doubt it. As R is running in that LOCALE, you'll need to change it to one in which those character encodings do make sense, like UTF-8 for example.

You example work for me in a suitable locale on Linux (using UTF-8):

> df <- read.csv(text = "userid,jobid,Title,companyid,industryids1
+ 82497,1160,互联网产品经理,12
+ 96429,658,企划经理（商业公司）,24
+ 14471,95,产品运营经理,25,6
+ 14471,1708,产品营销高级经理,727,2
+ 14471,1558,产品总监,611,4
+ 14471,1777,产品总监,743,1
+ 14471,1697,产品经理,725,234
+ 14471,1716,度假产品总监 ,730,234
+ 14471,1717,产品经理,730,5", header = TRUE)
> df
  userid jobid                Title companyid industryids1
1  82497  1160       互联网产品经理        12           NA
2  96429   658 企划经理（商业公司）        24           NA
3  14471    95         产品运营经理        25            6
4  14471  1708     产品营销高级经理       727            2
5  14471  1558             产品总监       611            4
6  14471  1777             产品总监       743            1
7  14471  1697             产品经理       725          234
8  14471  1716        度假产品总监        730          234
9  14471  1717             产品经理       730            5

My sessionInfo() is:

> sessionInfo()
R version 2.15.2 RC (2012-10-22 r60997)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8    
 [5] LC_MONETARY=en_GB.utf8    LC_MESSAGES=en_GB.utf8   
 [7] LC_PAPER=C                LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

loaded via a namespace (and not attached):
[1] tools_2.15.2

So it seems you'll either need to tell R to use a different encoding/locale or the R Windows FAQ suggests that you try to use a font for the R GUI Console that contains handling for the encoding you need.

score 0 · Answer 2 · answered Jan 16 '15 at 06:25

I'm working with RStudio (ver.3.1.2) under WIN7 (64 bit). What I did while Chinese text mining is to set the system language to Chinese (Simplified, PRC).

Control Panel -> Region and Language -> Formats -> Chinese (Simplified, PRC)
Control Panel -> Region and Language -> Administrative -> Change System Locale... -> Chinese (Simplified, PRC)

and then I can check the system info:

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lubridate_1.3.3 tmcn_0.1-3     

loaded via a namespace (and not attached):
 [1] bitops_1.0-6   digest_0.6.8   httr_0.6.1     memoise_0.2.1 
 [5] plyr_1.8.1     Rcpp_0.11.3    RCurl_1.95-4.5 Rwordseg_0.2-1
 [9] stringr_0.6.2  swirl_2.2.21   testthat_0.9.1 tools_3.1.2   
[13] yaml_2.1.13

As well, set everything about encoding in RStudio to UTF-8

File -> Reopen with Encoding -> UTF-8
File -> Save with Encoding -> UTF-8
Tools -> Global -> General -> Default text encoding -> UTF-8

Then there should be no problem for reading / saving scripts with Chinese characters and printing them on console. But I have to say, with the locale language seted as above, the warning & error message also come up in Chinese Characters...

> library(dfsaf)
Error in library(dfsaf) : 不存在叫‘dfsaf’这个名字的程辑包

Good Luck

Reading Chinese Language (GB2312) Data

2 Answers2