How to determine the correct file encoding for use with read.fwf (or use a workaround to remove non-conforming characters)

Question

I tried the approach in the following question and am still stuck.

How to detect the right encoding for read.csv?

This following code should be reproduceable... Any ideas? I'd rather not use scan() or readLines because I've been using this code successfully for assorted state level ACS data in the past....

My other thought is to edit the text file prior to importing it. However I store the files zipped and use a script to unzip and then access the data. Having to edit the file outside of the R environment would really gum up that process. Thanks in advance!

Filename <- "g20095us.txt"
Url <- "http://www2.census.gov/acs2005_2009_5yr/summaryfile/2005-2009_ACSSF_By_State_By_Sequence_Table_Subset/UnitedStates/All_Geographies_Not_Tracts_Block_Groups/"

Widths <- c(6,2,3,2,7,1,1,1,2,2,3,5,5,6,1,5,4,5,1,3,5,5,5,3,5,1,1,5,3,5,5,5,2,3,
        3,6,3,5,5,5,5,5,1,1,6,5,5,40,200,6,1,50)
Classes <- c(rep('character',4),'integer',rep('character',47))
Names <- c('fileid','stusab','sumlev','geocomp','logrecno','us','region','division',
       'statece','state','county','cousub','place','tract','blkgrp','concit',
       rep('blank',14),'ua',rep('blank',11),'ur',rep('blank',4),'geoid','name',rep('blank',3))
GeoHeader <- read.fwf(paste0(Url,Filename),widths=Widths,
                  colClasses=Classes,col.names=Names,fill=TRUE,strip.white=TRUE)

Four lines from the file "g2009us.txt" below. The second one "Canoncito" is causing the problems. The other files in the download are csv but this one is fixed-width and necessary to identify geographies of interest (the organization of the data is not very intuitive).

ACSSF US251000000964 2430 090 25100US2430090 Cameron Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT ACSSF US251000000965 2430 092 25100US2430092 Cañoncito Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT ACSSF US251000000966 2430 095 25100US2430095 Casamero Lake Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT ACSSF US251000000967 2430 105 25100US2430105 Chi Chil Tah Chapter, Navajo Nation Reservation and Off-Reservation Trust Land, AZ--NM--UT

So thats a big file! But I grabbed one from DC and looked at it... they look like comma separated files rather than fixed width. They also read in just fine using `read.csv`. If I'm wrong, post the first few lines from the `g20095us.txt` file in your question so we can avoid the big download — Justin, Nov 21 '12 at 15:19
Thanks Justin. I forgot that one can directly access the file rather than downloading the entire set of data. The code has been updated to point directly at the file in question (which is the only fixed width file in the zipped set I linked to previously). — Michael Williams, Nov 21 '12 at 16:04
I find it easiest to use a text editor like `emacs` or a command line tool like `sed` to clean up fixed width files to a more manageable format (e.g. tsv or csv). However, take a look [here](http://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv) for more details on determining file encoding — Justin, Nov 21 '12 at 16:16
Yeah. I posted the same link at the beginning of the post. And I didn't get any closer after working through the author's suggestions. I might have implemented it incorrectly though. — Michael Williams, Nov 21 '12 at 16:26

score 8 · Accepted Answer · edited May 23 '17 at 12:08

First, we start by identifying all non-ASCII characters. I do this by converting converting to a raw vector, and then looking for values over 127 (the last unambiguously encoded value in ASCII).

lines <- readLines("g20095us.txt")

non_ascii <- function(x) {
  any(charToRaw(x) > 127)
}

bad <- vapply(lines, non_ascii, logical(1), USE.NAMES = FALSE)
lines[bad]

We then need to figure out what the correct encoding is. This is challenging when we only have two cases, and often involves some trial and error. In this case I googled for "encoding \xf1", and discovered Why doesn't this conversion to utf8 work?, which suggested that latin1 might be the corect encoding.

I tested that using iconv which converts from one encoding to another (and you always want to use utf-8):

iconv(lines[bad], "latin1", "utf-8")

Finally, we reload with the correct encoding. Confusingly, the encoding argument to any of the read.* functions doesn't do this - you need to manually specify an encoding on the connection:

fixed <- readLines(file("g20095us.txt", encoding = "latin1"))
fixed[bad]

How to determine the correct file encoding for use with read.fwf (or use a workaround to remove non-conforming characters)

1 Answers1

Linked