4

Sometimes a Byte Order Mark (BOM) is present at the beginning of a .CSV file. The symbol is not visible when you open the file using Notepad or Excel, however, When you read the file in R using various methods, you will different symbols in the name of first column. here is an example

A sample csv file with BOM in the beginning.

ID,title,clean_title,clean_title_id
1,0 - 0,,0
2,"""0 - 1,000,000""",,0
27448,"20yr. rope walker
igger",Rope Walker Igger,1832700817

Reading through read.csv in base R package

(x1 = read.csv("file1.csv",stringsAsFactors = FALSE))
#   ï..ID                raw_title        semi_clean semi_clean_id
# 1     1                    0 - 0                               0
# 2     2          "0 - 1,000,000"                               0
# 3 27448 20yr. rope walker\nigger Rope Walker Igger    1832700817

Reading through fread in data.table package

(x2 = data.table::fread("file1.csv"))
#    ID                raw_title        semi_clean semi_clean_id
# 1:     1                    0 - 0                               0
# 2:     2        ""0 - 1,000,000""                               0
# 3: 27448 20yr. rope walker\rigger Rope Walker Igger    1832700817

Reading through read_csv in readr package

(x3 = readr::read_csv("file1.csv"))
#   <U+FEFF>ID                raw_title        semi_clean semi_clean_id
# 1          1                    0 - 0              <NA>             0
# 2          2          "0 - 1,000,000"              <NA>             0
# 3      27448 20yr. rope walker\rigger Rope Walker Igger    1832700817

You can notice different characters in front of variable name ID.

Here are the results when you run names on all of these

names(x1)
# [1] "ï..ID"         "raw_title"     "semi_clean"    "semi_clean_id"
names(x2)
# [1] "ID"         "raw_title"     "semi_clean"    "semi_clean_id"
names(x3)
# [1] "ID"             "raw_title"     "semi_clean"    "semi_clean_id"

In x3, there is nothing 'visible' in front of ID, but when you check

names(x3)[[1]]=="ID"
# [1] FALSE

How to get rid of these unwanted character in each case. PS: Please add more methods to read csv files, the problem faced and the solutions.

MichaelChirico
  • 31,197
  • 13
  • 98
  • 169
Gaurav Singhal
  • 850
  • 2
  • 7
  • 22

1 Answers1

3

For read.csv in base R use:

x1 = read.csv("file1.csv",stringsAsFactors = FALSE, fileEncoding = "UTF-8-BOM")

For fread, use:

x2 = fread("file1.csv")
setnames(x2, "ID", "ID")

For read_csv, use:

x3 = readr::read_csv("file1.csv")
setDT(X3) #convert into data tables, so that setnames can be used
setnames(x3, "\uFEFFID", "ID")

One non-R based solution is open the file in Notepad++, save the file after change encoding to "Encoding in UTF-8 without BOM"

Gaurav Singhal
  • 850
  • 2
  • 7
  • 22
  • 2
    For folks coming here looking for authoritative info on BOMs & R after being frustrated with the inanity of Windows data files, you should also hit this —https://github.com/ropenscilabs/gtfsr/issues/19#issuecomment-247766324 — this — https://gist.github.com/hrbrmstr/be3bf6e2b7e8b06648fd — this — https://github.com/hadley/readr/issues/263 — and this — https://github.com/hadley/readr/issues/500. – hrbrmstr Sep 20 '16 at 12:28
  • 1
    Post-hoc approaches are to be avoided, I'd say. Better to attack the problem at its source. Also, don't use `as.data.table` in most cases. Use `setDT`, since it doesn't create a copy. – MichaelChirico Sep 20 '16 at 15:31
  • Thanks @MichaelChirico, I have modified the answer. However I am not skilled enough to even understand how to attack the problem at source. – Gaurav Singhal Sep 20 '16 at 18:43
  • No worries! That's not necessarily on you. I would report the issue for `readr`; in fact, there is already a well-upvoted issue on [`readr`'s GH page](https://github.com/hadley/readr/issues/500) – MichaelChirico Sep 20 '16 at 18:45