4

I am trying to load a comma-delimited data file that also has commas in one of its text columns. The following sample code generates such a file'test.csv',which I'll load usingread.csv()to illustrate my problem.

> d <- data.frame(name = c("John Smith", "Smith, John"), age = c(34, 34))
> d
         name age
1  John Smith  34
2 Smith, John  34
> write.csv(d, file = "test.csv", quote = F, row.names = F)
> d2 <- read.csv("test.csv")
> d2
            name age
John Smith    34  NA
Smith       John  34

Because of the ',' in Smith, John, d2 is not assigned correctly. How do I read the file so that d2 looks exactly like d?

Thanks.

Claus Wilke
  • 13,982
  • 6
  • 41
  • 78
Manojit
  • 375
  • 2
  • 12
  • 1
    The problem is that you wrote the file with `quote=F`. Just take that away and it should work. – G5W Dec 21 '17 at 00:53
  • Thanks, but my problem is the file was sent to me as-is. I used this sample code to create such a file. – Manojit Dec 21 '17 at 00:55
  • 1
    @Manojit You need to _escape_ the CSV entries in the file, because some of you data itself has commas in it. Without quotes, `read.csv` will see this in the second line: `Smith, John, 34` ... i.e. it will see _three_ columns. – Tim Biegeleisen Dec 21 '17 at 01:06
  • @Manojit Is it just the one column that has commas, or are there multiple columns that have this problem. – G5W Dec 21 '17 at 01:10
  • @TimBiegeleisen Thanks. Unfortunately, as I explained before, I did not write the file myself, it was sent to me as is (the`write.csv()`bit is to generate an example file to illustrate my problem). I guess there is no easy way to do this with`read.csv()`or`read.table(),`I may have to edit it using something like Excel. – Manojit Dec 21 '17 at 01:15
  • @G5W only one column – Manojit Dec 21 '17 at 01:16

1 Answers1

5

1) read.pattern read.pattern (in gsubfn package) can read such files:

library(gsubfn)

pat <- "(.*),(.*)"
read.pattern("test.csv", pattern = pat, header = TRUE, as.is = TRUE)

giving:

         name age
1  John Smith  34
2 Smith, John  34

2) two pass Another possibility is to read it in, fix it up and then re-read it. This uses no packages and gives the same output.

L <- readLines("test.csv")
read.table(text = sub("(.*),", "\\1|", L), header = TRUE, sep = "|", as.is = TRUE)

Note: For 3 fields with the third field at the end use this in (1)

pat <- "(.*),([^,]+),([^,]+)"

The same situation use this in (2) assuming that there are non-spaces adjacent to each of the last two commas and at least one space adjacent to any commas in the text field and that fields have at least 2 characters:

text = gsub("(\\S),(\\S)", "\\1|\\2", L)

If you have some other arrangement just modify the regular expression in (1) appropriately and the sub or gsub in (2).

G. Grothendieck
  • 211,268
  • 15
  • 177
  • 297
  • Thanks! This solves the problem with`test.csv.`My actual file has another numerical column before the text column. How should I adapt the`read.table()`syntax for that file? – Manojit Dec 21 '17 at 01:31
  • Have moved comment to Note at end of answer. – G. Grothendieck Dec 21 '17 at 01:54
  • Thanks!`pat – Manojit Dec 21 '17 at 01:58
  • Ok - you already answered it while I was typing my previous comment. Thanks again! – Manojit Dec 21 '17 at 02:03
  • Your answer worked almost perfectly. Unfortunately few of the names in my data file have apostrophes in them (for example, "Smith, John's niece"). The`sub/gsub`syntax somehow merges all data in between two such apostrophes as a single data. Is there a way to include those in`sub/gsub`? – Manojit Dec 21 '17 at 02:43
  • 1
    use `quote = ""` in `read.pattern` and `read.table` to disable quotes. – G. Grothendieck Dec 21 '17 at 03:42