4

I have a tool (exe provided to me), which outputs poorly formatted csv's. They are bad in that the last value can have commas, with no quotes, e.g.:

184500,OBJECT_CALENDAR,,,UNITS_NO_UNITS,NULL,,,,Sched N&S B1,1st,3rd,4S,5th&6th

Where the last string actually begins at 'Sched', so I would expect to see something like this:

184500,OBJECT_CALENDAR,,,UNITS_NO_UNITS,NULL,,,,"Sched N&S B1,1st,3rd,4S,5th&6th"

This is screwing up everything I am trying to do, and I am curious how to address it. Is there a way to define the number of columns in read.csv?

I have tried to read it line by line, but it is slow, and less than elegant:

processFile = function(filepath) {
  i = 1
  vector = character(0)
  theFile = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(theFile, n = 1)
    if ( length(line) == 0 ) {
      break
    } else {
      vector[i] <- line
      i = i+1
    }
  }
  close(theFile)
  formatted <- lapply(strsplit(vector[-1],','), function(x) {c(x[1:9], paste(x[10:length(x)], collapse = ','))})
  finalFrame <- as.data.frame(matrix(unlist(formatted),ncol = 10, byrow = TRUE))
  return(finalFrame)
}

Any better ways to do this? Any base functions that can do this, and if not, any libraries?

lukehawk
  • 1,183
  • 3
  • 17
  • 43

2 Answers2

3

This isn't ideal since you still have to read the file in line by line, but stringr::str_split has a parameter n that specifies the maximum number of splits. If you set pattern = "," and n=10, then it will split your string into only 10 pieces, leaving the last chunk as a single string.

divibisan
  • 8,631
  • 11
  • 31
  • 46
  • 1
    Thanks! I'd rather use just the base library (just cuz I am one of THOSE), but I appreciate the heads up. This is definitely helpful. – lukehawk Mar 22 '18 at 20:56
3

Specifying the classes for each column seems to work in my case. So if you have 4 columns and the 4th one might have varying number of commas, try this:

theData <- read.table(filepath, colClasses=rep("character" ,4))

Of course adjust the number of columns and their classes to your situation. Here is what I get on toy csv file:

> read.table("tmp.csv", colClasses=rep("character" ,4), header=FALSE)
  V1 V2 V3    V4
1 A, B, C,   1&2
2 A, C, C, 1,2,3
3 A, V, X,    12
4 A, V, D,   1,0

Another option would be to use read.csv with fill=TRUE argument

theData <- read.csv(filepath, fill=TRUE)

This will produce a data.frame with number of columns equal to the line with the maximum number of commas. Then you would have to manually combine those split commas into one.

NOTE: this will work in the case when only the last column can have loose commas.

Karolis Koncevičius
  • 7,687
  • 9
  • 48
  • 71
  • And then you could just use ```paste``` to join those extra columns together into a new one and then drop the others. Source for pasting columns: https://stackoverflow.com/questions/18115550/how-to-combine-two-or-more-columns-in-a-dataframe-into-a-new-column-with-a-new-n – divibisan Mar 22 '18 at 21:34