0

I am trying to parse the following line to an data array:

"John,Doe","123 Main St","Brown Eyes"

I wanted to have an array data like below:

data(0) = John,Doe
data(1) = 123 Main St
data(2) = Brown Eyes

I used the following CSV parser from website:

import scala.util.parsing.combinator._

object CSV extends RegexParsers {
  override protected val whiteSpace = """[ \t]""".r

  def COMMA   = ","
  def DQUOTE  = "\""
  def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }
  def CR      = "\r"
  def LF      = "\n"
  def CRLF    = "\r\n"
  def TXT     = "[^\",\r\n]".r

  def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ opt(CRLF)
  def record: Parser[List[String]] = rep1sep(field, COMMA)
  def field: Parser[String] = (escaped|nonescaped)
  def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case _ => List[List[String]]()
  }
}

Then all the spaces are trimmed. The data array actually look like:

data(0) = John,Doe
data(1) = 123MainSt
data(2) = BrownEyes

How do I avoid such unwanted "removing whitespace" for the CSV parser? Thanks!

Rob Starling
  • 3,760
  • 2
  • 20
  • 38
Edamame
  • 17,408
  • 44
  • 143
  • 254
  • There is an error in this approach. It specifically corrupts on column values which contain a valid line break. Because of so many issues, even in the presence of an RFC for the .csv MIME-type, I strongly suggest you use a well-maintained RFC driven native Scala library which optimally handles this problem, kantan.csv: https://nrinaudo.github.io/kantan.csv – chaotic3quilibrium Aug 30 '20 at 20:22

4 Answers4

1

Your code says to take a sequence of escaped or nonescaped tokens and join them with no intervening space:

...* ^^ { case ls => ls.mkString("") }

Per the docs for RegexParsers,

  • The parsing methods call the method skipWhitespace (defaults to true) and, if true, skip any whitespace before each parser is called.
  • Protected val whiteSpace returns a regex that identifies whitespace.

Try turning off skipWhitespace:

override protected val skipWhitespace = false
Rob Starling
  • 3,760
  • 2
  • 20
  • 38
1

Is there a particular reason for hand-writing a CSV-decoder instead of using one of many existing, well-tested ones? Like OpenCSV or Jackson CSV module. It should be much simpler to use an existing lib, and you wouldn't bump into various issues in trying to unescape quotes, trim (or not) spaces, and so on.

StaxMan
  • 102,903
  • 28
  • 190
  • 229
  • Agreed. Because of so many issues, even in the presence of an RFC for the .csv MIME-type, I strongly suggest you use a well-maintained RFC driven native Scala library which optimally handles this problem, kantan.csv: https://nrinaudo.github.io/kantan.csv – chaotic3quilibrium Aug 30 '20 at 20:23
1

The precise answer to your question was given by Robert Starling: set skipWhitespace to false.

The answer to the question "how do I parse CSV reliably?", which I'm assuming is what you really want to know, is "use a dedicated library".

You can use one of the Java ones - opencsv, commons-csv, jackson-csv, univocity... or one of the Scala ones - product-collections, purecsv, kantan.csv...

Don't write your own without a good reason - I wrote tabulate because I needed better type handling than was available at the time - and if you do, don't use one of the Scala parser combinator libraries: they load the whole data as a string in memory before parsing, which doesn't scale at all when your data starts growing.

If you must write your own and want to use a parser combinator library (because, let's face it, it's a fun problem and those libraries are cool), consider fastparse instead, or parboiled, which are both of a higher quality than the standard Scala one.

Nicolas Rinaudo
  • 5,712
  • 25
  • 40
0

You can do this job in one line :

line.split((",(?=([^\"]*\"[^\"]*\")*[^\"]*$)")

The regex is from here and then apply this on all lines of your file. This will split only the comma outside two quotes.

Code to parse the csv File :

scala> scala.io.Source.fromFile("toto.csv").getLines.toList.map(_.split((",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"))

Community
  • 1
  • 1
alifirat
  • 2,659
  • 1
  • 11
  • 30
  • That's code to parse a subset of CSV file. What happens if one of the values contains a double quote? – Nicolas Rinaudo Jan 06 '16 at 12:41
  • Per @NicolasRinaudo's comment, There is an error in this approach. It specifically corrupts on column values which contain a valid line break. Because of so many issues, even in the presence of an RFC for the .csv MIME-type, I strongly suggest you use a well-maintained RFC driven native Scala library which optimally handles this problem, kantan.csv: https://nrinaudo.github.io/kantan.csv – chaotic3quilibrium Aug 30 '20 at 20:23