1

UPDATE: Pass this:

#",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"

to clojure.string/split to parse CSV.

UPDATE: I need a regex that matches all commas that are not in quotes, in a form that can be used by clojure.string/split.

I have written a CSV parse function in Clojure:

(defn parse-csv [data schema]
  (let [split-data (clojure.string/split data #",")]
    (loop [rm-data split-data
           rm-keys (:keys schema)
           rm-trans (:trans schema)
           final {}]
      (if (empty? rm-keys)
           final
          (recur (rest rm-data)
                 (rest rm-keys)
                 (rest rm-trans)
                 (into final
                   {(first rm-keys)
                   ((first rm-trans) (first rm-data))}))))))

schema is simply a hash map consisting of a list of keywords and a list of functions (which are applied to their respective values). This is used to define how the output hash map will look. Here's an example:

(def schema {:keys [:foo :bar :baz] :trans [identity read-string identity]})
(parse-csv "Hello,42,world" schema) ;; returns {:foo "Hello", :bar 42, :baz "world"}

However, if we do this:

(def schema {:keys [:foo :bar :baz] :trans [identity identity identity]})
(parse-csv "Hello,\"Newell, Gabe\",world" schema) ;; returns {:foo "Hello" :bar "\"Newell" :baz "Gabe\""}

Things get messed up, and the word "world" is ignored. The result should look like:

{:foo "Hello" :bar "\"Newell, Gabe\"" :baz "world"}

The above data, in a file, would actually look like Hello,"Newell, Gabe",world, so we need to avoid triggering the split function when it comes across the comma in "Newell, Gabe".

We need a function that will split a string by a certain character unless the certain character is in quotes.

Community
  • 1
  • 1
the_rover
  • 25
  • 4
  • In order to allow commas inside quoted strings, you would need to first split out strings, then make fields out of what is left. Or use a real parser. – noisesmith Dec 26 '15 at 20:57
  • 1
    https://github.com/clojure/data.csv reads this correct with default options: `(require '[clojure.data.csv :as csv]) (csv/read-csv "Hello,\"Newell, Gabe\",world") ; => (["Hello" "Newell, Gabe" "world"])` – cfrick Dec 27 '15 at 12:01

1 Answers1

0

To allow seperators inside fields (here ,), those fields need to be quoted.

What you are doing here with \" is escaping the quotes (unix style, quotes can also be escaped with another quote).

So, for allowing the , inside the field :

"Hello,"Newell, Gabe",world"

Those outer quotes should not be part of the csv of course.

Udate after edit of the question:

The above data, in a file, would actually look like Hello,"Newell, Gabe",world, so we need to avoid triggering the split function when it comes across the comma in "Newell, Gabe".

Hello,"Newell, Gabe",world

This is perfectly valid csv, but if this is processed with a pure split function on the , you are in trouble.

One option might be to use another seperator like | or ; for example.

Update 2:

so I need a new function that splits the CSV string into fields unless the comma is in quotes (I can't change the dataset). How would I go about implementing this?

For each line, you would need to scan over every character, somewhat like this in pseudo code (I don't know Clojure, so I can't provide you some code, but I work a lot with large csv-files):

bool InsideQuotes = false;
loop through chars
  if `,` and InsideQuotes == false -> new field
  if `"` -> InsideQuotes = !InsideQuotes

This way quotes inside a quoted field can be escaped with another quote. For example:

Hello,"17"" monitor, Samsung",world

Update:

For some Regex on csv see this and this.

Community
  • 1
  • 1
Danny_ds
  • 10,507
  • 1
  • 17
  • 42
  • The data, in a file, would actually look like `Hello,"Newell, Gabe",world`, so we need to avoid triggering the `split` function when it comes across the comma in `"Newell, Gabe"`. – the_rover Dec 26 '15 at 20:59
  • Yes, that should do it. – Danny_ds Dec 26 '15 at 21:01
  • Ok, your comment (and question) has changed - see update in answer. – Danny_ds Dec 26 '15 at 21:18
  • After updated answer: I am using my CSV "reader thingy" on someone else's (multi-megabyte) dataset, so I need a new function that splits the CSV string into fields *unless* the comma is in quotes (I can't change the dataset). How would I go about implementing this? – the_rover Dec 26 '15 at 21:19
  • Ooh, sorry - didn't want to offend you! I thought the csv parser was part of Clojure.. Title says: "Clojure CSV parser". I'll update the answer some more. – Danny_ds Dec 26 '15 at 21:25