1

A file contains the following types of records, where each record has four entries.

abc, 12:30, love coding, re0*10

cde, informative, "love coding, abcd,ab/cd", 0

The usage scenario is like this, given a file of 1000 records. Each record will be put into a row in the table, the each entry will be put into a corresponding entry. I would like to have a Regex that can help me capture the four entries for each record.

For the first type of record, I can use the following pattern to capture the four entries

 ^([^,]*),([^,]*),([^,]*),([^,]*)$

For the second type of record, I can use

^([^,]*),([^,]*),"([.*])",([^,]*)$

But how to have a single regular expression to capture these two patterns, which can be used to process the whole file.

user288609
  • 10,545
  • 24
  • 73
  • 107
  • 1
    I'd recommend using some CSV-parsing framework instead. – Mena Mar 24 '16 at 15:55
  • Possible duplicate of: [Java: splitting a comma-separated string but ignoring commas in quotes](http://stackoverflow.com/questions/1757065/java-splitting-a-comma-separated-string-but-ignoring-commas-in-quotes) - In the top/accepted answer given a regex is given to achieve the behavior you want. – Kevin Cruijssen Mar 24 '16 at 15:57
  • I doubt that the second regex works... – fabian Mar 24 '16 at 16:01
  • @fabian it does. Because of backtracking. – f1sh Mar 24 '16 at 16:05
  • @f1sh not the last time I checked... Which is 5 sec ago... – fabian Mar 24 '16 at 16:08

3 Answers3

0

You could use the alternation operator "|".

Like this:

^([^,]*), ([^,]*), (?:(".*")|([^,]*) ), ([^,]*)$
Lucas Araujo
  • 1,413
  • 12
  • 22
0

To be able to match both lines you can use alternation like this:

^("[^"]*"|[^,]*), *("[^"]*"|[^,]*), *("[^"]*"|[^,]*), *("[^"]*"|[^,]*)$

("[^"]*"|[^,]*) in each cell matches either a quoted value or anything that is not a comma. Note that it doesn't take care of unbalanced or escaped quoted strings.

RegEx Demo

anubhava
  • 664,788
  • 59
  • 469
  • 547
0

I just post the solution for a single part of the line (the part corresponding to the capturing group, that can be surrounded by quotes). I think you can continue from there yourself

"?((?<=")[^"]*(?=")|(?<!")[^,]*(?!"))"?

This uses lookarounds to take care of the quotes. This way the groups stay the same as in the original regexps

This way the quotes, if there are quotes, are outside the capturing group, but the regex matches only if either there are quotes on both sides of the capturing group ((?<=")[^"]*(?=")) or if there are no quotes on both sides ((?<!")[^,]*(?!")).

fabian
  • 67,623
  • 12
  • 74
  • 102