2

I'm having a lot of difficulty with spaces in Java while using regular expressions. The assignment is to split a comma-separated input string like,

J,Project report,"F, G, I",1

into separate strings containing:
J
Project report
F, G, I
1
if that makes sense. I'm using a Scanner to split the string. My regex (and code) I'm using is.

while (t.hasNext("([a-zA-Z0-9]| )*(\".+\")*,?")) { 
    System.out.println("t.next is : " + t.next());

...where t is a scanner of the input string as described above. But this does not appear to ever resort to true as nothing is printed. The closest I can get to working is just using simply ".*" as my regex, but that will separate at spaces and I need to separate only at the commas NOT within quotation marks. Can anyone assist? Thank you.

RichW
  • 1,866
  • 2
  • 13
  • 23
Stefan Arambasich
  • 1,991
  • 2
  • 17
  • 24
  • must you use regular expressions? they are perhaps not the best tool for the job here – stew Dec 10 '11 at 05:46
  • No it's not required, but simply using a "," as a delimiter won't work, as F, G, and I will be separated when they are to be part of one string. This can admittedly be simpler, but I am not sure how to approach it. – Stefan Arambasich Dec 10 '11 at 05:48

4 Answers4

1

This CAN be done with a regular expression, but a regular expression is perhaps not the best tool for the job. The expression you are going to end up with is going to be hard to read/maintain, and isn't necessarily going to be any more efficient.

Without going into too much detail, as this is your homework, not mine, I'd think about this another way:

You need a stateful scanner. You have two states, "i'm in the middle of quotes" and "i'm not". Scan the string character by character and each character will cause you either to accumulate a future result, emit a result or change states.

If this needs to be more robust, it might need to have even more states, for example. if you need to also parse something like:

a,"b\"c",d
stew
  • 10,996
  • 31
  • 48
  • No, it doesn't need to be very robust. This is the entire file: http://pastebin.com/imej2AYM I don't need to worry about anything else besides these cases. I like the state idea too. – Stefan Arambasich Dec 10 '11 at 05:56
1

Try this out:

(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)")

Reference: Java: splitting a comma-separated string but ignoring commas in quotes

Also, http://regexpal.com/ is a really neat and useful tool when it comes to testing out regexes :)

Community
  • 1
  • 1
Hristo
  • 42,002
  • 60
  • 155
  • 224
1

I agree with the suggestion that a robust third parter CSV library is a way to go. However, here is how you can use Scanner.

Scanner t = new Scanner(new File("test.csv"));
t.useDelimiter(',(?=([^\"]*\"[^\"]*\")*[^\"]*$)');
while( t.hasNext() ) {
    System.out.println(t.next());
}

I used the regex from @Hristo answer.

JRideout
  • 1,525
  • 10
  • 16
  • Thank you JRideout! Third-party is good but it's always nice to be able to solve a problem yourself without handing it off to someone else to finish - I want to understand what's going on and be able to understand the gory details to make me a better programmer. But maybe I shouldn't be using Java in that case? ;) The assignment wasn't actually mine, I was helping a friend, but this will help a lot thank you! – Stefan Arambasich Dec 10 '11 at 21:37
1

CSV files are more complex than they at first appear. For example in German countries the file separator is normally the ";" character..... While I understand your assigment was to use regexp's, don't waste your time when solving this problem for real.

My tool of choice is opencsv. Here's a groovy script (I leave you to convet it to Java) that parses your string:

import au.com.bytecode.opencsv.CSVParser

@Grapes([
    @Grab(group='net.sf.opencsv', module='opencsv', version='2.3')
])

CSVParser csv = new CSVParser()
String[] result = csv.parseLine('J,Project report,"F, G, I",1')

assert result[0] == "J"
assert result[1] == "Project report"
assert result[2] == "F, G, I"
assert result[3] == "1"

The CSVReader object provides ways to iterate over the file contents:

new File("data.csv").withReader { reader ->
    CSVReader csv = new CSVReader(reader);

    csv.readAll().each {
        println it[0]
        println it[1]
        println it[2]
        println it[3]
    }
}
Mark O'Connor
  • 72,448
  • 10
  • 129
  • 174
  • The assignment wasn't actually to use regex and it actually isn't even my assignment haha. My friend needed help for his software engineering class, and it was bothering me that I wasn't able to help solve the problem, hence my thread. But keeping in mind this library would probably be beneficial for both my friend and I. Thanks! I almost picked this response as the answer. Very informative! – Stefan Arambasich Dec 10 '11 at 21:34