1

While doing a Java course, I've come across this code:

String[] columns = row.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");

Checking the documentation of the split method we can understand that between parenthesis there's a regexp string.

Checking the documentation of regexp, things become more tricky:

Splitting the expression into the pieces I find in the said documentation :

  1. , - The character we expect the string to be divided with
  2. ( - closes at point 16
  3. ?= - (?=X) X, via zero-width positive lookahead
  4. ( - closes at point 11
  5. [^\"] - Any character except «"»
  6. * - I don't get this
  7. \" - The character «"»
  8. [^\"] - Again, like in point 5, any character except «"»
  9. * - Again, like in point 6, I'm not sure I get this
  10. \" - Again, like in point 7, the character «"»
  11. ) - Closes expression started at point 4
  12. * - Is this some kind of logical AND ?
  13. [^\"] - Like in points 5 and 8, any character except «"»
  14. * - Same as points 6, 9 and 12
  15. $ - Boundary matcher indicating the end of a line
  16. ) - Closes expression started at point 2

I don't understand right from point 3, as I don't understand what is a lookahead. I was able to get a gist of it with this answer, that uses this website as a reference, where we can read about positive and negative lookaheads:

Negative lookahead is indispensable if you want to match something not followed by something else. When explaining character classes, this tutorial explained why you cannot use a negated character class to match a q not followed by a u. Negative lookahead provides the solution: q(?!u). The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead, we have the trivial regex u.

Positive lookahead works just the same. q(?=u) matches a q that is followed by a u, without making the u part of the match. The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign.

You can use any regular expression inside the lookahead (but not lookbehind, as explained below). Any valid regular expression can be used inside the lookahead.

Ok, so now that point 3 is kind of understood, what I got so far is that we are trying to split the string with the character «,» (point 1), followed by the expression ([^\"]*\"[^\"]*\")*[^\"]*$.

At this point I'm led to believe that the question has become: How does * agglutinates expressions in a regexp in Java ?

Looking into the documentation I find three instances of it, but I still don't get it:

  • In Greedy quantifiers, X* X, zero or more times;
  • In Reluctant quantifiers, X*? X, zero or more times
  • In Possessive quantifiers, X*+ X, zero or more times

Please correct me if I any of the deductions is wrong, and thank you for your time.

A. Baila
  • 43
  • 4
  • 1
    Read this? https://www.xyzws.com/javafaq/what-are-differences-among-greedy-reluctant-and-possessive-quantifiers-in-java-patterns/206 – Jean-Baptiste Yunès Dec 10 '19 at 08:37
  • 2
    Try pasting the regex [here](https://regex101.com/) and read the explanation – CinCout Dec 10 '19 at 08:38
  • 1
    The meaning of `*` in regex is very basic and should be covered in the most fundamental introductions. It allows for the repetition of the previous expression zero or more times. – tripleee Dec 10 '19 at 08:40
  • 1
    Everything is explained very well at http://regular-expressions.info and http://rexegg.com. Links available in the linked thread. – Wiktor Stribiżew Dec 10 '19 at 08:40
  • 1
    In so many words, require there to be an even number of `"` after the separator. In other words, don't split between two `"`:s. (Still probably broken for some dialects of CSV, if that's what you are attempting to split.) – tripleee Dec 10 '19 at 08:42
  • 1
    Well, the regex split on comma which is not contained in a string, but this is a rather piss poor approach if the purpose is to parse CSV – nhahtdh Dec 10 '19 at 09:00

0 Answers0