String regex Parsing with Semicolons in data

Question

How can I put together a regex to split a fiql string (example below) which separates conditions with a semicolon. The problem is semi colons can also be in the string.

I am using string split but can't find the right regex. I've tried below in which in trying to get the last semi colon before the ==:

query.split("(;)[^;]*==)

But it only works for the first key value.

Example string:

Key1==value1; key2==val;ue2;key3==value3

Target is array or list : key1==value1, key2==val;ue2, key3==value3 Problem here is the semicolon in value 2 is causing a split.

Any idea?

Pshemo · Answer 1 · 2016-09-29T14:13:20.240

1

It looks like you want to split on ; only if it has == after it, but also has no ; between it and that ==.

You ware almost there. Your code should look like

split(";(?=[^;]*==)")

notice that (?=...) part is positive look-ahead, which simply checks if after ; exists part which can be matched by subexpression [^;]*==, but doesn't include that part in final match so it won't disappear after splitting (it is zero-length match).

DEMO:

String str = "Key1==value1; key2==val;ue2;key3==value3";
for (String s : str.split(";(?=[^;]*==)")){
    System.out.println(s);
}

Output:

Key1==value1
 key2==val;ue2
key3==value3

If you want to also get rid of space before key2 then make it part of delimiter on which you want to split. So let regex match not only ; but also whitespaces surrounding it. Zero or more whitespaces can be represented with \s* so your code can look like

split("\\s*;\\s*(?=[^;]*==)")

edited Sep 29 '16 at 14:13

answered Sep 29 '16 at 14:07

Pshemo

113,402
22
170
242

It's an at least quadratic regex, since you're reading the input string twice. If a string is long enough, or the file is big enough it would be really slow. It works, but should it be really used? – bashnesnos Sep 29 '16 at 14:21
1

@bashnesnos True, this approach may not be best in terms of performance because of backtracking, but I am not sure if it will be O(N^2) (if that is what you mean that by *quadratic*). I suspect it will be closer to O(2*N). This regex will iterate to find `;`, then look-ahead will try to find match for `[^;]*==`, so `[^;]*` can iterate max to next `;`. So only area of characters which can be iterated more than once are those matched by `[^;]*`. But they still be iterated only max 2 times: once when we will search for delimiter `;`, and once in look-ahead. – Pshemo Sep 29 '16 at 15:29
Yes, I should've wrote 2*N. I followed the same logic as you, but how it transformed into quadratic in the end I dunno. Excuse me for confusion :-) – bashnesnos Sep 29 '16 at 15:40

bashnesnos · Answer 2 · 2016-09-29T14:52:42.640

1

Use a group instead. And search tokens using java.util.regex.Matcher in a loop:

Pattern patrn = Pattern.compile("(?>(\\w+==[\\w;]+)(?:;\\s*|$))");
Matcher mtchr = patrn.matcher("Key1==value1; key2==val;ue2;key3==value3");


while(mtchr.find()) {
    System.out.println(mtchr.group(1));
}

Yields:
Key1==value1
key2==val;ue2
key3==value3

Adding ;? won't work unfortunately, since your middle tokens won't terminate anymore.

edited Sep 29 '16 at 14:52

answered Sep 29 '16 at 14:10

bashnesnos

806
6
16

Nice one. But few hints: (1) String literals are placed inside `"..."`, not `'...'`, `'` is reserved to `char` type literals, (2) there is no point in wrapping *entire regex* in non-capturing group `(?:regex)`, you can simply use `regex`. – Pshemo Sep 29 '16 at 14:22
@Pshemo thanks, I've copy-pasted it from a groovy console :-) I've used non-capturing group to avoid back-tracking, but I agree that in current case it's more like a pre-caution. – bashnesnos Sep 29 '16 at 14:24
You are welcome. BTW `\w` already contains `0-9` range, so you can skip `\d` in your regex. Also you are not obligated to add `EDIT: change description` in your answer. If you see that there is possible improvement or a problem simply correct it in your answer :) – Pshemo Sep 29 '16 at 14:30
1

@Pshemo yup, spotted the \d while correcting suggestion (2) :-) I've replaced global non-capturing group with a global atomic one, I guess it makes more sense now. – bashnesnos Sep 29 '16 at 14:39

score 0 · Answer 3 · edited May 23 '17 at 12:15

0

RegExp are evil.

if you can request to make minimal change on the string to be parsed, so value be surrounded by double qoutes, then, the string can be like Key1=="value1"; key2=="val;ue2";key3=="value3" then this post will help you check Java: splitting a comma-separated string but ignoring commas in quotes

alternatively, you need to write a custom String parser. here is a quick non-optimized CustomStringParser

Hope this helps.

edited May 23 '17 at 12:15

Community

1
1

answered Sep 29 '16 at 14:34

Osama Dwairi

3
3

Well, it's not an attempt to parse arbitrary XHTML with regex here :-) Why not. If it's not that performance-critical regex should do the trick here. – bashnesnos Sep 29 '16 at 15:03
1

absolutely, I agree. anyway, code snippet should be faster nevertheless :) – Osama Dwairi Sep 29 '16 at 15:09

String regex Parsing with Semicolons in data

3 Answers3