Regex getting wrong output in Java

Question

I have a string that looks something like:

" 'a 'b '(d f g (1 2)) '(3 4) (a d) d "

And what I am trying to do is match so I get this output:

'a, 'b, '(d f g (1 2)), '(3 4), (a d), d

I am currently using:

"'\(.*\)|\(\.*\)|'\w+|\w+"

But there is a problem i've runned into using this, for example if I write

'(a b c) (d f)

it will return

'(a b c) (d f)

instead of

'(a b c), (d f)

So my question is if there is a way to solve this with regex or do I have to solve this an other way?

Since it can't be parsed by regex alone, do you have a preferred lanuage for an alternate solution? — cmbuckley, Apr 01 '12 at 22:21
@warbio, I updated my answer with algorithm proposal. It's a common approach to work with bracket structures. — iehrlich, Apr 01 '12 at 22:32

iehrlich · Accepted Answer · 2012-04-01T22:31:40.900

4

The answer is no.

The language you are trying to parse is not regular, it's context-free. So you are not able to parse it with regex.

If you're interested, here is the grammar:

 S->SS|e;
 S->'(A);
 A-> AA|(A)|w+;

It's not a regular since you can't build FSM to represent it, which is true, in case you can recursively include bracket structures.

Well, whatever. Let's answer the question "How?". Traverse the string from the first character. Once you find a hyphen, start counting brackets. Opening counts for +1, closing counts for -1. Once you hit a closing bracket with zero resulting counter, insert a comma after that bracket. Problem solved:

 'a 'b '(d f g (1 2)) '(3 4) (a d) d
        |      |   ||
        |      |   |+-- counter = 0 on closing bracket, insert comma
        |      |   +--- counter = 1
        |      +------- counter = 2
        +-------------- start counting, counter = 1

etc.

edited Apr 01 '12 at 22:31

answered Apr 01 '12 at 22:10

iehrlich

3,524
4
30
42

Alright I guess I have to do it an other way, thanks for the quick answer. – warbio Apr 01 '12 at 22:14
1

Most **regex flavors are not regular**. Matching context-free languages is no problem for PCRE. Example http://stackoverflow.com/questions/7434272/match-an-bn-cn-e-g-aaabbbccc-using-regular-expressions-pcre – Qtax Apr 01 '12 at 22:24
@Qtax in fact, it's rather a bad practice to call something a "regex" that is not in fact a "regex" :( Although I agree that my answer may be not precise in the real world, it's absolutely right in academic context. – iehrlich Apr 01 '12 at 22:28
But @sudd, this is not an academic context. In fact, we call them "regexes" to emphasize that we're **not** talking about theory-pure regular expressions. Many regex flavors can even handle nested structures like the one in this question, though Java doesn't happen to be one of them. – Alan Moore Apr 02 '12 at 03:31
@AlanMoore As I already said, I agree with this point of view. – iehrlich Apr 02 '12 at 03:33

score 0 · Answer 2 · answered Apr 01 '12 at 22:27

0

If you are using PCRE or the like, you could an expression like:

'?(?:\w+|(\(?:([^()]+|(?1))*\)))

answered Apr 01 '12 at 22:27

Qtax

31,392
7
73
111

score 0 · Answer 3 · answered Apr 01 '12 at 22:29

If I understand correctly, you want to add a comma before every space, except in parentheses. Is that right?

If so, there might be a way to do it in regex using lookaheads and lookbehinds but it's going to get messy fast. Better to split up all the terms first and then add commas just after the one's you want.

Regex getting wrong output in Java

3 Answers3