34

Is it possible to create a regular expression with a variable number of groups?

After running this for instance...

Pattern p = Pattern.compile("ab([cd])*ef");
Matcher m = p.matcher("abcddcef");
m.matches();

... I would like to have something like

  • m.group(1) = "c"
  • m.group(2) = "d"
  • m.group(3) = "d"
  • m.group(4) = "c".

(Background: I'm parsing some lines of data, and one of the "fields" is repeating. I would like to avoid a matcher.find loop for these fields.)


As pointed out by @Tim Pietzcker in the comments, perl6 and .NET have this feature.

Community
  • 1
  • 1
aioobe
  • 383,660
  • 99
  • 774
  • 796

7 Answers7

24

According to the documentation, Java regular expressions can't do this:

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.

(emphasis added)

5

You can use split to get the fields you need into an array and loop through that.

http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#split(java.lang.String)

Thirtyate
  • 51
  • 1
  • 1
4

I have not used java regex, but for many languages the answer is: No.

Capturing groups seem to be created when the regex is parsed, and filled when it matches the string. The expression (a)|(b)(c) has three capturing groups, only if either one, or two of them can be filled. (a)* has just one group, the parser leaves the last match in the group after matching.

Jens
  • 23,903
  • 6
  • 72
  • 114
  • 1
    .NET has [captures](http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.group.captures.aspx), so you can access individual matches of a repeated subgroup. – Tim Pietzcker Feb 16 '11 at 16:18
  • 1
    @Tim, ah, look at that. That's precisely what I'm after (but in Java). – aioobe Feb 16 '11 at 16:22
3
Pattern p = Pattern.compile("ab(?:(c)|(d))*ef");
Matcher m = p.matcher("abcdef");
m.matches();

should do what you want.

EDIT:

@aioobe, I understand now. You want to be able to do something like the grammar

A    ::== <Foo> <Bars> <Baz>
Foo  ::== "foo"
Baz  ::== "baz"
Bars ::== <Bar> <Bars>
        | ε
Bar  ::== "A"
        | "B"

and pull out all the individual matches of Bar.

No, there is no way to do that using java.util.regex. You can recurse and use a regex on the match of Bars or use a parser generator like ANTLR and attach a side-effect to Bar.

Mike Samuel
  • 109,453
  • 27
  • 204
  • 234
  • Uhm, that's not a variable number of groups. That's always two groups. Perhaps I simplified my example a bit *too* much. (Clarified question.) – aioobe Feb 16 '11 at 16:03
  • @aioobe, I edited this post to address your clarified question. – Mike Samuel Feb 17 '11 at 16:33
0

I have just had the very similar problem, and managed to do "variable number of groups" but a combination of a while loop and resetting the matcher.

    int i=0;
    String m1=null, m2=null;

    while(matcher.find(i) && (m1=matcher.group(1))!=null && (m2=matcher.group(2))!=null)
    {
        // do work on two found groups
        i=matcher.end();
    }

But this is for my problem (with two repeating

    Pattern pattern = Pattern.compile("(?<=^ab[cd]{0,100})[cd](?=[cd]{0,100}ef$)");
    Matcher matcher = pattern.matcher("abcddcef")
    int i=0;
    String res=null;

    while(matcher.find(i) && (res=matcher.group())!=null)
    {
        System.out.println(res);
        i=matcher.end();
    }

You lose the ability to specify arbitrary length of repetition with * or + because look-ahead and look-behind must be of the predictable length.

v010dya
  • 4,283
  • 5
  • 24
  • 43
0

I would think that backtracking inhibits this behavior, and say the effect of /([\S\s])/ in its grouping accumulative state on something like the Bible. Even if it can be done, the output is unknowable as the groups will lose positional meaning. Its better to do a separate regex on like kind in a global sense and have it deposited into an array.

0

If there is a reasonable max number of matching groups you would encounter:

"ab([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?([cd])?ef"

This example will work for 0 - 8 matches. I admit this is ugly and not humanly readable.

kashiraja
  • 632
  • 11
  • 17