Handling duplicate regex group name in Java (C# translation)

Question

I am trying to translate a section of C# code into Java, and while I have familiarity in both, I am not very strong with the regex libraries.

From MSDN, they give this example

String pattern = @"\D+(?<digit>\d+)\D+(?<digit>\d+)?";

And this output (which I see they are using the capture index, and not the group name itself)

   Match: abc123def456
   Group 1: 456
      Capture 0: 123
      Capture 1: 456

With this note

a group name can be repeated in a regular expression. For example, it is possible for more than one group to be named digit, as the following example illustrates. In the case of duplicate names, the value of the Group object is determined by the last successful capture in the input string.

So maybe this is a bad example (because my actual code isn't using digits), but anyways...

Translating that into Java, it isn't too happy about the second <digit>.

String pattern = "\\D+(?<digit>\\d+)\\D+(?<digit>\\d+)?"
Pattern p = Pattern.compile(pattern);
String matchMe =  "abc123def456";

And errors at Pattern.compile with

Named capturing group <digit> is already defined

Removing all but the last name completely would be an option, I guess, seeing as that would "match" the C# behavior.

This problem arises, though, when I am trying to nest patterns within one another like so

String x =  "(?<InnerData>...)no group(?<InnerGroup>foo)";
String y = "(?<header>[...])some data" + x + "more regex" + x;
Pattern.compile(y);

where x is inner content that repeats within y and it's not something I can stick a repetition modifier onto.

I know it doesn't make sense to have groups of the same name because how would it know what you wanted?

So, question is - what can I do about that?
Is using the Matcher.group(int) my only option and forego the group names?

Use 2 and when matching, check if Group 2 matched. If yes, only grab its value. If you need to get the whole capture stack, just use 2 differently named groups. — Wiktor Stribiżew, Feb 08 '17 at 21:52
Also note that `\d` in C# matches any Unicode digit by default, and in Java, you need to use `Pattern.UNICODE_CHARACTER_CLASS` flag to get the same behavior. — Wiktor Stribiżew, Feb 08 '17 at 21:53
Why are you against using separate names and applying C#'s logic manually? I doubt there's an alternative. — shmosel, Feb 08 '17 at 21:53
According to http://stackoverflow.com/a/5771326/2055998 you cannot have multiple groups with the same name. — PM 77-1, Feb 08 '17 at 21:55
And now please provide feedback if you need help: 1) what can you change? Can you add any code? Or just modify the pattern? 2) what is the expected result for a sample string? — Wiktor Stribiżew, Feb 08 '17 at 22:01
@WiktorStribiżew I'd like to give a more "complete" MCVE, sure. It's code for a client, though, so I can't add much (if any). — OneCricketeer, Feb 08 '17 at 22:25
So, please proceed on your own, and once you need specific help, come back. — Wiktor Stribiżew, Feb 08 '17 at 22:26
You should generate different names for the groups. As is, you have two names ech repeated twice. Why do you use named groups? Do you use these names afterwards? — Gangnus, Aug 18 '17 at 11:12
@Gangnus Yes, obviously I should use different names. This code I was given was trying to use "modular" patterns that can be decomposed and therefore easily embedded within each other. The names were being used in the C# code, but I have gotten around the issue using `Matcher.group(int)` in Java — OneCricketeer, Aug 18 '17 at 16:56
There are completely valid use cases where it makes sense to have duplicate group names in your regex. For example, when used with alternation, e.g. `(blah(?.+?)|test(?.+?))`. Here, `x` will be populated with the match from either side of the alternation (`|`). This, as far as I can tell, is not possible using Java's built-in regex API. And that makes me sad. — Josh M., Mar 04 '20 at 21:45

Pedro Rodrigues · Answer 1 · 2018-10-14T08:33:13.553

You can't do that with regex, if I understood the problem correctly at least. Example data would be helpful, if you can provide some.

First

"(?<header>[...])some data" + x1 + "more regex" + x2

For your example, this works as long as x1 and x2 are the same regex with different group names. But I believe this ain't what your looking for.

Second

Suppose the string: FEW014 BKN025CB

And that I have 3 parameters I'm interested in, lets say:

a can be OVC, FEW, or BKN

h can be any set of exactly 3 digits

t can be CB, TCU, or absent

Additionally a string of these can have up to 4 occurrences of those 3 parameters; the example has 2, but it can go up to 4.

Now suppose the regex (which matches those 3 parameters):

(?P<a>FEW|BKN|OVC)(?P<h>[\d]{3})(?P<t>CB|TCU)?

I can use a regex engine to get a list of all occurrences of those parameters, but the engine won't go about relating them to each other.

I would get something like the following:

a:
  FEW
  BKN
h:
  014
  025
t:
  CB

See how I lost track to where the CB came from? This is expected behaviour, since a regex engine does not keep state. They just shove stuff into buckets.

Last

The way to go about this, is just not be greedy with your regex, and match related things once, stored them, and keep going.

--

The second example I used there, is stolen from a real world case where this was implemented; just some names changed for simplicity.

FEW014 BKN025CB is part of textual meteorological report, and is parsed in the way explained.

May it help you understand the deal, here is the code that does that:

@occurs(4)
@search(r"""
    (?P<amount>FEW|SCT|BKN|OVC)
    (?P<height>[\d]{3}|///)
    (?P<type>CB|TCU|///)?
""")
def pclouds(item):
    """Returns ((amount, height, type),) of ((string, int, string),) for
    clouds or ()"""
    tcloud = namedtuple('Cloud', 'amount height type')
    height = item['height']
    if height == '///':
        height = -1
    else:
        height = int(height)
    return tcloud(item['amount'], height, item['type'])

https://github.com/pedro2555/avweather/blob/master/avweather/_metar_parsers.py#L221

the search decorator, searches one instance of the given regex
the occurs decorator, repeats the search the given amount of times
call pclouds function, and notice item holds just a single set of 3 values

score 0 · Answer 2 · answered Aug 16 '20 at 23:55

Why do you need name the groups?

I think that it's not necessary for your problem. It just find successive matches with find In this case, the only group is the group 1.

import java.util.regex.Matcher; import java.util.regex.Pattern;

public class Main
{
  public static void main(String[] args) {
   String patt = "\\D+(\\d+)";
   String target = "abc123def456";
   Pattern pattern = Pattern.compile(patt);
   Matcher matcher = pattern.matcher(target);
   while (matcher.find()) {
     System.out.println(matcher.group(1));
   }
  }
}

Program exit:

123
456

Demo

The groups were already named, and I was trying to keep the groups as-is rather than count them off — OneCricketeer, Aug 17 '20 at 18:47

Handling duplicate regex group name in Java (C# translation)

2 Answers2

Linked