2

I am trying to match inputs like

<foo>
<bar>
#####<foo>
#####<bar>

I tried #{5}?<\w+>, but it does not match <foo> and <bar>.

What's wrong with this pattern, and how can it be fixed?

polygenelubricants
  • 348,637
  • 121
  • 546
  • 611
huangli
  • 23
  • 2

2 Answers2

11

On ? for optional vs reluctant

The ? metacharacter in Java regex (and some other flavors) can have two very different meanings, depending on where it appears. Immediately following a repetition specifier, ? is a reluctant quantifier instead of "zero-or-one"/"optional" repetition specifier.

Thus, #{5}? does not mean "optionally match 5 #". It in fact says "match 5 # reluctantly". It may not make too much sense to try to match "exactly 5, but as few as possible", but this is in fact what this pattern means.


Grouping to the rescue!

One way to fix this problem is to group the optional pattern as (…)?. Something like this should work for this problem:

(#{5})?<\w+>

Now the ? does not immediately follow a repetition specifier (i.e. *, +, ?, or {…}); it follows a closing bracket used for grouping.

Alternatively, you can also use a non-capturing group (?:…)in this case:

(?:#{5})?<\w+>

This achieves the same grouping effect, but doesn't capture into \1.

References

Related questions


Bonus material: What about ??

It's worth noting that you can use ?? to match an optional item reluctantly!

    System.out.println("NOMZ".matches("NOMZ??"));
    // "true"

    System.out.println(
          "NOM NOMZ NOMZZ".replaceAll("NOMZ??", "YUM")
    ); // "YUM YUMZ YUMZZ"

Note that Z?? is an optional Z, but it's matched reluctantly. "NOMZ" in its entirety still matches the pattern NOMZ??, but in replaceAll, NOMZ?? can match only "NOM" and doesn't have to take the optional Z even if it's there.

By contrast, NOMZ? will match the optional Z greedily: if it's there, it'll take it.

    System.out.println(
          "NOM NOMZ NOMZZ".replaceAll("NOMZ?", "YUM")
    ); // "YUM YUM YUMZ"

Related questions

Community
  • 1
  • 1
polygenelubricants
  • 348,637
  • 121
  • 546
  • 611
2

Place your # match in a subpattern:

(#{5})?<\w+>
BoltClock
  • 630,065
  • 150
  • 1,295
  • 1,284