2

How come java (using Matcher.find()) does not find the longest possible match?

regex = "ab*(bc)?"

With input of "abbbc" the regex finds "abbb", instead of "abbbc" which also matches and is longer. Is there a way to force it to match the longest possible string?

matt b
  • 132,562
  • 64
  • 267
  • 334
steve lee
  • 29
  • 1
  • 2

4 Answers4

5

The (bc) is an exact string, it wasn't found because the b* was greedy, but since (bc)?
is optional the match suceeded after the last 'b'. You probably want something like this: ab*[bc]? but this doesent make sense so probably ab*c?. If this regex represents something more elaborate, you should post those examples.

Here is how the regex engine sees it:

Compiling REx "ab*(bc)?"
Matching REx "ab*(bc)?" against "abbbc"
   0 <> <abbbc>              |  1:EXACT <a>(3)
   1 <a> <bbbc>              |  3:STAR(6)
                                  EXACT <b> can match 3 times out of 2147483647...
   4 <abbb> <c>              |  6:  CURLYM[1] {0,1}(16)
   4 <abbb> <c>              | 10:    EXACT <bc>(14)
                                      failed...
                                    CURLYM trying tail with matches=0...
   4 <abbb> <c>              | 16:    END(0)
Match successful!

Compiling REx "ab*[bc]?"
Matching REx "ab*[bc]?" against "abbbc"
   0 <> <abbbc>              |  1:EXACT <a>(3)
   1 <a> <bbbc>              |  3:STAR(6)
                                  EXACT <b> can match 3 times out of 2147483647...
   4 <abbb> <c>              |  6:  CURLY {0,1}(19)
                                    ANYOF[bc] can match 1 times out of 1...
   5 <abbbc> <>              | 19:    END(0)
Match successful!
1

The portions match greedily left to right. So the b* matches greedily which causes (bc)? to fail which is fine, so the matcher never backtracks to try a shorter b*.

Maybe ab*?(?:(?![bc])|(bc)) does what you want.

Mike Samuel
  • 109,453
  • 27
  • 204
  • 234
  • Thanks for the help, but this seems to have the same issue as it only returns "a". (Yet "abbbc" matches??) I'm basically looking for a way to have an optional substring (e.g. "bc") and force it to include it in the match if it exists. – steve lee Jan 19 '11 at 18:21
  • @steve lee, That was stupid of me. I updated the regex. – Mike Samuel Jan 19 '11 at 18:24
1

Others have helped to improve the regexp; but just to emphasize the answer is "because it does greedy matching". That is, the match you get is the one it reaches according to algorithm (which basically does longest possible submatches, from left-to-right).

StaxMan
  • 102,903
  • 28
  • 190
  • 229
1

If your expression actually looks so, and you don't care about grouping, it can be rewritten as ab+c?.

If expression is actually more complex and having (bc) is essential, you can use negative lookahead as follows, I think it would be more elegant than Mike Samuel's solution: ab*(?!c)(bc)?.

axtavt
  • 228,184
  • 37
  • 489
  • 472
  • That works in this case. Thanks. I'm wondering how to apply it to the following similar regex: "[a-z][a-z]*(?!St\.)(St\.)?" That is: starts with a letter, followed by one or more letters. The string either contains no period or else contains "St.". It needs to include the "St." if it exists. – steve lee Jan 19 '11 at 20:12
  • In your solution, why doesn't the greedy '*' consume all of the b's giving abbb (same as "ab*(bc)?"). How does adding the negative lookahead cause it to consider the "bc"? (But "bc" match fails without it?) Thanks for your help! – steve lee Jan 19 '11 at 20:21
  • @steve lee, the (?!c) won't let a 'c' follow a 'b' so it backtracks 1 character so that b* sees a 'b' past itself. The (?!c) is satisfied, then the match picks up on the last 'bc' which matches (bc)? –  Jan 19 '11 at 20:41
  • @steve: `[a-z][a-z]*(St\.)?` should work as is since `S` doesn't match `[a-z]`, isn't it? – axtavt Jan 19 '11 at 20:48
  • @steve lee, or `^[a-z]((St\.)|[a-z])*$` if its in the middle. If its at the end and you allow u/l case, then thats different. –  Jan 19 '11 at 20:55
  • It all makes sense now! Thanks for your help! – steve lee Jan 19 '11 at 21:11
  • @steve lee, @axtavt - `[a-z]((?!St\.)[a-z])*(St\.)?` case insensitive, floating match but optional 'St.' at end of match. When you have more than single character assertions, things are a little different... –  Jan 19 '11 at 21:14