How does the "leftmost, longest" rule apply to subexpressions in ERE?

Question

The POSIX standard states that for both ERE and BRE:

Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, shall match the longest possible string. For this purpose, a null string shall be considered to be longer than no match at all. For example, matching the BRE "(.)." against "abcdef", the subexpression "(\1)" is "abcdef", and matching the BRE "(a*)*" against "bc", the subexpression "(\1)" is the null string.

My question: how should `(a|ab)(c|bcd)(d*)` match against "abcd"?

My reading of the standard above is that the subexpression (a|ab) should match the leftmost, longest string that keeps the whole match as long as possible, so the subexpression should match "ab". However, when I used GNU regex to search for (a|ab)(c|bcd)(d*) in "abcd", I get the following for the first subexpression:

echo abcd | sed -E 's/(a|ab)(c|bcd)(d*)/\0, \1, \2, \3/'
abcd, a, bcd,

This example is from this page.

Here is a C++ code using Boost.Regex with regex::extended flag:

#include <boost/regex.hpp>
#include <iostream>
#include <string>

int main()
{
  boost::regex_constants::syntax_option_type regex_flags =
      boost::regex::extended;

  std::string text = "abcd";
  boost::regex expression("(a|ab)(c|bcd)(d*)", regex_flags);
  boost::smatch matches {};

  boost::regex_search(text, matches, expression);

  for (const auto match : matches) { std::cout << match << ", "; }
  std::cout << std::endl;

  return 0;
}

Compiling and running, produces:

abcd, ab, c, d,

Here, consistend with the standard the subexpression (a|ab) matches with the longest thing it can match, but in GNU regex it doesn't.

It is a regular greedy regex pattern. The first `.*` matches up to the end of string, and there is nothing left for the second `.*` to match — Wiktor Stribiżew, Aug 14 '20 at 23:20
Subexpressions don't change how the rest of the regexp is processed. They just provide a way to group part of the regexp, either for quantification or back-referencing. — Barmar, Aug 14 '20 at 23:26
@Barmar, in light of the updates I added, could you reopen the question? I don't believe the linked question answers the distinction between Boost.Regex and GNU regex in interpreting the ERE standard. — Mahrud, Aug 15 '20 at 00:00
Alternations are implemented differently by different engines. Some will try to find the longest match, others give precedence to the earlier alternative. https://www.regular-expressions.info/alternation.html — Barmar, Aug 15 '20 at 00:53
@Barmar so they simply ignore the rule about leftmost, longest subexpression matches? — Mahrud, Aug 15 '20 at 03:52
leftmost, longest doesn't apply to subexpressions, it applies to expressions in general. — Barmar, Aug 15 '20 at 04:00
@Barmar but the standard says "each subpattern, from left to right, shall match the longest possible string". isn't `(a|ab)` the first subpattern in the new example? — Mahrud, Aug 16 '20 at 01:31
Unfortunately that spec is not very clear in defining its terms. In most other RE literature, what it calls a subexpression is called a capturing group. — Barmar, Aug 16 '20 at 02:06

How does the "leftmost, longest" rule apply to subexpressions in ERE?

My question: how should (a|ab)(c|bcd)(d*) match against "abcd"?

0 Answers0

My question: how should `(a|ab)(c|bcd)(d*)` match against "abcd"?